Republishing Old Books: 2020

While going thru all of my old papers, I found a copy of my High School literary magazine.

While the title appears to be Greek, it's really a trans-language pun. The first character, for example, is a Phi, which is pronounced like an F. The next character, while it looks like a P is a Rho, pronounced like an R. And so on. The theory is that if a person who knew Greek read it it would come out as "Fractured Fairytales".

Why it is Volume II? I don't know. Was there a Volume I, or was that another joke?

As I remember, we solicited stories from the students at the high school, selected a set of them, typed them up in typing class, and then had it printed, and bound, and sold as a fund raiser. Printing probably meant photo-copying. A couple of students did drawings to accompany some of the stories.

We also had a set of ads. We sold ads to local merchants to help pay the cost of publishing the magazine.

The idea is to bring this back into print, using print-on-demand. So basically we want to create a PDF file of the magazine, using the same techniques as we used for the Oak Island book and the Petri Net book.

The main problem is the poor quality of the printed images. Remember that the original image was from a typewriter. Probably with a cloth ribbon. So the characters are not very well defined.

We were able to get around this, in part by writing code that "knew" we were dealing with a fixed-width typewriter font, so we could predict where the character boxes would be, once we had started a line -- each character was about 50 x 80 pixels, aligned vertically and horizontally.

Or it would be if things were not skewed. Several pages however had significant skew -- the lines were not parallel to the sides of the paper. Specifically, pages 18, 62, 13, 47, 32, 38, 22, 67, 63, 37 all had significant skew. So I wrote a program to try to detect the skew and correct for it.

We had run into this problem before on one of the books, and had a program that computed the left and right margins, which should be aligned and both parallel to each other and to the left and right edges of the paper. But with this book, since it was written with a typewriter, the right margins are not aligned, so that didn't work. Even the left margins are not really aligned. An "m" character and an "i" character are centered in their box, rather than being aligned to the left margin, so even the left margin varies too much to be used to find skew.

We tried to divide each line into 5 parts, computing the average base line position of the characters in the first fifth and the last fifth. For a proper line, there would be no difference; for a skewed line, one end would be higher (or lower) than the other. But this approach is more difficult for lines with lots of characters that have descenders, (like g,j,p,q,y).

So the approach we ended up with was to pick a character which is a simple box with no descender, like an n or x, and find the first and last one on a line. These should have the same vertical position. If they are significantly offset, that defines the skew. Compute the difference for each line, and average to get the skew for a page. Once we have the skew, we can adjust the boxes up (or down) as defined by the skew, to put them all on the same base line.

Once the skew was corrected, and we could then define lines of characters, we could check the quality of the OCR. To do that, we wrote a short shell script that would search our box files for a given character and then use the positioning information for that character to go extract the bits that had been identified and create a small image file of just those bits and a bit of its surrounding (for context). For example, here is one of the files that were created when we looked for all capital K characters.

Apparently something has mislabeled the C as a K; we can then go correct the box file.

We did this for every character, to make sure that the OCR work was correct.

We also extracted all the text words and ran those thru a spelling checker, finding another 69 OCR errors and a number of typos in the original printed book.

From our box files, listing where each character image is located, plus the actual scanned images, we can build a font to match our scanned characters, then use a font editor to clean up the font images making the size of all characters consistent.

That allows us to recreate the typed text of the magazine almost exactly as it originally was. Then we add in the images, and finally the ad pages.

We use our box file representation of the book to create a PDF of the complete book, plus a cover and upload it to Kindle Direct Publishing. From that we print a proof copy, and after correcting some minor problems, it is now online and available from Amazon.

Republishing Old Books

Saturday, July 11, 2020

My High School Literary Magazine