Saturday, November 19, 2022

Fonts and characters for Gordo Redux

Having cleaned up the pages, more or less, we can then proceed to taking them apart.  Most pages consist of a couple of images (the comic strips images) and a page number.  We wrote a short program to identify and strip out the images, placing each one in a separate file, and recording where they are in a box file.  For example, page p008 is represented by the box file:

v 2
i 2gif/p008.gif 419 296 4912 1573 p008_1.gif      # 4494x1278
i 2gif/p008.gif 272 1763 4888 3034 p008_2.gif      # 4617x1272
i 2gif/p008.gif 272 3266 4905 4558 p008_3.gif      # 4634x1293
i 2gif/p008.gif 380 4771 4901 6037 p008_4.gif      # 4522x1267

which identifies the 4 strip files (p008_1, p008_2, p008_3, and p008_4) and their positions on page p008.  The "i" at the beginning of each line identifies that this is an "image" to be placed on the page.  A comment (# 4617x1272) provides the size of the image.

The other item on this page is the page number.   File p008 is page 3, and so we identify that character, and it's location on the page, with one additional line:

c 2gif/p008.gif 2523 6216 2571 6288 3      # 49x73

We obtain the location and digits of the page numbers by stripping out all the images and feeding the remaining scanned page into tesseract, the OCR program for Linux.  Tesseract is not happy scanning just a character or two, so there is quite a bit of hand work to get all the page numbers in the right places, but once we are done with that we can check that all the page numbers are correct.

Arriola used a personal style for numbering pages.  As is common, the front matter  -- the title page, dedication, foreword -- is not numbered. So page 1 starts in file p006.  Normally every page is then numbered, with page 1 on the right-hand side of a book. Since every spread has a left-hand page and a right-hand page, and page 1 is a left-hand page, all left-hand pages are odd numbers, and all right-hand pages are even numbers.

But Arriola does not apply page numbers to blank pages.  Page 9 (p014) is the end of a Section, and is an odd numbered, left-hand page.  To start the next section on a left-hand page, a blank page is inserted, for the right-hand page, and then the next section starts.  Normally the blank page would be page 10, and the new section would start on page 11.  But Arriola skips the page number for the blank page and starts the next section then as page 10.  Which means we now have an even-numbered left-hand page.

This problem occurs in three places: p015, p055, p079.  Which means that page 10, page 49, and page 72 should actually be on the blank pages before them.

We have two options: We can just reproduce the book as Arriola numbered it, or we can renumber the pages to match standard book page design.  Note that even if we follow standard design and number the blank pages, we don't have to actually print the page number on the otherwise blank page.  

While contemplating this problem, we can go ahead and use the identified page numbers to generate a font for these characters. This font has the ten decimal digits.  When we generate out book, we use the font characters to replace the scanned images for the page numbers, so that every "2" looks exactly the same.

If we decide to renumber the pages, we will generate the new page numbers from this font, so it will look the same as the original page numbers.

The one other issue with text in the book is the Foreword, file p004.  The forward is a typewritten page from Mell Lazarus.  That page is all text, and is the only text in the book (except the page numbers).  We go back to the original scanned image and feed that into tesseract to OCR the characters.  Again there are missing characters and mis-identified characters, but with some work, we get a box file identifying each character and where it goes on the page.  

The Foreword page is typewritten, which is a different type face from the page numbers, so we again use the characters from the page to generate a font.  However, faced with the need to go back and edit those characters (with a limited number of characters, some of the automatically generated characters in the font are rather rough), I realized that we have a typewriter font already from the Willamette High School literary magazine, and so decided to try that.  It works! So now we have that page done.  There is some work to make sure that each character is sitting on the base line perfectly, since scanned characters can easily be up or down a pixel or two.

As we are processing p004, we notice two typos on the page: (1) a space before a comma in one place and (2) a use of a semi-colon which should be a colon.  Since these are simple typos, we correct them.


No comments:

Post a Comment