- We first scanned the book, at high resolution (600 dpi) to get images of each page.
- We applied OCR (Optical Character Recognition) to these pages to get the actual characters on the page. In fact, we used tesseract, an open-source OCR program, to produce both the characters and where on the page they are. The location on the page is given by a "bounding box" which says where it starts and where it ends, in both the horizontal and vertical dimensions. The locations are given in pixels. (Each page is about 2000 x 3120 pixels. At 600 dpi that is 3.3 x 5.2 inches, not the 6x9 expected).
- Some editing and a lot of error checking then produces a file that lists both each character and where it is on the page. We added by hand information about the size and font-face where necessary.
c p17.gif 510 1234 538 1295 f(We added the "space" character to represent the space between words.)
c p17.gif 536 1259 572 1295 o
c p17.gif 581 1259 615 1295 u
c p17.gif 627 1259 661 1295 n
c p17.gif 674 1235 710 1295 d
s p17.gif 711 1235 742 1293 space
c p17.gif 743 1235 768 1293 1 italic
c p17.gif 781 1237 817 1295 6 italic
s p17.gif 818 1259 844 1295 space
c p17.gif 845 1259 880 1312 p
c p17.gif 889 1259 925 1295 o
c p17.gif 932 1234 945 1295 i
c p17.gif 955 1259 989 1295 n
Generating a new version of the book then requires only a representation of each of the characters used in the book. So we know that the "f" of the first line above goes horizontally from pixels 510 to 538 and vertically from pixels 1234 to 1295 on page "p17.gif" (so we know it is 29 pixels wide and 62 pixels high), but we still need to know exactly what bits form the "f".
The particular bits to represent an "f" is defined by the font. Different fonts represent characters with different images -- they look different. We can reproduce the book, from our file, in different fonts. As long as an "f" is (no more than) 29x62 pixels, we can place it in the right spot on the (image) page to re-create the book.
So the problem is to find the "right" font. The book itself says " This book was hand-set in 10 point Caslon Old Style". A quick Google search on "Caslon" shows that William Caslon was a typeface designer in England from about 1730. Unfortunately, there are a lot of "Caslon" fonts, and it is not clear what "Caslon Old Style" specifically is.
However, to some degree, we don't need a formal definition of the font -- we have it in front of us in the book. Unfortunately, as we have seen in the previous posting, each instance of the characters, being a particular metal slug in 1953, printed on a particular piece of paper, and then scanned at 600 dpi, appears to be slightly different. Of the 42,272 different character instances we scanned, no more than a dozen are exactly the same as some other character instance.
So we have two options:
- We can try to find an existing well-defined font that looks close to what we have in the current printed copy, or
- We can try to make a representation of the font from the character images that we have.
We have also found something called LibreCaslonText which is an open-source representation that looks a lot like the original text. (The representation of "Q" and "J" seem particularly to vary from font to font.) The main problem with this is that it is in "True Type" format, which is a vector representation of the font, rather than the bitmap format that we are using. One advantage for the True Type format is that it represents all sizes (point sizes) in one form, rather than having different bitmaps for the various sizes, as we have. This looks like it will work for approach 1. Converting the True Type format to a BDF format, we can then produce the following for the "The" of the last blog:
The differences are subtle.
No comments:
Post a Comment