Sunday, March 3, 2019

Republishing the Petri Net book, Fonts

The font information is defined by the book's style sheet.  A skilled book designer at Prentice-Hall sat down and decided what the book would look like.  This involved deciding all sorts of things:

1. The page size of the book: 6 inches by 9 inches.
2. The size of the margins and the resulting size of the printed page.
3. What the running heads and page numbers look like.
4. The fonts and sizes for the various parts.

Normal text is 10 point Times Roman.
Figure captions are 8 point.
The Bibliography is 9 point.
Running heads are 8 point.
Page numbers are 8 point bold.
Chapter titles are 18 point Modula
Section titles are 11 point Times Roman bold
Subsection titles are 10 point Bold
Subsection titles are 10 point Bold Italic

The Chapter titles may or may not be 18 point.  The book design says the chapter numbers are 1/4 inch high, which at 72 points per inch would make them 18 point.  The Titles are supposed to be 1/8 inch high, which would be 9 point, but clearly are not.

Partly this is a question of what is supposed to be 1/8 inch high?  The capital letters? or the lower-case letters?  And which of them?  All capital letters (except Q), and all digits (0-9) are the same height in our fonts.  For lower-case letters, you have those with ascenders (bdfhkl) which should be the same height as capital letters and digits. Characters with descenders (gpqy) should be the same height as characters with ascenders  (but shifted down on the line, of course).  The compact lower-case letters with neither ascenders or descenders (cemnorsuvwxz) are shorter than the others.  And then the remaining characters (aijt) are just different.  And for a particular design, other characters may also be "different".  For example 'a' is sometimes a variant of an 'o' and other times more like an upside down '9'. The design of 'g' and 'q' may also vary significantly.

And then the italic and bold forms may also be different.

But if we stick with the height of digits and (most) capital letters, then it seems that our most common text, which is supposed to be set as 10 point, is about 60 dots high.  At 600 dpi, that would be 6 dots per point.  So our Chapter numbers are, empirically, from 160 to 168 dots high, or about 27 points, while the chapter titles (measuring the height of the capital letters) are 112 or 113 dots high, or about 18 points.

Note that it does not really matter what we call the font used for our Chapter titles -- we can call it 17 points or 18 points or 19 points -- as long as we use the same name for all characters of the same height (and same font family and font face).  We are going to reproduce the original characters just as they are in the printed book.  Although it would be nice if we used the same names/sizes as in the book design document.

So empirically, we have:

11 point -- Section heading numbers, titles
10 point -- normal text, section, subsection titles
 9 point -- bibliography
 8 point -- figure captions, page numbers, running heads, sub/superscripts for normal text.
 7 point -- All caps tags for DEFINITIONS, THEOREMS, LEMMAS, sub/superscripts for bibliography or figure captions
 6 point -- second level sub/superscripts

Actually we are having some difficulty with superscripts and subscripts.  None of the book design documents say what they should be.  Other documents suggest that for a 10 point text, a subscript should be 7 point and then a second level subscript should be 5 point.  But our smallest second level subscripts seem more like 6 point, so we suspect that subscripts (and superscripts) should be 8 point.

One way to try to determine this is to look at the actual characters on the page.  On page 23, for example, we have superscripts of 0, 1, and 2 (last paragraph) which are 43 or 44 dots high, which (at 6 dots per point) would be 7 point.  There are also second level subscripts of 0, 1, and 2 which are 37 dots, or 6 points. So this example suggests that for 10 point text, subscripts should be 7 point, and second level subscripts should be 6 point.

Our current approach is to build an ideal image of a particular character  (for a given point size, font family, and font face) by averaging together all the bits of all the instances of that character that we have.  So we take all the 8 point, italic, Times Roman instances of "e" and average them together.  Since we have 730 instances of these,  the law of large numbers suggests that errors or deviations in a few of those will not matter.  (And we will review and edit the fonts produced anyway.)

But to check, after we have defined our ideal 8 point italic Times Roman "e", we run back over all 730 of the specific instances and compute the difference between the specific instance and the ideal, looking for any large differences.  A large difference suggests that that particular specific instance may be something other than a 8 point italic Times Roman "e" -- it might be a different size, or different font face.  In fact we also check to see if that specific "e" might match a different point size closer than it does the assigned point size.

This approach fixes most of our issues, but also exposes that the printed book is highly analog.  Many characters are "broken".



Other characters are the wrong size.  For example, the subscript 'j' in the Index for "Maximal tj-dead submarkings"  (page 284) should be a 7 point italic j, but for all appearances is a 10 point italic j.  The horizontal ellipsis in the caption for Figure 8.10  should be 8 point, but empirically is 10 point (as are the commas before and after it).  The text at the top of pages 268 and 274 is the wrong point size (should be 9 point not 10 point) and with the wrong margins. 


Republishing the Petri Net book, Part 1

Another book that I wrote, back in the 70's and 80's, was the Petri Net book.  It was published by Prentice-Hall in 1981, and went out of print about ten years later.  For this book, I pushed the state of publishing at the time by doing it all in troff.  I submitted a magnetic tape of the source of the book, with troff layout macros, and Prentice-Hall found a printer that could phototypeset from that tape.

So I still have the original troff files.  Obviously to bring it back into print, I should be able to just run it thru an updated version of troff to generate a PDF file, and it's done!  Well, not quite.  There are a lot of figures in this book -- after all Petri nets are basically a visual representation of parallelism, with their circles and arrows.  I did the original art for the figures, then a professional artist re-did them, and they were pasted into the book before printing.  Now we would want them to be merged directly into the PDF.  troff does not really do images, which is what scuttled the troff version of the MIX book.

Plus there is front matter (the title pages, the table of contents) and back matter (the Index) that really wasn't part of my original troff files (although it could be added).  The table of contents and Index, of course, depend heavily on getting the page numbers correct, so if we used either troff or LaTeX, we run the risk of different fonts, different line and page breaks, and different page numbers.

And, unlike the MIX book, we don't have a lot of corrections to be made to the book.

So, this seems like a good candidate for the programs and techniques used for the Oak Island book -- to reproduce it exactly as it was originally printed.  But of course it is a larger book -- about 300 pages -- and much more technical.  The MIX book was a book on computer science practice; the Petri Net book is computer science theory, with much more math.

The first step was getting a good scan of the book.  We now have, at home, a decent scanner, with auto-feed.  So our first attempt was to cut the spine off a copy of a book and feed it into the scanner, using the auto-feeder.  A couple of problems came up.  First the auto-feeder uses a different actual scanning mechanism, and only does 300dpi scanning.  I think we could have worked with that, but after weeks of work, we found that many of the pages were scanned slightly off-kilter.  We tried to adjust those, but eventually had to go to using the flat-bed scanner, with its 600dpi resolution.

Scanning all 300 pages took about 4 hours -- scanning first the odd pages and then the even pages.  We used xsane to drive the scanner.  xsane has a nice ability to adjust the file names for the pages, first by +2 scanning the odd pages (1, 3, 5, 7, ...) and then by -2 for the even pages (290, 288, 286, ...).  We kept the outer corner of the page against the edge of the scanner, so that half the pages were upside down (but that was easy to fix with a short script), which meant we never had to worry about the straightness of the cut-off edge (the inner edge).

Once we had the scanned page images, we used tesseract as an OCR engine to create files of the characters and the boxes that contained them. tesseract is supposed to be the best (free) OCR program, but most of the work (so far, months), has been making sure that we have the correct character in the correct place on each page.  This means that the box is correctly positioned, the character is corrected identified, and then we needed to add in font, size and face and positioning information.

We extracted the figures (152 of them) from the scanned images and put each one in its own file.  We will need to process them to make them as clean as possible before they are printed, but for now, let us concentrate on the text.  Also horizontal lines (rules) needed to be identified and not treated as text.

Two pages (47 and 225) are full page images turned on their sides, which means that the captions are also sideways.  We treat the captions as text, so those pages are rotated and will need to be rotated back for final assembly.  Actually each of these pages become two pages, because while the captions are rotated, the page numbers are not.  So we have mostly white space (for the image to be inserted), plus a rotated caption text, and a non-rotated page number.

tesseract, for some reason, does not OCR all the bits on a page -- a handful of characters are just not processed.  So we have a program (remains) which searches each page for bunches of bits that are not in any box.  If the bunch is big enough, we create a new box and add it to the file, with a "dummy" character name, which I then fill in by hand.  We had at least 632 such un-OCRed characters, out of our total of 411,645 characters.

We also have to contend with all "extended" symbols.  tesseract is pretty good at catching ligatures -- fi, ff, fl, ffi, ffl -- but it does not have a clue about the Greek characters that math uses -- alpha, beta, delta, sigma, and so on --.  We adopt a naming for those using HTML entity names α for α and Σ for Σ, for example.  It turns out that ligatures have HTML entity names -- fi ff and so on -- so we can use those.  Also we can tag things like open and close quotes (both double and single), and primes, and ellipsis, and even different meanings of the hyphen/dash -- ‐, −, –, —.  This works as a general mechanism for identifying any character or box that needs to be treated specially.

Now we can start running programs to identify lines of text and making sure that the boxes are all in order -- top to bottom and left to right.  This is the step where it can become obvious that some pages were not scanned properly.  We identify lines by scanning each bit row for rows with no black bits -- no characters. We expect that to be the (vertical) space between lines.  But on a
slanted page, the space between the lines starts low on one side of the page and then rises as we go across the page, so there is no row of white space.  We had to rescan some pages at this point.  A few pages were actually printed slightly askew on the page, and we had to adjust those by hand.

Once we had the files organized, we could then compare the scanned  OCRed character stream with our HTML version.  I believe the HTML version was created from the original troff code.  Converting both to a stream with one character per line, we could use the standard diff utility to identify where tesseract had mis-identified a character and fix them.  tesseract has particular trouble with ones and the letter "ell" (1/l), and zeros and "ohs" (0/O), but it also mis-identified all the greek symbols, for example.

Once the character streams for the HTML and OCRed versions were mostly the same, we could augment the HTML version with font information, and transfer that to the OCRed version, getting us to our current representation of:

c gif/p150.gif 1096 786 1124 826 r 10pt i

which identifies a character (c), the file that it's image is in (gif/p150.gif), the corners (x,y) of the box that contains it (1096 786 1124 826), the character (r), it's point size (10pt), and (in this case) face code of italic (i).  Other lines define images, lines, and meta-information like newlines and spaces.