Another book that I wrote, back in the 70's and 80's, was the Petri Net book. It was published by Prentice-Hall in 1981, and went out of print about ten years later. For this book, I pushed the state of publishing at the time by doing it all in troff. I submitted a magnetic tape of the source of the book, with troff layout macros, and Prentice-Hall found a printer that could phototypeset from that tape.
So I still have the original troff files. Obviously to bring it back into print, I should be able to just run it thru an updated version of troff to generate a PDF file, and it's done! Well, not quite. There are a lot of figures in this book -- after all Petri nets are basically a visual representation of parallelism, with their circles and arrows. I did the original art for the figures, then a professional artist re-did them, and they were pasted into the book before printing. Now we would want them to be merged directly into the PDF. troff does not really do images, which is what scuttled the troff version of the MIX book.
Plus there is front matter (the title pages, the table of contents) and back matter (the Index) that really wasn't part of my original troff files (although it could be added). The table of contents and Index, of course, depend heavily on getting the page numbers correct, so if we used either troff or LaTeX, we run the risk of different fonts, different line and page breaks, and different page numbers.
And, unlike the MIX book, we don't have a lot of corrections to be made to the book.
So, this seems like a good candidate for the programs and techniques used for the Oak Island book -- to reproduce it exactly as it was originally printed. But of course it is a larger book -- about 300 pages -- and much more technical. The MIX book was a book on computer science practice; the Petri Net book is computer science theory, with much more math.
The first step was getting a good scan of the book. We now have, at home, a decent scanner, with auto-feed. So our first attempt was to cut the spine off a copy of a book and feed it into the scanner, using the auto-feeder. A couple of problems came up. First the auto-feeder uses a different actual scanning mechanism, and only does 300dpi scanning. I think we could have worked with that, but after weeks of work, we found that many of the pages were scanned slightly off-kilter. We tried to adjust those, but eventually had to go to using the flat-bed scanner, with its 600dpi resolution.
Scanning all 300 pages took about 4 hours -- scanning first the odd pages and then the even pages. We used xsane to drive the scanner. xsane has a nice ability to adjust the file names for the pages, first by +2 scanning the odd pages (1, 3, 5, 7, ...) and then by -2 for the even pages (290, 288, 286, ...). We kept the outer corner of the page against the edge of the scanner, so that half the pages were upside down (but that was easy to fix with a short script), which meant we never had to worry about the straightness of the cut-off edge (the inner edge).
Once we had the scanned page images, we used tesseract as an OCR engine to create files of the characters and the boxes that contained them. tesseract is supposed to be the best (free) OCR program, but most of the work (so far, months), has been making sure that we have the correct character in the correct place on each page. This means that the box is correctly positioned, the character is corrected identified, and then we needed to add in font, size and face and positioning information.
We extracted the figures (152 of them) from the scanned images and put each one in its own file. We will need to process them to make them as clean as possible before they are printed, but for now, let us concentrate on the text. Also horizontal lines (rules) needed to be identified and not treated as text.
Two pages (47 and 225) are full page images turned on their sides, which means that the captions are also sideways. We treat the captions as text, so those pages are rotated and will need to be rotated back for final assembly. Actually each of these pages become two pages, because while the captions are rotated, the page numbers are not. So we have mostly white space (for the image to be inserted), plus a rotated caption text, and a non-rotated page number.
tesseract, for some reason, does not OCR all the bits on a page -- a handful of characters are just not processed. So we have a program (remains) which searches each page for bunches of bits that are not in any box. If the bunch is big enough, we create a new box and add it to the file, with a "dummy" character name, which I then fill in by hand. We had at least 632 such un-OCRed characters, out of our total of 411,645 characters.
We also have to contend with all "extended" symbols. tesseract is pretty good at catching ligatures -- fi, ff, fl, ffi, ffl -- but it does not have a clue about the Greek characters that math uses -- alpha, beta, delta, sigma, and so on --. We adopt a naming for those using HTML entity names α for α and Σ for Σ, for example. It turns out that ligatures have HTML entity names -- fi ff and so on -- so we can use those. Also we can tag things like open and close quotes (both double and single), and primes, and ellipsis, and even different meanings of the hyphen/dash -- ‐, −, –, —. This works as a general mechanism for identifying any character or box that needs to be treated specially.
Now we can start running programs to identify lines of text and making sure that the boxes are all in order -- top to bottom and left to right. This is the step where it can become obvious that some pages were not scanned properly. We identify lines by scanning each bit row for rows with no black bits -- no characters. We expect that to be the (vertical) space between lines. But on a
slanted page, the space between the lines starts low on one side of the page and then rises as we go across the page, so there is no row of white space. We had to rescan some pages at this point. A few pages were actually printed slightly askew on the page, and we had to adjust those by hand.
Once we had the files organized, we could then compare the scanned OCRed character stream with our HTML version. I believe the HTML version was created from the original troff code. Converting both to a stream with one character per line, we could use the standard diff utility to identify where tesseract had mis-identified a character and fix them. tesseract has particular trouble with ones and the letter "ell" (1/l), and zeros and "ohs" (0/O), but it also mis-identified all the greek symbols, for example.
Once the character streams for the HTML and OCRed versions were mostly the same, we could augment the HTML version with font information, and transfer that to the OCRed version, getting us to our current representation of:
c gif/p150.gif 1096 786 1124 826 r 10pt i
which identifies a character (c), the file that it's image is in (gif/p150.gif), the corners (x,y) of the box that contains it (1096 786 1124 826), the character (r), it's point size (10pt), and (in this case) face code of italic (i). Other lines define images, lines, and meta-information like newlines and spaces.
No comments:
Post a Comment