Tuesday, March 31, 2015

Problems with PDF

It will not be easy to generate PDF.  I am working for the ISO standard document defining PDF PDF 32000-1:2008.  But it is not at all clear how PDF works, or rather, how to make it work.

I have all the gross structure in place -- the objects, the object reference table, the document structure, the pages, and so on.  The problem comes down just to putting the characters where they are supposed to be.

The main problem is the units.  The space we are working with is 600 dots per inch, with the origin of the page in the upper left hand corner (positive x goes to the right, positive y goes down).   The space for PDF has the origin at the lower left hand corner (positive y goes up).  I can deal with that.  But defining the units of the numbers is not working.

PDF uses its default units as 1/72 of an inch.  So if we move to (100,100) that is 100/72 inches in each direction. So our 600 dpi resolution needs to be converted to put it in PDF 72 dpi resolution.  I could do that.  Actually PDF has a property UserUnit that appears to do this automatically.  UserUnit is defined as:
 A positive number that shall give the size of default user space units, in multiples of 1⁄72 inch. 
So I should be able to define my UserUnit as 72/600 (or 0.12) which is the multiple of 1/72 inch which is the size of my user space units.  But doing so has no effect.  I can add it to my PDF or not and the resulting PDF image is identical. I've even looked at the source code of code to read PDF files; the code recognizes this property, read the value in, stores the value in a variable, and never uses that variable at all.

This is compounded by several places where the size of the user unit seems to be arbitrarily changed.   For example, the TJ command allows characters to be put down, one after another, with possible adjustments between each character. After a character is placed, the position where the next character will be placed is automatically increased by the width of the last character.  So we put down an "A" and automatically move over the width of an "A", to be ready for the next character.

In our case, we have the position of each character in absolute terms, so we know that "A" goes at "416,2654", and the following "B" goes at "474,2654".  Since an "A" is 58 units wide, that works perfectly.  But later we have an "A" at "1853,882", and a following "d" at "1912,882".  After the "A", we will be at 1911, not 1912, so we need to move one more unit to the right.  The TJ command lets us do this, but the units for the "adjustment" to the position "shall be expressed in thousandths of a unit of text space".  So we don't move by "1", but by "1000"?  But this does not appear to work either.

Another approach to the units problem is to scale using the text matrix.  The text matrix defines how to scale (and translate) text.  I should be able to define the scaling as 72/600 to convert from 600 dpi to 72 dpi units.

The same problem comes up when defining Type 3 fonts.  With Type 3 fonts, we define a small graphic function for each character that defines what that character looks like.  The units for these functions are defined by the FontMatrix which defines the scaling between glyph space and the text space.  Since all our numbers are in units of 600 dpi, we don't need these two spaces to have different units, so we define the scaling to be 1, but this does not appear to work -- the resulting glyphs are much much too large.  The standard says: 
A common practice is to define glyphs in terms of a 1000-unit glyph coordinate system, in which case the font matrix is [ 0.001 0 0 0.001 0 0 ]. 

and we suspect that the 1000 multiple is there even tho I don't want it.  (I suspect the 1000 for the glyph coordinate system is related to the 1000 multiple in the TJ command.)

In theory, I should be able to code around these problems and just express my number in the standard 72 dpi units, particularly since I can use real numbers.  So my "1853,882" would be "1853*72/600,882*72/600" or "222.36,105.84" and then moving 1 pixel to the right between an "A" and a following "d" would be not 1 unit but 1*72/600*1000 = 120 units (and probably -120 since the standard says that this adjustment is subtracted from the current position).

But the PDF seems to be particularly screwy in terms of its commands.  For example, remember that our representation of a document gives the x,y position of each character.  Mostly, we can keep track of the position of the last character, and position the next one relative to it.  But in general, we may just need to move to a fixed location.  For example, the first character on the page, or a character at the beginning of a new paragraph, or after an in-line image.  PDF provides a Td operator that moves to an (x,y) position, but not an absolute (x,y) position.  Rather the (x,y) parameters are relative.  And not relative to the previous position, but "offset from the start of the current line".  What is the meaning of the "start of the current line"?  It appears that is the result of the last Td command, but that is not really clear, and does not seem to really work.

By applying "fudge" factors in various places, I can mostly get a document that comes close to what I want, but I have no explanation for these fudge factors, and the result is not close enough for actual publication.

It may be possible to work around these issues, but it will take time and experimentation.  Rather than delay publication trying to work these things out, it seems more expeditious to use my existing code to generate a 600 dpi GIF image and then use existing software to convert these into a PDF file where each page is simply an embedded image.  This should take more bits (larger file size), but it should produce exactly the result that I want.

Saturday, March 21, 2015

Generating a PDF

Our current format for the book is a file that indicates where each character goes:

c p03.gif 1822 885 1832 923 l 8pt
c p03.gif 1839 900 1855 923 s 8pt
c p03.gif 1862 884 1886 898 cdq 8pt
n p03.gif 1888 882 1888 934 new-line
i p03.gif 400 1000 2399 1007 figures/line.gif
c p03.gif 407 1086 591 1274 O 28pt
c p03.gif 616 1106 650 1142 n
c p03.gif 658 1105 689 1142 e
s p03.gif 691 1108 726 1145 space
c p03.gif 729 1106 760 1142 a
c p03.gif 769 1106 803 1142 u


The first character (c,i,s,n) indicates what the line represents:
  • c -- a character to be printed
  • i -- an image
  • s -- a space between two words
  • n -- the end of a line
The next item is the page name -- the name of the file that represents a particular page of the book (each page is a separate file).  Then we have 4 numbers defining the bounding box of the item.  These numbers are locations on the page, in 600 dots-per-inch, first the x,y coordinates of the upper left corner of the box, and then the x,y coordinates of the lower right corner of the box.

The next item depends upon the type of line:
  • For a character (c), it is the name of the character, possibly with the point size (6pt, 8pt, 10pt, 12pt, 28pt)  and/or font face (italic) of the character.  Most characters have a single character name, but some (like ff, fi, fl, ffi, ffl, cdq, odq) have names that are more than just one character.
  • For an image (i), we have the name of the file which is the image to place in the bounding box.
  • For a space (s), it is the word "space".
  • For a new-line, it is the word "new-line".
We want a PDF file of this same information.

One approach, which we have used, is to create an HTML file which is just a list of each page in the document, in order, as a sequence of images.  We can then import this HTML file into a browser or word processor and ask that it be saved as a PDF file.

Another approach would be to generate a PDF file directly from this format.  A PDF file can do lots of things, but we only have 3 main things to represent:
  1. Text
  2. Fonts
  3. Images 
Text  is represented as a sequence of strings, with location information:

BT
      /F13 12 Tf
      407 1086 Td
   (One) Tj
ET

The first line is "Begin Text"; the last line is "End Text".  The Tf command selects a font (font F13 in 12-point size).  The Td defines where to put the next character.  The Tj shows a string, in the example above "One".

We want to be able to define the position of each character.  Normally, PDF will show the first character ("O") and then space over the width of that character and output the next ("n"), space over the width of that character and output the next ("e").  If the font defines the width of the characters as being the same as what we want to do, this will work fine.  If not, we can output each character and then move, output the next character, and then move, and so on.  In fact, PDF has a TJ operator (instead of Tj), that takes, instead of a string like (One), an array of strings (generally one character strings) and integers, where we output the string, and then advance an extra amount to move:

    [ (O) 17 (n) 5 (e) ] TJ


With this in mind, we would define our font spacing as the minimum amount after any occurrence of a character, and then indicate the extra amount to move after that as necessary.  For example, if we look at the 10-point "O", the distance from the O to the next character varies from 65 to 68 pixels:


  6 occurrences of 65 pixels,
 28 occurrences of 66 pixels,
 22 occurrences of 67 pixels,
  6 occurrences of 68 pixels,


If we define the width of the O, in the font, as being 66 pixels, then we can adjust that by -1, +1, or +2 as necessary.  We can pick the width for the character that minimizes the number of times we need to adjust the width.

To match our input, it would be best if we define all numbers in the PDF file to be in units of 600 dpi.  By default, PDF uses 1/72 inch as the size of a unit, but we can define the "UserUnit" value to be what we want.  Presumably, the "units" for UserUnit is the default 1/72 (1.0 point), so we would define it to be 72/600 of those units (0.12 points).

Our input images are GIF images; we could convert those to other formats, but it appears that PDF does not support many image formats directly.  Rather it allows several different ways to represent images.  Some examples we have seen are in FlateDecode (zlib/decode).  It also looks like JBIG2 or LZWDecode are possibilities.  Wikipedia says that FlateDecode may be related to PNG or TIFF.

If we represent the text of the book as text, we need to also define the fonts that will be used to represent those characters on the page.  All of the print-on-demand services stress that the fonts need to be embedded in the PDF file.  We have our fonts in BDF format, and can read and understand that format.  PDF files can represent several different types of fonts, but basically they are either Type 1, Type 3, or TrueType.  Type 1 is an outline font format, as is TrueType.  Type 3 allows the glyphs of the font to be defined by PostScript programs.

I think we could translate our bitmap raster fonts into little PostScript programs by tracing the outlines of the outside and inside of each character and then filling the area between them. We could start at the origin of each character and define a path along the outside of the glyph.  In the PDF graphics language this would be something like:

  0 0 m
  0 36 l
  8 36 l
  8 0 l
  0 0 l
  0 52 m
  0 54 l
  2 54 l
  2 52 l
  0 52 l
  b*

to draw a simple low-res "i".  Units are, as always, in 600 dpi, relative to the origin of the i.  ("m" is a command to move, "l" draws a straight line from the previous position to a new position, and "b*" is a PDF graphics command to close the paths, and fill it with the background color (black, I assume).

PDF files represent the various parts of the document by "objects".  A font is an object; an image is an object; text is an object, and so on.  The overall PDF file format is a header, a body, a table of objects, and a trailer.  The file is meant to be read from the end, so the trailer indicates the number of objects and where in the file (byte offset) the start of the object table is.  The object table is a list of all the objects, consisting mainly of the byte-offset in the file of the start of that object.  The header is pretty minimal, consisting of bytes to identify it as a PDF file, and what version of PDF.  The body contains the definitions of the objects.

So to create a PDF file, we would first run thru our input file and create an internal list of all the objects we have -- the fonts, the images, and the text pages.  We could generate them in that order.  Then output the object table, and the trailer.


  

Lulu or Createspace

To generate a print-on-demand hard-copy of the book, we can use several services.  Lulu.com is the largest print-on-demand service, but Amazon provides its own as Createspace.com.  After trying to evaluate which of these would be best, it became apparent that we do not need to choose one or the other -- we can do both!

The basic file provided for print-on-demand is a PDF file.  PDF is a page-ready format used for documents.   PDF files can include text and images.  In our case, we can generate our text as images and so have a PDF file that simply contains the correct size and content of images.

An alternative would be to generate a PDF file directly from our representation of the book.  Our representation says where each character should be placed.  By combining that with the representation of each character in the font definition, we know what image to generate.  We can encode this information, in this way, to create a PDF file.  PDF files can have embedded fonts, images, and text.


Tuesday, March 3, 2015

Getting a Font to Print with

The last blog posting (How to process the scanned images to create a new book) took the scanned images and showed how they can be "merged" to create an ideal image which then replaces the individual characters to create an "ideal" representation of the original book.  This process takes many steps:
  1. We first scanned the book, at high resolution (600 dpi) to get images of each page.
  2. We applied OCR (Optical Character Recognition) to these pages to get the actual characters on the page.  In fact, we used tesseract, an open-source OCR program, to produce both the characters and where on the page they are.  The location on the page is given by a "bounding box" which says where it starts and where it ends, in both the horizontal and vertical dimensions.  The locations are given in pixels. (Each page is about 2000 x 3120 pixels.  At 600 dpi that is 3.3 x 5.2 inches, not the 6x9 expected).
  3. Some editing and a lot of error checking then produces a file that lists both each character and where it is on the page. We added by hand information about the size and font-face where necessary. 
The result of this processing is a file which can be used to reproduce the book as it originally was:
c p17.gif 510 1234 538 1295 f
c p17.gif 536 1259 572 1295 o
c p17.gif 581 1259 615 1295 u
c p17.gif 627 1259 661 1295 n
c p17.gif 674 1235 710 1295 d
s p17.gif 711 1235 742 1293 space
c p17.gif 743 1235 768 1293 1 italic
c p17.gif 781 1237 817 1295 6 italic
s p17.gif 818 1259 844 1295 space
c p17.gif 845 1259 880 1312 p
c p17.gif 889 1259 925 1295 o
c p17.gif 932 1234 945 1295 i
c p17.gif 955 1259 989 1295 n
(We added the "space" character to represent the space between words.)

Generating a new version of the book then requires only a representation of each of the characters used in the book.  So we know that the "f" of the first line above goes horizontally from pixels 510 to 538 and vertically from pixels 1234 to 1295 on page "p17.gif" (so we know it is 29 pixels wide and 62 pixels high), but we still need to know exactly what bits form the "f".

The particular bits to represent an "f" is defined by the font.  Different fonts represent characters with different images -- they look different.  We can reproduce the book, from our file, in different fonts.  As long as an "f" is (no more than) 29x62 pixels, we can place it in the right spot on the (image) page to re-create the book.

So the problem is to find the "right" font.  The book itself says " This book was hand-set in 10 point Caslon Old Style".  A quick Google search on "Caslon" shows that William Caslon was a typeface designer in England from about 1730.  Unfortunately, there are a lot of "Caslon" fonts, and it is not clear what "Caslon Old Style" specifically is.

However, to some degree, we don't need a formal definition of the font -- we have it in front of us in the book.  Unfortunately, as we have seen in the previous posting, each instance of the characters, being a particular metal slug in 1953, printed on a particular piece of paper, and then scanned at 600 dpi, appears to be slightly different. Of the 42,272 different character instances we scanned, no more than a dozen are exactly the same as some other character instance.

So we have two options:

  1. We can try to find an existing well-defined font that looks close to what we have in the current printed copy, or
  2. We can try to make a representation of the font from the character images that we have.
We've tried both.  We can "merge" all the characters together, and from that have produced an approximate font.  But it clearly needs to be "cleaned" up by hand, to make straight lines of the same size.  This requires some skill and design capability.  The result of this is a set of font files that can be used to create a version of the book that approximates the printed version.  For example, the "The" used in the last blog post looks like:



We have also found something called LibreCaslonText which is an open-source representation that looks a lot like the original text.  (The representation of "Q" and "J" seem particularly to vary from font to font.) The main problem with this is that it is in "True Type" format, which is a vector representation of the font, rather than the bitmap format that we are using. One advantage for the True Type format is that it represents all sizes (point sizes) in one form, rather than having different bitmaps for the various sizes, as we have.  This looks like it will work for approach 1.  Converting the True Type format to a BDF format, we can then produce the following for the "The" of the last blog:


The differences are subtle.