Republishing Old Books: Generating a PDF

Our current format for the book is a file that indicates where each character goes:

c p03.gif 1822 885 1832 923 l 8pt
c p03.gif 1839 900 1855 923 s 8pt
c p03.gif 1862 884 1886 898 cdq 8pt
n p03.gif 1888 882 1888 934 new-line
i p03.gif 400 1000 2399 1007 figures/line.gif
c p03.gif 407 1086 591 1274 O 28pt
c p03.gif 616 1106 650 1142 n
c p03.gif 658 1105 689 1142 e
s p03.gif 691 1108 726 1145 space
c p03.gif 729 1106 760 1142 a
c p03.gif 769 1106 803 1142 u

The first character (c,i,s,n) indicates what the line represents:

c -- a character to be printed
i -- an image
s -- a space between two words
n -- the end of a line

The next item is the page name -- the name of the file that represents a particular page of the book (each page is a separate file). Then we have 4 numbers defining the bounding box of the item. These numbers are locations on the page, in 600 dots-per-inch, first the x,y coordinates of the upper left corner of the box, and then the x,y coordinates of the lower right corner of the box.

The next item depends upon the type of line:

For a character (c), it is the name of the character, possibly with the point size (6pt, 8pt, 10pt, 12pt, 28pt) and/or font face (italic) of the character. Most characters have a single character name, but some (like ff, fi, fl, ffi, ffl, cdq, odq) have names that are more than just one character.
For an image (i), we have the name of the file which is the image to place in the bounding box.
For a space (s), it is the word "space".
For a new-line, it is the word "new-line".

We want a PDF file of this same information.

One approach, which we have used, is to create an HTML file which is just a list of each page in the document, in order, as a sequence of images. We can then import this HTML file into a browser or word processor and ask that it be saved as a PDF file.

Another approach would be to generate a PDF file directly from this format. A PDF file can do lots of things, but we only have 3 main things to represent:

Text
Fonts
Images

Text is represented as a sequence of strings, with location information:

BT
      /F13 12 Tf
      407 1086 Td
   (One) Tj
ET

The first line is "Begin Text"; the last line is "End Text". The Tf command selects a font (font F13 in 12-point size). The Td defines where to put the next character. The Tj shows a string, in the example above "One".

We want to be able to define the position of each character. Normally, PDF will show the first character ("O") and then space over the width of that character and output the next ("n"), space over the width of that character and output the next ("e"). If the font defines the width of the characters as being the same as what we want to do, this will work fine. If not, we can output each character and then move, output the next character, and then move, and so on. In fact, PDF has a TJ operator (instead of Tj), that takes, instead of a string like (One), an array of strings (generally one character strings) and integers, where we output the string, and then advance an extra amount to move:

    [ (O) 17 (n) 5 (e) ] TJ

With this in mind, we would define our font spacing as the minimum amount after any occurrence of a character, and then indicate the extra amount to move after that as necessary. For example, if we look at the 10-point "O", the distance from the O to the next character varies from 65 to 68 pixels:

6 occurrences of 65 pixels,
28 occurrences of 66 pixels,
22 occurrences of 67 pixels,
6 occurrences of 68 pixels,

If we define the width of the O, in the font, as being 66 pixels, then we can adjust that by -1, +1, or +2 as necessary. We can pick the width for the character that minimizes the number of times we need to adjust the width.

To match our input, it would be best if we define all numbers in the PDF file to be in units of 600 dpi. By default, PDF uses 1/72 inch as the size of a unit, but we can define the "UserUnit" value to be what we want. Presumably, the "units" for UserUnit is the default 1/72 (1.0 point), so we would define it to be 72/600 of those units (0.12 points).

Our input images are GIF images; we could convert those to other formats, but it appears that PDF does not support many image formats directly. Rather it allows several different ways to represent images. Some examples we have seen are in FlateDecode (zlib/decode). It also looks like JBIG2 or LZWDecode are possibilities. Wikipedia says that FlateDecode may be related to PNG or TIFF.

If we represent the text of the book as text, we need to also define the fonts that will be used to represent those characters on the page. All of the print-on-demand services stress that the fonts need to be embedded in the PDF file. We have our fonts in BDF format, and can read and understand that format. PDF files can represent several different types of fonts, but basically they are either Type 1, Type 3, or TrueType. Type 1 is an outline font format, as is TrueType. Type 3 allows the glyphs of the font to be defined by PostScript programs.

I think we could translate our bitmap raster fonts into little PostScript programs by tracing the outlines of the outside and inside of each character and then filling the area between them. We could start at the origin of each character and define a path along the outside of the glyph. In the PDF graphics language this would be something like:

0 0 m
0 36 l
8 36 l
8 0 l
0 0 l
0 52 m
0 54 l
2 54 l
2 52 l
0 52 l
b*

to draw a simple low-res "i". Units are, as always, in 600 dpi, relative to the origin of the i. ("m" is a command to move, "l" draws a straight line from the previous position to a new position, and "b*" is a PDF graphics command to close the paths, and fill it with the background color (black, I assume).

PDF files represent the various parts of the document by "objects". A font is an object; an image is an object; text is an object, and so on. The overall PDF file format is a header, a body, a table of objects, and a trailer. The file is meant to be read from the end, so the trailer indicates the number of objects and where in the file (byte offset) the start of the object table is. The object table is a list of all the objects, consisting mainly of the byte-offset in the file of the start of that object. The header is pretty minimal, consisting of bytes to identify it as a PDF file, and what version of PDF. The body contains the definitions of the objects.

So to create a PDF file, we would first run thru our input file and create an internal list of all the objects we have -- the fonts, the images, and the text pages. We could generate them in that order. Then output the object table, and the trailer.

Republishing Old Books

Saturday, March 21, 2015

Generating a PDF

No comments:

Post a Comment