Republishing Old Books: PDF explained (part 1): Overall Structure

I finally have my code generating PDF and wanted to take the opportunity while this is still fresh in my mind to write it down. Maybe it will help someone else generate PDF directly.

To refresh, this is a project to re-publish an existing book from 1953. The idea is to scan the book, at reasonably high resolution (600 dpi), and then extract from the scan the font that was used to print it, and then re-generate the book from that font. Everything is raster or pixel based since it is all from the scan; there are no vector fonts or anything involved.

After the scan and processing, I have a data file that describes, in 600 dpi units, exactly where each character is to be put on each page. The format is derived from the "box" format of the OCR program that I used (tesseract). The book pages were scanned into separate files in a TIFF format, then coverted to GIF and OCR'ed to get the text and locations, then processed into a format like:

c p17.gif 629 1260 666 1295 n
c p17.gif 676 1234 713 1295 d
s p17.gif 714 1233 741 1234 space
c p17.gif 742 1236 767 1295 1 italic
c p17.gif 780 1236 816 1295 6 italic
s p17.gif 817 1235 844 1236 space
c p17.gif 845 1260 880 1312 p
c p17.gif 889 1260 925 1295 o
c p17.gif 931 1235 946 1295 i
n f02.gif 2174 2701 2174 2745 new-line
i f02.gif 409 3002 2402 3432 figures/f02b.gif

The first character on each line identifies the type of line. The most common is the "c" or "character" line. The first character is following by the name of the page that has the character, followed by a bounding box which contains the character -- start x, start y, end x, end y. These are in 600 dpi units with the upper left corner of the page being 0,0. Page sizes varied slightly, but most were 2800 x 3920 (which at 600 dpi would be 4.66 inch by 6.53 inch). I think, actually this was after I added a 400 pixel margin on all four edges, so the actual non-blank text would be in a 2000 x 3120 pixel box.

After the page and bounding box, we have the actual character which is to be found there. Mostly this is a single letter, representing the character but sometimes we have a "name" -- for example "odq" and "cdq" for open and closed double quotes, and "ff", "fi", "fl", "ffi", and "ffl" for ligatures. There also may be font information -- point size and font face -- if need be. The sizes are sort of arbitrary -- we don't know what the original font was -- but it says it was "set in 10-point Caslon Old Style", so the default is set as 10 points, and we defined new size for larger (12, 28) and smaller (6, 8) size characters. And we can see that some characters are italic. In theory, we could have bold or serif and sans-serif fonts, and other sizes. But most of the text is just plain old 10-point regular font; out of 42,660 characters, 40254 were 10-point regular.

We wrote a program to read the original GIF files, and the list of bounding boxes and from that to generate little images of each of the characters used, separated by size and font face. These were written out in BDF format as a font description. For example, here is a generated BDF representation of a
6 point comma:

STARTCHAR ,
ENCODING 44
SWIDTH 138 166
DWIDTH 10 0
BBX 10 12 0 -5
BITMAP
1800
7E00
7F00
FF80
FF80
7FC0
7FC0
3FC0
0F80
0780
0780
0300
ENDCHAR

I could then use an old font editor I wrote to modify the characters slightly, to clean them up. Once the fonts were cleaned up, I could use the same list of characters to generate a new output page with the cleaned-up idealized characters placed where they were supposed to be put. The same file could be used to find the characters and then to replace the original characters by a new image. All the uses of "T" in the book are now represented by an idealized version of a perfect "T" instead of slightly different versions of the same basic image.

I could compute on the box definitions, and adjust the positions slightly, to, for example, make the baseline of each line of text perfectly horizontal, and to smooth out the horizontal spacing between characters. To do this, it helps to know where the lines of text are, and where the spaces between words are. So two additional lines start not with "c" but with "s", for the spaces between words, and "n" for the new-lines at the end of a line of text. For spaces, the bounding box is the size of the space between words (really only the horizontal positions are relevant), and for a new-line, the bounding box is the position just after the last character (only the horizontal position really matters).

Lines with "s" and "n" generate no output on the page, but are useful hints when we adjust character positions. For example, we want (a) all characters to sit on a horizontal baseline, (b) the spaces between words in a line to be roughly the same size, and (c) the space between characters in a word to be roughly the same. We thought about adding (d) the space between lines should be roughly the same and (e) the space between characters in a word should be roughly the same for all characters in all words in the same line, but decided that was probably too much.

One last type of line is the "i" line which defines an "image" to be put on the page. An "i" line defines the bounding box on the page where the image should be, and the name of a image file (a GIF file for our work) to hold the image. During the first phase of processing, the image is extracted from the page image and written to the file; during the second phase, we want to use the file image to create the image on the paper -- or at least in the image file that represents the piece of paper.

So we have a program which can read a file describing a page, plus a set of fonts, and will generate an image file with each character and image put in the right place. Thus we can generate a new image of each scanned page where every character and image is "cleaned-up" and there are no extraneous bits of dust or marks on the page image. This page image can then be printed, or whatever, as an image. Although it is constructed of glyphs representing specific characters and copies of bit-map images, the output is just an image file -- a sequence of black and white dots (possibly grey scale or even color bits in the images) with no concept of why a particular bit is white or black.

We can then create a PDF file, using traditional programs like Word or LibreOffice, which is just a sequence of pages, each page being a full-page GIF image. This was our first take at generating a PDF file of our newly re-formatted book. But a PDF file has an ability to have fonts and then put characters from that font (and images) on a page of paper; that is what PDF is for. So we should be able to translate directly from our representation into a PDF file without having to go to a full-page image.

PDF was defined by Adobe Systems as a follow-on to PostScript, but was then offered up as an open standard. It was picked up by ISO (International Organization for Standardization) and made into an open standard ISO 32000-1:2008 Document management — Portable document format. The first thing I did was read that. That defines things, but really does not show how they are used, particularly not the parts that I figured I needed: a different unit of measurement (600 dpi), representing my own bit-mapped glyphs for fonts, and images.

I also read Developing with PDF by Leonard Rosenthol (O'Reilly Media) from our local City Library, That helped on some points, particularly hinting at how to represent images in Chapter 3 but leaving out the critical details (like how to represent the actual image -- it just says "image data goes here" even for small examples). Much of the book seems to be just copied from the ISO standard (although he does at least correct his Example 4-7 which makes much more sense than the same Figure 46 in the ISO standard).

A more useful book was PDF Explained by John Whitington (O'Reilly Media) which I found on-line at http://issuu.com/wangzuo/docs/pdf_explained_2011-1 although the ability to read it was limited by having to go thru it one page at a time from the front. Chapter 6 on Text and Fonts (pages 71 to 88) does not really say anything more than Rosenthol did; maybe it just seemed to make more sense because it was the second or third time I had read it. But it did give me another reference to Fonts and Encodings by Yannis Haralambous (O'Reilly Media) which is a truly fascinating combination of theory and practice, but, while interesting, we are beginning to get a bit far afield.

It was pretty difficult, if not impossible, to find the information I wanted by just searching with Google, or at least it seemed that way to me. If I mentioned that I wanted to know about PDF, I started to get PDF documents, not information about PDF. So mostly, I muddled thru, trying this and that until I found the (apparent) key to generating my own PDF. PDF is a strange mixture of general things (objects) and specific things that are apparently there to support someone's specific idea of how documents will be structured (and ours is not structured like that).

A PDF file is a collection of objects. Each object is mainly a list of key/value pairs, with occasional data. An object is defined by the keyword "obj"; the definition ends with the keyword "endobj". For example:

548 0 obj % page object for cover.gif
<<
/Type /Page
/Parent 545 0 R
/Resources 544 0 R
/MediaBox [ 0 0 765.36 454.56 ]
/Contents 546 0 R
>>
endobj

is the definition of an object (object 548) which has a number of properties given by the key/value pairs. Some of the values are constants (like /Page), others are references to other objects (545 0 R, 544 0 R, and 546 0 R), and one is an array of numeric values (the /MediaBox).

The overall PDF file structure has a header, a body, a cross-reference table, and a trailer.

The file, in general, is all in ASCII, with the exception of some binary data streams (we will talk about that later).

The header is one line

%PDF-1.6

which defines the file as a PDF, and gives the version of the language that is used. PDF files can be version 1.0, 1.1, ... 1.7. Different features of the language have been introduced in different versions, and you need to specify the largest version for any features you use. For example, we tried to use "UserUnit" which was introduced with PDF 1.6; Resources come from PDF 1.2. I think these are all upward-compatible.

Notice that the "%" character is also the "comment" character for PDF -- everything after a "%" until the end of the line is a comment, and ignored. So in some sense, this first line is a comment.

The Standard (and other sources) say that if the PDF has binary data, then by convention, the second line of a PDF file is a comment line with at least one binary character. The idea is that if a program tries to determine if a file is ASCII or binary, it will look at a small number of it's initial characters. If they are all just ASCII, it will assume the rest of the file is ASCII; if it sees any non-ASCII (binary) characters, it will assume the file is binary. So this second line is for those file classification programs, not for PDF itself -- it's a comment started by a "%", so the line is ignored. We decided to make the entire file ASCII, so we don't bother with this second line.

The trailer is used for two things:

1. To find the cross reference table, and
2. To find the "Root" object.

The cross reference table says, for each object, where in the file that object is, allowing direct access to a particular object. The Root object is the one that defines the content of the PDF file. The trailer defines what object it is, and then the cross reference table tells you where in the file it is.

The trailer uses the same syntactic structure as the rest of a PDF file: a list of key/value pairs:

trailer
<<
      key-1 value-1
      key-2 value-2
           ...
      key-n value-n
>>
startxref
Byte_offset_of_cross-reference_table
%%EOF

We only need two keys in the trailer section: the Root and the total number of objects (Size).

The "Byte_offset_of_cross-reference_table" is just what it says -- the byte offset, from the beginning of the file, of the start of the cross-reference table. So if the entire PDF file is 3,382,705 bytes long, and the cross-reference table starts at byte 3,372,180 then the end of the PDF file is:

startxref
3372180
%%EOF

This allows a program that wants to read a PDF file to first read the tail end of it, back up over the "%%EOF" and the numeric digits before it to the "startxref" keyword and then compute where the cross reference table is in the PDF file.

The trailer itself always has at least two entries:

Size -- which is the number of objects in the file, and

Root -- which defines the root object.

So, for example, we could have a PDF file that ends with:

trailer
<<
/Size 550
/Root 549 0 R
>>
startxref
3372180
%%EOF

which says there are 550 objects altogether, and the Root object is number 549.

A couple of comments about syntax. Blank spaces do not matter. Numbers are represented as numbers, in decimal, in ASCII (with a possible plus or minus sign); numbers may have fractional parts (again in decimal).

Boolean values can be true and false. I'm not sure if case is important.

The "<<" and ">>" tokens define a "dictionary" object. Dictionary objects are a list of key/value pairs.

Strings are surrounded by "(" and ")", so (foo) is a string consisting of the three characters "f", "o", and "o". In a string, if you want a right parenthesis, you need to escape it with a back-slash (\). If you want a back-slash, you have to escape it with a back-slash (\\). You can also represent some white-space characters using a back-slash, as you would expect from C: \n, \r, \t, ... You can also represent any character in a string with \ooo, where ooo is a 3-digit octal ASCII character code. So a right parenthesis is both \) and \051.

Hexadecimal strings are an arbrarity long sequence of hexadecimal digits (using either A-F or a-f or a mixture), surrounded by "<" and ">", so something like <4E6F20>. I don't think we use any of these.

Arrays are delimited by "[" and "]", and are just a sequence of things, not necessarily all the same.

Symbols can be keywords, or constants (true and false, for example), or occasionally names. To differentiate from a "thing" from a "name", names are prefixed with a slash, like /Size.

So the characters "<", ">","(", ")", "[", "]" have special value for hexadecimal strings, dictionaries, strings, and arrays. Plus "%" starts a comment. And "/" defines a name. So if we need to use these characters, as characters and not as syntactic tokens, we need a way to quote them. This is done by representing the character by a hash-sign "#" followed by the two digit hexadecimal ASCII code for that character. So if you want a "%", you represent it as #23. A right parenthesis is #29.

PDF then has these two different quoting mechanisms -- one for strings and a different one for names.

The object cross-reference table is a list of n entries, where n is given by the /Size value in the trailer dictionary. The cross-reference table begins with the keyword "xref". This is then followed by the number of entries in the cross reference table (which should be exactly equal to the /Size in the trailer). This is given as a pair of number which is the range of object numbers. This range is always from zero (0) to n (the number of the last object, which is the number of objects). The zero object is reserved, so objects start at 1 and go up from there.

After "xref" and "0 n", we have a list of n lines each describing the location of one object in the PDF file.   Each line is exactly of the same format:

nnnnnnnnnn ggggg n

where nnnnnnnnnn is a 10-digit decimal byte offset to the i-th object and ggggg is the 5-digit generation number (always zero for us), and n is the literal character "n" (for "in-use").

The one exception to this is the first entry in the object table, which is for object 0, which is not used. So it's entry in the cross-reference table is

0000000000 65535 f

The "f" means "free".

So a small PDF file, with 4 objects might have a cross reference table and trailer that looks like:

xref
0 4
0000000000 65535 f
0000000009 00000 n
0000000088 00000 n
0000000114 00000 n
0000000525 00000 n
trailer
<<
/Size 4
/Root 1 0 R
>>
startxref
680
%%EOF

Generation numbers have to do with being able to modify PDF files after they are written; in our work they will always be zero since we are not modifying existing PDF files, but writing new ones.

Notice how the /Root, in the example above is "1 0 R". That is a reference to object 1 (generation 0). This is called an "indirect" object. The idea is that, in general, when you are building an object and you need another object, you can either put it's value directly in place, or you can make an indirect reference to it. (The /Root is always an indirect reference; some other things have to be indirect references.) An indirect reference is an object number followed by a generation number (0 as we've said) followed by R.

Objects, as we showed earlier, are defined by the keyword "obj", and then a dictionary of key/value pairs, followed by "endobj". Almost all objects have a /Type, and the content of the object depends on its type. We will go over all of those that we care about, but here is a really simple case.

The content objects that define what goes on a page are of the form:

k 0 obj
<<
/Length n
>>
stream
...
endstream
endobj

where the contents of the stream (between the keywords "stream" and "endstream" define what goes on the page. The only key/value pair (for our purposes) is the /Length of the stream -- the number of bytes that it takes up. My image of a PDF viewer is that it finds a content object, finds the /Length and the beginning of the stream and then reads and processes exactly /Length bytes and then expects to see the keyword "endstream". So we need to know the length of the stream before we put it into the PDF file.

Our basic approach is to just generate the content stream as we read thru our character description input file, moving to the right spot on the page, outputting characters and images, until we are done with this page. So we don't know, in advance, how long the content stream will be. One option would be to run thru our code twice, once to compute the length, and then once to compute the output. Another would be to buffer the content stream in memory (or a file) until it is complete, then compute the length, and copy the buffered contents into the PDF file.

Both of these are possible but tedious or error prone coding. But indirect objects give us another possibility. We generate the object header, but instead of giving a value to /Length, we give it an indirect reference to the next object:

546 0 obj
<<
/Length 547 0 R
>>
stream
...
endstream
endobj

As we generate the stream, we remember the byte offset in the file (ftell() will give us that) of the first byte of the stream, and then after we generate the last byte, before we write out the "endstream", we get the byte offset of the last byte of the stream, and compute the length. Then we finish the definition of that object and add another one right after it which gives the length:

547 0 obj
15998
endobj

This increase the number of objects, but simplifies our programming. Now when the PDF viewer looks at our file, and goes to object 546, it sees that the /Length of the stream is an indirect reference to object 547, goes to the cross-reference table to find where object 547 is (a simple indexing operation into the table), goes to that part of the PDF file, and gets the value (15,998), pops back up to object 546 with the correct /Length.

This Posting is getting long. Let's stop at this point and we will cover the content of our objects in the next posting.

Republishing Old Books

Monday, April 13, 2015

PDF explained (part 1): Overall Structure

No comments:

Post a Comment