Republishing Old Books: PDF explained (part 2): Pages

From Part 1, we understand the overall file structure of a PDF, and how the PDF file is a list of objects. The cross-reference table shows us how to get from an object number to the location of the object in the file.

From a programming point of view, we will need to generate objects as we need them, and create a table of the objects we have created, along with their location in the file. The function ftell() will report to us where we are in the file. So to create an object, we use ftell() to say what the current offset is, enter the object number and offset in our table and then write out the definition of the object. When we are done, we write out the cross-reference table from our object table, then the trailer which points at the cross-reference table.

The top-level object is the /Root. The /Root is always of /Type /Catalog and then has a reference to the /Pages object. So, for example:

549 0 obj % root object
<<
/Type /Catalog
/Pages 545 0 R
>>
endobj

The /Pages object is a list of all of the pages of the document; it is called a "page tree" in the PDF documentation. It has a /Type of /Pages. It has a /Count of how many pages, and then a key /Kids whose value is an array of all the individual pages. Since we have a table of the pages of the book, we can use that to generate the list of pages:

568 0 obj % page tree object
<<
/Type /Pages
/Count 46
/Kids [
571 0 R % page bcover.gif
574 0 R % page f01.gif
577 0 R % page f02.gif
580 0 R % page f02a.gif
...
703 0 R % page p36.gif
706 0 R % page p40.gif
]
/Resources 567 0 R
>>
endobj

Notice how the pages are an array "[ ... ]", and in our case we make each page a separate object, so we have just a list of indirect object references (plus a comment saying what page each object is).

One other thing in the page tree, is the /Resources object.

For reasons that I don't understand, PDF wants to refer to various objects by a "name" not by the object number (a standard indirect reference). So when we set the current font, we refer to font "F1" and not object 44. Also when we put an image in place, we refer to image "Im6" and not object 6. Maybe this is to create "short" names for things we use a lot. So we need to then define the mapping of names to objects. The Resource object does this. And since we need to reference fonts and images when we are defining pages, we need a Resource object for each page, or maybe for all pages. If we use the same Resource object for all pages, we can define it in the /Pages or page tree object. Or we can define a Resource object for each page. Or both.

In our case, we will do both; there is very little cost for doing both. And we only have one Resource object which lists all the images and all the fonts. In our code, we have a list of all the images and all the fonts, so we can associate the object number with that data structure and generate the Resource object by running thru those tables. It does not hurt to include a definition of a font on a page that does not use it; we just need to be sure that if we use a font or image name, it is defined.

So our Resource object is just:

567 0 obj % resource object
<<
/XObject <<   % image resources
/Im1 1 0 R % image object for figures/p40.gif
/Im2 2 0 R % image object for figures/p36.gif
/Im3 3 0 R % image object for figures/p34.gif
...
/Im23 23 0 R % image object for figures/vline.gif
       >>
/Font <<   % font resources
/caslon28 61 0 R
/caslon12i 100 0 R
/caslon10i 233 0 R
/caslon10 394 0 R
/caslon8 517 0 R
/caslon6 566 0 R
       >>
>>
endobj

Notice how for images, we just define a name "Imx" for object number "x". For the fonts, we define them to be our normal font names. This allows us to say:

/caslon10i 1 Tf

to set the current font to our 10-point, italic font.

So, we have defined our Root, which points to a Page Tree which is an array of Pages, plus the Resource object which defines the names of the fonts and images we will use. Now a Page object itself turns out to be simple. Each Page object defines the page, which is to say its size and resources and the object that contains its content:

613 0 obj % page object for p07.gif
<<
/Type /Page
/Parent 568 0 R
/Resources 567 0 R
/MediaBox [ 0 0 336.00 470.40 ]
/Contents 611 0 R
>>
endobj

Remember the stuff after the percent sign on the first line is just a comment. The /Type of the page is /Page (the type of the page tree is /Pages, plural). We have a key/value that points back to the page list object (the /Parent), and one that points to the /Resources object -- the same one that we put in the page tree. Then the two actual new items.

First, the size of the page: /MediaBox. This is an array of 4 values, a rectangle, which defines the start and end points of the page. I don't know why a page would ever start at other than 0,0 but I guess we have that generality. Maybe this would allow for margins, but that actually would be a "crop box" rather than a "media box".

But the larger issue is what the numbers mean -- what units are they in? We would like to use the same units we have internally, 600 DPI units, which would make our pages 2800x3920, but we would need some way to define our units.

PDF has such a definition capability -- UserUnit. This is defined in PDF 1.6. By default, PDF uses a 72 DPI unit; each unit is 1/72 inch. To switch to 600 DPI, should only have to say:

/UserUnit 0.12

which says that each of our units is 0.12 of 1/72 of an inch -- the size of our units in multiples of 1/72 inch. Our units are much smaller than 1/72 inch, which is why we have such a small number.

But this does not work. Empirically, setting the UserUnit value has no effect on the interpretation of a PDF file. I went so far as to get the source for xpdf -- an open source PDF package. If you search for "UserUnit", you can find the place in the code where it processes "UserUnit":

if (dict->lookup("UserUnit", &obj1)->isNum()) {
    userUnit = obj1.getNum();

and see that the value for /UserUnit is assigned to the internal variable userUnit (makes sense). Then searching the code further for uses of this variable, we find there are none. Yes, there is a "getter" function to get the value of this variable:

double getUserUnit() { return userUnit; }

but that function is never called, anywhere in the source code. So you can set the variable, but the variable has no effect on the interpretation of the PDF file.

So that means that we have to take our internal 600 DPI units and convert them to PDF units of 72 DPI. A small matter of coding. And that means that the page that we saw above: 336.00 470.40 converts to 4.66 inches by 6.53 inches, or 2800 x 3920 pixels in 600 DPI.

And every value that is output to our PDF file will need to be scaled from our internal 600 DPI to PDF's default 72 DPI.

But once that is settled, we have only the last part of a page to consider, the /Contents, which is the object that will describe how to generate our page image.

Most of our page is text, so most of our PDF would seem to be a text object.   For a text page, we need to be able to position to a particular location and print a character in the correct font, then move to the location of the next character and print it. Text streams start with "BT" and end with "ET". In between these are commands to set the text state and output characters.

To set a font, we just say:

BT
/fn 1 Tf

where fn is a font name, as given in the /Resources object for this text page. The number (1 in the above example) is the font size. The examples all suggest that for a 10-point font, we would say:

/caslon 10 Tf

This is designed for scalable fonts -- where a given font can be made larger and smaller to get all the various sizes of the font. But in our case, we have a separate font, with different character images, for each point size. So we have two choices: (1) do not scale them -- use a "size" of 1 or (2) scale each use of the font the same. In this latest case, every use of a 10-point font would be:

/caslon10 10 Tf

and all the images that are generated will be magnified by a factor of 10. But since we are generating each glyph of each font as the set of bits that we want for that character in that size, we don't need any additional scaling and we go with option (1) -- we do not further scale them; all font scale factors are 1.

Once we start a text block (Begin Text -- BT), and have established the font, we then want to move to where the first character should be on the page. You would think this is a simple common operation and is supported directly, but it turns out it is not. The "normal" way to move in text mode is the Td operator which takes two parameters, x and y:

x y Td

Normally, of course x and y are numbers. For the line in our book representation:

c p04.gif 400 425 426 465 t

we want to move to location 400,425 and output the glyph for a "t". We might try

400 425 Td
(t) Tj

where the Td moves to the right spot and Tj outputs (or "shows") the string whose contents are the character "t". But remember, our internal units are 600 DPI, but PDF is 72 DPI, so we need to scale these units down to:

48 51 Td
(t) Tj

Where is location (48,51) on the page? In our internal model, we put the origin of the page (0,0) at the upper left. PDF puts the origin at the lower left. Thus (48,51) is near the bottom left corner, not the upper left corner. If the page is 475.20 units long, we need to say 470.40 - 51 = 419.40, and we
need:

48 419.40 Td
(t) Tj

This is close, but when the character "t" is put on the page, the glyph is painted so that it's origin is positioned on the current location. For the "t", which sits on the baseline, the origin is at the bottom of the box, not the top, so we actually want 465, not 425, which is scaled to 55.8 and then adjusted to 470.40 - 55.8 = 414.6 to give:

48 414.5 Td
(t) Tj

This sometimes works, but generally does not. The problem is that the Td command does not move to x,y but rather moves to the start of the next line, offset from the start of the current line by x,y. So x,y is not an absolute location but a relative amount, depending upon where the last line started, and then remembers this location as the start of the next line (to define where the next Td command moves).

The current state of the Text object is defined by two transformation matrices: the current text matrix and the current text line matrix. PDF, like PostScript, uses a matrix so that both translation and scaling (and rotation) can be represented by a matrix multiply. Suppose we have the transformation matrix:

and represent a location (x,y) as [ x y 1] (we add the one so that we can multiply the 1x3 location vector by the 3x3 transformation matrix, then we get a result location vector [ x' y' 1] where

x' = a*x + c*y + e;
y' = b*x + d*y + f;

Suppose b and c are zero, then we have

x' = a*x + e;
y' = d*y + f;

Here we have scaled (x,y) by a in the x-dimension and d in the y-dimension and then translated it by (e,f). (If a,b and c,d are set to the right sin() and cos() values, we get rotation.) So this one matrix can provide translation and scaling, as well as rotation. For the Text matrix and Text Line matrix we always keep b and c set to zero. The a and d values are used for scaling, and (e,f) define a location on the page.

With this background, the "tx ty Td" operation is defined in terms of the Text matrix (Tm) and Text Line matrix (Tlm) by

Initially the Text matrix and Text Line matrix are the identity matrix -- so the (e,f) position is (0,0), and there is no scaling (all scaling is 1.0). So the first "x y Td" operation will move to (x,y), but a subsequent "p q Td" will move to (x+p,y+q). We could keep track of this location, and when we want to move to (x,y), just compute the difference between that location and the previous location.

Or we can just set the (e,f) values in the matrix directly. There is another PDF command that will set the Text matrix directly (Tm). If we want to move to (x,y) (with no scaling), we can set the Tm matrix with:

1 0 0 1 x y Tm

and this moves us to (x,y).

So we have two ways to move to (x,y) -- just set the Text matrix with Tm, or remember where the previous line started and use Td to move relative to that location.

Once we are positioned, we want to show a character. Then we move to the next location and show another character and so on. But generally, the next character is just to the right of the current character. The PDF model associates with each character its "width" and if we show a character at (x,y) with width d, automatically adjusts the current location to (x+d,y). So if the next character goes at (x+d,y), we don't need to explicitly move between them, we can just show the next character. So if our character positions match the character widths, we may be able to just do:

1 0 0 1 106.20 359.64 Tm
(The) Tj
1 0 0 1 1027.08 359.64 Tm
(Gates) Tj

and so on. Unfortunately, that seems to almost never work.

The problem is with the space between characters. Our positioning of characters is defined by where they are in the book, and we create our glyph for each character from the scanned images in the book, so we really do not know how wide a character is. We know how wide the bits for the glyph are but there is bound to be some extra space after the character -- characters do not butt directly one upon the other. If we look at the space after a given character, for example, the letter "e" in the book, we get a distribution from 28 to 49 pixels,

with a peak at 40. So one option is to make the width of an "e" 40 pixels, but this will only handle some cases; in all the other cases we need to adjust the horizontal positioning after we show an "e" to be in the right place. Sometimes we will need to move to the left; sometimes to the right.

PDF provides a convenient way to do this. The TJ command takes an array of either numbers or strings. Working from the left to the right, if the operand is a string, it is shown. If the operand is a number, it is subtracted from the current x-position. Thus, the previous example could be written as:

1 0 0 1 106.20 359.64 Tm
[ (The) x (Gates) ] TJ

for some value of x. Our presentation of TJ says that x would be the size of the space between the "e" and the "G". From our book description file that would be about 22 pixels. But remember that 22 pixels is for 600 DPI; it's only 2.64 in the PDF 72 DPI. But the space in the TJ is not in the default text units, it is in units which are 1/1000 of a standard text unit. So not 72 DPI, but 72,000 DPI. There is no explanation why the scale of the units change, but the normal scaling for defining glyphs for a font is also by 1000. That means we need something like:

1 0 0 1 106.20 359.64 Tm
[ (The) -2640 (Gates) ] TJ

The spacing is negative because the value is subtracted from the current position (not added), so positive values move to the left; negative values move to the right.

This gives us enough to create the PDF file for our text. We can select our fonts (Tf), move to a position on the page (Tm or Td), and show our text, character by character, adjusting our position as we go (TJ). We can code it so that we invoke TJ for each word (stopping a TJ when we get a "space" in our book representation), or for each line (stopping a TJ only when we get a new-line).

If we want to try this out, we can define our fonts to be some of the standard fonts (pre-defined Type1 fonts) instead of our own fonts, but it looks terrible since the sizes and widths are all off. To get a good view, we really need to define our own fonts. Which is our next blog entry.

Republishing Old Books

Tuesday, April 14, 2015

PDF explained (part 2): Pages

No comments:

Post a Comment