Republishing Old Books: Problems with PDF

It will not be easy to generate PDF. I am working for the ISO standard document defining PDF PDF 32000-1:2008. But it is not at all clear how PDF works, or rather, how to make it work.

I have all the gross structure in place -- the objects, the object reference table, the document structure, the pages, and so on. The problem comes down just to putting the characters where they are supposed to be.

The main problem is the units. The space we are working with is 600 dots per inch, with the origin of the page in the upper left hand corner (positive x goes to the right, positive y goes down). The space for PDF has the origin at the lower left hand corner (positive y goes up). I can deal with that. But defining the units of the numbers is not working.

PDF uses its default units as 1/72 of an inch. So if we move to (100,100) that is 100/72 inches in each direction. So our 600 dpi resolution needs to be converted to put it in PDF 72 dpi resolution. I could do that. Actually PDF has a property UserUnit that appears to do this automatically. UserUnit is defined as:

A positive number that shall give the size of default user space units, in multiples of 1⁄72 inch.

So I should be able to define my UserUnit as 72/600 (or 0.12) which is the multiple of 1/72 inch which is the size of my user space units. But doing so has no effect. I can add it to my PDF or not and the resulting PDF image is identical. I've even looked at the source code of code to read PDF files; the code recognizes this property, read the value in, stores the value in a variable, and never uses that variable at all.

This is compounded by several places where the size of the user unit seems to be arbitrarily changed. For example, the TJ command allows characters to be put down, one after another, with possible adjustments between each character. After a character is placed, the position where the next character will be placed is automatically increased by the width of the last character. So we put down an "A" and automatically move over the width of an "A", to be ready for the next character.

In our case, we have the position of each character in absolute terms, so we know that "A" goes at "416,2654", and the following "B" goes at "474,2654". Since an "A" is 58 units wide, that works perfectly. But later we have an "A" at "1853,882", and a following "d" at "1912,882". After the "A", we will be at 1911, not 1912, so we need to move one more unit to the right. The TJ command lets us do this, but the units for the "adjustment" to the position "shall be expressed in thousandths of a unit of text space". So we don't move by "1", but by "1000"? But this does not appear to work either.

Another approach to the units problem is to scale using the text matrix. The text matrix defines how to scale (and translate) text. I should be able to define the scaling as 72/600 to convert from 600 dpi to 72 dpi units.

The same problem comes up when defining Type 3 fonts. With Type 3 fonts, we define a small graphic function for each character that defines what that character looks like. The units for these functions are defined by the FontMatrix which defines the scaling between glyph space and the text space. Since all our numbers are in units of 600 dpi, we don't need these two spaces to have different units, so we define the scaling to be 1, but this does not appear to work -- the resulting glyphs are much much too large. The standard says:

A common practice is to define glyphs in terms of a 1000-unit glyph coordinate system, in which case the font matrix is [ 0.001 0 0 0.001 0 0 ].

and we suspect that the 1000 multiple is there even tho I don't want it. (I suspect the 1000 for the glyph coordinate system is related to the 1000 multiple in the TJ command.)

In theory, I should be able to code around these problems and just express my number in the standard 72 dpi units, particularly since I can use real numbers. So my "1853,882" would be "1853*72/600,882*72/600" or "222.36,105.84" and then moving 1 pixel to the right between an "A" and a following "d" would be not 1 unit but 1*72/600*1000 = 120 units (and probably -120 since the standard says that this adjustment is subtracted from the current position).

But the PDF seems to be particularly screwy in terms of its commands. For example, remember that our representation of a document gives the x,y position of each character. Mostly, we can keep track of the position of the last character, and position the next one relative to it. But in general, we may just need to move to a fixed location. For example, the first character on the page, or a character at the beginning of a new paragraph, or after an in-line image. PDF provides a Td operator that moves to an (x,y) position, but not an absolute (x,y) position. Rather the (x,y) parameters are relative. And not relative to the previous position, but "offset from the start of the current line". What is the meaning of the "start of the current line"? It appears that is the result of the last Td command, but that is not really clear, and does not seem to really work.

By applying "fudge" factors in various places, I can mostly get a document that comes close to what I want, but I have no explanation for these fudge factors, and the result is not close enough for actual publication.

It may be possible to work around these issues, but it will take time and experimentation. Rather than delay publication trying to work these things out, it seems more expeditious to use my existing code to generate a 600 dpi GIF image and then use existing software to convert these into a PDF file where each page is simply an embedded image. This should take more bits (larger file size), but it should produce exactly the result that I want.

Republishing Old Books

Tuesday, March 31, 2015

Problems with PDF

No comments:

Post a Comment