Wednesday, April 15, 2015

PDF explained (part 3): Fonts and Images

In our last post, we showed how to print the characters (or how to "show" them), but we haven't gone into how the fonts are defined.

The easy thing to do is to use one of the predefined fonts that are supposed to be available with PDF, but we want to define our own.  We have raster images for each character from the scan of the book that we did.  Our book description file (which is mainly provided by the OCR program tessaract) gives the bounding box around the bits for each character in our scanned images.  If we take all the copies of a particular character, we can create a sort of average, or idealized version of that character, partly mechanically, partly by using a font editor to clean things up, and then we can use this idealized version to replace all the individual characters with a perfect version.

So how do we get PDF to use our font, not the predefined ones?  The text content streams use font names.  The font names are defined in the /Resources object to be specific objects.  So we need to define a font object.

A font object is, like all other objects in PDF, basically a collection of key/value pairs.  It just has a bit more that we need to define than our other objects. 

394 0 obj % font object for caslon10
<<
/Type /Font
/Subtype /Type3
/FontMatrix [ .001 0 0 .001 0 0 ]
/FontBBox [ 0 0 0 0]
/Resources 24 0 R
/Encoding 393 0 R
/CharProcs 392 0 R
/FirstChar 33
/LastChar 135
...

The /Type of the font object is /Font.  Fonts can be several different types: Type 0, Type 1, TrueType and Type 3.  Most of these are vector fonts.  We need to define a raster font, which is not scalable, and so PDF does not really want to encourage it.  But a Type 3 font "defines glyphs with streams of PDF graphics operator"which means that we should be able to define our own glyphs as raster bit images, so we will define Type3 fonts.

The /FontMatrix defines the transformation matrix for when we give the graphics operators for our glyphs.  Since we really still want to work in 600 DPI units, it is tempting to define the scaling for this matrix to put it in that unit, but remember that the TJ operator used 1/1000 units for the adjustments to character positions, and we suspect that that 1/1000 is related to the "standard" /FontMatrix definition for Type3 fonts, and so we will stick with this standard definition.

The /FontBBox is the font bounding box -- the box that is big enough to include all our glyphs.  We could go thru all the glyphs and compute it, but if we set it to all zeros, PDF will compute it for us, so why should we bother to compute anything?  If we are wrong, the ISO Standard says "incorrect behavior may result".

We originally did not specify a /Resources value.  The ISO Standard says it is optional, and if absent will use the resource dictionary for the page that the font is used on.  But experience shows this is not true.  If we leave out an explicit /Resources value, when we make a reference to an image or anything, our viewer (evince) complains the name is undefined.  So to placate it, we include an explicit /Resources.  But this /Resources object is specific to the fonts -- it gives names to all the images that we need for the glyphs.

The next two objects /Encoding and /CharProcs are the heart of the font.  The /Encoding object is an object of /Type /Encoding which defines a /Differences array giving the differences from a standard character coding.  We find it easiest to just define the entire coding.  This consists of an array of two items -- the first is the numeric character code and the second is the name of the corresponding glyph.  So this is basically a definition of ASCII, for the printable characters, possibly with extensions.   We only need to give the encodings for the characters we are going to use for this font.  Thus, for Caslon28, the 28-point size used for the title page, we have only a few characters, and can define an /Encoding object as:

60 0 obj % font encoding for caslon28
<<
/Type /Encoding
/Differences [
46 /.
65 /A
68 /D
69 /E
71 /G
72 /H
73 /I
75 /K
76 /L
77 /M
78 /N
79 /O
80 /P
82 /R
83 /S
84 /T
87 /W
]
>>
endobj

When we try to show a character in caslon28 (the current font), the PDF engine will take the character encoding that is given and look it up in the /Encoding object for this font.  The selected entry gives the name of the glyph that we want to show.  The name of the glyph is then looked up in the /CharProcs object, which is a directory giving the object number for each glyph name.  So the /CharProcs object is just a list of each glyph name and the object which defines how it can be written.

59 0 obj % mapping a character for caslon28 to a draw function
<<
/. 26 0 R
/A 28 0 R
/D 30 0 R
/E 32 0 R
/G 34 0 R
/H 36 0 R
/I 38 0 R
/K 40 0 R
/L 42 0 R
/M 44 0 R
/N 46 0 R
/O 48 0 R
/P 50 0 R
/R 52 0 R
/S 54 0 R
/T 56 0 R
/W 58 0 R
>>
endobj

Each glyph (name) defines an object which is the procedure to create the selected glyph image.

Note that we can extend these beyond the standard 128 ASCII characters.  In the book, we have two extensions: open and close double quotes, and ligatures.  For one of these extensions, we decided to encode them as the character codes 128, 129, ... 135.  When we want to output one of these characters, we find its code and produce its code in the Text object.  For example,

[(\201) -480 (free) ) -1080 (\202)] TJ

which shows the open-double-quote (\201 or 129) and the close-double-quote (\202 or 130).  This code (say \201) is then looked up in the /Encoding object to get the name of the glyph

...
122 /z
129 /odq
130 /cdq
...

And the name of the glyph is looked up in the /CharProc object to get the name of the object that draws that character

...
/z 377 0 R
/odq 379 0 R
/cdq 381 0 R
...

The last form of information in the /Font object is the character width information.  It seems to me that this could easily be done like the /CharProc information, or the /Encoding information, having an object that maps the glyph or encoding to how wide the character is -- how much to move in the horizontal dimension after displaying the glyph.  But PDF does it differently.  It has an keyword /Widths whose value is an array of character widths.  The units of the character widths.  The widths should be given in units which are 1/1000 of a standard unit -- like the widths in the TJ command and the default FontMatrix.  The ISO Standard says it is in the units defined by the FontMatrix,  but we will just stick with the default.

However, notice that the /Encoding list does not necessarily cover the entire ASCII code range (0 to 127) -- it may start later (no use of the control codes or for the blank code, 32) and may go longer (past 127 up to 255).  One approach would be to make the array always 256 long, but instead PDF adds two other key/value pairs to the /Font object: /FirstChar and /LastChar which give the encoding values for the smallest and largest character codes in the /Encoding object, and the /Widths array goes from /FirstChar to /LastChar.

For example, for our Caslon 28-point font, with only a few characters, we can define

/FirstChar 46
/LastChar 87
/Widths [ 4560 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 23280 0 0 24000 20280 0 25560 25560 11040 0 21360 20160 28920 25920 24120 19080 0 21600 16080 24240 0 0 31560 ]

since we only have characters from period (encoding 46) to 'W' (encoding 87).  The units here are 1000 times the size of a 72 DPI unit.  So the period in our 28-point Caslon is 4.560/72 inches wide about .06 inches.

This leaves us with the definition of the the /CharProc objects, one for each glyph, to define; then we have a complete font definition.

The object that draw a glyph can use any PDF graphic command.  The definition of the Type 3 fonts says that the first command has to be either d0 or d1.  The d0 command tells the PDF engine that the glyph drawing code may set both shape and color; the d1 says we will only set shape -- the color is set by the environment calling the glyph drawing code.  In our case, we only want ordinary text characters, so we do not need color, so we use d1.  The d1 command wants to know both the width and bounding box of the glyph.  The width is redundant -- it's already in the /Widths array -- so I don't know why we need to repeat it.  In fact the definition of the d1 command says it has to be "consistent" with the /Widths array, but we can easily generate that.  We can also look at the bit-map that we have for this glyph and compute its bounding box.

The ISO standard only gives one example of a Type 3 font, and it shows two glyphs -- one a triangle and one a rectangle.  We want a more complex display for each glyph, but what we have described so far is really a lot of structure -- the overall structure of the PDF, the page tree, the /Resource object, the /Font objects, and the text streams, and the multiple different types of scaling we are doing.  Let's check first that all of this is in place and working before we try to draw the actual raster images.  We can mimic the drawing of the rectangle that is given in the ISO standard, but make the size of the rectangle equal to the size of the bounding box for each character.  That will show placement and spacing for where each character goes.

If we do this, we would define an object to draw a 28 point W, for example, as:

58 0 obj % a function to draw W 28pt
 << /Length 50 >>
stream
 31560 0 0 0 30120 22320 d1
 0 0 30120 22320 re f
endstream
endobj

Notice that we need to define the /Length of each function.  We could use the indirect object technique that we showed earlier, but for this code it seems easiest to just put the function, which is small, in a buffer and then compute the length of the buffer and dump the length and the buffer at the same time.  Looking up what these commands to, the "re" command defines a rectangle by its four corners, and the "f" command fills that rectangle (presumably with a black color).

It took a day of trial and error to get our cross reference table correct, to define the right scaling, and so on, but eventually the approach we have just outlined produced a PDF file that generated no errors when viewed and produced an image like:

This suggests we are close to defining our complete PDF file.  The problem is how to show the actual bit-map raster images for each character.

The book by Rosenthal, Developing with PDF, comes in handy at this point.  He has a chapter on "Images", and starts off with "Raster Images".  This gives a step-by-step description of how to create a bit-map image in PDF.  In addition to needing to do this for fonts, we also have to be able to handle the "image" lines in our book description file, so this seems like it will handle both of our remaining problems.

To define an image, we have two parts:  (1) the definition of the image and (2) how to show the image.  PDF has an object type that supports raster images, so we can just define it:

21 0 obj % image object for figures/cover1.gif
<<
/Type /XObject
/Subtype /Image
/ColorSpace /DeviceGray
/Width 431
/Height 397
/BitsPerComponent 8
/Filter /ASCIIHexDecode
/Length 342611
>>
stream
FFFFFFFFF ... FFFFFFFFFFFFFFFF>
endstream
endobj

The image has a /Width and /Height -- only rectangular images are supported.  We have to define the color space (in our case everything is black and white, but we could support color), so we use DeviceGray which gives us values from 0.0 (black) to 1.0 (white) for our colors.  The /BitsPerComponent value says how many bits we use to encode that.  For 8, we get from 0 (black) to 255 (white), giving 256 levels of gray.  This works out really well for our GIF files, with their color-maps of up to 256 values.  To compute the value of a pixel, we take the pixel from the GIF file, index into the color map to get 8-bit R,G,B values and then convert those to an 8-bit gray scale.  If R=G=B, then this value is the gray scale value.  If not, we found a comment in the ISO standard that says that NTSC video standard says:

 gray = 0.3*r + 0.59*g + 0.11*b;

With a 431x397 image of 8-bit values, we would need 171,107 bytes, with each byte being in the range 0 to 255.  This would clearly be binary data.  But at least to start, we want to keep it in an ASCII encoding.  The /ASCIIHexDecode value says we will write each byte as two hex digits, so we have a stream of 342,214 bytes.  The /ASCIIHexDecode representation ignores white space, so just for ease of understanding, let us write each row of data as a separate line, so we get 397 new-lines to boost our total to 342,611bytes, which as you see is the /Length value for this image.

And for some reason, even tho we specify the /Length of the image, the /ASCIIHexDecode representation says that the stream must end with a ">" character.  Maybe it's trying to use the same decoder function as is used for hex strings (<...>), but then it seems it would also start with a "<".  Whatever.

We can use different /Filter values to signal different encodings for the image.  For example, we see here that an image of n bytes takes slightly more than 2*n bytes to represent -- each 8-bit value takes two hex digits.  But we could use ASCII85Decode instead, which takes 4 8-bit bytes (32 bits) and used 5 ASCII printing characters (there are 85 different characters used) to represent those 32 bits.  This would mean the n bytes of data takes only 5/4*n bytes to represent them (with a bit more time to encode and decode the image stream).  Or we could go to a binary representation with compression; there are 10 different standard /Filter values defined in the ISO standard to encode (and possibly encrypt) binary data streams.  But let's stick with this to get things working.

Once we have our image object defined, we need a way to refer to it.  We define a name for it in the /Resources object.  All the examples in the ISO Standard, and Rosenthol's book just use "Im1" or "Im2", or such to suggest an image "Im" with a unique number suffix to make it unique.  We know that each object has a unique object number, so let us just use the object number as the unique number suffix.  We then add to our /Resources object for our pages:

 /XObject  <<   % image resources
 /Im1 1 0 R % image object for figures/p40.gif
 /Im2 2 0 R % image object for figures/p36.gif
 /Im3 3 0 R % image object for figures/p34.gif
 ...
 /Im22 22 0 R % image object for figures/cover2.gif
 /Im23 23 0 R % image object for figures/vline.gif
       >>

to define names for all of the images that are in the book.

Now to display the image, at a particular point in the book, we will have a line like:

i bcover.gif 545 3093 975 3489 figures/cover1.gif

which says that we want the image from figures/cover2.gif to be put starting at location (545 3093) and going to (975 3489). The previous line was probably showing a character, so we first exit text mode (ET).  Since PDF has the origin at the lower-left, we need to move to location (545 3489) but expressed in 72 DPI units, and remember that the y-coordinate has zero at the bottom of the page, not the top, so this becomes a move to (65.4 61.2).

And it turns out that the image definition is interpreted as defining a raster image for a 1x1 display, so if we want it to be the actual size we have (431x397) (but in 72 DPI units, so 51.72x47.64), we have to scale it up to this size.  Both setting the location and the scale are what the transformation matrix is used for, so our first step is to set the transformation matrix.  Then we just display the object (Do command):

ET
51.72 0 0 47.64 65.4 61.2 cm
/Im21 Do

(Of course, we only have to exit text mode if we were in it, and we may go back in it again if we have more text next.)

Adding this to our code to translate from our internal book description to a PDF file, we get images as well as our current black rectangles for letters.


And this seems like almost what we want also for our font glyph code.  If we define a small image object which is the bit-map for our glyph, we can then just display it like we do for objects.

Generating the images for the glyphs is fairly easy, and then changing the glyph drawing object to draw the glyph image is just:


525 0 obj % a function to draw A 6pt
 << /Length 58 >>
stream
 4080 0 0 0 3480 3360 d1
 3480 0 0 3360 0 0 cm
 /Im524 Do
endstream
endobj

But this turns out to still draw little black boxes for each glyph, and not the image of the character.  There is a hint in the ISO Standard where it says:

NOTE 6      

One of the most important uses of stencil masking is for painting
character glyphs represented as bitmaps.  Using such a glyph as a
stencil mask transfers only its “black” bits to the page, leaving
the “white” bits (which are really just background) unchanged.
For reasons discussed in 9.6.5, "Type 3 Fonts", an image mask,
rather than an image, should almost always be used to paint glyph
bitmaps.

Now, I can't find anything in 9.6.5 about image masks instead of images, but
we can see that images are not really the same as glyphs.  Glyphs, in our internal representation only have two values for each bit, black or white, while images have a color map index which leads to 256 gray scales, or even color.  Font glyph pixels are only 1 bit each.  So representing them as we did with images (where each pixel is 8 bits) is obviously overkill.  It seems that we may want to try a "Stencil Mask" instead of an image.

Rosenthol's book has a section on Stencil Masks, and it seems there are really only two differences from our current images:

1. The /ImageMask key is set to "true", and
2. The /BitsPerComponent is set to 1 (not 8).

An initial stab at changing these two keys in the definition of a glyph image gives us non-black rectangles for our fonts -- nothing readable, but a start, so this looks promising.

Next we have the problem of how to represent an image object with only one bit per pixel.  Clearly we would want to pack the bits, so that we have 8 1-bit pixels in a byte.  But suppose we have an image, like our 6-point A above, which is 29x28 bits?  That's a total of 812 bits, which is 101 8-bit bytes plus 4 left over bits.  We can pad that out to 102 8-bit bytes with 4 extra zeros.

If we do this, some characters show up just fine, but others look like they have a skew problem.

The "O" looks right, and the "A", "T", and "V" are close.  Examining the "O", we find that it is 184x186, and specifically, notice that the 184 is a multiple of 8 -- 8*23 = 184.  The "A" is 183x186 and the "T" is 191x190.  It would seem that we need to pad each row of the image to a multiple of 8, not just the entire image.

With this change, everything works fine.  We can position characters and show the correct raster image.  Our final Type 3, raster image object for a 6-point comma, for example, is:

520 0 obj % image object for , 6pt
<<
/Type /XObject
/Subtype /Image
/ImageMask true
/Width 7
/Height 11
/BitsPerComponent 1
/Filter /ASCIIHexDecode
/Length 33
>>
stream
C6
82
00
00
00
80
C0
F0
E2
C6
8E>
endstream
endobj

and the object to actually display it is:

521 0 obj % a function to draw , 6pt
 << /Length 61 >>
stream
 1200 0 0 -480 840 840 d1
 840 0 0 1320 0 -480 cm
 /Im520 Do
endstream
endobj

(Notice that the origin of the comma is not at 0,0 since part of the comma extends below the baseline, so we have to position at (0, -480) to put the origin of the comma on the baseline.)

This allows us to completely define our book directly in PDF from our character and box representation.

Now our file is huge -- 85,522,909 for what is only 20,179,903 bytes if we just import our GIF page images into LibreOffice and have it generate a PDF from them, but this is largely because of the way we are representing our images.  If we use /ASCII85Decode instead of ASCIIHexDecode, we should cut our file size almost in half.  Going to a compressed binary representation of our images (using FlateDecode for example) would probably bring us to a very small size (at the cost of being a binary image).  The 23 GIF images in our book take up 84,506,939 bytes of our PDF file, so only 1,015,970 bytes are needed for the fonts and text.  The GIF file representations of the 23 figures is only about 17 mega-bytes total, so we should be able to reduce the size of the file to around 18 mega-bytes with a compressed binary representation.

And we may be able to reduce the size of the text part by working on the widths of the characters or how we position to show text (using Td or Tm). But we at least have our Type 3 font working.


Tuesday, April 14, 2015

PDF explained (part 2): Pages

From Part 1, we understand the overall file structure of a PDF, and how the PDF file is a list of objects.  The cross-reference table shows us how to get from an object number to the location of the object in the file.

From a programming point of view, we will need to generate objects as we need them, and create a table of the objects we have created, along with their location in the file.  The function ftell() will report to us where we are in the file.  So to create an object, we use ftell() to say what the current offset is, enter the object number and offset in our table and then write out the definition of the object.  When we are done, we write out the cross-reference table from our object table, then the trailer which points at the cross-reference table.

The top-level object is the /Root.  The /Root is always of /Type /Catalog and then has a reference to the /Pages object.  So, for example:

549 0 obj % root object
<<
 /Type /Catalog
 /Pages 545 0 R
>>
endobj

The /Pages object is a list of all of the pages of the document; it is called a "page tree" in the PDF documentation.  It has a /Type of /Pages.  It has a /Count of how many pages, and then a key /Kids whose value is an array of all the individual pages.  Since we have a table of the pages of the book, we can use that to generate the list of pages:

568 0 obj % page tree object
<<
 /Type /Pages
 /Count 46
 /Kids [
 571 0 R  % page bcover.gif
 574 0 R  % page f01.gif
 577 0 R  % page f02.gif
 580 0 R  % page f02a.gif
 ...
 703 0 R  % page p36.gif
 706 0 R  % page p40.gif
 ]
 /Resources 567 0 R
>>
endobj


Notice how the pages are an array "[ ... ]",  and in our case we make each page a separate object, so we have just a list of indirect object references (plus a comment saying what page each object is).

One other thing in the page tree, is the /Resources object.

For reasons that I don't understand, PDF wants to refer to various objects by a "name" not by the object number (a standard indirect reference).  So when we set the current font, we refer to font "F1" and not object 44.  Also when we put an image in place, we refer to image "Im6" and not object 6.  Maybe this is to create "short" names for things we use a lot.  So we need to then define the mapping of names to objects.  The Resource object does this.  And since we need to reference fonts and images when we are defining pages, we need a Resource object for each page, or maybe for all pages.  If we use the same Resource object for all pages, we can define it in the /Pages or page tree object.  Or we can define a Resource object for each page.  Or both.


In our case, we will do both; there is very little cost for doing both.  And we only have one Resource object which lists all the images and all the fonts.  In our code, we have a list of all the images and all the fonts, so we can associate the object number with that data structure and generate the Resource object by running thru those tables.  It does not hurt to include a definition of a font on a page that does not use it; we just need to be sure that if we use a font or image name, it is defined.

So our Resource object is just:

567 0 obj % resource object
<<
 /XObject  <<   % image resources
 /Im1 1 0 R % image object for figures/p40.gif
 /Im2 2 0 R % image object for figures/p36.gif
 /Im3 3 0 R % image object for figures/p34.gif
...
 /Im23 23 0 R % image object for figures/vline.gif
       >>
 /Font <<   % font resources
 /caslon28 61 0 R
 /caslon12i 100 0 R
 /caslon10i 233 0 R
 /caslon10 394 0 R
 /caslon8 517 0 R
 /caslon6 566 0 R
       >>
>>
endobj

Notice how for images, we just define a name "Imx" for object number "x".  For the fonts, we define them to be our normal font names.  This allows us to say:

/caslon10i 1 Tf

to set the current font to our 10-point, italic font.

So, we have defined our Root, which points to a Page Tree which is an array of Pages, plus the Resource object which defines the names of the fonts and images we will use.  Now a Page object itself turns out to be simple.  Each Page object defines the page, which is to say its size and resources and the object that contains its content:


613 0 obj % page object for p07.gif
<<
 /Type /Page
 /Parent 568 0 R
 /Resources 567 0 R
 /MediaBox [ 0 0 336.00 470.40 ]
 /Contents 611 0 R
>>
endobj

Remember the stuff after the percent sign on the first line is just a comment.  The /Type of the page is /Page (the type of the page tree is /Pages, plural).  We have a key/value that points back to the page list object (the /Parent), and one that points to the /Resources object -- the same one that we put in the page tree.  Then the two actual new items.

First, the size of the page: /MediaBox.  This is an array of 4 values, a rectangle, which defines the start and end points of the page.  I don't know why a page would ever start at other than 0,0 but I guess we have that generality.  Maybe this would allow for margins, but that actually would be a "crop box" rather than a "media box".

But the larger issue is what the numbers mean -- what units are they in?  We would like to use the same units we have internally, 600 DPI units, which would make our pages 2800x3920, but we would need some way to define our units.

PDF has such a definition capability -- UserUnit.  This is defined in PDF 1.6.  By default, PDF uses a 72 DPI unit;  each unit is 1/72 inch.  To switch to 600 DPI, should only have to say:

/UserUnit 0.12

which says that each of our units is 0.12 of 1/72 of an inch -- the size of our units in multiples of 1/72 inch.  Our units are much smaller than 1/72 inch, which is why we have such a small number.

But this does not work.  Empirically, setting the UserUnit value has no effect on the interpretation of a PDF file.  I went so far as to get the source for xpdf -- an open source PDF package.  If you search for "UserUnit", you can find the place in the code where it processes "UserUnit":

  if (dict->lookup("UserUnit", &obj1)->isNum()) {
    userUnit = obj1.getNum();

and see that the value for /UserUnit is assigned to the internal variable userUnit (makes sense).  Then searching the code further for uses of this variable, we find there are none.  Yes, there is a "getter" function to get the value of this variable:

double getUserUnit() { return userUnit; }

but that function is never called, anywhere in the source code.  So you can set the variable, but the variable has no effect on the interpretation of the PDF file.

So that means that we have to take our internal 600 DPI units and convert them to PDF units of 72 DPI.  A small matter of coding.  And that means that the page that we saw above: 336.00 470.40 converts to 4.66 inches by 6.53 inches, or 2800 x 3920 pixels in 600 DPI.

And every value that is output to our PDF file will need to be scaled from our internal 600 DPI to PDF's default 72 DPI.

But once that is settled, we have only the last part of a page to consider, the /Contents, which is the object that will describe how to generate our page image.

Most of our page is text, so most of our PDF would seem to be a text object.   For a text page, we need to be able to position to a particular location and print a character in the correct font, then move to the location of the next character and print it.  Text streams start with "BT" and end with "ET".  In between these are commands to set the text state and output characters.

To set a font, we just say:

BT
/fn 1 Tf


where fn is a font name, as given in the /Resources object for this text page.  The number (1 in the above example) is the font size.  The examples all suggest that for a 10-point font, we would say:

/caslon 10 Tf

This is designed for scalable fonts -- where a given font can be made larger and smaller to get all the various sizes of the font.  But in our case, we have a separate font, with different character images, for each point size.  So we have two choices:  (1) do not scale them -- use a "size" of 1 or (2) scale each use of the font the same.  In this latest case, every use of a 10-point font would be:

/caslon10 10 Tf

and all the images that are generated will be magnified by a factor of 10.  But since we are generating each glyph of each font as the set of bits that we want for that character in that size, we don't need any additional scaling and we go with option (1) -- we do not further scale them; all font scale factors are 1.

Once we start a text block (Begin Text -- BT), and have established the font, we then want to move to where the first character should be on the page.  You would think this is a simple common operation and is supported directly, but it turns out it is not.  The "normal" way to move in text mode is the Td operator  which takes two parameters, x and y:

x y Td

Normally, of course x and y are numbers.  For the line in our book representation:

c p04.gif 400 425 426 465 t

we want to move to location 400,425 and output the glyph for a "t".  We might try

400 425 Td
(t) Tj

where the Td moves to the right spot and Tj outputs (or "shows") the string whose contents are the character "t".  But remember, our internal units are 600 DPI, but PDF is 72 DPI, so we need to scale these units down to:

48 51 Td
(t) Tj

Where is location (48,51) on the page?  In our internal model, we put the origin of the page (0,0) at the upper left.  PDF puts the origin at the lower left.  Thus (48,51) is near the bottom left corner, not the upper left corner.  If the page is 475.20 units long, we need to say 470.40 - 51 = 419.40, and we
need:

48 419.40 Td
(t) Tj

This is close, but when the character "t" is put on the page, the glyph is painted so that it's origin is positioned on the current location.  For the "t", which sits on the baseline, the origin is at the bottom of the box, not the top, so we actually want 465, not 425, which is scaled to 55.8 and then adjusted to 470.40 - 55.8 = 414.6 to give:

48 414.5 Td
(t) Tj

This sometimes works, but generally does not.  The problem is that the Td command does not move to x,y but rather moves to the start of the next line, offset from the start of the current line by x,y.  So x,y is not an absolute location but a relative amount, depending upon where the last line started, and then remembers this location as the start of the next line (to define where the next Td command moves).

The current state of the Text object is defined by two transformation matrices: the current text matrix and the current text line matrix.  PDF, like PostScript, uses a matrix so that both translation and scaling (and rotation) can be represented by a matrix multiply.  Suppose we have the transformation matrix:
and represent a location (x,y) as [ x y 1]  (we add the one so that we can multiply the 1x3 location vector by the 3x3 transformation matrix, then we get a result location vector [ x' y' 1] where

x' = a*x + c*y + e;
y' = b*x + d*y + f;

Suppose b and c are zero, then we have

x' = a*x + e;
y' = d*y + f;

Here we have scaled (x,y) by a in the x-dimension and d in the y-dimension and then translated it by (e,f).  (If a,b and c,d are set to the right sin() and cos() values, we get rotation.)  So this one matrix can provide translation and scaling, as well as rotation.  For the Text matrix and Text Line matrix we always keep b and c set to zero.  The a and d values are used for scaling, and (e,f) define a location on the page.

With this background, the "tx ty Td" operation is defined in terms of the Text matrix (Tm) and Text Line matrix (Tlm) by


Initially the Text matrix and Text Line matrix are the identity matrix -- so the (e,f) position is (0,0), and there is no scaling (all scaling is 1.0).  So the first "x y Td" operation will move to (x,y), but a subsequent "p q Td" will move to (x+p,y+q).  We could keep track of this location, and when we want to move to (x,y), just compute the difference between that location and the previous location.

Or we can just set the (e,f) values in the matrix directly.  There is another PDF command that will set the Text matrix directly (Tm).  If we want to move to (x,y) (with no scaling), we can set the Tm matrix with:

1 0 0 1 x y Tm

and this moves us to (x,y).

So we have two ways to move to (x,y) -- just set the Text matrix with Tm, or remember where the previous line started and use Td to move relative  to that location.

Once we are positioned, we want to show a character.  Then we move to the next location and show another character and so on.  But generally, the next character is just to the right of the current character.  The PDF model associates with each character its "width" and if we show a character at (x,y) with width d, automatically adjusts the current location to (x+d,y).  So if the next character goes at (x+d,y), we don't need to explicitly move between them, we can just show the next character.  So if our character positions match the character widths, we may be able to just do:

1 0 0 1 106.20 359.64 Tm
(The) Tj
1 0 0 1 1027.08 359.64 Tm
(Gates) Tj

and so on. Unfortunately, that seems to almost never work.

The problem is with the space between characters.  Our positioning of characters is defined by where they are in the book, and we create our glyph for each character from the scanned images in the book, so we really do not know how wide a character is.  We know how wide the bits for the glyph are but there is bound to be some extra space after the character -- characters do not butt directly one upon the other.  If we look at the space after a given character, for example, the letter "e" in the book, we get a distribution from 28 to 49 pixels,

with a peak at 40.  So one option is to make the width of an "e" 40 pixels, but this will only handle some cases; in all the other cases we need to adjust the horizontal positioning after we show an "e" to be in the right place. Sometimes we will need to move to the left; sometimes to the right.

PDF provides a convenient way to do this.  The TJ command takes an array of either numbers or strings.  Working from the left to the right, if the operand is a string, it is shown. If the operand is a number, it is subtracted from the current x-position.  Thus, the previous example could be written as:

1 0 0 1 106.20 359.64 Tm
[ (The) x (Gates) ] TJ

for some value of x.  Our presentation of TJ says that x would be the size of the space between the "e" and the "G".  From our book description file that would be about 22 pixels.  But remember that 22 pixels is for 600 DPI; it's only 2.64 in the PDF 72 DPI.  But the space in the TJ is not in the default text units, it is in units which are 1/1000 of a standard text unit.  So not 72 DPI, but 72,000 DPI.  There is no explanation why the scale of the units change, but the normal scaling for defining glyphs for a font is also by 1000.  That means we need something like:

1 0 0 1 106.20 359.64 Tm
[ (The) -2640 (Gates) ] TJ

The spacing is negative because the value is subtracted from the current position (not added), so positive values move to the left; negative values move to the right.

This gives us enough to create the PDF file for our text.  We can select our fonts (Tf), move to a position on the page (Tm or Td), and show our text, character by character, adjusting our position as we go (TJ).  We can code it so that we invoke TJ for each word (stopping a TJ when we get a "space" in our book representation), or for each line (stopping a TJ only when we get a new-line).

If we want to try this out, we can define our fonts to be some of the standard fonts (pre-defined Type1 fonts) instead of our own fonts, but it looks terrible since the sizes and widths are all off.  To get a good view, we really need to define our own fonts.  Which is our next blog entry.



Monday, April 13, 2015

PDF explained (part 1): Overall Structure

I finally have my code generating PDF and wanted to take the opportunity while this is still fresh in my mind to write it down.  Maybe it will help someone else generate PDF directly.

To refresh, this is a project to re-publish an existing book from 1953.  The idea is to scan the book, at reasonably high resolution (600 dpi), and then extract from the scan the font that was used to print it, and then re-generate the book from that font.  Everything is raster or pixel based since it is all from the scan; there are no vector fonts or anything involved.

After the scan and processing, I have a data file that describes, in 600 dpi units, exactly where each character is to be put on each page.  The format is derived from the "box" format of the OCR program that I used (tesseract).  The book pages were scanned into separate files in a TIFF format, then coverted to GIF and OCR'ed to get the text and locations, then processed into a format like:

c p17.gif 629 1260 666 1295 n
c p17.gif 676 1234 713 1295 d
s p17.gif 714 1233 741 1234 space
c p17.gif 742 1236 767 1295 1 italic
c p17.gif 780 1236 816 1295 6 italic
s p17.gif 817 1235 844 1236 space
c p17.gif 845 1260 880 1312 p
c p17.gif 889 1260 925 1295 o
c p17.gif 931 1235 946 1295 i
n f02.gif 2174 2701 2174 2745 new-line
i f02.gif 409 3002 2402 3432 figures/f02b.gif

The first character on each line identifies the type of line.  The most common is the "c" or "character" line.  The first character is following by the name of the page that has the character, followed by a bounding box which contains the character -- start x, start y, end x, end y.  These are in 600 dpi units with the upper left corner of the page being 0,0.  Page sizes varied slightly, but most were 2800 x 3920 (which at 600 dpi would be 4.66 inch by 6.53 inch).  I think, actually this was after I added a 400 pixel margin on all four edges, so the actual non-blank text would be in a 2000 x 3120 pixel box.

After the page and bounding box, we have the actual character which is to be found there.  Mostly this is a single letter, representing the character but sometimes we have a "name" -- for example "odq" and "cdq" for open and closed double quotes, and "ff", "fi", "fl", "ffi", and "ffl" for ligatures.  There also may be font information -- point size and font face -- if need be.  The sizes are sort of arbitrary -- we don't know what the original font was -- but it says it was "set in 10-point Caslon Old Style", so the default is set as 10 points, and we defined new size for larger (12, 28) and smaller (6, 8) size characters.  And we can see that some characters are italic.  In theory, we could have bold or serif and sans-serif fonts, and other sizes.  But most of the text is just plain old 10-point regular font; out of 42,660 characters, 40254 were 10-point regular.

We wrote a program to read the original GIF files, and the list of bounding boxes and from that to generate little images of each of the characters used, separated by size and font face.  These were written out in BDF format as a font description.  For example, here is a generated BDF representation of a
6 point comma:

STARTCHAR ,
ENCODING 44
SWIDTH 138 166
DWIDTH 10 0
BBX 10 12 0 -5
BITMAP
1800
7E00
7F00
FF80
FF80
7FC0
7FC0
3FC0
0F80
0780
0780
0300
ENDCHAR

I could then use an old font editor I wrote to modify the characters slightly, to clean them up.  Once the fonts were cleaned up, I could use the same list of characters to generate a new output page with the cleaned-up idealized characters placed where they were supposed to be put.  The same file could be used to find the characters and then to replace the original characters by a new image.  All the uses of "T" in the book are now represented by an idealized version of a perfect "T" instead of slightly different versions of the same basic image.

I could compute on the box definitions, and adjust the positions slightly, to, for example, make the baseline of each line of text perfectly horizontal, and to smooth out the horizontal spacing between characters.  To do this, it helps to know where the lines of text are, and where the spaces between words are.  So two additional lines start not with "c" but with "s", for the spaces between words, and "n" for the new-lines at the end of a line of text.  For spaces, the bounding box is the size of the space between words (really only the horizontal positions are relevant), and for a new-line, the bounding box is the position just after the last character (only the horizontal position really matters).

Lines with "s" and "n" generate no output on the page, but are useful hints when we adjust character positions.  For example, we want (a) all characters to sit on a horizontal baseline, (b) the spaces between words in a line to be roughly the same size, and (c) the space between characters in a word to be roughly the same.  We thought about adding (d) the space between lines should be roughly the same and (e) the space between characters in a word should be roughly the same for all characters in all words in the same line, but decided that was probably too much.

One last type of line is the "i" line which defines an "image" to be put on the page.  An "i" line defines the bounding box on the page where the image should be, and the name of a image file (a GIF file for our work) to hold the image.  During the first phase of processing, the image is extracted from the page image and written to the file; during the second phase, we want to use the file image to create the image on the paper -- or at least in the image file that represents the piece of paper.

So we have a program which can read a file describing a page, plus a set of fonts, and will generate an image file with each character and image put in the right place.  Thus we can generate a new image of each scanned page where every character and image is "cleaned-up" and there are no extraneous bits of dust or marks on the page image.  This page image can then be printed, or whatever, as an image.  Although it is constructed of glyphs representing specific characters and copies of bit-map images, the output is just an image file -- a sequence of black and white dots (possibly grey scale or even color bits in the images) with no concept of why a particular bit is white or black.

We can then create a PDF file, using traditional programs like Word or LibreOffice, which is just a sequence of pages, each page being a full-page GIF image.  This was our first take at generating a PDF file of our newly re-formatted book.  But a PDF file has an ability to have fonts and then put characters from that font (and images) on a page of paper; that is what PDF is for.  So we should be able to translate directly from our representation into a PDF file without having to go to a full-page image.

PDF was defined by Adobe Systems as a follow-on to PostScript, but was then offered up as an open standard. It was picked up by ISO (International Organization for Standardization) and made into an open standard ISO 32000-1:2008 Document management — Portable document format.  The first thing I did was read that. That defines things, but really does not show how they are used, particularly not the parts that I figured I needed: a different unit of measurement (600 dpi), representing my own bit-mapped glyphs for fonts, and images.

I also read Developing with PDF by Leonard Rosenthol (O'Reilly Media) from our local City Library,  That helped on some points, particularly hinting at how to represent images in Chapter 3 but leaving out the critical details (like how to represent the actual image -- it just says "image data goes here"  even for small examples).  Much of the book seems to be just copied from the ISO standard (although he does at least correct his Example 4-7 which makes much more sense than the same Figure 46 in the ISO standard).

A more useful book was PDF Explained by John Whitington (O'Reilly Media) which I found on-line at http://issuu.com/wangzuo/docs/pdf_explained_2011-1 although the ability to read it was limited by having to go thru it one page at a time from the front.  Chapter 6 on Text and Fonts (pages 71 to 88) does not really say anything more than Rosenthol did; maybe it just seemed to make more sense because it was the second or third time I had read it.  But it did give me another reference to Fonts and Encodings by Yannis Haralambous (O'Reilly Media) which is a truly fascinating combination of theory and practice, but, while interesting, we are beginning to get a bit far afield.

It was pretty difficult, if not impossible, to find the information I wanted by just searching with Google, or at least it seemed that way to me.  If I mentioned that I wanted to know about PDF, I started to get PDF documents, not information about PDF.  So mostly, I muddled thru, trying this and that until I found the (apparent) key to generating my own PDF.   PDF is a strange mixture of general things (objects) and specific things that are apparently there to support someone's specific idea of how documents will be structured (and ours is not structured like that).

A PDF file is a collection of objects.  Each object is mainly a list of key/value pairs, with occasional data.  An object is defined by the keyword "obj"; the definition ends with the keyword "endobj".  For example:

548 0 obj % page object for cover.gif
<<
 /Type /Page
 /Parent 545 0 R
 /Resources 544 0 R
 /MediaBox [ 0 0 765.36 454.56 ]
 /Contents 546 0 R
>>
endobj

is the definition of an object (object 548) which has a number of properties given by the key/value pairs.  Some of the values are constants (like /Page), others are references to other objects (545 0 R, 544 0 R, and 546 0 R), and one is an array of numeric values (the /MediaBox).

The overall PDF file structure has a header, a body, a cross-reference table, and a trailer.

The file, in general, is all in ASCII, with the exception of some binary data streams (we will talk about that later).

The header is one line

%PDF-1.6

which defines the file as a PDF, and gives the version of the language that is used.  PDF files can be version 1.0, 1.1, ... 1.7.  Different features of the language have been introduced in different versions, and you need to specify the largest version for any features you use.  For example, we tried to use "UserUnit" which was introduced with PDF 1.6; Resources come from PDF 1.2.  I think these are all upward-compatible.

Notice that the "%" character is also the "comment" character for PDF -- everything after a "%" until the end of the line is a comment, and ignored.  So in some sense, this first line is a comment.

The Standard (and other sources) say that if the PDF has binary data, then by convention, the second line of a PDF file is a comment line with at least one binary character.  The idea is that if a program tries to determine if a file is ASCII or binary, it will look at a small number of it's initial characters.  If they are all just ASCII, it will assume the rest of the file is ASCII; if it sees any non-ASCII (binary) characters, it will assume the file is binary.  So this second line is for those file classification programs, not for PDF itself -- it's a comment started by a "%", so the line is ignored.  We decided to make the entire file ASCII, so we don't bother with this second line.

The trailer is used for two things:

1. To find the cross reference table, and
2. To find the "Root" object.

The cross reference table says, for each object, where in the file that object is, allowing direct access to a particular object.  The Root object is the one that defines the content of the PDF file.  The trailer defines what object it is, and then the cross reference table tells you where in the file it is.

The trailer uses the same syntactic structure as the rest of a PDF file: a list of key/value pairs:

trailer
  <<
      key-1  value-1
      key-2  value-2
           ...
      key-n  value-n
  >>
startxref
Byte_offset_of_cross-reference_table
%%EOF

We only need two keys in the trailer section:  the Root and the total number of objects (Size).

The "Byte_offset_of_cross-reference_table" is just what it says -- the byte offset, from the beginning of the file, of the start of the cross-reference table.  So if the entire PDF file is 3,382,705 bytes long, and the cross-reference table starts at byte 3,372,180  then the end of the PDF file is:

startxref
3372180
%%EOF

This allows a program that wants to read a PDF file to first read the tail end of it, back up over the "%%EOF" and the numeric digits before it to the "startxref" keyword and then compute where the cross reference table is in the PDF file.

The trailer itself always has at least two entries:

Size -- which is the number of objects in the file, and

Root -- which defines the root object.

So, for example, we could have a PDF file that ends with:

trailer
<<
 /Size 550
 /Root 549 0 R
>>
startxref
3372180
%%EOF

which says there are 550 objects altogether, and the Root object is number 549.

A couple of comments about syntax.  Blank spaces do not matter.  Numbers are represented as numbers, in decimal, in ASCII (with a possible plus or minus sign); numbers may have fractional parts (again in decimal). 

Boolean values can be true and false.  I'm not sure if case is important.

The "<<" and ">>" tokens define a "dictionary" object.  Dictionary objects are a list of key/value pairs.

Strings are surrounded by "(" and ")", so (foo) is a string consisting of the three characters "f", "o", and "o".  In a string, if you want a right parenthesis, you need to escape it with a back-slash (\). If you want a back-slash, you have to escape it with a back-slash (\\).  You can also represent some white-space characters using a back-slash, as you would expect from C: \n, \r, \t, ...  You can also represent any character in a string with \ooo, where ooo is a 3-digit octal ASCII character code.  So a right parenthesis is both \) and \051.

Hexadecimal strings are an arbrarity long sequence of hexadecimal digits (using either A-F or a-f or a mixture), surrounded by "<" and ">", so something like <4E6F20>.  I don't think we use any of these.

Arrays are delimited by "[" and "]", and are just a sequence of things, not necessarily all the same.

Symbols can be keywords, or constants (true and false, for example), or occasionally names.  To differentiate from a "thing" from a "name", names are prefixed with a slash, like /Size.

So the characters "<", ">","(", ")", "[", "]" have special value for hexadecimal strings, dictionaries, strings, and arrays.  Plus "%" starts a comment.  And "/" defines a name.  So if we need to use these characters, as characters and not as syntactic tokens, we need a way to quote them.  This is done by representing the character by a hash-sign "#" followed by the two digit hexadecimal ASCII code for that character.  So if you want a "%", you represent it as #23.  A right parenthesis is #29.

PDF then has these two different quoting mechanisms -- one for strings and a different one for names.


The object cross-reference table is a list of  n entries, where n is given by the /Size value in the trailer dictionary.  The cross-reference table begins with the keyword "xref".  This is then followed by the number of entries in the cross reference table (which should be exactly equal to the /Size in the trailer). This is given as a pair of number which is the range of object numbers.  This range is always from zero (0) to n (the number of the last object, which is the number of objects).  The zero object is reserved, so objects start at 1 and go up from there.

After "xref" and "0 n", we have a list of n lines each describing the location of one object in the PDF file.   Each line is exactly of the same format:

nnnnnnnnnn ggggg n

where nnnnnnnnnn is a 10-digit decimal byte offset to the i-th object and ggggg is the 5-digit generation number (always zero for us), and n is the literal character "n" (for "in-use").

The one exception to this is the first entry in the object table, which is for object 0, which is not used.  So it's entry in the cross-reference table is

0000000000 65535 f

The "f" means "free".

So a small PDF file, with 4 objects might have a cross reference table and trailer that looks like:

xref
0 4
0000000000 65535 f
0000000009 00000 n
0000000088 00000 n
0000000114 00000 n
0000000525 00000 n
trailer
<<
 /Size 4
 /Root 1 0 R
>>
startxref
680
%%EOF

Generation numbers have to do with being able to modify PDF files after they are written; in our work they will always be zero since we are not modifying existing PDF files, but writing new ones.

Notice how the /Root, in the example above is "1 0 R".  That is a reference to object 1 (generation 0).  This is called  an "indirect" object.  The idea is that, in general, when you are building an object and you need another object, you can either put it's value directly in place, or you can make an indirect reference to it.  (The /Root is always an indirect reference; some other things have to be indirect references.)  An indirect reference is an object number followed by a generation number (0 as we've said) followed by R.

Objects, as we showed earlier, are defined by the keyword "obj", and then a dictionary of key/value pairs, followed by "endobj". Almost all objects have a /Type, and the content of the object depends on its type. We will go over all of those that we care about, but here is a really simple case.

The content objects that define what goes on a page are of the form:
k 0 obj
<<
     /Length n
>>
stream
...
endstream
endobj

where the contents of the stream (between the keywords "stream" and "endstream" define what goes on the page.  The only key/value pair (for our purposes) is the /Length of the stream -- the number of bytes that it takes up.  My image of a PDF viewer is that it finds a content object, finds the /Length and the beginning of the stream and then reads and processes exactly /Length bytes and then expects to see the keyword "endstream".  So we need to know the length of the stream before we put it into the PDF file.

Our basic approach is to just generate the content stream as we read thru our character description input file, moving to the right spot on the page, outputting characters and images, until we are done with this page.  So we don't know, in advance, how long the content stream will be.  One option would be to run thru our code twice, once to compute the length, and then once to compute the output.  Another would be to buffer the content stream in memory (or a file) until it is complete, then compute the length, and copy the buffered contents into the PDF file.

Both of these are possible but tedious or error prone coding.  But indirect objects give us another possibility.  We generate the object header, but instead of giving a value to /Length, we give it an indirect reference to the next object:

546 0 obj
<<
/Length 547 0 R
>>
stream
...
endstream
endobj

As we generate the stream, we remember the byte offset in the file (ftell() will give us that) of the first byte of the stream, and then after we generate the last byte, before we write out the "endstream", we get the byte offset of the last byte of the stream, and compute the length.  Then we finish the definition of that object and add another one right after it which gives the length:

547 0 obj
15998
endobj


This increase the number of objects, but simplifies our programming.  Now when the PDF viewer looks at our file, and goes to object 546, it sees that the /Length of the stream is an indirect reference to object 547, goes to the cross-reference table to find where object 547 is (a simple indexing operation into the table), goes to that part of the PDF file, and gets the value (15,998), pops back up to object 546 with the correct /Length.

This Posting is getting long.  Let's stop at this point and we will cover the content of our objects in the next posting.

Friday, April 3, 2015

The physical size of the book

The original book was 5.125 x 6.3125.  (This is a width/height ratio = .81)  The inner margin was 1.18 (1 3/16) inch; the outer margin was .625  (5/8) inch.  The top margin was 5/8 inch also (.625) and the bottom margin was 3/8 inch (.375).  This leaves a text area with a line length of 3 5/16 inch and a height of 5 5/16 inch.

The closest "standard book size" is then 5.25 x 8.  Notice that since the width is about right, the height of the actual text is about the same as the original book (5.3125 inches of text, plus margins to make the 6.3125).  But we have an 8 inch page, so this leaves about 2 inches of unused vertical space.   I put this at the bottom -- I could put it at top and bottom equally, but, in my opinion, it's just too much unused space, no matter where it is.

The basic problem is that the shape of a 5.25/8.00 book is not the same as the shape of a 5.125/6.3125 book -- it's too long.  A 5.125/6.3125 book is .81/1 ratio, while a 5.25/8.00 book is .65/1.

Looking at standard book sizes (at least those that CreateSpace (Amazon's Print-on-Demand), the closest to a .81 ratio is a 7.5 x 9.25 book (also 8x10 or 8.5x11 but those are even larger).  This is 1.46 times the size of the original book -- so almost 50% larger.  But things fit nicely.

We have the same number of pages in both cases -- 52 pages.  It's just that in one case (book3.pdf) the page is smaller, the text is smaller, and there is 2 inches of blank space at the bottom  of the page, and with the other, the page is larger, the text is larger, and there is no extra space.  The smaller book is "closer" to the size of the original; the larger is the same shape as the original, everything is just larger.

Or we can try to make it a non-standard size.  It looks like the primary result of doing that is to limit where it can be sold: '"Enter my own size" trim sizes can be sold on Amazon.com and your eStore, but are ineligible for the bookstores and Online Retailers channel within Expanded Distribution.'

But there also is a comment that "Books with cream paper must be one of  the following trim sizes: 5" x 8", 5.25" x 8", 5.5" x 8.5", or 6" x 9" in order to enroll in Expanded Distribution.", so we may be limited already by the use of cream paper, so maybe the loss of "Expanded Distribution" is just going to happen, in which case we could just make our book the same size as the original.

So three options:

1. 5.25 x 8 book -- with two inches of blank space on each page.

2. 7 x 9 book -- same shape, but 46% larger than the original.

3. 5.125 x 6.3125 book -- original size, custom size. Limited distribution.

(but with "cream paper", we seem to have limited distribution anyway).