Republishing Old Books: PDF explained (part 3): Fonts and Images

In our last post, we showed how to print the characters (or how to "show" them), but we haven't gone into how the fonts are defined.

The easy thing to do is to use one of the predefined fonts that are supposed to be available with PDF, but we want to define our own. We have raster images for each character from the scan of the book that we did. Our book description file (which is mainly provided by the OCR program tessaract) gives the bounding box around the bits for each character in our scanned images. If we take all the copies of a particular character, we can create a sort of average, or idealized version of that character, partly mechanically, partly by using a font editor to clean things up, and then we can use this idealized version to replace all the individual characters with a perfect version.

So how do we get PDF to use our font, not the predefined ones? The text content streams use font names. The font names are defined in the /Resources object to be specific objects. So we need to define a font object.

A font object is, like all other objects in PDF, basically a collection of key/value pairs. It just has a bit more that we need to define than our other objects.

394 0 obj % font object for caslon10
<<
/Type /Font
/Subtype /Type3
/FontMatrix [ .001 0 0 .001 0 0 ]
/FontBBox [ 0 0 0 0]
/Resources 24 0 R
/Encoding 393 0 R
/CharProcs 392 0 R
/FirstChar 33
/LastChar 135
...

The /Type of the font object is /Font. Fonts can be several different types: Type 0, Type 1, TrueType and Type 3. Most of these are vector fonts. We need to define a raster font, which is not scalable, and so PDF does not really want to encourage it. But a Type 3 font "defines glyphs with streams of PDF graphics operator"which means that we should be able to define our own glyphs as raster bit images, so we will define Type3 fonts.

The /FontMatrix defines the transformation matrix for when we give the graphics operators for our glyphs. Since we really still want to work in 600 DPI units, it is tempting to define the scaling for this matrix to put it in that unit, but remember that the TJ operator used 1/1000 units for the adjustments to character positions, and we suspect that that 1/1000 is related to the "standard" /FontMatrix definition for Type3 fonts, and so we will stick with this standard definition.

The /FontBBox is the font bounding box -- the box that is big enough to include all our glyphs. We could go thru all the glyphs and compute it, but if we set it to all zeros, PDF will compute it for us, so why should we bother to compute anything? If we are wrong, the ISO Standard says "incorrect behavior may result".

We originally did not specify a /Resources value. The ISO Standard says it is optional, and if absent will use the resource dictionary for the page that the font is used on. But experience shows this is not true. If we leave out an explicit /Resources value, when we make a reference to an image or anything, our viewer (evince) complains the name is undefined. So to placate it, we include an explicit /Resources. But this /Resources object is specific to the fonts -- it gives names to all the images that we need for the glyphs.

The next two objects /Encoding and /CharProcs are the heart of the font. The /Encoding object is an object of /Type /Encoding which defines a /Differences array giving the differences from a standard character coding. We find it easiest to just define the entire coding. This consists of an array of two items -- the first is the numeric character code and the second is the name of the corresponding glyph. So this is basically a definition of ASCII, for the printable characters, possibly with extensions. We only need to give the encodings for the characters we are going to use for this font. Thus, for Caslon28, the 28-point size used for the title page, we have only a few characters, and can define an /Encoding object as:

60 0 obj % font encoding for caslon28
<<
/Type /Encoding
/Differences [
46 /.
65 /A
68 /D
69 /E
71 /G
72 /H
73 /I
75 /K
76 /L
77 /M
78 /N
79 /O
80 /P
82 /R
83 /S
84 /T
87 /W
]
>>
endobj

When we try to show a character in caslon28 (the current font), the PDF engine will take the character encoding that is given and look it up in the /Encoding object for this font. The selected entry gives the name of the glyph that we want to show. The name of the glyph is then looked up in the /CharProcs object, which is a directory giving the object number for each glyph name. So the /CharProcs object is just a list of each glyph name and the object which defines how it can be written.

59 0 obj % mapping a character for caslon28 to a draw function
<<
/. 26 0 R
/A 28 0 R
/D 30 0 R
/E 32 0 R
/G 34 0 R
/H 36 0 R
/I 38 0 R
/K 40 0 R
/L 42 0 R
/M 44 0 R
/N 46 0 R
/O 48 0 R
/P 50 0 R
/R 52 0 R
/S 54 0 R
/T 56 0 R
/W 58 0 R
>>
endobj

Each glyph (name) defines an object which is the procedure to create the selected glyph image.

Note that we can extend these beyond the standard 128 ASCII characters. In the book, we have two extensions: open and close double quotes, and ligatures. For one of these extensions, we decided to encode them as the character codes 128, 129, ... 135. When we want to output one of these characters, we find its code and produce its code in the Text object. For example,

[(\201) -480 (free) ) -1080 (\202)] TJ

which shows the open-double-quote (\201 or 129) and the close-double-quote (\202 or 130). This code (say \201) is then looked up in the /Encoding object to get the name of the glyph

...
122 /z
129 /odq
130 /cdq
...

And the name of the glyph is looked up in the /CharProc object to get the name of the object that draws that character

...
/z 377 0 R
/odq 379 0 R
/cdq 381 0 R
...

The last form of information in the /Font object is the character width information. It seems to me that this could easily be done like the /CharProc information, or the /Encoding information, having an object that maps the glyph or encoding to how wide the character is -- how much to move in the horizontal dimension after displaying the glyph. But PDF does it differently. It has an keyword /Widths whose value is an array of character widths. The units of the character widths. The widths should be given in units which are 1/1000 of a standard unit -- like the widths in the TJ command and the default FontMatrix. The ISO Standard says it is in the units defined by the FontMatrix, but we will just stick with the default.

However, notice that the /Encoding list does not necessarily cover the entire ASCII code range (0 to 127) -- it may start later (no use of the control codes or for the blank code, 32) and may go longer (past 127 up to 255). One approach would be to make the array always 256 long, but instead PDF adds two other key/value pairs to the /Font object: /FirstChar and /LastChar which give the encoding values for the smallest and largest character codes in the /Encoding object, and the /Widths array goes from /FirstChar to /LastChar.

For example, for our Caslon 28-point font, with only a few characters, we can define

/FirstChar 46
/LastChar 87
/Widths [ 4560 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 23280 0 0 24000 20280 0 25560 25560 11040 0 21360 20160 28920 25920 24120 19080 0 21600 16080 24240 0 0 31560 ]

since we only have characters from period (encoding 46) to 'W' (encoding 87). The units here are 1000 times the size of a 72 DPI unit. So the period in our 28-point Caslon is 4.560/72 inches wide about .06 inches.

This leaves us with the definition of the the /CharProc objects, one for each glyph, to define; then we have a complete font definition.

The object that draw a glyph can use any PDF graphic command. The definition of the Type 3 fonts says that the first command has to be either d0 or d1. The d0 command tells the PDF engine that the glyph drawing code may set both shape and color; the d1 says we will only set shape -- the color is set by the environment calling the glyph drawing code. In our case, we only want ordinary text characters, so we do not need color, so we use d1. The d1 command wants to know both the width and bounding box of the glyph. The width is redundant -- it's already in the /Widths array -- so I don't know why we need to repeat it. In fact the definition of the d1 command says it has to be "consistent" with the /Widths array, but we can easily generate that. We can also look at the bit-map that we have for this glyph and compute its bounding box.

The ISO standard only gives one example of a Type 3 font, and it shows two glyphs -- one a triangle and one a rectangle. We want a more complex display for each glyph, but what we have described so far is really a lot of structure -- the overall structure of the PDF, the page tree, the /Resource object, the /Font objects, and the text streams, and the multiple different types of scaling we are doing. Let's check first that all of this is in place and working before we try to draw the actual raster images. We can mimic the drawing of the rectangle that is given in the ISO standard, but make the size of the rectangle equal to the size of the bounding box for each character. That will show placement and spacing for where each character goes.

If we do this, we would define an object to draw a 28 point W, for example, as:

58 0 obj % a function to draw W 28pt
<< /Length 50 >>
stream
31560 0 0 0 30120 22320 d1
0 0 30120 22320 re f
endstream
endobj

Notice that we need to define the /Length of each function. We could use the indirect object technique that we showed earlier, but for this code it seems easiest to just put the function, which is small, in a buffer and then compute the length of the buffer and dump the length and the buffer at the same time. Looking up what these commands to, the "re" command defines a rectangle by its four corners, and the "f" command fills that rectangle (presumably with a black color).

It took a day of trial and error to get our cross reference table correct, to define the right scaling, and so on, but eventually the approach we have just outlined produced a PDF file that generated no errors when viewed and produced an image like:

This suggests we are close to defining our complete PDF file. The problem is how to show the actual bit-map raster images for each character.

The book by Rosenthal, Developing with PDF, comes in handy at this point. He has a chapter on "Images", and starts off with "Raster Images". This gives a step-by-step description of how to create a bit-map image in PDF. In addition to needing to do this for fonts, we also have to be able to handle the "image" lines in our book description file, so this seems like it will handle both of our remaining problems.

To define an image, we have two parts: (1) the definition of the image and (2) how to show the image. PDF has an object type that supports raster images, so we can just define it:

21 0 obj % image object for figures/cover1.gif
<<
/Type /XObject
/Subtype /Image
/ColorSpace /DeviceGray
/Width 431
/Height 397
/BitsPerComponent 8
/Filter /ASCIIHexDecode
/Length 342611
>>
stream
FFFFFFFFF ... FFFFFFFFFFFFFFFF>
endstream
endobj

The image has a /Width and /Height -- only rectangular images are supported. We have to define the color space (in our case everything is black and white, but we could support color), so we use DeviceGray which gives us values from 0.0 (black) to 1.0 (white) for our colors. The /BitsPerComponent value says how many bits we use to encode that. For 8, we get from 0 (black) to 255 (white), giving 256 levels of gray. This works out really well for our GIF files, with their color-maps of up to 256 values. To compute the value of a pixel, we take the pixel from the GIF file, index into the color map to get 8-bit R,G,B values and then convert those to an 8-bit gray scale. If R=G=B, then this value is the gray scale value. If not, we found a comment in the ISO standard that says that NTSC video standard says:

gray = 0.3*r + 0.59*g + 0.11*b;

With a 431x397 image of 8-bit values, we would need 171,107 bytes, with each byte being in the range 0 to 255. This would clearly be binary data. But at least to start, we want to keep it in an ASCII encoding. The /ASCIIHexDecode value says we will write each byte as two hex digits, so we have a stream of 342,214 bytes. The /ASCIIHexDecode representation ignores white space, so just for ease of understanding, let us write each row of data as a separate line, so we get 397 new-lines to boost our total to 342,611bytes, which as you see is the /Length value for this image.

And for some reason, even tho we specify the /Length of the image, the /ASCIIHexDecode representation says that the stream must end with a ">" character. Maybe it's trying to use the same decoder function as is used for hex strings (<...>), but then it seems it would also start with a "<". Whatever.

We can use different /Filter values to signal different encodings for the image. For example, we see here that an image of n bytes takes slightly more than 2*n bytes to represent -- each 8-bit value takes two hex digits. But we could use ASCII85Decode instead, which takes 4 8-bit bytes (32 bits) and used 5 ASCII printing characters (there are 85 different characters used) to represent those 32 bits. This would mean the n bytes of data takes only 5/4*n bytes to represent them (with a bit more time to encode and decode the image stream). Or we could go to a binary representation with compression; there are 10 different standard /Filter values defined in the ISO standard to encode (and possibly encrypt) binary data streams. But let's stick with this to get things working.

Once we have our image object defined, we need a way to refer to it. We define a name for it in the /Resources object. All the examples in the ISO Standard, and Rosenthol's book just use "Im1" or "Im2", or such to suggest an image "Im" with a unique number suffix to make it unique. We know that each object has a unique object number, so let us just use the object number as the unique number suffix. We then add to our /Resources object for our pages:

/XObject << % image resources
/Im1 1 0 R % image object for figures/p40.gif
/Im2 2 0 R % image object for figures/p36.gif
/Im3 3 0 R % image object for figures/p34.gif
...
/Im22 22 0 R % image object for figures/cover2.gif
/Im23 23 0 R % image object for figures/vline.gif
>>

to define names for all of the images that are in the book.

Now to display the image, at a particular point in the book, we will have a line like:

i bcover.gif 545 3093 975 3489 figures/cover1.gif

which says that we want the image from figures/cover2.gif to be put starting at location (545 3093) and going to (975 3489). The previous line was probably showing a character, so we first exit text mode (ET). Since PDF has the origin at the lower-left, we need to move to location (545 3489) but expressed in 72 DPI units, and remember that the y-coordinate has zero at the bottom of the page, not the top, so this becomes a move to (65.4 61.2).

And it turns out that the image definition is interpreted as defining a raster image for a 1x1 display, so if we want it to be the actual size we have (431x397) (but in 72 DPI units, so 51.72x47.64), we have to scale it up to this size. Both setting the location and the scale are what the transformation matrix is used for, so our first step is to set the transformation matrix. Then we just display the object (Do command):

ET
51.72 0 0 47.64 65.4 61.2 cm
/Im21 Do

(Of course, we only have to exit text mode if we were in it, and we may go back in it again if we have more text next.)

Adding this to our code to translate from our internal book description to a PDF file, we get images as well as our current black rectangles for letters.

And this seems like almost what we want also for our font glyph code. If we define a small image object which is the bit-map for our glyph, we can then just display it like we do for objects.

Generating the images for the glyphs is fairly easy, and then changing the glyph drawing object to draw the glyph image is just:

525 0 obj % a function to draw A 6pt
<< /Length 58 >>
stream
4080 0 0 0 3480 3360 d1
3480 0 0 3360 0 0 cm
/Im524 Do
endstream
endobj

But this turns out to still draw little black boxes for each glyph, and not the image of the character. There is a hint in the ISO Standard where it says:

NOTE 6

One of the most important uses of stencil masking is for painting
character glyphs represented as bitmaps. Using such a glyph as a
stencil mask transfers only its “black” bits to the page, leaving
the “white” bits (which are really just background) unchanged.
For reasons discussed in 9.6.5, "Type 3 Fonts", an image mask,
rather than an image, should almost always be used to paint glyph
bitmaps.

Now, I can't find anything in 9.6.5 about image masks instead of images, but
we can see that images are not really the same as glyphs. Glyphs, in our internal representation only have two values for each bit, black or white, while images have a color map index which leads to 256 gray scales, or even color. Font glyph pixels are only 1 bit each. So representing them as we did with images (where each pixel is 8 bits) is obviously overkill. It seems that we may want to try a "Stencil Mask" instead of an image.

Rosenthol's book has a section on Stencil Masks, and it seems there are really only two differences from our current images:

1. The /ImageMask key is set to "true", and
2. The /BitsPerComponent is set to 1 (not 8).

An initial stab at changing these two keys in the definition of a glyph image gives us non-black rectangles for our fonts -- nothing readable, but a start, so this looks promising.

Next we have the problem of how to represent an image object with only one bit per pixel. Clearly we would want to pack the bits, so that we have 8 1-bit pixels in a byte. But suppose we have an image, like our 6-point A above, which is 29x28 bits? That's a total of 812 bits, which is 101 8-bit bytes plus 4 left over bits. We can pad that out to 102 8-bit bytes with 4 extra zeros.

If we do this, some characters show up just fine, but others look like they have a skew problem.

The "O" looks right, and the "A", "T", and "V" are close. Examining the "O", we find that it is 184x186, and specifically, notice that the 184 is a multiple of 8 -- 8*23 = 184. The "A" is 183x186 and the "T" is 191x190. It would seem that we need to pad each row of the image to a multiple of 8, not just the entire image.

With this change, everything works fine. We can position characters and show the correct raster image. Our final Type 3, raster image object for a 6-point comma, for example, is:

520 0 obj % image object for , 6pt
<<
/Type /XObject
/Subtype /Image
/ImageMask true
/Width 7
/Height 11
/BitsPerComponent 1
/Filter /ASCIIHexDecode
/Length 33
>>
stream
C6
82
00
00
00
80
C0
F0
E2
C6
8E>
endstream
endobj

and the object to actually display it is:

521 0 obj % a function to draw , 6pt
<< /Length 61 >>
stream
1200 0 0 -480 840 840 d1
840 0 0 1320 0 -480 cm
/Im520 Do
endstream
endobj

(Notice that the origin of the comma is not at 0,0 since part of the comma extends below the baseline, so we have to position at (0, -480) to put the origin of the comma on the baseline.)

This allows us to completely define our book directly in PDF from our character and box representation.

Now our file is huge -- 85,522,909 for what is only 20,179,903 bytes if we just import our GIF page images into LibreOffice and have it generate a PDF from them, but this is largely because of the way we are representing our images. If we use /ASCII85Decode instead of ASCIIHexDecode, we should cut our file size almost in half. Going to a compressed binary representation of our images (using FlateDecode for example) would probably bring us to a very small size (at the cost of being a binary image). The 23 GIF images in our book take up 84,506,939 bytes of our PDF file, so only 1,015,970 bytes are needed for the fonts and text. The GIF file representations of the 23 figures is only about 17 mega-bytes total, so we should be able to reduce the size of the file to around 18 mega-bytes with a compressed binary representation.

And we may be able to reduce the size of the text part by working on the widths of the characters or how we position to show text (using Td or Tm). But we at least have our Type 3 font working.

Republishing Old Books

Wednesday, April 15, 2015

PDF explained (part 3): Fonts and Images

No comments:

Post a Comment