Monday, July 22, 2019

Creating PDF for the Petri Net book

Much of the coding for creating the PDF representation of the Petri Net book was already done in the Oak Island project.  This was just a much larger version of the same problem -- more pages, more figures, more fonts, ...  But we were producing PDF pretty quickly and then resolving the problems that came up.

Many of the problems were simple and immediately obvious.  Running the pdfbuild program would create a PDF file.  Looking at the PDF file using the local Linux PDF viewer (evince) would show any problems.  Correct the code for that case, rerun it until it looked right.

Then I went to Amazon to create the book project, and upload the PDF file for the book.  Big file (1.18 gigabytes), but with compression it came down to only about 12 megabytes.

But after a long time, Amazon rejected the file.  But without really saying why.  A short note to the Amazon (actually "Kindle Direct Publishing") people and I got back a note that the file showed an error when opened with Acrobat Reader (a standard Adobe program for viewing PDF files on Windows and Mac computers).  Acrobat Reader said "Cannot extract the embedded font 'T3Font_33'.  Some characters may not display or print correctly."  But this did not show up on evince on my Linux computer.  But it was clear that Amazon would not take my file if it did not work with Acrobat Reader.

Googling the error message didn't help much, but did reveal the presence of a set of forums run by Adobe -- the Adobe Community.  I registered for that and searched to see if I could find suggestions of what was wrong with my files.  There were two people MikelKlink and Leonard_Rosenthol who gave good comments in the PDF Language and Specification Forum.  They helped with scaling my 600 dpi fonts to make the numbers smaller.  The PDF spec suggests units of 1/1000 of a point (1/72 inch), while my units are 1/600 inch (600 dpi).  The least common multiple of 1/72 and 1/600 is 1/1800 (25*72 = 3*600), so I could use those units instead of 1/1000.  They also pointed out that I was short one byte in my xref at the end of the file -- xref entries have to be exactly 20 bytes; despite having potentially either just a newline (NL) or both a carriage return and newline (CR/NL) for an end-of-line marker.

But that did not resolve the problem with the fonts.  I thought it might be a problem with the /Resource objects, but no one picked up on that.  Searching for other ways to "debug" a PDF file, I tried using Adobe Acrobat Pro, but it did not give useful error messages either.  Eventually I saw a reference to a "PDF repair tool" and found one at ILovePDF.com.  I could upload a sample chapter to their site and then download a "repaired" version.  The repaired version had no errors with Acrobat Reader, but appeared to be the same image as my file.

So then the problem came down to comparing the two files -- my buggy PDF and the repaired PDF file from ILovePDF.   I have a program that reads a PDF and then outputs a human-readable version of it, but the two files still had lots of differences.  Some minor (they always use "1 0 0 1 x y Tm" to move to position (x,y) while I use the shorter "x y Td") but the major problem is that they renumbered all the objects.  It appeared that they read the entire document into memory, build the object tree and then renumbered the objects starting at the root, and working down, breadth-first. So the root object for their file is object 1 and the first object in the file. My code numbers objects from the bottom of the tree, working up, so the root object is the last object in the file and the highest object number.

So the problem then became one of identifying the corresponding objects in the two files and comparing them, knowing that the important difference must be down in the fonts somewhere.

And after some time, I found a difference.

Some characters, particularly in small point sizes, are very small images -- the period, the single quote, the center dot, and so on.  A center dot for example, in a 7 point font is (in my font definition) just 7 bits by 7 bits.  Each row has to be rounded up to a multiple of 8 bits, so this becomes an 8 bit by 7 bit image. which is then 7 bytes of data.  But compressed data has at least a little bit of overhead, and while a couple of bytes of overhead does not matter when 100K of data is compressed to 7K (plus the couple of bytes of overhead), it does matter when you only have 7 bytes to start with.  The "compressed" version of the 7 bytes takes 12 bytes of data.  Plus you then have to add "/Filter /FlateDecode" to the description of the object, losing another 20 bytes or so.

So my code was checking to see if the output of compression was bigger than the input, and skipping the compression in that case.  But it was doing it wrong.  I was not interpreting the return codes from zlib  correctly, so that I was adding the /Filter /FlateDecode to the description of the font  character image, but not actually compressing the bits.  Somehow evince was handling that okay, but Acrobat Reader was not.

But fixing my compression code and using zlib properly solved that problem and got my PDF files accepted by Acrobat Reader (and evince).  Which then allowed me to upload the files to Amazon/Kindle and move the book along in the process of getting it available as a print-on-demand book.