Republishing Old Books: December 2022

Monday, December 26, 2022

Fixing the Photocopy Lines

From reading graphics processing books, it appears that fixing the thin little horizontal lines that have been introduced into the images by the photocopying process should be fairly easy. All we have to do is use a Hough transform.

A Hough transform is used to find straight lines in an image. The idea is to look at each pixel and say "if there was a straight line going thru this point, what would it be?". The equation for a line with slope a and y-intercept b going thru the point (x,y) is y = ax+b. Now if we have (x,y) and want to solve for a and b, that would be better presented as b = y - ax. Now give x, y and a we can solve for b.

A Hough transform defines an accumulator matrix m and stepping thru the possible slopes, a, computes b = y - ax for each allowable a and adds one to the location in the matrix defined by m[a,b]. The idea is that if the line is actually there, all the points along the line will add to m[a,b] for the line with a particular slope a and intercept b. All the non-lines will just sort of add randomly to other locations in the matrix. So the matrix locations with large counts correspond to lines and the others are all just noise.

There are problems with this approach when the slope approaches vertical since the slope tends towards infinity, but in our case we are only interested in horizontal (or mostly horizontal) lines. In fact, we expect that if we limit our work to each panel of a strip, instead of an entire strip, then the skew will be minor and we are actually looking for horizontal lines. In that case a = 0, and we simply accumulate into a vector m[b] whenever for a line defined by b = y + 0x or just b = y.

So for each point (x,y) that we think might be on a line, we add one to m[y] and look for the high values of the resulting vector.

So we are interested in a point if it is on a line. We know the lines look like:

So we look for a column with a small number of white pixels with black pixels above and below. Then we accumulate the number in a particular row, and look for the rows with the largest accumulation. For a row like that, we then step down it looking for the small cluster of white pixels and setting them all to black.

There are some parameters for this, like how many white pixels in a row, and how many black pixels above and below the white pixels. We start with up to 5 white pixels and needing 10 black pixels above and below.

As we suspected, the fine white lines mainly show up on the odd numbered files, which would be the left hand pages, which was the "back" of the sheets as they were copied.

When we use this approach, we find two types of errors. One is some very noticeable white lines that are not caught. Going thru all 1884 panels, we find 150 that still need work.

Some of those are because the white lines are big than we anticipated. For example, here on panel 3 of strip 4 on p013, the white line is sometimes 6 pixels high.

Another problem is where we have what we think might be a white line, but is actually part of the original art. For example in the 2nd panel of the 2nd strip of p034, we have the following art:

Which actually has no white lines in it. But notice that there are two marks on the plate, just below the top rim. The distance here between the white and the black is such that my code things this is part of a thin white line, and fills those white pixels in, producing:

So I need to either improve my code, or ferret out all these sorts of problems and fix them by hand.

Tuesday, December 13, 2022

Starting the Second Gordo Book

Most of the technical issues for the first Gordo book (Gordo Redux) are handled, so we turn our attention to the second Gordo Book (Gordo Galore). As with the first one we scanned it at 600 dpi. Then we can run the same set of programs we used on the first book, convert the pages from gray scale to black and white, clean up the images, to create the cboxes and pull the individual strips out as images.

But a new problem is presented by this particular book: when it was photocopied at some point, thin white lines were introduced into the copy. This seems particularly a problem on the backs of the pages, the left-hand sides which would be odd pages. For example look at part of p029:

I've seen these sorts of lines on Xerox copies for decades. They are caused by dirty or defective points in the photocopy machinery or toner.

But we would like to identify these lines, and fill them in with the missing black to restore the original image drawn by Arriola.

If we expand in and look at the individual pixels, we see that the lines are not constant, but rather a mixture of light and dark pixels.

When we convert this to pure black and white pixels, the picture changes somewhat, but we still have these lines.

It appears that many of these could be detected by simply changing small pieces of white surrounded by black to black, but as with the first book, there some places were that would change the image. Although for the first book, the image we found which limited us to only clusters of 10 pixels or less was an issue of a black cluster in a white background.

Still there are some places where that would be insufficient. If we follow the lines shown above to the left in this same image, we get:

At least for the line on the right, the little white channels breaking the otherwise solid vertical line probably should not be filled in without the context of knowing that there is a thin white line crossing the page.

To study this problem, I tried to look at where the white lines are, in this image, by examining the borders for the panels. This image has 3 panels, so a left border, two gutters of white with borders at roughly 1/3 and 2/3 horizontally, and a right border. If I try to list the white lines at each vertical border, I get:

The largest, most obvious white lines are marked with an "@" symbol. The numbers represent the starting row of the white line, and the number of pixels high that it is. So "447,3" is a gap that starts at row 447 and continues to row 449. There are a couple of things we can note from this chart.

One is that the white lines are not actually strictly horizontal. As they move from left to right, following a white line, for this image, the row position decreases. So there is skew. The last column shows the difference between the left hand row position and the right hand row position. For this image it decreases by a skew of 20 to 27 pixels in the span of 4446 columns, about 0.6%. But notice also that, at least in this case, the skew is not constant for this image, but seems to decrease, almost uniformly as the row changes. And since the skew is not constant, it is not a property of the scanning.

And lines do not seem to appear across the entire image. A line may stop or start in the middle of the image. From p029, for example:

Here we see a major line towards the bottom that continues thru the image, although varying someone is "strength". Other lines, at the top, start and stop at intervals.

If we can identify the lines, we can fill them in to change the above image to