Saturday, November 25, 2023

More on fixing the little white lines

 The basic problem we are having with the strips for the second Gordo book is that some of the images (particularly the odd-page numbers, which would be the right-hand side images -- the backs of the pages), were copied on a machine which introduced thin white lines across the images.

If we look at an image, and blow it up we find:


In this small segment of an image, we can see 4 white lines.  The green pixels are ones that our programs have identified and marked as being white, but should be black.   A lot of the green pixels are just single pixel errors -- general noise -- and not associated with the white lines explicitly.

We have tried for months to write programs that will find just the white lines and fill them in.  It seems at first to be relatively simple:  look for white pixels with black pixels above and below.  But consider the following image segment:


This would seem to be an obvious case of a thin (3 pixel) white line.  But if we pull back from this and look at the larger context,


The gap in the previous image is the space between the bottom of one A and the top of the A on the next line.  It is not something that should be fixed.  Although, it may in fact actually be a 3-pixel white line that just happened to be in exactly that place; it requires judgement to determine if it should be filled in or not.

And there are lots of places in the images where there are a "small" number of white pixels have black above and below, but are not caused by a white line.  Consider in the original image we showed above


 The space between the top line of the F and the middle line should not be filled in, but meets the simple definition -- a small number of white pixels with black above and below.  The letter E is an even worse case.

So after months of trying to program fixing the lines, it seemed that I needed to just do it by hand.  On my Linux system, the image editor GIMP provides me the ability to edit images one pixel at a time.  So I just need to load each image and then find the little white lines, and change the appropriate white pixels to black. 

Of course that would be tedious at best.  Pixel by pixel editing is very time consuming.  We want to be able to see what we have changed and what was the original image, so we change the pixels from white to green, and then can use a later program to change the green pixels to black.  We use a pencil setting to draw green pixels over the white pixels that we want to change.

And to avoid having to change just exactly the right (white) pixels,  we developed a process that allows us to draw slightly out of the lines and also change black pixels.  A program then compares the modified image with the original image and if it finds a green pixel where there used to be a black pixel, it keeps that pixel black, remembering the green pixel only if the previous pixel was white.  This allows us to be a bit sloppy in our editing so we can work faster.

So if we take our example image, we can mark it as:

and then process it down to 


We are working on each panel of the book, one at a time.  We started this in May, and it looks like we may finish it by November or December.



Tuesday, May 16, 2023

Understanding the Hough Transform

 Our basic code to find the white lines that the photocopying process has introduced is to use a Hough Transform, which we described in a previous post "Fixing the Photocopy Lines".  But we are having a number of problems with our code for that use.  So let us try to understand Hough Transforms and how they work better.

Let us start with a really simple case: a simple  48 pixel long well-defined perfectly horizontal line (from (5,16) to (52,16)).


If we feed this into our Hough Transform code and look at the matrix that is produced by it, we find it looks like:


The maximum value (48) is right in the middle of row 16, corresponding to a slope of 0 and an intercept of 16.  But we also compute for lines that go thru each point of the line for 31 different slopes, each a tiny bit different from the previous (both positive and negative).

So although the maximum is a slope of 0 and an intercept of 16, we also have the same maximum for a slope slightly less and slightly more (a slope so that at the end of the image, the line would be one pixel higher or lower, so not quite level).  This is because of the integer arithmetic used.  As the slope gets higher (or lower), some pixels will actually intercept the edge of the image higher and lower.  Notice that the sum in each column (sum over all slopes) is 48, since all 48 green pixels contribute to some b-intercept line, for each different slope.

If the green line gets thicker, then there are multiple intercepts for the thicker line (from 15 to 17, for example for a 3-pixel wide line), and the Hough transform matrix expands similarly.

Note that each column still adds up to 48 (times 3 now that we have 3 lines of 48 green pixels).  But the maximum value (48) occurs 15 times -- five times on each of the 3 lines for the intercept lines corresponding to the 3 straight, horizontal lines.

And we start to see other local maximums, for example the 39 up one line from the 3 real intercepts, and to the left and right of the center.  This probably corresponds to a line that would start at the lower of the 3 lines at the left and then move right into the center line and up to the upper line, exiting at the right-most pixel of the upper line.  And a similar line going down from the left-most green pixel for the upper line to the right-most green pixel for the lower line. 

So while a maximum value in the Hough Transform matrix corresponds to a particular slope/intercept line, there may be multiple maximums suggesting a thicker line.

In a noisy environment, however, we would expect that there are no perfect lines.  For a perfectly horizontal single-pixel line, that would mean a lower maximum -- some of the pixels that "should" contribute to it are missing -- but we would still expect that the maximum is larger than the lower values in the matrix. 

If we poke holes in the lines, corresponding to noise, we can end up with more or fewer non-green holes in each of the 3 lines corresponding to the width of the line, and higher or lower values in the Hough Transform matrix.  For example, putting one non-green pixel in the lowest line, five in the central line, and 3 in the upper line, moves the maximum to 47 in the line corresponding to the lowest line, 43 for the line corresponding to the middle line, and 45 in the upper line.   This gives a cluster of values from from 47 to 43, and then a drop to 38, 37 ... and much lower values.  But it is not clear how to differentiate between the 43 (which is 4 below the maximum) ad the 38 (which is only 5 below that)

But there could be other high-value entries "close" to the maximum that correspond to a thicker line -- not just one pixel.

If we look at a real example, panel p027_1_1, we find the maximum at 783,22 with a value of 276.  There is a 274,  at 784,20, and a 239 at 785,18.  We have to drop to 182 to get away from the cluster of high numbers in rows 784 to 790.


Monday, April 3, 2023

Panel Borders

Most of the panels have borders -- a line box drawn around the contents of the panel. So it seems that we should be able to recognize the border as a black line of some thickness, and fill in any missing pixels -- white pixels in the black border.

In particular, for those pages with little white lines from the copying process, if we find a set of white pixels on the left and the right, it suggests there is a little white line connecting them.  That could help us repair those lines.

So the first problem is to recognize the border, where it is in the panel and it's thickness. There are four sides to a panel: left, top, right, and bottom.  The border should run along the outside next each of these sides, as a continuous line of black pixels.

Except the border, as with all the art of the panels, was hand drawn, so it's not perfectly straight and uniform.  We could, if we can recognize it, adjust it so that it was in fact a perfectly straight line of uniform width on all sides.  But that would change the nature of the panel, and we don't want to do that; we want to maintain the hand-drawn nature of the borders.

So we wrote code that will scan, starting from the outside edge over any white pixels, until we find the black pixels that make up the border, and then scan over those black pixels to find the other side of the border, giving us the location and width of the border.  Trying to match this simple model of the border reveals several problems:

(1) Not all panels have borders.  For example, p152_1_3 has no border.


Turns out there are a lot of our panels that do not have borders.  This is especially true at the very beginning, where each letter of a large title is a "panel" and none of them have borders. 


 

(2) Not all panels stay inside the borders.  For example, p016_3_3, so we may find stuff before the border, and the border may be interrupted by art work laying "on top" of the border. 



(3) Some of our "panels" are not just one panel.  For example, p150_4_2 is actually two panels with a word balloon that spans between them, over the "gutter" between them. 


(4) Notice that we then also have to allow for "gaps" in the border.  If we have a file that is one or more panels that cannot be easily separated, such as p146_3_1, the border may not be complete or contiguous. 

 
(5) Arriola has a style of a "panel" of words next to a panel of art instead of a large word balloon.  These word panels do not have borders, and there are a lot of them.

(6) But these word panels are the speech of a particular character and so there is the "tail" of the word balloon pointing at the character doing the speaking.  This tail consists of a sidways V that pierces the border and points to the speaker.  See for example, p010_3_3 and the following art panel p010_3_4.  These V shaped tails can be on either the left or the right. 

 


(7) This latter art panel, p010_3_4, also illustrates that the borders may have strange corners and ticks, as illustrated in the lower left corner.  

 


Sometimes it is even more extreme, like p012_2_4 which has a noticeable corner tick in both the upper right and lower right. 


(8) The border may run into the art.  Consider p041_3_1 which shows the two characters in silhouette against a porthole.  The "border" is obvious on the top, but as it runs down the sides and across the bottom is merged directly into the art.
 


(9) Some panels may be both borderless text and an art panel together, for example, p041_2_2.  This is similar to point (3) above.

 

So first we try to isolate and identify those panels that are simply rectangles with solid borders.  Well except for the V shape gaps for the tail of a word panel. This produces a list of 1276 panels.  This should allow us to find the average width of the borders when it is there.

Doing this, and then looking at the computed border widths, shows that they peak at about 19 pixels in width, largely between 15 and 26 pixels. 

Except for a smaller peak at 3 or 4 pixels.  For example, p043_3_1 has a left border of 3, a top border of 3, a right border of 4, and a bottom border of 7.  

 

 

Of course these numbers are only a summary.  The borders vary from 1 to 429 because of effects such as (8) above where the border and the art run together.  For example in p043_3_1, the right and bottom borders merge with the hair of the one character in the lower right.  The top border, on the other hand is not affected by these problems, but even it runs from 1 pixel in spots to 11 in another.  The histogram of the various widths shows:

  1  81
  2  319
  3  335
  4  238
  5  225
  6  166
  7  134
  8  64
  9  6
10  4
11  1

So the most common width is 4, but anywhere from 2 to 5 (possibly 7) is pretty much the same.

(This particular part of the book is describing a time travel story and maybe the thinner border is a reflection that these events are in the past.)

So we adopt the definition that the border starts with the first black pixel and is at most 20 to 22 pixels wide, but may be shorter.

Another property that we would expect is that, at the 600 dpi that we are scanning the images, the border, while it may vary, will not vary by very much from one scan line to the next.  We would expect it to often be the same, or maybe more one or two pixels either direction.  If we look at where the border begins, and then at where it begins for the next row or column, and look at the histogram of the difference between those two, we get numbers that vary from -89 to +58, but the peak (5,345,976) is at zero, and by plus/minus 3, is down to about 6,000.

Look at the histogram of differences. 

We can use the outliers to identify issues with the panels.  For example, the 58 value corresponds to p081_3_3, and shows that the quotes that belong in the following text panel have been incorrectly put in this panel.  

 


If we pick those values that are plus/minus 20 pixels or more, we get a selection of 144 panels to examine. Making it plus/minus 30 reduces this to just 28.  For example, checking p025_2_1 which has a value of -30 on the bottom, we find it is caused by a dust speck on the bottom of the panel, which we remove. 

While, we are looking at p025_2_1, we notice that it seems skewed. If you follow the border from left to right, across the top, it drops about 17 pixels, 20 pixels on the right, 13 pixels across the bottom, and 24 on the left.  This suggests that it is skewed -- not aligned to the page that it is on, but slightly rotated.

We can check for skew by using the outer set of pixels that define the border to define a line that is the linear least squares best fit for the border.  We can then compare the slopes for the four sides and if we find they are all or the same sign and roughly the same magnitude, then it suggests that the image is skewed. 

images1/p025_2_1.gif:   left border is slope -0.018, intercept    25.58
images1/p025_2_1.gif: bottom border is slope -0.009, intercept    32.83
images1/p025_2_1.gif:  right border is slope -0.013, intercept    22.73 images1/p025_2_1.gif:    top border is slope -0.012, intercept    18.21
images1/p025_2_1.gif: min slope=-0.018;max slope=-0.009;diff slope= 0.009 ###


That is not technically correct, since obviously the top and bottom sides are perpendicular to the left and right.  However our code is written to only work on one side -- the left side of the panel.  To get the top, right, and bottom sides, we rotate the original panel 90 degrees, and then run our code again.  Rotate again for the right; rotate again for the top. This makes the computations for each side the same and gives comparitive numbers for any of the statistics, like the slopes.

This rotation approach causes one difficulty.  If we have rotated the image 90 degrees, then what the program is seeing as location (i,j) in the image is not what I see as (i,j) in the image. If we are printing the location of a pixel, we want to see it in its true position, not in it's rotated position.

In particular, if we compute a line that approximates each border, we should be able to compute the intersection of the lines for the left side and the top side to compute the corner pixel that defines where we should start working on the image -- we don't want to process the background that exists outside the border.  But this is more difficult because the lines for the top and the left sides have been computed in different coordinate systems and will need to be translated to the true coordinate system to identify the spot where they intersect.








Friday, March 10, 2023

Fixing the Photocopy Lines, Part 2

 We have spent months, literally months, trying to fix the code that uses a Hough transform to find the little white lines introduced by the photocopying process and have more problems than we anticipated.   First our code does not find all the white lines, and second it finds lines where there are no lines.

The first problem seems to be because there are lots of potential white lines, since there are so many white pixels.  The second is that some parts of the images seem to be just so that, in isolation, they can be interpreted as being part of a white line, when they are not.

But in working with the images,  I noticed that if you look at the lines so that you can see the individual pixels, like:

You can see that the white lines tend to be represented by small groups of disconnected white pixels.  It is only when you get a really big white line that you get long sections of white pixels.  And the small groups of pixels seem to be in repeating patterns. For example in the above image, the last group of pixels in the last line is repeated earlier in that line, as well as in the line above it.

So it looks like, if we scan the images for small groups of white pixels -- say something that would fit in an 8x8 square, and count how many times those patterns are repeated in the images, then we can take the most common ones and when we find them in an image, fill them n with black.

In fact we can use the partially fixed images from using the Hough transform and diff them with the original images to get just bits that we believe are in little white lines.  From that we can produce a list of the most frequent patterns, and then use those patterns on the original images to find a larger set of little white lines.

Writing a program to find the patterns, produces thousands of potential patterns, but looking at the frequencies, most patterns occur in just a few places -- less than 10.  There are 776 patterns of more than 10 occurrences, and they are no bigger than 7x7.

A second program can find those patterns.  The pattern is all white, but we believe it should be black.  To observe what we are finding, we set the pixels to green.  This produces an output file like:

Notice that a lot of the previous white lines have been filled in.  Not all of the line is filled in, since we are requiring each pattern to be white pixels surrounded by black pixels.  If the white pixels are too close, we do not replace them.  And if the white line is just too big, we don't find anything at all.  And if the white line goes thru a vertical black line, just leaving a gap in the black line, we miss that too.

But if we now use these green pixels to drive a Hough transform, we may be able to better identify the lines, and then go back and fill in the missing white pixels.  We will try that next.

Oh, and if we just count the number of green pixels on a page and plot that, ordered by the number of pixels, tagging each file with it's file name, it becomes clear that the problem is almost exclusively with the odd pages (the back sides) and not the even pages.


The sudden jump in the number of green pixels is from 20K green pixels for p128 to 40K green pixels for p019.  All of the even pages are below p128, and all of the odd pages (except one) are above p019.  The exception is p001 which has no green pixels.  p001 is the copyright page, which we had to process by hand to make visible.

But even the even pages can have thousands of green pixels, which when looked at closely show a lower level of the same photocopy problem.  So our techniques will be applied to all the pages.