Most of the panels have borders -- a line box drawn around the contents of the panel. So it seems that we should be able to recognize the border as a black line of some thickness, and fill in any missing pixels -- white pixels in the black border.
In particular, for those pages with little white lines from the copying process, if we find a set of white pixels on the left and the right, it suggests there is a little white line connecting them. That could help us repair those lines.
So the first problem is to recognize the border, where it is in the panel and it's thickness. There are four sides to a panel: left, top, right, and bottom. The border should run along the outside next each of these sides, as a continuous line of black pixels.
Except the border, as with all the art of the panels, was hand drawn, so it's not perfectly straight and uniform. We could, if we can recognize it, adjust it so that it was in fact a perfectly straight line of uniform width on all sides. But that would change the nature of the panel, and we don't want to do that; we want to maintain the hand-drawn nature of the borders.
So we wrote code that will scan, starting from the outside edge over any white pixels, until we find the black pixels that make up the border, and then scan over those black pixels to find the other side of the border, giving us the location and width of the border. Trying to match this simple model of the border reveals several problems:
(1) Not all panels have borders. For example, p152_1_3 has no border.
Turns out there are a lot of our panels that do not have borders. This
is especially true at the very beginning, where each letter of a large
title is a "panel" and none of them have borders.
(2) Not all panels stay inside the borders. For example, p016_3_3, so we may find stuff before the border, and the border may be interrupted by art work laying "on top" of the border.
(3) Some of our "panels" are not just one panel. For example, p150_4_2 is actually two panels with a word balloon that spans between them, over the "gutter" between them.
(4) Notice that we then also have to allow for "gaps" in the border. If we have a file that is one or more panels that cannot be easily separated, such as p146_3_1, the border may not be complete or contiguous.
(5) Arriola has a style of a "panel" of words next to a panel of art instead of a large word balloon. These word panels do not have borders, and there are a lot of them.
(6) But these word panels are the speech of a particular character and so there is the "tail" of the word balloon pointing at the character doing the speaking. This tail consists of a sidways V that pierces the border and points to the speaker. See for example, p010_3_3 and the following art panel p010_3_4. These V shaped tails can be on either the left or the right.
(7) This latter art panel, p010_3_4, also illustrates that the borders may have strange corners and ticks, as illustrated in the lower left corner.
Sometimes it is even more extreme, like p012_2_4 which has a noticeable corner tick in both the upper right and lower right.
(8) The border may run into the art. Consider p041_3_1 which shows the two characters in silhouette against a porthole. The "border" is obvious on the top, but as it runs down the sides and across the bottom is merged directly into the art.
(9) Some panels may be both borderless text and an art panel together, for example, p041_2_2. This is similar to point (3) above.
So first we try to isolate and identify those panels that are simply rectangles with solid borders. Well except for the V shape gaps for the tail of a word panel. This produces a list of 1276 panels. This should allow us to find the average width of the borders when it is there.
Doing this, and then looking at the computed border widths, shows that they peak at about 19 pixels in width, largely between 15 and 26 pixels.
Except for a smaller peak at 3 or 4 pixels. For example, p043_3_1 has a left border of 3, a top border of 3, a right border of 4, and a bottom border of 7.
Of course these numbers are only a summary. The borders vary from 1 to 429 because of effects such as (8) above where the border and the art run together. For example in p043_3_1, the right and bottom borders merge with the hair of the one character in the lower right. The top border, on the other hand is not affected by these problems, but even it runs from 1 pixel in spots to 11 in another. The histogram of the various widths shows:
1 81
2 319
3 335
4 238
5 225
6 166
7 134
8 64
9 6
10 4
11 1
So the most common width is 4, but anywhere from 2 to 5 (possibly 7) is pretty much the same.
(This particular part of the book is describing a time travel story and maybe the thinner border is a reflection that these events are in the past.)
So we adopt the definition that the border starts with the first black pixel and is at most 20 to 22 pixels wide, but may be shorter.
Another property that we would expect is that, at the 600 dpi that we are scanning the images, the border, while it may vary, will not vary by very much from one scan line to the next. We would expect it to often be the same, or maybe more one or two pixels either direction. If we look at where the border begins, and then at where it begins for the next row or column, and look at the histogram of the difference between those two, we get numbers that vary from -89 to +58, but the peak (5,345,976) is at zero, and by plus/minus 3, is down to about 6,000.
Look at the histogram of differences.
We can use the outliers to identify issues with the panels. For example, the 58 value corresponds to p081_3_3, and shows that the quotes that belong in the following text panel have been incorrectly put in this panel.
If we pick those values that are plus/minus 20 pixels or more, we get a selection of 144 panels to examine. Making it plus/minus 30 reduces this to just 28. For example, checking p025_2_1 which has a value of -30 on the bottom, we find it is caused by a dust speck on the bottom of the panel, which we remove.
While, we are looking at p025_2_1, we notice that it seems skewed. If you follow the border from left to right, across the top, it drops about 17 pixels, 20 pixels on the right, 13 pixels across the bottom, and 24 on the left. This suggests that it is skewed -- not aligned to the page that it is on, but slightly rotated.
We can check for skew by using the outer set of pixels that define the border to define a line that is the linear least squares best fit for the border. We can then compare the slopes for the four sides and if we find they are all or the same sign and roughly the same magnitude, then it suggests that the image is skewed.
images1/p025_2_1.gif: left border is slope -0.018, intercept 25.58
images1/p025_2_1.gif: bottom border is slope -0.009, intercept 32.83
images1/p025_2_1.gif: right border is slope -0.013, intercept 22.73 images1/p025_2_1.gif: top border is slope -0.012, intercept 18.21
images1/p025_2_1.gif: min slope=-0.018;max slope=-0.009;diff slope= 0.009 ###
That is not technically correct, since obviously the top and bottom sides are perpendicular to the left and right. However our code is written to only work on one side -- the left side of the panel. To get the top, right, and bottom sides, we rotate the original panel 90 degrees, and then run our code again. Rotate again for the right; rotate again for the top. This makes the computations for each side the same and gives comparitive numbers for any of the statistics, like the slopes.
This rotation approach causes one difficulty. If we have rotated the image 90 degrees, then what the program is seeing as location (i,j) in the image is not what I see as (i,j) in the image. If we are printing the location of a pixel, we want to see it in its true position, not in it's rotated position.
In particular, if we compute a line that approximates each border, we should be able to compute the intersection of the lines for the left side and the top side to compute the corner pixel that defines where we should start working on the image -- we don't want to process the background that exists outside the border. But this is more difficult because the lines for the top and the left sides have been computed in different coordinate systems and will need to be translated to the true coordinate system to identify the spot where they intersect.