We have spent months, literally months, trying to fix the code that uses a Hough transform to find the little white lines introduced by the photocopying process and have more problems than we anticipated. First our code does not find all the white lines, and second it finds lines where there are no lines.
The first problem seems to be because there are lots of potential white lines, since there are so many white pixels. The second is that some parts of the images seem to be just so that, in isolation, they can be interpreted as being part of a white line, when they are not.
But in working with the images, I noticed that if you look at the lines so that you can see the individual pixels, like:
You can see that the white lines tend to be represented by small groups of disconnected white pixels. It is only when you get a really big white line that you get long sections of white pixels. And the small groups of pixels seem to be in repeating patterns. For example in the above image, the last group of pixels in the last line is repeated earlier in that line, as well as in the line above it.
So it looks like, if we scan the images for small groups of white pixels -- say something that would fit in an 8x8 square, and count how many times those patterns are repeated in the images, then we can take the most common ones and when we find them in an image, fill them n with black.
In fact we can use the partially fixed images from using the Hough transform and diff them with the original images to get just bits that we believe are in little white lines. From that we can produce a list of the most frequent patterns, and then use those patterns on the original images to find a larger set of little white lines.
Writing a program to find the patterns, produces thousands of potential patterns, but looking at the frequencies, most patterns occur in just a few places -- less than 10. There are 776 patterns of more than 10 occurrences, and they are no bigger than 7x7.
A second program can find those patterns. The pattern is all white, but we believe it should be black. To observe what we are finding, we set the pixels to green. This produces an output file like:
Notice that a lot of the previous white lines have been filled in. Not all of the line is filled in, since we are requiring each pattern to be white pixels surrounded by black pixels. If the white pixels are too close, we do not replace them. And if the white line is just too big, we don't find anything at all. And if the white line goes thru a vertical black line, just leaving a gap in the black line, we miss that too.
But if we now use these green pixels to drive a Hough transform, we may be able to better identify the lines, and then go back and fill in the missing white pixels. We will try that next.
Oh, and if we just count the number of green pixels on a page and plot that, ordered by the number of pixels, tagging each file with it's file name, it becomes clear that the problem is almost exclusively with the odd pages (the back sides) and not the even pages.
The sudden jump in the number of green pixels is from 20K green pixels for p128 to 40K green pixels for p019. All of the even pages are below p128, and all of the odd pages (except one) are above p019. The exception is p001 which has no green pixels. p001 is the copyright page, which we had to process by hand to make visible.
But even the even pages can have thousands of green pixels, which when looked at closely show a lower level of the same photocopy problem. So our techniques will be applied to all the pages.