Tuesday, November 15, 2022

Cleaning up the scanned images

Having scanned the pages at 600 dpi, gray scale, and then converted them to pure black and white, we need to look at each image to see that it is good enough for printing.  Experience tells us that there will be stray black and white bits where they should not be.

So the first step is to try to clean up those little bits.  We have a program that looks for a single black bit in a field of white, or a single white bit in a field of black.  At 600 dpi, you cannot see something that small.  And once we have that, we can expand it to a small set of bits in a field of other bits.  The question becomes: At what point does a small set of bits become part of the image, instead of just extraneous bits?  My original thought was maybe 40 bits.  Any clump of 40 bits or less, surrounded by a field of the opposite color can be erased without changing the image.

But working on that premise, I noticed the following panel (page 9):


The nose on this person is:

So the right nostril is 10 pixels in a 5x3 rectangle.  The left nostril is less than 35 pixels.  And if either of them was "cleaned up" as random noise, it would change the image significantly.  So it would seem that Arriola's drawings are meaningful at a very small number of bits.

Of course that is in the actual drawings -- the strips.  Outside the strips, any stray bits would be safely considered extraneous, and can be cleaned up.  Which leads us to another approach.  If we identify the "background" white paper regions that are not part of the strips, then we can clean up larger spurious black bits.  Which leads us to write  a program to identify the background of the page.  This is simply starting at the upper left corner and "flooding" the page until we come to a barrier, a set of black bits that surround the comic strips themselves.  If we turn these background bits to green to make it easy to see what we are classifying as background we get pages like:


The smallest group of meaningful bits that are not background would be individual panels, or letters, or the page numbers.  But even a page number, like the "1" on this page, is 100x300 pixels.


The smallest meaning character would be a period, which would be 40x40 pixels.  But in the background areas, we could reasonably clean up anything under, say 15x15 pixels.

At the same time, we ran across several pages where the holes for the binding went through part of the drawing -- some 20 pages had problems with the holes.  We used a bit editor (GIMP) to go in and fix up those pages.  For example,  page 89 was scanned as:

but using the bit editor, by hand, we restored it to


No comments:

Post a Comment