Day-to-day thoughts, technical information, a random grab-bag of thoughts, discoveries, and interesting little tidbits.

Sunday, October 4, 2009

Cleaning up Scanned Documents with the GIMP

Recently, I've decided that I don't really want to keep a lot of paper documents lying around, such as bills and receipts, but I also don't want to lose the information they contain.  So, I decided to scan them into my computer, and store them as images, eventually to be converted into PDF format.

After scanning in the first document, I came across the problem that no doubt a lot of you encounter; the scanned document is sort of off-kilter, no matter how determinedly I aligned the paper on the glass.  Additionally, the document's background seemed to be inconsistently shaded at best.  Of course, there's also the ever-present black border where the paper did not cover the glass.  Obviously, this would not do.

After an hour attempt, I started to put together a method of cleaning up documents to the point that they would be cleanly replicated on the image file, to the point that signatures could even look realistic.  This article describes just how this happens, and how it can be corrected.  For simplicity's sake, I will describe this using the free Photoshop workalike known as the GNU Image Manipulation Program (or Gimp, as it's usually shortened to).

Common Issues and their Causes

The following image shows the basic layout of a scanner and the document placed in it.

The blue box at the right is the scanner's lamp and imaging system; it goes over the glass at a consistent rate, and constantly takes snapshots of its location as it goes.  Under perfect conditions, where the paper is perfectly aligned and flat, where the lamp is the only source of light there is, and where the lamp/sensor itself is perfectly aligned, this can result in a perfectly clean scan with no artifacts or strange bending.

However, such perfect scanners generally cost in the megabucks, so the typical scanner is generally "close enough," in that the document is clean, clear, and not too badly warped, and is mostly legible.  The developers of such scanners expect the cleanup to be done in software, so the tolerances can be a bit more off than the more expensive devices.

You'll notice with the above pictures that the blue lamp/sensor device isn't completely aligned with the paper.  It will still scan straight, but it results in one of the more common faults of scanned documents: shear.  This means that one side of the scan will be captured earlier than the other side, causing the whole document to look crooked, as the above illustration shows.  What makes this a bit more annoying is that rotation alone won't fix the problem, as it will simply rotate the text as well as the lines and borders.

Another problem that often pops up is the strange "fog" of color that warps over the image, usually marbled based on imperfections in the paper; the lamp puts out a steady light that is reflected off the paper and returned to the sensor on the assembly.

 As you can see, the first example shows a perfect scan, if the paper is perfectly flat and even.  The second example is a bit more realistic; paper consists of millions of miniature bumps, and is usually something less than perfectly flat, especially if it was handled in some way, and definitely if it was recently folded or crumpled.  In this case, those areas that are further from the glass (and lamp) become slightly darker, and the texture of the paper itself can affect the darkness somewhat.  This is the source of the colored (or gray) "fog" that can cover a good portion of the paper.

Example Document Used

 The following document will be used for the examples; it was scanned using the green channel at 300dpi with an 8-bit grayscale.  It was originally part of a larger, but mostly-obsolete document I had stored in a 3-ring binder for a while (for those curious, it was from the "Encrypted Root Filesystem HOWTO" website).
The reasons for my scanning choices are as follows:
  • 300dpi is the typical resolution for a mid-grade laser printout of a document.
  • The green channel generally has the least defects (less splotchy fog, and more smooth, faint fog)
  • The 8-bit grayscale will cover approximately 256 shades of gray (makes an excellent starting point toward clearing the fog).

Correcting Document Shear

The first issue that will need to be resolved is the straightening of the document.  This is not very hard to accomplish, but it does require some patience and practice... and a little help with the guide and shear tools in the Gimp.

The first step involves a calculation: You need to determine how wide your total scan is, and then place a vertical guide at the exact (or bordering) halfway point.  Then, you need to locate a line long enough on the document to compare ends, and align a horizontal guide exactly where the vertical guide and the document's line meet:

In the case of my example, my flatbed scanner has a maximum width (at 300dpi) of 2563 pixels, so 1281 would be an appropriate location for the cross.  Also, notice the dotted blue line going across.  On the far left and far right of that border, the blue line is above and below it, respectively.

Now that the guides have been placed, we can now use the shear tool in the Gimp.  In the toolbox on the right, look for the shear tool:

Up will pop the shear dialog box, which contains two options and four buttons:

 When you make changes to the values in the magnitude options, the displayed image will automatically update as a preview.  The easiest way to ensure you don't overshoot the mark is to use the up and down arrows to the right of the entry fields to take the changes one step at a time.  In this case, we want to change the "Shear magnitude Y."

The above picture shows the before; here you see the shear box with both 0 entries, and the image below with the two guides.  Notice that the horizontal guide shows just how bad the shear is on the document.  In the following image, however, all of this is corrected:

Here, you can see that the lines are now matched up, and since the guide lines are always perfectly horizontal and vertical, you know that the document has been corrected.  Something to keep in mind, however: when you try this, you'll notice that while you're doing the shearing, the lines will become jagged and uneven.  Once the process is completed, the jagged lines will be corrected, and everything will be smooth and straight.

Correcting "Scanner Fog"

Now, you'll notice that the image does have a hazy gray color over most of the surface.  Most of the time, the haze is paler than anything else on the page, which means that cleaning it up is just a levels tool away.

The levels tool is designed to change how light the lights are, and how dark the darks are.  You can activate the levels tool by going to the "Colors" menu and clicking on "Levels..."  You will then see the dialog box shown below:

Most of what you see on this dialog box is unimportant for our purposes; all we're interested in is the histogram and the input level controls.  The histogram is the big white box above, with the black "mountain" shown in it.  Beneath it, you see three triangles under a gradient.  We'll call those triangles the blackpoint, greypoint, and the whitepoint, as they mark where the black, middle gray, and white are in the image.

The histogram is essential to the task of cleaning the fog.  You'll notice the mountain showing up at the far right of the dialog.  That mountain happens to be just above the very light gray part of the grayscale.  This means that each of those shades of gray have a lot of pixels of that brightness.  At this point, you probably realize that the mountain describes the pixels in the scanner fog, which is usually in that brightness range, as compared with the actual text, which is considerably darker.

Underneath the histogram, we have the grayscale with three triangles.  Those triangles generally adjust where the black color starts lightening up, and where the white color begins to darken.  To fix the problem with the scanner fog, we need to move the white arrow to the point just left of the mountain, which will turn all the pixels of scanner fog pure white, essentially losing those pixels as data (don't click OK just yet):

But, now we have a new problem: the text in the document has become quite faint!  Well, that can be corrected as well.  Notice the low, long hill on the left side of the histogram?  Those are the pixels that contain the content, much of which was black on paper.  We just need to move the black slider to the right so that hill is above the black region at the bottom:

As you can see, the document has been darkened again, but the fog is still gone.  As you can see, the "fog mountain" in the histogram is now above the pure white region, making all "fog" pixels white, while the "content hill" is now completely above the pure black region, meaning that much of the content is now back to black.  For the most part, the corrections are mostly done.  To remove the punch-holes on the left-hand side, a quick swipe of white paint will fix it in no time.

Now, you'll notice that the borders are a little off, and, of course, the black border is still there.  Cropping is the word for this occasion; a tool that will delete everything outside of a rectangular region you select.  Simply open the tool, adjust the borders to match where you want the page cut, and voila.  To begin, you need to select the crop tool from the toolbox:

Then click and drag from one corner of your desired selection to the other, using the boxes along the edges, and over the corners to resize the selection to crop. It is recommended that you also make use of the zoom function if you want to crop only what is necessary to clear away the outer edges:

To complete the cropping process, you simply need to click on the inside of the crop selection, and it is done.

You might notice that the gray regions are not all that even.  This can have multiple reasons, but the main reason is often that the printer that produced the image was not perfect, either, putting more ink/toner into some dots/regions than others.  In this document's case, it was getting near the time when fresh toner would be needed.  Even though these techniques cleaned up the document quite nicely, remember that nothing changes the fact that unless you actively change the document yourself, you can at best get on the screen exactly what was on the paper you scanned.

Well, enjoy this technique, and good luck!