After scanning in the first document, I came across the problem that no doubt a lot of you encounter; the scanned document is sort of off-kilter, no matter how determinedly I aligned the paper on the glass. Additionally, the document's background seemed to be inconsistently shaded at best. Of course, there's also the ever-present black border where the paper did not cover the glass. Obviously, this would not do.
After an hour attempt, I started to put together a method of cleaning up documents to the point that they would be cleanly replicated on the image file, to the point that signatures could even look realistic. This article describes just how this happens, and how it can be corrected. For simplicity's sake, I will describe this using the free Photoshop workalike known as the GNU Image Manipulation Program (or Gimp, as it's usually shortened to).
Common Issues and their Causes
The following image shows the basic layout of a scanner and the document placed in it.
The blue box at the right is the scanner's lamp and imaging system; it goes over the glass at a consistent rate, and constantly takes snapshots of its location as it goes. Under perfect conditions, where the paper is perfectly aligned and flat, where the lamp is the only source of light there is, and where the lamp/sensor itself is perfectly aligned, this can result in a perfectly clean scan with no artifacts or strange bending.
However, such perfect scanners generally cost in the megabucks, so the typical scanner is generally "close enough," in that the document is clean, clear, and not too badly warped, and is mostly legible. The developers of such scanners expect the cleanup to be done in software, so the tolerances can be a bit more off than the more expensive devices.
You'll notice with the above pictures that the blue lamp/sensor device isn't completely aligned with the paper. It will still scan straight, but it results in one of the more common faults of scanned documents: shear. This means that one side of the scan will be captured earlier than the other side, causing the whole document to look crooked, as the above illustration shows. What makes this a bit more annoying is that rotation alone won't fix the problem, as it will simply rotate the text as well as the lines and borders.
Another problem that often pops up is the strange "fog" of color that warps over the image, usually marbled based on imperfections in the paper; the lamp puts out a steady light that is reflected off the paper and returned to the sensor on the assembly.
As you can see, the first example shows a perfect scan, if the paper is perfectly flat and even. The second example is a bit more realistic; paper consists of millions of miniature bumps, and is usually something less than perfectly flat, especially if it was handled in some way, and definitely if it was recently folded or crumpled. In this case, those areas that are further from the glass (and lamp) become slightly darker, and the texture of the paper itself can affect the darkness somewhat. This is the source of the colored (or gray) "fog" that can cover a good portion of the paper.
Example Document Used The following document will be used for the examples; it was scanned using the green channel at 300dpi with an 8-bit grayscale. It was originally part of a larger, but mostly-obsolete document I had stored in a 3-ring binder for a while (for those curious, it was from the "Encrypted Root Filesystem HOWTO" website).
- 300dpi is the typical resolution for a mid-grade laser printout of a document.
- The green channel generally has the least defects (less splotchy fog, and more smooth, faint fog)
- The 8-bit grayscale will cover approximately 256 shades of gray (makes an excellent starting point toward clearing the fog).
Correcting Document ShearThe first issue that will need to be resolved is the straightening of the document. This is not very hard to accomplish, but it does require some patience and practice... and a little help with the guide and shear tools in the Gimp.
The first step involves a calculation: You need to determine how wide your total scan is, and then place a vertical guide at the exact (or bordering) halfway point. Then, you need to locate a line long enough on the document to compare ends, and align a horizontal guide exactly where the vertical guide and the document's line meet:
In the case of my example, my flatbed scanner has a maximum width (at 300dpi) of 2563 pixels, so 1281 would be an appropriate location for the cross. Also, notice the dotted blue line going across. On the far left and far right of that border, the blue line is above and below it, respectively.
Now that the guides have been placed, we can now use the shear tool in the Gimp. In the toolbox on the right, look for the shear tool:
When you make changes to the values in the magnitude options, the displayed image will automatically update as a preview. The easiest way to ensure you don't overshoot the mark is to use the up and down arrows to the right of the entry fields to take the changes one step at a time. In this case, we want to change the "Shear magnitude Y."
The above picture shows the before; here you see the shear box with both 0 entries, and the image below with the two guides. Notice that the horizontal guide shows just how bad the shear is on the document. In the following image, however, all of this is corrected:
Here, you can see that the lines are now matched up, and since the guide lines are always perfectly horizontal and vertical, you know that the document has been corrected. Something to keep in mind, however: when you try this, you'll notice that while you're doing the shearing, the lines will become jagged and uneven. Once the process is completed, the jagged lines will be corrected, and everything will be smooth and straight.
Correcting "Scanner Fog"
Now, you'll notice that the image does have a hazy gray color over most of the surface. Most of the time, the haze is paler than anything else on the page, which means that cleaning it up is just a levels tool away.
The levels tool is designed to change how light the lights are, and how dark the darks are. You can activate the levels tool by going to the "Colors" menu and clicking on "Levels..." You will then see the dialog box shown below:
Most of what you see on this dialog box is unimportant for our purposes; all we're interested in is the histogram and the input level controls. The histogram is the big white box above, with the black "mountain" shown in it. Beneath it, you see three triangles under a gradient. We'll call those triangles the blackpoint, greypoint, and the whitepoint, as they mark where the black, middle gray, and white are in the image.
The histogram is essential to the task of cleaning the fog. You'll notice the mountain showing up at the far right of the dialog. That mountain happens to be just above the very light gray part of the grayscale. This means that each of those shades of gray have a lot of pixels of that brightness. At this point, you probably realize that the mountain describes the pixels in the scanner fog, which is usually in that brightness range, as compared with the actual text, which is considerably darker.
Underneath the histogram, we have the grayscale with three triangles. Those triangles generally adjust where the black color starts lightening up, and where the white color begins to darken. To fix the problem with the scanner fog, we need to move the white arrow to the point just left of the mountain, which will turn all the pixels of scanner fog pure white, essentially losing those pixels as data (don't click OK just yet):
Now, you'll notice that the borders are a little off, and, of course, the black border is still there. Cropping is the word for this occasion; a tool that will delete everything outside of a rectangular region you select. Simply open the tool, adjust the borders to match where you want the page cut, and voila. To begin, you need to select the crop tool from the toolbox:
Well, enjoy this technique, and good luck!