[sane-devel] scanning for archival and OCR
David H. Durgee
dhdurgee at comcast.net
Tue Jan 22 17:34:05 UTC 2013
Now that I have my Canon PIXMA MG2120 working as a scanner I was looking
at the possibility of using it to scan some documents for archival
purposes. My intent is to scan some old bills and other documents and
database them in some manner before disposing of the originals. As
these are mostly bills and other documents they tend to be B&W with the
possible exception of some logos or letter head/foot in color.
I am trying to determine how best to scan and save these documents. I
would like to be able to reprint them in the future if that becomes
necessary, implying a high quality scan. But I am also concerned with
the size of the saved documents on my system. What is the best file
format for saving such documents? I would imagine that 16 or possibly
even 8 colors would suffice to represent the documents, assuming the
palette can be specified. The documents are likely to be mostly
white-space, so I would expect them to be highly compressible.
Given I am interested in databasing them I would also think that using
an OCR program to extract the content would be useful. I tried a few
things with that on a small document and encountered problems, given the
mixed content of the particular document. Are there any specific
recommendations for scanning a document with OCR in mind? As I am
running linux I am looking at tesseract and cuneiform at this point, at
least one of which wants a B&W document for input and whose results
appear to vary quite a bit with the scan resolution.
Am I dealing with conflicting goals? Will I need to scan each document
twice or more to accomplish what I want here?
Dave
More information about the sane-devel
mailing list