[sane-devel] scanning for archival and OCR
Julien Michielsen
michkloo at xs4all.nl
Wed Jan 23 11:29:23 UTC 2013
On 01/22/13 18:34, David H. Durgee wrote:
> Now that I have my Canon PIXMA MG2120 working as a scanner I was looking
> at the possibility of using it to scan some documents for archival
> purposes. My intent is to scan some old bills and other documents and
> database them in some manner before disposing of the originals. As
> these are mostly bills and other documents they tend to be B&W with the
> possible exception of some logos or letter head/foot in color.
>
> I am trying to determine how best to scan and save these documents. I
> would like to be able to reprint them in the future if that becomes
> necessary, implying a high quality scan. But I am also concerned with
> the size of the saved documents on my system. What is the best file
> format for saving such documents? I would imagine that 16 or possibly
> even 8 colors would suffice to represent the documents, assuming the
> palette can be specified. The documents are likely to be mostly
> white-space, so I would expect them to be highly compressible.
>
> Given I am interested in databasing them I would also think that using
> an OCR program to extract the content would be useful. I tried a few
> things with that on a small document and encountered problems, given the
> mixed content of the particular document. Are there any specific
> recommendations for scanning a document with OCR in mind? As I am
> running linux I am looking at tesseract and cuneiform at this point, at
> least one of which wants a B&W document for input and whose results
> appear to vary quite a bit with the scan resolution.
>
> Am I dealing with conflicting goals? Will I need to scan each document
> twice or more to accomplish what I want here?
>
> Dave
I have been scanning for arschival purposes and will tell you how I
achieve this.
-1 xsane allows the type of file you want to use for writing a file
to disk. For archival purposes I use pdf. I made a scan for archival
with the name vkweer1.pdf . For the next scan I wanted to save,
xsane proposed the name vkweer2.pdf. This number goes up by 1 for
each coming scan. After finishing the scanning, I'll combine the
seperate scans into one combined scan with ghostscript. An example
of a pdf-combining ghostscript:
gs -q -sPAPERSIZE=a4 -dNOPAUSE -dBATCH -sDEVICE=pdfwrite
-sOutputFile=vkweerc.pdf vkweer1.pdf vkweer2.pdf
This command creates a combined file ("vkweerc.pdf") from the two
input-files (vkweer1.pdf and vkweer2.pdf) into one output file which
is much smaller than the original input-files:
ls vkw* -l
-rw-r--r-- 1 julien users 682131 Jan 23 11:51 vkweerc.pdf
-rw-r----- 1 julien users 14439950 Jan 23 11:46 vkweer.pdf
-rw-r----- 1 julien users 14439950 Jan 23 11:47 vkweer2.pdf
As you can see the output of gs (the file vkweerc.pdf) is more
than 42 times smaller than the size of the inputfiles vkweer1.pdf and
vkweer2.pdf combined. Great for archival purposes ;-) . In the setup
of xsane I specified to save the pdf zlib compressed, but appearantly
ghostscript does a much better job, as far as compressing is concerned.
Hope this is useful to you.
Julien
--
Julien Michielsen
julien_at_michkloo.xs4all.nl
More information about the sane-devel
mailing list