[sane-devel] scanning for archival and OCR
Martin Dengler
martin at martindengler.com
Tue Jan 22 23:18:44 UTC 2013
On Tue, Jan 22, 2013 at 12:34:05PM -0500, David H. Durgee wrote:
> I am trying to determine how best to scan and save these documents.
I have found the following process to be useful:
Scanner input, jpg (or pdf)
|
v
tidy up image using 'unpapered'
|
v
convert to grayscale via ppmtopgm -> pamtotiff
|
v
OCR using tesseract
Tesseract can embed the OCR in the PDF (search for tesseract hocr),
too.
This is a makefile I use to automate that process, starting from a PDF
(image only) generated by my scanner:
http://www.martindengler.com/proj/scan-post-process-Makefile
...like so:
make -f scan-post-process-Makefile $(basename input.pdf .pdf)-processed
Tesseract isn't perfect, but it's pretty good.
> Dave
Martin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://lists.alioth.debian.org/pipermail/sane-devel/attachments/20130122/f38c93f3/attachment.pgp>
More information about the sane-devel
mailing list