[sane-devel] scanning for archival and OCR
martin at martindengler.com
Tue Jan 22 23:18:44 UTC 2013
On Tue, Jan 22, 2013 at 12:34:05PM -0500, David H. Durgee wrote:
> I am trying to determine how best to scan and save these documents.
I have found the following process to be useful:
Scanner input, jpg (or pdf)
tidy up image using 'unpapered'
convert to grayscale via ppmtopgm -> pamtotiff
OCR using tesseract
Tesseract can embed the OCR in the PDF (search for tesseract hocr),
This is a makefile I use to automate that process, starting from a PDF
(image only) generated by my scanner:
make -f scan-post-process-Makefile $(basename input.pdf .pdf)-processed
Tesseract isn't perfect, but it's pretty good.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 198 bytes
Desc: not available
More information about the sane-devel