[sane-devel] scanning for archival and OCR

Tue Jan 22 23:18:44 UTC 2013

On Tue, Jan 22, 2013 at 12:34:05PM -0500, David H. Durgee wrote:
> I am trying to determine how best to scan and save these documents.

I have found the following process to be useful:

Scanner input, jpg (or pdf)
  |
  v
tidy up image using 'unpapered'
  |
  v
convert to grayscale via ppmtopgm -> pamtotiff
  |
  v
OCR using tesseract

Tesseract can embed the OCR in the PDF (search for tesseract hocr),
too.

This is a makefile I use to automate that process, starting from a PDF
(image only) generated by my scanner:

http://www.martindengler.com/proj/scan-post-process-Makefile

...like so:

make -f scan-post-process-Makefile  $(basename input.pdf .pdf)-processed

Tesseract isn't perfect, but it's pretty good.

> Dave

Martin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://lists.alioth.debian.org/pipermail/sane-devel/attachments/20130122/f38c93f3/attachment.pgp>