[sane-devel] Using Find Command in XSane PDF File

Wed Feb 1 20:22:10 UTC 2017

Here's a quick list of commands for first scanning in a receipt at the 
recommended 300 DPI (customize per your scanner), then using ImageMagick's 
convert for converting (and for some odd reason I failed to document these doh!  
But likely you can skip these convert commands), then using Tesseract OCR for 
creating a text file of the possible recognized text.

# First scan in the image, at 300 DPI is recommended and 450 DPI I think is the 
# optimal DPI when attempt OCR.

$ scanimage > ./receipt.tif

# Or a more extravagant method:

$ scanimage --format=tiff --progress --custom-gamma=no --source Flatbed --resolution=300 --icc-profile=${HOME}/ICC/CanoScan9000F/CNSR0D.ICC > receipt.tif

# The below attempt to auto crop the background from the receipt, but due to 
# the scanner's white background, the commands fail to detect the background 
# with white paper.  The commands should work with a black background, after 
# some adjustment.  (eg. Use black paper from a hobby shop for providing a 
# black background during scanning.)
$ convert -trim -fuzz 55% /tmp/receipt.tif /tmp/receipt-trim.tif
$ convert -verbose -border 10x10 -trim +repage -fuzz 75% receipt.tif receipt-trim.tif

# If I recall correctly, just remove "stdout" and a receipt.txt should be 
# automatically created within the immediate folder.
$ tesseract receipt.tif stdout

# As extensively described here, this creates a PDF with included OCR text.  
# The included text within the PDF file is written in binary and cannot be 
# simply grepped!
$ tesseract receipt.tiff receipt.pdf

There are two resulting end results:
1) A scanned image (eg. receipt.tif) and a text file (eg.  receipt.txt) 
containing possibly recognized text.  If you archive data, this is probably 
your best method for preserving image detail and preventing FUD and extravagant 
proprietary formats.  Searching simple text files are extremely easy.  
Maintaining two separate files can be troublesome.

2) A scanned image (eg. receipt.tif) imported into a PDF file containing the 
OCR text.  Using the latest versions of Tesseract, I believe the default is to 
provide a PDF file including the image and text file, while older versions 
output a text file.  Choose the PDF file method if you like simplicity and care 
less about details.  The downside, the image is further significantly 
compressed.

I prefer the first solution, as this provides me with a high resolution 
TIF/JPEG image versus after creating the PDF file, the image is further 
compressed drastically.  On the flip, the one PDF file includes both the image 
and text files rather than having to deal with two separate files.  (eg.  
receipt.tif and receipt.txt)

The final incantation of find will search a PDF file containing OCR text or general text.

# Search multiple PDF files for TEXT
find /tmp -name '*.pdf' -exec sh -c 'pdftotext "{}" - | grep --with-filename --label="{}" --color "TEXT"' \;

Last but not least, somebody actively maintains gscan2pdf 
(http://gscan2pdf.sourceforge.net/), containing a GUI front-end making scanning 
to PDF simple and easy, written in Python.  I've installed & tried it, but am 
extremely bias with command line utilities versus troublesome clicky 
front-ends.

-- 
Roger
http://rogerx.freeshell.org/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 455 bytes
Desc: Digital signature
URL: <http://lists.alioth.debian.org/pipermail/sane-devel/attachments/20170201/5d60b07a/attachment.sig>