[sane-devel] Using Find Command in XSane PDF File
Roger
rogerx.oss at gmail.com
Wed Feb 1 20:22:10 UTC 2017
Here's a quick list of commands for first scanning in a receipt at the
recommended 300 DPI (customize per your scanner), then using ImageMagick's
convert for converting (and for some odd reason I failed to document these doh!
But likely you can skip these convert commands), then using Tesseract OCR for
creating a text file of the possible recognized text.
# First scan in the image, at 300 DPI is recommended and 450 DPI I think is the
# optimal DPI when attempt OCR.
$ scanimage > ./receipt.tif
# Or a more extravagant method:
$ scanimage --format=tiff --progress --custom-gamma=no --source Flatbed --resolution=300 --icc-profile=${HOME}/ICC/CanoScan9000F/CNSR0D.ICC > receipt.tif
# The below attempt to auto crop the background from the receipt, but due to
# the scanner's white background, the commands fail to detect the background
# with white paper. The commands should work with a black background, after
# some adjustment. (eg. Use black paper from a hobby shop for providing a
# black background during scanning.)
$ convert -trim -fuzz 55% /tmp/receipt.tif /tmp/receipt-trim.tif
$ convert -verbose -border 10x10 -trim +repage -fuzz 75% receipt.tif receipt-trim.tif
# If I recall correctly, just remove "stdout" and a receipt.txt should be
# automatically created within the immediate folder.
$ tesseract receipt.tif stdout
# As extensively described here, this creates a PDF with included OCR text.
# The included text within the PDF file is written in binary and cannot be
# simply grepped!
$ tesseract receipt.tiff receipt.pdf
There are two resulting end results:
1) A scanned image (eg. receipt.tif) and a text file (eg. receipt.txt)
containing possibly recognized text. If you archive data, this is probably
your best method for preserving image detail and preventing FUD and extravagant
proprietary formats. Searching simple text files are extremely easy.
Maintaining two separate files can be troublesome.
2) A scanned image (eg. receipt.tif) imported into a PDF file containing the
OCR text. Using the latest versions of Tesseract, I believe the default is to
provide a PDF file including the image and text file, while older versions
output a text file. Choose the PDF file method if you like simplicity and care
less about details. The downside, the image is further significantly
compressed.
I prefer the first solution, as this provides me with a high resolution
TIF/JPEG image versus after creating the PDF file, the image is further
compressed drastically. On the flip, the one PDF file includes both the image
and text files rather than having to deal with two separate files. (eg.
receipt.tif and receipt.txt)
The final incantation of find will search a PDF file containing OCR text or general text.
# Search multiple PDF files for TEXT
find /tmp -name '*.pdf' -exec sh -c 'pdftotext "{}" - | grep --with-filename --label="{}" --color "TEXT"' \;
Last but not least, somebody actively maintains gscan2pdf
(http://gscan2pdf.sourceforge.net/), containing a GUI front-end making scanning
to PDF simple and easy, written in Python. I've installed & tried it, but am
extremely bias with command line utilities versus troublesome clicky
front-ends.
--
Roger
http://rogerx.freeshell.org/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 455 bytes
Desc: Digital signature
URL: <http://lists.alioth.debian.org/pipermail/sane-devel/attachments/20170201/5d60b07a/attachment.sig>
More information about the sane-devel
mailing list