[sane-devel] help with improving text scans

Fri Dec 19 18:36:27 UTC 2008

On Friday 19 December 2008 11:34:09 gobo wrote:
> for some time now i've been using homemade scripts with scanimage and
> scanadf to scan my paper documents. most of my documents are plain
> text. the results have always been poor and marginally acceptable. i'm
> using suse 10.3 and an hp aio j6450 or psc1210xi.
>
> recently i obtained a canon scanner w/adf for use at work where i must
> use windows. to get around the image compatibility issues of microsoft
> document imaging (office 2003) i simply print the scanned image to pdf
> with acrobat. the results obtained with mdi are far superior to
> anything i've ever been able to achieve with sane apps.
>
> i've spent hours fumbling around with scanimage options, imagemagick
> convert to resize the images and ps2pdf to produce the pdf files.
> while i have made some slight improvements over the default settings,
> i've never been able to get even close to the mdi output. in the few
> places where i must have a good scan, i use resolutions of 150 or 300,
> but to get prints of the image becomes a real pain. i must load the
> image in gimp, fiddle around resizing it and then printing.
>
>
> my standard scanimage script would contain:
> scanimage -x 215.9 -y 297 -d
> hpaio:/net/Officejet_J6400_series?ip=192.168.1.103 \
> -pv --mode gray > $FILE
>
>
> pieces from a perl script using the adf:
> # this is the scan device
> @scanr = ("hpaio:/net/Officejet_J6400_series?ip=192.168.1.103");
> # these are the command line options for scanadf
> @opts = ("-x 215.9 -y 297 -v --mode=gray --source ADF --batch-scan=no -e
> 1");
>
> # scan page
> system("scanadf @opts -d @scanr -o $fnamepg");
>
> adding --resolution=150, or 300 does produce a larger image, with less
> artifacting, and much more readable, but difficult to print.
>
> the answer must be one of two things -- either i'm missing something
> real simple about producing hi-res 8.5x11" images (that is right in
> front of my nose) or we are just not there yet with linux scanning.
>
> can someone correct, or put me on a better path?
>
> thanks.

I use 2 bash scripts for document scanning, bscan and scans2pdf,
located at http://www.acjlaw.net:8080/~jeremy/Ricoh/scripts/
The scripts are based on simpler versions I found on the net (I forget where)

The bscan (batch scan) script acquires pnm images from the scanner using 
scanimage 
and then processes those images into a multipage pdf using pnmtools.

The scans2pdf script takes sequential pnm images from xsane
(e.g. file.%04d.pnm) and converts them into a multipage pdf.

The processing logic in scans2pdf is exactly the same as in bscan.
I never got around to substituting the processing logic in bscan
with a call to scans2pdf 
(it's mainly just a matter of repackaging
arguments to bscan to work with scans2pdf -- 
eg. the option "-gray nshades" enables both grayscale scanning and also
sets the number of grayshades to keep in the final processed pdf.)

To facilitate one-key scanning it's convenient to define some aliases:
alias B='bscan -gray 2'
alias BL='bscan -gray 2 -page Legal'
alias CL='bscan -color 32 -page Legal'
alias b='bscan -s 0'
alias bl='bscan -s 0 -page Legal'
alias c='bscan -color 32'

Thus to scan a letter-sized document in grayscale, and then convert to 
black+white using adaptive/dynamic thresholding/binarization
I would simply use the command "B -bw filename" which will create
filename.pdf
To scan legal sheets in lineart mode: "bl filename" or in color "cl filename"
I have here a 13-page legal -sized document which was scanned in
grayscale and converted to b/w. It is 749K or 57K/pg which is reasonable.
I could have scanned in b/w but it would not have saved all that much space.

The bscan program accepts many options for
changing the default behavior:

SCANNER OPTIONS:
-d "device name" eg. HS2P or SP15C
-source ADF= Y | N
-page legal | letter
-color number_of_colors (enables color scanning & set max # colors)
-gray number_of_gray_shades
-res resolution
-duplex enables duplex
-s (user settings defaults)

PROCESSING OPTIONS
-bw (convert to black+white using adaptive thresholding)
-dither (eg. atkinson, see pamdither)
-color (remap colorspace to number_of_colors)
-gray (downsample to nshades of gray)
-flip r180 (rotates 180 degrees)

OUTPUT OPTION:
-pnm (don't convert to pdf)

Some documents which don't have enough contrast to still be readable
after conversion to b/w are simply scanned in gray or color mode:
"B filename" or "C filename"

The large filename.pdf can then be reduced in size by conversion to djvu:
pdf2djvu filename.pdf -o filename.djv
djview4 filename.djv -> print to ps
ps2pdf14 filename.ps filename.pdf (now much smaller ~1/50 original size)

Some documents may need user-interaction to set cropping, 
brightness/contrast/gamma, etc. using xsane.
The scans (file.0001.pnm, file.0002.pnm, ...) can then be converted to pdf:
"scans2pdf -bw file" which will convert all the file*.pnm to a single 
multipage file.pdf containing b/w images.

It should be straightforward to modify this script to recognize your scanners'
options and device names.

There is also a promissing gui-program gscan2pdf on sourceforge:
http://gscan2pdf.sourceforge.net/
There was a bug in the program which would not let me change my SP15C 
scanner's options. I submitted a bug report to the author, but he hadn't been 
able to fix/work around the problem. But the program may work for you.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.alioth.debian.org/pipermail/sane-devel/attachments/20081219/c5cbbb87/attachment-0001.htm