[sane-devel] scanning for archival and OCR

Jeremy Johnson jeremy at acjlaw.net
Wed Jan 23 16:02:24 UTC 2013

Hmmm, I guess I learn something new every day.
I wouldn't have suspected that ghostscript writer could concatenate pdf's and 
save so much during compression.

So I just did a test, scanning some tax forms
in 8-bit grayscale to z[0001 --- 0019].pdf using xsane
and then combining using both gs and pdftk.

The results:

$ du -csh z00??.pdf
388K    z0001.pdf
1.2M    z0002.pdf
1.3M    z0003.pdf
1.1M    z0004.pdf
1.6M    z0005.pdf
892K    z0006.pdf
724K    z0007.pdf
908K    z0008.pdf
1.3M    z0009.pdf
728K    z0010.pdf
556K    z0011.pdf
196K    z0012.pdf
1.4M    z0013.pdf
196K    z0014.pdf
580K    z0015.pdf
472K    z0016.pdf
376K    z0017.pdf
920K    z0018.pdf
1.3M    z0019.pdf
16M     total

# Now concatenate using pdftk
$ pdftk z00??.pdf cat output PDFTK.pdf
$ ls -sh PDFTK.pdf
16M PDFTK.pdf

# Concatenate using ghostscript's re-write
$ gs -q  -dNOPAUSE -dBATCH -sDEVICE=pdfwrite  -sOutputFile=GS.pdf z00??.pdf
$ ls -sh GS.pdf
8.5M GS.pdf

Of course, pdftk allows mixing papersizes. Ghostscript's writer will truncate 
pages which are larger then the default or specified pagesize. Not sure if 
ghostscript can write pdfs with mixed papersizes.

For good measure, I also tried pdfjam/pdfjoin/pdflatex and it too just 
concatenates the pdfs into a 16M file.

More information about the sane-devel mailing list