[sane-devel] High Volume document scanning !!

Mon Nov 3 18:05:05 UTC 2008

On Monday 03 November 2008 07:00:04 m. allan noah wrote:
> On Mon, Nov 3, 2008 at 4:44 AM, Rod De Beer <REDeBeer at dla.gov.za> 
wrote:
> > hi Allan
> >
> > Thanks so much for your reply !!!!!
> >
> > A little more info -:
> >
> > i am going with the idea of dropping the  windows based scanning system
> > loading Linux
> > on my severs and workstations and looking for a linux based high volume
> > scan interface
> > where i can save the documents as multipage PDF and slot them as blob 
in
> > the database
> > with the METADATA using Mysql as the DB and i develop my system 
around
> > that ????
>
> I personally don't know of any such system, but perhaps someone else
> will. Writing something like that would not be very hard, gscan2pdf
> already does the front half of it, and it is perl, so very easy to
> extend to a backend db.
>
> > My scanners requirements are page size upto and including A3 black and
> > White 200dpi multipage PDF (imbedded metadata) and volumes as 
mentioned
> > below, At the moment i am leasing about 200 Canon dr9080c 's. YES !! the
> > Fijisu does
> > look like a strong option to go for outright purchase.
>
> oh, the canon backend i am working on is to support the DR-9080C, so
> you might be able to continue your lease :)
>
> allan
> --
> "The truth is an offense, but not a sin"

Scanning 5000 pages/day/scanner works out to around 10ppm (assuming an 
8-hr 
work day). Many low end scanners can achieve more than 10ppm of 
A4 at 200dpi b/w.
Document prep and archiving will probably be at least as important as the 
actual time spent scanning. It's certainly faster to scan one 100-page 
document and upload it to a database than to scan 100 1-page documents and 
upload them to a db with individual filenames and index field data.

A search on www.sourceforge.net for "document archiving scanning"  brings 
up a 
number of opensource solutions such as OpenLSD, Maarch Archiving DMS, 
and 
Maxview Document Management. There are also a number of content 
management 
systems (DMS) such as knowledgetree (a php-4 based web application) which 
could be used to store scanned documents in a db backend.

I myself have used archiveindex's Repository program to store scanned 
documents in a db. The Repository is a C-based cgi program which stores 
document path+filenames in a heirarchical BerkleyDB-4 file. The scanned 
documents are uploaded (and renamed to a hex representation of system time) 
to 
user-defined storage areas with size quotas (each storage area can be setup 
to 
hold a max of e.g. one CD or one DVD worth of file storage). The web app 
presents the user with a heirarchical storage scheme with folder icons and 
thumbnails of documents (pdf, jpeg, dvi, ...). There are also scripts which 
can be run from the command line (or other programs) to do eg batch queries 
or 
batch uploads/downloads etc. Scanned documents are automatically  
tokenized 
and indexed. There is also an optional OCR program (for Windows only). The 
webapp provides access to your db from any browser (but you can restrict 
access to your webserver however you like). Some drawbacks of the program 
are:
1) not opensource (your data is locked into proprietary format) -- the greater 
your financial investment in scanning the riskier the proprietary lock in 
feels.
2) 32-bit binaries (with dependencies on berkley db-4.0,  netpbm and certain 
system utilities e.g. the keylock program parses the output of 'ifconfig eth0|
grep HWaddr' and won't work with sys-apps/net-tools-1.60-r13 or greater)

It should probably be fairly straightforward to implement something similar 
using modern opensource tools such as CatalystFramework or RubyonRails for 
the 
web-based frontend and mysql or postgresql for the backend.