[sane-devel] High Volume document scanning !!
Jeremy Johnson
jeremy at acjlaw.net
Mon Nov 3 18:05:05 UTC 2008
On Monday 03 November 2008 07:00:04 m. allan noah wrote:
> On Mon, Nov 3, 2008 at 4:44 AM, Rod De Beer <REDeBeer at dla.gov.za>
wrote:
> > hi Allan
> >
> > Thanks so much for your reply !!!!!
> >
> > A little more info -:
> >
> > i am going with the idea of dropping the windows based scanning system
> > loading Linux
> > on my severs and workstations and looking for a linux based high volume
> > scan interface
> > where i can save the documents as multipage PDF and slot them as blob
in
> > the database
> > with the METADATA using Mysql as the DB and i develop my system
around
> > that ????
>
> I personally don't know of any such system, but perhaps someone else
> will. Writing something like that would not be very hard, gscan2pdf
> already does the front half of it, and it is perl, so very easy to
> extend to a backend db.
>
> > My scanners requirements are page size upto and including A3 black and
> > White 200dpi multipage PDF (imbedded metadata) and volumes as
mentioned
> > below, At the moment i am leasing about 200 Canon dr9080c 's. YES !! the
> > Fijisu does
> > look like a strong option to go for outright purchase.
>
> oh, the canon backend i am working on is to support the DR-9080C, so
> you might be able to continue your lease :)
>
> allan
> --
> "The truth is an offense, but not a sin"
Scanning 5000 pages/day/scanner works out to around 10ppm (assuming an
8-hr
work day). Many low end scanners can achieve more than 10ppm of
A4 at 200dpi b/w.
Document prep and archiving will probably be at least as important as the
actual time spent scanning. It's certainly faster to scan one 100-page
document and upload it to a database than to scan 100 1-page documents and
upload them to a db with individual filenames and index field data.
A search on www.sourceforge.net for "document archiving scanning" brings
up a
number of opensource solutions such as OpenLSD, Maarch Archiving DMS,
and
Maxview Document Management. There are also a number of content
management
systems (DMS) such as knowledgetree (a php-4 based web application) which
could be used to store scanned documents in a db backend.
I myself have used archiveindex's Repository program to store scanned
documents in a db. The Repository is a C-based cgi program which stores
document path+filenames in a heirarchical BerkleyDB-4 file. The scanned
documents are uploaded (and renamed to a hex representation of system time)
to
user-defined storage areas with size quotas (each storage area can be setup
to
hold a max of e.g. one CD or one DVD worth of file storage). The web app
presents the user with a heirarchical storage scheme with folder icons and
thumbnails of documents (pdf, jpeg, dvi, ...). There are also scripts which
can be run from the command line (or other programs) to do eg batch queries
or
batch uploads/downloads etc. Scanned documents are automatically
tokenized
and indexed. There is also an optional OCR program (for Windows only). The
webapp provides access to your db from any browser (but you can restrict
access to your webserver however you like). Some drawbacks of the program
are:
1) not opensource (your data is locked into proprietary format) -- the greater
your financial investment in scanning the riskier the proprietary lock in
feels.
2) 32-bit binaries (with dependencies on berkley db-4.0, netpbm and certain
system utilities e.g. the keylock program parses the output of 'ifconfig eth0|
grep HWaddr' and won't work with sys-apps/net-tools-1.60-r13 or greater)
It should probably be fairly straightforward to implement something similar
using modern opensource tools such as CatalystFramework or RubyonRails for
the
web-based frontend and mysql or postgresql for the backend.
More information about the sane-devel
mailing list