[DRE-maint] Bug#960229: RFP: pdfbeads -- utility to take scanned page images and convert them to a single PDF file

Rogério Brito rbrito at ime.usp.br
Sun May 10 21:24:12 BST 2020


Package: wnpp
Severity: wishlist

* Package name    : pdfbeads
  Version         : 1.1.1
  Upstream Author : Alexey Kryukov <amkryukov at gmail.com>
* URL             : https://rubygems.org/gems/pdfbeads/versions/1.1.1
* License         : GPL
  Programming Lang: Ruby
  Description     : utility to take scanned page images and convert them to a single PDF file

PDFBeads is a small utility written in Ruby which takes scanned page images
and converts them into a single PDF file. Unlike other PDF creation tools,
PDFBeads attempts to implement the approach typically used for DjVu
books. Its key feature is separating scanned text (typically black, but
indexed images with a small number of colors are also accepted) from
halftone pictures. Each type of graphical data is encoded into its own layer
with a specific compression method and resolution.

The name `PDFBeads' has been selected for the package because building PDF
files from separate image is comparable to threading beads on a string. It
also seems to be a good choice for a Ruby application.

Here's a few operations you can perform with PDFBeads:

* encode B&W images using either CCITT Group 4 Fax or JBIG2 compression
  method (you'll need Adam Langley's jbig2 utility, available at
  github.com/agl/jbig2enc/ , for JBIG2 compression);

* combine halftone or indexed pictures with previously binarized text pages,
  placing them into the background layer. Various compression methods of
  background images (JPEG2000, JPEG or PNG-styled deflate compression) are
  supported;

* split mixed images where binarized text is combined with color or
  grayscale pictures (such pages may be produced with ScanTailor – an
  interactive post-processing tool for scanned page, available at
  scantailor.sourceforge.net) and encode each layer separately;

* correctly process indexed images with a limited number of colors, encoding
  each color separately into the foreground layer;

* split color images into background and foreground layers (similar to BG44
  and FG44 chunks in a DjVu file) according to a given mask;

* create PDF files with TOC and metadata;

* read text from hOCR files and create a hidden text layer in the PDF file.

Note that PDFBeads is intended for creating PDF files from previously
processed images, and so it can't done some operations (e. g. converting
color or grayscale scans to B&W) which should be typically performed with a
special scan processing application, such as ScanTailor.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

This program is similar in spirit to djvubind (that we already have in our
repositories), but for PDF files, of course.

I tested it and it works quite well, with only a few modifications: to make
it run, I installed ruby-rmagick, then installed the gem via

    gem install pdfbeads

and, only a minor modification was needed to make it run with the current
Ruby 2.7 that we have in Debian: I had to remove the line `import 'iconv'`
from the top level pdfbeads binary and everything was working perfectly
fine.

Since I don't know much ruby, I guess that it would be best to have people
from the Ruby team maintain and/or package it. I am even willing to
co-maintain it, if necessary, but, again, my knowledge of Ruby is minimal.

I plan on uploading a version of Alan Langley's jbig2 encoder (an efficient
B/W format that PDFs use) to our repository. I have it packaged already, but
not uploaded and pdfbeads (and other tools that we have packaged, like
OCRmyPDF, can use it if it is installed).


Thanks for any help,

Rogério Brito.

-- 
Rogério Brito : rbrito@{ime.usp.br,gmail.com} : GPG key 4096R/BCFCAAAA
http://cynic.cc/blog/ : github.com/rbrito : profiles.google.com/rbrito
DebianQA: http://qa.debian.org/developer.php?login=rbrito%40ime.usp.br



More information about the Pkg-ruby-extras-maintainers mailing list