Bug#1076283: poppler-utils: pdfimages cannot extract images on some PDFs produced by hplip

Manny debbug.poppler-utils at sideload.33mail.com
Sat Jul 13 17:28:31 BST 2024


Package: poppler-utils
Version: 22.12.0-2+b1
Severity: normal
Tags: upstream
X-Debbugs-Cc: debbug.poppler-utils at sideload.33mail.com
control: affects -1 hplip

This command is routinely used on PDFs to extract the images from
scanned documents such that original images are preserved as-is
without conversion or transcoding losses:

  $ pdfimages -all sample.pdf guts

HPLIP is a FOSS driver for HP scanners. This command is used to scan a
doc to PDF:

  $ hp-scan --mode=gray --adf -oscan.pdf --device="hpaio:/net/hp_$model?ip=$IPaddress&queue=false"

That produces a grayscale PDF of all pages fed into the ADF. The PDF
is always fine as far as I can tell; always renders fine in
evince. But sometimes some pages generate error messages from
pdfimages and result in a blank PNG image. Sample output:

===8<----------------------------------------
  $ pdfimages -all extraction_broken.pdf broken
  Syntax Error (281): Unknown compression method in flate stream
  Syntax Error (2406): Illegal character '>'
  Syntax Error (2406): Unknown operator 'E'
  Syntax Error (2416): Unknown operator ']'
  Syntax Error (2561): Unknown operator '^GPBSNeGNT''
  Syntax Error (2566): Illegal character '>'
  Syntax Error (2566): Unknown operator ''1o`SP0VoiGpK"`B""o1'
  Syntax Error (2641): Illegal character '>'
  Syntax Error (2641): Unknown operator 'C'
  Syntax Error (2646): Unknown operator 'CMbtB^&@ZZ$24'
  Syntax Error (2691): Illegal character '>'
  Syntax Error (2691): Unknown operator '@!F0;'
  Syntax Error (2691): Too few (0) args to 'c' operator
===8<----------------------------------------

I scanned the same page twice using the same hplip command. The two
PDFs should essentially be quite similar apart from page alignment
differences. But pdfimages cannot extract the image from one PDF yet
it has no problem on the other.

I will attach the extraction_broken.pdf to this bug report. Since the
PDFs are 2mb, I will attach a working sample from the same scanner
after the bug report has a number.

-- System Information:
Debian Release: 12.5
  APT prefers stable-updates
  APT policy: (990, 'stable-updates'), (990, 'stable-security'), (990, 'stable'), (500, 'oldstable')
Architecture: amd64 (x86_64)
Foreign Architectures: i386

Kernel: Linux 5.10.0-28-amd64 (SMP w/2 CPU threads)
Kernel taint flags: TAINT_OOT_MODULE, TAINT_UNSIGNED_MODULE
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), LANGUAGE not set
Shell: /bin/sh linked to /usr/bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

Versions of packages poppler-utils depends on:
ii  libc6          2.36-9+deb12u7
ii  libcairo2      1.16.0-7
ii  libfreetype6   2.12.1+dfsg-5
ii  liblcms2-2     2.14-2
ii  libpoppler126  22.12.0-2+b1
ii  libstdc++6     12.2.0-14

poppler-utils recommends no packages.

poppler-utils suggests no packages.

-- no debconf information
-------------- next part --------------
A non-text attachment was scrubbed...
Name: extraction_broken.pdf
Type: application/pdf
Size: 2203850 bytes
Desc: not available
URL: <http://alioth-lists.debian.net/pipermail/pkg-freedesktop-maintainers/attachments/20240713/0fa98a02/attachment-0001.pdf>


More information about the Pkg-freedesktop-maintainers mailing list