[Pkg-gmagick-im-team] Bug#1043109: imagemagick: metadata leaks into body on TIFF → PDF conversion
debbug.imagemagick at sideload.33mail.com
debbug.imagemagick at sideload.33mail.com
Sun Aug 6 09:31:07 BST 2023
Package: imagemagick
Version: 8:6.9.11.60+dfsg-1.3
Severity: normal
Tags: upstream
X-Debbugs-Cc: debbug.imagemagick at sideload.33mail.com
Metadata from a TIFF file is being transfered to the *body* of the
target file when “converting” to a PDF file. This results in a PDF
file that falsely appears to have searchable text. One side-effect of
that is OCR programs raise errors saying the PDF has already been
OCR-processed.
Steps to reproduce:
① Use Gimp to save a TIFF file. The options to save metadata should
probably be enabled.
② Verify that the “PageName” field is populated:
$ tiffinfo gimp_output.tif
TIFFReadDirectory: Warning, Unknown field with tag 326 (0x146) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 327 (0x147) encountered.
TIFFReadDirectory: Warning, Unknown field with tag 328 (0x148) encountered.
TIFF Directory at offset 0x8 (8)
Image Width: 3544 Image Length: 6240
Resolution: 204, 196 pixels/inch
Bits/Sample: 1
Sample Format: unsigned integer
Compression Scheme: None
Photometric Interpretation: min-is-white
Orientation: row 0 top, col 0 lhs
Samples/Pixel: 1
Rows/Strip: 128
Planar Configuration: single image plane
SubIFD Offsets: 5392
PageName: pg04-5.tiff
Software: GIMP 2.10.22
DateTime: 2023:08:05 20:24:13
XMLPacket (XMP Metadata):
③ Use ImageMagick-convert to produce a PDF:
$ convert gimp_output.tif imagemagick_output.pdf
convert-im6.q16: Unknown field with tag 326 (0x146) encountered. `TIFFReadDirectory' @ warning/tiff.c/TIFFWarnings/985.
convert-im6.q16: Unknown field with tag 327 (0x147) encountered. `TIFFReadDirectory' @ warning/tiff.c/TIFFWarnings/985.
convert-im6.q16: Unknown field with tag 328 (0x148) encountered. `TIFFReadDirectory' @ warning/tiff.c/TIFFWarnings/985.
convert-im6.q16: Unknown field with tag 327 (0x147) encountered. `TIFFReadDirectory' @ warning/tiff.c/TIFFWarnings/985.
convert-im6.q16: Unknown field with tag 328 (0x148) encountered. `TIFFReadDirectory' @ warning/tiff.c/TIFFWarnings/985.
④ Use pdf2txt to see the stray text that was injected into the PDF body:
$ pdf2txt imagemagick_output.pdf
pg04-5.tiff
⑤ Use pdfinfo to prove that the TIFF metadata (“PageName:”) did not make it into the PDF metadata:
$ pdfinfo imagemagick_output.pdf
Title: imagemagick_output
Producer: https://imagemagick.org
CreationDate: Sun Aug 6 10:14:34 2023 CEST
ModDate: Sun Aug 6 10:14:34 2023 CEST
Tagged: no
UserProperties: no
Suspects: no
Form: none
JavaScript: no
Pages: 1
Encrypted: no
Page size: 1250.82 x 2292.24 pts
Page rot: 0
File size: 27485613 bytes
Optimized: no
PDF version: 1.7
⑥ Use ocrmypdf to attempt making the text contained within the PDF searchable:
$ ocrmypdf imagemagick_output.pdf searchable.pdf
Scanning contents: 100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 68.27page/s]
Using Tesseract OpenMP thread limit 2
OCR: 0%| | 0.0/1.0 [00:00<?, ?page/s]
PriorOcrFoundError: page already has text! - aborting (use --force-ocr to force OCR; see also help for the arguments --skip-text and --redo-ocr
Workaround:
Of course the workaround for this particular workflow is to pass the
--force-ocr option to ocrmypdf. This may not be an option in other situations.
-- Package-specific info:
ImageMagick program version
---------------------------
animate: ImageMagick 6.9.11-60 Q16 x86_64 2021-01-25 https://imagemagick.org
compare: ImageMagick 6.9.11-60 Q16 x86_64 2021-01-25 https://imagemagick.org
convert: ImageMagick 6.9.11-60 Q16 x86_64 2021-01-25 https://imagemagick.org
composite: ImageMagick 6.9.11-60 Q16 x86_64 2021-01-25 https://imagemagick.org
conjure: ImageMagick 6.9.11-60 Q16 x86_64 2021-01-25 https://imagemagick.org
display: ImageMagick 6.9.11-60 Q16 x86_64 2021-01-25 https://imagemagick.org
identify: ImageMagick 6.9.11-60 Q16 x86_64 2021-01-25 https://imagemagick.org
import: ImageMagick 6.9.11-60 Q16 x86_64 2021-01-25 https://imagemagick.org
mogrify: ImageMagick 6.9.11-60 Q16 x86_64 2021-01-25 https://imagemagick.org
montage: ImageMagick 6.9.11-60 Q16 x86_64 2021-01-25 https://imagemagick.org
stream: ImageMagick 6.9.11-60 Q16 x86_64 2021-01-25 https://imagemagick.org
-- System Information:
Debian Release: 11.5
APT prefers oldstable-updates
APT policy: (990, 'oldstable-updates'), (990, 'oldstable-security'), (990, 'testing'), (990, 'oldstable')
Architecture: amd64 (x86_64)
Foreign Architectures: i386
Kernel: Linux 5.10.0-19-amd64 (SMP w/2 CPU threads)
Kernel taint flags: TAINT_OOT_MODULE, TAINT_UNSIGNED_MODULE
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), LANGUAGE not set
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled
Versions of packages imagemagick depends on:
ii imagemagick-6.q16 8:6.9.11.60+dfsg-1.3
imagemagick recommends no packages.
imagemagick suggests no packages.
-- no debconf information
More information about the Pkg-gmagick-im-team
mailing list