[Python-modules-team] Bug#563443: python-pypdf: parsing not robust to whitespace
John V. Belmonte
jbelmonte at debian.org
Sun Jan 3 00:02:17 UTC 2010
Package: python-pypdf
Version: 1.12-2
Severity: normal
While using pdfshuffler on PDF statements from my stock broker, on export I'd
consistently get an exception from pypdf. Note that pdfshuffler's own display,
along with evince, acroread, kpdf, etc. have no problem with these documents.
On inspection it turns out that pypdf's parsing is rather primitive
and doesn't handle the presence of extra spaces, linefeeds in place of
space, etc. Here is an example of PDF source causing problems:
9 0 obj
<<
/Type /Font
/Subtype /Type1
/Encoding 4 0 R
/BaseFont /Times-Bold
>> endobj
I will attach a patch that makes parsing more lax about whitespace in a few
places that were significant to my document. However this is just the tip
of the iceburg. Unfortunatley the pypdf code is written in a rather low-level
fashion and addressing the problem fully will be a large task.
-- System Information:
Debian Release: squeeze/sid
APT prefers testing
APT policy: (500, 'testing')
Architecture: i386 (i686)
Kernel: Linux 2.6.30-2-686 (SMP w/1 CPU core)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Versions of packages python-pypdf depends on:
ii python-support 1.0.6 automated rebuilding support for P
python-pypdf recommends no packages.
python-pypdf suggests no packages.
-- no debconf information
-- debsums errors found:
debsums: changed file /usr/share/python-support/python-pypdf/pyPdf/pdf.py (from python-pypdf package)
debsums: changed file /usr/share/python-support/python-pypdf/pyPdf/generic.py (from python-pypdf package)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lax_whitespace_1.12.patch
Type: text/x-java
Size: 2747 bytes
Desc: not available
URL: <http://lists.alioth.debian.org/pipermail/python-modules-team/attachments/20100102/78501fdb/attachment.java>
More information about the Python-modules-team
mailing list