[Python-modules-team] Bug#948467: python3-feedparser: Handling of invalid XHTML differs from upstream package
Val Lorentz
progval at progval.net
Wed Jan 8 23:59:28 GMT 2020
Package: python3-feedparser
Version: 5.2.1-1
Severity: normal
Dear maintainer(s),
The attached script uses feedparser to parse an invalid XHTML document.
If feedparser is installed from PyPI with pip, then the script succeeds
exists without error.
If feedparser is installed from Debian 10 repositories (or Archlinux, I
am told), it errors with: "TypeError: startswith first arg must be bytes
or a tuple of bytes, not str" (full traceback attached).
In all cases, feedparser 5.2.1 is used (5.2.1-1 on Debian).
I did not investigate further, but this might be caused by a different
version of sgmllib (bundled in Debian's python3-feedparser package)
-- System Information:
Debian Release: 10.2
APT prefers oldstable-debug
APT policy: (500, 'oldstable-debug'), (500, 'stable'), (500,
'oldstable'), (1, 'experimental')
Architecture: amd64 (x86_64)
Foreign Architectures: armhf
Kernel: Linux 4.19.0-6-amd64 (SMP w/4 CPU cores)
Kernel taint flags: TAINT_DIE, TAINT_OOT_MODULE, TAINT_UNSIGNED_MODULE
Locale: LANG=fr_FR.UTF-8, LC_CTYPE=fr_FR.UTF-8 (charmap=UTF-8),
LANGUAGE=fr_FR.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled
Versions of packages python3-feedparser depends on:
ii python3 3.7.3-1
python3-feedparser recommends no packages.
python3-feedparser suggests no packages.
-- no debconf information
-------------- next part --------------
A non-text attachment was scrubbed...
Name: feedparser_invalid_xhtml.py
Type: text/x-python
Size: 392 bytes
Desc: not available
URL: <http://alioth-lists.debian.net/pipermail/python-modules-team/attachments/20200109/c895e77b/attachment.py>
-------------- next part --------------
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/feedparser_debian/sgmllib3.py", line 352, in finish_endtag
method = getattr(self, 'end_' + tag)
AttributeError: '_LooseFeedParser' object has no attribute 'end_content'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "feedparser_invalid_xhtml.py", line 19, in <module>
feedparser.parse(data)
File "/usr/lib/python3/dist-packages/feedparser.py", line 3972, in parse
feedparser.feed(data.decode('utf-8', 'replace'))
File "/usr/lib/python3/dist-packages/feedparser.py", line 2131, in feed
sgmllib.SGMLParser.feed(self, data)
File "/usr/lib/python3/dist-packages/feedparser_debian/sgmllib3.py", line 98, in feed
self.goahead(0)
File "/usr/lib/python3/dist-packages/feedparser_debian/sgmllib3.py", line 137, in goahead
k = self.parse_endtag(i)
File "/usr/lib/python3/dist-packages/feedparser_debian/sgmllib3.py", line 314, in parse_endtag
self.finish_endtag(tag)
File "/usr/lib/python3/dist-packages/feedparser_debian/sgmllib3.py", line 354, in finish_endtag
self.unknown_endtag(tag)
File "/usr/lib/python3/dist-packages/feedparser.py", line 704, in unknown_endtag
method()
File "/usr/lib/python3/dist-packages/feedparser.py", line 1840, in _end_content
value = self.popContent('content')
File "/usr/lib/python3/dist-packages/feedparser.py", line 1011, in popContent
value = self.pop(tag)
File "/usr/lib/python3/dist-packages/feedparser.py", line 863, in pop
if piece.startswith('</'):
TypeError: startswith first arg must be bytes or a tuple of bytes, not str
More information about the Python-modules-team
mailing list