[Python-modules-team] Bug#948467: python3-feedparser: Handling of invalid XHTML differs from upstream package

Val Lorentz progval at progval.net
Wed Jan 8 23:59:28 GMT 2020


Package: python3-feedparser
Version: 5.2.1-1
Severity: normal


Dear maintainer(s),

The attached script uses feedparser to parse an invalid XHTML document.

If feedparser is installed from PyPI with pip, then the script succeeds
exists without error.

If feedparser is installed from Debian 10 repositories (or Archlinux, I
am told), it errors with: "TypeError: startswith first arg must be bytes
or a tuple of bytes, not str" (full traceback attached).

In all cases, feedparser 5.2.1 is used (5.2.1-1 on Debian).


I did not investigate further, but this might be caused by a different
version of sgmllib (bundled in Debian's python3-feedparser package)



-- System Information:
Debian Release: 10.2
  APT prefers oldstable-debug
  APT policy: (500, 'oldstable-debug'), (500, 'stable'), (500,
'oldstable'), (1, 'experimental')
Architecture: amd64 (x86_64)
Foreign Architectures: armhf

Kernel: Linux 4.19.0-6-amd64 (SMP w/4 CPU cores)
Kernel taint flags: TAINT_DIE, TAINT_OOT_MODULE, TAINT_UNSIGNED_MODULE
Locale: LANG=fr_FR.UTF-8, LC_CTYPE=fr_FR.UTF-8 (charmap=UTF-8),
LANGUAGE=fr_FR.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

Versions of packages python3-feedparser depends on:
ii  python3  3.7.3-1

python3-feedparser recommends no packages.

python3-feedparser suggests no packages.

-- no debconf information
-------------- next part --------------
A non-text attachment was scrubbed...
Name: feedparser_invalid_xhtml.py
Type: text/x-python
Size: 392 bytes
Desc: not available
URL: <http://alioth-lists.debian.net/pipermail/python-modules-team/attachments/20200109/c895e77b/attachment.py>
-------------- next part --------------
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/feedparser_debian/sgmllib3.py", line 352, in finish_endtag
    method = getattr(self, 'end_' + tag)
AttributeError: '_LooseFeedParser' object has no attribute 'end_content'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "feedparser_invalid_xhtml.py", line 19, in <module>
    feedparser.parse(data)
  File "/usr/lib/python3/dist-packages/feedparser.py", line 3972, in parse
    feedparser.feed(data.decode('utf-8', 'replace'))
  File "/usr/lib/python3/dist-packages/feedparser.py", line 2131, in feed
    sgmllib.SGMLParser.feed(self, data)
  File "/usr/lib/python3/dist-packages/feedparser_debian/sgmllib3.py", line 98, in feed
    self.goahead(0)
  File "/usr/lib/python3/dist-packages/feedparser_debian/sgmllib3.py", line 137, in goahead
    k = self.parse_endtag(i)
  File "/usr/lib/python3/dist-packages/feedparser_debian/sgmllib3.py", line 314, in parse_endtag
    self.finish_endtag(tag)
  File "/usr/lib/python3/dist-packages/feedparser_debian/sgmllib3.py", line 354, in finish_endtag
    self.unknown_endtag(tag)
  File "/usr/lib/python3/dist-packages/feedparser.py", line 704, in unknown_endtag
    method()
  File "/usr/lib/python3/dist-packages/feedparser.py", line 1840, in _end_content
    value = self.popContent('content')
  File "/usr/lib/python3/dist-packages/feedparser.py", line 1011, in popContent
    value = self.pop(tag)
  File "/usr/lib/python3/dist-packages/feedparser.py", line 863, in pop
    if piece.startswith('</'):
TypeError: startswith first arg must be bytes or a tuple of bytes, not str



More information about the Python-modules-team mailing list