[Python-apps-team] Bug#735837: python-html5lib 0.999-2 breaks planet-venus

Fri Jan 17 20:34:54 UTC 2014

Package: python-html5lib
Version: 0.999-2
Severity: important

(Filing this as “important” since it breaks the entire package for my
 use case and a reverse dependency, feel free to downgrade severity).

python-html5lib seems to have changed its API in the 0.999 upstream
release.

I get the following error when running planet-venus:

ERROR:planet.runner:Error processing http://blogs.noname-ev.de/commandline-tools/feeds/index.rss2
ERROR:planet.runner:AttributeError: 'module' object has no attribute 'TreeBuilder'
ERROR:planet.runner:  File "/usr/lib/pymodules/python2.7/planet/spider.py", line 472, in spiderPlanet
    writeCache(uri, feed_info, data)
ERROR:planet.runner:  File "/usr/lib/pymodules/python2.7/planet/spider.py", line 279, in writeCache
    reconstitute.source(xdoc.documentElement,data.feed,data.bozo,data.version)
ERROR:planet.runner:  File "/usr/lib/pymodules/python2.7/planet/reconstitute.py", line 231, in source
    content(xsource, 'subtitle', source.get('subtitle_detail',None), bozo)
ERROR:planet.runner:  File "/usr/lib/pymodules/python2.7/planet/reconstitute.py", line 167, in content
    parser = html5parser.HTMLParser(tree=dom.TreeBuilder)

This error can be addressed by patching planet-venus as follows:

--- i/planet/reconstitute.py
+++ w/planet/reconstitute.py
@@ -18,6 +18,7 @@ from xml.sax.saxutils import escape
 from xml.dom import minidom, Node
 from html5lib import html5parser
 from html5lib.treebuilders import dom
+from html5lib import treebuilders
 import planet, config
 
 try:
@@ -164,7 +165,7 @@ def content(xentry, name, detail, bozo):
             bozo=1
 
     if detail.type.find('xhtml')<0 or bozo:
-        parser = html5parser.HTMLParser(tree=dom.TreeBuilder)
+        parser = html5parser.HTMLParser(tree=treebuilders.getTreeBuilder("dom"))
         html = parser.parse(xdiv % detail.value, encoding="utf-8")
         for body in html.documentElement.childNodes:
             if body.nodeType != Node.ELEMENT_NODE: continue

But, even after that is fixed, there are errors which I wasn’t able to
fix:

ERROR:planet.runner:Error processing https://raumzeitlabor.de/feed/
ERROR:planet.runner:AttributeError: 'module' object has no attribute 'XHTMLSerializer'
ERROR:planet.runner:  File "/usr/lib/pymodules/python2.7/planet/spider.py", line 472, in spiderPlanet
    writeCache(uri, feed_info, data)
ERROR:planet.runner:  File "/usr/lib/pymodules/python2.7/planet/spider.py", line 166, in writeCache
    scrub.scrub(feed_uri, data)
ERROR:planet.runner:  File "/usr/lib/pymodules/python2.7/planet/scrub.py", line 137, in scrub
    xhtml = serializer.XHTMLSerializer(inject_meta_charset = False)

The XHTMLSerializer seems to be gone entirely. I am not sure whether the
HTMLSerializer is enough for what planet-venus does. Just using that did
not work for me, though, I get exceptions with this patch:

--- i/planet/scrub.py
+++ w/planet/scrub.py
@@ -131,10 +131,11 @@ def scrub(feed_uri, data):
             # Run this through HTML5's serializer
             from html5lib import html5parser, sanitizer, treebuilders
             from html5lib import treewalkers, serializer
+            from html5lib.serializer import HTMLSerializer
             p = html5parser.HTMLParser(tokenizer=sanitizer.HTMLSanitizer,
               tree=treebuilders.getTreeBuilder('dom'))
             doc = p.parseFragment(node.value, encoding='utf-8')
-            xhtml = serializer.XHTMLSerializer(inject_meta_charset = False)
+            xhtml = HTMLSerializer(inject_meta_charset = False)
             walker = treewalkers.getTreeWalker('dom')
-            tree = xhtml.serialize(walker(doc), encoding='utf-8')
+            tree = xhtml.render(walker(doc), encoding='utf-8')
             node['value'] = ''.join([str(token) for token in tree])

The error is:

ERROR:planet.runner:Error processing http://soup.wobbl.es/rss
ERROR:planet.runner:TypeError: 'NoneType' object has no attribute '__getitem__'
ERROR:planet.runner:  File "/usr/lib/pymodules/python2.7/planet/spider.py", line 472, in spiderPlanet
    writeCache(uri, feed_info, data)
ERROR:planet.runner:  File "/usr/lib/pymodules/python2.7/planet/spider.py", line 166, in writeCache
    scrub.scrub(feed_uri, data)
ERROR:planet.runner:  File "/usr/lib/pymodules/python2.7/planet/scrub.py", line 140, in scrub
    tree = xhtml.render(walker(doc), encoding='utf-8')
ERROR:planet.runner:  File "/usr/lib/python2.7/dist-packages/html5lib/serializer/htmlserializer.py", line 307, in render
    return b"".join(list(self.serialize(treewalker, encoding)))
ERROR:planet.runner:  File "/usr/lib/python2.7/dist-packages/html5lib/serializer/htmlserializer.py", line 199, in serialize
    for token in treewalker:
ERROR:planet.runner:  File "/usr/lib/python2.7/dist-packages/html5lib/filters/optionaltags.py", line 18, in __iter__
    type = token["type"]

I then tried s/dom/lxml/ and installed python-lxml, but no dice:

ERROR:planet.runner:Error processing http://soup.wobbl.es/rss
ERROR:planet.runner:AttributeError: DocumentFragment instance has no attribute 'tag'
ERROR:planet.runner:  File "/usr/lib/pymodules/python2.7/planet/spider.py", line 472, in spiderPlanet
    writeCache(uri, feed_info, data)
ERROR:planet.runner:  File "/usr/lib/pymodules/python2.7/planet/spider.py", line 166, in writeCache
    scrub.scrub(feed_uri, data)
ERROR:planet.runner:  File "/usr/lib/pymodules/python2.7/planet/scrub.py", line 140, in scrub
    tree = xhtml.render(walker(doc), encoding='utf-8')
ERROR:planet.runner:  File "/usr/lib/python2.7/dist-packages/html5lib/serializer/htmlserializer.py", line 307, in render
    return b"".join(list(self.serialize(treewalker, encoding)))
ERROR:planet.runner:  File "/usr/lib/python2.7/dist-packages/html5lib/serializer/htmlserializer.py", line 199, in serialize
    for token in treewalker:
ERROR:planet.runner:  File "/usr/lib/python2.7/dist-packages/html5lib/filters/optionaltags.py", line 17, in __iter__
    for previous, token, next in self.slider():
ERROR:planet.runner:  File "/usr/lib/python2.7/dist-packages/html5lib/filters/optionaltags.py", line 9, in slider
    for token in self.source:
ERROR:planet.runner:  File "/usr/lib/python2.7/dist-packages/html5lib/treewalkers/_base.py", line 144, in __iter__
    details = self.getNodeDetails(currentNode)
ERROR:planet.runner:  File "/usr/lib/python2.7/dist-packages/html5lib/treewalkers/lxmletree.py", line 145, in getNodeDetails
    elif node.tag == etree.Comment:

Any idea on how to fix planet-venus? Would it make sense to add some
compatibility code?