[Python-modules-team] Bug#735837: python-html5lib 0.999-2 breaks planet-venus

Mon Jan 20 15:18:13 UTC 2014

Control: affects -1 + planet-venus

(CC-ing planet-venus maintainer)

Hi.

Michael Stapelberg <stapelberg at debian.org> writes:

> python-html5lib seems to have changed its API in the 0.999 upstream
> release.
>

Indeed. Sorry about the breakage.

https://github.com/html5lib/html5lib-python/blob/master/CHANGES.rst#10b1
mentions a few changes that you've just encountered. See my comments
below.

> I get the following error when running planet-venus:
>
> ERROR:planet.runner:Error processing http://blogs.noname-ev.de/commandline-tools/feeds/index.rss2
> ERROR:planet.runner:AttributeError: 'module' object has no attribute 'TreeBuilder'
> ERROR:planet.runner:  File "/usr/lib/pymodules/python2.7/planet/spider.py", line 472, in spiderPlanet
>     writeCache(uri, feed_info, data)
> ERROR:planet.runner:  File "/usr/lib/pymodules/python2.7/planet/spider.py", line 279, in writeCache
>     reconstitute.source(xdoc.documentElement,data.feed,data.bozo,data.version)
> ERROR:planet.runner:  File "/usr/lib/pymodules/python2.7/planet/reconstitute.py", line 231, in source
>     content(xsource, 'subtitle', source.get('subtitle_detail',None), bozo)
> ERROR:planet.runner:  File "/usr/lib/pymodules/python2.7/planet/reconstitute.py", line 167, in content
>     parser = html5parser.HTMLParser(tree=dom.TreeBuilder)
>
> This error can be addressed by patching planet-venus as follows:
>
> --- i/planet/reconstitute.py
> +++ w/planet/reconstitute.py
> @@ -18,6 +18,7 @@ from xml.sax.saxutils import escape
>  from xml.dom import minidom, Node
>  from html5lib import html5parser
>  from html5lib.treebuilders import dom
> +from html5lib import treebuilders
>  import planet, config
>  
>  try:
> @@ -164,7 +165,7 @@ def content(xentry, name, detail, bozo):
>              bozo=1
>  
>      if detail.type.find('xhtml')<0 or bozo:
> -        parser = html5parser.HTMLParser(tree=dom.TreeBuilder)
> +        parser = html5parser.HTMLParser(tree=treebuilders.getTreeBuilder("dom"))
>          html = parser.parse(xdiv % detail.value, encoding="utf-8")
>          for body in html.documentElement.childNodes:
>              if body.nodeType != Node.ELEMENT_NODE: continue
>

Indeed, the upstream Changelog tells :

 "* Removed default DOM treebuilder, so ``html5lib.treebuilders.dom`` is
   no longer supported. ``html5lib.treebuilders.getTreeBuilder("dom")``
   will return the default DOM treebuilder, which uses
   ``xml.dom.minidom``."

> But, even after that is fixed, there are errors which I wasn’t able to
> fix:
>
> ERROR:planet.runner:Error processing https://raumzeitlabor.de/feed/
> ERROR:planet.runner:AttributeError: 'module' object has no attribute 'XHTMLSerializer'
> ERROR:planet.runner:  File "/usr/lib/pymodules/python2.7/planet/spider.py", line 472, in spiderPlanet
>     writeCache(uri, feed_info, data)
> ERROR:planet.runner:  File "/usr/lib/pymodules/python2.7/planet/spider.py", line 166, in writeCache
>     scrub.scrub(feed_uri, data)
> ERROR:planet.runner:  File "/usr/lib/pymodules/python2.7/planet/scrub.py", line 137, in scrub
>     xhtml = serializer.XHTMLSerializer(inject_meta_charset = False)
>
> The XHTMLSerializer seems to be gone entirely. I am not sure whether the
> HTMLSerializer is enough for what planet-venus does. 

 "* Removed the ``XHTMLSerializer`` as it never actually guaranteed its
   output was well-formed XML, and hence provided little of use."

Hmm... I don't know what can be done here.

> Just using that did
> not work for me, though, I get exceptions with this patch:
>
> --- i/planet/scrub.py
> +++ w/planet/scrub.py
> @@ -131,10 +131,11 @@ def scrub(feed_uri, data):
>              # Run this through HTML5's serializer
>              from html5lib import html5parser, sanitizer, treebuilders
>              from html5lib import treewalkers, serializer
> +            from html5lib.serializer import HTMLSerializer
>              p = html5parser.HTMLParser(tokenizer=sanitizer.HTMLSanitizer,
>                tree=treebuilders.getTreeBuilder('dom'))
>              doc = p.parseFragment(node.value, encoding='utf-8')
> -            xhtml = serializer.XHTMLSerializer(inject_meta_charset = False)
> +            xhtml = HTMLSerializer(inject_meta_charset = False)
>              walker = treewalkers.getTreeWalker('dom')
> -            tree = xhtml.serialize(walker(doc), encoding='utf-8')
> +            tree = xhtml.render(walker(doc), encoding='utf-8')
>              node['value'] = ''.join([str(token) for token in tree])
>
> The error is:
>
> ERROR:planet.runner:Error processing http://soup.wobbl.es/rss
> ERROR:planet.runner:TypeError: 'NoneType' object has no attribute '__getitem__'
> ERROR:planet.runner:  File "/usr/lib/pymodules/python2.7/planet/spider.py", line 472, in spiderPlanet
>     writeCache(uri, feed_info, data)
> ERROR:planet.runner:  File "/usr/lib/pymodules/python2.7/planet/spider.py", line 166, in writeCache
>     scrub.scrub(feed_uri, data)
> ERROR:planet.runner:  File "/usr/lib/pymodules/python2.7/planet/scrub.py", line 140, in scrub
>     tree = xhtml.render(walker(doc), encoding='utf-8')
> ERROR:planet.runner:  File "/usr/lib/python2.7/dist-packages/html5lib/serializer/htmlserializer.py", line 307, in render
>     return b"".join(list(self.serialize(treewalker, encoding)))
> ERROR:planet.runner:  File "/usr/lib/python2.7/dist-packages/html5lib/serializer/htmlserializer.py", line 199, in serialize
>     for token in treewalker:
> ERROR:planet.runner:  File "/usr/lib/python2.7/dist-packages/html5lib/filters/optionaltags.py", line 18, in __iter__
>     type = token["type"]
>
> I then tried s/dom/lxml/ and installed python-lxml, but no dice:
>
> ERROR:planet.runner:Error processing http://soup.wobbl.es/rss
> ERROR:planet.runner:AttributeError: DocumentFragment instance has no attribute 'tag'
> ERROR:planet.runner:  File "/usr/lib/pymodules/python2.7/planet/spider.py", line 472, in spiderPlanet
>     writeCache(uri, feed_info, data)
> ERROR:planet.runner:  File "/usr/lib/pymodules/python2.7/planet/spider.py", line 166, in writeCache
>     scrub.scrub(feed_uri, data)
> ERROR:planet.runner:  File "/usr/lib/pymodules/python2.7/planet/scrub.py", line 140, in scrub
>     tree = xhtml.render(walker(doc), encoding='utf-8')
> ERROR:planet.runner:  File "/usr/lib/python2.7/dist-packages/html5lib/serializer/htmlserializer.py", line 307, in render
>     return b"".join(list(self.serialize(treewalker, encoding)))
> ERROR:planet.runner:  File "/usr/lib/python2.7/dist-packages/html5lib/serializer/htmlserializer.py", line 199, in serialize
>     for token in treewalker:
> ERROR:planet.runner:  File "/usr/lib/python2.7/dist-packages/html5lib/filters/optionaltags.py", line 17, in __iter__
>     for previous, token, next in self.slider():
> ERROR:planet.runner:  File "/usr/lib/python2.7/dist-packages/html5lib/filters/optionaltags.py", line 9, in slider
>     for token in self.source:
> ERROR:planet.runner:  File "/usr/lib/python2.7/dist-packages/html5lib/treewalkers/_base.py", line 144, in __iter__
>     details = self.getNodeDetails(currentNode)
> ERROR:planet.runner:  File "/usr/lib/python2.7/dist-packages/html5lib/treewalkers/lxmletree.py", line 145, in getNodeDetails
>     elif node.tag == etree.Comment:
>
> Any idea on how to fix planet-venus? Would it make sense to add some
> compatibility code?
>

I don't have a clue. Actually, I haven't played a lot with html5lib
myself :-(

Have you checked however that the code in planet-venus is the latest
upstream version (the Debian changelog seems to be lagging behind
somehow, but on the other hand I can't see notices of upstream releases
either, only commits... ;) ? If there has been no fork or other tricks, I seem to
find quite some differences in what is at
https://github.com/rubys/venus.git vs the current source package in
unstable.

It seems that upstream ships a copy of html5lib that looks more recent
in there, so there may be hope.

Maybe you can test your feeds with this latest upstream and check if
this looks like updating planet-venus for jessie is an option ?

Looking forward to getting your feedback.

Best regards,
-- 
Olivier BERGER 
http://www-public.telecom-sudparis.eu/~berger_o/ - OpenPGP-Id: 2048R/5819D7E8
Ingenieur Recherche - Dept INF
Institut Mines-Telecom, Telecom SudParis, Evry (France)