[Python-apps-team] Bug#735837: python-html5lib 0.999-2 breaks planet-venus
Michael Stapelberg
stapelberg at debian.org
Fri Jan 17 20:34:54 UTC 2014
Package: python-html5lib
Version: 0.999-2
Severity: important
(Filing this as “important” since it breaks the entire package for my
use case and a reverse dependency, feel free to downgrade severity).
python-html5lib seems to have changed its API in the 0.999 upstream
release.
I get the following error when running planet-venus:
ERROR:planet.runner:Error processing http://blogs.noname-ev.de/commandline-tools/feeds/index.rss2
ERROR:planet.runner:AttributeError: 'module' object has no attribute 'TreeBuilder'
ERROR:planet.runner: File "/usr/lib/pymodules/python2.7/planet/spider.py", line 472, in spiderPlanet
writeCache(uri, feed_info, data)
ERROR:planet.runner: File "/usr/lib/pymodules/python2.7/planet/spider.py", line 279, in writeCache
reconstitute.source(xdoc.documentElement,data.feed,data.bozo,data.version)
ERROR:planet.runner: File "/usr/lib/pymodules/python2.7/planet/reconstitute.py", line 231, in source
content(xsource, 'subtitle', source.get('subtitle_detail',None), bozo)
ERROR:planet.runner: File "/usr/lib/pymodules/python2.7/planet/reconstitute.py", line 167, in content
parser = html5parser.HTMLParser(tree=dom.TreeBuilder)
This error can be addressed by patching planet-venus as follows:
--- i/planet/reconstitute.py
+++ w/planet/reconstitute.py
@@ -18,6 +18,7 @@ from xml.sax.saxutils import escape
from xml.dom import minidom, Node
from html5lib import html5parser
from html5lib.treebuilders import dom
+from html5lib import treebuilders
import planet, config
try:
@@ -164,7 +165,7 @@ def content(xentry, name, detail, bozo):
bozo=1
if detail.type.find('xhtml')<0 or bozo:
- parser = html5parser.HTMLParser(tree=dom.TreeBuilder)
+ parser = html5parser.HTMLParser(tree=treebuilders.getTreeBuilder("dom"))
html = parser.parse(xdiv % detail.value, encoding="utf-8")
for body in html.documentElement.childNodes:
if body.nodeType != Node.ELEMENT_NODE: continue
But, even after that is fixed, there are errors which I wasn’t able to
fix:
ERROR:planet.runner:Error processing https://raumzeitlabor.de/feed/
ERROR:planet.runner:AttributeError: 'module' object has no attribute 'XHTMLSerializer'
ERROR:planet.runner: File "/usr/lib/pymodules/python2.7/planet/spider.py", line 472, in spiderPlanet
writeCache(uri, feed_info, data)
ERROR:planet.runner: File "/usr/lib/pymodules/python2.7/planet/spider.py", line 166, in writeCache
scrub.scrub(feed_uri, data)
ERROR:planet.runner: File "/usr/lib/pymodules/python2.7/planet/scrub.py", line 137, in scrub
xhtml = serializer.XHTMLSerializer(inject_meta_charset = False)
The XHTMLSerializer seems to be gone entirely. I am not sure whether the
HTMLSerializer is enough for what planet-venus does. Just using that did
not work for me, though, I get exceptions with this patch:
--- i/planet/scrub.py
+++ w/planet/scrub.py
@@ -131,10 +131,11 @@ def scrub(feed_uri, data):
# Run this through HTML5's serializer
from html5lib import html5parser, sanitizer, treebuilders
from html5lib import treewalkers, serializer
+ from html5lib.serializer import HTMLSerializer
p = html5parser.HTMLParser(tokenizer=sanitizer.HTMLSanitizer,
tree=treebuilders.getTreeBuilder('dom'))
doc = p.parseFragment(node.value, encoding='utf-8')
- xhtml = serializer.XHTMLSerializer(inject_meta_charset = False)
+ xhtml = HTMLSerializer(inject_meta_charset = False)
walker = treewalkers.getTreeWalker('dom')
- tree = xhtml.serialize(walker(doc), encoding='utf-8')
+ tree = xhtml.render(walker(doc), encoding='utf-8')
node['value'] = ''.join([str(token) for token in tree])
The error is:
ERROR:planet.runner:Error processing http://soup.wobbl.es/rss
ERROR:planet.runner:TypeError: 'NoneType' object has no attribute '__getitem__'
ERROR:planet.runner: File "/usr/lib/pymodules/python2.7/planet/spider.py", line 472, in spiderPlanet
writeCache(uri, feed_info, data)
ERROR:planet.runner: File "/usr/lib/pymodules/python2.7/planet/spider.py", line 166, in writeCache
scrub.scrub(feed_uri, data)
ERROR:planet.runner: File "/usr/lib/pymodules/python2.7/planet/scrub.py", line 140, in scrub
tree = xhtml.render(walker(doc), encoding='utf-8')
ERROR:planet.runner: File "/usr/lib/python2.7/dist-packages/html5lib/serializer/htmlserializer.py", line 307, in render
return b"".join(list(self.serialize(treewalker, encoding)))
ERROR:planet.runner: File "/usr/lib/python2.7/dist-packages/html5lib/serializer/htmlserializer.py", line 199, in serialize
for token in treewalker:
ERROR:planet.runner: File "/usr/lib/python2.7/dist-packages/html5lib/filters/optionaltags.py", line 18, in __iter__
type = token["type"]
I then tried s/dom/lxml/ and installed python-lxml, but no dice:
ERROR:planet.runner:Error processing http://soup.wobbl.es/rss
ERROR:planet.runner:AttributeError: DocumentFragment instance has no attribute 'tag'
ERROR:planet.runner: File "/usr/lib/pymodules/python2.7/planet/spider.py", line 472, in spiderPlanet
writeCache(uri, feed_info, data)
ERROR:planet.runner: File "/usr/lib/pymodules/python2.7/planet/spider.py", line 166, in writeCache
scrub.scrub(feed_uri, data)
ERROR:planet.runner: File "/usr/lib/pymodules/python2.7/planet/scrub.py", line 140, in scrub
tree = xhtml.render(walker(doc), encoding='utf-8')
ERROR:planet.runner: File "/usr/lib/python2.7/dist-packages/html5lib/serializer/htmlserializer.py", line 307, in render
return b"".join(list(self.serialize(treewalker, encoding)))
ERROR:planet.runner: File "/usr/lib/python2.7/dist-packages/html5lib/serializer/htmlserializer.py", line 199, in serialize
for token in treewalker:
ERROR:planet.runner: File "/usr/lib/python2.7/dist-packages/html5lib/filters/optionaltags.py", line 17, in __iter__
for previous, token, next in self.slider():
ERROR:planet.runner: File "/usr/lib/python2.7/dist-packages/html5lib/filters/optionaltags.py", line 9, in slider
for token in self.source:
ERROR:planet.runner: File "/usr/lib/python2.7/dist-packages/html5lib/treewalkers/_base.py", line 144, in __iter__
details = self.getNodeDetails(currentNode)
ERROR:planet.runner: File "/usr/lib/python2.7/dist-packages/html5lib/treewalkers/lxmletree.py", line 145, in getNodeDetails
elif node.tag == etree.Comment:
Any idea on how to fix planet-venus? Would it make sense to add some
compatibility code?
More information about the Python-apps-team
mailing list