[DRE-maint] Bug#534721: libhpricot-ruby1.8: Hpricot's XML parser fails to parse simple, valid XML
T Chan
something-bz at sodium.serveirc.com
Fri Jun 26 17:16:08 UTC 2009
Package: libhpricot-ruby1.8
Version: 0.8-2
Severity: grave
Justification: renders package unusable
This bug also applies to libhpricot-ruby1.9.
Problems:
- Valid XML is rendered invalid.
- XML is no longer parseable.
- Invalid XML is not rejected by default (required by the standard). (minor)
Workaround:
$ aptitude install libhpricot-ruby1.8=0.6-2
Discussion:
Closing tags are sometimes not parsed correctly; causing the parser to "helpfully" add closing tags. Whether this happens or not seems to be pseudorandom:
$ ruby -e "require 'hpricot'; print Hpricot.XML('<aaaa></aaaa>')"
<aaaa></aaaa>
$ ruby -e "require 'hpricot'; print Hpricot.XML('<zzzz></zzzz>')"
<zzzz></zzzz></zzzz>
The effect is similar to the (incorrect) behaviour when it detects malformed XML:
$ ruby -e "require 'hpricot'; print Hpricot.XML('<a></b>')"
<a></b></a>
$ ruby -e "require 'hpricot'; print Hpricot.XML('<a>b')"
<a>b</a>
The unparsed tag appears to be treated like <zzzz/>:
$ ruby -e "require 'hpricot'; print Hpricot.XML('<zzzz></zzzz>').search('/zzzz')"
<zzzz></zzzz></zzzz>
$ ruby -e "require 'hpricot'; print Hpricot.XML('<zzzz></zzzz>').search('/zzzz/zzzz')"
</zzzz>
This causes the nesting to break, rendering most XML completely unparseable:
$ ruby -e "require 'hpricot'; print Hpricot.XML('<a><zzzz></zzzz><b></b></a>')"
<a><zzzz></zzzz><b></b></zzzz></a>
$ ruby -e "require 'hpricot'; print Hpricot.XML('<a><zzzz></zzzz><b></b></a>').search('/a/b')"
(no output)
$ ruby -e "require 'hpricot'; print Hpricot.XML('<a><zzzz></zzzz><b></b></a>').search('/a/zzzz/b')"
<b></b>
This might be related to how Hpricot treats uncrecognized closing tags.
0.6-2 closes the correct tag, ignoring the contents of the closing tag (this is also invalid behaviour for an XML parse):
$ ruby -e "require 'hpricot'; print Hpricot.XML('<a></b>')"
<a></a>
0.8-2 is broken as above:
$ ruby -e "require 'hpricot'; print Hpricot.XML('<a></b>')"
<a></b></a>
I suspect the problem is in hpricot_scan.so, but hpricot_scan.c is full of auto-generated code.
-- System Information:
Debian Release: squeeze/sid
APT prefers testing
APT policy: (990, 'testing'), (500, 'stable')
Architecture: i386 (x86_64)
Kernel: Linux 2.6.26-2-amd64 (SMP w/2 CPU cores)
Locale: LANG=en_GB.UTF-8, LC_CTYPE=en_GB.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/bash
Versions of packages libhpricot-ruby1.8 depends on:
ii libc6 2.9-12 GNU C Library: Shared libraries
ii libruby1.8 1.8.7.174-1 Libraries necessary to run Ruby 1.
libhpricot-ruby1.8 recommends no packages.
libhpricot-ruby1.8 suggests no packages.
-- no debconf information
More information about the Pkg-ruby-extras-maintainers
mailing list