[xml/sgml-pkgs] Bug#649189: libxml2-utils: Html parser accepts invalid element name (starting with full stop) to document tree
Zsban Ambrus
ambrus at math.bme.hu
Fri Nov 18 17:24:48 UTC 2011
Package: libxml2-utils
Version: 2.7.8.dfsg-2+squeeze1
Severity: normal
Dear maintainer,
When the html parser of libxml2 (with the recover option) meets a tag where
the tag name starts with a full stop, it correctly detects that this is
invalid HTML, but nevertheless accepts the tag with that name into the
document tree. This means that if you output the same document tree as XML,
you get an output that is malformed XML.
Here's an example.
$ xmllint --html --xmlout - <<<'<.m>r'
-:1: HTML parser error : Tag .m invalid
<.m>r
^
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><.m>r
</.m></body></html>
$
The `<.m>' part is not well-formed XML, because XML element names cannot
start with a full stop. You can see this if you try to parse the output
with an XML parser, eg. with xmllint.
In case you're interested, I have noticed this bug when I tried to parse
some (invalid) HTML documents with the perl module XML::LibXML (which is
using the libxml2 library as its backend) and output them as XML.
-- System Information:
Debian Release: 6.0.3
APT prefers stable-updates
APT policy: (500, 'stable-updates'), (500, 'stable')
Architecture: amd64 (x86_64)
Kernel: Linux 2.6.37 (SMP w/2 CPU cores)
Locale: LANG=C, LC_CTYPE=hu_HU (charmap=ISO-8859-2)
Shell: /bin/sh linked to /bin/bash
Versions of packages libxml2-utils depends on:
ii libc6 2.11.2-10 Embedded GNU C Library: Shared lib
ii libreadline6 6.1-3 GNU readline and history libraries
ii libxml2 2.7.8.dfsg-2+squeeze1 GNOME XML library
libxml2-utils recommends no packages.
libxml2-utils suggests no packages.
-- no debconf information
More information about the debian-xml-sgml-pkgs
mailing list