[xml/sgml-pkgs] Bug#993638: Bug#993638: libxml2: XHTML 1.0 validation is broken

Vincent Lefevre vincent at vinc17.net
Mon Sep 20 02:18:46 BST 2021


On 2021-09-19 22:59:31 +0200, Mattia Rizzolo wrote:
> On Sun, Sep 19, 2021 at 09:45:19PM +0200, Vincent Lefevre wrote:
> > On 2021-09-19 19:15:54 +0200, Mattia Rizzolo wrote:
> > > I can never manage to download DTDs from w3.org (how could you?!), so,
> > > taking your testcase and a copy of the same DTD:
> > 
> > The DTD is provided by Debian, no need to download it.
> 
> But you need to instruct xmllint to use said DTD, it won't by its own
> decision to pick a random DTD from the filesystem.

No, this is not necessary with a correctly configured system.
This is not a random DTD, but the DTD mentioned in the HTML file,
which has the standard public identifier

  "-//W3C//DTD XHTML 1.0 Strict//EN"

Then libxml2 can find the right file on the local file system via
catalogs. In my case (which is the *default* setup with Debian
packages on my system, i.e. I haven't changed anything about that
in /etc):

/etc/xml/catalog contains

<delegatePublic publicIdStartString="-//W3C//DTD XHTML 1.0" catalog="file:///etc/xml/w3c-dtd-xhtml.xml"/>

so that libxml2 then uses /etc/xml/w3c-dtd-xhtml.xml, which contains

<delegatePublic publicIdStartString="-//W3C//DTD XHTML 1.0 Strict//EN" catalog="file:///usr/share/xml/xhtml/schema/dtd/1.0/catalog.xml"/>

so that libxml2 then uses
/usr/share/xml/xhtml/schema/dtd/1.0/catalog.xml, which contains

<public publicId="-//W3C//DTD XHTML 1.0 Strict//EN" uri="xhtml1-strict.dtd"/>

so that libxml2 gets the file

  /usr/share/xml/xhtml/schema/dtd/1.0/xhtml1-strict.dtd

There is the same mechanism for the .ent files referenced
by xhtml1-strict.dtd, i.e. via public identifiers.

>  I also know how to
> use apt-file myself:
> | % apt-file search xhtml1-strict.dtd
> | dita-ot: /usr/share/dita-ot/demo/h2d/dtd/xhtml1-strict.dtd
> | erlang-erl-docgen: /usr/lib/erlang/lib/erl_docgen-1.1.1/priv/dtd/xhtml1-strict.dtd
> | kate5-data: /usr/share/katexmltools/xhtml1-strict.dtd.xml
> | libpxp-ocaml-dev: /usr/share/doc/libpxp-ocaml-dev/examples/namespaces/xhtml1-strict.dtd.gz
> | librdf-rdfa-parser-perl: /usr/share/perl5/auto/share/dist/RDF-RDFa-Parser/catalogue/www.w3.org/MarkUp/DTD/xhtml1-strict.dtd
> | w3-recs: /usr/share/doc/w3-recs/html/www.w3.org/TR/2002/REC-xhtml1-20020801/DTD/xhtml1-strict.dtd.gz
> | w3c-sgml-lib: /usr/share/xml/w3c-sgml-lib/schema/dtd/REC-xhtml1-20020801/xhtml1-strict.dtd
> | xemacs21-basesupport: /usr/share/xemacs21/xemacs-packages/etc/psgml-dtds/xhtml1-strict.dtd
> | xmlcopyeditor: /usr/share/xmlcopyeditor/dtd/xhtml1-strict.dtd
> | %
> 
> indeed the one I used is the one from xmlcopyeditor (I picked a random
> package, trusting that said .dtd is actually the same as all of the
> above).

The one I'm using is from w3c-dtd-xhtml, apparently no longer
available in Debian (my machine is a Debian/unstable one installed
about 5 years ago, and Debian won't replace the package by
w3c-sgml-lib automatically). In any case, the concerned files
from w3c-sgml-lib seem to be the same with minor differences.

> My system is fine.  That error message is only a red herring due to
> --nonet,

Everything is on the local filesystem. There is no reason to do
any network access! If libxml2 tries to do a network access, this
means that something on your system is broken... perhaps catalogs
that are not set up correctly.

> and indeed the return code of xmllint is 0.

Don't look at the return code of xmllint; it is not reliable.
Even in case of bad usage, it will sometimes return 0:

  https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=727075

Validation issues are reported on stderr, e.g. with a working libxml2:

$ xmllint --loaddtd --nonet --noout test.html
test.html:6: parser error : EndTag: '</' not found

^

> If you prefer, I can modify the DOCTYPE and do this instead, so there
> won't be "I/O error"s and the return code is clear:
> 
> mattia at warren /tmp/tmp/xml % cat test.html
> <?xml version="1.0" encoding="utf-8"?>
> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "file:///tmp/tmp/xml/xhtml1-strict.dtd">
> <html xmlns="http://www.w3.org/1999/xhtml">
> <head><title>title</title></head>
> <body><p>text</p></body>
> </html>
> mattia at warren /tmp/tmp/xml % xmllint --noout --nonet test.html ; echo $?
> 0

Wrong test. You forgot to load the DTD!

Please try:

  xmllint --loaddtd --noout --nonet test.html

Note: you may also need to copy the 3 .ent files referenced by
the DTD in the same directory:

<!ENTITY % HTMLlat1 PUBLIC
   "-//W3C//ENTITIES Latin 1 for XHTML//EN"
   "xhtml-lat1.ent">
%HTMLlat1;

<!ENTITY % HTMLsymbol PUBLIC
   "-//W3C//ENTITIES Symbols for XHTML//EN"
   "xhtml-symbol.ent">
%HTMLsymbol;

<!ENTITY % HTMLspecial PUBLIC
   "-//W3C//ENTITIES Special for XHTML//EN"
   "xhtml-special.ent">
%HTMLspecial;

I have tried that:

$ ls -l /tmp/tmp/xml
total 68
-rw-r--r-- 1 vinc17 vinc17 13484 2012-04-24 22:49:16 xhtml-lat1.ent
-rw-r--r-- 1 vinc17 vinc17  4486 2012-04-24 22:49:16 xhtml-special.ent
-rw-r--r-- 1 vinc17 vinc17 13748 2012-04-24 22:49:16 xhtml-symbol.ent
-rw-r--r-- 1 vinc17 vinc17 25473 2012-04-24 22:49:15 xhtml1-strict.dtd

With libxml2 2.9.10+dfsg-6.7, strace shows that every file is loaded
from this directory, and I get no output, as expected.

But with libxml2 2.9.12+dfsg-4, I get:

$ xmllint --loaddtd --noout --nonet test.html
error : xmlAddEntity: invalid redeclaration of predefined entity
error : xmlAddEntity: invalid redeclaration of predefined entity

and strace still shows that every file is loaded from this directory.

Something interesting:

openat(AT_FDCWD, "/tmp/tmp/xml/xhtml-lat1.ent", O_RDONLY) = 5
lseek(5, 0, SEEK_CUR)                   = 0
read(5, "<!-- ..........................."..., 8192) = 8192
read(5, "  \"×\" ><!-- multiplication "..., 16384) = 5292
read(5, "", 11092)                      = 0
brk(0x559087649000)                     = 0x559087649000
close(5)                                = 0
[...]
openat(AT_FDCWD, "/tmp/tmp/xml/xhtml-symbol.ent", O_RDONLY) = 5
lseek(5, 0, SEEK_CUR)                   = 0
read(5, "<!-- ..........................."..., 8192) = 8192
read(5, "     rArr can be used for 'impli"..., 16384) = 5556
read(5, "", 10828)                      = 0
close(5)                                = 0
[...]
openat(AT_FDCWD, "/tmp/tmp/xml/xhtml-special.ent", O_RDONLY) = 5
lseek(5, 0, SEEK_CUR)                   = 0
read(5, "<!-- ..........................."..., 8192) = 4486
read(5, "", 3706)                       = 0
write(2, "error : ", 8)                 = 8
write(2, "xmlAddEntity: invalid redeclarat"..., 57) = 57
write(2, "error : ", 8)                 = 8
write(2, "xmlAddEntity: invalid redeclarat"..., 57) = 57
close(5)                                = 0

So the issue seems to occur when reading xhtml-special.ent.

Hmm... there seems to be a subtle difference in xhtml-special.ent:

With the file from w3c-dtd-xhtml:

<!ENTITY quot    """ ><!-- quotation mark = APL quote, U+0022 ISOnum -->
<!ENTITY amp     "&" ><!-- ampersand, U+0026 ISOnum -->
<!ENTITY lt      "<" ><!-- less-than sign, U+003C ISOnum -->
<!ENTITY gt      ">" ><!-- greater-than sign, U+003E ISOnum -->

But with the file from w3c-sgml-lib:

<!ENTITY lt      "&#60;" ><!-- less-than sign, U+003C ISOnum -->
<!ENTITY gt      ">" ><!-- greater-than sign, U+003E ISOnum -->
<!ENTITY amp     "&#38;" ><!-- ampersand, U+0026 ISOnum -->
<!ENTITY apos    "'" ><!-- The Apostrophe (Apostrophe Quote, APL Quote), U+0027 ISOnum -->
<!ENTITY quot    """ ><!-- quotation mark (Quote Double), U+0022 ISOnum -->

The errors correspond to amp and lt.

Now, I don't know whether the new libxml2 version is too picky,
or there was a real issue with the old entity files (ignored
by all parsers until now?). In the latter case, I think that
there should be a Breaks against w3c-dtd-xhtml.

One more thing: I've just checked on my Debian/stable machine,
which just has w3c-sgml-lib installed:
"xmllint --loaddtd --nonet --noout" works without any error.
Thus there should be no issue by switching w3c-dtd-xhtml to
w3c-sgml-lib.

-- 
Vincent Lefèvre <vincent at vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)



More information about the debian-xml-sgml-pkgs mailing list