Bug#1037063: libxml-libxml-perl: Seemingly incorrect handling of escaped characters in patterns

Xan Charbonnet xan at charbonnet.com
Sat Jun 3 04:38:02 BST 2023


Package: libxml-libxml-perl
Version: 2.0134+dfsg-2+b1
Severity: normal

Dear Maintainer,

I use XML::LibXML::Reader to work with files that validate against the Library
of Congress's MARCXML Schema, available here:
https://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd

That schema includes a pattern:
[\dA-Za-z!"#$%&'()*+,-./:;<=>?{}_^`~\[\]\\]{1}
or, with the XML escaping processed:
[\dA-Za-z!"#$%&'()*+,-./:;<=>?{}_^`~\[\]\\]{1}

That regex requires a single character, any one of a long list of allowable
characters.  Note how three of the characters require escaping because they
would have meaning in the regex itself: the two square brackets [ and ], and
the backslash \.

An online XML Schema validator that I found with a quick search:
https://www.liquid-technologies.com/online-xsd-validator
shows that those three characters are valid.  The problem is that
XML::LibXML::Reader seems to believe that they are not.

I wrote a simple test script, validate.pl:

-------------------
#!/usr/bin/perl

use strict;
use warnings;
use File::Slurp;
use XML::LibXML::Reader;

my ($xml_path, $xsd_path) = @ARGV;

my %parameters = (
    'location'=>$xml_path,
    'Schema'=>XML::LibXML::Schema->new(string=>scalar(read_file($xsd_path))),
);

my $reader = XML::LibXML::Reader->new(%parameters);
while($reader->read())
{}
print "Finished reading; document must be valid.\n";
-------------------

Along with a basic XML Schema file, test.xsd:

-------------------
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

<xs:element name="root">
  <xs:complexType>
   <xs:attribute name="code">
     <xs:simpleType>
       <xs:restriction base="xs:string">
         <xs:pattern value="[\dA-Za-z!"#$%&'()*+,-./:;<=>?{}_^`~\[\]\\]{1}"/>
       </xs:restriction>
     </xs:simpleType>
    </xs:attribute>
  </xs:complexType>
</xs:element>

</xs:schema>
-------------------

and a VERY basic XML file, test.xml:

-------------------
<root code="["/>
-------------------

Running:
$ perl validate.pl test.xml test.xsd
results in:
test.xml:1: Schemas validity error : Element 'root', attribute 'code': [facet
'pattern'] The value '[' is not accepted by the pattern
'[\dA-Za-z!"#$%&'()*+,-./:;<=>?{}_^`~\[\]\\]{1}'.

I believe that value in fact should match that pattern.  The online schema
validator from earlier validates this pair of files.  If you replace the data
in the "code" attribute with any of the other characters, validation passes.
It only fails for the three characters that are escaped.



-- System Information:
Debian Release: 11.7
  APT prefers stable-updates
  APT policy: (500, 'stable-updates'), (500, 'stable-security'), (500, 'stable')
Architecture: amd64 (x86_64)
Foreign Architectures: i386

Kernel: Linux 5.10.0-21-amd64 (SMP w/8 CPU threads)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), LANGUAGE not set
Shell: /bin/sh linked to /usr/bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

Versions of packages libxml-libxml-perl depends on:
ii  libc6                         2.31-13+deb11u6
ii  libxml-namespacesupport-perl  1.12-1.1
ii  libxml-sax-perl               1.02+dfsg-1
ii  libxml2                       2.9.10+dfsg-6.7+deb11u4
ii  perl                          5.32.1-4+deb11u2
ii  perl-base [perlapi-5.32.0]    5.32.1-4+deb11u2

libxml-libxml-perl recommends no packages.

libxml-libxml-perl suggests no packages.

-- no debconf information



More information about the pkg-perl-maintainers mailing list