Bug#1037063: libxml-libxml-perl: Seemingly incorrect handling of escaped characters in patterns
Xan Charbonnet
xan at charbonnet.com
Sat Jun 3 04:38:02 BST 2023
Package: libxml-libxml-perl
Version: 2.0134+dfsg-2+b1
Severity: normal
Dear Maintainer,
I use XML::LibXML::Reader to work with files that validate against the Library
of Congress's MARCXML Schema, available here:
https://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd
That schema includes a pattern:
[\dA-Za-z!"#$%&'()*+,-./:;<=>?{}_^`~\[\]\\]{1}
or, with the XML escaping processed:
[\dA-Za-z!"#$%&'()*+,-./:;<=>?{}_^`~\[\]\\]{1}
That regex requires a single character, any one of a long list of allowable
characters. Note how three of the characters require escaping because they
would have meaning in the regex itself: the two square brackets [ and ], and
the backslash \.
An online XML Schema validator that I found with a quick search:
https://www.liquid-technologies.com/online-xsd-validator
shows that those three characters are valid. The problem is that
XML::LibXML::Reader seems to believe that they are not.
I wrote a simple test script, validate.pl:
-------------------
#!/usr/bin/perl
use strict;
use warnings;
use File::Slurp;
use XML::LibXML::Reader;
my ($xml_path, $xsd_path) = @ARGV;
my %parameters = (
'location'=>$xml_path,
'Schema'=>XML::LibXML::Schema->new(string=>scalar(read_file($xsd_path))),
);
my $reader = XML::LibXML::Reader->new(%parameters);
while($reader->read())
{}
print "Finished reading; document must be valid.\n";
-------------------
Along with a basic XML Schema file, test.xsd:
-------------------
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="root">
<xs:complexType>
<xs:attribute name="code">
<xs:simpleType>
<xs:restriction base="xs:string">
<xs:pattern value="[\dA-Za-z!"#$%&'()*+,-./:;<=>?{}_^`~\[\]\\]{1}"/>
</xs:restriction>
</xs:simpleType>
</xs:attribute>
</xs:complexType>
</xs:element>
</xs:schema>
-------------------
and a VERY basic XML file, test.xml:
-------------------
<root code="["/>
-------------------
Running:
$ perl validate.pl test.xml test.xsd
results in:
test.xml:1: Schemas validity error : Element 'root', attribute 'code': [facet
'pattern'] The value '[' is not accepted by the pattern
'[\dA-Za-z!"#$%&'()*+,-./:;<=>?{}_^`~\[\]\\]{1}'.
I believe that value in fact should match that pattern. The online schema
validator from earlier validates this pair of files. If you replace the data
in the "code" attribute with any of the other characters, validation passes.
It only fails for the three characters that are escaped.
-- System Information:
Debian Release: 11.7
APT prefers stable-updates
APT policy: (500, 'stable-updates'), (500, 'stable-security'), (500, 'stable')
Architecture: amd64 (x86_64)
Foreign Architectures: i386
Kernel: Linux 5.10.0-21-amd64 (SMP w/8 CPU threads)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), LANGUAGE not set
Shell: /bin/sh linked to /usr/bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled
Versions of packages libxml-libxml-perl depends on:
ii libc6 2.31-13+deb11u6
ii libxml-namespacesupport-perl 1.12-1.1
ii libxml-sax-perl 1.02+dfsg-1
ii libxml2 2.9.10+dfsg-6.7+deb11u4
ii perl 5.32.1-4+deb11u2
ii perl-base [perlapi-5.32.0] 5.32.1-4+deb11u2
libxml-libxml-perl recommends no packages.
libxml-libxml-perl suggests no packages.
-- no debconf information
More information about the pkg-perl-maintainers
mailing list