Bug#750946: libhtml-html5-parser-perl: UTF-8 character confuses the parser

Vincent Lefevre vincent at vinc17.net
Sun Jun 8 19:03:03 UTC 2014


Package: libhtml-html5-parser-perl
Version: 0.301-1
Severity: important

(with possible data loss as a consequence)

Consider the following HTML file:

<?xml version="1.0" encoding="utf-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>title</title>
  </head>
  <body>
    <p>↓</p>
  </body>
</html>

On this file, the following script

#!/usr/bin/env perl

use strict;
use HTML::HTML5::Parser;

use utf8;                            # for the characters in the script.
use open ':encoding(UTF-8)';         # for the file arguments.
binmode STDIN, ':encoding(UTF-8)';   # for stdin.
binmode STDOUT, ':encoding(UTF-8)';  # for stdout.

@ARGV == 1 or die "Usage: $0 <file.html>\n";

my $parser = HTML::HTML5::Parser->new;
my $doc = $parser->parse_file($ARGV[0]);
print "Charset: '", $parser->charset($doc), "'\n";
print $doc->toString();

outputs:

Charset: ''
<?xml version="1.0" encoding="windows-1252"?>
<html xmlns="http://www.w3.org/1999/xhtml"><head/><body/></html>

If I replace the ↓ (U+2193 DOWNWARDS ARROW) by é (U+00E9 LATIN SMALL
LETTER E WITH ACUTE), then I get:

Charset: 'utf-8'
<?xml version="1.0" encoding="utf-8"?>
<!--?xml version="1.0" encoding="utf-8"?-->
<html xmlns="http://www.w3.org/1999/xhtml" xmlns="http://www.w3.org/1999/xhtml"><head>
    <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
    <title>title</title>
  </head>
  <body>
    <p>�</p>
  

</body></html>

which is also incorrect, but at least the charset is correct.

-- System Information:
Debian Release: jessie/sid
  APT prefers unstable
  APT policy: (500, 'unstable'), (500, 'testing'), (500, 'stable'), (1, 'experimental')
Architecture: amd64 (x86_64)
Foreign Architectures: i386

Kernel: Linux 3.11-2-amd64 (SMP w/2 CPU cores)
Locale: LANG=POSIX, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash

Versions of packages libhtml-html5-parser-perl depends on:
ii  libhtml-html5-entities-perl       0.003-2
ii  libio-html-perl                   1.00-1
ii  libtry-tiny-perl                  0.22-1
ii  liburi-perl                       1.60-1
ii  libxml-libxml-perl                2.0116+dfsg-1
ii  perl                              5.18.2-4
ii  perl-modules [libhttp-tiny-perl]  5.18.2-4

libhtml-html5-parser-perl recommends no packages.

Versions of packages libhtml-html5-parser-perl suggests:
pn  libxml-libxml-devel-setlinenumber-perl  <none>

-- no debconf information



More information about the pkg-perl-maintainers mailing list