Bug#750946: libhtml-html5-parser-perl: UTF-8 character confuses the parser
Vincent Lefevre
vincent at vinc17.net
Sun Jun 8 19:03:03 UTC 2014
Package: libhtml-html5-parser-perl
Version: 0.301-1
Severity: important
(with possible data loss as a consequence)
Consider the following HTML file:
<?xml version="1.0" encoding="utf-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>title</title>
</head>
<body>
<p>↓</p>
</body>
</html>
On this file, the following script
#!/usr/bin/env perl
use strict;
use HTML::HTML5::Parser;
use utf8; # for the characters in the script.
use open ':encoding(UTF-8)'; # for the file arguments.
binmode STDIN, ':encoding(UTF-8)'; # for stdin.
binmode STDOUT, ':encoding(UTF-8)'; # for stdout.
@ARGV == 1 or die "Usage: $0 <file.html>\n";
my $parser = HTML::HTML5::Parser->new;
my $doc = $parser->parse_file($ARGV[0]);
print "Charset: '", $parser->charset($doc), "'\n";
print $doc->toString();
outputs:
Charset: ''
<?xml version="1.0" encoding="windows-1252"?>
<html xmlns="http://www.w3.org/1999/xhtml"><head/><body/></html>
If I replace the ↓ (U+2193 DOWNWARDS ARROW) by é (U+00E9 LATIN SMALL
LETTER E WITH ACUTE), then I get:
Charset: 'utf-8'
<?xml version="1.0" encoding="utf-8"?>
<!--?xml version="1.0" encoding="utf-8"?-->
<html xmlns="http://www.w3.org/1999/xhtml" xmlns="http://www.w3.org/1999/xhtml"><head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>title</title>
</head>
<body>
<p>�</p>
</body></html>
which is also incorrect, but at least the charset is correct.
-- System Information:
Debian Release: jessie/sid
APT prefers unstable
APT policy: (500, 'unstable'), (500, 'testing'), (500, 'stable'), (1, 'experimental')
Architecture: amd64 (x86_64)
Foreign Architectures: i386
Kernel: Linux 3.11-2-amd64 (SMP w/2 CPU cores)
Locale: LANG=POSIX, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Versions of packages libhtml-html5-parser-perl depends on:
ii libhtml-html5-entities-perl 0.003-2
ii libio-html-perl 1.00-1
ii libtry-tiny-perl 0.22-1
ii liburi-perl 1.60-1
ii libxml-libxml-perl 2.0116+dfsg-1
ii perl 5.18.2-4
ii perl-modules [libhttp-tiny-perl] 5.18.2-4
libhtml-html5-parser-perl recommends no packages.
Versions of packages libhtml-html5-parser-perl suggests:
pn libxml-libxml-devel-setlinenumber-perl <none>
-- no debconf information
More information about the pkg-perl-maintainers
mailing list