Bug#633511: libwww-perl: Incorrect encoding handling for text/html files with LWP::Simple::get and insufficient documentation
Vincent Lefevre
vincent at vinc17.net
Mon Jul 11 01:30:45 UTC 2011
Package: libwww-perl
Version: 6.02-1
Severity: normal
Tags: upstream
This bug report is more or less what I gave on
https://rt.cpan.org/Public/Bug/Display.html?id=69393
with some additional information concerning Debian.
When a file declared as iso-8859-1 and served as text/html is also
a valid UTF-8 file, LWP::Simple::get from libwww-perl 6.02 regards
it as a UTF-8 encoded file. This is incorrect.
For instance, with lwp-dump being
#!/usr/bin/env perl
use strict;
use Devel::Peek;
use LWP::Simple;
@ARGV == 1 or die "Usage: $0 <URL>\n";
my $url = shift;
my $file = LWP::Simple::get($url);
defined $file or die "$0: can't fetch $url\n";
Dump $file;
and when running
for i in 1a 1h 2a 2h
do
./lwp-dump http://www.vinc17.net/test/perl-lwp-test$i.xml \
2> perl-lwp-test$i.dump
done
I get (see perl-lwp-test1h.dump in particular):
==> perl-lwp-test1a.dump <==
SV = PV(0x194dac8) at 0x6a02d0
REFCNT = 1
FLAGS = (PADMY,POK,pPOK,UTF8)
PV = 0x1308cd0 "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<root>post\303\203\302\251... A</root>\n"\0 [UTF8 "<?xml version="1.0" encoding="iso-8859-1"?>\n<root>post\x{c3}\x{a9}... A</root>\n"]
CUR = 71
LEN = 80
==> perl-lwp-test1h.dump <==
SV = PV(0x194dac8) at 0x6a02d0
REFCNT = 1
FLAGS = (PADMY,POK,pPOK,UTF8)
PV = 0x13097d0 "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<root>post\303\251... A</root>\n"\0 [UTF8 "<?xml version="1.0" encoding="iso-8859-1"?>\n<root>post\x{e9}... A</root>\n"]
CUR = 69
LEN = 80
==> perl-lwp-test2a.dump <==
SV = PV(0x194dac8) at 0x6a02d0
REFCNT = 1
FLAGS = (PADMY,POK,pPOK,UTF8)
PV = 0x1308cd0 "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<root>post\303\203\302\251... \303\203</root>\n"\0 [UTF8 "<?xml version="1.0" encoding="iso-8859-1"?>\n<root>post\x{c3}\x{a9}... \x{c3}</root>\n"]
CUR = 72
LEN = 80
==> perl-lwp-test2h.dump <==
SV = PV(0x194dac8) at 0x6a02d0
REFCNT = 1
FLAGS = (PADMY,POK,pPOK,UTF8)
PV = 0x1309850 "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<root>post\303\203\302\251... \303\203</root>\n"\0 [UTF8 "<?xml version="1.0" encoding="iso-8859-1"?>\n<root>post\x{c3}\x{a9}... \x{c3}</root>\n"]
CUR = 72
LEN = 80
Note: my examples are not HTML files, but this doesn't matter. I first
thought the problem occurred for all text/* files (e.g. text/xml, that's
why I just wrote basic XML files), but in fact only text/html seems to
be affected.
How the bug should be fixed depends on the expected behavior. However
LWP::Simple::get is not sufficiently documented. This means that the
other cases are potentially wrong too. Indeed, in lenny, I always get
a sequence of bytes (no UTF8 flag):
==> perl-lwp-test1a.dump <==
SV = PVIV(0x1b1ef38) at 0x1bec568
REFCNT = 1
FLAGS = (PADMY,POK,pPOK)
IV = 0
PV = 0x1c04130 "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<root>post\303\251... A</root>\n"\0
CUR = 69
LEN = 72
==> perl-lwp-test1h.dump <==
SV = PVIV(0x166af38) at 0x1738568
REFCNT = 1
FLAGS = (PADMY,POK,pPOK)
IV = 0
PV = 0x1750130 "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<root>post\303\251... A</root>\n"\0
CUR = 69
LEN = 72
==> perl-lwp-test2a.dump <==
SV = PVIV(0x2150f38) at 0x221e568
REFCNT = 1
FLAGS = (PADMY,POK,pPOK)
IV = 0
PV = 0x2236130 "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<root>post\303\251... \303</root>\n"\0
CUR = 69
LEN = 72
==> perl-lwp-test2h.dump <==
SV = PVIV(0x1752f38) at 0x1820568
REFCNT = 1
FLAGS = (PADMY,POK,pPOK)
IV = 0
PV = 0x1838130 "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<root>post\303\251... \303</root>\n"\0
CUR = 69
LEN = 72
and in squeeze, ditto except perl-lwp-test1h.dump, which is already
wrong:
==> perl-lwp-test1a.dump <==
SV = PV(0x23ce758) at 0x1e455f0
REFCNT = 1
FLAGS = (PADMY,POK,pPOK)
PV = 0x23ce5b0 "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<root>post\303\251... A</root>\n"\0
CUR = 69
LEN = 72
==> perl-lwp-test1h.dump <==
SV = PV(0x2afe758) at 0x25755f0
REFCNT = 1
FLAGS = (PADMY,POK,pPOK,UTF8)
PV = 0x2d5f9f0 "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<root>post\303\251... A</root>\n"\0 [UTF8 "<?xml version="1.0" encoding="iso-8859-1"?>\n<root>post\x{e9}... A</root>\n"]
CUR = 69
LEN = 72
==> perl-lwp-test2a.dump <==
SV = PV(0x2a5d758) at 0x24d45f0
REFCNT = 1
FLAGS = (PADMY,POK,pPOK)
PV = 0x2a5d5b0 "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<root>post\303\251... \303</root>\n"\0
CUR = 69
LEN = 72
==> perl-lwp-test2h.dump <==
SV = PV(0x28cd758) at 0x23445f0
REFCNT = 1
FLAGS = (PADMY,POK,pPOK)
PV = 0x2b8e0c0 "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?>\n<root>post\303\251... \303</root>\n"\0
CUR = 69
LEN = 72
A sequence of bytes is probably what one expects for files without
a HTTP charset (e.g. served as application/xml).
Also, what happens if a file is sent as text/html with UTF-8 charset,
but isn't a valid UTF-8 file?
The problem with the 1h file may come from HTTP::Message, with a
default charset guessed by content_charset(), if LWP::Simple::get
uses decoded_content from HTTP::Message with a default charset
guessed by content_charset(). Charset guessing should strictly
follow the explicit rules from
http://www.w3.org/TR/REC-html40/charset.html#spec-char-encoding
to avoid inconsistencies like here.
-- System Information:
Debian Release: wheezy/sid
APT prefers unstable
APT policy: (500, 'unstable'), (500, 'testing'), (500, 'stable'), (1, 'experimental')
Architecture: amd64 (x86_64)
Kernel: Linux 2.6.39-2-amd64 (SMP w/2 CPU cores)
Locale: LANG=POSIX, LC_CTYPE=en_US.ISO8859-1 (charmap=ISO-8859-1)
Shell: /bin/sh linked to /bin/dash
Versions of packages libwww-perl depends on:
ii ca-certificates 20110502 Common CA certificates
ii libencode-locale-perl 1.02-1 utility to determine the locale en
ii libfile-listing-perl 6.01-1 module to parse directory listings
ii libhtml-parser-perl 3.68-1+b1 collection of modules that parse H
ii libhtml-tagset-perl 3.20-2 Data tables pertaining to HTML
ii libhtml-tree-perl 4.2-1 Perl module to represent and creat
ii libhttp-cookies-perl 6.00-2 HTTP cookie jars
ii libhttp-date-perl 6.00-1 module of date conversion routines
ii libhttp-message-perl 6.01-1 perl interface to HTTP style messa
ii libhttp-negotiate-perl 6.00-2 implementation of content negotiat
ii liblwp-mediatypes-perl 6.01-1 module to guess media type for a f
ii liblwp-protocol-https-perl 6.02-1 https driver for LWP::UserAgent
ii libnet-http-perl 6.01-1 module providing low-level HTTP co
ii liburi-perl 1.58-1 module to manipulate and access UR
ii libwww-robotrules-perl 6.01-1 database of robots.txt-derived per
ii netbase 4.46 Basic TCP/IP networking system
ii perl 5.12.4-1 Larry Wall's Practical Extraction
Versions of packages libwww-perl recommends:
ii libauthen-ntlm-perl 1.08-1 authentication module for NTLM
ii libhtml-form-perl 6.00-1 module that represents an HTML for
pn libhtml-format-perl <none> (no description available)
ii libhttp-daemon-perl 6.00-1 simple http server class
ii libmailtools-perl 2.08-1 Manipulate email in perl programs
libwww-perl suggests no packages.
-- no debconf information
More information about the pkg-perl-maintainers
mailing list