Bug#750946: libhtml-html5-parser-perl: UTF-8 character breaks parse_file
gregor herrmann
gregoa at debian.org
Sat Aug 5 16:16:04 UTC 2017
On Wed, 22 Oct 2014 14:13:17 +0200, Vincent Lefevre wrote:
> Control: retitle -1 libhtml-html5-parser-perl: UTF-8 character breaks parse_file
>
> As a consequence of this bug, html2xhtml doesn't work at all when
> applied on a file. No problems when the HTML document is provided
> in the standard input, though.
[..]
> parse_file is used in the former test (like in my original bug report),
> and parse_string is used in the latter test. Thus it seems that's
> parse_file that is broken. Hence the retitle.
Thanks for all those test cases.
Out of curiosity, I looked at the code a bit and used your test HTML
file with bin/html2xhtml:
So what happens is:
lib/HTML/HTML5/Parser.pm: parse_file():
my $response = HTML::HTML5::Parser::UA->get($file, $opts->{user_agent});
lib/HTML/HTML5/Parser/UA.pm: get():
interestingly takes the _get_lwp route for file:/// and returns stuff
lib/HTML/HTML5/Parser.pm: parse_file():
then takes $response->{decoded_content};
which generates, when printed, a wide character warning, and
presumably from here on things go south
What helps is:
- replace in lib/HTML/HTML5/Parser.pm
$response->{decoded_content} with $response->{content}
which feels a bit dangerous
- or in lib/HTML/HTML5/Parser/UA.pm's get:
move the
if ($uri =~ /^file:/i)
up so it's the first alternative and then _get_fs is used
The latter change would be, as a diff:
#v+
--- a/lib/HTML/HTML5/Parser/UA.pm
+++ b/lib/HTML/HTML5/Parser/UA.pm
@@ -18,14 +18,14 @@ sub get
{
my ($class, $uri, $ua) = @_;
+ if ($uri =~ /^file:/i)
+ { goto \&_get_fs }
if (ref $ua and $ua->isa('HTTP::Tiny') and $uri =~ /^https?:/i)
{ goto \&_get_tiny }
if (ref $ua and $ua->isa('LWP::UserAgent'))
{ goto \&_get_lwp }
if (UNIVERSAL::can('LWP::UserAgent', 'can') and not $NO_LWP)
{ goto \&_get_lwp }
- if ($uri =~ /^file:/i)
- { goto \&_get_fs }
goto \&_get_tiny;
}
#v-
While this helps for reading local files, I guess the _get_lwp() case
might still be buggy.
Cheers,
gregor
--
.''`. https://info.comodo.priv.at/ - Debian Developer https://www.debian.org
: :' : OpenPGP fingerprint D1E1 316E 93A7 60A8 104D 85FA BB3A 6801 8649 AA06
`. `' Member of VIBE!AT & SPI, fellow of the Free Software Foundation Europe
`-
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 963 bytes
Desc: Digital Signature
URL: <http://lists.alioth.debian.org/pipermail/pkg-perl-maintainers/attachments/20170805/84a49c5d/attachment.sig>
More information about the pkg-perl-maintainers
mailing list