Bug#750946: libhtml-html5-parser-perl: UTF-8 character breaks parse_file

Mon Aug 7 01:11:34 UTC 2017

On Sat, 5 Aug 2017 12:16:04 -0400 gregor herrmann <gregoa at debian.org> wrote:
> What helps is:
> - replace in lib/HTML/HTML5/Parser.pm
>   $response->{decoded_content} with $response->{content}
>   which feels a bit dangerous
> - or in lib/HTML/HTML5/Parser/UA.pm's get:
>   move the
>   if ($uri =~ /^file:/i)
>   up so it's the first alternative and then _get_fs is used
> 
> 
> The latter change would be, as a diff:
> 
> #v+
> --- a/lib/HTML/HTML5/Parser/UA.pm
> +++ b/lib/HTML/HTML5/Parser/UA.pm
> @@ -18,14 +18,14 @@ sub get
>  {
>         my ($class, $uri, $ua) = @_;
> 
> +       if ($uri =~ /^file:/i)
> +               { goto \&_get_fs }
>         if (ref $ua and $ua->isa('HTTP::Tiny') and $uri =~ /^https?:/i)
>                 { goto \&_get_tiny }
>         if (ref $ua and $ua->isa('LWP::UserAgent'))
>                 { goto \&_get_lwp }
>         if (UNIVERSAL::can('LWP::UserAgent', 'can') and not $NO_LWP)
>                 { goto \&_get_lwp }
> -       if ($uri =~ /^file:/i)
> -               { goto \&_get_fs }
> 
> 
> 
> While this helps for reading local files, I guess the _get_lwp() case
> might still be buggy.


I also looked into this and found another possible fix:

diff -ru HTML-HTML5-Parser-0.301/lib/HTML/HTML5/Parser.pm HTML-HTML5-Parser-0.301-patched/lib/HTML/HTML5/Parser.pm

--- HTML-HTML5-Parser-0.301/lib/HTML/HTML5/Parser.pm	2013-07-08 07:12:25.000000000 -0700
+++ HTML-HTML5-Parser-0.301-patched/lib/HTML/HTML5/Parser.pm	2017-08-06 12:42:58.000000000 -0700
@@ -13,6 +13,7 @@
 use HTML::HTML5::Parser::TagSoupParser;
 use Scalar::Util qw(blessed);
 use URI::file;
+use Encode qw(encode_utf8);
 use XML::LibXML;
 
 BEGIN {
@@ -102,6 +103,11 @@
 	{
         # XXX AGAIN DO THIS TO STOP ENORMOUS MEMORY LEAKS
         my ($errh, $errors) = @{$self}{qw(error_handler errors)};
+        
+        if (utf8::is_utf8($text)) {
+        	$text	= encode_utf8($text);
+        }
+        
 		$self->{parser}->parse_byte_string(
             $opts->{'encoding'}, $text, $dom,
             sub {


Part of the underlying issue here is that many variables and methods in these modules are named in a confusing way, expecting/requiring encoded bytes, but using names which imply a desire for decoded strings.

The above patch should handle the LWP case which the previously suggest patch avoids. It still passes the test suite (which should probably be improved to verify this case), and also supports the test case detailed in this bug report (though I should mention that I believe the test script included by Vincent Lefevre includes a double-encoding bug as $doc->toString() actually returns utf8 encoded bytes, which the :encoding(UTF-8) PerlIO layer on stdout will attempt to encode a second time).

thanks,
.greg

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.alioth.debian.org/pipermail/pkg-perl-maintainers/attachments/20170806/c1bba498/attachment.html>