Bug#750946: libhtml-html5-parser-perl: UTF-8 character breaks parse_file

gregor herrmann gregoa at debian.org
Sat Aug 5 16:16:04 UTC 2017


On Wed, 22 Oct 2014 14:13:17 +0200, Vincent Lefevre wrote:

> Control: retitle -1 libhtml-html5-parser-perl: UTF-8 character breaks parse_file
> 
> As a consequence of this bug, html2xhtml doesn't work at all when
> applied on a file. No problems when the HTML document is provided
> in the standard input, though. 
[..]
> parse_file is used in the former test (like in my original bug report),
> and parse_string is used in the latter test. Thus it seems that's
> parse_file that is broken. Hence the retitle.

Thanks for all those test cases.

Out of curiosity, I looked at the code a bit and used your test HTML
file with bin/html2xhtml:

So what happens is:

lib/HTML/HTML5/Parser.pm: parse_file():
my $response = HTML::HTML5::Parser::UA->get($file, $opts->{user_agent});

lib/HTML/HTML5/Parser/UA.pm: get():
interestingly takes the _get_lwp route for file:/// and returns stuff

lib/HTML/HTML5/Parser.pm: parse_file():
then takes $response->{decoded_content};
which generates, when printed, a wide character warning, and
presumably from here on things go south

What helps is:
- replace in lib/HTML/HTML5/Parser.pm
  $response->{decoded_content} with $response->{content}
  which feels a bit dangerous
- or in lib/HTML/HTML5/Parser/UA.pm's get:
  move the
  if ($uri =~ /^file:/i)
  up so it's the first alternative and then _get_fs is used


The latter change would be, as a diff:

#v+
--- a/lib/HTML/HTML5/Parser/UA.pm
+++ b/lib/HTML/HTML5/Parser/UA.pm
@@ -18,14 +18,14 @@ sub get
 {
        my ($class, $uri, $ua) = @_;

+       if ($uri =~ /^file:/i)
+               { goto \&_get_fs }
        if (ref $ua and $ua->isa('HTTP::Tiny') and $uri =~ /^https?:/i)
                { goto \&_get_tiny }
        if (ref $ua and $ua->isa('LWP::UserAgent'))
                { goto \&_get_lwp }
        if (UNIVERSAL::can('LWP::UserAgent', 'can') and not $NO_LWP)
                { goto \&_get_lwp }
-       if ($uri =~ /^file:/i)
-               { goto \&_get_fs }

        goto \&_get_tiny;
 }
#v-


While this helps for reading local files, I guess the _get_lwp() case
might still be buggy.


Cheers,
gregor


-- 
 .''`.  https://info.comodo.priv.at/ - Debian Developer https://www.debian.org
 : :' : OpenPGP fingerprint D1E1 316E 93A7 60A8 104D  85FA BB3A 6801 8649 AA06
 `. `'  Member of VIBE!AT & SPI, fellow of the Free Software Foundation Europe
   `-   
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 963 bytes
Desc: Digital Signature
URL: <http://lists.alioth.debian.org/pipermail/pkg-perl-maintainers/attachments/20170805/84a49c5d/attachment.sig>


More information about the pkg-perl-maintainers mailing list