Bug#750946: libhtml-html5-parser-perl: UTF-8 character breaks parse_file

gregor herrmann gregoa at debian.org
Mon Aug 7 15:26:08 UTC 2017


Control: tag -1 + patch

On Sun, 06 Aug 2017 18:11:34 -0700, Gregory Williams wrote:

> I also looked into this and found another possible fix:
> 
> diff -ru HTML-HTML5-Parser-0.301/lib/HTML/HTML5/Parser.pm HTML-HTML5-Parser-0.301-patched/lib/HTML/HTML5/Parser.pm
> --- HTML-HTML5-Parser-0.301/lib/HTML/HTML5/Parser.pm	2013-07-08 07:12:25.000000000 -0700
> +++ HTML-HTML5-Parser-0.301-patched/lib/HTML/HTML5/Parser.pm	2017-08-06 12:42:58.000000000 -0700
> @@ -13,6 +13,7 @@
>  use HTML::HTML5::Parser::TagSoupParser;
>  use Scalar::Util qw(blessed);
>  use URI::file;
> +use Encode qw(encode_utf8);
>  use XML::LibXML;
>  
>  BEGIN {
> @@ -102,6 +103,11 @@
>  	{
>          # XXX AGAIN DO THIS TO STOP ENORMOUS MEMORY LEAKS
>          my ($errh, $errors) = @{$self}{qw(error_handler errors)};
> +        
> +        if (utf8::is_utf8($text)) {
> +        	$text	= encode_utf8($text);
> +        }
> +        
>  		$self->{parser}->parse_byte_string(
>              $opts->{'encoding'}, $text, $dom,
>              sub {
> 

This looks indeed much better than my crude workarounds, thanks for
that!

Do you think you can take this up with upstream?


Cheers,
gregor

-- 
 .''`.  https://info.comodo.priv.at/ - Debian Developer https://www.debian.org
 : :' : OpenPGP fingerprint D1E1 316E 93A7 60A8 104D  85FA BB3A 6801 8649 AA06
 `. `'  Member of VIBE!AT & SPI, fellow of the Free Software Foundation Europe
   `-   
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 963 bytes
Desc: Digital Signature
URL: <http://lists.alioth.debian.org/pipermail/pkg-perl-maintainers/attachments/20170807/f86824bb/attachment.sig>


More information about the pkg-perl-maintainers mailing list