Bug#750946: libhtml-html5-parser-perl: UTF-8 character breaks parse_file
gregor herrmann
gregoa at debian.org
Mon Aug 7 15:26:08 UTC 2017
Control: tag -1 + patch
On Sun, 06 Aug 2017 18:11:34 -0700, Gregory Williams wrote:
> I also looked into this and found another possible fix:
>
> diff -ru HTML-HTML5-Parser-0.301/lib/HTML/HTML5/Parser.pm HTML-HTML5-Parser-0.301-patched/lib/HTML/HTML5/Parser.pm
> --- HTML-HTML5-Parser-0.301/lib/HTML/HTML5/Parser.pm 2013-07-08 07:12:25.000000000 -0700
> +++ HTML-HTML5-Parser-0.301-patched/lib/HTML/HTML5/Parser.pm 2017-08-06 12:42:58.000000000 -0700
> @@ -13,6 +13,7 @@
> use HTML::HTML5::Parser::TagSoupParser;
> use Scalar::Util qw(blessed);
> use URI::file;
> +use Encode qw(encode_utf8);
> use XML::LibXML;
>
> BEGIN {
> @@ -102,6 +103,11 @@
> {
> # XXX AGAIN DO THIS TO STOP ENORMOUS MEMORY LEAKS
> my ($errh, $errors) = @{$self}{qw(error_handler errors)};
> +
> + if (utf8::is_utf8($text)) {
> + $text = encode_utf8($text);
> + }
> +
> $self->{parser}->parse_byte_string(
> $opts->{'encoding'}, $text, $dom,
> sub {
>
This looks indeed much better than my crude workarounds, thanks for
that!
Do you think you can take this up with upstream?
Cheers,
gregor
--
.''`. https://info.comodo.priv.at/ - Debian Developer https://www.debian.org
: :' : OpenPGP fingerprint D1E1 316E 93A7 60A8 104D 85FA BB3A 6801 8649 AA06
`. `' Member of VIBE!AT & SPI, fellow of the Free Software Foundation Europe
`-
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 963 bytes
Desc: Digital Signature
URL: <http://lists.alioth.debian.org/pipermail/pkg-perl-maintainers/attachments/20170807/f86824bb/attachment.sig>
More information about the pkg-perl-maintainers
mailing list