Bug#787821: libhtml-parser-perl: encode_entities() convert chars to Ã instead of their proper entity

Fri Jun 5 12:29:12 UTC 2015

-=| Mathieu Roy, 05.06.2015 13:35:24 +0200 |=-
> Package: libhtml-parser-perl
> Version: 3.71-1+b3
> Severity: important
> 
> Hello,
> 
> According to http://search.cpan.org/dist/HTML-Parser/lib/HTML/Entities.pm
> 
> 
>  use HTML::Entities;
>  $input = "vis-à-vis Beyoncé's naïve\npapier-mâché résumé";
>  print encode_entities($input), "\n"
> 
> print 
> 
>  vis-à-vis Beyoncé's naïve
>  papier-mâché résumé
> 
> 
> That's correct.
> 
> 
> However, here:
> 
>   $ cat test.pl 
> #!/usr/bin/perl
> 
> use HTML::Entities;
> $input = "vis-à-vis Beyoncé's naïve\npapier-mâché résumé";
> print encode_entities($input), "\n"
> 
> # EOF 
> 
>   $ perl test.pl 
> vis-Ã -vis BeyoncÃ©'s naÃ¯ve
> papier-mÃ¢chÃ© rÃ©sumÃ©

I can confirm that. However, adding "use utf8;" to the test script 
fixes the output. So it seems to me that your test file is encoded in 
utf8 and you need to tell that to perl.

HTML::Entities encodes characters, and it depends on perl's 
interpretation of the source text. Without an explicit 'use utf8' it 
is considered to be Latin1, which I think leads to the garbage above.

If you recode the test file in latin1, everything will work as 
expected, since latin1 is the default encoding.

> Where do these Ã come from?
> According to http://www.w3schools.com/charsets/ref_html_entities_4.asp it's for Ã.
> 
> I tested the same script on a debian stable and on some ubuntu with the exact same result.
> 
> I dont know what I'm doing wrong here but a simple copy/paste of the documented example does not work.

I guess the documentation needs 'use utf8;' somewhere or maybe 
something more generic, since the same text may be encoded in latin1.

> Other similar commands work as expected. For instance:
> 
> echo "vis-à-vis Beyoncé's naïve\npapier-mâché résumé" | recode utf8..html
> vis-à-vis Beyoncé's naïve\npapier-mâché résumé
> 
> 
> 
> 
> Plus, as a side bug (require a report on its own?),
> man HTML::Entities prints
> 
>    For example, this:
> 
>         $input = "vis-a-vis Beyonce's naieve\npapier-mache resume";
>         print encode_entities($input), "\n"
> 
>        Prints this out:
> 
>         [...]
> 
> Yes, the man page example is actually stripped of entities to encode!

Not sure where the problem is here. perldoc works fine:

 perldoc HTML::Entities

pod2man /usr/lib/x86_64-linux-gnu/perl5/5.20/HTML/Entities.pm 
generates stuff like:

 \& $input = "vis\-a\*`\-vis Beyonce\*'\*(Aqs 
 nai\*:ve\enpapier\-ma\*^che\*' re\*'sume\*'";

Which I guess is *roff speak for accents.

Adding --utf8 seems to get it right:

 pod2man --utf8 /usr/lib/x86_64-linux-gnu/perl5/5.20/HTML/Entities.pm \
     |   man -l -

Bug#787821: libhtml-parser-perl: encode_entities() convert chars to &Atilde; instead of their proper entity

Bug#787821: libhtml-parser-perl: encode_entities() convert chars to Ã instead of their proper entity