Bug#787821: libhtml-parser-perl: encode_entities() convert chars to à instead of their proper entity
Damyan Ivanov
dmn at debian.org
Fri Jun 5 12:29:12 UTC 2015
-=| Mathieu Roy, 05.06.2015 13:35:24 +0200 |=-
> Package: libhtml-parser-perl
> Version: 3.71-1+b3
> Severity: important
>
> Hello,
>
> According to http://search.cpan.org/dist/HTML-Parser/lib/HTML/Entities.pm
>
>
> use HTML::Entities;
> $input = "vis-à-vis Beyoncé's naïve\npapier-mâché résumé";
> print encode_entities($input), "\n"
>
> print
>
> vis-à-vis Beyoncé's naïve
> papier-mâché résumé
>
>
> That's correct.
>
>
> However, here:
>
> $ cat test.pl
> #!/usr/bin/perl
>
> use HTML::Entities;
> $input = "vis-à-vis Beyoncé's naïve\npapier-mâché résumé";
> print encode_entities($input), "\n"
>
> # EOF
>
> $ perl test.pl
> vis-à -vis Beyoncé's naïve
> papier-mâché résumé
I can confirm that. However, adding "use utf8;" to the test script
fixes the output. So it seems to me that your test file is encoded in
utf8 and you need to tell that to perl.
HTML::Entities encodes characters, and it depends on perl's
interpretation of the source text. Without an explicit 'use utf8' it
is considered to be Latin1, which I think leads to the garbage above.
If you recode the test file in latin1, everything will work as
expected, since latin1 is the default encoding.
> Where do these à come from?
> According to http://www.w3schools.com/charsets/ref_html_entities_4.asp it's for Ã.
>
> I tested the same script on a debian stable and on some ubuntu with the exact same result.
>
> I dont know what I'm doing wrong here but a simple copy/paste of the documented example does not work.
I guess the documentation needs 'use utf8;' somewhere or maybe
something more generic, since the same text may be encoded in latin1.
> Other similar commands work as expected. For instance:
>
> echo "vis-à-vis Beyoncé's naïve\npapier-mâché résumé" | recode utf8..html
> vis-à-vis Beyoncé's naïve\npapier-mâché résumé
>
>
>
>
> Plus, as a side bug (require a report on its own?),
> man HTML::Entities prints
>
> For example, this:
>
> $input = "vis-a-vis Beyonce's naieve\npapier-mache resume";
> print encode_entities($input), "\n"
>
> Prints this out:
>
> [...]
>
> Yes, the man page example is actually stripped of entities to encode!
Not sure where the problem is here. perldoc works fine:
perldoc HTML::Entities
pod2man /usr/lib/x86_64-linux-gnu/perl5/5.20/HTML/Entities.pm
generates stuff like:
\& $input = "vis\-a\*`\-vis Beyonce\*'\*(Aqs
nai\*:ve\enpapier\-ma\*^che\*' re\*'sume\*'";
Which I guess is *roff speak for accents.
Adding --utf8 seems to get it right:
pod2man --utf8 /usr/lib/x86_64-linux-gnu/perl5/5.20/HTML/Entities.pm \
| man -l -
More information about the pkg-perl-maintainers
mailing list