Bug#787821: libhtml-parser-perl: encode_entities() convert chars to Ã instead of their proper entity

Fri Jun 5 14:20:24 UTC 2015

> Le vendredi 5 juin 2015 14:31:17, vous avez écrit :
> On Fri, 05 Jun 2015 14:34:42 +0200, Mathieu ROY wrote:
> > Ok, so after further testing, it turns out that if I change the coding of
> > the
> > string from UTF-8 to ISO-8859..., it encode to the proper entities.
> 
> Good.
> 
> > I obviously can adjust the script to pre convert UTF-8 to ISO-8859
> 
> Or just add "use utf8;" to your script if it contains utf8-encoded
> strings.

That works for the test script allright.

But in the script I'm actually working on, the string is imported from an 
image exif data. And in this case, use utf8 has no effect at all. The string is 
utf8 and encode_entities fails to convert it properly.

Instead of keeping strings UTF-8 and expecting HTML::Entities to cope properly 
with it (it does not), I actually need to do the contrary: convert UTF-8 to 
perl internal format and then call encode entities.

Consider the following:

  $ cat test.pl 
#!/usr/bin/perl
use utf8;
use HTML::Entities;

open(INPUT, "< testdata");
while (<INPUT>) {
    print encode_entities($_), "\n"
}
close(INPUT);

  $ echo "vis-à-vis Beyoncé's naïve\npapier-mâché résumé" > testdata 

  $ perl test.pl 
vis-Ã -vis BeyoncÃ©'s naÃ¯ve\npapier-
mÃ¢chÃ© rÃ©sumÃ©

Back to square one.

Now, without use utf8; but decoding:

#!/usr/bin/perl

use HTML::Entities;
use Encode qw(decode);
use Encode::Detect::Detector;

open(INPUT, "< testdata");
while (<INPUT>) {
    print encode_entities(decode(detect($_),$_)), "\n"
}
close(INPUT);

  $ perl test.pl 
vis-à-vis Beyoncé's naïve\npapier-mâché 
résumé

> > but it
> > should be at least documented (but I dont see any reason why
> > encode_entities
> > should actually not be able to deal with UTF-8)
> 
> That's how encoding in perl works in general, and I'm sure it's
> documented somewhere :)
> (I just don't find the correct perldoc right now ...)

I expected these use utf8/no utf8 to be sort of transitional and thought 
should be avoided whenever not absolutely necessary.

Description of use utf8; mentions:

"When UTF-8 becomes the standard source format, this pragma will effectively 
become a no-op."

Well, that day, if that day comes, HTML::Entities will definitely have to deal 
properly with UTF-8 first hand. :-)

Anyway, in the meantime, I tend to prefer forcing strings to be decoded into 
internal format than saying that all strings are UTF-8.

Regards,

-- 
http://yeupou.wordpress.com/

Bug#787821: libhtml-parser-perl: encode_entities() convert chars to &Atilde; instead of their proper entity

Bug#787821: libhtml-parser-perl: encode_entities() convert chars to Ã instead of their proper entity