Bug#787821: libhtml-parser-perl: encode_entities() convert chars to Ã instead of their proper entity

Fri Jun 5 13:40:18 UTC 2015

-=| Mathieu ROY, 05.06.2015 14:34:42 +0200 |=-
> Ok, so after further testing, it turns out that if I change the coding of the 
> string from UTF-8 to ISO-8859..., it encode to the proper entities.

This is because in the absence of explicit encoding statement the perl 
interpreter consider the source text to be encoded in Latin1.

>From 'perldoc encoding', "Implicit upgrading for byte strings"

       By default, if strings operating under byte semantics and
       strings with Unicode character data are concatenated, the new
       string will be created by decoding the byte strings as ISO
       8859-1 (Latin-1).
       The encoding pragma changes this to use the specified
       encoding instead.

(Although note that the encoding pragma is deprecated. Better use the 
utf8 pragma and encode your source as UTF-8).

> I obviously can adjust the script to pre convert UTF-8 to ISO-8859 
> but it should be at least documented (but I dont see any reason why 
> encode_entities should actually not be able to deal with UTF-8)

encode_entities deals with whatever the perl interpreter supplies. And 
the perl interpreter needs your help in determining the meaning of the 
byte sequence you feed it with.

Bug#787821: libhtml-parser-perl: encode_entities() convert chars to &Atilde; instead of their proper entity

Bug#787821: libhtml-parser-perl: encode_entities() convert chars to Ã instead of their proper entity