Bug#798727: Encode::Unicode decode() dies unnecessarily

Sat Sep 12 09:17:34 UTC 2015

On Sat, Sep 12, 2015 at 01:40:29AM +0200, Damian Lukowski wrote:
> The Encode::Unicode documentation states the following:
> 
> When BE or LE is omitted during decode(), it checks if BOM is at the
> beginning of the string; if one is found, the endianness is set to what
> the BOM says. If no BOM is found, the routine dies.
> 
> To reproduce:
> ---
> use Encode qw/decode/;
> decode("utf-16be", "Hello World"); # does not die
> decode("utf-16le", "Hello World"); # does not die
> decode("utf-16", "\xFE\xFFHello World"); # does not die
> decode("utf-16", "Hello World"); # dies with "UTF-16:Unrecognised BOM"
> ---
> 
> Unicode Standard version 8.0:
> 
> The UTF-16 encoding scheme may or may not begin with a BOM. However,
> when there is no BOM, and in the absence of a higher-level protocol, the
> byte order of the UTF-16 encoding scheme is big-endian.
> 
> RFC2781:
> 
> If the first two octets of the text is not 0xFE followed by
> 0xFF, and is not 0xFF followed by 0xFE, then the text SHOULD be
> interpreted as being big-endian.

Thanks for the bug report; I've added your patch to the upstream bug
report, and will await comment by them.

Dominic.