Bug#959761: Bug#959474: Issues with Chinese language (all variants) when building some pages in buster

Tue May 5 11:16:17 BST 2020

On Tue, 05 May 2020 10:53:29 +0200, Axel Beckert wrote:

> > Perhaps the strings in wml need to be decoded from UTF-8 so that they 
> > aren't treated as a sequence of independent bytes?
> ... and would have expect "use feature unicode_strings;" already
> activates all of this.

(I haven't read the thread in detail …).

Personally I often use "use utf8:all" (from libutf8-all-perl) if I'm
reasonably sure that the input is not weird and I want to output
utf-8. It is sometimes a bit slow but handles all the en/decoding in
my experience.

> > Explicitly using Encode helps:
> > 
> >  echo 包 | perl -E 'use Encode qw(decode_utf8); while(<>) { $_ = decode_utf8($_); s|\s+\n|\n|sg; print }'
> >  Wide character in print at -e line 1, <> line 1.
> >  包

% time echo 包 | perl -E 'use Encode qw(decode_utf8); while(<>) { $_ = decode_utf8($_); s|\s+\n|\n|sg; print }'
Wide character in print at -e line 1, <> line 1.
包
echo 包  0.00s user 0.00s system 42% cpu 0.002 total
perl -E   0.03s user 0.01s system 97% cpu 0.034 total

% time echo 包 | perl -Mutf8::all -E ' while(<>) { s|\s+\n|\n|sg; print }'
包
echo 包  0.00s user 0.00s system 63% cpu 0.002 total
perl -Mutf8::all -E ' while(<>) { s|\s+\n|\n|sg; print }'  0.04s user 0.01s system 98% cpu 0.050 total

% time echo 包 | perl -CS -E 'while(<>) { s|\s+\n|\n|sg; print }'
包
echo 包  0.00s user 0.00s system 60% cpu 0.002 total
perl -CS -E 'while(<>) { s|\s+\n|\n|sg; print }'  0.00s user 0.00s system 83% cpu 0.005 total

Cheers,
gregor

-- 
 .''`.  https://info.comodo.priv.at -- Debian Developer https://www.debian.org
 : :' : OpenPGP fingerprint D1E1 316E 93A7 60A8 104D  85FA BB3A 6801 8649 AA06
 `. `'  Member VIBE!AT & SPI Inc. -- Supporter Free Software Foundation Europe
   `-   BOFH excuse #378:  Operators killed by year 2000 bug bite.