Bug#959474: Issues with Chinese language (all variants) when building some pages in buster

Axel Beckert abe at debian.org
Tue May 5 02:34:28 BST 2020


Hi,

found the culprit quicker than expected. I'm though no more sure if
it's really a WML issue or if sits even deeper:

Axel Beckert wrote:
> → echo 包 | /usr/share/wml/exec/wml_p8_htmlstrip -O 1
>> → echo 包 | /usr/share/wml/exec/wml_p8_htmlstrip -O 2
>
Level 2 actually only consists of these two regular expressions being
applied:

* s|(\S+)[ \t]{2,}|$1 |sg
* s|\s+\n|\n|sg

It's the latter one (a really simple regexp) which causes the
breakage. But not always. It depends on which Perl version
compatibility level is used:

→ echo 包 | perl -pe 's|\s+\n|\n|sg;'
包
→ echo 包 | perl -pE 's|\s+\n|\n|sg;'
�

"-E' instead of "-e" means "use the most recent Perl version feature
set", for this bug it is equivalent to "use 5.014;" as that's what is
used in htmlstrip.

From some point of view, we're lucky, because the feature set of Perl
5.14 wasn't that big: "say state switch unicode_strings".

It's obvious that neither say, state nor switch are causing this. So
it seems as if "use feature unicode_strings" is the culprit. Proof:

→ echo 包 | perl -pe 's|\s+\n|\n|sg;'
包
→ echo 包 | perl -M"feature unicode_strings" -pe 's|\s+\n|\n|sg;'
�

Which kinda sounds like a Perl bug. Cc'ing the maintainers of Debian's
perl package (not the whole Debian Perl Team), maybe they have some
insight what actually goes wrong here and if that's indeed a Perl bug.

I'm leaving #959761 open in wml as I now have an idea how to fix this
there (adding "no feature unicode_strings" to htmlstrip in the hope
that this doesn't do any collateral damage):

→ echo 包 | perl -pE 'no feature unicode_strings; s|\s+\n|\n|sg;'
包

		Regards, Axel
-- 
 ,''`.  |  Axel Beckert <abe at debian.org>, https://people.debian.org/~abe/
: :' :  |  Debian Developer, ftp.ch.debian.org Admin
`. `'   |  4096R: 2517 B724 C5F6 CA99 5329  6E61 2FF9 CD59 6126 16B5
  `-    |  1024D: F067 EA27 26B9 C3FC 1486  202E C09E 1D89 9593 0EDE
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://alioth-lists.debian.net/pipermail/perl-maintainers/attachments/20200505/c76cf338/attachment.sig>


More information about the Perl-maintainers mailing list