Bug#661736: HTML::FormatText and UTF-8

brian m. carlson sandals at crustytoothpaste.net
Thu Dec 6 02:26:04 UTC 2012


This issue is this line (line 135):

  $text =~ tr/\xA0\xAD/ /d;

This works great if your data is in a Unicode string.  It also works
great if your data is a byte string using Latin-1.  It works very poorly
if your UTF-8 data is in a byte string.  In the example given in the
original bug report, -Mutf8 was not used, so the data is treated as a
series of (two) Latin-1 characters.

vauxhall ok % perl -MHTML::FormatText -Mutf8 -C6 -E 'print HTML::FormatText->new->format_string("à")' |hd
00000000  c3 a0 0a                                          |...|
00000003
vauxhall ok % perl -MHTML::FormatText -Mutf8 -E 'print HTML::FormatText->new->format_string("à")' |hd 
00000000  e0 0a                                             |..|
00000002

I suspect the correct fix for this bug is documentation.

-- 
brian m. carlson / brian with sandals: Houston, Texas, US
+1 832 623 2791 | http://www.crustytoothpaste.net/~bmc | My opinion only
OpenPGP: RSA v4 4096b: 88AC E9B2 9196 305B A994 7552 F1BA 225C 0223 B187
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 836 bytes
Desc: Digital signature
URL: <http://lists.alioth.debian.org/pipermail/pkg-perl-maintainers/attachments/20121206/8814c213/attachment.pgp>


More information about the pkg-perl-maintainers mailing list