Bug#496266: UTF-8 string characters not properly recognized
Changwoo Ryu
cwryu at debian.org
Tue Sep 2 22:12:21 UTC 2008
2008-09-02 (화), 13:19 -0500, Adam Majer:
> Christian Perrier wrote:
> >> Le samedi 23 août 2008 à 19:59 -0500, Adam Majer a écrit :
> >>> Package: gedit
> >>> Version: 2.22.3-1
> >>> Severity: normal
> >>>
> >>> The following UTF-8 string is not correctly handled in gedit,
> >>>
> >>> const char *unicode_insert = "?Э";
> >>>
> >>> The " and the ? characters are viewed as one character, making the
> >>> entire thing next to impossible to copy/paste/edit.
> >> Looks like an issue in pango, since it is not specific to gedit.
> >>
> >> Such things seem to happen a lot when using Tibetan characters, so this
> >> may or may not be intentional. I’d prefer to have the input of someone
> >> who uses them. Is there anyone on debian-i18n who’s more knowledgeable
> >> about Tibetan glyphs?
> >
> >
> > Adding Pema Geyleg and Tenzin Dendup, our fellow Dzongkha translation
> > coordinators, who certainly have skills about Tibetan-family scripts
> > (Dzongkha is one of these) and could maybe point you to people with
> > needed knowledge.
>
>
> I'm sorry, but aren't we missing the entire point here? This is not
> about bad handling of some Tibetan characters. It is about bad handling
> of 3-byte UTF-8 characters.
>
> http://en.wikipedia.org/wiki/UTF-8
>
> So, the following characters should have the same problems,
>
> "ऄक
>
> "ঈউঊ
>
> "ਜਗਏ
>
> "ଜଁଂ
>
> "ஔ
>
> "ంఁః
>
> "ಂಖ
>
> "ഈഃ
>
> etc..
>
>
> I've put a Ascii " in front of all the different characters. In emacs,
> I'm able to select the " in front of these characters and copy it. vim
> under a UTF-8 gnome terminal also allows the " to be selected. The 2nd
> last line above (using icedove), I can't independently select the " but
> I can select the " and ಂ together and then remove the 2nd character.
>
> Maybe it is just my misunderstanding of UTF-8, I'm not sure. But at
> least my expected behaviour was being able to select 1 UTF-8 character
> at a time, even if linguistically it does not make any sense.
The Tibetan code in this case, U+0FA1 is NOT a character. It's a Tibetan
code for combining with other Tibetan codes to form a Tibetan character.
Unicode code points do not necessarily represent characters. Selecting
combined character is more expected than selecting its sub-parts (even
when it's possible).
This issue is about handling Unicode combining. In this case, Pango
interprets a quote mark (") and U+0FA1 Tibetan code (wrong combination)
as one combined character. I'm not sure whether it's a defined behavior.
--
Changwoo Ryu <cwryu at debian.org>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 185 bytes
Desc: This is a digitally signed message part
Url : http://lists.alioth.debian.org/pipermail/pkg-gnome-maintainers/attachments/20080903/173d74fe/attachment.pgp
More information about the pkg-gnome-maintainers
mailing list