[Debian-in-workers] Debian-in-workers Digest, Vol 22, Issue 8

Mon Jun 4 14:54:00 UTC 2007

Moving thread to festival-te-devel.

On 6/4/07, U. Sudhakar <sudhakaru at cdac.in> wrote:
>      I refering Telugu Festival source..
>
> it's showing some garbage like MAT2... given below
>
> i want know MAT2..pls help me..
>
>
>
> (lts.ruleset
>  telugu
>   (
>         ; Matras can be formed by (OCT340 OCT260 MAT1 )  and (OCT340 OCT260 MAT2)
>         ( OCT340 à )
>         ( OCT260 ° )
>         ( OCT261  ± )
>         ( MAT1   Ÿ  ¿ )
>         ( MAT2  €  ‚ ƒ „ 
 † ‡   ˆ ‰ Š ‹ Œ )
>   )
>

Festival does not have native support for Unicode. Unicode Telugu
characters, when encoded in UTF-8, will be composed of 3 bytes. But
festival does not recognize these 3 bytes to be a single composite
character. Each byte is considered to be a character (only ASCII).

In festival-te, we use this simple hack in order to process UTF-8
encoded strings. Split each Unicode character into the 3 byte size
characters it is composed of. Refer to Unicode tables for the
composition. For example, Telugu Unicode character A (అ) is composed
of Octal values \340 \260 \205 (same as Hex 0xE0 0xB0 0x85). You will
find gucharmap package useful to find the mappings.

The junk characters that you see in the source file are precisely
these characters.

--
Chaitanya