Bug#611625: libglib2.0: g_utf16_to_ucs4 performs incorrectly

Fri Aug 8 10:50:38 UTC 2014

On Mon, 31 Jan 2011 at 14:15:40 +0200, alex bodnaru wrote:
>     glong read2, written2;
>     GError *error = NULL;
>     const wchar_t *ws = L"\x0428\x04d9\x043d\x0431\x04d9";
>     gchar *conv2 = g_utf16_to_utf8(ws, -1, &read2, &written2, &gerror);
>     gunichar *ws2 = g_utf16_to_ucs4(ws, -1, &read2, &written2, &gerror);
>     /*ws2 should be: {0x0428,0x04d9,0x043d,0x0431,0x04d9}
>     but it's not.*/

(That won't compile as written: "error" and "gerror" are not the same name.)

You seem to be assuming that wchar_t* is always a UTF-16 string. This is not
the case: wchar_t is typically 16 bits on Windows but 32 bits on Unix.
In particular, the platform ABI used on Debian has 32-bit wchar_t.
(A wchar_t* also doesn't have to be Unicode.)

When compiled with -fshort-wchar, code similar to that works
(but will probably be incompatible with other platform libraries).

For best results, use gunichar * (or a pointer to another 32-bit type)
for UCS-4, gunichar2 * (or a pointer to another 16-bit type) for UTF-16
or UCS-2, gchar * (or a pointer to another 8-bit type) for UTF-8 or
legacy encodings like ISO-8859-*, and only use wchar_t (whose size
and encoding are unspecified) if you must interact with platform APIs
that use it. You can convert between standard encodings (UTF-16, UCS-4, etc.)
and the unspecified encoding used by wchar_t* by passing "WCHAR_T" to
g_iconv_open() or g_convert().

    S