Bug#702082: evince doesn't handle "astral" (non-BMP) Unicode characters in PDF link names

David Madore david+bugs at madore.org
Sat Mar 2 15:25:41 UTC 2013


Package: evince
Version: 3.4.0-3.1
Severity: minor

Open the file <URL: http://www.madore.org/~david/.misc/astral.pdf >
using evince: this contains a table of contents with a single entry
whose name is the same as on the document itself, i.e., "The name "
followed by the four characters U+10909 U+10904 U+10905 U+10904 (four
phoenician letters, which are written right to left on the document
but that's completely irrelevant here); when evince tries to display
the table of contents, it fails with the following error message:

(evince:8324): Gtk-WARNING **: Failed to set text from markup due to error parsing markup: Error on line 1 char 42: Invalid UTF-8 encoded text in name - not valid 'The name \xed\xa0\x82\xed\xb4\x89\xed\xa0\x82\xed\xb4\x84\xed\xa0\x82\xed\xb4\x85\xed\xa0\x82\xed\xb4\x84'

What this means is that some program or library somewhere (either
within evince itself or within libpoppler or something - I haven't
been able to discover which) took the Unicode string which in the PDF
is (correctly) encoded as UTF-16 (the PDF is uncompressed so it can be
easily checked that the encoding is as I state it):

"\xfe\xff\x00\x54\x00\x68\x00\x65\x00\x20\x00\x6e\x00\x61\x00\x6d\x00\x65\x00\x20\xd8\x02\xdd\x09\xd8\x02\xdd\x04\xd8\x02\xdd\x05\xd8\x02\xdd\x04"

and instead of converting it correctly to UTF-8

"The name \xf0\x90\xa4\x89\xf0\x90\xa4\x84\xf0\x90\xa4\x85\xf0\x90\xa4\x84"

produced the octet stream

"The name \xed\xa0\x82\xed\xb4\x89\xed\xa0\x82\xed\xb4\x84\xed\xa0\x82\xed\xb4\x85\xed\xa0\x82\xed\xb4\x84"

which is not valid UTF-8 and rightfully rejected by the Gtk toolkit.

The reason for this is obviously that some idiot thought that UTF-16
can be converted to UTF-8 by simply taking each UTF-16 translation
unit separately and converting it to UTF-8, whereas in fact surrogate
pairs (designating "astral" characters) must be handled together.
(This error is sometimes known as CESU-8.)

I've been unable to find who the culprit is (from a superficial
glance, the code from both evince and libpoppler seems sane and calls
iconv which is certainly not buggy itself), so I'm bugreporting
against evince, which exhibits the bug.

-- 
     David A. Madore
   ( http://www.madore.org/~david/ )



More information about the pkg-gnome-maintainers mailing list