[PATCH 1/4] docs: API: introduce unicode module

Tue Feb 10 17:04:40 GMT 2015

Signed-off-by: Nicolas Sebrecht <nicolas.s-dev at laposte.net>
---
 docs/doc-src/API.rst | 256 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 254 insertions(+), 2 deletions(-)

diff --git a/docs/doc-src/API.rst b/docs/doc-src/API.rst
index 774f47b..0344b8f 100644
--- a/docs/doc-src/API.rst
+++ b/docs/doc-src/API.rst
@@ -7,7 +7,13 @@
 :mod:`offlineimap's` API documentation
 ======================================
 
-Within :mod:`offlineimap`, the classes :class:`OfflineImap` provides the high-level functionality. The rest of the classes should usually not needed to be touched by the user. Email repositories are represented by a :class:`offlineimap.repository.Base.BaseRepository` or derivatives (see :mod:`offlineimap.repository` for details). A folder within a repository is represented by a :class:`offlineimap.folder.Base.BaseFolder` or any derivative from :mod:`offlineimap.folder`.
+Within :mod:`offlineimap`, the classes :class:`OfflineImap` provides the
+high-level functionality. The rest of the classes should usually not needed to
+be touched by the user. Email repositories are represented by a
+:class:`offlineimap.repository.Base.BaseRepository` or derivatives (see
+:mod:`offlineimap.repository` for details). A folder within a repository is
+represented by a :class:`offlineimap.folder.Base.BaseFolder` or any derivative
+from :mod:`offlineimap.folder`.
 
 This page contains the main API overview of OfflineImap |release|.
 
@@ -34,7 +40,8 @@ number of resources and conventions you may find useful.
 :class:`offlineimap.account`
 ============================
 
-An :class:`accounts.Account` connects two email repositories that are to be synced. It comes in two flavors, normal and syncable.
+An :class:`accounts.Account` connects two email repositories that are to be
+synced. It comes in two flavors, normal and syncable.
 
 .. autoclass:: offlineimap.accounts.Account
 
@@ -57,6 +64,251 @@ An :class:`accounts.Account` connects two email repositories that are to be sync
    `severity` that denotes the severity level of the error.
 
 
+:mod:`offlineimap.utils.uni` -- module for Unicode
+==================================================
+
+.. module:: offlineimap.utils.uni
+
+Module :mod:`offlineimap.utils.uni` provides the functions to work with unicode.
+It is used when unicode support is enabled from command line.
+
+And yes, Python 2.x is so bad when working with encodings. You might like to read
+https://pythonhosted.org/kitchen/unicode-frustrations.html (while we are NOT
+using kitchen) to get some points.
+
+For a fast usage learning of this module, jump to the `The Idiom in OfflineIMAP`
+section.
+
+Definitions
+-----------
+
+* the ``decode`` operation expects an encoded string of bytes and convert it to
+  Unicode (code points).
+* the ``encode`` operation expects Unicode (code points) and convert it to an
+  encoded string of bytes.
+
+Unicode in Python 2.x
+---------------------
+
+The names of the types are lying around.
+- type ``str`` is a string of bytes (already encoded with the encoding of a
+  charset);
+- type ``unicode`` is a string of mixed ASCII characters and Unicode code points
+  (covering all other existing characters). The mix is... embarassing sometimes.
+- both ``str`` and ``unicode`` are strings of "something".
+
+Mixing types (unicode and str) and encodings might not work as expected. Here
+are samples of what Python does with 'à', Unicode 'U+00E0' (u'\xe0'), UTF-8
+'c3a0' (u'\xc3\xa0'):
+
+* >>> u'z'  # Result: ``type unicode``, print decoded Unicode code point (character).
+  u'z'
+* >>> u'à'  # Result: ``type unicode``, Unicode code point.
+  u'\xe0'
+
+Python 2.x is not consistent about the true values of unicode type: ASCII
+characters are the characters while non-ASCII characters are Unicode code point.
+
+* >>> u'à'.encode('UTF-8')  # Result: ``type str``, UTF-8 encoded string of bytes.
+  '\xc3\xa0z'
+* >>> 'à'   # Result: string of encoded bytes. Encoding depends on the declared encoding of the source.
+  ????
+
+Note: in the Python interpretor the encoding used relies on the locale.
+
+* >>> u'z' + 'z'  # Success (result: ``type unicode``)
+  u'oz'
+* >>> u'à' + 'z'  # Success (result: ``type unicode``)
+  u'\xe0z'
+* >>> u'z' + 'à'  # Failure
+  UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 [...]
+* >>> u'à' + 'à'  # Failure
+  UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 [...]
+
+Same applies while substituting with ``"%s"``.
+
+When mixing str and unicode types, Python tries to convert the string of bytes
+(type str) with the more restrictive mode ('ascii' charset with 'strict'
+errors). This is why mixing types in Python can either work or not.
+
+* >>> u'z'.encode('UTF-8) + 'z'  # Success (result: ``type str``, string of UTF-8 encoded bytes)
+  'zz'
+* >>> u'à'.encode('UTF-8) + 'z'  # Success (result: ``type str``, string of UTF-8 encoded bytes)
+  '\xc3\xa0z'
+* >>> u'z'.encode('UTF-8) + 'à'  # Success (result: ``type str``, string of UTF-8 encoded bytes)
+  'z\xc3\xa0'
+* >>> u'à'.encode('UTF-8) + 'à'  # Success (result: ``type str``, string of UTF-8 encoded bytes)
+  '\xc3\xa0\xc3\xa0'
+
+Same applies while substituting with ``"%s"``.
+
+As long as mixing unicode encoded string of bytes Python can't make errors but
+this is dangerous because mixing encoded bytes can give illegal/unreadable results.
+It remains safe while mixing only ASCII, Latin-1 and UTF-8 because ASCII is a
+subset of Latin-1 and Latin-1 is a subset of UTF-8. This is why we mix these
+encodings in `offlineimap`.
+
+A word about IMAP encoding
+--------------------------
+
+To make things more funny (or fustrating, it's up to you), IMAP uses a dedicated
+non-standard encoding: a modified UTF-7 for IMAP. The modified UTF-7 IMAP
+(latter reduced to `IMAP encoding`) charset allows to encode non-ASCII
+characters with only ASCII characters. Encoded characters are variable-length.
+E.g. minus e-acute is encoded to `&AOk-`.
+
+UTF-7 is NOT a Unicode standard but it is more efficient on the internet and
+legacy compatible with the ASCII assumptions of the server-side softwares
+providing Usenet, SMTP, IMAP, etc.
+
+IMAP use a modified version of UTF-7. See http://tools.ietf.org/html/rfc2060 for
+details.
+
+Anyway, this pure-ASCII encoding means that in Python, an encoded string can
+either be of type str or unicode...
+
+Have fun! ,-)
+
+Encodings inside and around `offlineimap`
+-----------------------------------------
+
+With ``--unicode`` option enabled, OfflineImap works with unicode type as much
+as possible. Notice this is not always possible. E.g., configuration options can
+be used to name the threads or for ``__str__()`` method. Another funny part
+comes when handling the various encodings from I/O, libraries and some builtins
+since each have their own way of handling Unicode.
+
+* IMAP protocol use a modified version of UTF-7. UTF-7 is not a Unicode standard.
+  Neither is the IMAP UTF-7. It contains only ASCII characters.
+* stdin/stdout encodings depends on the system configuration (the result of
+  nl_langinfo(CODESET) relying on LC_CTYPE, see man nl_langinfo(3)).
+* Configuration file supports UTF-8. There is no full support of UTF-8 (see
+  ``offlineimap.conf`` for details).
+* Sqlite3 use UTF-8 encoding by default. Contrary to what it seems, this can be
+  hurting for some corner cases. We hit such corner cases in `OfflineIMAP`. Code
+  should be well documented on this.
+* imaplib2 works with IMAP UTF-7 encoded strings of bytes (`str`).
+* ``eval()`` expects Latin-1 or UTF-8 arguments.
+* ``__str__()`` relies on ASCII (there is a ``__unicode__()`` method in v2.x but
+  both are renamed ``__bytes__()`` and ``__str__()`` in v3.x). This is why we
+  stick with ``__str__()`` and add ASCII requirements on the configuration
+  options internally using this method.
+* Thread module does not support Unicode and make ASCII assumptions; pass it
+  encoded strings. If information is extracted from it, it must be early turned
+  to Unicode.
+* Module Curses does not support Unicode; pass it encoded strings.
+* Exceptions does not support Unicode; messages are encoded into filesystem.
+* Log files are encoded to filesystem encoding. Encoding happens late, send
+  Unicode values to info(), warn() and the like.
+* Whatever encoding is allowed by the backend, foder names are encoded to IMAP
+  UTF-7 in the cache.
+
+Overall design to handle encodings
+----------------------------------
+
+Everthing related to encodings stands in the `offlineimap.utils.uni` module.
+
+The first prerequisite is to have Unicode support optional. Most softwares out
+there enforced Unicode while implementing Unicode support. With `OfflineImap` it
+would be hurting for mainly two reasons:
+
+* It's hard to fix subtle bugs in OfflineIMAP (truncated or silenced errors, not
+  enough software debug capabilities, etc). Things are evolving but the current
+  status is still not ideal. Also, we have few active maintainers. Being forced to
+  use Unicode could make things worse for end-users.
+
+* Emails are often sensitive for our end-users. Since working with Unicode in
+  Python (v2 at least) is so delicate, we can expect very bad things to happen.
+  `OfflineImap` always make its best to avoid data lost and Unicode should
+  not be an exception.
+
+So, having Unicode support optional helps the transition. This means that the
+code has to handle both contexts (with or without Unicode) in a smart way.
+
+To get this, the module provides an API which is free of context. When using the
+`uni` module, the context is set once and it's possible to use it whitout having
+to care if it's working with Unicode support enabled or in legacy mode.
+
+The second prerequisite is the module to be easy to use. Making things simple
+helps focusing on the real code, the core implementation. To get it simple, it
+must handle the hard things itself and not bother developers with in the
+internals of encoding/decoding strings.
+
+The Idiom
+---------
+
+To limit or avoid issues with Unicode, usual rules are:
+* ``decode`` all bytes data as early as possible: keyboard strokes, files, data
+  received from the network, ...
+* ``encode`` back Unicode to bytes as late as possible: write text to a file, log
+  a message, send data to the network, ...
+
+While somewhat easy at the first glance, this idion is much, MUCH harder in
+practice than it seems. Here are some obstacles:
+* interlaced variables over the objects while each usage has its own limitations
+  and assumptions (e.g. python idom of public access to attributes higly sucks
+  here, BTW);
+* callbacks and function passing to a library must be clearly identified to not
+  break library's assumptions;
+* having to deal with the same type for multi-encodings (as discussed above).
+* heritage from libraries handling Unicode in their own ways;
+
+On top of those, the main obstacle is that the usage of the strings is not
+consistent in substance: it can be used to display a message, retrieve/store
+data on disk, enlive exceptions, enlive Threads, keying dict values, flag a
+state, raise a protocol action, etc, ...  Of course, each usage come with its
+own encoding assumptions and limitations.
+
+In fact, the life of the string itself is usually not that simple. It jumps from
+one type of usage to another: e.g. a dict key might easily become a Thread
+attribute which will be used later when raising an exception or store encoded
+data on disk. So, the encoding of the string depends of what is done with it at
+each usage time. You are a Generics advocator? Fine, I was one too. But you
+should agree on their limits.
+
+Apprehending the mixes all of the possible usages in some random ways gives a
+good part of the real picture and how hard it can be to apply this idiom.
+
+
+The uni Basics
+--------------
+
+`UniError` is the error raised for all errors from the uni module.
+
+The uni Functions
+-----------------
+
+The functions are easy wrappers above the Python encode/decode functions. The
+encodings used are known to a limited group: filesystem, IMAP, ASCII and
+Unicode.
+
+* `uni.uni2bytes(<string>(, encoding, errors, exception_msg)`: encode to encoding.
+* `uni.bytes2uni(<string>(, encoding, errors, exception_msg)`: decode from encoding.
+* `uni.uni2str(<string>(, exception_msg)`: coerce with ASCII charset.  Raise
+  UniError if coercing is not possible. This is a wrapper to the `str()` builtin
+  with better error reporting.
+* `uni.uni2std(<string>(, exception_msg)`: encode with default encoding. This
+  function should never fail. Usefull for encoding suspect data to exceptions, for
+  example.
+* `uni.uni2fs(<string>(, errors, exception_msg)`: encode from Unicode to
+  filesystem encoding.
+* `uni.fs2uni(<string>(, errors, exception_msg)`: decode from filesystem
+  encoding to Unicode
+* `uni.uni2imap(<string>(, errors, exception_msg)`: encode from Unicode to IMAP
+  encoding. Returns in str or unicode type.
+* `uni.imap2uni(<string>(, errors, exception_msg)`: decode from IMAP encoding
+  str/unicode to Unicode (unicode).
+
+Unless specifically defined above, the arguments are:
+* encoding: encoding as expected by Python encode/decode functions. Default is
+  `uni.ENCODING` (UTF-8).
+* errors: errors as expected by the Python encode/decode functions ('strict',
+  'ignore', 'replace').
+* exception_msg: if any error, raise `UniError` with exception_msg. This
+  helps adding information of what was expected from where.
+
+
+
 :mod:`offlineimap.globals` -- module with global variables
 ==========================================================
 
-- 
2.2.2