[PATCH 1/4] docs: API: introduce unicode module

Wed Feb 11 15:09:25 UTC 2015

Signed-off-by: Nicolas Sebrecht <nicolas.s-dev at laposte.net>
---
 docs/doc-src/API.rst | 455 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 453 insertions(+), 2 deletions(-)

diff --git a/docs/doc-src/API.rst b/docs/doc-src/API.rst
index 774f47b..8916b90 100644
--- a/docs/doc-src/API.rst
+++ b/docs/doc-src/API.rst
@@ -7,7 +7,13 @@
 :mod:`offlineimap's` API documentation
 ======================================
 
-Within :mod:`offlineimap`, the classes :class:`OfflineImap` provides the high-level functionality. The rest of the classes should usually not needed to be touched by the user. Email repositories are represented by a :class:`offlineimap.repository.Base.BaseRepository` or derivatives (see :mod:`offlineimap.repository` for details). A folder within a repository is represented by a :class:`offlineimap.folder.Base.BaseFolder` or any derivative from :mod:`offlineimap.folder`.
+Within :mod:`offlineimap`, the classes :class:`OfflineImap` provides the
+high-level functionality. The rest of the classes should usually not needed to
+be touched by the user. Email repositories are represented by a
+:class:`offlineimap.repository.Base.BaseRepository` or derivatives (see
+:mod:`offlineimap.repository` for details). A folder within a repository is
+represented by a :class:`offlineimap.folder.Base.BaseFolder` or any derivative
+from :mod:`offlineimap.folder`.
 
 This page contains the main API overview of OfflineImap |release|.
 
@@ -34,7 +40,8 @@ number of resources and conventions you may find useful.
 :class:`offlineimap.account`
 ============================
 
-An :class:`accounts.Account` connects two email repositories that are to be synced. It comes in two flavors, normal and syncable.
+An :class:`accounts.Account` connects two email repositories that are to be
+synced. It comes in two flavors, normal and syncable.
 
 .. autoclass:: offlineimap.accounts.Account
 
@@ -57,6 +64,450 @@ An :class:`accounts.Account` connects two email repositories that are to be sync
    `severity` that denotes the severity level of the error.
 
 
+:mod:`offlineimap.utils.uni` -- module for Unicode
+==================================================
+
+.. module:: offlineimap.utils.uni
+
+Module :mod:`offlineimap.utils.uni` provides the functions to work with unicode.
+It is used when unicode support is enabled from command line.
+
+And yes, Python 2.x is so bad when working with encodings. You might like to read
+https://pythonhosted.org/kitchen/unicode-frustrations.html (while we are NOT
+using kitchen) to get some points.
+
+For a fast usage learning of this module, jump to the `The Idiom in OfflineIMAP`
+section.
+
+Definitions
+-----------
+
+* the ``decode`` operation expects an encoded string of bytes and convert it to
+  Unicode (code points).
+* the ``encode`` operation expects Unicode (code points) and convert it to an
+  encoded string of bytes.
+
+Unicode in Python 2.x
+---------------------
+
+The names of the types are lying around.
+- type ``str`` is a string of bytes (already encoded with the encoding of a
+  charset);
+- type ``unicode`` is a string of mixed ASCII characters and Unicode code points
+  (covering all other existing characters). The mix is... embarassing sometimes.
+- both ``str`` and ``unicode`` are strings of "something".
+
+Mixing types (unicode and str) and encodings might not work as expected. Here
+are samples of what Python does with 'à', Unicode 'U+00E0' (u'\xe0'), UTF-8
+'c3a0' (u'\xc3\xa0'):
+
+* >>> u'z'  # Result: ``type unicode``, print decoded Unicode code point (character).
+  u'z'
+* >>> u'à'  # Result: ``type unicode``, Unicode code point.
+  u'\xe0'
+
+Python 2.x is not consistent about the true values of unicode type: ASCII
+characters are the characters while non-ASCII characters are Unicode code point.
+
+* >>> u'à'.encode('UTF-8')  # Result: ``type str``, UTF-8 encoded string of bytes.
+  '\xc3\xa0z'
+* >>> 'à'   # Result: string of encoded bytes. Encoding depends on the declared encoding of the source.
+  ????
+
+Note: in the Python interpretor the encoding used relies on the locale.
+
+* >>> u'z' + 'z'  # Success (result: ``type unicode``)
+  u'oz'
+* >>> u'à' + 'z'  # Success (result: ``type unicode``)
+  u'\xe0z'
+* >>> u'z' + 'à'  # Failure
+  UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 [...]
+* >>> u'à' + 'à'  # Failure
+  UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 [...]
+
+Same applies while substituting with ``"%s"``.
+
+When mixing str and unicode types, Python tries to convert the string of bytes
+(type str) with the more restrictive mode ('ascii' charset with 'strict'
+errors). This is why mixing types in Python can either work or not.
+
+* >>> u'z'.encode('UTF-8) + 'z'  # Success (result: ``type str``, string of UTF-8 encoded bytes)
+  'zz'
+* >>> u'à'.encode('UTF-8) + 'z'  # Success (result: ``type str``, string of UTF-8 encoded bytes)
+  '\xc3\xa0z'
+* >>> u'z'.encode('UTF-8) + 'à'  # Success (result: ``type str``, string of UTF-8 encoded bytes)
+  'z\xc3\xa0'
+* >>> u'à'.encode('UTF-8) + 'à'  # Success (result: ``type str``, string of UTF-8 encoded bytes)
+  '\xc3\xa0\xc3\xa0'
+
+Same applies while substituting with ``"%s"``.
+
+As long as mixing unicode encoded string of bytes Python can't make errors but
+this is dangerous because mixing encoded bytes can give illegal/unreadable results.
+It remains safe while mixing only ASCII, Latin-1 and UTF-8 because ASCII is a
+subset of Latin-1 and Latin-1 is a subset of UTF-8. This is why we mix these
+encodings in `offlineimap`.
+
+A word about IMAP encoding
+--------------------------
+
+To make things more funny (or fustrating, it's up to you), IMAP uses a dedicated
+non-standard encoding: a modified UTF-7 for IMAP. The modified UTF-7 IMAP
+(latter reduced to `IMAP encoding`) charset allows to encode non-ASCII
+characters with only ASCII characters. Encoded characters are variable-length.
+E.g. minus e-acute is encoded to `&AOk-`.
+
+UTF-7 is NOT a Unicode standard but it is more efficient on the internet and
+legacy compatible with the ASCII assumptions of the server-side softwares
+providing Usenet, SMTP, IMAP, etc.
+
+IMAP use a modified version of UTF-7. See http://tools.ietf.org/html/rfc2060 for
+details.
+
+Anyway, this pure-ASCII encoding means that in Python, an encoded string can
+either be of type str or unicode...
+
+Have fun! ,-)
+
+Encodings inside and around `offlineimap`
+-----------------------------------------
+
+With ``--unicode`` option enabled, OfflineImap works with unicode type as much
+as possible. Notice this is not always possible. E.g., configuration options can
+be used to name the threads or for ``__str__()`` method. Another funny part
+comes when handling the various encodings from I/O, libraries and some builtins
+since each have their own way of handling Unicode.
+
+* IMAP protocol use a modified version of UTF-7. UTF-7 is not a Unicode standard.
+  Neither is the IMAP UTF-7. It contains only ASCII characters.
+* stdin/stdout encodings depends on the system configuration (the result of
+  nl_langinfo(CODESET) relying on LC_CTYPE, see man nl_langinfo(3)).
+* Configuration file supports UTF-8. There is no full support of UTF-8 (see
+  ``offlineimap.conf`` for details).
+* Sqlite3 use UTF-8 encoding by default. Contrary to what it seems, this can be
+  hurting for some corner cases. We hit such corner cases in `OfflineIMAP`. Code
+  should be well documented on this.
+* imaplib2 works with IMAP UTF-7 encoded strings of bytes (`str`).
+* ``eval()`` expects Latin-1 or UTF-8 arguments.
+* ``__str__()`` relies on ASCII (there is a ``__unicode__()`` method in v2.x but
+  both are renamed ``__bytes__()`` and ``__str__()`` in v3.x). This is why we
+  stick with ``__str__()`` and add ASCII requirements on the configuration
+  options internally using this method.
+* Thread module does not support Unicode and make ASCII assumptions; pass it
+  encoded strings. If information is extracted from it, it must be early turned
+  to Unicode.
+* Module Curses does not support Unicode; pass it encoded strings.
+* Exceptions does not support Unicode; messages are encoded into filesystem.
+* Log files are encoded to filesystem encoding. Encoding happens late, send
+  Unicode values to info(), warn() and the like.
+* Whatever encoding is allowed by the backend, foder names are encoded to IMAP
+  UTF-7 in the cache.
+
+Overall design to handle encodings
+----------------------------------
+
+Everthing related to encodings stands in the `offlineimap.utils.uni` module.
+
+The first prerequisite is to have Unicode support optional. Most softwares out
+there enforced Unicode while implementing Unicode support. With `OfflineImap` it
+would be hurting for mainly two reasons:
+
+* It's hard to fix subtle bugs in OfflineIMAP for various reasons (truncated or
+  silenced errors, not enough software debug capabilities, etc). Things are
+  evolving but the current status is still not ideal. Also, we have few active
+  maintainers. Being forced to use Unicode could make things worse for end-users.
+
+* Emails are often sensitive for our end-users. Since working with Unicode in
+  Python (v2 at least) is so delicate, we can expect very bad things to happen.
+  `OfflineImap` always make its best to avoid data lost and Unicode should
+  not be an exception.
+
+So, having Unicode support optional helps the transition. This means that the
+code has to handle both contexts (with or without Unicode) in a smart way.
+
+To get this, the module provides an API which is free of context. When using the
+`uni` module, the context is set once and it's possible to use it whitout having
+to care if it's working with Unicode support enabled or in legacy mode.
+
+The second prerequisite is the module to be easy to use. Making things simple
+helps focusing on the real code, the core implementation. To get it simple, it
+must handle the hard things itself and not bother developers with in the
+internals of encoding/decoding strings.
+
+The Usual Idiom
+---------------
+
+To limit or avoid issues with Unicode, usual rules are:
+* ``decode`` all bytes data as early as possible: keyboard strokes, files, data
+  received from the network, ...
+* ``encode`` back Unicode to bytes as late as possible: write text to a file, log
+  a message, send data to the network, ...
+
+IOW, the usual idom is no more than having everybody agree on the same
+expections with the technical benefit of working with Unicode, the
+representation which natively supports all encodings.
+
+While somewhat easy at the first glance, the usual idion is much, MUCH harder in
+practice than it seems. Here are some obstacles:
+* interlaced variables over the objects while each usage has its own limitations
+  and assumptions (e.g. python idom of public access to attributes higly sucks
+  here, BTW);
+* callbacks and function passing to a library must be clearly identified to not
+  break library's assumptions;
+* having to deal with the same type for multi-encodings (as discussed above).
+* heritage from libraries handling Unicode in their own ways;
+
+On top of those, the main obstacle is that the usage of the strings is not
+consistent in substance: it can be used to display a message, retrieve/store
+data on disk, enlive exceptions, enlive Threads, keying dict values, flag a
+state, raise a protocol action, etc, ...  Of course, each usage come with its
+own encoding assumptions and limitations.
+
+In fact, the life of the string itself is usually not that simple. It jumps from
+one type of usage to another: e.g. a dict key might easily become a Thread
+attribute which will be used later when raising an exception or store encoded
+data on disk. So, the encoding of the string depends of what is done with it at
+each usage time. You are a Generics advocator? Fine, I was one too. But you
+should agree on their limits.
+
+Apprehending the mixes all of the possible usages in some random ways gives a
+good part of the real picture and how hard it can be to apply the usual idiom.
+
+And it's not talk about digging into such hidden breakage in the code. Debugging
+with soft encapsulation and poor debugging tools can easily turn your happy
+coding course into a nightmare. I already hear some Python fanatic raising
+against my statements. So, I'd like to add that when you already took days at
+carefully encode/decode everywhere and that you still get weird unexpected
+issues, it's time move on with something new.
+
+I came to the assessment that the underlying problem is the IMPLICIT encodings
+and types when using a variable. Any offending encoding/type can break the
+software in very unexpected ways. By requiring explicit encoding coming with a
+predictable type, you are at least forced to think about the encoding of the
+string each time it is used, which helps avoiding most of the issues.
+
+Please, believe me. I've passed DAYS at digging into missed or broken
+expectations. If you're not inclined to believe me, just take a step back, start
+off with OfflineImap v6.5.7-rc2 and implement Unicode support while sticking to
+the usual idiom. Make it working for the whole program to get a good overview.
+I'd be so happy to compare our results, seriously.
+
+Time to take a fresh new approach.
+
+The Idiom in OfflineIMAP
+------------------------
+
+Not going to break everything. Let's start with the usual idom and improve it a
+bit more:
+
+To limit or avoid issues with Unicode, try follow these rules:
+* ``decode`` all bytes data as early as possible: keyboard strokes, files, data
+  received from the network, ...
+* ``encode`` back to bytes as late as possible: write text to a file, log
+  a message, send data to the network, ...
+* make encodings EXPLICIT, with reasonable types, EVERYWHERE it makes sense.
+* in legacy mode, limit encoding/decoding processes as much as possible (yes,
+  it's not even possible to avoid all encoding/decoding).
+
+Forcing explicit encodings is not possible with the types that Python provides.
+As a consequence, switch from the ``Unicode strings everywhere`` to ``objects
+handling strings everywhere``. Requiring to explicit the encoding will force to
+think in Unicode most of the time. This is a good thing and contributors might
+be well surprised to realize how far encoding issues do care in the code.
+
+Call me crazy if it helps to make yourself less fustrated but that's the best
+thing I could come with. I find this design elegant and damn effective in
+practice.
+
+Because the uni objects are expected to be used most of the time, they must be
+easy to use, with encode/decode operations done for the developer. Python
+property, the pythonic getters and setters for attributes, are well at the job.
+
+Notice that a string in ``Unicode`` is not much interesting outside the
+internals of the software. Inside the software, developers don't have to care
+about the original encodings (e.g. for strings comparisons) if everything is
+Unicode. Outside, it must be encoded/decoded to/from the correct encoding to be
+usefull (ASCII, UTF-8, etc). So, changing the ``unicode everywhere`` to
+``something else everywhere`` is not hurting from this POV.
+
+Objects can be used transparently in either legacy mode or Unicode modes.
+
+Here we are!  With all the above said, let's be more concrete. There are few
+functions and fetures in the module to know.
+
+The uni Basics
+--------------
+
+`UniError` is the error raised for all errors from the uni module.
+
+The context:
+* `uni.set_unicode_context(<type bool>)`: True enables Unicode support.
+* `uni.use_unicode()`: get context.
+
+The context can only be set once. Any other attempt will raise a `UniError`.
+
+* `uni.isASCII(<string>(exception_msg)`: return True or False. If a exception_msg is
+  supplied raise UniError with this message, instead.
+
+The uni Functions
+-----------------
+
+The "low-level" functions are easy wrappers above the Python encode/decode
+functions. The encodings used are known to a limited group: filesystem, IMAP,
+ASCII and Unicode.
+* `uni.uni2bytes(<string>(, encoding, errors, exception_msg)`: encode to encoding.
+* `uni.bytes2uni(<string>(, encoding, errors, exception_msg)`: decode from encoding.
+* `uni.uni2str(<string>(, exception_msg)`: coerce with ASCII charset.  Raise
+  UniError if coercing is not possible. This is a wrapper to the `str()` builtin
+  with better error reporting.
+* `uni.uni2std(<string>(, exception_msg)`: encode with default encoding. This
+  function should never fail. Usefull for encoding suspect data to exceptions, for
+  example.
+* `uni.uni2fs(<string>(, errors, exception_msg)`: encode from Unicode to
+  filesystem encoding.
+* `uni.fs2uni(<string>(, errors, exception_msg)`: decode from filesystem
+  encoding to Unicode
+* `uni.uni2imap(<string>(, errors, exception_msg)`: encode from Unicode to IMAP
+  encoding. Returns in str or unicode type.
+* `uni.imap2uni(<string>(, errors, exception_msg)`: decode from IMAP encoding
+  str/unicode to Unicode (unicode).
+
+Unless specifically defined above, the arguments are:
+* encoding: encoding as expected by Python encode/decode functions. Default is
+  `uni.ENCODING` (UTF-8).
+* errors: errors as expected by the Python encode/decode functions ('strict',
+  'ignore', 'replace').
+* exception_msg: if any error, raise `UniError` with exception_msg. This
+  helps adding information of what was expected from where.
+
+
+The uni Objects
+---------------
+
+The `uni` module should mostly be used with the factories.
+
+* `uni.noneString()`: factory to get a uni object without bundled string (None).
+* `uni.valueString(<string>)`: factory to get a uni object FROM any encoding.
+* `uni.uniString(<string>)`: factory to get a uni object FROM a Unicode string.
+* `uni.dbytesString(<string>)`: factory to get a uni object FROM the default
+  encoding.
+* `uni.fsString(<string>)`: factory to get a uni object FROM the filesystem
+  encoding.
+* `uni.imapString(<string>)`: factory to get a uni object FROM the IMAP encoding.
+
+Under the hood the returned object of the factories can either be of type
+`StrObject` or `UnicodeObject`.  This depends on the context. But you should not
+care about that, they both have the exact same semantics (attributes and
+methods). You're seeing a bird that walks like a duck, swims like a duck and
+quacks like a duck... call that bird a duck.
+
+Internally, the factories use the setters.
+
+What's most interesting are the getters and setters. They handle the
+encode/decode jobs for you. They are:
+* `uni_object.value`: get/set the bundled RAW value.
+* `uni_object.uni`: get/set the Unicode string.
+* `uni_object.dbytes`: get/set the string with default encoding.
+* `uni_object.fs`: get/set the string with filesystem encoding.
+* `uni_object.imap`: get/set the string with IMAP encoding.
+
+In legacy context, the `StrObject` NEVER do encode or decode tasks. All getters
+and setters are basically aliases to the `value` accessor. Setters taking
+Unicode data makes coercing on the string to work with the expected type `str`.
+
+Setters only accept a string or None.
+
+The encoding of the bundled value depends on the context:
+* no encoding `str` in legacy mode
+* Unicode `unicode` in unicode mode
+
+Patterns
+--------
+
+Some patterns are usefull to know and come regulary in the code.
+
+* Non `uni objects` variables are prefixed with the encoding they are supposed
+  to be (`fs_string`, `uni_string`, etc).
+* Encoding/decoding in place: `fsString(string_fs).uni`
+
+s % y
+s + y
+
+uni_obj.split(), equivalent of:
+activeaccounts = [uniString(a) for a in activeaccounts.split(",")]
+
+list of uni object into list of strings:
+[o.uni for o in uni_objs]
+the join() method:
+d = {a: 1, b: 2, c: 3}
+u'-'.join([o.uni for o in d]
+
+d.keys() -> list of uni objects.
+
+
+Patterns samples
+----------------
+
+Here are some samples of the most used patterns.
+
+Working with filesystem paths::
+
+        self._lockfilepath = fsString(os.path.join(
+            self.config.getmetadatadir().fs, "%s.lock"% self.name.fs))
+
+More advanced::
+
+        newfilename = fsString(os.path.join(fs_dir_prefix, fs_filename))
+        if (newfilename != oldfilename):
+            try:
+                fs_old = os.path.join(self.getfullname().fs, oldfilename.fs)
+                fs_new = os.path.join(self.getfullname().fs, newfilename.fs)
+                os.rename(fs_old, fs_new)
+            except OSError as e:
+                raise OfflineImapError("Can't rename file '%s' to '%s': %s"% (
+                    oldfilename.fs, newfilename.fs, e[1]),
+                    OfflineImapError.ERROR.FOLDER), \
+                    None, exc_info()[2]
+
+Configuration retrieving::
+
+        password = uniString(self.getconf('remotepass', None))
+
+Checks with None::
+
+        password = uniString(self.getconf('remotepass', None))
+        if password != None:
+            return rawString(password.dbytes)
+
+Notice that the configuration file is read with codecs.open() set to
+`uni.ENCODING`. Hence, the first use of `uniString`. Then, we use the
+`rawString(password.dbytes)` pattern to encode the password back to what the
+user defined and ensure further read will remain unchanged.
+
+Limitations
+-----------
+
+* Since the context must be set for the `uni objects` to work properly, most
+factories work at runtime. This means that something like::
+
+  call_a_function(UniString(variable).dbytes)
+
+won't work.
+
+
+uniString(u'').uni -> redress type to str in legacy mode.
+
+Tips/Hints
+----------
+
+TODO:
+* uni-tests
+* hack module
+* split statements
+* list values are unicode (for v in mylist), ...
+* keys in dicts are unicode (ditto)
+
+
 :mod:`offlineimap.globals` -- module with global variables
 ==========================================================
 
-- 
2.2.2