<DKIM> [PATCH,review]: add lmdb folder backend

Mon Dec 19 14:06:04 GMT 2016

---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68

On Mon, Dec 19, 2016 at 12:31 PM, Nicolas Sebrecht
<nicolas.s-dev at laposte.net> wrote:
>
>   Hi,
>
> I did a quick review.
>
> This feature would obviously require documentation in offlineimap.conf
> and this must be marked EXPERIMENTAL.

 ohhh yeah.  definitely.  wrote it up since posting

> On Mon, Dec 19, 2016 at 09:29:29AM +0000, lkcl wrote:
>
>> diff --git a/offlineimap/folder/LocalStatusLMDB.py b/offlineimap/folder/LocalStatusLMDB.py
>> index e69de29..3108595 100644
>> --- a/offlineimap/folder/LocalStatusLMDB.py
>> +++ b/offlineimap/folder/LocalStatusLMDB.py
>> @@ -0,0 +1,299 @@
>> +# Local status cache virtual folder: LMDB backend
>> +# Copyright (C) 2009-2016 Stewart Smith and contributors.
>> +# Copyright (C) 2016 Luke Kenneth Casson Leighton <lkcl at lkcl.net>
>
> We decided to use "and contributors" to avoid adding new copyright
> lines.

 i quite like explicit copyright notices.  plus, it's hell on earth
for distros to work out who actually contributed *unless* they're
explicitly listed... and if they're not listed, the package cannot be
added until the package maintainer has gone through every single damn
commit working out exactly who the contributors are to every single
file... by hand.

 if you're familiar with debian's copyright file, it lists the
contributors using regexps (i.e. you can't simply claim "all
contributors contributed all files" because that would be misleading
and quite possibly a criminal offense to state that someone else owns
copyright on code that they don't actually own).

 i had to maintain a package, once, that literally had tens of
thousands of files in it.  even considering reviewing them to create
that initial debian/copyright file was out of the question, so i wrote
a program to work out the discrepancies (it was an O(N^3) algorithm
but it did the job).

 without that it would have taken literally weeks to review all the
files... and it relied *critically* on the copyright holders being
explicity listed.

 bottom line is, there's really *really* good reasons not to just say
"and contributors".

> If this new file is yours, I'd suggest:
>
>   # Copyright (C) 2016 Luke Kenneth Casson Leighton and contributors.

 it's based on the sqlite file.

> to keep things simple.

 ... and make life for debian, ubuntu, fedora and all other distros
that have strict policies for explicitly and meticulously maintaining
an accurate and up-to-date list of all copyright holders for all
files...

>> +import os
>> +import lmdb
>
>
> Python 2.7.10 (default, Nov  9 2016, 23:16:09)
> [GCC 4.9.3] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
>>>> import lmdb
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   ImportError: No module named lmdb
>>>>
>
> We need to have this dependency made optional.

 import sqlite should probably also likewise.

> Notice this is likely
> easier to achieve while importing/using LocalStatusLMDB in
> repository/LocalStatus.py.

 moved it into the __init__, where once you have an env there's no
further need for the lmdb module anyway

>     """LocalStatus backend implemented with an LMDB database."""
>
> (style)

 yep got that one (and a few others)

>> +        if not os.path.exists(dirname):
>> +            os.makedirs(dirname)
>> +        if not os.path.isdir(dirname):
>> +            raise UserWarning("LMDB database path '%s' is not a directory."%
>> +                               dirname)
>
> Race condition. Not signicant, though.

 the same race condition is present in the sqlite code

>> +        with self._env.begin() as txn:
>
> I wonder it's missing of locks. This class will be *instanciated* and
> used more than once in different threads.

 ok lmdb is a multi-reader with a single global mutex on writing.
it's extremely clever.  so all those "with self._env.begin()" blocks
are transactions (that don't block readers) - the only one(s) that
will block (across all threads) is the "with
self._env.begin(write=True)" ones.

 soOoo... the only bit that has me concerned is the get/check on the
db_version....  i know what to do there....  use a
self._env.begin(write=True) as txn then pass that txn through to
__create_db and __upgrade_db.

> In order to avoid repeating the same code/patterns in the backends, I
> guess the best would be to add a new layer (class) to serialize the I/O
> on top of them. Not sure about that, though.

 would mean close synchronisation / analysis of the backends... not
sure if it's worth it... and some may have finer-grained opportunities
for locking (and different rules) than others.

 sqlite for example does locking on writes *and* reads... whereas lmdb
does locking *only* on writes, using copy-on-write and only needs a
global mutex for the very top block of the b+tree... it's amazingly
elegant.

>> +                 'flags': set(),
>> +                 'labels': set(),
>
> Trailing spaces for above two lines.

 :%s/ $//g :)

>> +                        'flags': set(flags or () ),
>                                                    ^
>
>> +                        'labels': set(labels or () )
>                                                      ^

 damn gmail variable-width fonts, can't tell what you're pointing at,
i assume the space in betweeen () and ) which i put there deliberately
so as to emphasise the empty tuple... yeh ok it goes :)

 amazingly this code actually works, i found a couple of errors, also
in the current version i removed the dict and replaced it with a
4-tuple, no point storing the keys of the dict repeated all the time,
they just take up space, esp. when there's versioning.

the one thing i'm really not happy about is the *massive* in-memory
cacheing.  i have 200,000 messages, it's a 9GB gmail folder, and as a
libre project maintainer i need to be keeping an eye on messages
several times an hour.  100% CPU usage for up to 90 seconds, to take
an in-memory copy of the database, isn't okay.

however.... it looks like in-memory cacheing is an integral part of
the design.  i thought about creating a derivative class of UserDict
(one that understands and mirrors the status_db) and then returning
that from cachemessagelist, the general idea being that the
replacement items, __get__ etc. functions actually *directly*
reference the lmdb database on-demand instead of keeping the data
in-memory.  lmdb is a particularly good candidate for this because it
does *not* take copies of keys or values: because it's a mmap'd shmem
file (with COW enabled) it passes the *real* memory address around.
kinda cool.

ok off to taiwan in the morning, will take this up again in a day or so.

l.