[PATCH 5/7] Re: Don't keep sqlite connections open

Sat May 7 15:08:35 BST 2011

On Fri, 6 May 2011 11:14:17 -0700 (PDT), chris coleman wrote:
> I want to put this out there and get the opinion of you and the list.
> 
> For performance (while multi-threading.. dealing with huge inboxes.. multiple accounts on multiple servers) and data-integrity reasons (crashes or other interruptions in the code that might damage the data stored previously in flat text files, typically 100KB in size, that were getting written to disk possibly 100 times in a single invocation of offlineimap, every 3 minutes for one week... Now I know why the disk light stays on solid for 1-2minutes when the script is running... yikes!).

Yes, OfflineImap always played super safe and writes out the cache after
every single change. It does so by writing it first to a new tmp file
and then moving it into place of the old one to avoid partially written
content (often fsyncing inbetween). We basically have to expect to be
killed or crash at anytime, so playing safe is good, in general. It is
safe but it kills performance, especially for those guys that have
multi-million email boxes (yes, they do exist).

That's why the sqlite patches have been floating since 2008 or so. But
due to development stagnation, they were never incorporated.

Using a database such as sqlite is very well suited for our purposes I
think, although people have pointed out the benefits or plain text
files when it comes to e.g. recovering from corruption.

This change is a major step in my opinion, in terms of performance and
generally being nicer to our hard disks. However, there is plenty of
stuff left to do in offlineimap on which I would rather focus then
implementing or even incorporating an abstract database backend. sqlite
is good, but *I* don't really see any benefit in being able to stuff
your cache into a postgres. After all, firefox doesn't offer you the
possibility to put your bookmarks and its cache into postgres either.

> There are some already existing frameworks , pre-packaged, tested and working, and they're available with a simple "apt-get python-sqlobject" or "apt-get python-sqlalchemy" for example.

Now, that we have a 2nd LocalStatus backend implemented, it would be
rather trivial to implement more backends, also one that implements an
db abstraction backend. There are only a few functions to
implement. Patches are welcome, but *I* am not gonna introduce another
level of abstraction for a smallish cache.

> I think it would be really cool to let the user pick the database that they have available, with a setting in the .offlineimaprc, and the offlineimap python code using one of these persistence frameworks , would be unchanged.

Once we are convinced that sqlite works great and everyone who remembers
that it even can do plaintext, *I* would rather remove the current
option from offlineimap.conf again and just use the sensible
default. offlineimap.conf is a monster as it is, and each additional
code path means more paths to test (and conversely, less code paths
actively being used) which is bound to introduce more regressions and
failure opportunities. I'd rather try to keep offlineimap as simple as
possible.

> 1) the added performance and reliability would really be awesome!  

I am 100% sure that using a 3rd party db abstraction would not gain us
performance over just using sqlite. But I am willing to be convinced by
benchmarks :-).

> 2) no need to close the connection every time you go thru the loop because another thread will corrupt it.  
Fixed in the latest revision, sqlite3 is multithreading capable since 3.3. after
all (published in April 2008 or so). We don't close it anymore.

[SNIP lots of valid stuff]
> What's your opinion?

All nice and good. In the end, it comes down to someone getting their
hands dirty and implementing it. When it comes to developer time, *I*
would spend my time rather debugging IMAP hangs and improve our Error
message handling, than including one more level of abstraction,
requiring additional packages to install. This is just a smallish cache,
it's not like we are doing a db-based web app. :-)

But this being open source, the door is always open to contributions for
everyone to scratch their itches. ;)

Sorry if that is not what you wanted to hear from me, but you asked for
my opinion :)

Sebastian

==========================================

Sebastian, 

I appreciate your opinion.

I bring this up because I see the project has been spending valuable energy reinventing some basic database technology that already exists and is tested by millions of users already-
a) flat files 
that are written to disk so often so they can play the role of journaling and transaction logs - to prevent crashes from losing/corrupting data.  
b) cached local copy of a remote database index (this is what the LocalStatus file is). 

With the trend being toward larger and larger IMAP inboxes (unlimited email storage available nearly everywhere), there are larger and larger LocalStatus cache files.

The chances of crashing 
and losing/corrupting data in the middle of a multi-megabyte write that takes place several times per second  - is going to get bigger not smaller.  

So, seriously, why 
not just let a proven reliable database handle the data integrity 
concerns?  Even a 64MB laptop can run free mysql with room to spare so it can't be because of system requirements...

I would say that if you're willing to try, the next step should be to point out which source files contain database calls.  

Do you have, or could you write up, a document that lists the source files that call the db.

And the unwritten rules that must be followed when making calls to the db.

That is the hard part.

The next step after that is easy.  Just have to alter the calls to talk through one of the high rated persistence frameworks... and test.

Chris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/offlineimap-project/attachments/20110507/6b7784ef/attachment-0001.html>