[PATCH 5/7] Re: Don't keep sqlite connections open
christocoleman at yahoo.com
Fri May 6 19:14:17 BST 2011
On Fri, 6 May 2011 00:34:39 +0200, Nicolas Sebrecht wrote:
> > def upgrade_db(self, from_ver):
> > """Upgrade the sqlite format from version 'from_ver' to current"""
> This is a good improvement but...
> ...we tend to have a lot of constructs like this
> conn, cursor = self.get_cursor()
> with conn:
> while True:
> <execute some SQL request and commit>
> except blablah:
> <retry on database lock>
> bail out of the while loop
Yes, 4 times or so, as sqlite will simply bail out if another thread
currently locks the db, rather than waiting for the lock.
> it would be nice to factorize out. Not that this is wrong by design but
> it will cause harder work to maintain with possible errors and bugs.
I agree that factoring that out would be nice, but... ( :-) )
... I really don't want to put that into this topic series. I have been
spending plenty of time in refining each of the patches (always doing
'git rebase -i next'), and the series is already as big as I am
comfortable with a series.
How about we think about merging this as is now, and work on a followup
series that builds on what we have and factors out whatever is
possible. For example, we still don't know if this approach [opening and
closing the db for each transaction] is scalable at all or whether we
don't need a "db owning thread" that we pass the data to. I'd rather
gather some data from others first.
Alternatively, it got pointed out that *newer versions* of sqlite can
access the db from multiple threads. I would love to find out *how new*
that version must be and improve our design to not always close and open
our database first before embarking on more architecture design.
> I think we need a wrapper (or more) proceeding the SQL requests with
> generic data methods. Let's me try to categorize the different request
> - read only: mostly SELECTs with no need to commit()
right, no need to check for locks here, as we won't conflict. (Well
after reading the docs, it basically states that one must always expect
> - write (we need to commit()):
> * one statement: easy
> * multiple statements: easy using a list
> * statements including additional code to execute: pass both the
> request and a function to the wrapper
Mmmh, I am not really sure I get this yet, can you give an example? What
additional code could you be thinking off? Are you thinking of something like
def savemessageflags(UID, flags):
res = data.write(SQL ='UPDATE status flags=:flags WHERE UID=:UID',
func=<a function that does something (with what?)>)
if res: ....
> - statistics
> - advanced tracking system for logging IMAP/cache sessions
> - manage miscallenous exceptions
> - multiple data formats as backend (usefull for testing purpose ,-))
The above 4 we can already do with quite less effort I think, see how
*'relatively easy'* it was to plug in a new data format as
backend. Improved exception handling will come anyhow now that a
foundation for them is in place.
> - thread safe concurrent access by managing queues
Right, that would be an advantage, but we need to take care that we make
sure we don't continue before the queue has been worked off. Offlineimap
takes care that it can "crash" anytime without losing data, so we
shouldn't just stuff queries in a queue and continue. Not insovable, but
requires some design thinking.
That is why I would love to get what we have as experimental and
optional in the code now and go on improving it from there.
> I think the SQL topic is a very good opportunity for improving our data
> driver design. In short, I would expect something like data.write(),
> data.read(), (...).
mmmh, I would think that the current system already provides quite a bit
of abstraction by providing overridable classes for each backend for
each functionality, but I would be happy to get a discussion started on
what and how to improve things. But I still feel that is someout out of
the scope of this topic series.
Sebastian, Nicolas, List,
I want to put this out there and get the opinion of you and the list.
For performance (while multi-threading.. dealing with huge inboxes.. multiple accounts on multiple servers) and data-integrity reasons (crashes or other interruptions in the code that might damage the data stored previously in flat text files, typically 100KB in size, that were getting written to disk possibly 100 times in a single invocation of offlineimap, every 3 minutes for one week... Now I know why the disk light stays on solid for 1-2minutes when the script is running... yikes!).
There are some already existing frameworks , pre-packaged, tested and working, and they're available with a simple "apt-get python-sqlobject" or "apt-get python-sqlalchemy" for example.
These handle thread-safeness and provide generic interface to most database backends (mysql, postgresql...)
I think it would be really cool to let the user pick the database that they have available, with a setting in the .offlineimaprc, and the offlineimap python code using one of these persistence frameworks , would be unchanged.
The user could select from: flat text file, or a back end database, with connection hostname username, pw and table prefix.
In my opinion,
1) the added performance and reliability would really be awesome!
2) no need to close the connection every time you go thru the loop because another thread will corrupt it.
3) And no need to reinvent, develop and troubleshoot a framework that someone has already done..
4) There is a reason why these frameworks exist and it's because the problem of handling persistence in an efficient and thread safe way is not trivial, nor is it built into python.
5) I would bet you that 99% of people running offlineimap are running on a machine that
has a db that one of these persistence frameworks could talk to, even if it's just sqlite.
What's your opinion?
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the OfflineIMAP-project