[Soc-coordination] Status Report: ListArchive - week 5

Pali Rohár pali.rohar at gmail.com
Sun Jun 22 13:11:17 UTC 2014


Hello,

this week I worked on speeding up parsing, processing and 
importing emails from mboxes files into archive.

My choice, Email::Folder::Mbox, module for parsing mbox archive 
is still not perfect and maintainers have not reviewed my first 
patch yet. I included that module in my tree and added more 
patches which speed up processing emails.

Now my program will get message-id (as unique identifier) of 
email at time when reading next message from mbox archive and 
does not have to wait until full MIME processing is complete. 
This speed-up skipping emails which are already processed.

Next I changed module for parsing dates (now using Date::Format) 
which looks like is faster than old (DateTime). And this new 
module can parse more date formats which can be found in debian-
devel emails (different variants which violates rfc2822).

With these changes time for importing all emails from debian-
devel archive is decreased from 23min to 17min. And running 
program again (when it skip all emails) will take only 2.40min 
(before it was 16min).

Repeated run on same data (which only skipping all emails) is 
better, but still not ideal. Because debian emails from one ML 
are stored in more mbox archives, I started using last 
modification time of mbox archive. Caching timestamps of each 
processed mbox file allows me to skip opening whole mbox file if 
cached timestamp is not older. After implementing this feature 
repeated run on all already imported mbox archives takes less 
than one second.

-- 
Pali Rohár
pali.rohar at gmail.com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.alioth.debian.org/pipermail/soc-coordination/attachments/20140622/cbed913c/attachment.sig>


More information about the Soc-coordination mailing list