[Soc-coordination] Status Report: ListArchive - week 9

Pali Rohár pali.rohar at gmail.com
Sun Jul 20 23:06:37 UTC 2014


Hello,

this week I was fixing last problems with parsing and processing 
mbox archives with mime emails from all debian archives. The 
result is that now archiving program can process all emails from 
all debian mailinglist archives.

There was a problem with processing mixed mboxcl/mboxrd archives 
from pipe because internal Email::Folder::Mbox module used seek 
function. This was fixed by introducing memory cache for non 
seekable filehandles. Seeks are used only for backward/fallback 
reading, so cache with previous lines is enough.

Another very big problem was with processing emails which has too 
many recipients addresses in From, To or Cc headers. Too many >= 
7. This is because email address with name can contains comments 
in these headers and for unknown reason regular expression in 
Email::Address module (for parsing these headers) are too slow. 
Some To header with lot of spaces and brackets took more than 10 
minutes to parse which is not usable. Looks like this problem was 
caused by possibility to parse nested comments (which are by rfcs 
allowed). Module has special variable COMMENT_NEST_LEVEL for 
setting nest level, but it was ignored after module was loaded.

But finally I found way how to use that variable (without need to 
changing source code of that module) and disabling nested 
comments in mail addresses cause that module parsing that headers 
immediately (without 10 minutes delay).

I did not found any documentation how to use COMMENT_NEST_LEVEL, 
but if somebody will have same problem... instead traditional 
command "use Email::Address;" is needed to call require and 
before that manually changing variable. Something like this 
working:

BEGIN {
	local $Email::Address::COMMENT_NEST_LEVEL = 1;
	require Email::Address;
	import Email::Address;
}

Current performance of archiver:

Processing and archiving each mbox file to correct archive (about 
27GB) takes 302 minutes.

Calling that script again (when all mbox archives are already 
imported) without new emails takes about 30 seconds which is 
quite good. So incremental import should be fast enough as it 
skip mbox files which was not modified.

-- 
Pali Rohár
pali.rohar at gmail.com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.alioth.debian.org/pipermail/soc-coordination/attachments/20140721/3dd3cf0b/attachment.sig>


More information about the Soc-coordination mailing list