[Soc-coordination] Status Report: ListArchive - week 9
Pali Rohár
pali.rohar at gmail.com
Sun Jul 20 23:06:37 UTC 2014
Hello,
this week I was fixing last problems with parsing and processing
mbox archives with mime emails from all debian archives. The
result is that now archiving program can process all emails from
all debian mailinglist archives.
There was a problem with processing mixed mboxcl/mboxrd archives
from pipe because internal Email::Folder::Mbox module used seek
function. This was fixed by introducing memory cache for non
seekable filehandles. Seeks are used only for backward/fallback
reading, so cache with previous lines is enough.
Another very big problem was with processing emails which has too
many recipients addresses in From, To or Cc headers. Too many >=
7. This is because email address with name can contains comments
in these headers and for unknown reason regular expression in
Email::Address module (for parsing these headers) are too slow.
Some To header with lot of spaces and brackets took more than 10
minutes to parse which is not usable. Looks like this problem was
caused by possibility to parse nested comments (which are by rfcs
allowed). Module has special variable COMMENT_NEST_LEVEL for
setting nest level, but it was ignored after module was loaded.
But finally I found way how to use that variable (without need to
changing source code of that module) and disabling nested
comments in mail addresses cause that module parsing that headers
immediately (without 10 minutes delay).
I did not found any documentation how to use COMMENT_NEST_LEVEL,
but if somebody will have same problem... instead traditional
command "use Email::Address;" is needed to call require and
before that manually changing variable. Something like this
working:
BEGIN {
local $Email::Address::COMMENT_NEST_LEVEL = 1;
require Email::Address;
import Email::Address;
}
Current performance of archiver:
Processing and archiving each mbox file to correct archive (about
27GB) takes 302 minutes.
Calling that script again (when all mbox archives are already
imported) without new emails takes about 30 seconds which is
quite good. So incremental import should be fast enough as it
skip mbox files which was not modified.
--
Pali Rohár
pali.rohar at gmail.com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.alioth.debian.org/pipermail/soc-coordination/attachments/20140721/3dd3cf0b/attachment.sig>
More information about the Soc-coordination
mailing list