[Teammetrics-discuss] Phase I: Statistics for mailing lists on Alioth

Fri May 27 08:07:57 UTC 2011

Hi!

The first phase of the project is to start with parsing the archives
of the mailing lists at lists.alioth.debian.org.

As we had decided initially, we will be parsing the mbox archives that
are generated by Pipermail, the message archiver that comes with
Mailman (the lists on Alioth run on Mailman). There is some sample
code [0] that parses a mbox archive and outputs the statistical data
we need. You need to pass a mbox archive to the script but otherwise
this code is good to go as far as our requirement is concerned.

For our specific case, we will have to download the mbox archive (per
each list) locally and then parse it. As an example, consider the
blends-commit mailing list [1]. The archives for the blends-commit
mailing list are from May 2011 - February 2009. For each mailing list,
I will parse the HTML and download all the gzip files which of course
will correspond to the months the list was active. That way we will
parse and download only what is required.

I need your your thoughts on this approach. Is this OK or we can do
better? Note that the aim is to automate the entire process. There is
no information that the we need to give the program; it does
everything itself.

I reinvented the wheel (thanks to Andreas for allowing me to do this!)
to parse the mbox but it resulted in maintainable and (possibly)
faster code that runs on Python 2.6 using the stdlib only and that we
can tailor to our requirements. Yayay!

--
Thanks,
Sukhbir.

[0] - http://anonscm.debian.org/gitweb/?p=teammetrics/teammetrics.git;a=blob_plain;f=liststat.py;hb=HEAD
[1] - http://lists.alioth.debian.org/pipermail/blends-commit/