[Teammetrics-discuss] Findings from NNTPStat and Web Archive Parser

Sukhbir Singh sukhbir.in at gmail.com
Fri Dec 9 08:26:19 UTC 2011


Hi,

> Mboxes are just deleted (I hoped that you would be able to do this
> yourself - finally /var/cache/teammetrics is group rw).

I didn't check the permissions but given that I have, I will do it
myself in future :)

> I wonder how you implemented the web archive parser.  Don't you use intermediate
> mboxes as well?  I hope I dod not deleted to much.

Here is what I do:

1. Read a page for a given message.
2. Get the required fields from it, such as Name, Subject, Date,
Message-ID. These fields are defined on line 19, archiveparser.py in
the FIELDS tuple.
3. Populate the database on the fly, i.e., we don't save anything to
the disk because there is no need.
4. Note that we are *not* saving the message body as of now. I will do
it later once we fix other things.

> Thinking twice about it:  What about if you *exactly* reimplement the
> mbox structure (I mean regarding directory layout and file naming
> scheme) of the official (hidden) mailing list archive.  This has two
> really great advantages:

Like I pointed out above, we don't save anything, but yes this is a good idea.

Based on what I mentioned about our approach with the current archive
parser, do you want me to implement the mbox creation from the web
archive? It should not be much work because most of the code is
already ready in nntpstat.py.



More information about the Teammetrics-discuss mailing list