[Teammetrics-discuss] NNTP parsing update.

Sukhbir Singh sukhbir.in at gmail.com
Mon Jul 4 21:21:26 UTC 2011


Hi,

repository.update() if you want to see the code or you can view it online.

There is a new file called nntpstat.py. It uses the same configuration
file as the one used by liststat.py, so the user does not worry about
the technical details, all he needs to do is give the list name with
the URL in the single configuration file.

Let's look at nntpstat.py. Here is what we are doing:

- We connect to list.gmane.org with the name of the list which
redirects to the Gmane group webpage.
- We parse that page using BeautifulSoup and get the name of the group.
- Open it and download the From, Subject, Date fields and then the
body of the content.

As opposed to downloading the entire header at once which took lots of
time, we download From, Subject, Date individually which is very very
fast.

- Generate the mbox from the information.

If you notice the generated mbox file, it's pretty good -- with all
the timezones updated and everything :-) So we have the mbox archives
now. I am happy with the outcome, it's just what I wanted it to be.

Tomorrow, I am going to implement a mechanism to generate checksums
which can be tricky but it will be done!

-- 
Sukhbir



More information about the Teammetrics-discuss mailing list