[Soc-coordination] Debian Teams Activity Metrics - Report IV [Update]

Sukhbir Singh sukhbir.in at gmail.com
Thu Aug 4 05:46:26 UTC 2011


Hi Olly,

> Hmm, it sounds like you're fetching individual messages from gmane via
> NNTP (or worse by scrapping the web frontend) and then building mbox
> archives yourself.

Yes, we fetch individual messages via NNTP (but there is no HTML parsing!).

> Gmane actually has an mbox export feature for this sort of thing:

I am aware :) There are a couple of reasons why we thought that this
is a bad idea:

1). Gmane actually doesn't favor this. The export page says:

    This interface is a slight CPU and bandwidth hog, so if it's
abused, it will be shut down. (List admins will then have to get a
user name/password thing going.)

Even though I have a redundancy check in place that prevents the same
articles from being fetched again, imagine this running the first time
for lists that have 40, 000 articles. And as we are not going to be
parsing a single list, for two - three teams, the article count goes
up to ~ 1, 50, 000 articles. This caused two problems:

a). This is not a good way of fetching articles via Gmane and would
strain the Gmane server and ...
b). ... we seemed to randomly getting disconnected when fetching
articles in the very initial stages only.
c). So not only is it possible abuse but it doesn't suit us.

2). We get the incomplete archives because we get disconnected during
the process.

Why NNTP actually wins:

1). Fetching messages over NNTP doesn't strain Gmane and is very fast
(as is obvious)
2). We fetch around 40, 000 header fields in a few seconds, though the
body takes some time but still it is much better than the export
option.

I have actually tried all of this -- if we could fetch the mbox
archives, it would have saved me lots of extra code! But no, it
doesn't work and introduces more problems.

Feel free to comment and suggest.

-- 
Sukhbir



More information about the Soc-coordination mailing list