[Teammetrics-discuss] Checksums for NNTP list parsing.
Andreas Tille
andreas at an3as.eu
Tue Jul 5 07:11:44 UTC 2011
On Tue, Jul 05, 2011 at 12:07:31PM +0530, Sukhbir Singh wrote:
> 1. We implement a checksum for the entire list. We download the entire
> list from the server and generate the checksum.
> PROS: Easy to implement and manage, will just take a few hours to implement.
> CONS: The entire list has to be downloaded. So for a list with 40, 000
> messages, it can take time (a few hours).
This is a lot for downloading and IMHO makes the checksum check fail in
all normal cases because if we consider a "normal" list there will
usually mails in one week (and I think we will not run the job more
frequently than each week - it is rather once a month). So if the check
is based on the complete archive you can leave it out because it will
finally end up in rereading the whole archive on any "living" mailing
list.
> 2. We don't implement a checksum for now and just save which lists
> have been downloaded.
> PROS: Very fast because we will keep track of what has been downloaded
> without the need to download it again.
> CONS: If the message changes in the archives (does this happen
> often?), the metrics can change.
I'd prefer something like this. IMHO we can assume the following: Let's
store the time when the {list,nntp}stat script was running the last
time, say t_last. I would assume that messages which are older than
t_last - 30 days will not change any more. IMHO this is a reasonable
and safe assumption for normal mails. It will fail for possibly removed
SPAM mails which might make our input for listmaster about spam less
usefull - but hey, lets use this as an argument to enable mboxes for us!
:-)
So if you parse those mails dating from
t_last - 30days up today
and leave the history untouched this makes perfectly sense to me.
> 3. We implement and save the checksum for each article in the server.
> This makes it easier to download new articles.
> PROS: Makes management per messages and downloading new messages
> easier. But messages still have to be downloaded (as in 1).
> CONS: As mentioned, some lists have 40, 000 messages. So saving this
> to a file is probably not a good idea. A database could be a better
> option.
This option also creates a lot of redundant downloads which could be
prevented.
Kind regards
Andreas.
--
http://fam-tille.de
More information about the Teammetrics-discuss
mailing list