[Teammetrics-discuss] Checksums for NNTP list parsing.

Andreas Tille andreas at an3as.eu
Tue Jul 5 07:11:44 UTC 2011


On Tue, Jul 05, 2011 at 12:07:31PM +0530, Sukhbir Singh wrote:
> 1. We implement a checksum for the entire list. We download the entire
> list from the server and generate the checksum.
> PROS: Easy to implement and manage, will just take a few hours to implement.
> CONS: The entire list has to be downloaded. So for a list with 40, 000
> messages, it can take time (a few hours).

This is a lot for downloading and IMHO makes the checksum check fail in
all normal cases because if we consider a "normal" list there will
usually mails in one week (and I think we will not run the job more
frequently than each week - it is rather once a month).  So if the check
is based on the complete archive you can leave it out because it will
finally end up in rereading the whole archive on any "living" mailing
list.
 
> 2. We don't implement a checksum for now and just save which lists
> have been downloaded.
> PROS: Very fast because we will keep track of what has been downloaded
> without the need to download it again.
> CONS: If the message changes in the archives (does this happen
> often?), the metrics can change.

I'd prefer something like this.  IMHO we can assume the following: Let's
store the time when the {list,nntp}stat script was running the last
time, say t_last.  I would assume that messages which are older than
t_last - 30 days will not change any more.  IMHO this is a reasonable
and safe assumption for normal mails.  It will fail for possibly removed
SPAM mails which might make our input for listmaster about spam less
usefull - but hey, lets use this as an argument to enable mboxes for us!
:-)

So if you parse those mails dating from

    t_last - 30days  up  today

and leave the history untouched this makes perfectly sense to me.
 
> 3. We implement and save the checksum for each article in the server.
> This makes it easier to download new articles.
> PROS: Makes management per messages and downloading new messages
> easier. But messages still have to be downloaded (as in 1).
> CONS: As mentioned, some lists have 40, 000 messages. So saving this
> to a file is probably not a good idea. A database could be a better
> option.

This option also creates a lot of redundant downloads which could be
prevented.

Kind regards

      Andreas. 

-- 
http://fam-tille.de



More information about the Teammetrics-discuss mailing list