[Teammetrics-discuss] How does filling up the database work?
andreas at an3as.eu
Tue Aug 9 09:42:17 UTC 2011
it seems we do not talk in one straight line, so lets discuss the
algorithm to make sure we are talking about the same thing. At first I
would like to explain the algorithm *I* used formerly.
Lets assume a given mailing list X.
1. Delete all entries of this mailing list inside the database
2. Download *all* index pages of the web archive (downloading the
index pages was way cheaper than just all mails)
3. Inject all mails for mailing list X I found in those index
pages (except SPAM)
Once this was done I went to mailing list Y, deleted those entries,
fetched index pages and inserted entries and so on.
So in *my* algorithm a complete download of the information which was
supposed to be injected into the database was downloaded every month
from scratch. I do not say that this should be reimplemented that way.
I just was using this strategy because it is quite common in database
programming to clear the data first and than import everything.
For some reason I assumed you might do similar but from our discussion I
suspect you are handling the injection of the data differently (which is
fine - I was just describing one option). Before we keep on discussing
could you please describe your method to inject the data (you might need
this description for yuor report anyway).
On Tue, Aug 09, 2011 at 02:43:11PM +0530, Sukhbir Singh wrote:
> You are talking about 'downloading all this stuff' again. By this you
> assume that the message are going to change right?
> But most messages have already been labelled spam and I really don't
> think that many messages are removed. I mean, sure I can implement the
> code but do you personally feel it is worth the effort? I think
> reading mbox archives and then deleting them later serves our purpose
> well enough. And either ways, we maintain which mbox archive was
> downloaded so we just check and proceed. Right?
More information about the Teammetrics-discuss