[Teammetrics-discuss] Web Archive Parser ready for your testing.

Andreas Tille andreas at an3as.eu
Tue Jan 3 15:19:22 UTC 2012


Hi Sukhbir,

On Mon, Jan 02, 2012 at 12:34:31AM +0530, Sukhbir Singh wrote:
> 
> The web archive parser resumes from the last message.

Yep.  I noticed the file

    /etc/teammetrics/archiveparser.conf

The only problem I have is the location.  This files is actually no
configuration file (it is not intended to edit this file manally to
influence the web archive parser directly.  So I'd vote for something
like

    /var/cache/teammetrics/archiveparser.status

or something like this - feel free to find a better name.

> The checking
> 'algorithm' is very fast and I optimized it to the maximum so that it
> resumes after fetching the least possible data from lists.d.o. I like
> it.

Seem to work fine.  I'm just running the parser and it fetches a lot of
mails in a short time.  However, I do not see any records in listspam
table.  Is this intended behaviour?
 
> Other than that, try it out for a few lists when you have time. Re-run
> it for a few times and see the log file output. You might notice the
> script takes time to finish on each run, that is because it calls
> updatenames.py on every run. But for our case since the script will
> run once a month, that will be no problem.

It's fine to run updatenames.py several times.  Finally it does not
matter whether it runs say 1h per month longer (and it is not even that
long).

> I suggest you observe the log file output and that should serve your
> purpose well.  Let me know if you want any changes to be made.

For the moment it looks good.  I removed all data yesterday and currently
it is:

teammetrics=# SELECT project, count(*) from listarchives group by project;
        project        | count  
-----------------------+--------
 debian-amd64          |  27726
 debian-boot           | 155106
 debian-curiosa        |   4842
 debian-devel          | 309831
 debian-embedded       |   6808
 debian-derivatives    |    757
 debian-isp            |   2876
 debian-accessibility  |   4135
 debian-gis            |   4012
 debian-firewall       |   8872
 debian-desktop        |   2938
 debian-edu            |  23873
 debian-custom         |   2728
 debian-ctte           |   2931
 debian-devel-games    |   5356
 debian-arm            |  12978
 debian-blends         |    570
 debian-i18n           |  16617
 debian-enterprise     |    170
 debian-devel-announce |   2526

So for the moment we just need to wait.  After the run I will do an

    UPDATE listarchives SET project = 'debian-blends' WHERE project = 'debian-custom';

because this list was renamed.  The logfile looks perfectly normal
(however also here no sign of SPAM handling).

I will keep you updated after creating the graphs.

Kind regards

        Andreas.


-- 
http://fam-tille.de



More information about the Teammetrics-discuss mailing list