[Teammetrics-discuss] Observations from current run of `archiveparser.py`.

Sukhbir Singh sukhbir.in at gmail.com
Sat Feb 4 19:14:12 UTC 2012


Hi Andreas,

`archiveparser.py` is chugging along nicely (from scratch because we
implemented the spam filter this time) and that's good but there is
one thing that I am not comfortable about.

Start by having a look at the log file :
/home/sukhbir/teammetrics/teammetrics @ blends.d.n.

Start from here:

    2012-02-04 18:06:52,585 INFO: 	List 11 of 55
    2012-02-04 18:06:52,585 INFO: Fetching 'debian-devel'

If you remember, it flags messages for spam based on spamfilter.py
(line 11). So in the case of the logs above, it is flagging messages
with subjects that have the words 'promotion', 'lottery', 'loan' in
them. But the problem is, all the messages flagged for `debian-devel`
in this case are NOT spam messages but they have the words above in
the Subject fields.

    Subject: Bug#2766: forwarded message from Mail Delivery Subsystem
--> (genuine message)
    Subject: Re: Gratuitous promotion of random binaries to standard
--> (genuine message)

(As of writing this, there are 22 messages that have been flagged with
the keywords, but they are all OK)

There are many such messages. Now, as we discussed, these messages go
through the spam filter, are caught, a log is generated and they are
pushed in the `listspam` table.

That is fine. But once a message is flagged as spam, it *does not*
populate the `listarchives` table. We just populate `listspam` and
move on. Because we have worked hard on this (and aimed for
perfection), I want to save every message possible. And right now, we
are losing lots of messages.

What I suggest is this -- populate `listspam` *and also* populate
`listarchives`. That way we serve both purposes: help the spam
fighting effort and not lose any messages. I had discussed this
earlier but you were not comfortable about it, but given that we are
losing so many genuine messages, I thought I had bring this up again.
Have a look at the log file and you will make up your mind hopefully
about this!

If you think we can do better with the spamfilter.py, please let me know.

PS: Summer of Code 2012 was just announced. Who knows, we might get a
student :)!

-- 
Sukhbir



More information about the Teammetrics-discuss mailing list