[Teammetrics-discuss] Web Archive Parser ready for your testing.

Wed Jan 4 16:49:01 UTC 2012

On Wed, Jan 04, 2012 at 12:05:32PM +0530, Sukhbir Singh wrote:
> Ok, good idea. Fixed to:
> 
>     CONFIG_FILE = '/var/cache/teammetrics/archiveparser.status'

OK.

> Yes, that is expected and I will explain why.
> 
> When running the script, I noticed that many messages which should fit
> the description of spam also match the messages which are definitely
> not spam. Now, I thought, it is better to have spam messages in our
> metrics *rather* than missing out genuine messages.

That's correcnt.

> So as such, the
> methods of possible spam that we discussed:
> 
>     Invalid date:
>         A large number of messages had this problem but the quantity
> of messages which were NOT spam was *more* than the quantity of
> messages that were spam. So this renders this check invalid for us.
> 
>     Missing message-IDs:
>         Again, the same problem as above.

As far as I remember we did NOT used this as qualifiers for SPAM.  Any
poster might have set clock wrongly or have a broken mailer (rather the
contrary - I expect SPAMers today as clever enough to set these
parameters valid).

>     Other filters in spamfilter.py (follows ahead)
> 
> So now here is how I want to handle this and your suggestion is needed:
> 
>     Populate `listspam` based on all the above filters BUT let's not
> skip the message if it is spam, populate it in `listarchives` also.
>     So: if it was spam, it will be at the bottom of the list. If it
> was not, it will add to the metrics.
> 
> This way our spam fighting efforts and our metrics, both will be satisfied.

I think that in any case SPAM has not a big influence at our metrics by
design because spam usually does not come from the same poster.
However, it is for several reasons important to identify potential SPAM.
On one hand I would liket to keep our tables clean from obvious cruft.
But even more importantly we want to help listmaster and other SPAM
fighters to detect this easily.  That's why I originally created a file
containing all potential SPAM messages which was used to give voting
points to these messages in the listmaster algorithm.  For instance if
I imagine that we create mboxes from the mailing list postings I would
imaginge that we could do something like:

    <listname>_<year-month>.mbox
    <listname>_<year-month>_spam.mbox.

The SPAM fighting effort Christian Perrier was talking about is that
people are reading those mboxes with their MUA and reply to a certain
address.  When having those pre-sorted mboxes this would really help
those people (and finally we could prove some use for listmaster).

In short:  I would vote for trying to detect SPAM as we did in our other
algorithms and as I did in my original hack which was finally based on
the same input data and worked to some extend (just ping me if you need
some additional explanation to the Perl code).

Kind regards

        Andreas.

-- 
http://fam-tille.de