[Teammetrics-discuss] Next phase: Handling spam
Andreas Tille
andreas at an3as.eu
Thu Jun 9 20:30:50 UTC 2011
On Fri, Jun 10, 2011 at 01:36:12AM +0530, Sukhbir Singh wrote:
> The query:
>
> INSERT INTO listarchive (project, yearmonth, author, subject, url,
> ts) VALUES (?, ?, ?, ?, ?, '$today')
>
> is of main importance to us. So let's work on this.
>
> * project - the name of the mailing list.
> * yearmonth - Ok.
> * author -
>
> We are going to insert names here, right? So by parsing 'From' of a
> mbox archive, we we will get this (an example):
To give you an idea I select some random data sets in the database on
blends.debian.net
# SELECT * from listarchive limit 5 ;
project | yearmonth | author | subject | url | ts
--------------+------------+-------------------------+----------------------+-------------------------------------------------------------------+------------
user-spanish | 2000-12-01 | Pablo Dorronsoro | x free 4 | http://lists.debian.org/debian-user-spanish/2000/12/msg00388.html | 2011-06-01
user-spanish | 2000-12-01 | Carles Pina i Estany | pgp4pine | http://lists.debian.org/debian-user-spanish/2000/12/msg00365.html | 2011-06-01
user-spanish | 2000-12-01 | Miguel Angel Vilela | Cambio de e-mail | http://lists.debian.org/debian-user-spanish/2000/12/msg00367.html | 2011-06-01
user-spanish | 2000-12-01 | Ramiro Alba | Una de modems PCMCIA | http://lists.debian.org/debian-user-spanish/2000/12/msg00370.html | 2011-06-01
user-spanish | 2000-12-01 | Alfonso Cepeda Caballos | pseudo-image-kit | http://lists.debian.org/debian-user-spanish/2000/12/msg00373.html | 2011-06-01
> tille at debian dot org (Andreas Tille)
I just kept the names without e-mail address. When thinking about it
I'm a bit unsure whether it is finally a good idea to throw away the
e-mail address. We could store this in addition. While this is not
normalised at all I do not think that database normalisation is a real
issue.
> For the guest account problem you mentioned:
>
> tille-guest at debian dot org (Andreas Tille)
>
> So I was thinking we do a split on '-' and then push the names?
No. Splitting does not work. There are a lot of cases where this
will totally fail:
'charles-debian-nospam', 'plessy', 'charles-guest'
will all resolve to
'Charles Plessy'
So the only chance we have is to have another lookup list - perhaps
this should be rather done in the database itself rather than in a
config file. Following this strategy enables to change the names
using an SQL UPDATE query.
> the above two address are there in the mbox, they are treated the same
> for the user Andreas. Is this approach the one you talked about?
>
> * subject - Do we need to save this in the DB? If yes, why?
Because it's there. :-)
Well, I have not actively used it. However, it was some kind of useful
to detect some SPAM patterns. I do not really mind for the moment but
keeping it does not harm.
> * URL - Ok.
> * TS - Ok.
>
> So the author issue needs to be sorted out. And I remember you
> mentioning something about multiple IDs so that is why I brought this
> up as this is important.
Yes it is. Look at the
$query = $query . "UPDATE listarchive SET author =
statements filling up the get-archive-pages script.
Kind regards
Andreas.
--
http://fam-tille.de
More information about the Teammetrics-discuss
mailing list