[Teammetrics-discuss] Next phase: Handling spam

Thu Jun 9 07:39:34 UTC 2011

Hi Sukhbir,

On Wed, Jun 08, 2011 at 08:48:01PM +0530, Sukhbir Singh wrote:
> Perl is so weird!

Well, sometimes it is cute. :-)

> my $spamurls = "spamurls.txt" ;
> 
> Where can I see this file?

This is explained in the next line:

   open(SPAMURLS, ">$spamurls")

The '>' means the file will be created for writing.  This file simply
logs which mails are considered as SPAM.

> Sorry for the simultaneous emails but I
> want to work on this in the night.

As I said I did some quite simple means to detect SPAM.  At first I noticed
some SPAM authors which contained the strings:

  'Pls check this new site', 'Tim.com.br'

There are probably much more but these "authors" in some low traffic
lists made it into the top X ranking.  I would suggest putting those
"authors" in a config file (say /etc/teammetrics/spam-handling.conf or
something like this).  So you can easily add strings you definitely do
not want to see in the statistics.

The next thing which I notet was that certain Strings in the subject
are a clear sign of SPAM which is just not relevant for our teammetrics:

  'File blocked - ScanMail for Lotus Notes',
  '^u?n?subscribe\s+.?$'

Same here: The list is far from complete but helped me sorting out
a certain amount of useless subjects.  I would add this list as
SPAMSUBJECTS in the same config file.

The next thing is that I tried to put a limit on non-ASCII UTF-8
characters which helped a lot against some Chinese SPAM.  However,
this has to be handled with some caution on mailing lists with
languages with a lot of such letters (Russian, Chinese, Japanese
etc.)  I handled this via

    if ( $project =~ /russian/ ) {
        $regard_utf8_spam = 0;
    } else {
        $regard_utf8_spam = 1;
    }

because *currently* the only relevant list which had a lot non-ASCII
UTF-8 characters was debian-russian.  However, to make it general you
need another configurable list which contains all lists which should
allow a lot of non-ASCII characters in the subject.  In Perl you can
fetch those spammy subjects with

   $subject =~ /^[-&#x\d;\sA-F\?:,]+$/

and there are also authors which have those non-ASCII characters which
are counted here:

   my $countstrangechars = 0;
   while ($author =~ /;\s*&#x[\dA-F][\dA-F][\dA-F]/g) { $countstrangechars++ }

   if ( $author =~ /^[-&#x\d;\sA-F\?:,]+$/ || $countstrangechars > 7 || $numspamauthors > 0 ) {

Finally I had some INSERT-problems where the injection of the record
into the database failed because of strange encoding:

   if ( $DBI::err == 7 ) {
      if ( $DBI::errstr =~ /UTF8/ ) {

This also turned out to be SPAM.

There are also cases with no sender ("# sometimes SPAM has no sender ...")
While this is not relevant for our statistics anyway it helps listmasters
to have also a record of these messages to help them in their anti-SPAM
battle in the archive.

All those mails I regarded as SPAM were logged in the file mentioned
above (spamurls.txt) and are used as data for listmasters (and for sure
for checking whether we are really sorting out SPAM or whether we have
some false positives (which was the sense of this file in the first
place).

Remark:  We have several ROBOTS posting to our mailing lists which are
definitely no SPAM but not interesting for the statistics anyway.  I
just ignored those ROBOT mails and you can take over the list @ROBOTS
into (another) configuration file.  There are two options to handle
these: Ignore them (as I did) or use an additional flag in the database
to ignore them in the evaluation process.  Both is fine.  We should just
make sure that not 'Debian Installer' and 'Debian Wiki' might be the
most active "persons" in the team ...

Kind regards

      Andreas.

-- 
http://fam-tille.de