[Teammetrics-discuss] Results from Web Archive Parser
Andreas Tille
andreas at an3as.eu
Wed Dec 21 15:30:39 UTC 2011
On Wed, Dec 21, 2011 at 07:54:22PM +0530, Sukhbir Singh wrote:
> The web archive parser has finally finished parsing all the 55 lists.
> Yayay! It started on 17th December at 19:40 and completed on 21st
> December at 09:49. A lot of time but look at the # of messages:
>
> count
> ---------
> 2751087
> (1 row)
>
> Cool!
Yep. And we need to make sure that the parsing is continued from what
is in the database once the parser is called next time. :-)
> Anyways, the good news is that it seems like it is working perfectly
> now. I have checked this with the graphs even and compared randomly. I
> will keep on checking more as required.
I try to have a look soon.
> The only issue is that of the invalid dates. Like we discussed, we
> implemented a log output whenever an invalid date was encountered
> because it *could* be spam. Seems like it won't work out.
>
> Consider this:
>
> $ grep -c skipping teammetrics/liststat.log
> 1070
>
> Very few of these messages are _actually_ spam. The others are those
> from which the Date parsing just won't work! We have dates like:
>
> Date: Tue Aug 29 11:19:01 2006
> Date: Date: Fri, 15 Jun 101 10:33:41 -0400 (EDT)
>
> No regex can parse all such dates. And because the dates can't be
> parsed, we can't extract the year or the month. And when that doesn't
> match, we end up logging a 'Date' mismatch error. I hope I explained
> it clearly.
I think so. However, I do not remember that we decided to consider a
broken date string as a sign for a SPAM message. As far as I remember
we agreed to apply the following logic: Well, we are unable to parse
the date correctly. But we know in which month the mail arrived in the
mailing list. So lets fix the date at something like
- 1.<month>.<year> / 15.<month>.<year>
- <random%30>.<month>.<year>
- date >= date of previous mail && date <= date of next mail
all assumptions will work reasonably well for our purpose. We are just
considering the month a mail reached the mailing list - and we just know
this month pretty sure.
> So what I feel is that we should remove this and perhaps throw the
> messages to the spam thing we wrote instead, for filtering spam.
> Because parsing the Date field just won't work, there seems to be no
> way to filter such valid messages. This is possible:
>
> if(date_can_be_parsed)
> compare
> else
> don't compare
>
> But then it defeats our purpose. So IMHO, I see no way.
I'm afraid I do not understand this paragraph.
> Other than that, I will now add support so that the same message is
> not fetched again and make improvements so we can finalize this at the
> earliest.
Sounds good.
Kind regards
Andreas.
--
http://fam-tille.de
More information about the Teammetrics-discuss
mailing list