[Teammetrics-discuss] Results from Web Archive Parser
Sukhbir Singh
sukhbir.in at gmail.com
Wed Dec 21 14:24:22 UTC 2011
Hi Andreas,
PS: Please feel free to reply later as Christmas is near.
The web archive parser has finally finished parsing all the 55 lists.
Yayay! It started on 17th December at 19:40 and completed on 21st
December at 09:49. A lot of time but look at the # of messages:
count
---------
2751087
(1 row)
Cool!
Anyways, the good news is that it seems like it is working perfectly
now. I have checked this with the graphs even and compared randomly. I
will keep on checking more as required.
The only issue is that of the invalid dates. Like we discussed, we
implemented a log output whenever an invalid date was encountered
because it *could* be spam. Seems like it won't work out.
Consider this:
$ grep -c skipping teammetrics/liststat.log
1070
Very few of these messages are _actually_ spam. The others are those
from which the Date parsing just won't work! We have dates like:
Date: Tue Aug 29 11:19:01 2006
Date: Date: Fri, 15 Jun 101 10:33:41 -0400 (EDT)
No regex can parse all such dates. And because the dates can't be
parsed, we can't extract the year or the month. And when that doesn't
match, we end up logging a 'Date' mismatch error. I hope I explained
it clearly.
So what I feel is that we should remove this and perhaps throw the
messages to the spam thing we wrote instead, for filtering spam.
Because parsing the Date field just won't work, there seems to be no
way to filter such valid messages. This is possible:
if(date_can_be_parsed)
compare
else
don't compare
But then it defeats our purpose. So IMHO, I see no way.
Other than that, I will now add support so that the same message is
not fetched again and make improvements so we can finalize this at the
earliest.
--
Sukhbir
More information about the Teammetrics-discuss
mailing list