[Teammetrics-discuss] Results from Web Archive Parser

Sukhbir Singh sukhbir.in at gmail.com
Wed Dec 21 16:23:03 UTC 2011


Hi,

> Yep.  And we need to make sure that the parsing is continued from what
> is in the database once the parser is called next time. :-)

Don't worry about it!

> I think so.  However, I do not remember that we decided to consider a
> broken date string as a sign for a SPAM message.  As far as I remember

You are right, we did not. But to get the date from the 'Date' field,
we need to parse it and that is where we encounter the broken date
string. And we need the year also and numbers like 101 and 099 as
years don't help :)

To sum up:

We will now try to get the 'day'. If we can't, we will just set it to 15.

The month and the year we will fetch from the message itself using
which the list URL (as you said).

> I'm afraid I do not understand this paragraph.

I meant that we should pass the messages through `spamfilter.py`.

I am going to fix all the changes and then do a hopefully 'working'
run of this.

PS: I added a check to see how many messages were being skipped due to
`lists.d.o` not responding. In this run: 1. Let's hope it stays this
way. I really like web archive parser; it's much better than NNTPStat.

-- 
Sukhbir



More information about the Teammetrics-discuss mailing list