[Teammetrics-discuss] Commitstat using key authentication?
sukhbir.in at gmail.com
Wed Sep 7 14:59:15 UTC 2011
Changes in liststat.py:
- In the earlier code if an encoding was not resolved, we used to
set it to 'ascii'. That was an error prone approach as tests on
non-English lists showed. Instead, we now use `chardet` module to
detect the encoding and then attempt to call unicode() on it. This
has resulted in proper handling of encoding related messages.
Mostly messages with encoding errors were spam, however after parsing
debian-user-german from lists.d.o., I noticed a large number of
messages that had encoding errors but should not have and were not
spam. Which was not good news.
So I set out to fix this. I noticed that I was doing this:
subject = u" ".join([unicode(text, charset or 'ascii')
I was defaulting it to 'ascii' but that was not cool! Sure, it works
for the English lists in almost all cases but not for i18n-ized lists.
So I used the 'chardet' module that helps with encoding detection. So
now we do this:
subject = u" ".join([unicode(text, charset or
And this has fixed almost 99% of the encoding errors and I am happy :)
The remaining messages are spam or in cases where the encoding can't
be detected (None). We will confirm this more when we run a proper
- For messages that had invalid dates, we now use the date of the
previous message. This helps us to avoid skipping the message
entirely, as was being done earlier.
Fixed as discussed.
- Messages with invalid payloads are not skipped, rather they go
through the spam filter.
More information about the Teammetrics-discuss