[Teammetrics-discuss] Web Parser
Andreas Tille
andreas at an3as.eu
Thu Dec 1 13:39:53 UTC 2011
Hi Sukhbir,
(just git pull - I have chmod a+x archiveparser.py)
On Thu, Dec 01, 2011 at 05:08:46PM +0530, Sukhbir Singh wrote:
> Hi,
>
> Yesterday I had updated the web archive parser and fixed some issues
> with the Date handling and all. Now it's in an operational state much
> better than the previous one, however there is one issue that I should
> discuss.
>
> After a certain amount of messages downloaded (usually 100+),
> lists.d.o stops responding for sometime. This causes the
> urllib2.URLError to be thrown but lists.d.o is not responding; to be
> sure that this problem was not in the code, I noticed that it failed
> to load even through the browser. After a few seconds, it starts
> responding again. This is totally random. I have handled this
> exception but *not* implemented a mechanism that tries to download the
> message again.
Sounds like some bandwidth limiting means implemented by Debian Server
Admin. Would you mind asking this question on
debian-services-admin at lists.debian.org ?
> I was wondering, is this expected? I mean, did you face this issue
> with your code, ever?
No, but I was just reading the index files and not the whole pages which
might have kept me by far below such a potential limit.
BTW, I just tried
$ ./archiveparser.py
Traceback (most recent call last):
File "./archiveparser.py", line 150, in <module>
main(conn, cur)
File "./archiveparser.py", line 93, in main
day, message_month, message_year = ''.join(date).split()
ValueError: need more than 0 values to unpack
The logfile simply says:
2011-12-01 13:26:08,933 INFO: Starting Web Archive Parser
2011-12-01 13:26:08,959 INFO: List 1 of 55
2011-12-01 13:26:08,959 INFO: Fetching 'debian-accessibility'
2011-12-01 13:26:09,261 INFO: List archives are from 2003 to 2011
2011-12-01 13:26:09,510 WARNING: Possible spam: date mismatch in message 871y1t91zy.fsf at lexx.delysid.or
Kind regards
Andreas.
--
http://fam-tille.de
More information about the Teammetrics-discuss
mailing list