[Debian Wiki] crawler not allowed to perform ?action=raw
Frank Lin PIAT
fpiat at klabs.be
Tue May 11 06:26:51 UTC 2010
retitle 569191 crawler not allowed to perform ?action=raw
thanks
Andreas B. Mundt wrote:
> we use GET to download a wikipage and further process the data to
> prepare the manual of Debian Edu. The command:
> GET "http://wiki.debian.org/DebianEdu/Documentation/Lenny/AllInOne?action=raw"
> works fine in Lenny, but stopped working in squeeze where "You are not
> allowed to access this!" is returned. If you remove "?action=raw" from
> the URL anything is fine. Is this inteded and we have to provide a
> header?
Damyan Ivanov wrote:
> On Lenny (works)
> ================
> User-Agent: lwp-request/0.810
>
> On Sid (breaks)
> ===============
> User-Agent: lwp-request/5.834 libwww-perl/5.834
Yes, this is moinmoin standard behavior.
The wiki engine has some surge protection mechanisms, to avoid web
crawlers (and users) from DoS'ing the wiki.
Well known web crawlers (including libwww-perl/*) are only allowed to
fetch html rendered pages.
As it was mentioned, you should change your crawler's user-Agent string
(use something meaningful, so the admin can get in touch with you,
rather than just blacklisting the "offending" IPs)
Thanks,
Franklin
More information about the pkg-perl-maintainers
mailing list