Message Board

Bugs & Development

google and non-existent urls

Author: C.Kruslicky (2 Jul 05 12:59pm)

Since there's no referer listed by googlebot requests, I'm not 100% sure this is due to a honeypot on my site, but whois indicates the IPs are definitely google's.

What I'd been seeing appeared to be spamtrap addresses being crawled by google as relative URLs, more recently, I'm seeing what I'm guessing are hashes of some sort with .html on the end. I'm curious if these are intended behavior or if some of the honeypot creations are malformed.

I'm also reluctant to post apache logs here in case it messes up statistics of the project itself. But to be more specific, the first kind look like:
/same/relative_path_as_honeypot/name@subdomain.example.com
while the newer ones I mention are like this:
/randomlookingalphachars.html

Just pointing it out, they all result in 404's, whether they indicate google is not honoring tags or a problem with my honeypot I do not know.

Re: google and non-existent urls

Author: C.Dijkgraaf (4 Jul 05 8:43pm)

The random page request has been theorised on Webmasterworld to be Googlebot's way of determining what the 404 Page Not Found would look like on a particular site. I think they are doing this just in case some web servers don't issue a 404 code correctly when serving up a not found page.

Re: google and non-existent urls

Author: C.Kruslicky (7 Jul 05 10:04am)

That's interesting, thanks for the info. Makes some sense, I seem to remember coming across something goofy with the default IE where you cannot return an access-denied page with a login form because it "helps" by not showing your content, so this makes some sense if it's common to return a valid page even in error situations.