Message Board

Newbie/Basic Questions

Legitimate robots

Author: W.Daniels (11 Dec 04 4:02pm)

What happens when Google or any of the other search engines catalog my site and hit the provided page?

Re: Legitimate robots

Author: M.Prince (11 Dec 04 5:47pm)

We've architected the system to protect good robots from being labeled harvesters. It's important to note that hitting the honey pot page alone is not what gets a robot declared a harvester. They have to visit the page and then send an email to the address that was handed out to them there. Google, presumably, is not in the business of harvesting therefore the addresses handed to their spiders will probably never receive any spam.

The other risk is that the honey pot pages get archived by Google or other search engines and then spammers search the indexes to find email addresses. We have a couple of ways to prevent this. First, we include a metatag in the head of the honey pot pages that good robots respect to not index or archive the pages. Second, we have a fairly extensive list of IP addresses of good robots and, when they visit the pages, we do not display an email address. Over time, from the data we're gathering, the list of good robots will continue to adapt as new robots come online or drop off.

Welcome to the Project!

Re: Legitimate robots

Author: S.Goodman (14 Dec 04 7:39pm)

Hmm. So you must be using the robots meta-tag or something like that? I put the honeypot in a separate directory and added a robots.txt file to keep legitimate robots out of that directory. It sounds like that might not be necessary if robots obey the meta-tags. Is this doing any good or should I remove the exclusion file?

Re: Legitimate robots

Author: M.Prince (15 Dec 04 12:16am)

You can use robots.txt to keep legitimate robots away from the honey pot as well. It won't do any serious harm. We actually build our lists of legitimate robot visitors in part based on what spiders visit page but, consistently, do not send spam to the addresses we display there.

You are correct, however, that we do include a robots meta tag to keep the pages from getting cached. The legitimate robots that we've seen appear to respect this. Again, it doesn't matter if Google or other legitimate spiders simply visit the honey pot page, the only real risk is if a robot stores the page and the email address it contains (e.g., caching) and then allows individuals to access these cached pages. There are actually fairly few robots that do this. Those that do respect the robots meta tag.

Re: Legitimate robots

Author: S.Goodman (17 Dec 04 5:59pm)

Thanks for the reply. When you say "It won't do any serious harm", does that mean it causes minor problems? Just trying to learn the in's and out's of honeypots and spambots.

Re: Legitimate robots

Author: M.Prince (17 Dec 04 6:08pm)

It would, for your site, deprive us of some of the data about legitimate spiders visiting. That data may end up being useful in some way in helping give perspective to illegitimate bot traffic.

For example, let's say we wanted to try and determine why some honey pots get harvested often and others are generally ignored. One of the things we could look at is how "popular" a page is. One metric, potentially, for calculating that may be how many times Google has visited a honey pot. I don't know if that's a good metric or not -- I can actually think of several things that may be wrong with it -- but the general point is that seeing legitimate bots visit could give us some kind of yardstick against which to measure illegitimate bots.

Does that make sense? I'm not sure yet how we'll end up quantifying the data we're gathering, but knowing the legitimate bot traffic may be something that's useful.

Re: Legitimate robots

Author: S.Goodman (20 Dec 04 3:59pm)

Yes, it does makes sense and I can see the difficulty of coming up with a measure of "visibility" or "popularity". If you want to know about legitimate bot visits, doesn't the robot exclusion meta-tag prevent that, to the extent that they honor it? I think what you really want is something that allows legitimate bots to visit the page and have their IP's harvested, but to avoid them caching the page contents. I don't know if there is any such construct in the robot exclusion protocol.

Meanwhile, I'll remove robots.txt.

Post Edited (20 Dec 04 3:01pm)

Re: Legitimate robots

Author: M.Prince (21 Dec 04 4:46pm)

The meta tag does exactly what you suggest we need: keeps the spiders from caching or indexing the pages but the spider still visits. The tag is on the page so in order to read it they have to access the page in the first place. At that point we capture their IP and other pertinent information.

If we begin to see harvesters ignoring pages with the noindex/noarchive meta tags then we may have to refine our strategy. At that point we may selectively turn off the noindex/noarchive tags but not serve up an address when a legitimate robot visits. That should have the same practical effect of keeping the honey pot pages out of Google's cache.

For now, however, harvesters don't seem to be ignoring our pages. We'll definitely need to adapt if they begin to.

Re: Legitimate robots

Author: D.Tetreault (19 Jan 05 5:12am)

Harvesters ignoring noindex/noarchive pages. Hmmm...

Wouldn't that be a good thing? Ultimate success in the fight against web-based email harvesting? Simply by tagging pages containing mailto:?

A result where spammers show any respect whatsoever must be a benefit to mankind (well, netkind, at least). I think your project is one of the best ideas for combating spam I've seen yet! Thank you for sharing.

Re: Legitimate robots

Author: J.Coghill (3 Feb 05 1:32pm)

My only concern is whether or not it would affect pages that I would like Google (or other searches) to index.