[Skip navigation]

Message Board

Newbie/Basic Questions

[ Older Posts ] [ Newer Posts ]

How are spiders to find these pages?

Author: S.Haneda (24 Jan 05 5:36am)

Curious, my understnading of how search engines work is that they need to be able to get to your site either through submission of your stie, or by following other sites links to you. If I am to assume that spammers robots work in much the same way, in that they simply follow links from one place to another, how is a spammer robot to find my honeypot page?

If you take http://www.newdomain.com/honeypot.php, lets assume that newdomain.com is in this case a really popular site, even with that, the auto-generated page name that this site makes of your honeypot install files will be a link that is never found. No one knows about it at all, I find it hard to believe that robots will try all possible combos of filenames and even then, chances are iffy at best.

I can not find any reference to this on the site or docs, but it sure seems to me like you need to put the file in place of some other well know page, or at least link to it. Ideally, you would pop it in place of some other page, but that may not be feasible.

Am I missing something here?

Re: How are spiders to find these pages?

Author: M.Prince (24 Jan 05 10:17am)

The last step in the honey pot installation process is to include links from your existing pages to your honey pot script. We provide instructions on how to format these links so that most humans visiting your pages will never see them, but harvesters and other robots will see and access them.

I agree that there are no harvesters today that are trying every possible file name. While some hacker tools will try common files that contain known administrator tools in order to test whether they may provide a weakness in a site's security, the processing time required to randomly guess domains would be beyond the capabilities of Google, let alone a spam harvester.

Sorry for the confusion. If you go through the installation process you'll see the instructions. In the meantime, I'll try to update our FAQ in order to make it clear that this is a required step.

Thanks for your help with the Project!

Re: How are spiders to find these pages?

Author: B.Rapp (24 Jan 05 10:36am)

His question gave me an idea. Have you considered a special honeypot variant for formmail spammers? I see a lot of people trying to access /cgi-bin/formmail.cgi or .pl, and it seems like you could use this special honeypot to capture the IPs of people doing the submissions as well as their email address, or at least the one they're sending test messages.

I'm not sure that this violates CAN SPAM in the same way as harvesting, but it still might be useful information.

Re: How are spiders to find these pages?

Author: M.Prince (24 Jan 05 10:41am)

Yeah, we've played with the idea of trying to catch "comment spammers" by including a hidden <form> on the pages and watching what spiders try to submit to it. That's definitely in the works for the next version of the honey pot scripts. We'll send out an email to existing members as soon as a new version is available. There's also a lot we can do to add features to already installed honey pots without the user having to modify his or her installation at all.

While I don't think there's as clear a legal recourse against such spammers (at least not under CAN-SPAM), as you said, it still would be interesting to know who they were. I'm also curious whether the same robots doing the harvesting are responsible for the comment and referrer log spamming. Are robots doing double duty? We aim to find out.

Thanks for the feedback!

Re: How are spiders to find these pages?

Author: S.Friedkitten. (25 Jan 05 6:44pm)

Would it benefit if one named their honeypot formmail.cgi? That might lure formmail abusers into grabbing the honeypot address as well, and as such expose themselves as spammers.

Just my 2c.

ServMe.

Re: How are spiders to find these pages?

Author: S.Stewart (7 Feb 05 5:21pm)

I read about this project earlier today on New Scientist, and immediately set about putting it on our website, linked from two of our most popular pages. Respect. We have an anchor link to our organisaton email on every page of over 300, and as a result get hammered because of harvesters. Nice to see this project doing something about it. BUT....

[quote]We provide instructions on how to format these links so that most humans visiting your pages will never see them[/quote]

Several of the examples shown on the projecthoneypot website could cause you problems with Search Engines, being essentially "hidden links". Specifically, a link in this form <a href="http://www.mysite.org/cgi-bin/filename.cgi"></a>, is dicey, and although I can't find the suggested linking forms now, there were a few that were along those lines.

I did note in the source code:

<meta name="robots" content="noarchive">
<meta name="robots" content="noindex">

but it is uncertain that this is enough to satisfy Google, Y, and MSN that it isn't hidden linking, something that can get you booted right out of the index.

If there's something I'm missing, please pass it along. I got around it by using a small jpg as <a href="http://www.mysite.org/cgi-bin/filename.cgi"><img src="example.jpg" alt="Not a link - Do not click" title="Not a link - Do not click"></a> down at the bottom of the page, (it's a helmet, and we're cavers, so that's cool).

Stefan

Re: How are spiders to find these pages?

Author: M.Prince (7 Feb 05 5:59pm)

We understand this is a concern and are working on getting in touch with the folks at Google and the other search engines in order to ensure this does not cause a penalty to our users. I think that using a small image is a good way to go. I also think the using CSS to hide the links is unlikely to incur any sort of penalty. I'm confident as to the later because there are a number of legitimate reasons to create an element that is set to display: none. For example, if you have a CSS/Javascript-based drop-down menu then you are likely to hide the DIVs using display:none and only show them on the occurance of some event. In this case, there simply isn't an event.

If you want to be even more careful then you could add the new Google tag rel=nofollow to keep the search engines away. We'd prefer everyone not do that as we actually use search engine traffic in order to help quantify the harvester traffic, but if you're worried we certainly won't be upset.

Since this is a question that several people have been asking recently, I'm going to redouble my efforts to get in touch with the folks at Google. If it makes you feel any better, we use hidden links for our honey pots all over this site and the Unspam corporate site. In the last 4 months, since the launch of the Project, our PageRank for the Project site has increased steadily, and the Unspam site has also increased. If we notice any downward trend in our own rankings we'll be the first ones to put out the alert and make sure the search engines aren't punishing those folks who are trying to clean up the web.

Thanks for the feedback. I'll let you know if I can get in touch with anyone at Google, etc. If anyone has a contact they'd be willing to share, send us feedback through the Contact Us form.

Re: How are spiders to find these pages?

Author: S.Stewart (7 Feb 05 6:43pm)

Great stuff, M. Prince. Again, respect, this is a very cool project you have underway.

I'm not sure where the linking suggestions were, maybe the activate part...? Trying to remember it now, probably that <a href=""></a>, with no anchor text, is the only one that is truly dodgy. Perhaps you could strip that one out of the examples. The two gifs in the "Extras" could be best for a lot of people to use, in lieu of anchor text. (I briefly thought of using text like "Project Honey Pot", but then realized that the harvesters might filter for it).

Anyway, excellent, man. I'll put it on our other site tomorrow.

Stefan

Privacy Policy | Terms of Use | About Project Honey Pot | FAQ | Cloudflare Site Protection | Contact Us

Copyright © 2004–25, Unspam Technologies, Inc. All rights reserved.