Message Board

Bugs & Development

generated honeypot pages

Author: S.Goodman (20 Feb 05 4:46pm)

I have one question and one comment on the pages generated by the honeypot script. The one on my site is the PHP version.

The question is the nature of the generated local-parts of the mailto: address that I assume you want harvested. The address seems to be straight alpha text that uses English-sounding nonsense words, so how do you correlate this to harvester IP and time without access to my log files? Feel free to answer off-line if this is sensitive information. Another aspect of this is that when I intentionally visit the generated page, view the source to see the generated address, close the browser and do it again, I get the same generated address. Granted, I am visitng it from the same IP but the visit time is different. What's going on?

The comment I have is that the generated page also contains the Project Honeypot logo and non-variable text making it easy to identify the page as a honeypot from us. This would seem to make all the "word salad" text inserted in to the legal part of the page useless as an obfuscation tool. I would propose that there is no real reason to put the logo, the accompanying text "MEMBER OF PROJECT HONEY POT Spam Harvester Protection Network provided by Unspam" and the Project Honeypot URL on the page. While attribution and advertising are great, those are dead give-aways. Since legitimate robots do _not_ catalog the honeypot pages and humans will rarely, if ever, see them, the logo, text and link does nothing for your search ranking or project awareness while acting as an easy marker for a honeypot from this project.

I might suggest that you provide the logo with link separately to be placed on a non-honeypot page on the site, preferably the home page. That would improve your search ranking, be seen by human visitors and make the honeypot page harder to recognize by harvesters.

Post Edited (20 Feb 05 3:57pm)

Re: generated honeypot pages

Author: S.Goodman (20 Feb 05 6:00pm)

As a follow-up to the comment that I made on a previous post in this thread, my general concern is making it harder for harvesters to distinguish honeypot script-generated addresses from real ones. I suspect that something is not working as it should due to the very high ratio of harvested addresses to spam received at those addresses. I know there is often a delay, but my sense is that this is too slow and I am afraid that many harvesters have learned to identify this script.

I'm sure that the legal contract is necessary for you to eventually go after spammers who use harvesters or you wouldn't have put it there. My question is whether it is necessary to have the generated addresses on the same page? In other words, is it sufficient for the legal contract to exist anywhere on the site in order to protect the whole site?

If posting the contract on a site does protect the entire site, it would be possible to have calls to honeypot address scipts on regular pages mixed in with regular content. I wouldn't mind putting a visible link to the legal contract page on the site home page, perhaps labelled "Terms and Conditions". The contract page could then be static, non-obfuscated and human-readable. Having curious visitors read the contract would advance the anti-spamming cause in general and Project Honeypot in particular. For that matter, we could even add some standard language to the contract that site owners would generally like and could modify to suit their purposes. For example, the site owner owns the site contents, copying distribution or incorporation into derivative works is only permitted with written permission from the site owner, the site may be linked to in a manner specified by the site owner, etc. Having boilerplate protective language that was gone over by a lawyer would be a good inducement to get people to post this contract on their sites.

The script that generates honeypot addresses could then go anywhere on the site. We might wish to put robot exclusion commands to prevent indexing of generated addresses around the script call, but that could be part of the code that people install. You would still get your google-bot heartbeat but the generated address would not be indexed. For that matter, is it really important to avoid the address being indexed by a search engine? IANAL, but an addition to the contract might be able to cover the case of addresses harvested from a search engine that got them from a site covered by the contract. The site content is our property, so it seems reasonable that the search engine index would be considered a derivative work on which we could still assert restrictions as to permissible use. This is just a layman's guess, so maybe one of the lawyers could comment.

If feasible, this might accomplish two desirable things. First, if the visibile, non-obfuscated legal contract becomes a sufficient deterrent to get harvesters to avoid a site, that's a win. If that happens, it should be fairly easy to publicize this and get people to post the contract (and sometimes the honeypot script) on many sites. If on the other hand, harvesters aren't scared away by the contract, they have no way of discerning real addresses from generated ones. Well, they can, for example by repeatedly visiting the site and discarding addresses that change too frequently, but that would just be the next step in the arms race and we can deal with it.

Any thoughts?

Re: generated honeypot pages

Author: A.Daviel (21 Feb 05 12:50pm)

I totally agree - the invariant text (below from "lynx -dump") is far too easy for harvesters
to identify and filter out. It should be part of the word salad, with perhaps a ubiquitously worded "terms and conditions" link.

(Hey, by just putting the text "projecthoneypot.org" on a page, can we protect it from harvesters ?? Cool !)

@ [2]MEMBER OF PROJECT HONEY POT
[3]Spam Harvester Protection Network
provided by Unspam

References

2. http://www.projecthoneypot.org/
3. http://www.unspam.com/

BTW, the word salad in the honeypot page changes per site, but is (I think) invariant
when decoded in a standards-compliant, non-scripting, parser. Since Web robots
(and possibly harvesters) convert to text before indexing, there may be more "invariant" text on the page than just the logo etc.

Re: generated honeypot pages

Author: M.Prince (21 Feb 05 3:12pm)

I posted something substantially similar to what's below in another forum, but it's relevant here as well.

In terms of the static text on the honey pot pages, there is some method to our madness. Let me explain....

First, it's possible for us to pass an instruction to the page from our servers that will hide or replace the logo and just about any other element of the page. This means that if spiders begin to filter on any page that links to Project Honey Pot we can simply turn that element of the honey pot pages off, or replace it with something different.

Second, we're actually trying to bait spiders somewhat. We wanted to create something that was easy for them to filter on and would be easy for legitimate websites to include on their pages. We do that not only with the logo/links at the bottom, but also the no-email-collection meta tag at the top of the page. If we can get spider authors to build into their code instructions that filter on these then any website owner can include them on their pages and be safe from spiders. And, again, we can alter the honey pot pages to randomize even that element.

There is no doubt that this will become an arms race, but I like that the spiders are on the defensive here. As soon as they start building in code to avoid certain pages I think we've begun to win this battle. Making spiders selective -- creating "false positives" for them -- would be a huge victory for the Project.

We've already begun to randomly turn off the logo at the bottom of the page and track whether that has any effect on whether an address is harvested. As soon as we begin to notice a significant statistical difference, we'll make the information known so webmasters can exploit it. Along the lines of S.Goodman's suggestion, if you want to include the grey box that's at the bottom of the Project Honey Pot pages on your own site, the HTML for it is available here:

http://www.projecthoneypot.org/how_to_avoid_spambots_5.php

We haven't seen any robots being used by harvesters that actually take the time to decode the HTML on the page. In our tests, even Google (with resources infinitely greater than even the most prolific spammers) mis-indexes the pages horribly. Even if you had an HTML parser built into your harvester, you'd need to write a Project Honey Pot-specific rule to teach it that columns next to each other in a table should be read across, rather than down. That's a challenging, computationally intensive project, and, best of all, it will inevitably create "false positives" where non-honey pot pages are misinterpreted as honey pots. If we can force the spammers to play defense, we'll consider it a big victory.

All that said, we recognize that the more variation between honey pots the better. As a result, we're also working on the 0.2 version of the script as I type this. That version will use not only text-based obscuring, but also an image version of the legal text. We want to keep installation as simple as possible, so getting the image version to work is going to be a bit of a trick, but we're on the case and will send out a notice when it's available.

Hopefully this makes some sense. Keep the feedback coming! And thanks for your help with the Project.

Re: generated honeypot pages

Author: A.Daviel (24 Feb 05 1:05am)

Thanks for the explanation. I had not understood that what I was seeing in "elinks" was
tables rather than whitespace. Cool. I note your current script is more sophisticated than the "example page".
I have added your meta tag to my not-very-well-maintained-recently metatag dictionary.

Andrew

- I found your conference presentation online in realvideo; good job :-)

Re: generated honeypot pages

Author: M.Prince (24 Feb 05 3:09am)

Yeah, the example page is just a general outline of what the pages look like. We don't want to give up all our secrets too easily..... ;-)

Thanks for the kudos on the MIT presentations. Hopefully more to come.....

Re: generated honeypot pages

Author: P.Simpson2 (26 Jul 15 10:26am)

I use CloudFlare and have activated CloudFlare's automatic javascript obfuscation of email addresses. But the effectiveness of the honeypot modules is greatly reduced if their imbedded email addresses cannot be read by the majority of email harvesters.

Take a look at:
https://www.cloudflare.com/a/content-protection/abtuk.org.uk#email_obfuscation
This lists several situations in which CloudFlare will not obfuscate an email address. Can I suggest you wrap one of these methods around the email address(es) in the honeypots you provide, perhaps the method that uses html comments featuring "email_off" and "/email_off".