[Skip navigation]

Message Board

Bugs & Development

[ Older Posts ] [ Newer Posts ]

recognizability of Project Honeypot MX's

Author: S.Goodman (12 Mar 05 3:01pm)

This question has to do with the fact that even though you go to great lengths to obfuscate the fact that the page is a honeypot, with a delegated MX, the IP to send spam to is one of a small group that could be relatively easily discovered.

The motivation behind the question is that although my honeypot has been up and active for a while, and addresses have been handed out to numerous bots, including one confirmed harvester, there has yet to be a single spam received for one of the harvested address. Now I know that from previous experience that until my provider started aggressively blocking at the MX, I used to receive hundreds of spam per day for each user in my domain. In addition to individual users receiving high volumes of spam, which can be attributed to use of their addresses on public forums, the domain info@ address, which was never used in emails on a public list, received the largest volume of spam. Therefore, spam to the info@ address _must_ have been the result of harvesters.

Despite this, I am concerned that not a single spam has come from any of the addresses handed out by my honeypot. One possibility, that we have discussed at length, is that the harvesters can recognize Project Honeypot pages. The discussion of this leads to the likely conclusion that while this is technically very difficult, though not impossible, it is very unlikely at this early stage. Another possibility is that spammers don't use harvested addresses for a long time after harvest. While others have observed this before, I would suggest this goes against common sense. Email addresses are ephemeral and people often change them. It would make little sense for a spammer to sit on freshly harvested addresses, which have the best chance of being active accounts.

The possibility that concerns me most is that the set of MX's that we have delegated our honeypot subdomains to is too small and thus discoverable. The overall stats for the project show a lower number of spam than I would expect from the number of addresses handed out. Though many of these addresses went to legitimate bots, a few spammer's bots should account for a _large_ amount of spam, and that has not happened.

As an experiment, if you haven't already done this, I would be willing to pick up an email hosting account and delegate a newly created honeypot subdomain to it. This account would be used for no other purpose and would forward all received mail to the Project Honeypot servers. If this account starts receiving spam, this would be an indication that something is rotten in Denmark. I realize that as a general model, this is more dangerous for Project Honeypot, as you have to trust that people actually don't use these accounts for anything else. With a delegated MX for a sub-domain, you are in control.

Re: recognizability of Project Honeypot MX's

Author: M.Prince (14 Mar 05 3:58pm)

I agree with your concerns. I too am surprised at the relatively small amount of spam we have received. I am encouraged, however, that the trend in the spam has continued upward. Behind the scenes, we're doing a lot of things to test whether our default configuration is being somehow recognized or avoided. My biggest fear is actually not the domains with which we create email addresses, nor the smaller set of addresses we instruct individuals to which to point their MXs, but rather our limited pool of IP addresses. While our tests, so far, show that there is no statistical evidence that any portion in the mail delivery chain is being avoided, I fear that at some point it will be.

That's a mixed blessing. At some level, if we could get spammers to avoid something like IP space we have been using, then we could simply have our members forward their legitimate mail through that space and we've have a good first-level spam filter. Of course, for the Project to remain successful, we need to come up with a more robust solution. To that end, we've approached companies with large IP space (like Akamai and large ISPs) to see if there's a way we can work together.

We're in the midst of creating the v.2 of the honey pot software. That software will allow experimentation with a number of other tests. For example, your honey pot may have handed out an address with a country code top level domain (e.g., bob.smith@hpot.example.br). It could be that U.S.-based spammers regularly avoid foreign-looking domains, or vice versa. Our current setup doesn't allow us to easily track and test this information, but v.2 is built to answer exactly these questions.

The other thing to be aware of is how much longer the time from harvest to first email is than we expected. The average is posted on our statistics page and currently sits at over 18 days. However, almost all the messages that come in faster than 18 days turn out to be phishing or Nigerian fraud scams. If you take those out of the mix, the time for a traditional spammer to go from harvest to first message is well over a month, and the longest we've seen is over 120 days -- virtually the length the Project has been online.

I agree that it seems surprising the amount of time this takes, and we continue to test our systems to see if there's a weakness being exploited, but for now it seems the most likely explanation is that 1) the time from harvest to first message is surprisingly long, and 2) while there are a lot of harvesters running around, the number that are actually responsible for the great VOLUME of spam are surprisingly small. I think that, as our addresses make their way onto "100 Million Addresses for $19.95" CDs we'll start to see the massive increase in the amount of spam we receive each day. And, in the meantime, we continue to test our systems in exactly the ways you suggest.

Thanks for your offer of help. We'll send out notice as the new version of the honey pot software comes online. Please keep thinking of ways in which we can help hide the identity of the spamtrap addresses and the honey pots themselves as those are problems that seem critical to our continued success.

Re: recognizability of Project Honeypot MX's

Author: T.Brolin (11 Apr 05 1:24pm)

The results of this project are indeed surprising. I expected a fairly fast acceleration in the amount of received spam.
I have logged some stats from the stats page, and as far as I can tell the acceleration is something like 1% per day.

Re: recognizability of Project Honeypot MX's

Author: M.Prince (11 Apr 05 1:54pm)

Yeah, I'm surprised too how slowly the spam volume has increased. We built the infrastructure to handle from day one thousands and thousands of messages. We've received only a trickle. And, while it's trending upward, it's doing so much more slowly than I would have guessed.

Some things we've tested and believe we can statistically say. First, the construction of the spamtrap addresses themselves do not make a difference in whether they receive spam messages. In other words:

bob.smith@example.com

is as likely to get spam as:

orangeelephant34@imahpot.example.com

We've looked at the data broken down by the username, the domain name, whether the domains are two-level, three-level, or more.... regardless, it appears the same statistical percentage of addresses are sent to any of the subgroups as to the whole.

Initially we had a grey box at the bottom of every honey pot page with some text and links to the Project Honey Pot site. We tested a number of harvesters and found that some, when set in the "avoid honey pots" mode, would avoid pages with that box. That's great news if you're running a legitimate site as it gives you a way to avoid receiving spam at your current addresses, but it potentially compromised the value of the hpots. As a result, we turned off the grey box most of the time and got about a 15% increase in the number of harvesters identified per period of time. Not a huge bump up in spam, but some.

We've tried turning off the legal text to see if harvesters are scanning on that somehow and that has made no statistically difference in what addresses are picked up. We've also tracked individual addresses. While many appear to receive one message and never receive a second (typically the fraud-based messages like phishing or Nigerian 419 scams), if a spamtrap address receives, say, 5 messages then it appears to continue to receive messages at a regular clip. In other words, it doesn't look like spammers are, after the fact, culling spamtraps from their lists.

We do notice that harvesters target bigger sites much more than smaller ones. Therefore, we are working to get more hpots installed on high-traffic sites. Soon we'll be launching a free service for ISPs and businesses to monitor their IP space for spammers and harvesters in exchange for their installing a honey pot somewhere. We're also going to be launching the http:BL service, which will allow you to block access to known harvesters if you're an active member of the Project. Hopefully these services will increase the number of installed honey pots, which, in turn, may increase the amount of spam and harvester traffic.

I'm encouraged that our top spam-sending countries line up with the top spam-sending countries reported by major anti-spam vendors. That's some indication that our sample is somewhat representative. I'm also stunned that almost 5.75% of traffic to honey pots turns out to ultimately be caused by harvesters. That's a much higher percentage than I ever would have guessed, and makes me believe that it's not likely that harvesters are avoiding our pages.

My working hypothesis is that real spam volume to an address only occurs once an email address makes its way onto some of these "100M Email Addresses for $19.95" CDs. Right now it doesn't appear many (if any) of our addresses have made it there. That could be that we just haven't been online long enough. It could be that somehow the CD makers are removing our hpots before adding them.

The process of how address volume builds is hidden from the typical user since it's hard to distinguish individual spammers from each other if you're receiving all your mail at one address. For the most part, it appears that most of our addresses only receive one or two messages per day. In other words, most of the spammers we're seeing aren't completely bombarding all the addresses they have with zillions of message, but instead trying to limit the messages they send to one or two per day. This makes business sense, if you think about it. It could be that spammers only start trading lists with one another after they feel like they've totally used up their value. Maybe we haven't hit that point yet. I can't imagine it's too far off, and that's when I'd expect the real volume will begin.

....then again, I've been saying this for a while now. So if anyone else has an idea, let us know....

Re: recognizability of Project Honeypot MX's

Author: M.Prince (11 Apr 05 2:05pm)

One more thing that's been interesting is that we've begun downloading and testing commercially available harvesting software to see if we can learn anything from it. We have -- for example, as I mentioned above, the "avoid spamtraps" option some harvesters have and what we needed to do to avoid it. (Turns out, just including the terms "honey pot" or "spamtrap" somewhere on your page -- even in the non-printing <head> section -- is enough to cause "smart" harvesters to avoid the addresses displayed on your page. They were avoiding our box not because they were specifically afraid of the Project, but because it had the term "spam harvester" in the text.)

A scary feature that many of these harvesters have is to harvest exclusively off the Google cache. We specifically tell Google not to cache our honey pot pages, As a result, a spammer harvesting off the Google cache would avoid hitting any of our traps. I have a feeling a number of spammers are using this feature, which it why we may not be seeing them.

This is currently bad news, but it does create a choak point in the harvesting process. If Google wanted to, they could control this process by either restricting searches, inserting honey pots in their own cache, or blocking the IPs of known spam harvesters.

One more general solution which may make sense is a version of the rel=nofollow tag that Google and the other search engines recently introduced. What if they created a similar rel=nocache tag. That way, on a tag-by-tag basis you could designate what information is allowed into the Google cache. You could mark personal information from a site as being non-cachable with a tag like:

<div rel=nocache>
Bob Smith
121 Mockingbird Lane, Apartment 21
Oklahoma City, OK 51423
443-987-6543
bob.smith@example.com
</div>

Or even specific email addresses:

<a href="mailto:bob.smith@example.com" rel="nocache">bob.smith@example.com</a>

Publishers have a right to control how the information they put online is copied and redistributed. It doesn't necessarily have to be true that these rights are specified only on a page-by-page basis. It seems like it may make sense to allow a finer distinction, especially for problems like harvesting email addresses and other personal information from the Google cache.

Re: recognizability of Project Honeypot MX's

Author: C.Dijkgraaf (12 Apr 05 1:50am)

Another though on stopping your emails being cached, if you are using Javascript to munge them anyway then you can check document.location to see if the JavaScript is being called from your site, or whether it is running from a cached page (Google's cache set the to
<BASE HREF="http://yourdomaint/yourpage"> so any included Javascript will stull run).
If the document.location doesn't match your page, then don't display the e-mail address.

Re: recognizability of Project Honeypot MX's

Author: T.Brolin (12 Apr 05 3:31pm)

One possible explanation for the lack of acceleration of spam per address might be that spammers, countrary to what we currently think, generally do not trade E-mail lists with each other.
Yes, I know about thoose 100M addresses for $19.95 CDs, but does anyone actually buy them? And do they contain good addresses?
This would explain everything, since one honey pot equals one spammer, the volume of spam sent to that address will not increase unless it is traded on some list.

Normal addresses on the other hand can se increases in spam volume due to the fact that it is harvested by more and more spammers.

I can personally say that the volume of spam sent to my own address has not increased the past few years. It has stayed pretty much steady at about 100 spams/scams a day. This would actually be the case if the number of active spammers is limited, and they all already got my e-mail address :-)

You say you think the number 5.75% harvester traffic on honey pots sounds high. I beg to differ... Bots are pretty much the only traffic to honey pots. And how many legitimate bots are there? There is google and the other search engines. There is also netcraft and other statistical organisations, but then what? I actually expected there to be more spammers than serach engines in the world... how about you?

One thought... Why not allow google to cache honey pots? We will know if a honey pot is cached by google, so there is no risk contaminating the data gathered.
The benefit is obvious. while we can't identify the harvester, and we run a (very slightly) higher risk of receiving legitimate E-mails, we can still track the spammers based on the E-mails received by the google-cached honey pots.
Only problem I see is how to identify the less common search engines. Thoose could cause contamination of the data gathered... Would it be possible to allow only google, and no other search engine, to cache honey pots?

Re: recognizability of Project Honeypot MX's

Author: M.Prince (12 Apr 05 8:55pm)

I think that address you're right, either address trading is less than we previously suspected or a slower process than we previously suspected, at least for harvested addresses. That doesn't completely explain the slow rise in the volume of spam since you'd expect that, even if there are only 100 spammers in the world, they'd continue to hit different honey pots and pick up more, unique spamtrap addresses.

We're working on v.2 of the honey pot scripts. One of the things we are going to begin experimenting with is occasionally turning of the nocache meta tag to allow Google and other search engines to cache the page.

I agree that there are more spammers/harvesters than there are search engines, but I am pretty sure that the resources Google, Yahoo, and Microsoft put toward indexing the Internet are substantially larger than spammers. The 5.75% number is as a percentage of total visits, not as a percentage of unique useragents or even unique IPs. Given how many honey pot visits can be accounted for by Google, Yahoo, and Microsoft (more than half of all honey pot trap visits appear with one of their useragents), and that there are a bunch of other kinds of both legitimate and non-legitimate non-harvester robots running around, I'm still surprised that so many of the visits are harvesters.

One of the other things we're working on for v.2 of the script is a way to track other bad robot behavior. We hope to be able to watch for comment spammers, robots that ignore robots.txt, robots that fetch pages extremely quickly, etc.... With this data it will be interesting to see, for example, if there's overlap between robots that are doing harvesting and those that are doing comment spamming.

We'll let everyone know when the new version of the script is available to download. We'll continue to support v.1, and will do everything we can to make upgrading as simple as possible.

Re: recognizability of Project Honeypot MX's

Author: T.Brolin (13 Apr 05 4:20pm)

"even if there are only 100 spammers in the world, they'd continue to hit different honey pots and pick up more, unique spamtrap addresses."

If the same harvester is hitting the same honey pot several times then we should see some acceleration, but not necessarily if they are smart enough to only harvest a given website only once, or at least wait a significant amount of time before harvesting it again. If they do it like that, and they limit the amount of spam they send to any individual address, then the spam acceleration should be low.

Re: recognizability of Project Honeypot MX's

Author: T.Brolin (29 Apr 05 3:52pm)

I did some calculations.
If the current trend keeps up, then by the end of this year you will have received over 80000 spam mails.
And by the end of the next year you will have received over 300000 spam mails.

Not too bad.

Re: recognizability of Project Honeypot MX's

Author: C.Kruslicky (30 Apr 05 1:13pm)

I don't know if this indicates anything, but I had 3 harvestors visit in the first couple months, then none at all for the following few months. Too small a sample to draw any conclusions, but maybe it supports the notion that harvesters wait a bit, or maybe it indicates my honeypot page has been ruled out by harvesters? Being ruled out seems unlikely for such a small site as mine. I do check the apache logs from time to time though, and the only visits are from search engines lately, so it's not a matter of harvesters deciding after getting the email addresses.

Re: recognizability of Project Honeypot MX's

Author: T.Brolin (2 May 05 2:09am)

C.Kruslicky: That is just an effect due to the way honeypot servers assign addresses to you. You are assigned adresses in batches, and the stats are not recorded until your honeypot has used up all the addresses in one batch. From what I understand, the usual batch size is something like 3-5 addresses.

Re: recognizability of Project Honeypot MX's

Author: A.Degives Mas (11 May 05 9:03pm)

Just a few oibservations on something mentioned in passing by M. Prince, regarding comment spammers. That phenomenon is certainly a scourge among (mostly) bloggers, but I believe there's a different dynamic or "spamming business model" underlying comment spam compared to the harvester types.

Comment spammers are mainly trolling for search bots, hoping to boost the page rank -- in search engines -- of the sites whose URIs they're injecting into comments.

Undoubtedly, there's a strong relation between "traditional" spammers and comment spammers, in that the idea is to get poor souls visiting the comment spam-advertised sites to provide their e-mail address and subsequently spam them to smithereens.

I'm less certain about the joint interests among comment spammers and spam harvesters; after all, a comment spammer leaves fake addresses to post their URI, and a bad address is not what a harvester is looking for.

Having said that, I'm both pleasantly surprised and much encouraged that comment spammers are also getting in Project Honeypot's cross-hairs (it's probably no surprise that I run a blog inundated with comment spam, no? If I'd get a penny for every comment spammer trying to get into my blog I'd be *really* rich...)

Apologies for diverting from the main topic here of recognizability of "our" honeypots!

Re: recognizability of Project Honeypot MX's

Author: C.Dijkgraaf (11 May 05 11:41pm)

Comment spammers aren't just a scourge for bloggers, they also hit guestbooks. I've noticed two varieties of spammers in this regard
1) Nigerian scammers, who leave a comment in the guestbook, usually it will have MUGU, TOGO, LOME or MUMU entered in one or more of the address, name or location, and usually arive in your guestbook via a search engine search including the terms guestbook and the current year (e.g. 2005)
They visit guestbooks looking for e-mail addresses. Why they leave an entries I don't know for sure, but maybe so they know where they have been recently.
2) Link spammers, who leave links to commercial sites, usually sex, gambelling or pharmecies, aiming of course to have a better search ranking in seach engines.

Privacy Policy | Terms of Use | About Project Honey Pot | FAQ | Cloudflare Site Protection | Contact Us

Copyright © 2004–24, Unspam Technologies, Inc. All rights reserved.