Message Board

Bugs & Development

Spam stats are starting to take off.

Author: T.Brolin (31 Aug 05 5:28pm)

If I understand the statistics page correctly, there are 5.1 messages sent to the average hpot address.
But only 7.22% of all the addresses were ever harvested by spammers, there rest of the addresses were fetched by innocent search engine spiders.
So, if my math is not too far off, this means that the average spam harvester visit has resulted in about 70 spam messages.
I think this sounds about right.
The massive amounts of spam some of us get on our private E-mail addresses are simply the result of it beeing harvested multiple times by multiple harvesters.
A hpot address on the other hand is only harvested once.

Re: Spam stats are starting to take off.

Author: M.Prince (11 Dec 05 9:46pm)

Those numbers are about right looking at all the data. We've seen a slow but constant rise in the volume of total spam the project receives. However, there appears to have been a lot less address sharing between spammers than we had originally guessed.

Another interesting point that doesn't come through in the stats -- although we're working on how to quantify it -- is that we are getting a disproportionately high level of phishing spam compared with a lot of stats I see elsewhere online -- maybe as much as 65% of our email traffic. It appears to me that traditional spammers aren't harvesting as much as they maybe once were. (I hold no delusion that we caused them to stop, I just think the market for spam services shifted due to an onslaught from spam filters, laws, etc.) It also seems to indicate that most of the harvesting that's going on today is being done for the people committing true fraud: phishing, 419 scams, lotto scams, etc.

One other x-factor that we have a really, really hard time tracking is the amount of harvesting that is taking place through Google. We've downloaded and played with most the major harvester software packages available. They're all increasingly encouraging their users to harvest directly through the Google cache. Not only is it faster, they argue, but it also avoids a lot of the pesky honey pots (like our own) that are architected to not hand out addresses to Google.

To give you a sense of the problem of Google cache harvesting, we are CONSTANTLY getting a Google IP listed among our top-25 harvesters list. That is in spite of the fact that we have no-archive tags at the top of every honey pot so they shouldn't ever end up in the Google cache at all. Imagine what that means must be happening: Google has to have visited a honey pot, somehow the Google spider gets confused and ignores the no-archive tag, it appears in the Google cache, and then some spammer's harvester surfs the cache and finds the honey pot page. A highly unlikely series of events. The fact that it is happening with some frequency means either: 1) Google's code recognizing no-archive is very messed up, and/or 2) there's a lot of harvesting going on through the Google cache. I've talked with Google engineers about the no-archive tag and, while they admit that it may make a mistake every once in a while, it certainly doesn't do so on a regular basis. That leads me to the conclusion that #2 must be the case.

While I've sort of rambled onto a different topic than I started, increasingly our advice to everyone is that if they want to avoid harvesting one of the best things they can do is add a "no-archive" metatag to the top of their pages. Doing so won't stop Google from indexing you, it will just stop that "View in Cache" link to appear next to your search results. Unless you have a specific reason you want your pages in the Google cache -- and I can't think of many good ones -- it's probably a best practice to include such a metatag by default.

Re: Spam stats are starting to take off.

Author: T.Brolin (25 Jan 06 8:44am)

If your theory that google spiders once in a while ignores the no-archive tag, then you should see HUGE amounts of spam sent to a few hpot addresses collected by google spiders. Is this the case?

By the way, when will we get to see some trends on the "trends" page? You should have enough data by now to plot som interesting diagrams.