Author: M.Prince (11 Dec 05 9:46pm)
Those numbers are about right looking at all the data. We've seen a slow but constant rise in the volume of total spam the project receives. However, there appears to have been a lot less address sharing between spammers than we had originally guessed.
Another interesting point that doesn't come through in the stats -- although we're working on how to quantify it -- is that we are getting a disproportionately high level of phishing spam compared with a lot of stats I see elsewhere online -- maybe as much as 65% of our email traffic. It appears to me that traditional spammers aren't harvesting as much as they maybe once were. (I hold no delusion that we caused them to stop, I just think the market for spam services shifted due to an onslaught from spam filters, laws, etc.) It also seems to indicate that most of the harvesting that's going on today is being done for the people committing true fraud: phishing, 419 scams, lotto scams, etc.
One other x-factor that we have a really, really hard time tracking is the amount of harvesting that is taking place through Google. We've downloaded and played with most the major harvester software packages available. They're all increasingly encouraging their users to harvest directly through the Google cache. Not only is it faster, they argue, but it also avoids a lot of the pesky honey pots (like our own) that are architected to not hand out addresses to Google.
To give you a sense of the problem of Google cache harvesting, we are CONSTANTLY getting a Google IP listed among our top-25 harvesters list. That is in spite of the fact that we have no-archive tags at the top of every honey pot so they shouldn't ever end up in the Google cache at all. Imagine what that means must be happening: Google has to have visited a honey pot, somehow the Google spider gets confused and ignores the no-archive tag, it appears in the Google cache, and then some spammer's harvester surfs the cache and finds the honey pot page. A highly unlikely series of events. The fact that it is happening with some frequency means either: 1) Google's code recognizing no-archive is very messed up, and/or 2) there's a lot of harvesting going on through the Google cache. I've talked with Google engineers about the no-archive tag and, while they admit that it may make a mistake every once in a while, it certainly doesn't do so on a regular basis. That leads me to the conclusion that #2 must be the case.
While I've sort of rambled onto a different topic than I started, increasingly our advice to everyone is that if they want to avoid harvesting one of the best things they can do is add a "no-archive" metatag to the top of their pages. Doing so won't stop Google from indexing you, it will just stop that "View in Cache" link to appear next to your search results. Unless you have a specific reason you want your pages in the Google cache -- and I can't think of many good ones -- it's probably a best practice to include such a metatag by default.