Message Board

Newbie/Basic Questions

Older Posts ]   [ Newer Posts ]
 google looking for honey?
Author: T.Vaida   (9 May 06 11:06am)
http://www.projecthoneypot.org/bots_and_servers.php
shows this ip:
64.233.172.35 (and consequently other ips in the neighbourhood) as being used for email harvesting, followed by spam from Asia.

I wonder about which of the following is the reason for that:
- google (or hacked computers from google) were collecting spam targets;
- g. bots did not get any hint that they should not index that honeypot page;
- g. bots ignored hints that they should not index that page;
- the owner of the honeypot modified the code to allow search engine indexing.
 
 Re: google looking for honey?
Author: T.Vaida   (9 May 06 11:08am)
I add another possibility (don't think it's real though because the whole class is using NS*.GOOGLE.COM )
- dns records faked so they look like they belong to google.
 
 Re: google looking for honey?
Author: M.Prince   (16 May 06 7:27pm)
This has been a constant "problem" we have encountered, but it's something we intend to research further as it appears to be a significant part of the spam economy. Let me explain what I think is going on:

As you probably know, Google builds its index by using crawlers that troll the web. These crawlers are really no different than the harvesters that spammers use. Google, of course, promises to only use their data for good so they respect things like the <meta> nocache tags that you can put at the top of your pages. We put these nocache tags at the top of all the honey pots that are installed and go through pretty substantial efforts -- documented elsewhere in this message board -- to make sure that they aren't removed by our users. What these tags should do is mean that legitimate crawlers -- like Google -- should not add the data from the pages (like the email addresses) to their caches. What's interesting is that isn't always what happens.

We have verified on several occasions that Google is caching pages that were marked with a nocache tag. It doesn't happen always or even, so far as we can tell, most of the time. However, something occasionally happens which allows our honey pots to get into the Google cache.

Now that's not enough for Google's crawler to end up on our list of harvesters. In order to be marked a harvester you not only have to visit the honey pot, but you then have to sent at least one message to the email addresses that the honey pot handed out. I think it's highly unlikely that Google is secretly augmenting their revenues by spamming. So what's going on?

We've looked into this a bit and it appears that at least a few of the major harvesting programs out there allow you to harvest directly through the Google cache. Why would a spammer want to do this? Well, for one thing, the Google cache responds a LOT faster than many websites. Additionally, you avoid a lot of the spider traps. Google has already taken the time to avoid things like those pages marked as offlimits by robots.txt -- where spider traps are often hidden -- so you don't have to.

Along the same lines, a spammer also eliminates the risk that we will be able to identify the spammer's harvesting IP because Google does the harvesting and the spammer then just takes the data from Google. In other words, Google is "laundering" the IP addresses of spammers.

Here's the interesting part. In order for Google to appear in our list of harvesters a series of relatively unlikely events has to happen. First, Google has to run across one of our honey pots. That's not very unlikely from our perspective -- Google's robot making up about 15% of the traffic we see -- but it's very unlikely from Google's perspective given the huge number of pages they index daily.

Second, something needs to go wrong at Google for them to ignore the nocache meta tag. It is possible that somehow our honey pots get messed up and don't contain the nocache tag. However, we've looked at the places this happens and verified that the nocache tag is there and appears to be displaying correctly. So I'm guessing that it's an issue on Google's side. More evidence of this fact is that we've noticed these wrongly cached pages disappear from the cache not long after we notice they are there. Trying to access the URL where they once were results in a "the page you have requested is not found"-type error.

Third, in the vast sea that is Google's cache, a harvester has to run across one of these incorrectly cached honey pots, pick up the address stored there, and then send to it. This isn't an impossible series of events by any stretch, but you'd think it'd be fairly unlikely to happen often. What's striking, however, is that it appears to happen all the time. Now I'd love to tell you that we just have millions of honey pots installed and that's the reason statistically this happens as often as it does. And, while we do have a lot of pages floating around out there, its not so high a number that you can account for this.

So what's the alternative explanation. The only thing we've been able to come up with is that a TON of harvesting is being done through Google's cache. We've talked to the Google folks about this and they seem to suggest that it's really not their problem. If a substantial percentage of spam harvesting is going through Google's cache, however, it may be time for the company to start taking some measures to control it.

Two possibilities: 1) The Internet community could create some sort of local nocaching tag. For example, if there was data that you didn't want to appear in Google's cache, maybe you could create something like <a href="mailto:email@address.com" rel="nocache">email@address.com</a>. Much like rel="nofollow" right now tells indexers not to follow particular links, maybe this could be a way to effectively black out chunks of data from appearing in Google's cache.

2) Google could install honey pots on the cache pages themselves. We've offered and would be happy to provide the technical backend to do this. So far no one at Google is listening. But, if any of you have any suggestions on how to get Google to take this problem seriously, please let us know.

In the meantime, we plan to launch a new version of the honey pots shortly that we will intentionally let get cached by Google. After the fact, we will see how much spam these honey pot pages actually receive in order to help quantify how much of a role Google is playing in the spam economy. Our best guess, however, is that it is substantial.

We welcome any thoughts or suggestions.



do not follow this link

Privacy Policy | Terms of Use | About Project Honey Pot | FAQ | Cloudflare Site Protection | Contact Us

Copyright © 2004–25, Unspam Technologies, Inc. All rights reserved.

contact | wiki | email