[Skip navigation]

Message Board

Tracking Harvesters/Spammers

[ Older Posts ] [ Newer Posts ]

Blocking current harvesters//blacklist?

Author: J.Cridland (25 Dec 04 7:02pm)

Given the alarming amount of IPs harvesting one of my sites (not surprisingly - it's a directory of loads of people), I'm playing with banning these IPs.

If they appear in the statistics page, periodically I'll transfer them to the ban list on my site, where they get this page...
http://www.mediauk.com/errordocs/allowed/403.html
(the 'ugly people' link goes to my honeypot).

The question, I suppose, is whether this is a good idea.

Good, because it:
- reduces my bandwidth bill
- stops people/companies on my website being spammed
- still should mean they get to the honeypot

Bad, because it:
- might alert harvesters that I'm on to them
- and therefore may harm my effectiveness to Project Honeypot

What does anyone else think?
(And is there a way I can automate the IP bannings if this is a good idea? Is there such a thing as a blacklist for this?)

Re: Blocking current harvesters//blacklist?

Author: M.Prince (26 Dec 04 8:18pm)

We're working on something internally that will allow our members to do exactly what you describe automatically. We're toying with a couple different versions:

- A pure HTTP RBL that would block access from known harvesters (as you describe);
- A HTTP RBL that kicked harvesters to a gateway page with a CAPTCHA or other access restriction before allowing them onto the rest of your site;
- A web server plugin that would, when a harvester visits, automatically rewrite any email addresses the website contains with either 1) honey pot addresses, or 2) other unusable informaiton.

Before we deploy any tool like this we want to make sure our data is robust. There are two primary and competing concerns: 1) are we wrongly listing anyone? and 2) are we fast enough to do any real good?

Currently we're being very safe and are therefore very slow. Under our current system there is an inherent lag between a harvester visiting and it being listed on our site. This is because something is only declared a "harvester" after it has done two critical things: 1) visited a honey pot page, and 2) the email address that was handed out there receives at least one message.

The upshot of this is that a harvester can stay off our list for a long period of time by running about collecting email addresses, but not sending to any of them until AFTER all the harvesting activities are complete. Of course, the moment that the first email arrives at one of the harvested addresses we'll list the harvester. That, however, does little good if our data is being used as a HTTP RBL.

We have some thoughts on how to potentially solve this problem and get better at identifying "bad bots" as well as true harvesters. Then, assuming we can ensure it is accurate, bad bot data can be used in some sort of HTTP RBL and fed out on a nearly realtime basis. A bad bot is spotted by one Project Honey Pot member and it is instantly blocked from gaining access to any one of our members' sites.

Until we get an automatic system up and running, doing a HTACCESS redirect (assuming you're using Apache, I'm not sure the Windows equiv, although I'm sure there is one) for known-bad IPs and Useragents is unlikely to do the project any harm and may save you from a number of potential attacks. For example, there's no reason you should be allowing visitors with the IP addresses 67.138.247.2, 70.84.20.52, 80.46.67.1, or 66.235.201.125 anywhere near your site.

Moreover, if a visitor has a useragent beginning "Java/_#..." (like Java/1.4.1_04), "Mozilla/3.0 (compatible; Indy Library)", "Missigua Locator", or "Franklin Box Company" I'd block them unless you enjoy receiving phishing scams or those emails from Nigerians offering you money from mysterious bank accounts. In the New Year we'll begin putting together a list of the top user agents harvesters are reporting in order to help website administrators wanting to keep out bad bots even as they move around from IP to IP.

I don't think there's much risk that your site blocking some IPs will raise much suspicion. Remember that these harvesters are accessing literally thousands of pages per minute. Chances are that even if the spammer behind the harvesting program reviews the logs they won't even notice your site blocking access; it'll just look like the spider followed a URL typo. (This will be a more serious concern when we get the automated tool up and running and a number of website admins begin blocking access.)

If you want to continue to gather data on the harvesters that you've blocked you can include a (hidden) link to your honey pot page on your 404 page, or whatever error page you direct harvesters to. It might even be worth not 404ing them, but instead just directing the traffic to a "Harvesters Not Allowed" page with some minimal content including a link to your site's honey pot page.

Please let us know how the manual listing of harvesters goes and whether you experience any drop in traffic or spam as a result. (FYI: My guess is that, at least in the short term, the number of phishing/Nigerian scams you receive will drop but your overall volume will stay about the same.) We'll let users know when we get an automated tool up and running. If anyone has suggestions, or wants to help out designing or testing such a tool, don't hesitate to drop us a line through the contact us page and we'll get you involved!

Re: Blocking current harvesters//blacklist?

Author: C.Pettinger (26 Dec 04 10:14pm)

I agree with Mr. Prince that the .htaccess redirect is probably the best thing for this although I can guess it is very labor intensive for you to manually add idiots to the list. I whipped something together for a similar problem as I too was tired of seeing junk in my log files. But I also had the issue of CONNECT and OPTIONS which don't like to work with redirects.
Maybe you could tweak this to do some automatic listing for you, don't know.

Its at http://www.pettingers.org/code/DAVBlack.html

Unfortunately it is pretty hard core and doesn't have any historical perspective. That is, it doesn't keep a count of "violations".

If you speak German, you might be able to make use of something like this with minor modifications:
http://linux.newald.de/new_design/login_check.html

Again, this is for a different application but might be a good starting point for a crafty programmer.

Something like a distributed, dynamic blacklist (like the DNSBLs of the smtp world) would be very cool although I am guessing that the load would be too great on both the host and the servers.

Re: Blocking current harvesters//blacklist?

Author: J.Cridland (28 Dec 04 6:22am)

If there was a 'simple' page containing the IP addresses or user agents I need to block, then that's all I need; I could write a PHP script to grab that and write it to my local .htaccess files. I'd be happy to share the code with others (and build in a way that it could only run once every 24 hours, for example).

I already have a script which runs every day; it would be an easy job to add this to the daily run.

This would probably be all I need... I could fiddle about with parsing http://www.projecthoneypot.org/bots_and_servers.php but it would be nicer if there was a pure text page for importing that way.

It strikes me that, unlike the cat and mouse business of spam-blocking, Project Honeypot is a simple and possibly wholly automateable way of spotting spam-spiders, and could have a really big effect on spammers. I'm quite bullish about its future - and would like to use the data it's already creating to save me money and hassle, and stop them spamming those on my website.

Re: Blocking current harvesters//blacklist?

Author: J.Johnson (28 Dec 04 10:26am)

If your site has support for SQL, I have put together a suite of robot traps. One simply records the hit, and the other uses .htaccess to block the IP of the user agent.

You need to be very careful about setting up your robots.txt to make sure it doesn't trap any legitimate search engines. The trap that blocks access is fully automated, and you would only need to decide whether to remove a block on an IP of a search engine or other traveler that was not harvesting email addresses.

The download also includes a simple html control panel to help organize any records you may want to keep.

I need to make a few changes to the instructions for implementing the traps, but you can access the download at: http://ih8spammers.com/

The changes in implementation I recommend are as follows:

Direct the hidden links to the trap to a page that uses iframes to load the traps
Load the Project Honey Pot trap in one iframe.

If you want the Project Honey Pot trap to be able to record the visit, load a blank page that uses a 3-5 second delay to redirect to the ih8Spammers trap.

If you maintain more than one site, an alternative to the last would be to crib the Project Honey Pot trap on one site that may not be as vulnerable or defenseless against harvesters.

You can also use any number of the very effective CGI, PHP or JavaScript methods listed in the Webmaster Resources on ih8spammers.com to protect your email addresses. I use and recommend Master Spambot Buster, and that you also use the OKDomains routine available from bignosebird.com in the script in the event that spambots ever advance to the point that they are able to harvest CGI protected addresses.

Re: Blocking current harvesters//blacklist?

Author: J.Cridland (28 Dec 04 6:00pm)

I trust Project Honeypot, since it's linking IP visits to received spam. I believe it's fairly unique in doing this, especially since there's hundreds of sites running the same code. For me, no other method of spotting spam-spiders will do. I have no wish of blocking ONE normal user.

I'd question your reply. You recommend the use of robots.txt, even though spam-spiders won't care what robots.txt says. You recommend IFRAMES, which don't generally get spidered to anywhere near the same degree as straight pages. You recommend a 3-5 second delay, presumably through JavaScript or Meta-tags, to redirect a robot, even though spiders don't understand JavaScript nor normally follow links in meta-tags. I don't really want to start a flame-war, but I'm not convinced that you're fully aware how spiders work - for example, in http://ih8spammers.com/guerilla.html there are no links that Google can follow, since you use some JavaScript slidy menu, which is incidentally against disability access laws in the UK.

(Not withstanding that you don't mention SpamCop.net in your 'how to report spam', but I've said quite enough!)

Re: Blocking current harvesters//blacklist?

Author: M.Prince (29 Dec 04 12:49am)

With our next version we may introduce some additional features to the honey pot page. One of those features may be an additional link that, if followed, would give us instant feedback that something fishy may be going on. We probably won't make that information available quite as publicly in order to avoid any false positives that may ensue. However, I think that without resorting to iFrames or Javascript we can put together something that will 1) catch most harvesters, 2) be almost entirely invisible to humans, and 3) not harm legitimate robots (like Google, MSN, Yahoo, etc.).

The more likely immediate feature we'll add is a way to track comment spammers. If either of you are bloggers you know the problem in the blogging community of spiders that target comment forms on pages with junk. The purpose appears to be to boost the Google PageRank of the spamvertised site by creating a huge number of backlinks. As with email spam, once a few people saw that it worked a lot have jumped on board and bloggers are drowning in the comments.

What we've proposed doing is including a hidden form on the honey pot page that is generated by the script. That form will submit back to the script and any information posted to it will be recorded. We hope to share that information with the makers of such programs as MT-Blacklist, and potentially create a tool that will allow website owners to either restrict commenting, or potentially entirely block access, for visitors with known-troublesome IP addresses.

There are a number of kinks to work out. For example, a lot of the spam programs appear to only be going after specific comment scripts present in Movable Type and other blogging software. As a result, they may be less likely to post to our honey pot page. We've trying to work all that out.

Like J. Cridland suggested, I think a single false-positive is unacceptable, especially as we begin to impose consequences on the IP addresses we list. To that end we don't publish our list of known-spammer IP addresses until they have 1) visited a page, and 2) at least one email has been sent to the email address visited there. By our definition of what it means to be a "harvester," until you do both we have to assume you are innocent. It should be recognized that this will inherently mean we will miss some other "bad robots" which other tools may be useful at spotting and stopping.

Even with our precautions, there is risk that we could harm the innocent if we're not careful. The biggest problem comes from users with something like a dialup account and dynamic IP address allocation. For example, the 24.248.240.* IP address range is clearly inhabited by a prolific, if somewhat amateur, harvester/spammer ("Travel/Baseball/Basketball Partnership" spam). It looks like the individual is probably based on the Oklahoma City, OK area and is using a Cox Communications DSL account. However, either intentionally or unintentionally, as the harvester starts its activities each morning it is jumping around in the netblock and, as a result, we have a bunch of different IPs listed for what is pretty clearly the same person. I want to be careful whoever is unlucky enough to be in the same IP range, but be otherwise innocent, isn't unfairly harmed.

Some RBL services have taken the stance that the ISPs behind those services will be more responsive to cleaning up their spam problem if some of their legitimate users are hurt. While there is some logic to that, I'm not sure I'm convinced. As a result, while we have a lot of data that could do at least some good, we're preceding very cautiously.

We have some tricks up our sleeve and will begin to roll them out in 2005 and as our data set becomes more robust. Again, if people want to help with our alpha testing of any of these features, contact us through the contact us page and we'll give you the first peek.

Re: Blocking current harvesters//blacklist?

Author: J.Johnson (9 Jan 05 4:35pm)

J.Cridland (28 Dec 04 5:00pm) wrote:

I'd question your reply. You recommend the use of robots.txt, even though spam-spiders won't care what robots.txt says. You recommend IFRAMES, which don't generally get spidered to anywhere near the same degree as straight pages. You recommend a 3-5 second delay, presumably through JavaScript or Meta-tags, to redirect a robot, even though spiders don't understand JavaScript nor normally follow links in meta-tags. I don't really want to start a flame-war, but I'm not convinced that you're fully aware how spiders work - for example, in http://ih8spammers.com/guerilla.html there are no links that Google can follow, since you use some JavaScript slidy menu, which is incidentally against disability access laws in the UK.

(Not withstanding that you don't mention SpamCop.net in your 'how to report spam', but I've said quite enough!)

My response:

Thank you very much for pointing out some possible misunderstandings I may entertain, but I have a few questions regarding your post as well. Before I do, I have a few observations that I think I can address with a reasonable degree of confidence.

I could not agree more fully with your praise for projecthoneypot.org. They provide a great service to us all, and I like the fact that I do not need to hide their trap behind a robots.txt. A number of spambots are however outside the reach of law enforcement, and will continuously traverse a site, and I think it is useful to be able to block them and not have to spend the time needed to look them up on a regular basis. This goes for bots like baidu.com's biadu spider, and the one used by cyveillance.com and others of their ilk, which are just badly behaved spiders. They do at least appear not to be harvesting email addresses.

As for badly behaved individuals. I will not regret not having to waste the time needed to investigate their unauthorized access either, should they decide to misbehave more than once. I do not use the .htaccess trap on each site that I maintain however, and will not disagree that it is a bit much for anyone who just wants to record and track spambots. I will not judge those who do so either though. I have several other reasons for using this trap however.

I maintain a site that provides free access to a large database of information for which several other sites charge a service fee. I pride myself too for doing a more thorough and comprehensive job in creating the database, and actually began this pursuit as a consequence of someone downloading the database. The .htaccess trap therefore serves the purpose of preventing anyone from capitalizing on something I spent months developing and still spend hours each week maintaining.

The same domain also provides literary critique group forums that require a significant level of security to prevent the manuscripts from being indexed by search engines that do not honor the robots.txt, and the .htaccess trap also provides an added level of security against certain shortcomings in the security of IM identities that may be posted in the member profiles against my advice.

Thank you for your insight to the Internet accessibility law in Britain. Do you suppose my site map would serve to mitigate my violation of British law against the use of a JavaScript slidy menu? Would you also be able to tell me how and where to report violations of Britain's anti-spam laws? I understand it is patently against the law, but I can find nothing about where to report violations on any of the British anti-spam or law enforcement sites. I was spammed by a British spammer selling email lists, an Internet service provider no less, and would like to report them.

My next question is, wouldn't you advise using a robots.txt file to disallow access to a trap file that would use .htaccess to block an IP, and thereby avoid blocking spiders that one might normally want to index a site? If you would recommend against using the robots.txt for this purpose, why would you do so? I'm puzzled.

Perhaps I need to clarify that the pages that load in the iframes are not intended to be parsed by search engines - they are after all trap files that are disallowed by the robots.txt. I have not heard that spambots avoid iframes, and would appreciate your definitive statement regarding the viability of a trap that uses them to trap such a creeper. I have only trapped a few dozen spambots and other unwelcome intruders, and wonder how many I am missing. Is there a ratio I can apply to estimate how many I missed?

I do admit to erring when I paired the PHP trap with my .htaccess trap. This caused several bots to be blocked from the PHP trap, and my recommendation that a delay be used is intended to allow time for the bot to hit the PHP trap prior to being blocked. A simple HTML redirect is all that's needed, and doing this works quite well.

I realize that many indexing agents do not parse and follow JavaScript links, but Google follows those that are text browser friendly, and appears to have indexed the pages I have linked in my floating menu that are not disallowed. Perhaps you are right about my lack of familiarity with search engines in general though, and I would appreciate some more detailed insights. Which important search engines am I missing that do not use Google, MSN and the others that do index every page I want them to index to fill in the deficiencies in their own indexing agents? Do you suppose they have any trouble following my site map too?

At your suggestion, I think I will point to SpamCop.net. Perhaps those who may visit ih8Spammers, but lack the wherewithal to make an independent effort, will use it. I had intended to do a little more research into the procedures that each of the major RBL sites employed for sending spammers to the nether reaches of the space-time continuum before doing so, but throwing a quick comment about SpamCop.net's quickie reporting procedure on the page won't hurt.

Post Edited (10 Jan 05 10:42pm)

Privacy Policy | Terms of Use | About Project Honey Pot | FAQ | Cloudflare Site Protection | Contact Us

Copyright © 2004–25, Unspam Technologies, Inc. All rights reserved.