Author: J.Johnson (9 Jan 05 4:35pm)
J.Cridland (28 Dec 04 5:00pm) wrote:
I'd question your reply. You recommend the use of robots.txt, even though spam-spiders won't care what robots.txt says. You recommend IFRAMES, which don't generally get spidered to anywhere near the same degree as straight pages. You recommend a 3-5 second delay, presumably through JavaScript or Meta-tags, to redirect a robot, even though spiders don't understand JavaScript nor normally follow links in meta-tags. I don't really want to start a flame-war, but I'm not convinced that you're fully aware how spiders work - for example, in http://ih8spammers.com/guerilla.html there are no links that Google can follow, since you use some JavaScript slidy menu, which is incidentally against disability access laws in the UK.
(Not withstanding that you don't mention SpamCop.net in your 'how to report spam', but I've said quite enough!)
My response:
Thank you very much for pointing out some possible misunderstandings I may entertain, but I have a few questions regarding your post as well. Before I do, I have a few observations that I think I can address with a reasonable degree of confidence.
I could not agree more fully with your praise for projecthoneypot.org. They provide a great service to us all, and I like the fact that I do not need to hide their trap behind a robots.txt. A number of spambots are however outside the reach of law enforcement, and will continuously traverse a site, and I think it is useful to be able to block them and not have to spend the time needed to look them up on a regular basis. This goes for bots like baidu.com's biadu spider, and the one used by cyveillance.com and others of their ilk, which are just badly behaved spiders. They do at least appear not to be harvesting email addresses.
As for badly behaved individuals. I will not regret not having to waste the time needed to investigate their unauthorized access either, should they decide to misbehave more than once. I do not use the .htaccess trap on each site that I maintain however, and will not disagree that it is a bit much for anyone who just wants to record and track spambots. I will not judge those who do so either though. I have several other reasons for using this trap however.
I maintain a site that provides free access to a large database of information for which several other sites charge a service fee. I pride myself too for doing a more thorough and comprehensive job in creating the database, and actually began this pursuit as a consequence of someone downloading the database. The .htaccess trap therefore serves the purpose of preventing anyone from capitalizing on something I spent months developing and still spend hours each week maintaining.
The same domain also provides literary critique group forums that require a significant level of security to prevent the manuscripts from being indexed by search engines that do not honor the robots.txt, and the .htaccess trap also provides an added level of security against certain shortcomings in the security of IM identities that may be posted in the member profiles against my advice.
Thank you for your insight to the Internet accessibility law in Britain. Do you suppose my site map would serve to mitigate my violation of British law against the use of a JavaScript slidy menu? Would you also be able to tell me how and where to report violations of Britain's anti-spam laws? I understand it is patently against the law, but I can find nothing about where to report violations on any of the British anti-spam or law enforcement sites. I was spammed by a British spammer selling email lists, an Internet service provider no less, and would like to report them.
My next question is, wouldn't you advise using a robots.txt file to disallow access to a trap file that would use .htaccess to block an IP, and thereby avoid blocking spiders that one might normally want to index a site? If you would recommend against using the robots.txt for this purpose, why would you do so? I'm puzzled.
Perhaps I need to clarify that the pages that load in the iframes are not intended to be parsed by search engines - they are after all trap files that are disallowed by the robots.txt. I have not heard that spambots avoid iframes, and would appreciate your definitive statement regarding the viability of a trap that uses them to trap such a creeper. I have only trapped a few dozen spambots and other unwelcome intruders, and wonder how many I am missing. Is there a ratio I can apply to estimate how many I missed?
I do admit to erring when I paired the PHP trap with my .htaccess trap. This caused several bots to be blocked from the PHP trap, and my recommendation that a delay be used is intended to allow time for the bot to hit the PHP trap prior to being blocked. A simple HTML redirect is all that's needed, and doing this works quite well.
I realize that many indexing agents do not parse and follow JavaScript links, but Google follows those that are text browser friendly, and appears to have indexed the pages I have linked in my floating menu that are not disallowed. Perhaps you are right about my lack of familiarity with search engines in general though, and I would appreciate some more detailed insights. Which important search engines am I missing that do not use Google, MSN and the others that do index every page I want them to index to fill in the deficiencies in their own indexing agents? Do you suppose they have any trouble following my site map too?
At your suggestion, I think I will point to SpamCop.net. Perhaps those who may visit ih8Spammers, but lack the wherewithal to make an independent effort, will use it. I had intended to do a little more research into the procedures that each of the major RBL sites employed for sending spammers to the nether reaches of the space-time continuum before doing so, but throwing a quick comment about SpamCop.net's quickie reporting procedure on the page won't hurt.
Post Edited (10 Jan 05 10:42pm)
|