Message Board

Bugs & Development

Older Posts ]   [ Newer Posts ]
 Top 25 Global Spam Harvester List
Author: C.Dijkgraaf   (20 Feb 05 10:44pm)
Hi, I think that the title of thes stats might be missleading. It doesn't appear to be showing the Top 25 Harversters at all, but rather the last 25 seen. Am I right?
 
 Re: Top 25 Global Spam Harvester List
Author: M.Prince   (21 Feb 05 12:42am)
Yeah, you're right. We're currently showing the last 25 seen....

That, however, is not our ultimate goal. It's our intent to come up with an algorithm to show the "top" harvesters. We're not quite sure how to quantify that, however. For example, do we just show the harvesters that have had the most visits? Or do we show the harvesters that have hit the most honey pots? Or maybe the harvesters that have resulted in the most number of messages ultimately?

Ideally, the algorithm would include all of these things. We're slowly making upgrades to the site to include more of the data points that will ultimately be necessary in order to calculate the true "top harvesters."

In the meantime, if anyone has a suggestion on what an algorithm to describe what harvesters are worse than others, please let us know because we're definitely looking for suggestions.
 
 Re: Top 25 Global Spam Harvester List
Author: C.Dijkgraaf   (21 Feb 05 5:17am)
Well the last 25 seen list could still be useful, so you may actually want to retain that.

In the end I think it is number of spam e-mails they are sending out that is our main complaint about them, so I'd say number of e-mails sent should definitely be part of the formula if not the key indicator.

Maybe you should show top stats in all the categories for now and I'm sure eventually trends will start to come out.
So you could have the following lists
Last 25 Seen
Most honey pots hit (not counting multiple visits to same honey pot)
Most visits to honey pots (counting multiple visits to honey pots).
Most e-mail received.

Other measures could be
Time between visits to a honey pots (the smaller the time interval the higher the ranking).
Time between visits between different honeypots (possibly indicative to how many pages they are requesting per day).

Thats all I can think of at the moment.

Colin
 
 Re: Top 25 Global Spam Harvester List
Author: C.Dijkgraaf   (21 Feb 05 6:07am)
Another things to consider is whether the stats are cumulative or per period?
ie. You could have some stats expressed as hits/visits per hour,day,week etc.

Also should an IP address or user agent ranking start decreasing when it stops showing harvesting activities? I'm sure eventually they will start to have to move their harvesting around to various IP's if they aren't allready doing so.
Some of the cumulative stats you will probably want the absolute number determining the ranking (most hits/visits ever seen), and others to showing which harvesters are most active recently having the last seen date as part of the equation eg. divide by (current date - last seen date) as part of the formula.

Another factor that could be measured is whether a harvester obeys the robots.txt, you could allow the users to specify whether a certain honey pot page is listed in robots.txt or not, it would be especially interesting to have some sites that have two honey pot pages, one listed in robots.txt as not allowed. I've read on some web sites that they have observed certain badly behaved robots actually try targetting pages listed as not being allowed.

Can you consistantly resolve IP addresses to countries? It would be interesting to see some country based statistics as well.
Resolving back to domains and having stats on those could be interesting as well.

Colin
 
 Re: Top 25 Global Spam Harvester List
Author: M.Prince   (21 Feb 05 3:27pm)
Thanks for the feedback! Keep it coming. Behind the scenes we've started to track all these measures. We'll roll out updates when we're happy with them and think they provide useful information without compromising the

I'm sensitive to C.Dijkgraaf's point above that we need to be careful about how exactly we tally the stats. The last thing we want is to create false positives for machines that are no longer actively harvesting, especially as web admins start using our data to block access by offending IPs.

I'm particularly interested in any suggestions people have on how to take into account the number of resulting emails stat. Like C.Dijkgraaf pointed out, ultimately that's what we care about most. But you could imagine a situation where the IP address 127.0.0.1 harvested a single spamtrap address on 1/1/2005. Whoever is behind 127.0.0.1 (who could it be?? subtle geek humor...) is particularly good at marketing his or her CD of 1M email addresses for $19.95 so that single spamtrap address gets spread far and wide. Then someone else takes over 127.0.0.1 (that's kinda Existential, if you think about it) but they don't spam at all. Meanwhile, spammers around the world are firing messages at the spamtrap address originally harvested by the previous user of 127.0.0.1.

If we too heavily rely on the total number of messages received by a spamtrap address (or even on the number received in the last X period of time) then we risk having some IPs that were maybe not particularly wide-spread harvesters, but were very good address distributors, getting on the list and staying on the list for a LONG time even though the actual bad guys behind them have moved on. Make sense?

This is what we're struggling with, and why we haven't migrated to a new algorithm for calculating the Top-25 yet. One of the nice things about the current setup is that we're fairly confident the IPs on the list are still being used as some sort of harvesting robot if they're listed there. As we move to something fancier, I want to make sure we're not creating as many problems as we're helping solve.

Thanks for the feedback. Any statisticians in the audience who want to suggest algorithms that take into account the above concerns.... we're really, really all ears.
 
 Re: Top 25 Global Spam Harvester List
Author: T.Brolin   (26 Feb 05 3:20pm)
The measure for how "bad" a harvester is has to be the number of harvested addresses.
The harvesters are not really in control of the amount of spam ultimately sent to the harvested addresses; the spammers are.

Amount of spam sent is a measure for how bad the spam servers are, not the harvesters.
 
 Re: Top 25 Global Spam Harvester List
Author: C.Dijkgraaf   (27 Feb 05 4:59am)
That is assuming that the harversters and spammers are seperate enteties. I woundn't be surprised if they were one and the same in quite a few instances. Why pay money for lists of e-mails when you can gather them yourself?
An indication of whether someone is both the harverster and the spammer is how long between the address being gathered and the first spam being received. If it is a matter of hours or even days they could be the same entity.
But since it isn't always clear cut if they are one and the same then yes, the number of addresses harvested is probably a more accurate measure.
 
 Re: Top 25 Global Spam Harvester List
Author: C.Dijkgraaf   (10 Mar 05 6:57pm)
This idea is probably out of scope of what you are trying to achieve, but I've created a scoring system for bot behaviour, and not just harvesting, I've given two examples as well.
See http://www.dijkgraaf.org/robots.html
As you can see Googlebot scored a 10 (with a pontential maximum of 24 although this can change), whereas Port Huron Labs scored negative 4. With both some behaviours of the bots I haven't deremined yet which would change the scores a bit.
This scoring system certainly isn't set in stone yet, so feedback is appreciated ;-)

Colin
 
 Re: Top 25 Global Spam Harvester List
Author: M.Prince   (11 Mar 05 1:43am)
Very interesting! Thanks for letting us know.

We're definitely planning some more general bot information and scoring with the next version of our honey pot scripts. Some of the things you're tracking we'll be able to handle. Others will be tough, but you've definitely given us some good things to think about.

A couple other thoughts you might want to incorporate:

- You might want to also track whether robots fetch images. Not sure whether that gets points off or not (on one hand, it's likely to mean its a more sophisticated operation, on the other, it hogs bandwidth), but it would be interesting to keep track of.
- You might want to not only track whether a useragent is provided, but also give a big negative if the robot pretends to be a real visitor, e.g. "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT)."
- You might want to track if the useragent changes over time and give a negative score for that. Small changes may be acceptable ("msnbot/0.3 (+http://search.msn.com/msnbot.htm)" becomes "msnbot/1.0 (+http://search.msn.com/msnbot.htm)" maybe ok) but a robot that changes frequently within a short interval seems likely to be trouble.

You might want to also check out some pages we put up to track both the useragents of harvesters:

http://www.projecthoneypot.org/harvester_useragents.php

As well as robots in general:

http://www.projecthoneypot.org/robot_useragents.php
 
 Re: Top 25 Global Spam Harvester List
Author: C.Dijkgraaf   (11 Mar 05 4:13am)
Yes, what file types bots request would be interesting information to capture.

I think bots that do download images shouldn't lose points unless
1) you put your images in a folder and banned bots from that folder
or 2) they don't issue a HTTP_IF_MODIFIED_SINCE when they next load the image (if they did then a particular bot would only use up bandwidth once for an image).
#1) I think is allready covered by obeys robots.txt, but
#2) would probably be worth doing as a seperate item for images.

I'm currently working on adapting some of my PHP pages to issue a 304 status if
1) the main PHP file hasn't change 2) the maximum modified date of database data shown in the page is less or equal to HTTP_IF_MODIFIED_SINCE.
This is to reduce the bandwith that bots use, because I have a lot of pages that are generated from a database.

I've seen one bot that even sets the HTTP_IF_MODIFIED_SINCE when fetching robots.txt (well done to that programmer, and thats why I've got a bonus point for that in my scoring system).

There are some specialists bots out there now, some that specifically are cataloging images, and even one that just catalogs favourite icons of webs.

Others bots loads JavaScript files to scan that for links, as some sites using JavaScript menus don't have links in the page that bots can follow.

Yes I've seen both the user agents pages very usefull thanks, I've allready used that information in to ban certain bots from my guestbook pages.

Yes I agree on the rapidly changing user agents, or those showing a wrong or missleading name (I allready had a -1 for wrong name).
I would also give points off for bots that use one user agent to request robots.txt and another to request the pages, I've seen several bots do that and I think it is a bad practice.

Colin
 
 Re: Top 25 Global Spam Harvester List
Author: C.Dijkgraaf   (12 Mar 05 4:36pm)
Ok, I've simplified the scoring a bit and added is some of your suggestions and published it as http://www.dijkgraaf.org/robots2.html
I've also made some of the rules less subjective and more quantitative allthough that still needs a bit of work. (especially those labelled "Quantatative measure needed!")

Which things do you think will be tough to do?
I suppose robots.txt would be one area that could be tough. Possibly this could be done by putting a redirect on robots.txt requests to a dynamic page such as your honeypot. that would log the visit and output a properly formated robots commands.

Request speed would be another one, unless you could get the site owner to put tracking code in each page (of course this doesn't capture requests for images etc.).

Meta tags you need a few pages linked together, so probably not too hard.

For status codes you would need to give the robot a url to visit the first time, and the next time it requests it give it a 404 or 301 status, and then measure how many times it still tries to request the original url. The link to the page would also need to morph.
You might also want to have the default 404 page being a page that can do logging to trap munging of links etc. On my site I have one that currenly does that, although I'm still making improvements to it so that if it knows a page has moved it will redirect them to the new page.

Let me know your thoughts

Colin
 
 Re: Top 25 Global Spam Harvester List
Author: C.Dijkgraaf   (12 Mar 05 5:27pm)
Some more test and criteria ideas can be found on this site http://www.searchtools.com/test/index.html
 
 Re: Top 25 Global Spam Harvester List
Author: M.Prince   (14 Mar 05 3:38pm)
Cool! Thanks! We're currently building our own scoring system for harvesters. As we more forward and score robots generally, this is all extremely useful information.



do not follow this link

Privacy Policy | Terms of Use | About Project Honey Pot | FAQ | Cloudflare Site Protection | Contact Us

Copyright © 2004–17, Unspam Technologies, Inc. All rights reserved.

contact | wiki | email