Message Board

http:BL Use/Development

Reasonable threat score thresholds?

Author: F.Mertz (8 Jun 08 4:33am)

I'm tidying up a perl module I've written to do http:BL queries, and would like to better understand threat scores in order to find a happy range between false positives and poor accuracy.

For example, referencing the documentation which states that a score of 25 for a harvester is "not necessarily as threatening" as a 25 for a comment spammer: Is this because the slope of harvester's scores is steeper than that of comment spammers, or because comment spammers are perceived to be causing more harm?

What sorts of things must a comment spammer do to achieve a threat score of 25? Similarly, what must a harvester do to achieve that score?

If a given host is both a harvester and a comment spammer (4th octet = 6), how do we interpret the threat score?

I've just looked up an IP address that I found in some email spam I received, and found that it's merely "suspicious", has a threat score of 82, and was last seen 13 days ago. I'd guess that 82 is a high enough score that I might consider it likely to be a harvester bot, or at least some system whose activity is of no real use to me, so I might just as well hide my blog's comment form and trackback URLs from it. Is this a reasonable assumption?

Thanks in advance to any/all who can shed some light on this subject for me!

Re: Reasonable threat score thresholds?

Author: F.Mertz (12 Jun 08 5:50am)

Replying to myself, but partially for someone else's benefit: I've finished that perl module and may soon upload it to CPAN for public consumption. I'd like to include in its documentation some information about determining sane thresholds... hint hint.

I've created a blosxom plugin with it, and will release that publicly after the module is on CPAN.

Re: Reasonable threat score thresholds?

Author: M.Prince (12 Jun 08 1:12pm)

First, thanks for the CPAN module. That's awesome. Send us the URL and we'll get it referenced on the http:BL implementations page.

Second, thresholds.... that's tricky for a couple reasons. The thresholds are based on a logarithmic score. We try and take into account all the information we have. So, for harvesters, we look at how many messages a harvest event results in. If it results in a lot, then the threat score is higher. If it results in a little, the threat score is lower. Since, currently, different information is taken into account for different types of bad IPs the scores mean different things. If an IP falls into multiple buckets then we take that into account too and it pushes the score up higher.

How's that for a non answer? We're not trying to be evasive. It's more that we don't know. Different websites will make different decisions as to what threat score is sufficient to result in different actions. For example, some websites may block visitors entirely when the threat score is above a 50, but allow everyone on and just hide their forms and email addresses if an IP address has any threat score at all. We're as much seeking feedback on where people find the appropriate thresholds as anything.

All that said, we're trying to rationalize the threat scoring across the different IP types. We also need to make the threat scoring faster. Right now, calculating the scores for all the IPs takes forever because we take into account so many factors. Our new system that we're building is designed around treating IPs all the same and coming up with a more universal threat score across them.

Even with the new system the choice of what threat score is appropriate will be up to individual web sites. But, hopefully, we'll be able to gather up the experience of a number of websites and create a collective pool of knowledge on what works in different situations.

I know... still not answering your question. But hopefully something in this missive helps.

Thanks again for your work expanding http:BL! Please do give us feedback as you determine what the best threat scores that work for you are.

Re: Reasonable threat score thresholds?

Author: M.Prince (12 Jun 08 1:18pm)

PS - something that's important we've heard from a number of websites is that having the ability to whitelist individual IPs is very helpful. We may decide that an IP address that is an out-bound proxy for, say, the UAE is threatening because a lot of bad stuff comes through it. But there's a lot of good traffic that comes through it too. Having the ability for individual visitors to a website to authenticate and whitelist, or at least having the ability for website admins to whitelist individual IPs themselves, makes the decision over where to set the threat score less critical. Hopefully your new module does this.

PS#2 - the other thing that's great is if the module can help people automatically insert links to a honey pot (whether their own honey pot or a QuickLink) on their existing web pages. That assists us in getting more data back into the Project from the people who are benefiting from it.

Re: Reasonable threat score thresholds?

Author: F.Mertz (12 Jun 08 6:06pm)

I wish I could say that my knowledge of the subject was meaningfully increased :-) I'm not going to allow my ignorance to slow me down, though.

If I were able to wrap my head around the threat score algorithm, I would incorporate some kind of configurable threshold calculation into the module that takes into consideration the type of threat, the threat score, and the days since last observation. Ideally it would use a logarithmic damping function and a switch one could set to varying degrees of aggression, maybe a scale from one to ten or just a few like aggressive, normal, passive, none. It's just a bit too dark for me to shoot in the direction of that noise, though. Maybe when the scores are normalized it'll be time to take this back up.

At present, the module presents the convenience option to set thresholds for each threat type independently, to set a global threshold, mix/match, or just run without thresholds (all defaulted to zero). It's got a single threshold for the days since last observation, also optional, that I may break out to a global and three types as the threat score thresholds are implemented. In this way the module can be configured to return a binary threat indication for simplicity, or can be used just as a data source for later processing. Or both, I suppose, but I can't see any point to doing additional processing if the necessary logic is provided by the configuration.

Of course, any time a lookup is performed that returns anything other than NXDOMAIN, the module will provide the threat score, threat type, and days since last observation via object method calls.

The ability to do local whitelisting/blacklisting will be provided in the module by a callback hook, so the developer is not forced into any specific implementation and is free to provide or not provide a callback. In this way it's up to the developer to determine where his whitelist data comes from, be it a text file, a database, a DNS lookup, or some other magic. I intend to incorporate a couple of reference implementations into the module package before it goes off to CPAN.

Because the primary audience of a CPAN module is developers, it seems a bit pointless to incorporate QuickLinks into the module. In the MVC context, this module is in the Controller while QuickLinks would be something better suited to the View. That said, I can and will incorporate QuickLinks into the blosxom plugin I've written before it goes out for public consumption. (Never mind that blosxom is in decline...)

Thanks for shedding what light you can on the matter, and for the module functionality suggestions. I'll follow up in this thread as things progress.