Message Board

Bugs & Development

Auto-Filtering for mail applications?

Author: N.Jackson (23 Jan 05 2:59pm)

As a user of SpamBayes for Outlook (and POPfile for a couple of other mail apps) I can see an excellent opportunity here for PHP to automatically generate a very, very accurate corpus of 'bad' emails. Could a POP mail relay or an Outlook/Thunderbird plugin be developed which:

a. Automatically downloads the latest known spam corpus from PHP and

b. Relays back to PHP exact details of sender domains for received emails which are known to be spam.

Surely it would help to get information on spam which is reaching legit mailboxes, as well as spam which is just sent to harvested addresses. All it takes is for a harvester to make one slipup in avoiding honeypots (a domain not previously recognised, for example) and suddenly anybody using the plugin can tell if they have had their address harvested - they get identical spam to the one sent to the slipup.

That and it would be great to never receive any piece of 'known' spam.

Re: Auto-Filtering for mail applications?

Author: M.Prince (23 Jan 05 6:23pm)

I think that's a great idea. We've talked with the author of DSpam, a well-regarded Bayesian filtering system, about potentially trying to do something like this. We're happy to work with other developers and, soon, will be releasing our first corpus of spam messages we've received so far to help anyone studying the problem.

My only concern is that it may invite abuse. For example, if it became common knowledge that we were doing this, you could imagine a spammer installing a honey pot, pulling down some addresses from it, and intentionally sending legitimate mail to it -- effectively poisoning the well. If this concern could be overcome then absolutely, I think it could be a great way to pre-seed Bayesian filters with a corpus of known spam.

The other potential problem may be that we're getting a ton of phishing messages at our honey pots. These phish messages are generally very similar to legitimate bank messages, but with just one or two links changed. I'd want to make sure the Bayesian filter was smart enough that it could pick up the subtle difference, and not just start blocking every message from PayPal.

But we're definitely open to exploring ways to work with filtering companies. If there's some way our data can benefit you, please contact us and we'll see if we can help!

Thanks for the suggestion.