[ Date Index ][
Thread Index ]
[ <= Previous by date /
thread ]
[ Next by date /
thread => ]
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Simon Waters wrote:
Wetware rules!Nah we are only good at spotting the obvious spams, the computers do that much quicker anyway. I suspect we are only doing better as the spam filters tend to ignore what they don't understand, where as we count some of it against it.
There is a commonly held view that any automatic spam filter shouldn't produce any false positives. This generally prevents people implementing spam filters than generalise a large amount. They are much better at classifying spam, but could misclassify some ham. There is also a common view that an automatic spam filter should perform as a human does when identifying spam since we are good at it. It's been proposed that this idea doesn't agree with the first... How many times have you accidentally deleted a good mail in a flurry of bashing "del"? A recent MIT spam conference discussed this. Those present seemed to think that we needed to change are ideas about what is acceptable performance from spam filters. I'm sceptical whether people will start looking at false positives as acceptable in order to get a spam filter that generalises well. We shall see how things develop.
For example quite a lot of the spam getting past spam assassin deliberately misspells all the obvious keywords - well I spot "Vaigra" and hit delete. Since quick and effective spell checking tools exist, I dare say this is a class of spam we could kill if anyone cared enough to code it.
Perfectly feasible stuff, I looked at doing this in fact. The trouble is getting the balance right between catching a few more spam mails and taking longer to work out what features to classify on. SpamAssassin with its growing ruleset is already a monster when it comes to feature extraction times. A recent paper put it at 1784s per 1000 messages on an Athlon XP 1800. It doesn't score much better than methods like mine that consider only 200 words and how often they occur in each mail. There is a huge difference in speed however.
So to solve the spam problem, first, solve the AI Problem.Nope you can pretty much solve the spam problem today by checking the sender is known to you it's crude but even OE gives you a button to do it.
A quote from Paul Graham's "A Plan for Spam": "The Achilles heel of the spammers is their message. They can circumvent any other barrier you set up. They have so far, at least. But they have to deliver their message, whatever it is. If we can write software that recognises their messages, there is no way they can get around that." I tend to agree with this. Yes. methods such as signing mail can stop things dead but aren't practical for all. Spamming is a business at the end of the day. If you can come up with something that filters well on content then the message doesn't get through. If the message doesn't get through then you have no profit. An interesting paper comparing spam filtering techniques can be found at: http://nexp.cs.pdx.edu/twiki-psam/pub/PSAM/PsamDocumentation/spam.pdf It was part of a USENIX conference so it's quite readable. - -- Dave Trudgian - Cornish Dave - ---------------------------- [w] www.trudgian.net [e] dave@xxxxxxxxxxxx [j] trudgiad@xxxxxxxxxxxxxxx -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) iD8DBQFAfTBct+PdOLWW6O4RAhe6AJ9OFFVkFzO16VtNe8pujZ5bA8EneACfUvWQ 5yw9c8zA0rWQBfeTCYhI23I= =ptDz -----END PGP SIGNATURE----- -- The Mailing List for the Devon & Cornwall LUG Mail majordomo@xxxxxxxxxxxx with "unsubscribe list" in the message body to unsubscribe.