Filtering Spam With Lamson
Lamson supports initial use of the SpamBayes spam filter library for filtering spam. What Lamson provides is a set of easy to use decorators that you attach to your state functions which indicate that you want spam filtered. It also uses the default SpamBayes configuration files and database formats as you configure, so if you have an existing SpamBayes setup you should be able to use it right away.
Using lamson.spam
Lamson gives you a simple decorator to place on any state functions that should block spam. Typically you do not want spam filtering on your entire application, since that would prevent legitimate registrations and put too much burden on your system. It’s better to put spam filtering on the “insider” parts, and to have confirmation emails on “outsider” pieces.
There will be a simple handler you can register in the cases where you do, but none is available right now.
Instead, what you want is to indicate that your “choke points” are filtering spam using lamson.spam.spam_filter so that when a spam is received they are put into a “spam black hole”.
Here’s an trivial example where the user is in the POSTING state, and you want everything to work like normal, but if they spam then you flip them into a SPAMMING state.
@route(".+") def SPAMMING(message): # the spam black hole pass @route("(anything)@(host)", anything=".+", host=".+") @spam_filter("run/spamdb", "run/.hammierc", "run/spam", next_state=SPAMMING) def POSTING(message, **kw): print "Ham message received." ...
The line to look at is obviously the spam_filter
line, which tells Lamson that you will:
- Use the SpamBayes training database
run/spamdb
for the detection. - Use the SpamBayes
run/.hammierc
file for your config (optional and ignored if it is not there). - Use
run/spam
as the dumping ground for anything classified as spam. - The next_state to transition to if they send a spam message. This is optional, but very helpful.
With this, the spam_filter
then wraps your state function, and every
message is fed to SpamBayes. If SpamBayes says it’s spam then Lamson
will dump it into your run/spam
and transition to SPAMMING
without running your POSTING state.
Once you are in this new SPAMMING
state (or any state you like) you
can do whatever you want. You can leave them there, or you can have
an external tool that let’s you un-block someone. Pretty much any
spam handling scheme you want is available.
Since your spam is placed into a queue you can inspect it later and check for any accidentally miscategorized mail, then use the SpamBayes tools to retrain for the misdetection.
Lamson only classifies mail that is marked as actual spam by looking at the 'X-Spambayes-Classification’ header and seeing if it starts with 'spam’. If it is 'unsure’ or 'ham’ it will let it through.
Effectiveness
I’ve been running a variant of this since the middle of May 2009 and it works great. The code I run is a custom version that fits the weirdness of my email setup but the principles are the same. I’m currently using the above spam filtering, some gray listing, and a few other tricks to block most of my incoming spam.
With all the spam block measures I’ve managed to cut down my spam to about 2-3 a day out of about 100-200 I receive. The majority of the “spam” that gets through is actually email that’s classified as “unsure” which I then use to retrain SpamBayes to make it stronger.
However, that’s my personal server, so in the case of a Lamson application you’ll want to be careful that your spam blocking activities don’t prevent too much legitimate use.
Changing What “Spam” Means
You can also change how spam is determined by sub-classing lamson.spam.spam_filter and doing your own implementation of the spam
method.
Using SpamBayes
An important point about SpamBayes is that it comes with all the command line tools you need to configure and train your database using a corpus of spam you might have. All Lamson needs to do is read this database to determine if it is spam or not.
With mutt, I save the message to “=spam”, which places the spam in Mail/spam along with all of the others. Then I run this command:
sb_mboxtrain.py -s ~/Mail/spam -d run/spamdb
This goes through the spam mailbox, and any emails that SpamBayes has not already classified get used for training.
SpamBayes comes with other commands you can read about on their site (if you can find it).
Autotraining
Lamson doesn’t support “autotraining” directly, since it’s not clear in each situation what is obviously spam. In my personal setup I know that any email not for registered users is obviously spam, so I can autotrain those.
If you want to implement autotraining for a part of your application, then look at the API for lamson.spam.Filter and simply use it in the right state function.