Writing your own spam filter

Written by Bjarke Viksoe.
This article was submitted 9/1/2013.

Just a few months after the transition to Windows 8, spam started to flood my main e-mail account. My e-mail host only provides a very basic spam-filter, which just marks the most incriminating e-mails with a tag in the subject field. Thus so far I had relied on Microsoft Outlook Express and it's ability to detect and deal with spam. Even its simple functionality to be able to mark certain domains as untrusted was extremely useful.

And all this was lost with Windows 8, and its puny built-in mail app.
Sure, I could install a 3rd party e-mail application, or even the Microsoft Desktop Live what-ever thing, but why even provide an e-mail tool that is so horribly featureless compared to what Windows had built-in before?

Anyway, I set out to build my own spam filter.
Thinking it would be terrible difficult dealing with e-mail protocols and spam detection (language classification) algorithms, I looked for existing libraries to help me get started, and found what I needed in the Node.js framework and its plentiful extension library.

Here is a step-by-step overview of what it took to write a custom spam filter:

1. Install Node.js

Download and install the Node.js framework.
Node.js is based on the Google Chrome V8 JavaScript engine, allowing you to leverage your skills in JavaScript scripting. I've found JavaScript to be much more powerful now than my initial judgment a few years ago. It's really grown a lot, although I still hate all the clumsy attempts to add a more advanced Object Oriented methodology to it. Just keep it to plain prototypal inheritance if any, please.

2. Download cool extensions

Node.js comes with a package manager. So once installed, create a new project folder and run the following commands in a prompt.

CD C:\Myproject
npm install inbox
npm install step
npm install bayes

These are the extensions needed to access the mail account using the IMAP protocol (inbox), to ease the hardship of asynchronous programming in JavaScript (step), and to classify spam mails (bayes).

Actually you can find plenty of other extensions with similar and better functionality on GitHub. The ones I choose here seem to fit my needs as a very inexperienced Node.js developer.

3. Step through the inbox

The inbox extension allows you to access an e-mail account using the IMAP protocol. It supports authentication with regular SSL encryption, as well as XOAuth for Google Gmail. In writing your code, you could end up with a series of nested callback functions, so to alleviate the async nature of the library, we'll use the step extension to serialize and group all the asynchronous result callbacks.

The step extension works like this:

Without step:
client.openMailbox("INBOX", function(err, info) {
  client.listMessage(-20, function(err, more) {
    ... more nested stuff
  })
});

And with step:
step(
  function() {
    client.openMailbox("INBOX", this);
  },
  function(err, info) {
    client.listMessages(-20, this);
  },
  ... not so nested stuff
);

So, that's already the beginning of the code: We open the Inbox folder and fetch the last 20 messages.

4. Fight the spam

Next up, we try to detect which of the newly received e-mails are spam.

  function(err, messages) {
    var group = this.group();
    messages.forEach(function(message) {
      if (!message.flags.contains("\\Seen")) {

        var isSpam = false;

        if (message.title.indexOf("[SPAM WARNING]") >= 0) isSpam = true;

        var fromFilter = [ "@126.com", "@163.com" ];
        fromFilter.forEach(function(k) {
           if (message.from.address.indexOf(k) >= 0) isSpam = true;
        });

        var toFilter = [ ".cn" ];
        toFilter.forEach(function(k) {
          if (message.to[0].address.indexOf(k) >= 0) isSpam = true;
        });

        if (detectAsianLanguage(message)) isSpam = true;

        if (classifier.categorize(message) === "spam") isSpam = true;

        var whiteFilter = [ "family.dk" ];
        whiteFilter.forEach(function(k) {
          if (message.from.address.indexOf(k) >= 0) isSpam = false;
        });

        if (isSpam) {
          client.moveMessage(message.UID, "INBOX.Spam", group());
        }
      }
    });
  }),

OK, I'll probably have to explain that my problems with spam mostly come from Chinese e-mail hosts. In particular a few big-player hosts. It turns out that if I can just prevent Chinese e-mails to arrive in my mailbox, 90% of all spam goes away.

Consequently, in the most simple form, the spam algorithm just triggers on certain e-mail sender addresses and throws them into the Spam folder. It's cruel, but it works.

Next, we try to detect e-mails that actually contain asian text / glyphs. A lot of the spam I receive is written entirely in Chinese, a language I'm not really empowered to read. My little detectAsianLanguage() function simply scans all the characters in the subject text, and triggers on any character that falls into the Han Ideographs Unicode range.

Finally, we'll analyze the e-mail message text and determine if it looks like spam.

5. Feed spam to the Naive

You might have noticed one line in the code above...

if (classifier.categorize(message) === "spam") isSpam = true;

This determines if the e-mail is spam or not, based on the e-mail content. It uses an algorithm called Naive Bayes classifier, and it's a common spam-fighting tool to determine (with some probability) that an e-mail is be to treated as spam.

To make it work, we'll need to train the Bayes classifier. This is done at the start of the application. Like this:

var bayes = require('bayes');

var classifier = bayes();
classifier.learn('sale', 'spam');
classifier.learn('cheap', 'spam');
classifier.learn('coupon', 'spam');
classifier.learn('undelivered mail', 'spam');
//
classifier.learn('reply to mail', 'mail');
classifier.learn('status of project', 'mail');
classifier.learn('hello', 'mail');

This sets up two categories for the Bayes engine to choose from. One contains possible spam sentences, phrases and word-lists, which are weighted against a white list. The Bayes extensions returns its verdict as either "spam" or "mail". The sample here is a little simplistic, but with a larger training set I actually completely got rid of my spam problems. The Bayes is part of a large Natural Language Processing toolset, and can be trained better when fed with many and complete spam e-mails.

6. Wipe'em

The last bit of code opens the folder named Spam on the e-mail account and deletes all mails older that X days. The code to do this is very similar to step 3 and 4, so I won't repeat.

To run the final app, you just need to node the file:

C:\Myproject\node app.js <IMAP host> <email address> <password>

You'll need to run the tool periodically. You can probably find some clever way for doing this. I ended up buying one of these things, sniffing out its internal USB protocol (since Windows 8 64-bit wasn't supported), and writing a small utility that, via the Windows USB HID driver, sent commands to make it turn into a Flashing Knobs gizmo whenever new non-spam mails arrived to my mailbox. Oh, the fun...

Installation Requirements

Node.js 0.10

Installation Guide

Install Node.js on the machine.
Extract the ZIP file to a temporary folder.
Run npm install from a prompt in that folder.

Download Files

Source code (3 Kb)