And all this was lost with Windows 8, and its puny built-in mail app.
Sure, I could install a 3rd party e-mail application, or even the Microsoft Desktop
Live what-ever thing, but why even provide an e-mail tool that is so horribly featureless
compared to what Windows had built-in before?
Anyway, I set out to build my own spam filter.
Thinking it would be terrible difficult dealing with e-mail protocols and spam
detection (language classification) algorithms, I looked for existing
libraries to help me get started,
and found what I needed in the Node.js framework and its plentiful
extension library.
Here is a step-by-step overview of what it took to write a custom spam filter:
1. Install Node.js
Download and install the Node.js framework.Node.js is based on the Google Chrome V8 JavaScript engine, allowing you to leverage your skills in JavaScript scripting. I've found JavaScript to be much more powerful now than my initial judgment a few years ago. It's really grown a lot, although I still hate all the clumsy attempts to add a more advanced Object Oriented methodology to it. Just keep it to plain prototypal inheritance if any, please.
2. Download cool extensions
Node.js comes with a package manager. So once installed, create a new project folder and run the following commands in a prompt.CD C:\Myproject
npm install inbox
npm install step
npm install bayes
These are the extensions needed to access the mail account using
the IMAP protocol
(inbox),
to ease the hardship of asynchronous programming in JavaScript
(step),
and to classify spam mails
(bayes).
Actually you can find plenty of other extensions with similar and better functionality on GitHub. The ones I choose here seem to fit my needs as a very inexperienced Node.js developer.
3. Step through the inbox
The inbox extension allows you to access an e-mail account using the IMAP protocol. It supports authentication with regular SSL encryption, as well as XOAuth for Google Gmail. In writing your code, you could end up with a series of nested callback functions, so to alleviate the async nature of the library, we'll use the step extension to serialize and group all the asynchronous result callbacks.The step extension works like this:
Without step:
client.openMailbox("INBOX", function(err, info) {
client.listMessage(-20, function(err, more) {
... more nested stuff
})
});
And with step:
step(
function() {
client.openMailbox("INBOX", this);
},
function(err, info) {
client.listMessages(-20, this);
},
... not so nested stuff
);
So, that's already the beginning of the code: We open the Inbox folder and fetch the last 20 messages.
4. Fight the spam
Next up, we try to detect which of the newly received e-mails are spam. function(err, messages) {
var group = this.group();
messages.forEach(function(message) {
if (!message.flags.contains("\\Seen")) {
var isSpam = false;
if (message.title.indexOf("[SPAM WARNING]") >= 0) isSpam = true;
var fromFilter = [ "@126.com", "@163.com" ];
fromFilter.forEach(function(k) {
if (message.from.address.indexOf(k) >= 0) isSpam = true;
});
var toFilter = [ ".cn" ];
toFilter.forEach(function(k) {
if (message.to[0].address.indexOf(k) >= 0) isSpam = true;
});
if (detectAsianLanguage(message)) isSpam = true;
if (classifier.categorize(message) === "spam") isSpam = true;
var whiteFilter = [ "family.dk" ];
whiteFilter.forEach(function(k) {
if (message.from.address.indexOf(k) >= 0) isSpam = false;
});
if (isSpam) {
client.moveMessage(message.UID, "INBOX.Spam", group());
}
}
});
}),
OK, I'll probably have to explain that my problems with spam mostly come from Chinese e-mail hosts. In particular a few big-player hosts. It turns out that if I can just prevent Chinese e-mails to arrive in my mailbox, 90% of all spam goes away.
Consequently, in the most simple form, the spam algorithm just triggers on certain e-mail sender addresses and throws them into the Spam folder. It's cruel, but it works.
Next, we try to detect e-mails that actually contain asian text / glyphs. A lot of the
spam I receive is written entirely in Chinese, a language I'm not really empowered
to read. My little detectAsianLanguage()
function simply scans
all the characters in the subject text, and triggers on any character that
falls into the Han Ideographs Unicode range.
Finally, we'll analyze the e-mail message text and determine if it looks like spam.
5. Feed spam to the Naive
You might have noticed one line in the code above...if (classifier.categorize(message) === "spam") isSpam = true;
This determines if the e-mail is spam or not, based on the e-mail content.
It uses an algorithm called
Naive Bayes classifier,
and it's a common spam-fighting tool to determine (with some probability) that an e-mail
is be to treated as spam.
To make it work, we'll need to train the Bayes classifier. This is done at the start of the application. Like this:
var bayes = require('bayes');
var classifier = bayes();
classifier.learn('sale', 'spam');
classifier.learn('cheap', 'spam');
classifier.learn('coupon', 'spam');
classifier.learn('undelivered mail', 'spam');
//
classifier.learn('reply to mail', 'mail');
classifier.learn('status of project', 'mail');
classifier.learn('hello', 'mail');
This sets up two categories for the Bayes engine to choose from.
One contains possible spam sentences, phrases and word-lists,
which are weighted against a white list. The Bayes extensions returns its
verdict as either "spam" or "mail". The sample here is a little simplistic, but with
a larger training set I actually completely got rid of my spam problems.
The Bayes is part of a large Natural Language Processing toolset,
and can be trained better when fed with many and complete spam e-mails.
6. Wipe'em
The last bit of code opens the folder named Spam on the e-mail account and deletes all mails older that X days. The code to do this is very similar to step 3 and 4, so I won't repeat.To run the final app, you just need to node the file:
C:\Myproject\node app.js <IMAP host> <email address> <password>
You'll need to run the tool periodically.
You can probably find some clever way for doing this.
I ended up buying one of
these things,
sniffing out its internal USB protocol (since Windows 8 64-bit wasn't supported),
and writing a small utility that, via the Windows USB HID driver, sent commands to make it turn
into a Flashing Knobs gizmo whenever new non-spam mails arrived to my mailbox.
Oh, the fun...
Installation Requirements
Node.js 0.10Installation Guide
- Install Node.js on the machine.
- Extract the ZIP file to a temporary folder.
- Run
npm install
from a prompt in that folder.
Download Files
![]() | Source code (3 Kb) |