I Hate Email Spam
By
Andrew Pitonyak
Spam, what can I say about it? I am too busy to be eloquent which is the same reason that I am too busy to read spam.
There is an excellent article by Paul Graham
on how to use statistical analysis to identify spam.
Gary Arnold has an excellent implementation
of this method.
Unfortunately, the initial implementation did not meet my needs so I developed my own.
Of course, I no longer use my own Spam filter now that I have moved to the Thunderbird
email program, but this is how I used to filter spam using PMMail. I have to admit that I
liked PMMail better than I like Thunderbird, but Thunderbird actually meets my needs
and PMMail does not (no new development in a long time). Thunderbird has its own
SPAM recognition methodology, and it seems to actually work. As I said, not as well
as my filters do, but... When I have time, perhaps I will filter my email using my own
filters rather than those provided my Thunderbird.
Summary
The method at a glance is as follows:
- Collect a large number of good email messages
- Collect a large number of bad (spam) email messages
- Count the occurences of each token (word) in the good email messages.
- Count the occurences of each token (word) in the bad (spam) email messages.
- Determine the probability that a particular token would appear in a bad email message.
- For each email message of interest, use the probability that each token would appear in a good or
bad email message to assign a probability that the message is bad.
I have about 400 bad email messages and a few thousand good email messages that I have received and collected.
Experience indicates that if you filter your email messages with my probability file, the filtering will be poor.
It is important that you collect a large number of your own good and bad messages.
My desire was to produce something that would work with my email software of choice, PMMail.
I do multiple things to filter the good email from the bad.
First, I created address books. I have address books for family members, people with whom I work, and even for my mailing lists.
I also created folders to receive my filtered messages.
In PMMail I added a complex filter similar to
h.fromid="$ab.Family" | h.fromid="$ab.Work" | h.fromid="$ab.Friends"
This acts as a white list moving email from people in my address books to a Pass folder.
I have similar filters my mailing lists.
Email that passes no filter I leave in my inbox.
PMMail can not support call an external filter.
It does, however, support a "user hook" for messages that pass a filter.
I have a filter that searches the header for a space, it always passes and
calls my batch file. My batch file does a statistical analysis of the message
and then inserts a new header into the message if it thinks that it is spam.
This only happens to messages that do not pass the white list filters.
My next filter searches headers for the text "X-Andy-Spam: Probably spam", moving message to the Spam folder.
If a header is inserted for a message considered good, it will read "X-Andy-Spam: Probably not spam" and include the
probability. When a header is inserted is configurable as shown later.
How Do I Do This
Install Perl
I wrote the filter in Perl.
There are many Perl implementations available for Windows,
I use a version that is freely available from
ActiveState. If you follow the link to
ActiveState,
you can click on the button in the upper left corner that says "Download."
ActiveState
requires you to register before downloading (I am told that you do not have to supply any information). Choose what you desire, download it, and install it.
Install Required Modules
My code uses an extra module. Perl can automatically install it for you.
- Open a command prompt
- Go to the perl bin directory. This is probably c:\perl\bin.
- Type ppm.bat to run the Perl Package Manager
- Type help if you want help
- Type install MIME-tools
- Type quit
Copy The Code
My code uses my Perl utility packages.
These packages must be in a directory called Pitonyak.
When Perl searches for packages it looks in its own lib directory; probably c:\perl\lib.
If you do not place the Pitonyak directory here, then you must tell perl where it is located
by setting the perllib environment variable.
If you create the directory "C:\spam\Pitonyak", then set "perllib=C:\spam".
You may place the spam scripts where ever you desire.
Configure The Scripts
pmmail_build_tokens.bat
builds the good, bad, and probability token files.
It is hard coded to build my token files from my email messages.
You must modify the batch file to modify your own token files from your own email messages.
The batch file is very short.
- del q:\devsrc\Perl\spam\andy_*.dat Delete the token files.
You may want to change the paths and file names.
-
perl -w q:\devsrc\Perl\spam\tokenize_file.pl -r -o q:\devsrc\Perl\spam\andy_good.dat -s c:\pmmail\andyp_0.act\Known1.FLD\*.msg -s c:\pmmail\andyp_0.act\pass0.FLD\*.msg -s c:\pmmail\andyp_0.act\PERSON0.FLD\*.msg --log_cfg q:\devsrc\Perl\spam\logger.dat
The meaning of each parameter is summarized here.
This creates the andy_good.dat token file from my good email messages. Be certain to modify the directory locations and file names
as appropriate. Add any required search files for the email message files.
-
perl -w q:\devsrc\Perl\spam\tokenize_file.pl -r -o q:\devsrc\Perl\spam\andy_bad.dat -s c:\pmmail\andyp_0.act\SPAM0.FLD\Verif0.FLD\*.msg --log_cfg q:\devsrc\Perl\spam\logger.dat
This creates the bad token file using the same methods as in the previous step.
-
perl -w q:\devsrc\Perl\spam\build_probabilities.pl -b q:\devsrc\Perl\spam\andy_bad.dat -g q:\devsrc\Perl\spam\andy_good.dat -p q:\devsrc\Perl\spam\andy_prob.dat --log_cfg q:\devsrc\Perl\spam\logger.dat
The meaning of each parameter is summarized here.
Build the probability file containg the liklihood a token is good or bad.
Configure pmmail_spam.bat
PMMail,
using the external user hook, calls this batch file once for each message.
I can manually run the perl scripts, and check a few thousand messages at a time.
This batch file is very simple consisting of one line.
perl -w q:\devsrc\Perl\spam\spam_check_file.pl -sl 0.90 -a 0.5 -p q:\devsrc\Perl\spam\andy_prob.dat --log_cfg q:\devsrc\Perl\spam\logger.dat -s %1
The meaning of each parameter is summarized here.
Besides the obvious changes related to the directory structure and file names,
you may want to change -sl 0.90 to -sl 0.99. This reduces the liklihood of a false positive.
You may also want to change -a 0.5 to -a 0.99 so that the header will only be introduced if
the email is considered spam.
Configure logger.dat
This is the file that the batch files use to configure the output of the scripts.
You can check the meaning of each parameter here.
When a message is logged, it is given a type. This type is used to decide where the message should be logged.
The primary keys of interest are screen_output and file_output.
The primary output types are as follows
Type | Type Meaning | SmallLogger Method |
W | Warning | warn($message) |
I | Info | info($message) |
E | Error | error($message) |
T | Trace | trace($message) |
D | Debug | debug($message) |
F2 | | write_log_type('F2', $message); |
What does a large business do?
By using a dedicated
business email hosting service you get some built in spam filtering that
will help your
business email not get deluged with unwanted spam.
Last Modified July 5, 2010 01:59:46 PM UTC© 1999-2024 Andrew Pitonyak (email me at: andy @ pitonyak.org)