Pitonyak::BayesianTokenCounter - I use this to decide if a piece of email is SPAM. This counts tokens in "good" files and "bad" files and then does a statistical analysis of which tokens belong in which group.
use File::Basename;
use strict;
use Pitonyak::SmallLogger;
use Pitonyak::SafeGlob qw(glob_spec_from_path);
use Pitonyak::BayesianTokenCounter;
my $log = new Pitonyak::SmallLogger;
$log->log_name_date('');
$log->screen_output('D', 0);
$log->screen_output('I', 1);
$log->file_output('D', 1);
$log->file_output('T', 1);
$log->message_loc_format('(sub):(line):');
$log->open_append(0);
$log->log_path('./');
my $good_tokens = new Pitonyak::BayesianTokenCounter;
my $bad_tokens = new Pitonyak::BayesianTokenCounter;
my $probability_tokens = new Pitonyak::BayesianTokenCounter;
$good_tokens->set_log($log);
$bad_tokens->set_log($log);
$probability_tokens->set_log($log);
#Read the bad tokens
$good_tokens->read_from_file('bad_file.dat');
#Read the good tokens
# and then add a few new files with good tokens
# to it
$good_tokens->read_from_file('good_file.dat');
my $want_files = 1;
my $want_dirs = 0;
my $glob = new Pitonyak::SafeGlob();
$glob->case_sensitive($files_case_sensitive);
$glob->recurse(0);
$glob->return_dirs(0);
$glob->return_files(1);
foreach my $file_name ($glob->glob_spec_from_path('~andy/new_good_files/*.MSG'))
{
$good_tokens->tokenize_file($file_name);
}
# Save the new good tokens
$good_tokens->write_to_file();
# Build a probablity file. You probably already
# built this and simply want to read it in.
$probability_tokens->build_probabilities($good_tokens, $bad_tokens);
my $token_list = new Pitonyak::BayesianTokenCounter;
$token_list->tokenize_file('test_message.MSG');
my $prob = $probability_tokens->rate_tokens($token_list);
$log->warn("The file has probability $prob of beeing SPAM") if ($prob > 0.9);
$log->info("Finished!");
This contains methods to create, read, and write token files. A token file that contains probabilities can also be created. After tokenizing a file, it can be compared against the good and bad tokens and a guess made to see if the file is a good or bad file.
The initial ideas came from http://www.paulgraham.com/spam.html And then Gary Arnold did an implementation. http://www.garyarnold.com/projects.php Gary's initial code did not meet my needs so I wrote my own.
I placed a limit on the length of a token. There is a max_token_len and min_token_len attribute that may be set.
This code ignores PGP signatures. I am on several mailing lists with members that have PGP signatures. These signatures are long and I did not want them in the token lists. Deep down inside, I think that perhaps if a piece of email contains a PGP signature, then I should probably just assume that it is NOT SPAM.
Some of my email is pre-filtered by SPAM Assasin which inserts certain headers into my email. Although SPAM Assasin does a good job, I did not want my token filters to be based on this. The ignore_headers attribute contains these values.
I list the content types that I know. Some content types I will accept, and others I will simply ignore. Check out the content_types attribute.
new()
Note that this is written in such a manner
that it can be inherited. Also note that it
is written such that $obj2 = $obj1->new()
is valid!
copy($object)
Make a copy of this object
$obj1->copy($obj2)
is the same as $obj1 = $obj2
.
returns the Baysian probability tokens for the input tokens.
The initial ideas came from http://www.paulgraham.com/spam.html And then Gary Arnold did an implementation. http://www.garyarnold.com/projects.php Unfortunately for me, Gary Arnold did not produce code that met my needs and I also wanted to be able to avoid certain attachements so I had to write my own code!
Returns, and optionally sets, the boolean for turning on considering tokens case sensitive.
Returns, and optionally sets, the boolean for using a fast mime decode algorithm.
If this evaluates to true then my own processing is used to find and decode mime attachements. This is much faster but does not use the standard methods that were written by someone who probably has a better understanding of how this works.
file_name([$new_file_name])
Returns, and optionally sets, the current file_name. This is the name of the file that will be read or written.
Remember that the call $obj->method(@parms)
is the same as
method($obj, @parms)
.
If there is only one paramter, the first parameter is assumed to be an attribute name and the default attribute value is returned.
get_class_attribute($attribute_name)
If there are two parameters, then the first parameter is assumed
to be a SmallLogger
object and the second parameter is
assumed to be an attribute name.
The attribute value for the object is returned.
If three parameters are given, then the first parameter is the object, the second parameter is used to set a new value for the attribute, and the third parameter is the attribute name, The attribute value is then returned.
Returns the default ignore_headers hash reference
ignore_headers($hash_ref)
Sets the current ignore_headers hash to the parameter.
Return the state of the current header and optionally set it
max_token_len([$max_token_len])
Returns, and optionally sets, the the max token length accepted.
min_token_len([$max_token_len])
Returns, and optionally sets, the the min token length accepted.
num_files([$num_files])
Returns, and optionally sets, the numbers of files processed.
num_tokens()
Get the current number of tokens in this object.
ProcessMimeMessage($text)
This assumes that the text string is a single email message. The text and html portions are processed out and returned.
purge_tokens_with_count_less_than($lower_limit)
Delete tokens that occure fewer than the specified number of times
rate_tokens($tokens_to_rate)
Returns the probability that the given tokens are bad tokens.
It is assumed that this token object is a probability token object
The calling code will look something like this:
my $log = new Pitonyak::SmallLogger;
my $token_list = new Pitonyak::BayesianTokenCounter;
$log->log_path($program_path);
$token_list->set_log($log);
$token_list->read_from_file($config_file);
my $file_tokens = new Pitonyak::BayesianTokenCounter;
$file_tokens->tokenize_file($file_name);
my $prob = $token_list->rate_tokens($file_tokens);
This will create an appropriate object and then read the file.
read_from_file($file_name)
Read the current file and then return the object used to read it.
set_log([$logger_instance])
If the logger instance is not present, then any existing logger will be deleted from the object.
If the object is present, then it must be an instance of Pitonyak::SmallLogger and it is set as the object to use.
Returns, and optionally sets, the true/false value for skipping HTML coments.
An object is created and then the file is tokenized into the object
tokenize_file($file_name)
If the $file_name is '-', then STDIN is read. If not, then the file is opened from disk and read. The file is then tokenized.
tokenize_string(@strings_to_tokenize)
This assumes that the list of strings is a mail message to be tokenized. In the program, the entire file is read into a single variable and then this is called.
tokens([$token_hash_ref])
Returns, and optionally sets, the internal token hash.
write_to_file([$file_name])
Write the tokens to either the current file name, or to the new file name as specified by the parameter.
This can be slow because the tokens are sorted by frequencey and name.
Copyright 1998-2002, Andrew Pitonyak (perlboy@pitonyak.org)
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
Version 1.00 First release
Version 1.01 Changed internal documentation to POD documentation. Added parameter checking.