Pitonyak::BayesianTokenCounter - I use this to decide if a piece of email is SPAM. This counts tokens in ``good'' files and ``bad'' files and then does a statistical analysis of which tokens belong in which group.
use File::Basename; use strict; use Pitonyak::SmallLogger; use Pitonyak::SafeGlob qw(glob_spec_from_path); use Pitonyak::BayesianTokenCounter;
my $log = new Pitonyak::SmallLogger; $log->log_name_date(''); $log->screen_output('D', 0); $log->screen_output('I', 1); $log->file_output('D', 1); $log->file_output('T', 1); $log->message_loc_format('(sub):(line):'); $log->open_append(0); $log->log_path('./');
my $good_tokens = new Pitonyak::BayesianTokenCounter; my $bad_tokens = new Pitonyak::BayesianTokenCounter; my $probability_tokens = new Pitonyak::BayesianTokenCounter; $good_tokens->set_log($log); $bad_tokens->set_log($log); $probability_tokens->set_log($log);
#Read the bad tokens $good_tokens->read_from_file('bad_file.dat');
#Read the good tokens # and then add a few new files with good tokens # to it $good_tokens->read_from_file('good_file.dat');
my $want_files = 1; my $want_dirs = 0;
my $glob = new Pitonyak::SafeGlob(); $glob->case_sensitive($files_case_sensitive); $glob->recurse(0); $glob->return_dirs(0); $glob->return_files(1);
foreach my $file_name ($glob->glob_spec_from_path('~andy/new_good_files/*.MSG'))
{
$good_tokens->tokenize_file($file_name);
}
# Save the new good tokens $good_tokens->write_to_file();
# Build a probablity file. You probably already # built this and simply want to read it in. $probability_tokens->build_probabilities($good_tokens, $bad_tokens);
my $token_list = new Pitonyak::BayesianTokenCounter; $token_list->tokenize_file('test_message.MSG');
my $prob = $probability_tokens->rate_tokens($token_list);
$log->warn(``The file has probability $prob of beeing SPAM'') if ($prob > 0.9); $log->info(``Finished!'');
This contains methods to create, read, and write token files. A token file that contains probabilities can also be created. After tokenizing a file, it can be compared against the good and bad tokens and a guess made to see if the file is a good or bad file.
The initial ideas came from http://www.paulgraham.com/spam.html And then Gary Arnold did an implementation. http://www.garyarnold.com/projects.php Gary's initial code did not meet my needs so I wrote my own.
I placed a limit on the length of a token. There is a max_token_len and min_token_len attribute that may be set.
This code ignores PGP signatures. I am on several mailing lists with members that have PGP signatures. These signatures are long and I did not want them in the token lists. Deep down inside, I think that perhaps if a piece of email contains a PGP signature, then I should probably just assume that it is NOT SPAM.
Some of my email is pre-filtered by SPAM Assasin which inserts certain headers into my email. Although SPAM Assasin does a good job, I did not want my token filters to be based on this. The ignore_headers attribute contains these values.
I list the content types that I know. Some content types I will accept, and others I will simply ignore. Check out the content_types attribute.
new()
new()
is valid!
copy($object)
$obj1-
copy($obj2)> is the same as $obj1 = $obj2
.
The initial ideas came from http://www.paulgraham.com/spam.html And then Gary Arnold did an implementation. http://www.garyarnold.com/projects.php Unfortunately for me, Gary Arnold did not produce code that met my needs and I also wanted to be able to avoid certain attachements so I had to write my own code!
case_sensitive([0|1])
fast_mime_decode([0|1])
If this evaluates to true then my own processing is used to find and decode mime attachements. This is much faster but does not use the standard methods that were written by someone who probably has a better understanding of how this works.
file_name([$new_file_name])
Remember that the call $obj->method(@parms)
is the same as
method($obj, @parms)
.
get_class_attribute($attribute_name)
SmallLogger
object and the second parameter is
assumed to be an attribute name.
The attribute value for the object is returned.
ignore_headers($hash_ref)
max_token_len([$max_token_len])
min_token_len([$max_token_len])
num_files([$num_files])
num_tokens()
ProcessMimeMessage($text)
purge_tokens_with_count_less_than($lower_limit)
rate_tokens($tokens_to_rate)
It is assumed that this token object is a probability token object
The calling code will look something like this: my $log = new Pitonyak::SmallLogger; my $token_list = new Pitonyak::BayesianTokenCounter; $log->log_path($program_path); $token_list->set_log($log); $token_list->read_from_file($config_file); my $file_tokens = new Pitonyak::BayesianTokenCounter; $file_tokens->tokenize_file($file_name); my $prob = $token_list->rate_tokens($file_tokens);
read_from_file($file_name)
set_log([$logger_instance])
If the object is present, then it must be an instance of Pitonyak::SmallLogger and it is set as the object to use.
skip_html_comments([0|1])
tokenize_file($file_name)
tokenize_string(@strings_to_tokenize)
tokens([$token_hash_ref])
write_to_file([$file_name])
This can be slow because the tokens are sorted by frequencey and name.
Copyright 1998-2002, Andrew Pitonyak (perlboy@pitonyak.org)
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
Version 1.00 First release
Version 1.01 Changed internal documentation to POD documentation. Added parameter checking.