NAME

Pitonyak::BayesianTokenCounter - I use this to decide if a piece of email is SPAM. This counts tokens in ``good'' files and ``bad'' files and then does a statistical analysis of which tokens belong in which group.


SYNOPSIS

use File::Basename; use strict; use Pitonyak::SmallLogger; use Pitonyak::SafeGlob qw(glob_spec_from_path); use Pitonyak::BayesianTokenCounter;

my $log = new Pitonyak::SmallLogger; $log->log_name_date(''); $log->screen_output('D', 0); $log->screen_output('I', 1); $log->file_output('D', 1); $log->file_output('T', 1); $log->message_loc_format('(sub):(line):'); $log->open_append(0); $log->log_path('./');

my $good_tokens = new Pitonyak::BayesianTokenCounter; my $bad_tokens = new Pitonyak::BayesianTokenCounter; my $probability_tokens = new Pitonyak::BayesianTokenCounter; $good_tokens->set_log($log); $bad_tokens->set_log($log); $probability_tokens->set_log($log);

#Read the bad tokens $good_tokens->read_from_file('bad_file.dat');

#Read the good tokens # and then add a few new files with good tokens # to it $good_tokens->read_from_file('good_file.dat');

my $want_files = 1; my $want_dirs = 0;

my $glob = new Pitonyak::SafeGlob(); $glob->case_sensitive($files_case_sensitive); $glob->recurse(0); $glob->return_dirs(0); $glob->return_files(1);

foreach my $file_name ($glob->glob_spec_from_path('~andy/new_good_files/*.MSG')) { $good_tokens->tokenize_file($file_name); }

# Save the new good tokens $good_tokens->write_to_file();

# Build a probablity file. You probably already # built this and simply want to read it in. $probability_tokens->build_probabilities($good_tokens, $bad_tokens);

my $token_list = new Pitonyak::BayesianTokenCounter; $token_list->tokenize_file('test_message.MSG');

my $prob = $probability_tokens->rate_tokens($token_list);

$log->warn(``The file has probability $prob of beeing SPAM'') if ($prob > 0.9); $log->info(``Finished!'');


DESCRIPTION

This contains methods to create, read, and write token files. A token file that contains probabilities can also be created. After tokenizing a file, it can be compared against the good and bad tokens and a guess made to see if the file is a good or bad file.

The initial ideas came from http://www.paulgraham.com/spam.html And then Gary Arnold did an implementation. http://www.garyarnold.com/projects.php Gary's initial code did not meet my needs so I wrote my own.

I placed a limit on the length of a token. There is a max_token_len and min_token_len attribute that may be set.

This code ignores PGP signatures. I am on several mailing lists with members that have PGP signatures. These signatures are long and I did not want them in the token lists. Deep down inside, I think that perhaps if a piece of email contains a PGP signature, then I should probably just assume that it is NOT SPAM.

Some of my email is pre-filtered by SPAM Assasin which inserts certain headers into my email. Although SPAM Assasin does a good job, I did not want my token filters to be based on this. The ignore_headers attribute contains these values.

I list the content types that I know. Some content types I will accept, and others I will simply ignore. Check out the content_types attribute.

new

new()
Note that this is written in such a manner that it can be inherited. Also note that it is written such that $obj2 = $obj1->new() is valid!

copy

copy($object)
Make a copy of this object

$obj1-copy($obj2)> is the same as $obj1 = $obj2.

build_probabilities

build_probabilities($good_token_list, $bad_token_list)
returns the Baysian probability tokens for the input tokens.

The initial ideas came from http://www.paulgraham.com/spam.html And then Gary Arnold did an implementation. http://www.garyarnold.com/projects.php Unfortunately for me, Gary Arnold did not produce code that met my needs and I also wanted to be able to avoid certain attachements so I had to write my own code!

case_sensitive

case_sensitive([0|1])
Returns, and optionally sets, the boolean for turning on considering tokens case sensitive.

fast_mime_decode

fast_mime_decode([0|1])
Returns, and optionally sets, the boolean for using a fast mime decode algorithm.

If this evaluates to true then my own processing is used to find and decode mime attachements. This is much faster but does not use the standard methods that were written by someone who probably has a better understanding of how this works.

file_name

file_name([$new_file_name])
Returns, and optionally sets, the current file_name. This is the name of the file that will be read or written.

get_class_attribute

Remember that the call $obj->method(@parms) is the same as method($obj, @parms).

SmallLogger::get_class_attribute($attribute_name)
If there is only one paramter, the first parameter is assumed to be an attribute name and the default attribute value is returned.

$obj->get_class_attribute($attribute_name)
If there are two parameters, then the first parameter is assumed to be a SmallLogger object and the second parameter is assumed to be an attribute name. The attribute value for the object is returned.

$obj->get_class_attribute($attribute_name, $attribute_value)
If three parameters are given, then the first parameter is the object, the second parameter is used to set a new value for the attribute, and the third parameter is the attribute name, The attribute value is then returned.

ignore_headers

Pitonyak::BayesianTokenCounter::ignore_headers
Returns the default ignore_headers hash reference

$obj->ignore_headers($hash_ref)
Sets the current ignore_headers hash to the parameter.

$obj->ignore_headers($key, [0|1])
Return the state of the current header and optionally set it

max_token_len

max_token_len([$max_token_len])
Returns, and optionally sets, the the max token length accepted.

min_token_len

min_token_len([$max_token_len])
Returns, and optionally sets, the the min token length accepted.

num_files

num_files([$num_files])
Returns, and optionally sets, the numbers of files processed.

num_tokens

num_tokens()
Get the current number of tokens in this object.

ProcessMimeMessage

ProcessMimeMessage($text)
This assumes that the text string is a single email message. The text and html portions are processed out and returned.

purge_tokens_with_count_less_than

purge_tokens_with_count_less_than($lower_limit)
Delete tokens that occure fewer than the specified number of times

rate_tokens

rate_tokens($tokens_to_rate)
Returns the probability that the given tokens are bad tokens.

It is assumed that this token object is a probability token object

The calling code will look something like this: my $log = new Pitonyak::SmallLogger; my $token_list = new Pitonyak::BayesianTokenCounter; $log->log_path($program_path); $token_list->set_log($log); $token_list->read_from_file($config_file); my $file_tokens = new Pitonyak::BayesianTokenCounter; $file_tokens->tokenize_file($file_name); my $prob = $token_list->rate_tokens($file_tokens);

read_from_file

Pitonyak::BayesianTokenCounter::read_from_file($file_name)
This will create an appropriate object and then read the file.

$obj->read_from_file($file_name)
Read the current file and then return the object used to read it.

set_log

set_log([$logger_instance])
If the logger instance is not present, then any existing logger will be deleted from the object.

If the object is present, then it must be an instance of Pitonyak::SmallLogger and it is set as the object to use.

skip_html_comments

skip_html_comments([0|1])
Returns, and optionally sets, the true/false value for skipping HTML coments.

tokenize_file

Pitonyak::BayesianTokenCounter::tokenize_file($file_name)
An object is created and then the file is tokenized into the object

$obj->tokenize_file($file_name)
If the $file_name is '-', then STDIN is read. If not, then the file is opened from disk and read. The file is then tokenized.

tokenize_string

tokenize_string(@strings_to_tokenize)
This assumes that the list of strings is a mail message to be tokenized. In the program, the entire file is read into a single variable and then this is called.

tokens

tokens([$token_hash_ref])
Returns, and optionally sets, the internal token hash.

write_to_file

write_to_file([$file_name])
Write the tokens to either the current file name, or to the new file name as specified by the parameter.

This can be slow because the tokens are sorted by frequencey and name.


COPYRIGHT

Copyright 1998-2002, Andrew Pitonyak (perlboy@pitonyak.org)

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.


Modification History

March 13, 1998

Version 1.00 First release

September 10, 2002

Version 1.01 Changed internal documentation to POD documentation. Added parameter checking.