Effectiveness of Bayesian Filtering and Gzip Compression as Spam Filtering Techniques

crazymeasleAI and Robotics

Oct 15, 2013 (3 years and 11 months ago)

128 views

Effectiveness of
Bayesian

Filtering and Gzip Compression as
Spam
Filter
ing

Techniques

Author(s):




Shakuntala Baichoo, Dept. of Computer Science and Engineering



Sunilduth Baichoo, Dept. of Computer Science and Engineering



Abhisheik Coomar Motee, Dept. of Co
mputer Science and Engineering

Extended Abstract

Spam is the abuse of electronic messaging systems to indiscriminately send unsolicited
bulk messages [1].

With the emergence of the Internet, electronic mail has evolved into a
human
-
communication medium tha
t people nowadays attach certain reliance to. But all
good things have their drawbacks, and for email, we can say that Unsolicited Bulk Email
(UBE), otherwise known as spam, is the major one.
In order to prevent inboxes from
getting flooded with spam email
s
spam filters

a
re used.
A spam filter is a program that
attempts to block spam messages before they actually get into a user’s inbox.

The
re exist

different filtering techniques
which
can be classified into two groups, namely
Server
-
side

spam filtering tec
hniques and
Client
-
side

spam filtering techniques. Server
-
side filters are
programs that reside on the mail server and that deal with mail as it comes in and most of
the cases they tag
e
mail
s

as potential spam before a user downloads it or they delete it
e
ntirely. Client
-
side filters function once the mails have been downloaded and they
examine the mails and then decide what to do with them.

In this paper we provide a comparison of two Spam filterin
g
algorithms
, namely
Bayesian
algorithm

and
gzip

c
ompressio
n

algorithm

and their suitability as client
-
side
filtering techniques.

Bayesian filtering was proposed by Sahami et al.
[2]

and gained attention in 2002

when it
was described in a paper by Paul Graham

[3]
. Since then it has become a popular

mechanism to di
stinguish illegitimate spam email from legitimate ones
.

Bayesian

filter
uses machine learning concepts to learn the difference between spam and non
-
spam

over
time. For Bayesian filter to be effective, it first needs a good mix of spam

and non
-
spam

(ham)

fo
r training purposes.

The Bayesian filter then identifies all the

tokens
1

of the
message.

Once the tokens of the message are identified, they are turned into an ordered
list. The

list is then added to a database of tokens. The database contains one entry fo
r
each

token and

the frequency of each token.

The training process

creat
es two separate
token databases, o
ne for spam tokens

and a
nother for normal tokens.

These databases are
used to determine the probability that a message is spam if it

contains particul
ar tokens.

The trained filter will then examine

an
incoming message

by breaking it down into tokens
just like in the training process. The spam and ham probability for each token is

calculated using the token database. Th
e most interesting token
s

are fe
d i
nto the

Bayes
formula shown below,
and the Bayes formula will give the combined pro
bability for the
entire message.

Pr(
Spam or Ham|Words
)
= Pr(Words|Spam or Ham) * Pr(Spam or Ham)








Pr(Words)




1

Usually a token is a word entity like an URL for example

The result of the above formula always lies between 0
and

1. If the message’s

probability
for spam is greater than its probability for ham, the message is treated as

spam or vice
-
versa.

The
gzip c
ompression method

build
s

a model of spam and a model of
ham
. A new
message

is compressed using both the models. If th
e message compresses better with the
spam

model, it is most likely to be spam; if it compresses better with the
ham

model

then

it

is most likely to be legitimate.
T
he compression algorithms look for repeated strings
within a text

and replace each repeat wi
th a reference to the first occurrence. The
compression ratio

achieved therefore measures how many repeated fragments, words or
phrases occur in

the text. The text will get a better compression ratio if it has more
fragments, words, or

phrases in common wi
th the models and a worse ratio if it is
dissimilar.

The gzip

algorithm
was compared
to the Bayesian algorithm
using

the
Spam Assassin
Corpus
. They were tested
for

accuracy and speed, with fixed corpus

sizes on the same
machine. Bayesian Filtering proved t
o be better
both in terms of accuracy and speed
.

After comparing the two algorithms, an application program was developed,
using
the
built
-
in heuristic capabilities of
Mozilla
Thunderbird
. The
mbox

file of all concatenated

mails, downloaded from a POP3 mai
l server,
was

passed through the application program

for spam classification using both

the gzip algorithm and the Bayesian algorithm.

As said
above
, Bayesian Algorithm proved to be better than gzip algorithm in terms of

speed and accuracy

but

gzip
was

als
o
found to be
a good filter. Most spam filters that
exist today use

not only one method, but several methods to determine the “spamminess”
of an e
-
mail

message. And so it was essential to use both. The filter application, though
developed

for a client
-
side

application, can also reside
at

the server. However as pointed
out

earlier, it should not be used as the only means of spam filtering

but

it should be used
in

conjunction with other filtering techniques.








References:

[1] Wikipedia
-

Spam


Electroni
c,
Available from:
http://en.wikipedia.org/wiki/Spam_

[Accessed
in July

200
8
].

[2]
Sahami
, M.
, Dumais
, S.
, Heckerman
, D.
, Horvitz
, E.,


A Bayesian Approach to
Filtering Junk E
-
Mail
”,

Learning
for Text Categorization: Papers from the 1998
Workshop. AAAI Technical Report WS
-
98
-
05.

[3]
Graham, P. ``A Plan for Spam.'', August 2002, Available at
http://paulgraham.com/antispam.html

[Accessed in Jul
y 2008]