A Neural Network

cartcletchΤεχνίτη Νοημοσύνη και Ρομποτική

19 Οκτ 2013 (πριν από 3 χρόνια και 7 μήνες)

70 εμφανίσεις

A Neural Network
Classifier for Junk E
-
Mail

Ian Stuart, Sung
-
Hyuk Cha, and Charles Tappert

CSIS Student/Faculty Research Day

May 7, 2004

Spam, spam, spam, …

Fighting spam


Several commercial applications exist


Server
-
side: expensive


Client
-
side: time
-
consuming


No approach is 100% effective


Spammers are aggressive and adaptable


Best solutions are typically hybrids of
different approaches and criteria

Common approaches


Simple filters


Common words or phrases


Unusual punctuation or capitalization


Blacklisting: “just say NO” (if you can)


Reject e
-
mail from known spammers


Whitelisting: “friends only, please”


Accept e
-
mail only from known correspondents


Classifiers: examine each e
-
mail and decide


Only a few publications on spam classifiers

Naïve Bayesian classifiers


Used in commercial classifiers


Assumes recognition features are independent


Max likelihood = product of likelihoods of features


E
-
mail classifier


examines each word


Training assigns a probability to each word


Look up each word/probability in a dictionary


If the product of the probabilities exceeds a given
threshold, it is spam


Challenge


creating the “dictionary”


We compare our Neural Network against two
published Naïve Bayesian classifiers

Naïve Bayesian classifier issues


How many features (words), which ones?


How is degradation avoided as spammers’
vocabulary changes?


What values are assigned to new words?


What are the thresholds?


How to avoid “sabotage” of classifier?

Which one isn’t spam?

(subject headers)


5 Be a mighty warrior in bed! vcrhwt ygjztyjjh


Money Back Guarantee_HGH


kindle life pddez liw mzac


v a l i u m
-

D i a z e p a m used to relieve anxiety


Fairfield tennis schedule


:Dramatic E,nhancement fo=r .Men = f"fumqid


,Refina'nce now. Don't wait

Which one isn’t spam?


(subject headers)


5 Be a mighty warrior in bed! vcrhwt ygjztyjjh


Money Back Guarantee_HGH


kindle life pddez liw mzac


v a l i u m
-

D i a z e p a m used to relieve anxiety


Fairfield tennis schedule


:Dramatic E,nhancement fo=r .Men = f"fumqid


,Refina'nce now. Don't wait

Spammers make patterns


The more they try to hide, the easier it
is to see them


Therefore, we use common spammer
patterns (instead of vocabulary) as
features for classification


Learn these patterns with a Neural
Network

Neural Network features


Total of 17 features



6 from the subject header



2 from priority and content
-
type headers



9 from the e
-
mail body

Features from subject header

1.
Number of words with no vowels

2.
Number of words with at least two of letters J, K, Q, X, Z

3.
Number of words with at least 15 characters

4.
Number of words with non
-
English characters, special
characters such as punctuation, or digits at beginning or
middle of word

5.
Number of words with all letters in uppercase

6.
Binary feature indicating 3 or more repeated characters

Features from priority and
content
-
type headers

1.
Binary feature indicating whether the
priority had been set to any level
besides normal or medium

2.
Binary feature indicating whether a
content
-
type header appeared within
the message headers or whether the
content type had been set to “text/html”

Features from message body

1.
Proportion of alphabetic words with no vowels and at least 7
characters

2.
Proportion of alphabetic words with at lease two of letters J,
K, Q, X, Z

3.
Proportion of alphabetic words at least 15 characters long

4.
Binary feature indicating whether the strings “From:” and
“To:” were both present

5.
Number of HTML opening comment tags

6.
Number of hyperlinks (“href=“)

7.
Number of clickable images represented in HTML

8.
Binary feature indicating whether a text color was set to white

9.
Number of URLs in hyperlinks with digits or “&”, “%”, or “@”


Neural Network spam classifier


3
-
layer, feed
-
forward network (Perceptron)


17 input units, variable # hidden layer units, 1 output unit


Data


1,654 e
-
mails: 854 spam, 800 legitimate


Use half of each (spam/non
-
spam) for training,
the other half for testing


Test with variations of hidden nodes (4 to 14)
and epochs (100 to 500)

Definitions used for classifier
success measures

n
SS
= number of spam classified as spam

n
SL
= number of spam classified as legitimate

n
LL
= number of legitimate classified



as legitimate

n
LS
= number of legitimate classified as spam

Measure of success: precision

Precision: the percentage of labeled
spam/legitimate e
-
mail correctly classified

Measure of success: precision

Precision: the percentage of labeled
spam/legitimate e
-
mail correctly classified


Measure of success: accuracy

Accuracy: the percentage of actual
spam/legitimate e
-
mail correctly classified


Measure of success: accuracy

Accuracy: the percentage of actual
spam/legitimate e
-
mail correctly classified


Neural Network results


Best overall results with 12 hidden nodes at
500 epochs


Spam Precision: 92.45%


Legitimate Precision: 91.32%


Spam Accuracy: 91.80%


Legitimate Accuracy : 92.00%


35 spams misclassified: 8.20%


32 legitimates misclassified: 8.00%

Misclassified e
-
mails


Most spam misclassified as legitimate
were short in length, with few hyperlinks


Most legitimate e
-
mails misclassified as
spam had unusual features for personal
e
-
mail (that is, they were “spam
-
like” in
appearance)

Comparing Neural Network and

Naïve Bayesian Classifiers


Accuracy of the NN classifier is comparable to
that reported for Naïve Bayesian classifiers


NN classifier required fewer features (17 versus
100 in one study and 500 in another)


NN classifier uses descriptive qualities of words
and messages similar to those used by human
readers


Blacklisting Experiment


Manually entered IP addresses of e
-
mail
incorrectly tagged by NN classifier


Entered first (original) IP address and, when present,
second IP address (e.g., mail server or ISP)


Into a website that sends IP addresses to 173
working spam blacklists and returns the # hits,
http://www.declude.com/junkmail/support/ip4r.htm



Counted only hit counts greater than one as spam
since single
-
list hits to be anomalies

Blacklisting Experimental Results


Of the 32
legitimate

e
-
mails misclassified
by the NN, 53% were identified as spam


Of the 35
spam

e
-
mails misclassified by
the NN, 97% were identified as spam


These poor results indicate that the
blacklisting strategy, at least for these
databases, is inadequate

Conclusions


NN competitive to Naïve Bayesian studies
despite using a much smaller feature set


Room for refinement of parsing for features


Use of descriptive, more human
-
like
features makes NN less subject to
degradation than Naïve Bayesian

Conclusions (cont.)


Neural Network approach is useful and
accurate, but too many legitimate
-
> spam


Should be powerful when used in
conjunction with a whitelist to reduce
legitimate
-
> spam (n
LS
), increasing spam
precision and legitimate accuracy


Blacklisting strategy is not very helpful