Computer aided mail filtering using SVM

grizzlybearcroatianΤεχνίτη Νοημοσύνη και Ρομποτική

16 Οκτ 2013 (πριν από 3 χρόνια και 5 μήνες)

85 εμφανίσεις

Computer aided mail filtering
using SVM
Lin Liao, JochenJaeger
Department of Computer Science & Engineering
University of Washington, Seattle
March 14, 2002
2
CSE 574
Efficient Mail Filtering Using SVM
Introduction
What is SPAM?

Electronic version of junk mail, bulk-mail, unwanted messages

Unsolicited Commercial E-mail (UCE), Unsolicited Bulk Email (UBE)
Some impacts

Annoying unsolicited advertisement or even harmful, offending content

Waste of resources (time and bandwidth), ~1Mio per day

Threat to the viability of e-mail and Internet commerce
Reasons for the proliferation of spam

Efficiency and increasing popularity of e-mail, prone to misuse

Very low cost for distribution with relatively high number of responses

Trading of private e-mail addresses has become as business
March 14, 2002
3
CSE 574
Efficient Mail Filtering Using SVM
Filtering E-mail
Typical counter-measures
Either technical or regulatory several bills and lawsuits had limited effect
Anti-spam filters
Automatically identify/classify incoming messages

Accept legitimate messages and reject spam
Several approaches
Blacklist of spam sender-addressesforged headers, faked addresses

Simple rule-based solutions, hand-crafted key-word patterns

Perform poorly, require frequent updates and fine-tuning

More sophisticated solutions take advantage of machine learning

Adaptive, user-specific, improves with experience
March 14, 2002
4
CSE 574
Efficient Mail Filtering Using SVM
Mail Filtering Using SVM
Support Vector Machines (SVM) for mail filtering
Outperforms other techniques:
Rule-based learner, Decision Trees, Naïve
Bayesian
Scenario

Training mode: Analyze test set of marked (classified) mails

Extract features, build feature database with number of occurrences for each
class

Feature Selection select most decisive (unbalanced) features from database

Build feature vectors for each message project message into feature space

Use SVM to learn feature characteristicsand build support vectors

Classify incoming mails using SVM

Regularly update feature database with new messages

Repeat feature selection and apply SVM again
March 14, 2002
5
CSE 574
Efficient Mail Filtering Using SVM
Analyzing Mail
Traditional text classification
Each word is a feature, each document is a binary feature vector
Special text categorization suggests several extensions 
Integer value vectors for number of occurrences

Different types of features, e.g.

Single words and pairs of words, characters body content

Domain-specific, non-phrasal features, mostly header information

Addresses: Resolved familiar E-mail addresses, domain, domain type

Content-type, attachments, priority, time zone, receive path

Percentage of non-alphanumerical characters and capital letters
Feature data structure

Type of the feature, string value discretize non-string values
March 14, 2002
6
CSE 574
Efficient Mail Filtering Using SVM
SVM classifier

Find separatinghyperplanewith max distance to closest training example

Advantage: avoids overfitting

We used the libsvmimplementation for Java
March 14, 2002
7
CSE 574
Efficient Mail Filtering Using SVM
Result 1
Prediction accuracy for different feature selection methods
73.00%
76.00%
79.00%
82.00%
85.00%
88.00%
91.00%
94.00%
97.00%
100.00%
020406080100120140160180200
Number features
Accuracy
chi2
chi2n
March 14, 2002
8
CSE 574
Efficient Mail Filtering Using SVM
Results, top 30 features for chi2
T: 5F: 295 : 0398.59744W: '.edu'
T: 50F: 283 : 15335.20468W: 're'
T: 14F: 245 : 4317.8967W: '.edu'
T: 13F: 236 : 2312.24432W: 'cs.washington.edu'
T: 54F: 290 : 30302.8129W: 'jochen'
T: 35F: 15 : 440286.56357W: 'text/html'
T: 10F: 200 : 1266.90472W: 'jochen jaeger'
T: 12F: 195 : 1260.14935W: 'jj@cs.washington.edu'
T: 56F: 42 : 427216.32468W: '#'
T: 56F: 29 : 383212.34502W: '"'
T: 4F: 151 : 0204.02783W: 'cs.washington.edu'
T: 56F: 20 : 331194.832W: '%'
T: 35F: 415 : 181179.06517W: 'text/plain'
T: 54F: 4 : 256178.75574W: 'click'
T: 56F: 67 : 419164.30302W: '&'
T: 54F: 123 : 2159.62344W: 'seattle'
T: 54F: 246 : 72157.80038W: 'me'
T: 54F: 1 : 214155.65927W: 'removed'
T: 55F: 133 : 8154.75426W: 'jochen jaeger'
T: 52F: 28 : 288146.57521W: '!'
T: 54F: 0 : 196145.05864W: 'unsubscribe'
T: 54F: 219 : 63142.36621W: 'my'
T: 54F: 121 : 9135.88109W: 'washington'
T: 55F: 1 : 186134.93947W: 'be removed'
T: 54F: 12 : 223134.66705W: 'offer'
T: 54F: 147 : 24131.97598W: 'thanks'
T: 55F: 0 : 177130.99684W: 'to unsubscribe'
T: 54F: 99 : 2127.22627W: 'wrote'
T: 54F: 247 : 93126.156136W: 'but'
T: 54F: 182 : 48126.041695W: 'hi'
Subject
Body
chars
From
To
Content type
March 14, 2002
9
CSE 574
Efficient Mail Filtering Using SVM
March 14, 2002
10
CSE 574
Efficient Mail Filtering Using SVM