(Naive) Bayesian Text

ocelotgiantAI and Robotics

Nov 7, 2013 (4 years and 3 days ago)

57 views

Copyright 2004, David D. Lewis

(Naive) Bayesian Text
Classification for Spam
Filtering

David D. Lewis, Ph.D.

Ornarose, Inc.

& David D. Lewis Consulting

www.daviddlewis.com




Presented at ASA Chicago Chapter Spring Conference., Loyola Univ.,

May 7, 2004.



Copyright 2004, David D. Lewis


Menu

Spam

Spam Filtering

Classification for Spam Filtering



Classification

Bayesian Classification

Naive Bayesian Classification

Naive Bayesian Text Classification

Naive Bayesian Text Classification for Spam Filtering


(Feature Extraction for) Spam Filtering


Text Classification (for Marketing)

(Better) Bayesian Classification




Copyright 2004, David D. Lewis

Spam


Unsolicited bulk email


or, in practice, whatever email you don’t want


Large fraction of all email sent


Brightmail est. 64%, Postini est. 77%


Still growing


Est. cost to US businesses exceeded $30
billion in Y2003

Copyright 2004, David D. Lewis

Approaches to Spam Control


Economic (email pricing, ...)


Legal (CANSPAM, ...)


Societal pressure (trade groups, ...)


Securing infrastructure (email servers, ...)


Authentication (challenge/response,...)


Filtering


Copyright 2004, David D. Lewis

Spam Filtering


Intensional (feature
-
based) vs. Extensional
(white/blacklist
-
based)


Applied at sender vs. receiver


Applied at email client vs. mail server vs.
ISP

Copyright 2004, David D. Lewis

Statistical Classification

1.
Define classes of objects

2.
Specify probability distribution model
connecting classes to observable features

3.
Fit parameters of model to data

4.
Observe features on inputs and compute
probability of class membership

5.
Assign object to a class

Copyright 2004, David D. Lewis

Classifier


Inter
-


preter

CLASSIFIER

Feature

Extraction


Copyright 2004, David D. Lewis


Extract features from header, content


Train classifier


Classify message and process:


Block message, insert tag, put in folder, etc.

Classification for Spam
Filtering

vs.

vs.


Define classes:

Copyright 2004, David D. Lewis

Two Classes of Classifier


Generative: Naive Bayes, LDA,...


Model joint distribution of class and features


Derive class probability by Bayes rule


Discriminative: logistic regression,
CART
,...


Model conditional distribution of class given
known feature values


Model directly estimates class probability

Copyright 2004, David D. Lewis

2. Specify probability model

2b. And prior distribution over parameters

3. Find posterior distribution of model
parameters, given data

4. Compute class probabilities
using
posterior distribution (or element of it)

5. Classify object

Bayesian Classification (1)

1.
Define classes

Copyright 2004, David D. Lewis

Bayesian Classification (2)


= “Naive”/”Idiot”/”Simple” Bayes


A particular generative model


Assumes independence of observable features
within each class of messages


Bayes rule used to compute class probability


Might or might not use a prior on model
parameters

Copyright 2004, David D. Lewis

Naive Bayes for Text
Classification
-

History


Maron (JACM, 1961)


automated indexing


Mosteller and Wallace (1964)


author
identification


Van Rijsbergen, Robertson, Sparck Jones,
Croft, Harper (early 1970’s)


search
engines


Sahami, Dumais, Heckerman, Horvitz
(1998)


spam filtering



Copyright 2004, David D. Lewis


Graham’s
A Plan for Spam


And its mutant offspring...


Naive Bayes
-
like classifier with weird
parameter estimation


Widely used in spam filters


Classic Naive Bayes superior when
appropriately used

Bayesian Classification (3)

Copyright 2004, David D. Lewis

NB & Friends: Advantages


Simple to implement


No numerical optimization, matrix algebra, etc.


Efficient to train and use


Fitting = computing means of feature values


Easy to update with new data


Equivalent to linear classifier, so fast to apply


Binary or polytomous

Copyright 2004, David D. Lewis

NB & Friends: Advantages


Independence allows parameters to be
estimated on different data sets, e.g.


Estimate content features from messages with
headers omitted


Estimate header features from messages with
content missing

Copyright 2004, David D. Lewis

NB & Friends: Advantages


Generative model


Comparatively good effectiveness with small
training sets


Unlabeled data can be used in parameter
estimation (in theory)


Copyright 2004, David D. Lewis

NB & Friends: Disadvantages


Independence assumption wrong


Absurd estimates of class probabilities


Threshold must be tuned, not set analytically


Generative model


Generally lower effectiveness than
discriminative techniques (e.g. log. regress.)


Improving parameter estimates can
hurt

classification effectiveness


Copyright 2004, David D. Lewis

Feature Extraction


Convert message to feature vector


Header: sender, recipient, routing,…


Possibly break up domain names


Text


Words, phrases, character strings


Become binary or numeric features


URLs, HTML tags, images,…

Copyright 2004, David D. Lewis

Copyright 2004, David D. Lewis

Copyright 2004, David D. Lewis

From: Sam Elegy <
aj6xfdou7@yahoo.com
>

To:
ddlewis4@att.net

Subject: you can buy V!@gra

Spamlike
content in
image form

Irrelevant legit content;
doubles as hash buster


Typographic


variations

Randomly generated name and email

Copyright 2004, David D. Lewis

Defeating Feature Extraction


Misspellings, character set choice, HTML
games: mislead extraction of words


Put content in images


Forge headers (to avoid identification, but
also interferes with classification)


Innocuous content to mimic distribution in
nonspam


Hashbusters (
zyArh73Gf
) clog dictionaries

Copyright 2004, David D. Lewis

Survival of the Fittest


Filter designers get to see spam


Spammers use spam filters


Unprecedented arms race for a statistical
field





Countermeasures mostly target feature
extraction, not modeling assumptions


Copyright 2004, David D. Lewis

Miscellany

1.
Getting legitimate bulk mail past spam
filters

2.
Other uses of text classification in
marketing

3.
Frontiers in Bayesian classification

Copyright 2004, David D. Lewis

Getting Legit Bulk Email Past
Filters


Test email against several filters


Send to accounts on multiple ISPs


Multiple client
-
based filters if particularly
concerned


Coherent content, correctly spelled


Non
-
tricky headers and markup


Avoid spam keywords where possible


Don’t
use spammer tricks

Copyright 2004, David D. Lewis

Text Classification in Marketing


Routing incoming email


Responses to promotions


Detect opportunities for selling


(Automated response sometimes possible)


Analysis of text/mixed data on customers


e.g. customer or CSR comments


Content analysis


Focus groups, email, chat, blogs, news,…


Copyright 2004, David D. Lewis

Better Bayesian Classification


Discriminative


Logistic regression with informative priors


Sharing strength across related problems


Calibration and confidence of predictions


Generative


Bayesian networks/graphical models


Use of unlabeled and partially labeled data


Hybrid

Copyright 2004, David D. Lewis

BBR


Logistic regression w/ informative priors


Gaussian = ridge logistic regression


Laplace = lasso logistic regression


Sparse data structures & fast optimizer


10^4 cases, 10^5 predictors, few seconds!


Accuracy competitive with SVMs


Free for research use


www.stat.rutgers.edu/~madigan/BBR/


Joint work w/ Madigan & Genkin (Rutgers)

Copyright 2004, David D. Lewis

Gaussian Laplace

Gaussian vs. Laplace Prior

Copyright 2004, David D. Lewis

Future of Spam Filtering


More attention to training data selection,
personalization


Image processing



Robustness against word variations


More linguistic sophistication


Replacing naive Bayes with better learners



Keep hoping for economic cure

Copyright 2004, David D. Lewis

Summary


By volume, spam filtering is easily the
biggest application of text classification


Possible of supervised learning


Filters have helped a lot


Naive Bayes is just a starting point


Other interesting applications of Bayesian
classification