Classification and Learning
C
lassification is the process of distributing
objects
,
items, concepts into classes or categories of the
same type.
Thi
s is synonymous to: groupings, indexing,
relegation, taxonomy etc.
A classifier is the tool that obtains
classification of
things.
Process of classification is one important tool of
data

mining.
What knowledge can we extract
from a given set of data?
Given d
ata
D
, can we
assert
1 1 i i
D M ... D M
How do we learn thes
e? In what other forms,
knowledge could be extracted from the data
given?
Classification
: What type? What class? What
group?

A labeling process
Clustering
: partitioning data into similar
groupings.
A
cluste
r
i
s
grouping
of 'similar' items!
Clusterin
g
i
s a
proces
s
tha
t
partitions
a
set of
objects into equivalence classes.
Classification
is used extensively in
■ marketing
■ healthcare outcomes
■ fraud detection
■ homeland security
■ investment
an
a
lysis
■ automatic website and image classifications
Classification allows
prioritization
and
filtering
. It
accommodates
key

words search
.
Data
Mining
Classification
Clustering
A typical example.
Data
:
A
relationa
l
database
comprising tuples
about emails passing through a port.
Each tuple = < sender, recipient, date, size>
Classify each mail either as an
authentic
or a
junk
.
Data preparation before data mining:
► Normal data to be mined is noisy wi
th many
unwanted attributes, etc.
►Discretization of continuous data
►Data normalization [

1 .. + 1] or [0 .. 1] range
►Data smoothing to reduce noise, removal of
outliers, etc.
►Relevance analysis: feature selection to ensure
relevant set of wanted feat
ures only
Process of classification
:
► Model construction
each tuple belongs to a predefined class and is
given a class label
set of all tuples thus offered is called a
training set
The attendant model is expressed as
** classifica
tion rule (IF

THEN
statements)
** decision tree
** mathematical formula
► Model evaluation
Estimation of accuracy on the test set.
Compare the known label of the test sample
with the computed label. Comp
ute
percentag
e
o
f
error
. Ensure test set
training set
► Implement model
to classify unseen objects
** assign a label to a new tuple
** predict the value of an actual attribute
Different techniques from statistics, information
retrieval and data mining are used for
classification. Included in it are:
■ Bayesian methods
■ Bayesian belief networks
■ Decision trees
■ Neural networks
■ A
ssociative cla
ssifier
■ Emergi
ng patterns
■
S
upport vector machines.
Combinations of these are used as well.
The basic
approach to a classification model
con
s
truction
:
Training
Data
Classification
Algorithm
Classifier model
Rule1: (term = "cough
"
&& term =~"chest Xray"
)
ignore
;
Rule2:
(temp = > 103 bp=180/100
)
malaria  pneumonia;
Rule3: (term = "general pain") && (term="LBC"
)
infection
If one or several attributes
or features
i
a A
occur together in more than one itemset (da
ta
sample, target data) assigned the topic
T
, then
output a rule
i j m
Rule:a a...a T
or
t thres
t 1
Rule P a  X T
Examples. Naïve Bayes Classifiers.
Best for classifying texts, documents, ...
Ma
jor drawback: unrealistic independent
assumption among individual items.
Basic issue here: Do we accept a document
d
in
class
C
? If we do, what is the penalty for
misclassification?
For a good mail cl
assifier, a junk mail should be
assigned "junk" label with a very high
probability. The cost of doing this to a good mail is
very high.
The probability that a document
i
d
belongs to
topic
j
C
is computed by
Bayes' rule
i j j
j i
i
P d  C P C
P C  d
P d
... (1)
Define priori odds on
j
C
as
j
j
j
P( C )
O( C )
1 P( C )
... (2)
Then Bayes' equation gives us the posteriori odds
i j
j i j j i j
i j
P( d  C )
O( C  d ) O( C ) O( C )L( d  C )
P( d  C )
... (3)
where
i j
L( d  C )
is the likelihood ratio
. This is one
way we could use the classifier to yield posteriori
estimate of a document.
Another way would be to go back to (1). Here
i j j
j i
i
P d  C P C
P C  d
P d
with
d j
j
n C
P C
 D
... (4)
where
 D
is the total volume of the documents in
the database, and
d j
n C
is the number of
documents in class
j
C
.
The following
from Julia
Itskevitch's work is outlined
1
.
1
Julia
Itskevitch
, `` Automatic Hierarchical E

mail Classification Using Association
Rules '', M.Sc.
thesis
, Computing Science, Simon Fraser U
niversity, July
...
www

sal.cs.uiuc.edu/~hanj/pubs/theses.html
Multi

variate Bernoulli model
Assumption: Each document is a collection of set
of terms (key

words, etc.)
t
. Either a specific
term is present or it's absent (we
are not
interested in its count, or in its position in the
document).
In this model,
i j t j
t 1
P d  C P  C
it t j it t j
t 1
P  C 1 1 P  C
...(5)
it t i
1, if d
or zero, otherwise.
If we use (5)
d j t
t j
d j
1 n C,
P  C
2 n C
... (6)
Alternative to this would be
a
t
erm

count
model
as
outlined below
:
In this, for every term we count its frequency of
occurrence if it is present. Positional effects are
ignored. In that case,
it
t j
t j
t 1
it
P  C
P  C const
!
... (7)
Jaso
n
Rennie's
Ifile Naïve Baysian approach
outlines a multinomial model
2
.
Reference:
http://cbbrowne.com/info/mail.html#IFILE
Every new item considered will be allowed to
ch
ange the frequency count dynamically. The
frequent terms are kept, non

frequent terms are
abandoned if their
2
count log ( age ) 1
where
age
is the
total time (
space
)
elapsed
since first
encounter.
ID3
Cla
s
sifier
. (Quinlan 198
3)
A decision

tree approach. Suppose the problem
domain is on a feature space
i
a
. Should we use
every feature to discriminate? Isn't there a
possibility that some feature is more important
(revealing) than others and therefore sho
uld be
used more heavily?
e.g. TB or ~TB case.
Training space: Three features: (Coughing,
Temp,
and Chest

pain). Possible values
over a vector of
features:
Coughing (yes, no)
Temp (hi, med, lo)
Chest

pain (yes, no)
Case Description
1.
(yes, hi, yes, T)
2.
(no, hi, yes, T)
3.
(yes, lo, yes, ~T)
4.
(yes, med, yes, T)
5.
(no, lo, no, ~T)
6.
(yes, med, no, ~T)
7.
(no, hi, yes, T)
8.
(yes, lo
, no, ~T)
Consider the feature "Coughing". Just on this
feature, the
t
raining
set splits into two groups: a
"yes" group, and a "no" group. The decision tree
on this feature appears as:
Entropy of the overall system before further
discrimination based on "coughing" feature:
Coughing
Yes
No
1. (yes, hi, yes, T)
3
. (yes, lo, yes, ~T)
4.
(yes, med, yes, T)
6. (yes, med, no, ~T)
8. (yes, lo, no, ~T)
2
. (no, hi, yes, T)
5. (no, lo, no, ~T)
7. (no, hi, yes, T)
2 2
4 4 4 4
T log log 1 bit
8 8 8 8
Entro
p
y
of the (
yes

coughing
) with 2 positive and
3 negative cases is:
yescoughing 2 2
2 2 3 3
T log log 0.9710 bit
5 5 5 5
Similarly, the
entro
p
y
of the branch
(no  coughing) with 2 positives and 1 negative
gives us:
nocoughing 2 2
1 1 2 2
T log log 0.9183 bit
3 3 3 3
Entropy of the co
mbined
b
r
anches
(weighed by
their probabilities)
coughing yescoughing nocoughing
5 3
T T T 0.951 bit
8 8
Therefore, information gained by
testing the
"coughing" feature gives us
coughing
T T 0.49 bit
ID3 yields a strategy of testing attributes
in
succession to discriminate o
n feature space. What
combination of features one should take
an
d
i
n
wha
t
orde
r
to
determine a class membership is
resolved here.
Learning discriminants: Generalized Perceptron.
Widrow

Hoff Algorithm (1960).
Adaline. (Adaptive Linear neuron that learns
via
Widrow

Hoff Algorithm)
Given an object with components
i
x
on a feature
space
i
.
A neuron is a unit that receives the
object components and processes them as follows.
1.
Each neuron in a NN has a set
of links to
receive
weighte
d
input
. Each link
i
receives
its input
i
x
, weigh it by
i
and then sends it
out to the adder to get summed.
j j
j 1
u x
This is
what adder
produces.
2.
If the sum is greater than the activation
function
y ( u b)
, the adder fires its
output.
The choice of
(.)
determines the neuron model.
Step function.
( v ) 1, if v b
0, otherwise
Sigmoid function.
1
( v )
1 exp( v )
Gaussian function.
2
1 1 v
( v ) exp
2
2
We first consider a single neuron system. This
could be generalized to a more complex system.
Th
e single neuron Perceptron is a single neuron
system that has a step

function activation unit
( v ) 1 if v 0
1, otherwise
This is used for binary classification. Given
training vectors and two classes
1
C
and
2
C
, if the
output
( v ) 1
assign class
1
C
to the vector,
otherwise class
2
C
.
To train the system is equivalent to adjusting its
weights
associated with its links.
How do we adjust the weights?
To train the system is equivalent to adjusting its
weights associated with its links.
1.
k=1
2.
get
k
. Initial weights could be randomly chosen between (0,1)
3.
while (misclass
ified training examples)
k 1 k
x
;
is learning rate
parameter
and the error function
w.x 1
Since t
he correction to weights could be expressed
as
x
,
the rule is known as delta

rule.
A perceptron can only model linearly separable
functions like AND, OR, NOT. But it can't model
XOR.
Generlaized

rule for
the Semilinear feedback
Net with backpropagation of Error.
The net input to a node in layer j is
i
ji
j
o
w
net
… (1)
The output of a node j is
0
/
)
(
1
1
)
(
j
j
net
j
j
e
net
f
o
… (2)
This is the no
nlinear sigmoid activation function;
this tells us how the hidden layer nodes would fire
if they at all fire.
The input to the nodes of layer k (here the output
layer) is
j
kj
k
o
w
net
… (3)
Hidden Layer
Input Layer
Output Layer
i
j
k
ji
w
kj
w
INPUT PATTERN
and the output of the layer k is
)
(
k
k
net
f
o
… (4)
In the learning phase a number of training
samples would be sequentially introduced.
Let
}
{
pi
p
i
x
be one such input object. The net
seeing this adjusts its weights to its links. The
output pattern
}
{
pk
o
might be different from the
ideal input pattern
}
{
pk
t
. The network link

weight
adjustment strategy is to
“adjust the link weights so that net square of
the error
k
pk
pk
o
t
P
E
2
)
(
2
1
is
minimized.”
For convenience, we omit the subscript p and ask
for what changes in the weights
k
k
k
o
t
E
2
)
(
2
1
… (5) is minimized.
We attempt to do so by gradient descent to the
minimum. That is,
kj
kj
w
E
const
w
kj
w
E
… (6)
Now
kj
k
k
kj
w
net
net
E
w
E
and
j
j
kj
kj
kj
k
o
o
w
w
w
net
… (7)
Let us compute
k
k
k
k
k
net
o
o
E
net
E
According to (5),
)
(
k
k
k
p
t
o
E
… (8)
And
)
(
k
k
k
k
net
f
net
o
… (9)
So that, eqn (6) can now be expressed as
)
(
)
(
k
k
k
k
kj
net
f
p
t
w
j
k
o
… (10a)
Similarly, the weight adjustment within the inside
is
j
i
ji
j
j
ji
ji
net
E
o
w
net
net
E
w
E
w
j
i
j
j
j
i
j
j
j
i
o
o
E
net
f
o
net
o
o
E
o
)
)(
(
But
j
o
E
cannot be computed directly. Instead, we
express it in terms of the known
m
m
km
j
k
k
k
j
k
k
j
o
w
o
net
E
o
net
net
E
o
E
k
kj
k
kj
k
k
w
w
net
E
Thus, it implies
k
kj
k
j
j
j
w
net
f
)
(
In other words, the del
tas in the hidden nodes can
be evaluated by the deltas at the output layer.
Note that given
0
/
)
(
1
1
)
(
j
j
net
j
j
e
net
f
o
)
1
(
j
j
j
j
o
o
net
o
This results for the following delta

rules in the
upper and the hidden layer respective
ly
)
1
(
)
(
pk
pk
pk
pk
pk
o
o
o
t
and
kj
k
pk
pj
pj
pj
w
o
o
)
1
(
Comments 0
Log in to post a comment