classification

fantasicgilamonsterΔιαχείριση Δεδομένων

20 Νοε 2013 (πριν από 3 χρόνια και 6 μήνες)

76 εμφανίσεις

Classification and Learning


C
lassification is the process of distributing
objects
,
items, concepts into classes or categories of the
same type.


Thi
s is synonymous to: groupings, indexing,
relegation, taxonomy etc.


A classifier is the tool that obtains

classification of

things.


Process of classification is one important tool of
data
-
mining.
What knowledge can we extract
from a given set of data?

Given d
ata
D
, can we
assert



1 1 i i
D M ... D M
 


How do we learn thes
e? In what other forms,
knowledge could be extracted from the data
given?




Classification
: What type? What class? What
group?
--

A labeling process


Clustering
: partitioning data into similar
groupings.
A
cluste
r
i
s
grouping

of 'similar' items!
Clusterin
g
i
s a
proces
s
tha
t
partitions

a

set of
objects into equivalence classes.



Classification

is used extensively in



■ marketing

■ healthcare outcomes

■ fraud detection

■ homeland security

■ investment
an
a
lysis


■ automatic website and image classifications


Classification allows
prioritization

and
filtering
. It
accommodates
key
-
words search
.

Data
Mining

Classification

Clustering


A typical example.


Data
:
A
relationa
l
database

comprising tuples
about emails passing through a port.


Each tuple = < sender, recipient, date, size>


Classify each mail either as an
authentic

or a
junk
.


Data preparation before data mining:


► Normal data to be mined is noisy wi
th many
unwanted attributes, etc.


►Discretization of continuous data

►Data normalization [
-
1 .. + 1] or [0 .. 1] range

►Data smoothing to reduce noise, removal of
outliers, etc.

►Relevance analysis: feature selection to ensure
relevant set of wanted feat
ures only


Process of classification
:


► Model construction





each tuple belongs to a predefined class and is
given a class label



set of all tuples thus offered is called a
training set



The attendant model is expressed as



** classifica
tion rule (IF
-
THEN



statements)




** decision tree




** mathematical formula


► Model evaluation




Estimation of accuracy on the test set.
Compare the known label of the test sample
with the computed label. Comp
ute
percentag
e
o
f
error
. Ensure test set

training set


► Implement model




to classify unseen objects




** assign a label to a new tuple


** predict the value of an actual attribute




Different techniques from statistics, information
retrieval and data mining are used for
classification. Included in it are:


■ Bayesian methods

■ Bayesian belief networks

■ Decision trees

■ Neural networks


■ A
ssociative cla
ssifier

■ Emergi
ng patterns


S
upport vector machines.


Combinations of these are used as well.

The basic
approach to a classification model
con
s
truction
:




Training
Data

Classification

Algorithm

Classifier model

Rule1: (term = "cough
"

&& term =~"chest Xray"
)

ignore
;

Rule2:

(temp = > 103 bp=180/100
)

malaria || pneumonia;

Rule3: (term = "general pain") && (term="LBC"
)

infection

If one or several attributes
or features
i
a A


occur together in more than one itemset (da
ta
sample, target data) assigned the topic
T
, then
output a rule



i j m
Rule:a a...a T

  

or






t thres
t 1
Rule P a | X T

 


  



Examples. Naïve Bayes Classifiers.


Best for classifying texts, documents, ...

Ma
jor drawback: unrealistic independent
assumption among individual items.


Basic issue here: Do we accept a document
d

in
class
C
? If we do, what is the penalty for
misclassification?


For a good mail cl
assifier, a junk mail should be
assigned "junk" label with a very high
probability. The cost of doing this to a good mail is
very high.


The probability that a document
i
d

belongs to
topic
j
C

is computed by

Bayes' rule











i j j
j i
i
P d | C P C
P C | d
P d


... (1)

Define priori odds on
j
C

as



j
j
j
P( C )
O( C )
1 P( C )



... (2)


Then Bayes' equation gives us the posteriori odds



i j
j i j j i j
i j
P( d | C )
O( C | d ) O( C ) O( C )L( d | C )
P( d | C )
 



... (3)


where
i j
L( d | C )

is the likelihood ratio
. This is one
way we could use the classifier to yield posteriori
estimate of a document.


Another way would be to go back to (1). Here










i j j
j i
i
P d | C P C
P C | d
P d



with




d j
j
n C
P C
| D|


... (4)


where
| D|

is the total volume of the documents in
the database, and


d j
n C

is the number of
documents in class
j
C
.
The following
from Julia

Itskevitch's work is outlined
1
.


1
Julia

Itskevitch
, `` Automatic Hierarchical E
-
mail Classification Using Association

Rules '', M.Sc.
thesis
, Computing Science, Simon Fraser U
niversity, July
...


www
-
sal.cs.uiuc.edu/~hanj/pubs/theses.html




Multi
-
variate Bernoulli model


Assumption: Each document is a collection of set
of terms (key
-
words, etc.)
t
 

. Either a specific
term is present or it's absent (we
are not
interested in its count, or in its position in the
document).


In this model,






i j t j
t 1
P d | C P | C

















it t j it t j
t 1
P | C 1 1 P | C

   

   


...(5)


it t i
1, if d
 
 

or zero, otherwise.


If we use (5)






d j t
t j
d j
1 n C,
P | C
2 n C






... (6)


Alternative to this would be
a

t
erm
-
count
model
as

outlined below
:


In this, for every term we count its frequency of
occurrence if it is present. Positional effects are
ignored. In that case,






it
t j
t j
t 1
it
P | C
P | C const
!









... (7)


Jaso
n

Rennie's

Ifile Naïve Baysian approach
outlines a multinomial model
2
.


Reference:
http://cbbrowne.com/info/mail.html#IFILE


Every new item considered will be allowed to
ch
ange the frequency count dynamically. The
frequent terms are kept, non
-
frequent terms are
abandoned if their
2
count log ( age ) 1
 
where
age

is the
total time (
space
)
elapsed

since first
encounter.


ID3
Cla
s
sifier
. (Quinlan 198
3)


A decision
-
tree approach. Suppose the problem
domain is on a feature space


i
a
. Should we use
every feature to discriminate? Isn't there a
possibility that some feature is more important
(revealing) than others and therefore sho
uld be
used more heavily?


e.g. TB or ~TB case.


Training space: Three features: (Coughing,
Temp,
and Chest
-
pain). Possible values

over a vector of
features:


Coughing (yes, no)

Temp (hi, med, lo)

Chest
-
pain (yes, no)








Case Description

1.

(yes, hi, yes, T)

2.

(no, hi, yes, T)

3.

(yes, lo, yes, ~T)

4.

(yes, med, yes, T)

5.

(no, lo, no, ~T)

6.

(yes, med, no, ~T)

7.

(no, hi, yes, T)

8.

(yes, lo
, no, ~T)


Consider the feature "Coughing". Just on this
feature, the
t
raining

set splits into two groups: a
"yes" group, and a "no" group. The decision tree
on this feature appears as:



Entropy of the overall system before further
discrimination based on "coughing" feature:


Coughing

Yes

No

1. (yes, hi, yes, T)

3
. (yes, lo, yes, ~T)

4.
(yes, med, yes, T)

6. (yes, med, no, ~T)

8. (yes, lo, no, ~T)





2
. (no, hi, yes, T)

5. (no, lo, no, ~T)

7. (no, hi, yes, T)

2 2
4 4 4 4
T log log 1 bit
8 8 8 8
   


Entro
p
y

of the (
yes

|
coughing
) with 2 positive and
3 negative cases is:


yes|coughing 2 2
2 2 3 3
T log log 0.9710 bit
5 5 5 5
   



Similarly, the
entro
p
y

of the branch

(no | coughing) with 2 positives and 1 negative
gives us:


no|coughing 2 2
1 1 2 2
T log log 0.9183 bit
3 3 3 3
   



Entropy of the co
mbined
b
r
anches

(weighed by
their probabilities)


coughing yes|coughing no|coughing
5 3
T T T 0.951 bit
8 8
  


Therefore, information gained by
testing the
"coughing" feature gives us
coughing
T T 0.49 bit
 


ID3 yields a strategy of testing attributes
in
succession to discriminate o
n feature space. What
combination of features one should take
an
d
i
n
wha
t
orde
r
to

determine a class membership is
resolved here.


Learning discriminants: Generalized Perceptron.

Widrow
-
Hoff Algorithm (1960).


Adaline. (Adaptive Linear neuron that learns

via

Widrow
-
Hoff Algorithm)




Given an object with components
i
x

on a feature
space
i

.
A neuron is a unit that receives the
object components and processes them as follows.


1.

Each neuron in a NN has a set

of links to
receive
weighte
d
input
. Each link
i

receives
its input
i
x
, weigh it by
i

and then sends it
out to the adder to get summed.





j j
j 1
u x





This is
what adder
produces.


2.

If the sum is greater than the activation
function
y ( u b)

 
, the adder fires its
output.


The choice of
(.)


determines the neuron model.


Step function.
( v ) 1, if v b

 




0, otherwise



Sigmoid function.
1
( v )
1 exp( v )

 

  


Gaussian function.
2
1 1 v
( v ) exp
2
2




 

 
 
 
 
 
 

We first consider a single neuron system. This
could be generalized to a more complex system.


Th
e single neuron Perceptron is a single neuron
system that has a step
-
function activation unit



( v ) 1 if v 0

  


1, otherwise
 


This is used for binary classification. Given
training vectors and two classes
1
C

and
2
C
, if the
output
( v ) 1

 

assign class
1
C

to the vector,
otherwise class
2
C
.


To train the system is equivalent to adjusting its
weights

associated with its links.


How do we adjust the weights?

To train the system is equivalent to adjusting its
weights associated with its links.


1.

k=1

2.

get
k

. Initial weights could be randomly chosen between (0,1)

3.

while (misclass
ified training examples)


k 1 k
x
  

 
;


is learning rate

parameter


and the error function
w.x 1

 





Since t
he correction to weights could be expressed
as
x
 


,
the rule is known as delta
-
rule.


A perceptron can only model linearly separable
functions like AND, OR, NOT. But it can't model
XOR.


Generlaized

-
rule for

the Semilinear feedback

Net with backpropagation of Error.





The net input to a node in layer j is






i
ji
j
o
w
net

… (1)


The output of a node j is




0
/
)
(
1
1
)
(


j
j
net
j
j
e
net
f
o






… (2)


This is the no
nlinear sigmoid activation function;
this tells us how the hidden layer nodes would fire
if they at all fire.


The input to the nodes of layer k (here the output
layer) is






j
kj
k
o
w
net

… (3)


Hidden Layer

Input Layer

Output Layer

i

j

k

ji
w

kj
w

INPUT PATTERN

and the output of the layer k is




)
(
k
k
net
f
o


… (4)


In the learning phase a number of training
samples would be sequentially introduced.


Let
}
{
pi
p
i
x


be one such input object. The net
seeing this adjusts its weights to its links. The
output pattern
}
{
pk
o

might be different from the
ideal input pattern
}
{
pk
t
. The network link
-
weight
adjustment strategy is to



“adjust the link weights so that net square of


the error



k
pk
pk
o
t
P
E
2
)
(
2
1

is


minimized.”


For convenience, we omit the subscript p and ask
for what changes in the weights







k
k
k
o
t
E
2
)
(
2
1

… (5) is minimized.


We attempt to do so by gradient descent to the
minimum. That is,




kj
kj
w
E
const
w








kj
w
E






… (6)


Now
kj
k
k
kj
w
net
net
E
w
E










and







j
j
kj
kj
kj
k
o
o
w
w
w
net

… (7)


Let us compute
k
k
k
k
k
net
o
o
E
net
E













According to (5),
)
(
k
k
k
p
t
o
E






… (8)

And
)
(
k
k
k
k
net
f
net
o






… (9)


So that, eqn (6) can now be expressed as




)
(
)
(
k
k
k
k
kj
net
f
p
t
w








j
k
o



… (10a)


Similarly, the weight adjustment within the inside
is


j
i
ji
j
j
ji
ji
net
E
o
w
net
net
E
w
E
w




















j
i
j
j
j
i
j
j
j
i
o
o
E
net
f
o
net
o
o
E
o
















)
)(
(


But
j
o
E



cannot be computed directly. Instead, we
express it in terms of the known


m
m
km
j
k
k
k
j
k
k
j
o
w
o
net
E
o
net
net
E
o
E











































k
kj
k
kj
k
k
w
w
net
E





Thus, it implies



k
kj
k
j
j
j
w
net
f


)
(

In other words, the del
tas in the hidden nodes can
be evaluated by the deltas at the output layer.


Note that given
0
/
)
(
1
1
)
(


j
j
net
j
j
e
net
f
o







)
1
(
j
j
j
j
o
o
net
o






This results for the following delta
-
rules in the
upper and the hidden layer respective
ly





)
1
(
)
(
pk
pk
pk
pk
pk
o
o
o
t






and
kj
k
pk
pj
pj
pj
w
o
o





)
1
(