Clustering SDSC Summer Institute 2012

mudlickfarctateΤεχνίτη Νοημοσύνη και Ρομποτική

25 Νοε 2013 (πριν από 3 χρόνια και 10 μήνες)

108 εμφανίσεις

© Copyright
2012,
Natasha Balac

1

Clustering

SDSC Summer Institute 2012

Natasha Balac, Ph.D.

© Copyright
2012
Natasha Balac

CLUSTERING


Basic idea: Group similar things
together


Unsupervised Learning


Useful when
no other info is available


K
-
means


Partitioning instances into
k

disjoint
clusters


Measure of similarity




© Copyright 2012
,
Natasha Balac

Clustering


Partition unlabeled examples into disjoint
subsets of
clusters
, such that:


Examples within a cluster are very similar


Examples in different clusters are very
different


Discover new categories in an
unsupervised

manner (no sample
category labels provided).

© Copyright 2012
,
Natasha Balac

CLUSTERING

X

X

X

X

© Copyright 2012
,
Natasha Balac

Clustering Techniques


K
-
means clustering


Hierarchical clustering


Conceptual clustering


Probability
-
based clustering


Bayesian clustering


© Copyright 2012
,
Natasha Balac

Common uses of Clustering


Often used as an exploratory data analysis tool


In one
-
dimension, a good way to quantify real
-
valued
variables into k non
-
uniform buckets


Used on acoustic data in speech understanding to
convert waveforms into one of k categories (known
as Vector Quantization)


Also used for choosing color palettes on old
fashioned graphical display devices


Color Image Segmentation


© Copyright 2012
,
Natasha Balac

Clustering


Unsupervised: no target value to be predicted


Differences ways clustering results can be
produced/represented/learned:


Exclusive vs. overlapping


Deterministic vs. probabilistic


Hierarchical vs. flat


Incremental vs. batch learning


© Copyright
2012,
Natasha Balac

Hierarchical Clustering


Build a tree
-
based hierarchical taxonomy
(
dendrogram
) from a set of unlabeled examples.






Recursive application of a standard clustering
algorithm can produce a hierarchical clustering.



animal

vertebrate

fish reptile amphib. mammal worm insect crustacean

invertebrate

© Copyright 2012
,
Natasha Balac


Clusters the data into k groups where k is
specified in advance

1. Cluster centers are chosen at random

2. Instances are assigned to clusters based on their
distance to the cluster centers

3. Centroids of clusters are computed


“means”

4. Go to 1st step until convergence


The
k
-
means algorithm

Iterative distance based clustering

© Copyright
2012
Natasha Balac

K
-
Means Clustering Pros & Cons


Simple and reasonably effective


The final cluster centers do not represent a
global minimum but only a local one


Result can vary significantly based on initial
choice of seeds


Completely different final clusters can arise from
differences in the initial randomly chosen cluster
centers


Algorithm can easily fail to find a reasonable
clustering





© Copyright
2012,
Natasha Balac

Getting trapped in a local minimum

Example: four instances at the vertices
of a two
-
dimensional rectangle


Local minimum: two cluster centers at the
midpoints of the rectangle’s long sides






Simple way to increase chance of
finding a global optimum: restart with
different random seeds


© Copyright 2012
,
Natasha Balac

K
-
Means Algorithm

Let
d

be the distance measure between instances.

Select
k

random instances {
s
1
,
s
2
,…
s
k
} as seeds.

Until clustering converges or other stopping criterion:


For each instance
x
i
:


Assign
x
i

to the cluster
c
j

such that
d
(
x
i
,
s
j
) is
minimal.


(
Update the seeds to the centroid of each cluster
)


For each cluster
c
j


s
j
=

(
c
j
)

© Copyright 2006, Natasha Balac

K Means Example

(K=2)

Pick seeds

Reassign clusters

Compute centroids

x

x

Reasssign clusters

x

x

x

x

Compute centroids

Reassign clusters

Converged!

© Copyright 2012
,
Natasha Balac

Seed Choice


Results can vary based on random seed
selection


Some seeds can result in poor
convergence rate, or convergence to sub
-
optimal clusters


Select good seeds using a heuristic or the
results of another method

© Copyright 2012
,
Natasha Balac

15

Clustering


Incremental Clustering


Probability
-
based Clustering


EM Clustering


Bayesian Clustering



© Copyright 2012
,
Natasha Balac

16

Incremental clustering


Works incrementally instance by instance forming
a a hierarchy of clusters


COBWEB
-

nominal; CLASSIT


numeric attributes


Instances are added one at the time


tree is updated appropriately at each step


finding the right leaf for an instance


restructuring the tree


How and where to update based on category
utility value


© Copyright 2012
,
Natasha Balac

17

Iris Data

Please refer to IrisDataSetClusters.pdf

© Copyright 2012
,
Natasha Balac

18

Clustering with cutoff

© Copyright
2012
Natasha Balac

19

Category utility


Category utility is a kind of quadratic loss
function defined on conditional probabilities:




C
1
, ..C
k

are k clusters


a
i
is the
i
th attribute


and it takes on values v
i1
, v
i2
, …

© Copyright 2012
,
Natasha Balac

20

Probability
-
based

clustering


Problems with K
-
means & Hierarchical methods:


Division by k


Order of examples


Merging/splitting operations might not be sufficient to
reverse the effects of bad initial ordering


Is result at least local minimum of category utility?


Solution:


Find the most likely clusters given the data


Instance has certain probability of belonging to a
particular cluster


© Copyright 2012
,
Natasha Balac

21

Soft Clustering


So far clustering methods assumed that each instance
has a “hard” assignment to exactly one cluster


No uncertainty about class membership or an instance
belonging to more than one cluster


Soft clustering

gives probabilities that an instance
belongs to each of a set of clusters


Each instance is assigned a probability distribution
across a set of discovered clusters


probabilities of all categories must sum to 1

© Copyright 2012
,
Natasha Balac

22

Finite mixtures


Probabilistic clustering algorithms model the data

using a mixture of distributions


Each cluster is represented by one distribution


The distribution governs the probabilities of attributes
values in the corresponding cluster


They are called finite mixtures because there is
only a finite number of clusters being represented


Usually individual distributions are normal


Distributions are combined using cluster weights


© Copyright 2006, Natasha Balac

23

A two
-
class mixture

model

© Copyright 2012
,
Natasha Balac

24

Using the mixture

model


The probability of an instance x belonging
to cluster A is:



© Copyright 2012
,
Natasha Balac

25

Learning the clusters


Assume we know that there are k clusters


To learn the clusters we need to determine their
parameters


I.e. their means and standard deviations


Start with the initial guess for the 5 parameters
use them to calculate cluster probabilities for
each instance, use these probabilities to re
estimate the parameters and repeat


We actually have a performance criterion: the
likelihood of the training data given the clusters

© Copyright 2012
,
Natasha Balac

26

Expectation

Maximization (EM)


Probabilistic method for soft clustering


Iterative method for learning probabilistic
categorization model from unsupervised data


Direct method that assumes
k

clusters:
{
c
1
,
c
2
,…
c
k
}


Soft version of
k
-
means


Assumes a probabilistic model of categories that
allows computing P(
c
i

|
E
)

for each category,
c
i
,
for a given example,
E

© Copyright 2012
,
Natasha Balac

27

EM Algorithm


Initially assume random assignment of examples to
categories



Learn an initial probabilistic model by estimating
model parameters
from this randomly labeled data

© Copyright 2012
,
Natasha Balac

28

The EM algorithm


EM algorithm: expectation
-
maximization algorithm


Generalization of k
-
means to probabilistic setting


Similar iterative procedure:

1. Calculate cluster probability for each instance
(expectation step)

2. Estimate distribution parameters based on the cluster
probabilities (maximization step)


Cluster probabilities are stored as instance weights


© Copyright 2012
,
Natasha Balac

29

Summary


Labeled clusters can be interpreted by using
supervised learning
-

train a tree or learn rules


Can be used to fill in missing attribute values


All methods have a basic assumption of
independence between the attributes


Some methods allow the user to specify in advanced that
two of more attributes are dependent and should be
modeled with a joint probability

Thank you!


Questions?


natashab@sdsc.edu


Coming up:


Use cases and hands on activity

© Copyright 2006, Natasha Balac

30

© Copyright
2012,
Natasha Balac

31

Regression Trees & Clustering

SDSC Summer Institute 2012

Natasha Balac, Ph.D.

© Copyright
2012,
Natasha Balac

REGRESSION TREE INDUCTION


Why Regression tree?


Ability to:


Predict continuous variable


Model conditional effects


Model uncertainty




© Copyright
2012,
Natasha Balac

Regression Trees


Continuous goal
variables


Induction by means of
an efficient recursive
partitioning algorithm


Uses linear regression
to select internal node

Quinlan, 1992


© Copyright
2012,
Natasha Balac

Regression trees


Differences to decision trees:


Splitting: minimizing intra
-
subset variation


Pruning: numeric error measure


Leaf node predicts average class values of training
instances reaching that node


Can approximate piecewise constant functions


Easy to interpret and understand the structure


Special kind: Model Trees

© Copyright
2012,
Natasha Balac

Model trees


RT with linear regression functions at each leaf


Linear regression (LR) applied to instances that
reach a node after full regression tree has been
built


Only a subset of the attributes is used for LR


Fast


Overhead for LR is minimal as only a small subset
of attributes is used in tree


© Copyright
2012,
Natasha Balac

Building the tree



Splitting criterion: standard deviation
reduction (T portion of the data reaching
the node)



Where T1,T2, are the sets that result from splitting the
node according to the chosen attribute


Termination criteria


Standard deviation becomes smaller than
certain fraction of sd for full training set (5%)


Too few instances remain (< 4)

© Copyright
2012,
Natasha Balac

Nominal attributes


Nominal attributes are converted into binary
attributes (that can be treated as numeric ones)


Nominal values are sorted using average class
value


If there are k values, k
-
1 binary attributes are
generated


It can be proven that the best split on one of the
new attributes is the best binary split on original


M5‘ only does the conversion once

© Copyright
2012
Natasha Balac

Pruning Model Trees


Based on estimated absolute error of LR models


Heuristic estimate for smoothing calculation



P’ is prediction past up to the next node; p is passed to nod below; q is predicted
value by the model; n # of training instances in the node; K smoothing const


Pruned by greedily removing terms to minimize the
estimated error


Heavy pruning allowed


single
LR model can
replace a whole subtree


Pruning proceeds bottom up
-

e
rror for LR model
at internal node is compared to error fosubtree

© Copyright
2012,
Natasha Balac

Building the tree


Splitting criterion


standard deviation reduction



T


portion T of the training data


T1, T2, …sets that result from splitting the node on the chosen attribute


Treating SD of the class values in T as a measure of the error
at the node; calc expected reduction in error by testing each
attribute at node


Termination criteria


Standard deviation becomes smaller than certain
fraction of SD for full training set (5%)


Too few instances remain (< 4)


© Copyright
2012,
Natasha Balac

Pseudo
-
code for M5’


Four methods:


Main method: MakeModelTree()


Method for splitting: split()


Method for pruning: prune()


Method that computes error: subtreeError()


We’ll briefly look at each method in turn


Linear regression method is assumed to
perform attribute subset selection based on
error


© Copyright
2012
Natasha Balac

MakeModelTree


SD =sd(instances)

for each k
-
valued nominal attribute


convert into k
-
1 synthetic binary attributes

root =newNode

root.instances = instances

split(root)

prune(root)

printTree(root)



© Copyright
2012,
Natasha Balac

split(node)

if sizeof(node.instances) < 4 or



sd(node.instances) < 0.05*SD


node.type = LEAF

else


node.type = INTERIOR

for each attribute


for all possible split positions of the attribute



calculate the attribute’s SDR

node.attribute = attribute with maximum SDR

split(node.left)

split(node.right)



© Copyright
2012,
Natasha Balac

prune()

if node = INTERIOR then


prune(node.leftChild)


prune(node.rightChild)


node.model =


linearRegression(node)


if subtreeError(node) > error(node) then



node.type = LEAF



© Copyright
2012,
Natasha Balac

subtreeError()


l = node.left; r = node.right

if node = INTERIOR then


return (sizeof(l.instances)*subtreeError(l)



+ sizeof(r.instances)*subtreeError(r))



/sizeof(node.instances)

else return error(node)



© Copyright 2012
,
Natasha Balac

MULTI
-
VARIATE REGRESSION
TREES
*


All the characteristics of a regression tree


Capable of predicting two or more
outcomes


Example:



Activity and toxicity, monetary gain and time


Balac, Gaines ICML 2001

© Copyright 2012
,
Natasha Balac

MULTI
-
VARIATE REGRESSION TREE
INDUCTION




>0.5 AND

>
-
3.61




Var 1

Var 3


>0.5 OR

>
-
3.56





<=0.5 AND

<=
-
3.56





>0.5 OR

>
-
3.61





<=
-
2 AND

<=4.8




Var 2

Var 4


Var 1

Var 4


>
-
4.71 OR


>4.83




<=
-
4.71 AND
<=4.83


Var 1

Var 2


Activity = 7.05



Toxicity = 0.173


Activity = 7.39



Toxicity = 2.89


© Copyright 2012
,
Natasha Balac

47

Incremental clustering


Works incrementally instance by instance forming
a a hierarchy of clusters


COBWEB
-

nominal; CLASSIT


numeric attributes


Instances are added one at the time


tree is updated appropriately at each step


finding the right leaf for an instance


restructuring the tree


How and where to update based on category
utility value


© Copyright 2006, Natasha Balac

48

Clustering:

Weather data

© Copyright 2006, Natasha Balac

49

Clustering

Step 1

Step 2

Step 3

© Copyright 2006, Natasha Balac

50

Step 4

Step 5

© Copyright 2012
,
Natasha Balac

51

Merging


Consider all pairs of nodes for merging and
evaluate category utility of each


Computationally expensive


When scanning nodes for a suitable host


both
the best matching node and the runner
-
up are
noted


The best will form the host for new instance
unless merging host and runner
-
up produces
better CU


© Copyright 2006, Natasha Balac

52

Final Hierarchy

Please refer to Final Hierarchy
Weather data set.pdf

© Copyright 2012
,
Natasha Balac

53

Iris Data

Please refer to IrisDataSetClusters.pdf

© Copyright
2012
Natasha Balac

54

Category utility


Category utility is a kind of quadratic loss
function defined on conditional probabilities:




C
1
, ..C
k

are k clusters


a
i
is the
i
th attribute


and it takes on values v
i1
, v
i2
, …

© Copyright 2012
,
Natasha Balac

55

Category Utility Extended

to Numeric attributes


Assuming normal distribution:





When Standard deviation of attribute a
i

is zero

it
produced infinite value of the category utility

formula


Acuity

parameter: pre
-
specified minimum
variance on each attribute


only one instance in a node produces 0 variance

© Copyright 2012
,
Natasha Balac

56

Iris Data

Please refer to IrisDataSetClusters.pdf

© Copyright 2012
,
Natasha Balac

57

Clustering with cutoff

Please refer toClusteringwithcutoff.pdf

© Copyright 2012
,
Natasha Balac

58

Finite mixtures


Probabilistic clustering algorithms model the data

using a mixture of distributions


Each cluster is represented by one distribution


The distribution governs the probabilities of attributes
values in the corresponding cluster


They are called finite mixtures because there is
only a finite number of clusters being represented


Usually individual distributions are normal


Distributions are combined using cluster weights


© Copyright 2006, Natasha Balac

59

A two
-
class mixture

model

© Copyright 2012
,
Natasha Balac

60

Using the mixture

model


The probability of an instance x belonging
to cluster A is:



© Copyright 2012
,
Natasha Balac

61

Learning the clusters


Assume we know that there are k clusters


To learn the clusters we need to determine their
parameters


I.e. their means and standard deviations


Start with the initial guess for the 5 parameters
use them to calculate cluster probabilities for
each instance, use these probabilities to re
estimate the parameters and repeat


We actually have a performance criterion: the
likelihood of the training data given the clusters

© Copyright 2012
,
Natasha Balac

62

Expectation

Maximization (EM)


Probabilistic method for soft clustering


Iterative method for learning probabilistic
categorization model from unsupervised data


Direct method that assumes
k

clusters:
{
c
1
,
c
2
,…
c
k
}


Soft version of
k
-
means


Assumes a probabilistic model of categories that
allows computing P(
c
i

|
E
)

for each category,
c
i
,
for a given example,
E