Semi-Supervised Clustering I - Network Protocols Lab - University of ...

dealerdeputyΤεχνίτη Νοημοσύνη και Ρομποτική

25 Νοε 2013 (πριν από 3 χρόνια και 8 μήνες)

70 εμφανίσεις

CS685 : Special Topics in Data Mining, UKY

The

UNIVERSITY
of
KENTUCKY

Semi
-
Supervised Clustering

CS 685: Special Topics in Data Mining

Spring 2008


Jinze Liu

CS685 : Special Topics in Data Mining, UKY

Outline of lecture


Overview of clustering and classification



What is semi
-
supervised learning?


Semi
-
supervised clustering


Semi
-
supervised classification



Semi
-
supervised clustering


What is semi
-
supervised clustering?


Why semi
-
supervised clustering?


Semi
-
supervised clustering algorithms

CS685 : Special Topics in Data Mining, UKY

Supervised classification versus unsupervised
clustering


Unsupervised clustering


Group similar objects together to find clusters


Minimize intra
-
class distance


Maximize inter
-
class distance



Supervised classification


Class label for each training sample is given


Build a model from the training data


Predict class label on unseen future data points


CS685 : Special Topics in Data Mining, UKY

What is clustering?


Finding groups of objects such that the objects in a group will be
similar (or related) to one another and different from (or
unrelated to) the objects in other groups

Inter
-
cluster
distances are
maximized

Intra
-
cluster
distances are
minimized

CS685 : Special Topics in Data Mining, UKY

What is Classification?

Apply
Model
Induction
Deduction
Learn
Model
Model
Tid

Attrib1

Attrib2

Attrib3

Class

1

Yes

Large

125K

No

2

No

Medium

100K

No

3

No

Small

70K

No

4

Yes

Medium

120K

No

5

No

Large

95K

Yes

6

No

Medium

60K

No

7

Yes

Large

220K

No

8

No

Small

85K

Yes

9

No

Medium

75K

No

10

No

Small

90K

Yes

10


Tid

Attrib1

Attrib2

Attrib3

Class

11

No

Small

55K

?

12

Yes

Medium

80K

?

13

Yes

Large

110K

?

14

No

Small

95K

?

15

No

Large

67K

?

10


Test Set
Learning
algorithm
Training Set
CS685 : Special Topics in Data Mining, UKY

Clustering algorithms


K
-
Means



Hierarchical clustering



Graph based clustering (Spectral
clustering)


CS685 : Special Topics in Data Mining, UKY

Classification algorithms


Decision Trees


Naïve Bayes classifier


Support Vector Machines (SVM)


K
-
Nearest
-
Neighbor classifiers


Logistic Regression


Neural Networks


Linear Discriminant Analysis (LDA)


CS685 : Special Topics in Data Mining, UKY

Supervised Classification Example

.

.

.

.

CS685 : Special Topics in Data Mining, UKY

.

Supervised Classification Example

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

CS685 : Special Topics in Data Mining, UKY

.

Supervised Classification Example

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

CS685 : Special Topics in Data Mining, UKY

.

Unsupervised Clustering Example

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

CS685 : Special Topics in Data Mining, UKY

.

Unsupervised Clustering Example

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

CS685 : Special Topics in Data Mining, UKY

Semi
-
Supervised Learning


Combines

labeled

and

unlabeled

data

during

training

to

improve

performance
:


Semi
-
supervised

classification
:

Training

on

labeled

data

exploits

additional

unlabeled

data,

frequently

resulting

in

a

more

accurate

classifier
.


Semi
-
supervised

clustering
:

Uses

small

amount

of

labeled

data

to

aid

and

bias

the

clustering

of

unlabeled

data
.

Unsupervised

clustering

Semi
-
supervised

learning

Supervised

classification

CS685 : Special Topics in Data Mining, UKY

.

Semi
-
Supervised Classification
Example

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

CS685 : Special Topics in Data Mining, UKY

.

Semi
-
Supervised Classification
Example

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

CS685 : Special Topics in Data Mining, UKY

Semi
-
Supervised Classification


Algorithms:


Semisupervised EM [Ghahramani:NIPS94,Nigam:ML00].


Co
-
training [Blum:COLT98].


Transductive SVM’s [Vapnik:98,Joachims:ICML99].


Graph based algorithms


Assumptions:


Known, fixed set of categories given in the labeled data.


Goal is to improve classification of examples into these
known categories.


More discussions next week

CS685 : Special Topics in Data Mining, UKY

.

Semi
-
Supervised Clustering
Example

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

CS685 : Special Topics in Data Mining, UKY

.

Semi
-
Supervised Clustering
Example

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

CS685 : Special Topics in Data Mining, UKY

.

Second Semi
-
Supervised Clustering
Example

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

CS685 : Special Topics in Data Mining, UKY

.

Second Semi
-
Supervised Clustering
Example

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

CS685 : Special Topics in Data Mining, UKY

Semi
-
supervised clustering:
problem definition


Input:


A set of unlabeled objects, each described by a set of attributes
(numeric and/or categorical)


A small amount of domain knowledge


Output:


A partitioning of the objects into k clusters (possibly with some
discarded as outliers)


Objective:


Maximum intra
-
cluster similarity


Minimum inter
-
cluster similarity


High consistency between the partitioning and the domain
knowledge

CS685 : Special Topics in Data Mining, UKY

Why semi
-
supervised clustering?


Why not clustering?


The clusters produced may not be the ones required.


Sometimes there are multiple possible groupings.



Why not classification?


Sometimes there are insufficient labeled data.



Potential applications


Bioinformatics (gene and protein clustering)


Document hierarchy construction


News/email categorization


Image categorization

CS685 : Special Topics in Data Mining, UKY

Semi
-
Supervised Clustering


Domain knowledge


Partial label information is given


Apply some constraints (must
-
links and cannot
-
links)



Approaches


Search
-
based Semi
-
Supervised Clustering


Alter the clustering algorithm using the constraints



Similarity
-
based Semi
-
Supervised Clustering


Alter the similarity measure based on the constraints



Combination of both


This class

Next class

CS685 : Special Topics in Data Mining, UKY

Search
-
Based Semi
-
Supervised Clustering


Alter the clustering algorithm that searches for a good
partitioning by:


Modifying the objective function to give a reward for
obeying labels on the supervised data [Demeriz:ANNIE99].


Enforcing constraints (
must
-
link, cannot
-
link
) on the
labeled data during clustering [Wagstaff:ICML00,
Wagstaff:ICML01].


Use the labeled data to initialize clusters in an iterative
refinement algorithm (kMeans,) [Basu:ICML02].


CS685 : Special Topics in Data Mining, UKY

Overview of K
-
Means Clustering


K
-
Means

is

a

partitional

clustering

algorithm

based

on

iterative

relocation

that

partitions

a

dataset

into

K

clusters
.



Algorithm
:



Initialize

K

cluster

centers

randomly
.

Repeat

until

convergence
:


Cluster

Assignment

Step
:

Assign

each

data

point

x

to

the

cluster

X
l
,

such

that

L
2

distance

of

x


from

(center

of

X
l
)

is

minimum


Center

Re
-
estimation

Step
:

Re
-
estimate

each

cluster

center

as

the

mean

of

the

points

in

that

cluster

}
{
1
l
K
l


l

l

CS685 : Special Topics in Data Mining, UKY

K
-
Means Objective Function


Locally

minimizes

sum

of

squared

distance

between

the

data

points

and

their

corresponding

cluster

centers
:





Initialization

of

K

cluster

centers
:


Totally

random


Random

perturbation

from

global

mean


Heuristic

to

ensure

well
-
separated

centers


2
1
||
||





K
l
X
x
l
i
l
i
x

CS685 : Special Topics in Data Mining, UKY

K Means Example


CS685 : Special Topics in Data Mining, UKY

K Means Example

Randomly Initialize Means

x

x

CS685 : Special Topics in Data Mining, UKY

K Means Example

Assign Points to Clusters

x

x

CS685 : Special Topics in Data Mining, UKY

K Means Example

Re
-
estimate Means

x

x

CS685 : Special Topics in Data Mining, UKY

K Means Example

Re
-
assign Points to Clusters

x

x

CS685 : Special Topics in Data Mining, UKY

K Means Example

Re
-
estimate Means

x

x

CS685 : Special Topics in Data Mining, UKY

K Means Example

Re
-
assign Points to Clusters

x

x

CS685 : Special Topics in Data Mining, UKY

K Means Example

Re
-
estimate Means and Converge

x

x

CS685 : Special Topics in Data Mining, UKY

Semi
-
Supervised K
-
Means


Partial label information is given


Seeded K
-
Means


Constrained K
-
Means




Constraints (Must
-
link, Cannot
-
link)


COP K
-
Means

CS685 : Special Topics in Data Mining, UKY

Semi
-
Supervised K
-
Means for partially labeled
data


Seeded

K
-
Means
:


Labeled

data

provided

by

user

are

used

for

initialization
:

initial

center

for

cluster

i

is

the

mean

of

the

seed

points

having

label

i
.


Seed

points

are

only

used

for

initialization
,

and

not

in

subsequent

steps
.


Constrained

K
-
Means
:


Labeled

data

provided

by

user

are

used

to

initialize

K
-
Means

algorithm
.


Cluster

labels

of

seed

data

are

kept

unchanged

in

the

cluster

assignment

steps,

and

only

the

labels

of

the

non
-
seed

data

are

re
-
estimated
.

CS685 : Special Topics in Data Mining, UKY

Seeded K
-
Means

Use labeled data to find

the initial centroids and

then run K
-
Means.


The labels for seeded

points may change.

CS685 : Special Topics in Data Mining, UKY

Seeded K
-
Means Example


CS685 : Special Topics in Data Mining, UKY

Seeded
K
-
Means Example

Initialize Means Using Labeled Data

x

x

CS685 : Special Topics in Data Mining, UKY

Seeded K
-
Means Example

Assign Points to Clusters

x

x

CS685 : Special Topics in Data Mining, UKY

Seeded K
-
Means Example

Re
-
estimate Means

x

x

CS685 : Special Topics in Data Mining, UKY

Seeded K
-
Means Example

Assign points to clusters and Converge

x

x

the label is changed

CS685 : Special Topics in Data Mining, UKY

Constrained K
-
Means

Use labeled data to find

the initial centroids and

then run K
-
Means.


The labels for seeded

points will not change.

CS685 : Special Topics in Data Mining, UKY

Constrained K
-
Means Example


CS685 : Special Topics in Data Mining, UKY

Constrained K
-
Means Example

Initialize Means Using Labeled Data

x

x

CS685 : Special Topics in Data Mining, UKY

Constrained K
-
Means Example

Assign Points to Clusters

x

x

CS685 : Special Topics in Data Mining, UKY

Constrained K
-
Means Example

Re
-
estimate Means and Converge

x

x

CS685 : Special Topics in Data Mining, UKY

COP K
-
Means


COP

K
-
Means

[Wagstaff

et

al
.:

ICML
01
]

is

K
-
Means

with

must
-
link

(must

be

in

same

cluster)

and

cannot
-
link

(cannot

be

in

same

cluster)

constraints

on

data

points
.


Initialization
:

Cluster

centers

are

chosen

randomly,

but

as

each

one

is

chosen

any

must
-
link

constraints

that

it

participates

in

are

enforced

(so

that

they

cannot

later

be

chosen

as

the

center

of

another

cluster)
.


Algorithm
:

During

cluster

assignment

step

in

COP
-
K
-
Means,

a

point

is

assigned

to

its

nearest

cluster

without

violating

any

of

its

constraints
.

If

no

such

assignment

exists,

abort
.

CS685 : Special Topics in Data Mining, UKY

COP K
-
Means Algorithm

CS685 : Special Topics in Data Mining, UKY

Illustration

x

x

Must
-
link

Determine

its label

Assign to the red class

CS685 : Special Topics in Data Mining, UKY

Illustration

x

x

Cannot
-
link

Determine

its label

Assign to the red class

CS685 : Special Topics in Data Mining, UKY

Illustration

x

x

Cannot
-
link

Determine

its label

The clustering algorithm fails

Must
-
link

CS685 : Special Topics in Data Mining, UKY

Summary


Seeded and Constrained K
-
Means: partially labeled data


COP K
-
Means: constraints (Must
-
link and Cannot
-
link)




Constrained K
-
Means and COP K
-
Means require all the
constraints to be satisfied.


May not be effective if the seeds contain noise.


Seeded K
-
Means use the seeds only in the first step to
determine the initial centroids.


Less sensitive to the noise in the seeds.



Experiments show that semi
-
supervised k
-
Means outperform
traditional K
-
Means.

CS685 : Special Topics in Data Mining, UKY

Reference


Semi
-
supervised Clustering by Seeding


http://www.cs.utexas.edu/users/ml/papers/semi
-
icml
-
02.pdf



Constrained K
-
means clustering with
background knowledge


http://www.litech.org/~wkiri/Papers/wagstaff
-
kmeans
-
01.pdf

CS685 : Special Topics in Data Mining, UKY

Next class


Topics


Similarity
-
based semi
-
supervised clustering



Readings


Distance metric learning, with application to
clustering with side
-
information


http://ai.stanford.edu/~ang/papers/nips02
-
metric.pdf


From Instance
-
level Constraints to Space
-
Level
Constraints: Making the Most of Prior Knowledge
in Data Clustering


http://www.cs.berkeley.edu/~klein/papers/constrained
_clustering
-
ICML_2002.pdf