Machine Learning
Data Clustering
Yang Sun
Outline
Machine Learning
Introduction
Algorithms
Applications
Problems
Data Clustering
Introduction
Applications
Algorithms
Problems
Introduction
What is
machine learning
?
“
changes in [a] system that ... enable [it] to do the
same task or tasks drawn from the same population
more efficiently and more effectively
the next time
.''
[Simon 1983]
“
The goal of machine learning is to build
computer systems that can adapt and learn
from their experience.
”
[Tom Dietterich]
Why is Machine Learning Important?
Some tasks cannot be defined well, except
by examples (e.g., recognizing people).
Relationships and correlations can be
hidden within large amounts of data.
Machine Learning/Data Mining may be able
to find these relationships.
Human designers often produce machines
that do not work as well as desired in the
environments in which they are used.
The amount of knowledge available about
certain tasks might be too large for explicit
encoding by humans (e.g., medical
diagnostic).
Environments change over time.
New knowledge about tasks is constantly
being discovered by humans. It may be
difficult to continuously re

design systems
“
by hand
”
.
Why is Machine Learning
Important (Cont
’
d)?
Method of learning
Procedure based classification
Rote learning
Advice or instructional learning
Learning by example or practice
Learning by analogy
Discovery learning
[Giles, 2004]
Machine learning algorithms
Supervised learning:
training set has input & output
Decision/regression trees
Neural networks
…
Unsupervised learning:
observation, discovery
Bayesian networks
Clustering
Semi

supervised learning:
combines both labeled and unlabeled examples
to generate an appropriate function or classifier
Reinforcement learning:
the algorithm learns a policy of how to act given an
observation of the world. Every action has some impact in the environment, and the
environment provides feedback that guides the learning algorithm
Transduction:
tries to predict new outputs based on training inputs, training
outputs, and new inputs
Learning to learn:
algorithm learns its own inductive bias based on previous
experience
Supervised Learning
Given: Training examples
for some unknown function (system)
Find
Predict , where is not in the
training set
Supervised Learning Classification
Example: Cancer diagnosis
Use this
training set
to learn how to classify patients
where diagnosis is not known:
The
input data
is often easily obtained, whereas the
classification
is not.
Input
Data
Classificati
on
Training
Set
Test Set
1

R (A Decision Tree Stump)
Main Assumptions
Only one attribute is necessary.
Finite number of splits on the attribute.
Hypothesis Space
Fixed size (parametric): Limited modeling potential
Decision Tree
Value of X1
Y is big
Value of X2
Y is small
Y is very big
Small
Medium or Large
< 0.34
> 0.34
[
Louis Wehenkel
, 2002]
Decision Tree Building
Growing the tree (uses part of the training set)
Top down
At each step
Select
tip
node to split (best first, greedy approach)
Find best input variable and best question
Split
Pruning the tree (uses remaining part of training
set)
Bottom up
At each step
Select test node to prune (worst first, greedy…)
Prune subtree and evaluate
A Particular Type of ANN
Multilayer Perceptrons
H3
Input layer
Hidden layer
Output layer
[
Louis Wehenkel
, 2002]
Support Vector Machines
Main Assumption:
Build a model using minimal number of training
instances (Support Vectors).
Hypothesis Space
Variable size (nonparametric): Can model any
function
Based on PAC (probably almost correct)
Bayesian Network
Main Assumptions:
All attributes are equally important.
All attributes are statistically independent (given the
class value)
Hypothesis Space
Fixed size (parametric): Limited modeling potential
Research Directions
Improving classification accuracy by
learning ensembles of classifiers
Methods for scaling up supervised learning
algorithms
Reinforcement learning
Learning complex stochastic models
[Dietterich, 1997]
Applications of Machine Learning
search engines
medical diagnosis
detecting credit card fraud
stock market analysis
classifying DNA sequences
speech and handwriting recognition
game playing
robot locomotion.
……
Autonomous Land Vehicle In a
Neural Network (ALVINN)
the project is no longer active
Drives 70 mph on a public highway
Camera
image
30x32 pixels
as inputs
30 outputs
for steering
30x32 weights
into one out of
four hidden
unit
4 hidden
units
UIUC Hexapod Robot
Application: Breast Cancer
Diagnosis
Research by Mangasarian,Street, Wolberg
Open Problems
Ensembles of classifiers
Best way to construct ensembles
How to understand the decision
Scaling
Large Training Set
Large Number of Features
Reinforcement learning
Clarifying properties
Hierarchical problem solving
Intelligent exploration methods
Optimizing cumulative discounted reward is not always appropriate
The entire state of the environment is not always visible at each time
step.
Stochastic Models
General purpose
Tractable approximation
Introduction
What is
data clustering
?
Classification of similar data into different
groups
Partitioning of a data set into subsets so that
the data in each subset (ideally) share some
common trait
[Wikipedia, Data Clustering]
Machine learning typically regards data
clustering as a form of
unsupervised
learning
.
Quality of Clustering
What is good clustering method?
Intra

cluster similarity is high
Inter

cluster similarity is low
Clustering quality depends on similarity
measure and implementation
Quality of clustering method is measured
by the ability to discover hidden patterns
Clustering Methods
Hierarchical clustering
:
Create a hierarchical
decomposition of the set of data using some criterion
agglomerative
divisive
Partitional clustering
:
Construct various partitions
and then evaluate them by some criterion
K

mean
QT clustering
Fuzzy c

mean
Spectral clustering
:
make use of the spectrum of
the similarity matrix of the data to cluster the points
Hierarchy Clustering
Create a hierarchical decomposition of the set of data using some
criterion
[wikipedia]
AGNES (AGglomerative NESting)
[Kaufmann and Rousseeuw, 1990]
Use the single link method and dissimilarity
matrix
Merge nodes that have the least dissimilarity
Repeat
DIANA (DIvisive ANAlysis)
[Kaufmann and Rousseeuw, 1990]
Inverse of AGNES
Problems in Hierarchical Clustering
Do not scale well:
Time complexity at least O(n
2
)
Can never UNDO previous operation
Improvements:
BIRCH [1996]: uses CF

tree and incrementally
adjusts the quality of sub

clusters
CURE [1998]: selects well scattered points
from the cluster and then shrink them towards
the center of the cluster by a specified fraction
CHAMELEON [1999]: hierarchical clustering
using dynamic modeling
Partitional Clustering: K mean
Randomly generate
k
clusters and determine the
cluster centers or directly generate
k
seed points
as cluster centers
Assign each point to the nearest cluster center.
Recompute the new cluster centers.
Repeat until some convergence criterion is met
(usually that the assignment hasn't changed).
K

mean clustering (cont’d)
Pros:
Relatively efficient, scaling better than
hierarchical clustering
Terminates at a local optimum
Cons:
Categorical data (no mean)
Need to specify k first
Noise and outlier
[Osmar R. Zaïane, 1999]
Other partitional clustering algorithms
QT Clust
does not require specifying the number of
clusters
a priori
Fuzzy
c

means
Clusters are overlapping
Spectral Clustering
make use of the
spectrum
of the similarity matrix of the
data to cluster the points
Shi

Malik algorithm
Separate data into two clusters
For each cluster, separate again
Similar idea as hierarchical clustering
Used in dimensionality reduction, image
segmentation
Applications
of Data Clustering
Image Segmentation
Object and Character Recognition
Information retrieval
Biology
Transcriptomics
Bioinformatics
Marketing Research
Social network mining
Data mining
Image Segmentation with
Normalized Cuts
http://www.cis.upenn.edu/~jshi/softwar
e/
Social network mining
http://www.w3.org/2001/sw/Europe/events/foaf

galway/papers/fp/bootstrapping_the_foaf_web
/
DNA clustering
www.research.ibm.com/ journal/sj/402/inman.html
Open Problems
Scaling problem
Current clustering techniques do not
address all the requirements adequately
and concurrently
Large number of dimensions and large
number of data sets
Distinct clusters vs. overlapping clusters
Thank you!
Comments 0
Log in to post a comment