Machine Learning Data Clustering - Wang

journeycartΤεχνίτη Νοημοσύνη και Ρομποτική

15 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

83 εμφανίσεις

Machine Learning

Data Clustering

Yang Sun

Outline


Machine Learning


Introduction


Algorithms


Applications


Problems


Data Clustering


Introduction


Applications


Algorithms


Problems

Introduction


What is
machine learning
?



changes in [a] system that ... enable [it] to do the
same task or tasks drawn from the same population
more efficiently and more effectively
the next time
.''
[Simon 1983]



The goal of machine learning is to build
computer systems that can adapt and learn
from their experience.


[Tom Dietterich]

Why is Machine Learning Important?


Some tasks cannot be defined well, except
by examples (e.g., recognizing people).


Relationships and correlations can be
hidden within large amounts of data.
Machine Learning/Data Mining may be able
to find these relationships.


Human designers often produce machines
that do not work as well as desired in the
environments in which they are used.



The amount of knowledge available about
certain tasks might be too large for explicit
encoding by humans (e.g., medical
diagnostic).


Environments change over time.


New knowledge about tasks is constantly
being discovered by humans. It may be
difficult to continuously re
-
design systems

by hand

.

Why is Machine Learning
Important (Cont

d)?

Method of learning


Procedure based classification


Rote learning


Advice or instructional learning


Learning by example or practice


Learning by analogy


Discovery learning




[Giles, 2004]

Machine learning algorithms


Supervised learning:
training set has input & output


Decision/regression trees


Neural networks





Unsupervised learning:
observation, discovery


Bayesian networks


Clustering


Semi
-
supervised learning:
combines both labeled and unlabeled examples
to generate an appropriate function or classifier



Reinforcement learning:
the algorithm learns a policy of how to act given an
observation of the world. Every action has some impact in the environment, and the
environment provides feedback that guides the learning algorithm



Transduction:
tries to predict new outputs based on training inputs, training
outputs, and new inputs


Learning to learn:
algorithm learns its own inductive bias based on previous
experience


Supervised Learning


Given: Training examples


for some unknown function (system)



Find


Predict , where is not in the
training set

Supervised Learning Classification


Example: Cancer diagnosis


Use this
training set

to learn how to classify patients
where diagnosis is not known:


The
input data

is often easily obtained, whereas the
classification

is not.

Input
Data

Classificati
on

Training
Set

Test Set

1
-
R (A Decision Tree Stump)


Main Assumptions


Only one attribute is necessary.


Finite number of splits on the attribute.


Hypothesis Space


Fixed size (parametric): Limited modeling potential

Decision Tree

Value of X1

Y is big

Value of X2

Y is small

Y is very big

Small

Medium or Large

< 0.34

> 0.34

[
Louis Wehenkel
, 2002]

Decision Tree Building


Growing the tree (uses part of the training set)


Top down


At each step


Select
tip

node to split (best first, greedy approach)


Find best input variable and best question


Split


Pruning the tree (uses remaining part of training
set)


Bottom up


At each step


Select test node to prune (worst first, greedy…)


Prune subtree and evaluate

A Particular Type of ANN

Multilayer Perceptrons

H3

Input layer

Hidden layer

Output layer

[
Louis Wehenkel
, 2002]

Support Vector Machines


Main Assumption:


Build a model using minimal number of training
instances (Support Vectors).


Hypothesis Space


Variable size (nonparametric): Can model any
function


Based on PAC (probably almost correct)


Bayesian Network


Main Assumptions:


All attributes are equally important.


All attributes are statistically independent (given the
class value)


Hypothesis Space


Fixed size (parametric): Limited modeling potential


Research Directions


Improving classification accuracy by
learning ensembles of classifiers


Methods for scaling up supervised learning
algorithms


Reinforcement learning


Learning complex stochastic models





[Dietterich, 1997]

Applications of Machine Learning


search engines


medical diagnosis


detecting credit card fraud


stock market analysis


classifying DNA sequences


speech and handwriting recognition


game playing


robot locomotion.


……

Autonomous Land Vehicle In a
Neural Network (ALVINN)

the project is no longer active

Drives 70 mph on a public highway


Camera

image

30x32 pixels

as inputs

30 outputs

for steering

30x32 weights

into one out of

four hidden

unit

4 hidden

units

UIUC Hexapod Robot


Application: Breast Cancer
Diagnosis

Research by Mangasarian,Street, Wolberg

Open Problems


Ensembles of classifiers


Best way to construct ensembles


How to understand the decision


Scaling


Large Training Set


Large Number of Features


Reinforcement learning


Clarifying properties


Hierarchical problem solving


Intelligent exploration methods


Optimizing cumulative discounted reward is not always appropriate


The entire state of the environment is not always visible at each time
step.


Stochastic Models


General purpose


Tractable approximation

Introduction


What is
data clustering
?


Classification of similar data into different
groups


Partitioning of a data set into subsets so that
the data in each subset (ideally) share some
common trait
[Wikipedia, Data Clustering]


Machine learning typically regards data
clustering as a form of
unsupervised
learning
.

Quality of Clustering


What is good clustering method?


Intra
-
cluster similarity is high


Inter
-
cluster similarity is low


Clustering quality depends on similarity
measure and implementation


Quality of clustering method is measured
by the ability to discover hidden patterns


Clustering Methods


Hierarchical clustering
:
Create a hierarchical
decomposition of the set of data using some criterion



agglomerative


divisive


Partitional clustering
:
Construct various partitions
and then evaluate them by some criterion



K
-
mean


QT clustering


Fuzzy c
-
mean


Spectral clustering
:

make use of the spectrum of
the similarity matrix of the data to cluster the points




Hierarchy Clustering

Create a hierarchical decomposition of the set of data using some
criterion



















[wikipedia]

AGNES (AGglomerative NESting)

[Kaufmann and Rousseeuw, 1990]


Use the single link method and dissimilarity
matrix


Merge nodes that have the least dissimilarity


Repeat

DIANA (DIvisive ANAlysis)

[Kaufmann and Rousseeuw, 1990]


Inverse of AGNES

Problems in Hierarchical Clustering


Do not scale well:


Time complexity at least O(n
2
)


Can never UNDO previous operation


Improvements:


BIRCH [1996]: uses CF
-
tree and incrementally
adjusts the quality of sub
-
clusters


CURE [1998]: selects well scattered points
from the cluster and then shrink them towards
the center of the cluster by a specified fraction


CHAMELEON [1999]: hierarchical clustering
using dynamic modeling

Partitional Clustering: K mean



Randomly generate
k

clusters and determine the
cluster centers or directly generate
k

seed points
as cluster centers


Assign each point to the nearest cluster center.


Recompute the new cluster centers.


Repeat until some convergence criterion is met
(usually that the assignment hasn't changed).


K
-
mean clustering (cont’d)


Pros:


Relatively efficient, scaling better than
hierarchical clustering


Terminates at a local optimum


Cons:


Categorical data (no mean)


Need to specify k first


Noise and outlier




[Osmar R. Zaïane, 1999]

Other partitional clustering algorithms


QT Clust


does not require specifying the number of
clusters
a priori



Fuzzy
c
-
means


Clusters are overlapping


Spectral Clustering


make use of the
spectrum

of the similarity matrix of the
data to cluster the points



Shi
-
Malik algorithm



Separate data into two clusters


For each cluster, separate again


Similar idea as hierarchical clustering


Used in dimensionality reduction, image
segmentation



Applications

of Data Clustering


Image Segmentation


Object and Character Recognition


Information retrieval


Biology


Transcriptomics


Bioinformatics


Marketing Research


Social network mining


Data mining

Image Segmentation with
Normalized Cuts


http://www.cis.upenn.edu/~jshi/softwar
e/

Social network mining

http://www.w3.org/2001/sw/Europe/events/foaf
-
galway/papers/fp/bootstrapping_the_foaf_web
/

DNA clustering


www.research.ibm.com/ journal/sj/402/inman.html


Open Problems


Scaling problem


Current clustering techniques do not
address all the requirements adequately
and concurrently


Large number of dimensions and large
number of data sets


Distinct clusters vs. overlapping clusters




Thank you!