# Bayesian Networks & Clustering

AI and Robotics

Nov 7, 2013 (4 years and 6 months ago)

68 views

Data Mining

Practical Machine Learning Tools and Techniques

Implementation:

Real machine learning schemes

Decision trees

From ID3 to C4.5 (pruning, numeric attributes, ...)

Classification rules

From PRISM to RIPPER and PART (pruning, numeric data, …)

Association Rules

Frequent
-
pattern trees

Extending linear models

Support vector machines and neural networks

Instance
-
based learning

Pruning examples, generalized exemplars, distance functions

2

Implementation:

Real machine learning schemes

Numeric prediction

Regression/model trees, locally weighted regression

Bayesian networks

Learning and prediction, fast data structures for learning

Clustering: hierarchical, incremental, probabilistic

Hierarchical, incremental, probabilistic, Bayesian

Semisupervised learning

Clustering for classification, co
-
training

Multi
-
instance learning

Converting to single
-
dedicated multi
-
instance methods

3

From naïve Bayes to Bayesian Networks

4

Naïve Bayes assumes:

attributes conditionally independent given the
class

Doesn’t hold in practice but classification
accuracy often high

However: sometimes performance much worse
than other methods e.g. decision tree

Can we eliminate the assumption?

Enter Bayesian networks

5

Graphical models that can represent any
probability distribution

Graphical representation: directed acyclic graph,
one node for each attribute

Overall probability distribution factorized into
component distributions

Graph’s nodes hold component distributions
(conditional distributions)

6

7

Computing the class probabilities

8

Two steps: computing a product of probabilities for
each class and normalization

For each class value

Take all attribute values and class value

Look up corresponding entries in conditional
probability distribution tables

Take the product of all probabilities

Divide the product for each class by the sum of the
products (normalization)

Why can we do this?

9

Single assumption: values of a node’s parents
completely determine probability distribution for
current node

Means that node/attribute is conditionally
independent of other ancestors given parents

parents
|
node
ancestors
|
node
Pr
=
Pr
Why can we do this? (cont)

10

Chain rule from probability theory:

Because of our assumption from the previous
slide:

1
1
2,
1,
...
|
...
a
,
,
a
a
Pr
=
a
,
a
a
Pr
i
i
n

sparents
'
a
a
Pr
=
a
,
,
a
a
Pr
=
a
,
a
a
Pr
i
i
i
i
n
|
...
|
...
1
1
2,
1,

Learning Bayes networks

11

Basic components of algorithms for learning Bayes
nets:

Method for evaluating the goodness of a given network

Measure based on probability of training data given
the network (or the logarithm thereof)

Method for searching through space of possible networks

Amounts to searching through sets of edges because
nodes are fixed

Problem: overfitting

12

Can’t just maximize probability of the training data

Because then it’s always better to add more edges (fit
the training data more closely)

Need to use cross
-
validation or some penalty for
complexity of the network

AIC measure:

MDL measure:

LL:
log
-
likelihood (log of probability of data)

K
: number of free parameters

N
: #instances

Another possibility: Bayesian approach with prior
distribution over networks

K
+
=
AICscore
LL

N
K
+
=
MDLscore
log
2
LL

Searching for a good structure

13

Task can be simplified: can optimize each
node separately

Because probability of an instance is product of
individual nodes’ probabilities

Also works for AIC and MDL criterion because

Can optimize node by adding or removing
edges from other nodes

Must not introduce cycles!

The K2 algorithm

14

Starts with given ordering of nodes (attributes)

Processes each node in turn

Greedily tries adding edges from previous
nodes to current node

Moves to next node when current node can’t
be optimized further

Result depends on initial order

Some tricks

15

Sometimes it helps to start the search with a naïve
Bayes network

It can also help to ensure that every node is in
Markov blanket of class node

Markov blanket of a node includes all parents, children,
and children’s parents of that node

Given values for Markov blanket, node is conditionally
independent of nodes outside blanket

I.e. node is irrelevant to classification if not in Markov
blanket of class node

Other algorithms

16

Extending K2 to consider greedily adding or
deleting edges between any pair of nodes

Further step: considering inverting the direction of
edges

TAN (Tree Augmented Naïve Bayes):

Starts with naïve Bayes

Considers adding second parent to each node (apart
from class node)

Efficient algorithm exists

Likelihood vs. conditional likelihood

17

In classification what we really want is to maximize
probability of class given other attributes

Not

probability of the instances

But: no closed
-
form solution for probabilities in
nodes’ tables that maximize this

However: can easily compute conditional
probability of data based on given network

Seems to work well when used for network scoring

Data structures for fast learning

Learning Bayes nets involves a lot of counting for
computing conditional probabilities

Naïve strategy for storing counts: hash table

Runs into memory problems very quickly

More sophisticated strategy:
all
-

Analogous to
k
D
-
tree for numeric data

Stores counts in a tree but in a clever way such that
redundancy is eliminated

Only makes sense to use it for large datasets

18

19

Assume each attribute in the data has been
assigned an index

Then, expand node for attribute
i

with the values
of all attributes
j
>
i

Two important restrictions:

Most populous expansion for each attribute is omitted
(breaking ties arbitrarily)

Expansions with counts that are zero are also omitted

The root node is given index zero

20

Discussion

21

We have assumed: discrete data, no missing
values, no new nodes

Different method of using Bayes nets for
classification:
Bayesian multinets

I.e. build one network for each class and make
prediction using Bayes’ rule

Different class of learning methods for Bayes nets:
testing conditional independence assertions

Can also build Bayes nets for regression tasks

Clustering: how many clusters?

How to choose
k

in
k
-
means? Possibilities:

Choose
k

that minimizes cross
-
validated squared distance
to cluster centers

Use penalized squared distance on the training data (eg.
using an MDL criterion)

Apply
k
-
means recursively with
k
= 2 and use stopping
criterion (eg. based on MDL)

Seeds for subclusters can be chosen by seeding along direction
of greatest variance in cluster

(one standard deviation away in each direction from cluster
center of parent cluster)

Implemented in algorithm called
X
-
means (using Bayesian

22

Hierarchical clustering

Recursively splitting clusters produces a
hierarchy that can be represented as a
dendogram

Could also be represented as a Venn diagram of
sets and subsets (without intersections)

Height of each node in the dendogram can be made
proportional to the dissimilarity between its children

23

Agglomerative clustering

Bottom
-
up approach

Simple algorithm

Requires a distance/similarity measure

Start by considering each instance to be a cluster

Find the two closest clusters and merge them

Continue merging until only one cluster is left

The record of mergings forms a hierarchical
clustering structure

a
binary dendogram

24

Distance measures

Single
-

Minimum distance between the two clusters

Distance between the clusters closest two members

Can be sensitive to outliers

Complete
-

Maximum distance between the two clusters

Two clusters are considered close only if all
instances in their union are relatively similar

Also sensitive to outliers

Seeks compact clusters

25

Distance measures cont.

Compromise between the extremes of minimum and
maximum distance

Represent clusters by their centroid, and use distance
between centroids

Works well for instances in multidimensional
Euclidean space

Not so good if all we have is pairwise similarity
between instances

Calculate average distance between each pair of
members of the two clusters

average
-

Technical deficiency of both: results depend on the
numerical scale on which distances are measured

26

More distance measures

Group
-
average

clustering

Uses the average distance between all members of the merged
cluster

Differs from average
-
linkage because it includes pairs from the
same original cluster

Ward's

clustering method

Calculates the increase in the sum of squares of the distances
of the instances from the centroid before and after fusing two
clusters

Minimize the increase in this squared distance at each
clustering step

All

measures will produce the same result if the clusters
are compact and well separate
d

27

Incremental clustering

Heuristic approach (COBWEB/CLASSIT)

Form a hierarchy of clusters incrementally

Start:

tree consists of empty root node

Then:

update tree appropriately at each stage

to update, find the right leaf for an instance

May involve restructuring the tree

Base update decisions on
category utility

28

Clustering weather data

1

29

N

M

L

K

J

I

H

G

F

E

D

C

B

A

ID

True

High

Mild

Rainy

False

Normal

Hot

Overcast

True

High

Mild

Overcast

True

Normal

Mild

Sunny

False

Normal

Mild

Rainy

False

Normal

Cool

Sunny

False

High

Mild

Sunny

True

Normal

Cool

Overcast

True

Normal

Cool

Rainy

False

Normal

Cool

Rainy

False

High

Mild

Rainy

False

High

Hot

Overcast

True

High

Hot

Sunny

False

High

Hot

Sunny

Windy

Humidity

Temp.

Outlook

2

3

Clustering weather data

4

30

N

M

L

K

J

I

H

G

F

E

D

C

B

A

ID

True

High

Mild

Rainy

False

Normal

Hot

Overcast

True

High

Mild

Overcast

True

Normal

Mild

Sunny

False

Normal

Mild

Rainy

False

Normal

Cool

Sunny

False

High

Mild

Sunny

True

Normal

Cool

Overcast

True

Normal

Cool

Rainy

False

Normal

Cool

Rainy

False

High

Mild

Rainy

False

High

Hot

Overcast

True

High

Hot

Sunny

False

High

Hot

Sunny

Windy

Humidity

Temp.

Outlook

3

Merge

best host
and runner
-
up

5

Consider
splitting

the best host if
merging doesn’t help

Final hierarchy

31

D

C

B

A

ID

False

High

Mild

Rainy

False

High

Hot

Overcast

True

High

Hot

Sunny

False

High

Hot

Sunny

Windy

Humidity

Temp.

Outlook

Example: the iris data
(subset)

32

Clustering with cutoff

33

Category utility

defined on conditional probabilities:

Every instance in different category

numerator becomes

34

maximum

number of attributes

k
v
=
a
Pr
C
v
=
a
Pr
C
Pr
=
C
,
C
C
CU
ij
i
l
ij
i
l
k

2
2
2,
1,
|
...

2
ij
i
v
=
a
Pr
n

Numeric attributes

Assume normal distribution:

Then:

Thus

becomes

Prespecified minimum variance

acuity
parameter

35

2
2
2
exp
2
1
σ
μ
a
σ
π
=
a
f

i
i
i
ij
i
σ
π
=
da
a
f
v
=
a
Pr
2
1
2
2

k
v
=
a
Pr
C
v
=
a
Pr
C
Pr
=
C
,
C
C
CU
ij
i
l
ij
i
l
k

2
2
2,
1,
|
...

k
σ
σ
π
C
Pr
=
C
,
C
C
CU
i
il
l
k

1
1
2
1
...
2,
1,
Probability
-
based clustering

Problems with heuristic approach:

Division by
k?

Order of examples?

Are restructuring operations sufficient?

Is result at least
local

minimum of category utility?

Probabilistic perspective

seek the
most likely

clusters given the data

Also: instance belongs to a particular cluster
with a certain probability

36

Finite mixtures

Model data using a
mixture
of distributions

One cluster, one distribution

governs probabilities of attribute values in that cluster

Finite mixtures
: finite number of clusters

Individual distributions are normal (usually)

Combine distributions using cluster weights

37

Two
-
class mixture model

38

A 51

A 43

B 62

B 64

A 45

A 42

A 46

A 45

A 45

B 62

A 47

A 52

B 64

A 51

B 65

A 48

A 49

A 46

B 64

A 51

A 52

B 62

A 49

A 48

B 62

A 43

A 40

A 48

B 64

A 51

B 63

A 43

B 65

B 66

B 65

A 46

A 39

B 62

B 64

A 52

B 63

B 64

A 48

B 64

A 48

A 51

A 48

B 64

A 42

A 48

A 41

data

model

A
=50,

A

=5,

p
A
=0.6

B
=65,

B

=2,

p
B
=0.4

Using the mixture model

Probability that instance x belongs to cluster A:

with

Probability of an instance given the clusters:

39

x
Pr
p
σ
,
μ
x;
f
=
x
Pr
A
Pr
A
x
Pr
=
x
A
Pr
A
A
A
|
|

2
2
2
exp
2
1
σ
μ
x
σ
π
=
σ
μ,
x;
f

i
i
cluster
Pr
x
Pr
=
x
Pr
cluster
|
rs
the_cluste
|

Learning the clusters

Assume:

we know there are
k
clusters

Learn the clusters

determine their parameters

I.e. means and standard deviations

Performance criterion:

probability of training data given the clusters

EM algorithm

finds a local maximum of the likelihood

40

EM algorithm

EM = Expectation
-
Maximization

Generalize
k
-
means to probabilistic setting

Iterative procedure:

E “expectation” step:

Calculate cluster probability for each instance

M “maximization” step:

Estimate distribution parameters from cluster probabilities

Store cluster probabilities as instance weights

Stop when improvement is negligible

41

More on EM

Estimate parameters from weighted instances

Stop when log
-
likelihood saturates

Log
-
likelihood:

42

n
n
n
A
w
+
+
w
+
w
x
w
+
+
x
w
+
x
w
=
μ
...
...
2
1
2
2
1
1

n
n
n
A
w
+
+
w
+
w
μ
x
w
+
+
μ
x
w
+
μ
x
w
=
σ
...
...
2
1
2
2
2
2
2
1
1

B
x
Pr
p
+
A
x
Pr
p
i
B
i
A
|
|
log

Extending the mixture model

More then two distributions: easy

Several attributes: easy

assuming independence!

Correlated attributes: difficult

Joint model: bivariate normal distribution

with a (symmetric) covariance matrix

n

attributes: need to estimate
n
+
n
(
n
+1)/2 parameters

43

More mixture model extensions

Nominal attributes: easy if independent

Correlated nominal attributes: difficult

Two correlated attributes

v
1
v
2

parameters

Missing values: easy

Can use other distributions than normal:

“log
-
normal” if predetermined minimum is given

“log
-
odds” if bounded from above and below

Poisson for attributes that are integer counts

Use cross
-
validation to estimate
k
!

44

Bayesian clustering

Problem: many parameters

䕍 潶敲e楴

Bayesian approach
: give every parameter a prior
probability distribution

Incorporate prior into overall likelihood figure

Penalizes introduction of parameters

Eg: Laplace estimator for nominal attributes

Can also have prior on number of clusters!

Implementation: NASA’s AUTOCLASS

45

Discussion

Can interpret clusters by using supervised
learning

post
-
processing step

Decrease dependence between attributes?

pre
-
processing step

E.g. use
principal component analysis

Can be used to fill in missing values

Can estimate likelihood of data

Use it to compare different models objectively

46

Semisupervised learning

Semisupervised learning: attempts to use
unlabeled data as well as labeled data

The aim is to improve classification performance

Why try to do this? Unlabeled data is often
plentiful and labeling data can be expensive

Web mining: classifying web pages

Text mining: identifying names in text

Video mining: classifying people in the news

Leveraging the large pool of unlabeled examples
would be very attractive

Clustering for classification

Idea: use naïve Bayes on labeled examples and
then apply EM

First, build naïve Bayes model on labeled data

Second, label unlabeled data based on class
probabilities (“expectation” step)

Third, train new naïve Bayes model based on all the
data (“maximization” step)

Fourth, repeat 2nd and 3rd step until convergence

Essentially the same as EM for clustering with
fixed cluster membership probabilities for labeled
data and #clusters = #classes

Has been applied successfully to document
classification

Certain phrases are indicative of classes

Some of these phrases occur only in the unlabeled
data, some in both sets

EM can generalize the model by taking advantage of
co
-
occurrence of these phrases

Refinement 1: reduce weight of unlabeled data

Refinement 2: allow multiple clusters per class

Co
-
training

Method for learning from multiple views (multiple sets of
attributes), eg:

First set of attributes describes content of web page

Second set of attributes describes links that link to the web page

Step 1: build model from each view

Step 2: use models to assign labels to unlabeled data

Step 3: select those unlabeled examples that were most
confidently predicted (ideally, preserving ratio of classes)

Step 4: add those examples to the training set

Step 5: go to Step 1 until data exhausted

Assumption: views are independent

EM and co
-
training

Like EM for semisupervised learning, but view is
switched in each iteration of EM

Uses all the unlabeled data (probabilistically labeled)
for training

Has also been used successfully with support
vector machines

Using logistic models fit to output of SVMs

Co
-
training also seems to work when views are
chosen randomly!

Why? Possibly because co
-
trained classifier is more
robust

Multi
-
instance learning

Converting to single
-
instance learning

Already seen aggregation of input or output

Simple and often work well in practice

Will fail in some situations

Aggregating the input loses a lot of information
because attributes are condensed to summary
statistics individually and independently

Can a bag be converted to a single instance

Converting to single
-
instance

Can convert to single instance without losing so
much info, but more attributes are needed in the
“condensed” representation

Basic idea: partition the instance space into
regions

One attribute per region in the single
-
instance
representation

Simplest case → boolean attributes

Attribute corresponding to a region is set to true for a
bag if it has at least one instance in that region

Converting to single
-
instance

Could use numeric counts instead of boolean

Main problem: how to partition the instance
space?

Simple approach → partition into equal sized
hypercubes

Only works for few dimensions

More practical → use unsupervised learning

Take all instances from all bags (minus class labels)
and cluster

Create one attribute per cluster (region)

Converting to single
-
instance

Clustering ignores the class membership

Use a decision tree to partition the space instead

Each leaf corresponds to one region of instance space

How to learn tree when class labels apply to
entire bags?

Aggregating the output can be used: take the bag's
class label and attach it to each of its instances

Many labels will be incorrect, however, they are only
used to obtain a partitioning of the space

Converting to single
-
instance

Using decision trees yields “hard” partition
boundaries

So does k
-
means clustering into regions, where
the cluster centers (reference points) define the
regions

Can make region membership “soft” by using
distance

transformed into similarity

to
compute attribute values in the condensed
representation

Just need a way of aggregating similarity scores
between each bag and reference point into a single
value

e.g. max similarity between each instance in a
bag and the reference point

Converting to single
-
instance is appealing
because many existing algorithms can then be
applied without modification

May not be the most efficient approach

-
instance algorithm to
the multi
-
instance setting

Can be achieved elegantly for distance/similarity
-
based methods (e.g. nearest neighbor or SVMs)

Compute distance/similarity between two bags of
instances

Kernel
-
based methods

Similarity must be a proper kernel function that
satisfies certain mathematical properties

One example (set kernel)

Given a kernel function for pairs of instances, the set kernel
sums it over all pairs of instances from the two bags being
compared

Is generic and can be applied with any single
-
instance kernel
function

Nearest neighbor learning

Apply variants of the Hausdorff distance, which is
defined for sets of points

Given two bags and a distance function between pairs
of instances, Hausdorff distance between the bags is

Largest distance from any instance in one bag to its closest
instance in the other bag

Can be made more robust to outliers by using the
n
th
-
largest distance

Dedicated multi
-
instance methods

Some methods are not based directly on single
-
instance algorithms

One basic approach → find a single
hyperrectangle that contains at least one
instance from each positive bag and no
instances from any negative bags

Rectangle encloses an area of the instance space
where all positive bags overlap

Can use other shapes

e.g hyperspheres (balls)

Can also use boosting to build an ensemble of balls

Dedicated multi
-
instance methods

Previously described methods have hard
decision boundaries

an instance either falls
inside or outside a hyperectangle/ball

Other methods use probabilistic soft concept
descriptions

Diverse
-
density

Learns a single reference point in instance space

Probability that an instance is positive decreases with
increasing distance from the reference point

Dedicated multi
-
instance methods

Diverse
-
density

Combine instance probabilities within a bag to obtain
probability that bag is positive

“noisy
-
OR” (probabilistic version of logical OR)

All instance
-
level probabilities 0 → noisy
-
OR value
and bag
-
level probability is 0

At least one instance
-
level probability is → 1 the value
is 1

Diverse density, like the geometric methods, is
maximized when the reference point is located in an
area where positive bags overlap and no negative
bags are present