Performance Metrics for Graph Mining Tasks

voltaireblingΔιαχείριση Δεδομένων

20 Νοε 2013 (πριν από 3 χρόνια και 8 μήνες)

71 εμφανίσεις

Performance Metrics for

Graph Mining Tasks

1

Outline


Introduction to Performance Metrics


Supervised Learning Performance
Metrics


Unsupervised
Learning Performance
Metrics


Optimizing Metrics


Statistical Significance Techniques


Model Comparison


2

Outline


Introduction to Performance Metrics


Supervised Learning Performance
Metrics


Unsupervised
Learning Performance
Metrics


Optimizing Metrics


Statistical Significance Techniques


Model Comparison


3

Introduction to Performance Metrics

Performance metric
measures how well your data mining algorithm is
performing on a given dataset.


For example, if we apply a classification algorithm on a dataset, we first
check to see how many of the data points were classified correctly. This is
a performance metric and the formal name for it is “accuracy.”


Performance metrics
also help us decide is one algorithm is better or
worse than another.


For example, one classification algorithm A classifies 80% of data points
correctly and another
classification algorithm
B classifies 90
% of data
points
correctly. We immediately realize that algorithm B is doing better.
There are some intricacies that we will discuss in this chapter.

4

Outline


Introduction to Performance Metrics


Supervised Learning Performance
Metrics


Unsupervised
Learning Performance
Metrics


Optimizing Metrics


Statistical Significance Techniques


Model Comparison


5

Supervised Learning Performance
Metrics

Metrics that are applied when the ground truth is
known (E.g., Classification tasks)


Outline:


2 X
2 Confusion
Matrix


Multi
-
level Confusion
Matrix


Visual Metrics


Cross
-
validation



6

2X2 Confusion Matrix

7

Predicted

Class

+

-

Actual

Class

+

f
++

f
+
-

C = f
++

+
f
+
-

-

f
-
+

f
--

D = f
-
+

+
f
--

A = f
++

+
f
-
+

B = f
+
-

+
f
--

T = f
++
+
f
-
+
+
f
+
-
+
f
--

An

2
X
2

matrix
,

is

used

to

tabulate

the

results

of

2
-
class

supervised

learning

problem

and

entry

(i,j)

represents

the

number

of

elements

with

class

label

i
,

but

predicted

to

have

class

label

j
.

True Positive

False Positive

False Negative

True Negative

+ and


are two class labels

2X2 Confusion Metrics

Example

8

Vertex

ID

Actual


Class

Predicted


Class

1

+

+

2

+

+

3

+

+

4

+

+

5

+

-

6

-

+

7

-

+

8

-

-

Predicted

Class

+

-

Actual

Class

+

4

1

C = 5

-

2

1

D = 3

A = 6

B = 2

T = 8

Corresponding

2x2 matrix for the given table

Results from a Classification
Algorithms


True

positive

=

4


False

positive

=

1


True

Negative

=

1


False

Negative

=
2


2X2 Confusion Metrics

Performance Metrics

Walk
-
through
different metrics using
the following
example


9

1.
Accuracy

is proportion
of correct
predictions






2.
Error rate
is proportion of incorrect predictions





3. Recall
is the proportion of

“+” data points
predicted as
“+”




4. Precision
is the proportion of data points predicted as “+” that are
truly “+”




Multi
-
level
Confusion Matrix

An
nXn

matrix, where
n

is the number of classes and entry
(
i,j
)
represents
the
number of elements with
class label
i
,
but predicted
to have
class label
j

10

Multi
-
level
Confusion
Matrix

Example

11

Predicted Class

Marginal

Sum
of

Actual Values

Class 1

Class 2

Class

3

Actual

Class

Class 1

2

1

1

4

Class 2

1

2

1

4

Class 3

1

2

3

6

Marginal

Sum of
Predictions

4

5

5

T = 14

Multi
-
level
Confusion
Matrix

Conversion to 2X2

Predicted Class

Class 1

Class 2

Class

3

Actual

Class

Class 1

2

1

1

Class 2

1

2

1

Class 3

1

2

3

2X2 Matrix

Specific to Class 1

f
+
+

f
-
+

f
+
-

f
--

Predicted Class

Class 1 (+)

Not Class 1
(
-
)

Actual

Class

Class 1 (+)

2

2

C = 4

Not Class 1 (
-
)

2

8

D

= 10

A = 4

B = 10

T = 14

We can
now apply
all the 2X2
metrics

Accuracy = 2/14

Error = 8/14

Recall = 2/4

Precision = 2/4

Multi
-
level
Confusion
Matrix

Performance Metrics

13

Predicted Class

Class 1

Class 2

Class

3

Actual

Class

Class 1

2

1

1

Class 2

1

2

1

Class 3

1

2

3

1.
Critical Success Index

or Threat Score is the ratio of correct predictions for class
L to the sum of vertices that belong to L and those predicted as L





2. Bias
-

For
each
class L,
it is the ratio of the total points with class label
L
to the
number
of points
predicted as
L.



Bias helps understand if a model is over or

under
-
predicting a class




Confusion Metrics

R
-
code

14


library(
PerformanceMetrics
)


data(M)


M


[,1] [,2]


[1,] 4 1


[2,] 2 1


twoCrossConfusionMatrixMetrics
(M
)


data(
MultiLevelM
)


MultiLevelM


[,1] [,2] [,3]


[1,] 2 1 1


[2,] 1 2 1


[3,] 1 2 3


multilevelConfusionMatrixMetrics
(
MultiLevelM
)

Visual Metrics

Metrics that are plotted on a graph to obtain the visual
picture of the performance of two class classifiers

15

0

0

1

1

False Positive Rate

True Positive Rate

(0,1)
-

Ideal

(0,0)

Predicts the

ve


class all the time

(1,1)

Predicts the +
ve


class all the time

AUC = 0.5

Plot the performance of multiple models to
decide which one performs best

ROC plot

Understanding Model Performance
based on ROC Plot

16

0

0

1

1

False Positive Rate

True Positive Rate

AUC = 0.5

Models that lie in
this area perform
worse than random

Note: Models here can
be negated to move
them to the upper right
corner

Models that lie in this
upper left have good
performance

Note: This is where you
aim to get the model

1.
Models that lie in
lower left are
conservative.

2.
Will not predict
“+” unless strong
evidence

3.
Low False
positives but high
False Negatives


1.
Models that lie in
upper right are
liberal.

2.
Will predict “+”
with little
evidence

3.
High False
positives

ROC Plot

Example

17

0

0

1

1

False Positive Rate

True Positive Rate


M
1

(0.1,0.8
)


M
2

(0.5,0.5)


M
3

(0.3,0.5)

M
1
’s
performance occurs furthest

in the upper
-
right direction and hence is considered
the best model.

Cross
-
validation

Cross
-
validation also
called
rotation estimation
,
is a way to analyze
how a predictive data mining model will perform
on an
unknown dataset,
i.e., how well the model
generalizes


Strategy:

1.
Divide up the dataset into two non
-
overlapping subsets

2.
One subset is called the “test” and the other the “training”

3.
Build the model using the “training” dataset

4.
Obtain predictions of the “test” set

5.
Utilize the “test” set predictions to calculate all the performance metrics



18

Typically cross
-
validation is performed for multiple iterations,

s
electing a different non
-
overlapping test and training set each time

Types of Cross
-
validation


hold
-
out:

Random 1/3
rd

of the data is used as test and
remaining 2/3
rd

as training


k
-
fold:
Divide the data into k partitions, use one partition as test
and remaining k
-
1 partitions for training


Leave
-
one
-
out:
Special case of k
-
fold, where k=1



19

Note:
Selection of data points is typically done in stratified manner, i.e.,
the class distribution in the test set is similar to the training set

Outline


Introduction to Performance Metrics


Supervised Learning Performance
Metrics


Unsupervised
Learning Performance
Metrics


Optimizing Metrics


Statistical Significance Techniques


Model Comparison


20

Unsupervised
Learning
Performance Metrics

Metrics that are applied when the ground truth is
not always available (E.g., Clustering tasks)


Outline:


Evaluation Using Prior
Knowledge


Evaluation Using
Cluster Properties


21

Evaluation Using Prior
Knowledge

To

test

the

effectiveness

of

unsupervised

learning

methods

is

by

considering

a

dataset

D

with

known

class

labels,

stripping

the

labels

and

providing

the

set

as

input

to

any

unsupervised

leaning

algorithm
,

U
.

The

resulting

clusters

are

then

compared

with

the

knowledge

priors

to

judge

the

performance

of

U


To

evaluate

performance

1.
Contingency

Table

2.
Ideal

and

Observed

Matrices




22

Contingency Table

23

Cluster

Same Cluster

Different

Cluster

Class

Same Class

u
11

u
10

Different Class

u
01

u
00

(A)

To

fill

the

table,

i
nitialize

u
11
,

u
0
1
,

u
10
,

u
00

to

0

(B)

Then,

for

each

pair

of

points

of

form

(
v,w
)
:

1.

if

v

and

w

belong

to

the

same

class

and

cluster

then

increment

u
11

2.

if

v

and

w

belong

to

the

same

class

but

different

cluster

then

increment

u
10

3.

if

v

and

w

belong

to

the

different

class

but

same

cluster

then

increment

u
01

4.

if

v

and

w

belong

to

the

different

class

and

cluster

then

increment

u
00

Contingency Table

Performance Metrics


Rand
Statistic also called simple
matching
coefficient
is
a measure
where both
placing a pair of points with the same class label in the same cluster
and placing
a pair of points with different class labels in different clusters
are
given
equal importance, i.e., it accounts for both specificity and sensitivity
of the
clustering




Jaccard Coefficient
can
be utilized when placing a
pair of
points with the
same class label in the same cluster is primarily important


24

Example

Matrix

Cluster

Same Cluster

Different

Cluster

Class

Same Class

9

4

Different Class

3

12

Ideal and Observed
Matrices

Given

that

the

number

of

points

is

T,

the

ideal
-
matrix

is

a

TxT

matrix,

where

each

cell

(
i,j
)

has

a

1

if

the

points

i

and

j

belong

to

the

same

class

and

a

0

if

they

belong

to

different

clusters
.

The

observed
-
matrix

is

a

TxT

matrix,

where

a

cell

(
i,j
)

has

a

1

if

the

points

i

and

j

belong

to

the

same

cluster

and

a

0

if

they

belong

to

different

cluster







Mantel Test
is
a statistical test of the
correlation between
two matrices of the
same rank. The two
matrices, in this case
, are symmetric and, hence, it is
sufficient to analyze lower or upper
diagonals of
each
matrix


25

Evaluation Using Prior Knowledge

R
-
code


library(
PerformanceMetrics
)


data(
ContingencyTable
)


ContingencyTable


[,1] [,2]


[1,] 9 4


[2,] 3 12


contingencyTableMetrics
(
ContingencyTable
)

26

Evaluation Using Cluster
Properties

In

the

absence

of

prior

knowledge

we

have

to

rely

on

the

information

from

the

clusters

themselves

to

evaluate

performance
.

1.
Cohesion

m
easures

how

closely

objects

in

the

same

cluster

are

related

2.
Separation

measures

how

distinct

or

separated

a

cluster

is

from

all

the

other

clusters






Here,
g
i


refers to cluster
i
, W is total number of clusters, x and y are data points,
proximity can be any similarity measure (e.g., cosine similarity)


We want the cohesion to be close to 1 and separation to be close to 0


27

Outline


Introduction to Performance Metrics


Supervised Learning Performance
Metrics


Unsupervised
Learning Performance
Metrics


Optimizing Metrics


Statistical Significance Techniques


Model Comparison


28

Optimizing Metrics

Performance metrics that act as optimization
functions for a data mining algorithm


Outline:


Sum of Squared
Errors


Preserved Variability

29

Sum of Squared
Errors

Squared sum error
(SSE)
is
typically used in

clustering algorithms
to measure
the quality of the clusters obtained. This parameter
takes into
consideration the
distance between each point in a cluster to
its cluster
center (centroid or some other
chosen representative).


For
d
j
, a point
in cluster
g
i
,
where
m
i

is the cluster center of
g
i
,
and
W, the total
number of clusters,
SSE
is defined as follows
:






This value
is small when points are close to their cluster center, indicating

a
good
clustering. Similarly, a large
SSE
indicates a
poor clustering
. Thus, clustering
algorithms aim to minimize
SSE.

30

Preserved
Variability

Preserved

variability

is

typically

used

in

eigenvector
-
based

dimension

reduction

techniques

to

quantify

the

variance

preserved

by

the

chosen

dimensions
.

The

objective

of

the

dimension

reduction

technique

is

to

maximize

this

parameter
.



Given
that the point is represented in
r dimensions (k
<<
r),
the eigenvalues
are
λ
1
>=
λ
2
>=…..

λ
r
-
1
>=
λ
r
.
The
preserved variability (PV)
is calculated as follows:






The

value

of

this

parameter

depends

on

the

number

of

dimensions

chosen
:

the

more

included,

the

higher

the

value
.

Choosing

all

the

dimensions

will

result

in

the

perfect

score

of

1
.


31

Outline


Introduction to Performance Metrics


Supervised Learning Performance
Metrics


Unsupervised
Learning Performance
Metrics


Optimizing Metrics


Statistical Significance Techniques


Model Comparison


32

Statistical Significance Techniques


Methods used to asses a
p
-
value
for the different performance
metrics


Scenario:


We obtain say cohesion =0.99 for clustering algorithm A. From the first look it feels
like 0.99 is a very good score.


However, it is possible that the underlying data is structured in such a way that you
would get 0.99 no matter how you cluster the data.


Thus, 0.99 is not very significant. One way to decide that is by using statistical
significance estimation.


We will discuss the Monte
Carlo
Procedure in next slide!

33

Monte Carlo Procedure

Empirical p
-
value Estimation

Monte

Carlo

procedure

uses

random

sampling

to

assess

the

significance

of

a

particular

performance

metric

we

obtain

could

have

been

attained

at

random
.



For

example,

if

we

obtain

a

cohesion

score

of

a

cluster

of

size

5

is

0
.
99
,

we

would

be

inclined

to

think

that

it

is

a

very

cohesive

score
.

However,

this

value

could

have

resulted

due

to

the

nature

of

the

data

and

not

due

to

the

algorithm
.

To

test

the

significance

of

this

0
.
99

value

we

1.
Sample

N

(usually

1000
)

random

sets

of

size

5

from

the

dataset

2.
Recalculate

the

cohesion

for

each

of

the

1000

sets

3.
Count

R
:

number

of

random

sets

with

value

>=

0
.
99

(original

score

of

cluster)

4.
Empirical

p
-
value

for

the

cluster

of

size

5

with

0
.
99

score

is

given

by

R/N

5.
We

apply

a

cutoff

say

0
.
05

to

decide

if

0
.
99

is

significant


Steps 1
-
4 is the Monte Carlo method for p
-
value estimation
.

34

Outline


Introduction to Performance Metrics


Supervised Learning Performance
Metrics


Unsupervised
Learning Performance
Metrics


Optimizing Metrics


Statistical Significance Techniques


Model Comparison


35

Model Comparison

Metrics that compare the performance of different algorithms


Scenario:

1)
Model 1 provides an accuracy of 70% and Model 2 provides an accuracy
of 75%

2)
From the first look, Model 2 seems better, however it could be that
Model 1 is predicting Class1 better than Class2

3)
However, Class1 is indeed more important than Class2 for our problem

4)
We can use model comparison methods to take this notion of
“importance” into consideration when we pick one model over another


Cost
-
based

Analysis

is

an

important

model

comparison

method

discussed

in

the

next

few

slides
.

36

Cost
-
based Analysis

In

real
-
world

applications,

certain

aspects

of

model

performance

are

considered

more

important

than

others
.

For

example
:

if

a

person

with

cancer

was

diagnosed

as

cancer
-
free

or

vice
-
versa

then

the

prediction

model

should

be

especially

penalized
.

This

penalty

can

be

introduced

in

the

form

of

a

cost
-
matrix
.









37

Cost

Matrix

Predicted

Class

+

-

Actual

Class

+

c
11

c
10

-

c
01

c
00

Associated with f
11 or

u
11

Associated with f
0
1 or

u
0
1

Associated with f
10 or

u
10

Associated with f
0
0

or

u
0
0

Cost
-
based
Analysis

Cost of a Model

The cost and confusion matrices for Model M are given below







Cost of Model M is given as:


38

Cost Matrix

Predicted

Class

+

-

Actual

Class

+

c
11

c
10

-

c
01

c
00


Confusion

Matrix

Predicted

Class

+

-

Actual

Class

+

f
11

f
10

-

f
01

f
00

Cost
-
based Analysis

Comparing Two Models

This
analysis is typically used to select one model when we have more than one
choice through using different algorithms or different parameters to the learning
algorithms
.










39

Cost

Matrix

Predicted

Class

+

-

Actual

Class

+

-
20

100

-

45

-
10

Confusion

Matrix of


M
x

Predicted

Class

+

-

Actual

Class

+

4

1

-

2

1

Confusion


Matrix of


M
y

Predicted

Class

+

-

Actual

Class

+

3

2

-

2

1

Cost of
M
y
: 200

Cost of
M
x
:
100


C
M
x

<
C
M
y


Purely, based on
cost model,
M
x

is a better model



Cost
-
based Analysis

R
-
code


library(
PerformanceMetrics
)


data(
Mx
)


data(My)


data(
CostMatrix
)


Mx


[,1] [,2]


[1,] 4 1


[2,] 2 1


My


[,1] [,2]


[1,] 3 2


[2,] 2 1


costAnalysis
(
Mx,CostMatrix
)


costAnalysis
(
My,CostMatrix
)

40