Performance Metrics for
Graph Mining Tasks
1
Outline
•
Introduction to Performance Metrics
•
Supervised Learning Performance
Metrics
•
Unsupervised
Learning Performance
Metrics
•
Optimizing Metrics
•
Statistical Significance Techniques
•
Model Comparison
2
Outline
•
Introduction to Performance Metrics
•
Supervised Learning Performance
Metrics
•
Unsupervised
Learning Performance
Metrics
•
Optimizing Metrics
•
Statistical Significance Techniques
•
Model Comparison
3
Introduction to Performance Metrics
Performance metric
measures how well your data mining algorithm is
performing on a given dataset.
For example, if we apply a classification algorithm on a dataset, we first
check to see how many of the data points were classified correctly. This is
a performance metric and the formal name for it is “accuracy.”
Performance metrics
also help us decide is one algorithm is better or
worse than another.
For example, one classification algorithm A classifies 80% of data points
correctly and another
classification algorithm
B classifies 90
% of data
points
correctly. We immediately realize that algorithm B is doing better.
There are some intricacies that we will discuss in this chapter.
4
Outline
•
Introduction to Performance Metrics
•
Supervised Learning Performance
Metrics
•
Unsupervised
Learning Performance
Metrics
•
Optimizing Metrics
•
Statistical Significance Techniques
•
Model Comparison
5
Supervised Learning Performance
Metrics
Metrics that are applied when the ground truth is
known (E.g., Classification tasks)
Outline:
•
2 X
2 Confusion
Matrix
•
Multi

level Confusion
Matrix
•
Visual Metrics
•
Cross

validation
6
2X2 Confusion Matrix
7
Predicted
Class
+

Actual
Class
+
f
++
f
+

C = f
++
+
f
+


f

+
f

D = f

+
+
f

A = f
++
+
f

+
B = f
+

+
f

T = f
++
+
f

+
+
f
+

+
f

An
2
X
2
matrix
,
is
used
to
tabulate
the
results
of
2

class
supervised
learning
problem
and
entry
(i,j)
represents
the
number
of
elements
with
class
label
i
,
but
predicted
to
have
class
label
j
.
True Positive
False Positive
False Negative
True Negative
+ and
–
are two class labels
2X2 Confusion Metrics
Example
8
Vertex
ID
Actual
Class
Predicted
Class
1
+
+
2
+
+
3
+
+
4
+
+
5
+

6

+
7

+
8


Predicted
Class
+

Actual
Class
+
4
1
C = 5

2
1
D = 3
A = 6
B = 2
T = 8
Corresponding
2x2 matrix for the given table
Results from a Classification
Algorithms
•
True
positive
=
4
•
False
positive
=
1
•
True
Negative
=
1
•
False
Negative
=
2
2X2 Confusion Metrics
Performance Metrics
Walk

through
different metrics using
the following
example
9
1.
Accuracy
is proportion
of correct
predictions
2.
Error rate
is proportion of incorrect predictions
3. Recall
is the proportion of
“+” data points
predicted as
“+”
4. Precision
is the proportion of data points predicted as “+” that are
truly “+”
Multi

level
Confusion Matrix
An
nXn
matrix, where
n
is the number of classes and entry
(
i,j
)
represents
the
number of elements with
class label
i
,
but predicted
to have
class label
j
10
Multi

level
Confusion
Matrix
Example
11
Predicted Class
Marginal
Sum
of
Actual Values
Class 1
Class 2
Class
3
Actual
Class
Class 1
2
1
1
4
Class 2
1
2
1
4
Class 3
1
2
3
6
Marginal
Sum of
Predictions
4
5
5
T = 14
Multi

level
Confusion
Matrix
Conversion to 2X2
Predicted Class
Class 1
Class 2
Class
3
Actual
Class
Class 1
2
1
1
Class 2
1
2
1
Class 3
1
2
3
2X2 Matrix
Specific to Class 1
f
+
+
f

+
f
+

f

Predicted Class
Class 1 (+)
Not Class 1
(

)
Actual
Class
Class 1 (+)
2
2
C = 4
Not Class 1 (

)
2
8
D
= 10
A = 4
B = 10
T = 14
We can
now apply
all the 2X2
metrics
Accuracy = 2/14
Error = 8/14
Recall = 2/4
Precision = 2/4
Multi

level
Confusion
Matrix
Performance Metrics
13
Predicted Class
Class 1
Class 2
Class
3
Actual
Class
Class 1
2
1
1
Class 2
1
2
1
Class 3
1
2
3
1.
Critical Success Index
or Threat Score is the ratio of correct predictions for class
L to the sum of vertices that belong to L and those predicted as L
2. Bias

For
each
class L,
it is the ratio of the total points with class label
L
to the
number
of points
predicted as
L.
Bias helps understand if a model is over or
under

predicting a class
Confusion Metrics
R

code
14
•
library(
PerformanceMetrics
)
•
data(M)
•
M
•
[,1] [,2]
•
[1,] 4 1
•
[2,] 2 1
•
twoCrossConfusionMatrixMetrics
(M
)
•
data(
MultiLevelM
)
•
MultiLevelM
•
[,1] [,2] [,3]
•
[1,] 2 1 1
•
[2,] 1 2 1
•
[3,] 1 2 3
•
multilevelConfusionMatrixMetrics
(
MultiLevelM
)
Visual Metrics
Metrics that are plotted on a graph to obtain the visual
picture of the performance of two class classifiers
15
0
0
1
1
False Positive Rate
True Positive Rate
(0,1)

Ideal
(0,0)
Predicts the
–
ve
class all the time
(1,1)
Predicts the +
ve
class all the time
AUC = 0.5
Plot the performance of multiple models to
decide which one performs best
ROC plot
Understanding Model Performance
based on ROC Plot
16
0
0
1
1
False Positive Rate
True Positive Rate
AUC = 0.5
Models that lie in
this area perform
worse than random
Note: Models here can
be negated to move
them to the upper right
corner
Models that lie in this
upper left have good
performance
Note: This is where you
aim to get the model
1.
Models that lie in
lower left are
conservative.
2.
Will not predict
“+” unless strong
evidence
3.
Low False
positives but high
False Negatives
1.
Models that lie in
upper right are
liberal.
2.
Will predict “+”
with little
evidence
3.
High False
positives
ROC Plot
Example
17
0
0
1
1
False Positive Rate
True Positive Rate
M
1
(0.1,0.8
)
M
2
(0.5,0.5)
M
3
(0.3,0.5)
M
1
’s
performance occurs furthest
in the upper

right direction and hence is considered
the best model.
Cross

validation
Cross

validation also
called
rotation estimation
,
is a way to analyze
how a predictive data mining model will perform
on an
unknown dataset,
i.e., how well the model
generalizes
Strategy:
1.
Divide up the dataset into two non

overlapping subsets
2.
One subset is called the “test” and the other the “training”
3.
Build the model using the “training” dataset
4.
Obtain predictions of the “test” set
5.
Utilize the “test” set predictions to calculate all the performance metrics
18
Typically cross

validation is performed for multiple iterations,
s
electing a different non

overlapping test and training set each time
Types of Cross

validation
•
hold

out:
Random 1/3
rd
of the data is used as test and
remaining 2/3
rd
as training
•
k

fold:
Divide the data into k partitions, use one partition as test
and remaining k

1 partitions for training
•
Leave

one

out:
Special case of k

fold, where k=1
19
Note:
Selection of data points is typically done in stratified manner, i.e.,
the class distribution in the test set is similar to the training set
Outline
•
Introduction to Performance Metrics
•
Supervised Learning Performance
Metrics
•
Unsupervised
Learning Performance
Metrics
•
Optimizing Metrics
•
Statistical Significance Techniques
•
Model Comparison
20
Unsupervised
Learning
Performance Metrics
Metrics that are applied when the ground truth is
not always available (E.g., Clustering tasks)
Outline:
•
Evaluation Using Prior
Knowledge
•
Evaluation Using
Cluster Properties
21
Evaluation Using Prior
Knowledge
To
test
the
effectiveness
of
unsupervised
learning
methods
is
by
considering
a
dataset
D
with
known
class
labels,
stripping
the
labels
and
providing
the
set
as
input
to
any
unsupervised
leaning
algorithm
,
U
.
The
resulting
clusters
are
then
compared
with
the
knowledge
priors
to
judge
the
performance
of
U
To
evaluate
performance
1.
Contingency
Table
2.
Ideal
and
Observed
Matrices
22
Contingency Table
23
Cluster
Same Cluster
Different
Cluster
Class
Same Class
u
11
u
10
Different Class
u
01
u
00
(A)
To
fill
the
table,
i
nitialize
u
11
,
u
0
1
,
u
10
,
u
00
to
0
(B)
Then,
for
each
pair
of
points
of
form
(
v,w
)
:
1.
if
v
and
w
belong
to
the
same
class
and
cluster
then
increment
u
11
2.
if
v
and
w
belong
to
the
same
class
but
different
cluster
then
increment
u
10
3.
if
v
and
w
belong
to
the
different
class
but
same
cluster
then
increment
u
01
4.
if
v
and
w
belong
to
the
different
class
and
cluster
then
increment
u
00
Contingency Table
Performance Metrics
•
Rand
Statistic also called simple
matching
coefficient
is
a measure
where both
placing a pair of points with the same class label in the same cluster
and placing
a pair of points with different class labels in different clusters
are
given
equal importance, i.e., it accounts for both specificity and sensitivity
of the
clustering
•
Jaccard Coefficient
can
be utilized when placing a
pair of
points with the
same class label in the same cluster is primarily important
24
Example
Matrix
Cluster
Same Cluster
Different
Cluster
Class
Same Class
9
4
Different Class
3
12
Ideal and Observed
Matrices
Given
that
the
number
of
points
is
T,
the
ideal

matrix
is
a
TxT
matrix,
where
each
cell
(
i,j
)
has
a
1
if
the
points
i
and
j
belong
to
the
same
class
and
a
0
if
they
belong
to
different
clusters
.
The
observed

matrix
is
a
TxT
matrix,
where
a
cell
(
i,j
)
has
a
1
if
the
points
i
and
j
belong
to
the
same
cluster
and
a
0
if
they
belong
to
different
cluster
•
Mantel Test
is
a statistical test of the
correlation between
two matrices of the
same rank. The two
matrices, in this case
, are symmetric and, hence, it is
sufficient to analyze lower or upper
diagonals of
each
matrix
25
Evaluation Using Prior Knowledge
R

code
•
library(
PerformanceMetrics
)
•
data(
ContingencyTable
)
•
ContingencyTable
•
[,1] [,2]
•
[1,] 9 4
•
[2,] 3 12
•
contingencyTableMetrics
(
ContingencyTable
)
26
Evaluation Using Cluster
Properties
In
the
absence
of
prior
knowledge
we
have
to
rely
on
the
information
from
the
clusters
themselves
to
evaluate
performance
.
1.
Cohesion
m
easures
how
closely
objects
in
the
same
cluster
are
related
2.
Separation
measures
how
distinct
or
separated
a
cluster
is
from
all
the
other
clusters
Here,
g
i
refers to cluster
i
, W is total number of clusters, x and y are data points,
proximity can be any similarity measure (e.g., cosine similarity)
We want the cohesion to be close to 1 and separation to be close to 0
27
Outline
•
Introduction to Performance Metrics
•
Supervised Learning Performance
Metrics
•
Unsupervised
Learning Performance
Metrics
•
Optimizing Metrics
•
Statistical Significance Techniques
•
Model Comparison
28
Optimizing Metrics
Performance metrics that act as optimization
functions for a data mining algorithm
Outline:
•
Sum of Squared
Errors
•
Preserved Variability
29
Sum of Squared
Errors
Squared sum error
(SSE)
is
typically used in
clustering algorithms
to measure
the quality of the clusters obtained. This parameter
takes into
consideration the
distance between each point in a cluster to
its cluster
center (centroid or some other
chosen representative).
For
d
j
, a point
in cluster
g
i
,
where
m
i
is the cluster center of
g
i
,
and
W, the total
number of clusters,
SSE
is defined as follows
:
This value
is small when points are close to their cluster center, indicating
a
good
clustering. Similarly, a large
SSE
indicates a
poor clustering
. Thus, clustering
algorithms aim to minimize
SSE.
30
Preserved
Variability
Preserved
variability
is
typically
used
in
eigenvector

based
dimension
reduction
techniques
to
quantify
the
variance
preserved
by
the
chosen
dimensions
.
The
objective
of
the
dimension
reduction
technique
is
to
maximize
this
parameter
.
Given
that the point is represented in
r dimensions (k
<<
r),
the eigenvalues
are
λ
1
>=
λ
2
>=…..
λ
r

1
>=
λ
r
.
The
preserved variability (PV)
is calculated as follows:
The
value
of
this
parameter
depends
on
the
number
of
dimensions
chosen
:
the
more
included,
the
higher
the
value
.
Choosing
all
the
dimensions
will
result
in
the
perfect
score
of
1
.
31
Outline
•
Introduction to Performance Metrics
•
Supervised Learning Performance
Metrics
•
Unsupervised
Learning Performance
Metrics
•
Optimizing Metrics
•
Statistical Significance Techniques
•
Model Comparison
32
Statistical Significance Techniques
•
Methods used to asses a
p

value
for the different performance
metrics
Scenario:
–
We obtain say cohesion =0.99 for clustering algorithm A. From the first look it feels
like 0.99 is a very good score.
–
However, it is possible that the underlying data is structured in such a way that you
would get 0.99 no matter how you cluster the data.
–
Thus, 0.99 is not very significant. One way to decide that is by using statistical
significance estimation.
We will discuss the Monte
Carlo
Procedure in next slide!
33
Monte Carlo Procedure
Empirical p

value Estimation
Monte
Carlo
procedure
uses
random
sampling
to
assess
the
significance
of
a
particular
performance
metric
we
obtain
could
have
been
attained
at
random
.
For
example,
if
we
obtain
a
cohesion
score
of
a
cluster
of
size
5
is
0
.
99
,
we
would
be
inclined
to
think
that
it
is
a
very
cohesive
score
.
However,
this
value
could
have
resulted
due
to
the
nature
of
the
data
and
not
due
to
the
algorithm
.
To
test
the
significance
of
this
0
.
99
value
we
1.
Sample
N
(usually
1000
)
random
sets
of
size
5
from
the
dataset
2.
Recalculate
the
cohesion
for
each
of
the
1000
sets
3.
Count
R
:
number
of
random
sets
with
value
>=
0
.
99
(original
score
of
cluster)
4.
Empirical
p

value
for
the
cluster
of
size
5
with
0
.
99
score
is
given
by
R/N
5.
We
apply
a
cutoff
say
0
.
05
to
decide
if
0
.
99
is
significant
Steps 1

4 is the Monte Carlo method for p

value estimation
.
34
Outline
•
Introduction to Performance Metrics
•
Supervised Learning Performance
Metrics
•
Unsupervised
Learning Performance
Metrics
•
Optimizing Metrics
•
Statistical Significance Techniques
•
Model Comparison
35
Model Comparison
Metrics that compare the performance of different algorithms
Scenario:
1)
Model 1 provides an accuracy of 70% and Model 2 provides an accuracy
of 75%
2)
From the first look, Model 2 seems better, however it could be that
Model 1 is predicting Class1 better than Class2
3)
However, Class1 is indeed more important than Class2 for our problem
4)
We can use model comparison methods to take this notion of
“importance” into consideration when we pick one model over another
Cost

based
Analysis
is
an
important
model
comparison
method
discussed
in
the
next
few
slides
.
36
Cost

based Analysis
In
real

world
applications,
certain
aspects
of
model
performance
are
considered
more
important
than
others
.
For
example
:
if
a
person
with
cancer
was
diagnosed
as
cancer

free
or
vice

versa
then
the
prediction
model
should
be
especially
penalized
.
This
penalty
can
be
introduced
in
the
form
of
a
cost

matrix
.
37
Cost
Matrix
Predicted
Class
+

Actual
Class
+
c
11
c
10

c
01
c
00
Associated with f
11 or
u
11
Associated with f
0
1 or
u
0
1
Associated with f
10 or
u
10
Associated with f
0
0
or
u
0
0
Cost

based
Analysis
Cost of a Model
The cost and confusion matrices for Model M are given below
Cost of Model M is given as:
38
Cost Matrix
Predicted
Class
+

Actual
Class
+
c
11
c
10

c
01
c
00
Confusion
Matrix
Predicted
Class
+

Actual
Class
+
f
11
f
10

f
01
f
00
Cost

based Analysis
Comparing Two Models
This
analysis is typically used to select one model when we have more than one
choice through using different algorithms or different parameters to the learning
algorithms
.
39
Cost
Matrix
Predicted
Class
+

Actual
Class
+

20
100

45

10
Confusion
Matrix of
M
x
Predicted
Class
+

Actual
Class
+
4
1

2
1
Confusion
Matrix of
M
y
Predicted
Class
+

Actual
Class
+
3
2

2
1
Cost of
M
y
: 200
Cost of
M
x
:
100
C
M
x
<
C
M
y
Purely, based on
cost model,
M
x
is a better model
Cost

based Analysis
R

code
•
library(
PerformanceMetrics
)
•
data(
Mx
)
•
data(My)
•
data(
CostMatrix
)
•
Mx
•
[,1] [,2]
•
[1,] 4 1
•
[2,] 2 1
•
My
•
[,1] [,2]
•
[1,] 3 2
•
[2,] 2 1
•
costAnalysis
(
Mx,CostMatrix
)
•
costAnalysis
(
My,CostMatrix
)
40
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο