MCS 2004

Multiple Classifier Systems
,
Cagliari 9

11 June 2004
1
Giorgio Valentini
e

mail: valentini@dsi.unimi.it
Random aggregated and bagged
ensembles of SVMs:
an empirical bias

variance analysis
DSI
–
Dipartimento di Scienze dell’ Informazione
Universit
à
degli Studi di Milano
MCS 2004

Multiple Classifier Systems
,
Cagliari 9

11 June 2004
2
Goals
•
Developing methods and procedures
to estimate the bias

variance decomposition of the error in ensembles of learning
machines
•
A
quantitative evaluation
of the variance reduction property in
random aggregated and bagged ensembles (
Breiman
,1996).
•
A
characterization of bias

variance (BV) decomposition
of the
error in bagged and random aggregated ensembles of SVMs,
comparing the results with BV decomposition in single SVMs
(
Valentini and Dietterich
, 2004)
•
Getting insights into the reasons why the ensemble method
Lobag
(
Valentini and Dietterich
, 2003) works.
•
Getting insights into the reasons why random subsampling
techniques works with large data mining problems (
Breiman
,
1999;
Chawla et al
. 2002).
MCS 2004

Multiple Classifier Systems
,
Cagliari 9

11 June 2004
3
Random aggregated ensembles
Let
D
=
{
(
x
j
,
t
j
)
}
, 1
j
m,
be a set of
m
samples
drawn identically and
independently from
a population
U
according
to
P
, where
P
(
x,
t)
is the joint
distribution of the data points in
U
.
Let
L
be a learning algorithm, and define
f
D
=
L
(
D
) as the predictor
produced by
L
applied to a training set
D
. The model produces a prediction
f
D
(x) = y.
Suppose that a sequence of learning sets
{
D
k
}
is given, each
i.i.d. from
the same underlying distribution
P
.
Breiman proposed to aggregate
the
f
D
trained with different samples drawn
from
U
to get a better predictor
f
A
(
x,
P).
For classification problems
t
j
S
N
, and
f
A
(
x,
P
) = arg ma
x
j
{
k

f
Dk
(
x) =
j
}
.
As the training sets
D
are randomly drawn from
U
, we name the procedure
to build
f
A
random aggregatin
g
.
MCS 2004

Multiple Classifier Systems
,
Cagliari 9

11 June 2004
4
Random aggregation reduces variance
Considering regression problems, if
T
and
X
are random
variables having
joint distribution
P
, the expected squared loss
EL
for the single
predictor
f
D
(
X
) is:
EL
=
E
D
[
E
T,
X
[(
T

f
D
(
X
))
2
]]
while the expected squared loss
E
L
A
for the aggregated predictor is:
E
L
A
=
E
T,
X
[(
T

f
A
(
X
))
2
]
Breiman showed that
EL
E
L
A
. This disequality depends on the
instability
of the predictions, that is on how unequal the two sides of
the following
eq.
are:
E
D
[
f
D
(
X
)]
2
E
D
[
f
D
2
(
X
)]
There is a strict relationship between the instability and the variance of the
base
predictor. Indeed the
variance
V
(
X
) of the base predictor is:
V
(
X
) =
E
D
[(
f
D
(
X
)

E
D
[
f
D
(
X
)])
2
]=
E
D
[
f
D
2
(
X
)]

E
D
[
f
D
(
X
)]
2
Breiman showed also that in classification problems, as in regression,
aggregating
“
good
”
predictors can lead to better performances, as long
as the base
predictor is unstable, whereas, unlike regression, aggregating poor
predictors can lower performances.
MCS 2004

Multiple Classifier Systems
,
Cagliari 9

11 June 2004
5
How much does the variance reduction
property hold for bagging too?
Bagging
is an approximation of random aggregating, for at least two
reasons
:
1.
B
ootstrap samples are not
“
real
”
data samples: they are drawn from
a data
set
D
, that is in turn a sample from the population
U
. On the
contrary
f
A
uses
samples drawn directly from U.
2.
B
ootstrap samples are drawn from
D
according to
an uniform
probability distribution, which is only an approximation of
the
unknown true distribution
P
.
1.
Does the variance reduction property hold for bagging too ?
2.
Can we provide a
quantitative estimate
of variance
reduction both in random aggregating and bagging?
Breiman theoretically showed the random aggregating reduces
variance
MCS 2004

Multiple Classifier Systems
,
Cagliari 9

11 June 2004
6
A quantitative estimate of bias

variance decomposition
of the error in random aggregated
(RA) and bagged ensembles of learning machines
•
We developed procedures to
quantitatively
evaluate bias

variance
decompostion of the error according to Domingos unified bias

variance
theory (
Domingos
, 2000).
•
We proposed three basic techniques (
Valentini
, 2003):
1.
Out

of

bag, or cross

validation estimate (when only small samples are
available)
2.
Hold

out techniques (when relatively large data sets are available)
•
In order to get a reliable estimate of the error we applied the second
technique evaluating the bias

variance decomposition using quite large
test sets.
•
We summarize here the two main experimental steps to perform bias
variance analysis with resampling

based ensembles:
1.
Procedures to generate data for ensemble training
2.
Bias

variance decomposition of the error on a separated test set
MCS 2004

Multiple Classifier Systems
,
Cagliari 9

11 June 2004
7
Procedure to generate training samples
for
random aggregates ensembles
Procedure to generate training samples
for
bagged ensembles
MCS 2004

Multiple Classifier Systems
,
Cagliari 9

11 June 2004
8
Procedure to estimate
the bias

variance
decomposition of the
error in ensembles of
learning machines
MCS 2004

Multiple Classifier Systems
,
Cagliari 9

11 June 2004
9
Comparison of bias

variance decomposition of the error in random aggregated
(RA) and bagged ensembles of SVMs on 7 two

class classification problems
•
Results represent changes relative to single SVMs (e.g. zero change means no difference). Square labeled
lines refer to random aggregated ensembles, triangle to bagged ensembles.
•
In random aggregated ensembles the error decreases form 15 to 70% w.r.t. single SVMs, while in bagged
ensemble the errror decreases from 0 to 15% depending on the data set.
•
Variance is significantly reduced in RA ens
. (about 90%), while
in bagging the variance reduction is
quite limited
, if compared to RA decrement (between 0 and 35 %). No substantial bias reduction is
registered.
Gaussian kernels
Linear kernels
MCS 2004

Multiple Classifier Systems
,
Cagliari 9

11 June 2004
10
Characterization of bias

variance decompostion of the error in
random aggregated ensembles of SVMs (gaussian kernel)
•
Lines labeled with crosses:
single SVMs
•
Lines labeled with triangles:
RA SVM ensembles
MCS 2004

Multiple Classifier Systems
,
Cagliari 9

11 June 2004
11
Lobag works when unbiased variance is relatively high
•
Lobag
(Low bias bagging) is a variant of bagging that uses low biased
base learners selected through bias

variance analysis procedures
(
Valentini
and
Dietterich
, 2003).
•
Our experiments with bagging show the reasons why Lobag works:
bagging lowers variance, but the bias remains substantially unchanged.
Hence selecting low bias base learners Lobag reduces both bias
(through bias

variance analysis) and variance (through classical
aggregation techniques)
•
Valentini and Dietterich experimentally showed that Lobag is
effective, with SVMs as base learners, when small sized samples are
used, that is when the variance due to reduced cardinality of the
available data is relatively high. But when we have relatively large
data sets, we may expect that lobag does not outperform bagging
(because in this case, on the average, the unbiased variance will be
relatively low).
MCS 2004

Multiple Classifier Systems
,
Cagliari 9

11 June 2004
12
Why random subsampling techniques work
with large databases ?
•
Breiman proposed random subsampling techniques for classification in
large databases, using decision trees as base learners
(
Breiman
, 1999)
,
and these techniques have been also successfully applied in distributed
environments
(
Chawla et al
., 2002).
•
Random aggregating can also be interpreted as a technique to draw
from a large population small subsamples to train the base learners and
then aggregating them e.g. by majority voting.
•
Our experiments on random aggregated ensembles show
that the
variance component of the error is strongly reduced, while the bias
remains unchanged or it is lowered
, getting insights into the reasons
why random subsampling techniques works with large data mining
problems. In particular our experimental analysis suggests to apply
SVMs trianed on small subsamples when large database are available
or when they are fragmented in distributed systems.
MCS 2004

Multiple Classifier Systems
,
Cagliari 9

11 June 2004
13
Conclusions
•
We showed how to apply bias

variance decomposition techniques to the
analysis of bagged and random aggregated ensembles of learning
machines.
•
These techniques have been applied to the analysis of bagged and random
aggregated ensembles of SVMs, but can be directly applied to a large set
of ensemble methods
*
•
The experimental analysis show that random aggregated ensembles
significantly reduce the variance component of the error w.r.t. single
SVMs, but this property only partially holds for bagged ensembles.
•
The empirical bias variance analysis gets also insights into the reasons
why
Lobag
works, highliting on the other hand some limitations of the
Lobag approach.
•
The bias

variance analysis of random aggregated ensembles highlights
also the reasons of their successfull application to large scale data mining
problems.
*
the C++ classes and applications to perform BV analysis are freely available at:
http://homes.dsi.unimi.it/
~
valenti/sw/NEURObjects
MCS 2004

Multiple Classifier Systems
,
Cagliari 9

11 June 2004
14
References
•
Breiman, L.:
Bagging predictors.
Machine Learning
24
(1996) 123

140
•
Breiman, L.: Pasting Small Votes for Classification in Large Databases and
On

Line. Machine Learning
36
(1999) 85

103
•
Chawla, N., Hall, L., Bowyer, K., Moore, T., Kegelmeyer, W.:
Distributed pasting of small votes. In: MCS2002, Cagliari, Italy. Vol. 2364 of
Lecture Notes in Computer Science., Springer

Verlag (2002) 52

61
•
Domingos, P.
A Unified Bias

Variance Decomposition for Zero

One and
Squared Loss. In: Proc.
17
th
National Conference on Artificial Intelligence,
Austin, TX, AAAI Press (2000) 564

569
•
G.
Valentini and T.G. Dietterich. Low Bias Bagged Support Vector Machines.
ICML 2003
, pages 752

759, Washington D.C., USA
(
2003
)
. AAAI Press.
•
Valentini, G. Ensemble methods based on bias

variance analysis. PhD thesis,
DISI, Università di Genova, Italy (2003),
ftp://ftp.disi.unige.it/person/ValentiniG/Tesi/finalversion/vale

th

2003

04.pdf.
•
Valentini, G., Dietterich, T.G.: Bias

variance analysis of Support Vector
Machines for the development of SVM

based ensemble methods.
Journal of Machine Learning Research
(2004)
(accepted for publication)
Comments 0
Log in to post a comment