an empirical bias-variance analysis

kettledoctorAI and Robotics

Oct 15, 2013 (3 years and 8 months ago)

138 views

MCS 2004
-

Multiple Classifier Systems
,
Cagliari 9
-
11 June 2004

1


Giorgio Valentini

e
-
mail: valentini@dsi.unimi.it

Random aggregated and bagged
ensembles of SVMs:

an empirical bias
-
variance analysis

DSI


Dipartimento di Scienze dell’ Informazione

Universit
à
degli Studi di Milano

MCS 2004
-

Multiple Classifier Systems
,
Cagliari 9
-
11 June 2004

2


Goals


Developing methods and procedures

to estimate the bias
-
variance decomposition of the error in ensembles of learning
machines


A
quantitative evaluation

of the variance reduction property in
random aggregated and bagged ensembles (
Breiman
,1996).


A
characterization of bias
-
variance (BV) decomposition

of the
error in bagged and random aggregated ensembles of SVMs,
comparing the results with BV decomposition in single SVMs
(
Valentini and Dietterich
, 2004)


Getting insights into the reasons why the ensemble method
Lobag

(
Valentini and Dietterich
, 2003) works.


Getting insights into the reasons why random subsampling
techniques works with large data mining problems (
Breiman
,
1999;
Chawla et al
. 2002).

MCS 2004
-

Multiple Classifier Systems
,
Cagliari 9
-
11 June 2004

3


Random aggregated ensembles

Let
D

=
{
(
x
j

,
t
j
)
}
, 1


j


m,

be a set of
m

samples

drawn identically and
independently from
a population
U

according

to
P
, where
P
(
x,
t)

is the joint
distribution of the data points in
U
.

Let
L

be a learning algorithm, and define
f
D

=
L

(
D
) as the predictor

produced by
L

applied to a training set
D
. The model produces a prediction

f
D
(x) = y.

Suppose that a sequence of learning sets
{
D
k

}
is given, each

i.i.d. from
the same underlying distribution
P
.

Breiman proposed to aggregate

the
f
D

trained with different samples drawn

from
U


to get a better predictor

f
A
(
x,
P).

For classification problems
t
j



S



N

, and
f
A
(
x,
P
) = arg ma
x
j

|{
k
|
f
Dk

(
x) =
j
}|
.

As the training sets
D

are randomly drawn from
U
, we name the procedure

to build
f
A

random aggregatin
g
.


MCS 2004
-

Multiple Classifier Systems
,
Cagliari 9
-
11 June 2004

4


Random aggregation reduces variance

Considering regression problems, if
T

and
X

are random

variables having
joint distribution
P
, the expected squared loss
EL
for the single

predictor
f
D
(
X
) is:

EL
=
E
D
[
E
T,
X
[(
T

-

f
D
(
X
))
2

]]


while the expected squared loss
E
L
A

for the aggregated predictor is:

E
L
A

=
E
T,
X
[(
T

-

f
A
(
X
))
2

]

Breiman showed that
EL


E
L
A
. This disequality depends on the
instability

of the predictions, that is on how unequal the two sides of

the following
eq.

are:

E
D
[
f
D
(
X
)]
2



E
D
[
f
D
2

(
X
)]

There is a strict relationship between the instability and the variance of the
base

predictor. Indeed the
variance

V
(
X
) of the base predictor is:

V
(
X
) =
E
D
[(
f
D
(
X
)
-

E
D
[
f
D
(
X
)])
2

]=
E
D
[
f
D
2

(
X
)]
-

E
D
[
f
D
(
X
)]
2


Breiman showed also that in classification problems, as in regression,
aggregating


good


predictors can lead to better performances, as long

as the base
predictor is unstable, whereas, unlike regression, aggregating poor

predictors can lower performances.

MCS 2004
-

Multiple Classifier Systems
,
Cagliari 9
-
11 June 2004

5


How much does the variance reduction
property hold for bagging too?

Bagging

is an approximation of random aggregating, for at least two
reasons
:

1.
B
ootstrap samples are not

real


data samples: they are drawn from
a data

set
D
, that is in turn a sample from the population
U
. On the
contrary
f
A

uses

samples drawn directly from U.

2.
B
ootstrap samples are drawn from
D

according to

an uniform
probability distribution, which is only an approximation of

the
unknown true distribution
P
.

1.
Does the variance reduction property hold for bagging too ?

2.
Can we provide a
quantitative estimate

of variance
reduction both in random aggregating and bagging?

Breiman theoretically showed the random aggregating reduces
variance

MCS 2004
-

Multiple Classifier Systems
,
Cagliari 9
-
11 June 2004

6


A quantitative estimate of bias
-
variance decomposition
of the error in random aggregated

(RA) and bagged ensembles of learning machines


We developed procedures to
quantitatively

evaluate bias
-
variance
decompostion of the error according to Domingos unified bias
-
variance
theory (
Domingos
, 2000).


We proposed three basic techniques (
Valentini
, 2003):

1.
Out
-
of
-
bag, or cross
-
validation estimate (when only small samples are
available)

2.
Hold
-
out techniques (when relatively large data sets are available)


In order to get a reliable estimate of the error we applied the second
technique evaluating the bias
-
variance decomposition using quite large
test sets.


We summarize here the two main experimental steps to perform bias
variance analysis with resampling
-
based ensembles:

1.
Procedures to generate data for ensemble training

2.
Bias
-
variance decomposition of the error on a separated test set

MCS 2004
-

Multiple Classifier Systems
,
Cagliari 9
-
11 June 2004

7


Procedure to generate training samples
for
random aggregates ensembles

Procedure to generate training samples
for
bagged ensembles

MCS 2004
-

Multiple Classifier Systems
,
Cagliari 9
-
11 June 2004

8


Procedure to estimate
the bias
-
variance
decomposition of the
error in ensembles of
learning machines

MCS 2004
-

Multiple Classifier Systems
,
Cagliari 9
-
11 June 2004

9


Comparison of bias
-
variance decomposition of the error in random aggregated
(RA) and bagged ensembles of SVMs on 7 two
-
class classification problems



Results represent changes relative to single SVMs (e.g. zero change means no difference). Square labeled
lines refer to random aggregated ensembles, triangle to bagged ensembles.



In random aggregated ensembles the error decreases form 15 to 70% w.r.t. single SVMs, while in bagged
ensemble the errror decreases from 0 to 15% depending on the data set.



Variance is significantly reduced in RA ens
. (about 90%), while
in bagging the variance reduction is
quite limited
, if compared to RA decrement (between 0 and 35 %). No substantial bias reduction is
registered.

Gaussian kernels

Linear kernels

MCS 2004
-

Multiple Classifier Systems
,
Cagliari 9
-
11 June 2004

10


Characterization of bias
-
variance decompostion of the error in
random aggregated ensembles of SVMs (gaussian kernel)



Lines labeled with crosses:


single SVMs



Lines labeled with triangles:


RA SVM ensembles

MCS 2004
-

Multiple Classifier Systems
,
Cagliari 9
-
11 June 2004

11


Lobag works when unbiased variance is relatively high


Lobag

(Low bias bagging) is a variant of bagging that uses low biased
base learners selected through bias
-
variance analysis procedures
(
Valentini

and
Dietterich
, 2003).


Our experiments with bagging show the reasons why Lobag works:
bagging lowers variance, but the bias remains substantially unchanged.
Hence selecting low bias base learners Lobag reduces both bias
(through bias
-
variance analysis) and variance (through classical
aggregation techniques)


Valentini and Dietterich experimentally showed that Lobag is
effective, with SVMs as base learners, when small sized samples are
used, that is when the variance due to reduced cardinality of the
available data is relatively high. But when we have relatively large
data sets, we may expect that lobag does not outperform bagging
(because in this case, on the average, the unbiased variance will be
relatively low).

MCS 2004
-

Multiple Classifier Systems
,
Cagliari 9
-
11 June 2004

12


Why random subsampling techniques work
with large databases ?


Breiman proposed random subsampling techniques for classification in
large databases, using decision trees as base learners

(
Breiman
, 1999)
,
and these techniques have been also successfully applied in distributed

environments

(
Chawla et al
., 2002).


Random aggregating can also be interpreted as a technique to draw
from a large population small subsamples to train the base learners and
then aggregating them e.g. by majority voting.


Our experiments on random aggregated ensembles show
that the
variance component of the error is strongly reduced, while the bias
remains unchanged or it is lowered
, getting insights into the reasons
why random subsampling techniques works with large data mining
problems. In particular our experimental analysis suggests to apply
SVMs trianed on small subsamples when large database are available
or when they are fragmented in distributed systems.



MCS 2004
-

Multiple Classifier Systems
,
Cagliari 9
-
11 June 2004

13


Conclusions


We showed how to apply bias
-
variance decomposition techniques to the
analysis of bagged and random aggregated ensembles of learning
machines.


These techniques have been applied to the analysis of bagged and random
aggregated ensembles of SVMs, but can be directly applied to a large set
of ensemble methods
*


The experimental analysis show that random aggregated ensembles
significantly reduce the variance component of the error w.r.t. single
SVMs, but this property only partially holds for bagged ensembles.


The empirical bias variance analysis gets also insights into the reasons
why
Lobag

works, highliting on the other hand some limitations of the
Lobag approach.


The bias
-
variance analysis of random aggregated ensembles highlights
also the reasons of their successfull application to large scale data mining
problems.



*
the C++ classes and applications to perform BV analysis are freely available at:
http://homes.dsi.unimi.it/
~
valenti/sw/NEURObjects

MCS 2004
-

Multiple Classifier Systems
,
Cagliari 9
-
11 June 2004

14


References


Breiman, L.:

Bagging predictors.

Machine Learning
24

(1996) 123
-
140


Breiman, L.: Pasting Small Votes for Classification in Large Databases and
On
-
Line. Machine Learning
36

(1999) 85
-
103


Chawla, N., Hall, L., Bowyer, K., Moore, T., Kegelmeyer, W.:

Distributed pasting of small votes. In: MCS2002, Cagliari, Italy. Vol. 2364 of
Lecture Notes in Computer Science., Springer
-
Verlag (2002) 52
-
61


Domingos, P.

A Unified Bias
-
Variance Decomposition for Zero
-
One and
Squared Loss. In: Proc.
17
th


National Conference on Artificial Intelligence,
Austin, TX, AAAI Press (2000) 564
-
569


G.

Valentini and T.G. Dietterich. Low Bias Bagged Support Vector Machines.


ICML 2003
, pages 752
-
759, Washington D.C., USA

(
2003
)
. AAAI Press.


Valentini, G. Ensemble methods based on bias
-
variance analysis. PhD thesis,
DISI, Università di Genova, Italy (2003),
ftp://ftp.disi.unige.it/person/ValentiniG/Tesi/finalversion/vale
-
th
-
2003
-
04.pdf.


Valentini, G., Dietterich, T.G.: Bias
-
variance analysis of Support Vector
Machines for the development of SVM
-
based ensemble methods.

Journal of Machine Learning Research

(2004)

(accepted for publication)