Boosted Decision Trees, a Powerful Event Classifier

cartcletchΤεχνίτη Νοημοσύνη και Ρομποτική

19 Οκτ 2013 (πριν από 3 χρόνια και 8 μήνες)

61 εμφανίσεις

Byron Roe

1

Boosted Decision Trees, a
Powerful Event Classifier

Byron Roe, Haijun Yang, Ji Zhu

University of Michigan

Byron Roe

2

Outline



What is Boosting?


Comparisons of ANN and Boosting for the
MiniBooNE experiment


Comparisons of Boosting and Other
Classifiers


Some tested modifications to Boosting and
miscellaneous

Byron Roe

3

Training and Testing Events


Both ANN and boosting algorithms use a set of
known events to train the algorithm.


It would be biased to use the same set to
estimate the accuracy of the selection; the
algorithm has been trained for this specific
sample.


A new set, the testing set of events, is used to
test the algorithm.


All results quoted here are for the testing set.


Byron Roe

4

Boosted Decision Trees


What is a decision tree?


What is “boosting the decision trees”?


Two algorithms for boosting.

Byron Roe

5

Decision Tree


Go through all PID
variables and find best
variable and value to split
events.


For each of the two
subsets repeat the
process


Proceeding in this way a
tree is built.


Ending nodes are called
leaves.

Background/Signal

Byron Roe

6

Select Signal and Background
Leaves


Assume an equal weight of signal and
background training events.


If more than ½ of the weight of a leaf
corresponds to signal, it is a signal leaf;
otherwise it is a background leaf.


Signal events on a background leaf or
background events on a signal leaf are
misclassified.

Byron Roe

7

Criterion for “Best” Split


Purity,
P,

is the fraction of the weight of a
leaf due to signal events.


Gini: Note that gini is 0 for all signal or all
background.




The criterion is to minimize gini_left +
gini_right of the two children from a parent
node

Byron Roe

8

Criterion for Next Branch to Split


Pick the branch to maximize the change in
gini.



Criterion = gini
parent



gini
right
-
child


gini
left
-
child

Byron Roe

9

Decision Trees


This is a decision tree


They have been known for some time, but
often are unstable; a small change in the
training sample can produce a large
difference.

Byron Roe

10

Boosting the Decision Tree


Give the training
events misclassified
under this procedure
a higher weight.


Continuing build
perhaps 1000 trees
and do a weighted
average of the results
(1 if signal leaf,
-
1 if
background leaf).

Byron Roe

11

Two Commonly used Algorithms
for changing weights


1. AdaBoost



2. Epsilon boost (shrinkage)

Byron Roe

12

Definitions


X
i
=
set of particle ID variables for event
i


Y
i
=

1 if event
i
is signal,
-
1 if background


T
m
(x
i
) =
1 if event
i
lands on a signal leaf of
tree
m
and
-
1 if the event lands on a
background leaf.

Byron Roe

13

AdaBoost


Define err_m = weight wrong/total weight


Increase weight for misidentified events

Byron Roe

14

Scoring events with AdaBoost


Renormalize weights




Score by summing over trees

Byron Roe

15

Epsilon Boost (shrinkage)


After tree
m
, change weight of misclassified
events, typical ~0.01 (0.03). For
misclassfied events:



Renormalize weights




Score by summing over trees

Byron Roe

16

Unwgted, Wgted Misclassified
Event Rate vs No. Trees


Byron Roe

17

Comparison of methods


Epsilon boost changes weights a little at a
time


Let y=1 for signal,
-
1 for bkrd, F=score
summed over trees


AdaBoost can be shown to try to optimize
each change of weights. exp(
-
yF) is
minimized;



The optimum value is

F=½ log odds probability that Y is 1 given x

Byron Roe

18

The MiniBooNE Collaboration

Byron Roe

19

40’ D tank, mineral oil, surrounded by about 1280
photomultipliers. Both Cher. and scintillation light.
Geometrical shape and timing distinguishes events

Byron Roe

20

Tests of Boosting Parameters


45 Leaves seemed to work well for our application


1000 Trees was sufficient (or over
-
sufficient).


AdaBoost with beta about 0.5 and epsilonBoost with
epsilon about 0.03 worked well, although small changes
made little difference.


For other applications these numbers may need
adjustment


For MiniBooNE need around 100 variables for best
results. Too many variables degrades performance.


Relative ratio = const.*(fraction bkrd kept)/


(fraction signal kept).
Smaller is better!

Byron Roe

21

Effects of Number of Leaves and
Number of Trees

Smaller is better! R = c X frac. sig/frac. bkrd.

Byron Roe

22

Number of feature variables in
boosting


In recent trials we have used 182 variables.
Boosting worked well.


However,
by looking at the frequency with which
each variable was used as a splitting variable, it
was possible to reduce the number to 86 without
loss of sensitivity. Several methods for choosing
variables were tried, but this worked as well as
any


After using the frequency of use as a splitting
variable, some further improvement may be
obtained by looking at the correlations between
variables.

Byron Roe

23

Effect of Number of PID Variables

Byron Roe

24

Comparison of Boosting and ANN


Relative ratio here is ANN
bkrd kept/Boosting bkrd
kept. Greater than one
implies boosting wins!


A. All types of
background events.
Red

is 21 and
black

is 52
training var.


B. Bkrd is pi0 events.
Red
is 22 and
black

is 52
training variables

Percent nue CCQE kept

Byron Roe

25

Numerical Results from sfitter (a
second reconstruction program)


Extensive attempt to find best variables for
ANN and for boosting starting from about
3000 candidates


Train against pi0 and related
backgrounds

22 ANN variables and 50
boosting variables



For the region near 50% of signal kept,
the ratio of ANN to boosting background
was about 1.2



Byron Roe

26

Robustness


For either boosting or ANN, it is important to
know how robust the method is, i.e. will small
changes in the model produce large changes in
output.


In MiniBooNE this is handled by generating
many sets of events with parameters varied by
about 1 sigma and checking on the differences.
This is not complete, but, so far, the selections
look quite robust for boosting.

Byron Roe

27

How did the sensitivities change
with a new optical model?


In Nov. 04, a new, much changed optical model
of the detector was introduced for making MC
events


Both rfitter and sfitter needed to be changed to
optimize fits for this model


Using the SAME feature variables as for the old
model:


For both rfitter and sfitter, the boosting results
were about the same.


For sfitter, the ANN results became about a
factor of 2 worse


Byron Roe

28

For ANN


For ANN one needs to set temperature,
hidden layer size, learning rate… There are
lots of parameters to tune.


For ANN if one


a. Multiplies a variable by a constant,


var(17)


2.var(17)


b. Switches two variables


var(17)

var(18)


c. Puts a variable in twice

The result is very likely to change
.

Byron Roe

29

For Boosting



Only a few parameters and once set have
been stable for all calculations within our
experiment.


Let y=f(x) such that if x
1
>x
2

then y
1
>y
2
,
then the results are identical as it only
depends on the ordering of values.


Putting variables in twice or changing the
order of variables has no effect.

Byron Roe

30

Tests of Boosting Variants


None clearly better than AdaBoost or
EpsilonBoost

Byron Roe

31

Byron Roe

32

Can Convergence Speed be
Improved?


Removing correlations between variables
helps.


Random Forest (using random
fraction[1/2] of training events per tree with
replacement and random fraction of PID
variables per node (all PID var. used for
test here) WHEN combined with boosting.


Softening the step function scoring:
y=(2*purity
-
1); score = sign(y)*sqrt(|y|).

Byron Roe

33

Smooth Scoring and Step Function

Byron Roe

34

Performance of AdaBoost with
Step Function and Smooth
Function

Byron Roe

35

Post
-
Fitting


Post
-
Fitting is an attempt to reweight the
trees when summing tree scores after all
the trees are made


Two attempts produced only a very
modest (few %), if any, gain.

Byron Roe

36

Conclusions


Boosting is very robust. Given a sufficient number of
leaves and trees AdaBoost or EpsilonBoost reaches an
optimum level, which is not bettered by any variant tried.


Boosting was better than ANN in our tests by 1.2
-
1.8.


There are ways (such as the smooth scoring function) to
increase convergence speed in some cases.


Post
-
fitting makes only a small improvement.


Several techniques can be used for weeding variables.
Examining the frequency with which a given variable is
used works reasonably well.


Downloads in FORTRAN or C++ available at:


http://www.gallatin.physics.lsa.umich.edu/~roe/

Byron Roe

37

References


R.E. Schapire ``The strength of weak learnability.’’
Machine Learning

5

(2), 197
-
227
(1990). First suggested the boosting approach for 3 trees taking a majority vote


Y. Freund, ``Boosting a weak learning algorithm by majority’’,
Information and
Computation
121

(2), 256
-
285 (1995) Introduced using many trees


Y. Freund and R.E. Schapire, ``Experiments with an new boosting algorithm,
Machine
Learning: Proceedings of the Thirteenth International Conference,
Morgan Kauffman,
SanFrancisco, pp.148
-
156 (1996). Introduced AdaBoost


J. Friedman, T. Hastie, and R. Tibshirani, ``Additive logistic regression: a statistical
view of boosting’’,

Annals of Statistics
28
(2), 337
-
407 (2000). Showed that
AdaBoost could be looked at as successive approximations to a maximum likelihood
solution.



T. Hastie, R. Tibshirani, and J. Friedman, ``The Elements of Statistical Learning’’
Springer (2001). Good reference for decision trees and boosting.


B.P. Roe et. al., “Boosted decision trees as an alternative to artificial neural networks
for particle identification”, NIM A543, pp. 577
-
584 (2005).


Hai
-
Jun Yang, Byron P. Roe, and Ji Zhu, “Studies of Boosted Decision Trees for
MiniBooNE Particle Identification”, Physics/0508045, submitted to NIM, July 2005.

Byron Roe

38

Byron Roe

39

Example


AdaBoost: Suppose the weighted error
rate is 40%, i.e., err=0.4 and beta = 1/2


Then alpha = (1/2)ln((1
-
.4)/.4)= .203


Weight of a misclassified event is
multiplied by exp(0.203)=1.225



Epsilon boost: The weight of wrong
events is increased by exp(2X.01) = 1.02

Byron Roe

40

AdaBoost Optimization

Byron Roe

41

AdaBoost Fitting is Monotone

Byron Roe

42

The MiniBooNE Experiment


Byron Roe

43

Byron Roe

44

Byron Roe

45

Comparison of 21 (or 22) vs 52
variables for Boosting


Vertical axis is the
ratio of bkrd kept for
21(22) var./that kept
for 52 var., both for
boosting


Red

is if training
sample is cocktail and
black

is if training
sample is pi0


Error bars are MC
statistical errors only


R
a
ti
o

Byron Roe

46

Artificial Neural Networks



Use to classify events, for example into
“signal” and “noise/background”.


Suppose you have a set of “feature
variables”, obtained from the kinematic
variables of the event



Byron Roe

47

Neural Network Structure


Combine the features
in a non
-
linear way to
a “hidden layer” and
then to a “final layer”

Use a training set to find
the best
w
ik

to
distinguish signal and
background

Byron Roe

48

Feedforward Neural Network
--
I

Byron Roe

49

Feedforward Neural Network
--
II

Byron Roe

50

Determining the weights


Suppose want signal events to give output
=1 and background events to give
output=0


Mean square error given
N
p

training
events with desired outputs
o
i

either 0 or
1, and ANN results
t
i
.

Byron Roe

51

Back Propagation to Determine
Weights



Byron Roe

52

AdaBoost vs Epsilon Boost and
differing tree sizes


A. Bkrd for 8 leaves/
bkrd for 45 leaves.
Red
is AdaBoost,
Black

is Epsilon Boost


B. Bkrd for AdaBoost/
bkrd for Epsilon Boost
Nleaves = 45.

Byron Roe

53

Adaboost Output for Training and
Test Samples