Byron Roe
1
Boosted Decision Trees, a
Powerful Event Classifier
Byron Roe, Haijun Yang, Ji Zhu
University of Michigan
Byron Roe
2
Outline
•
What is Boosting?
•
Comparisons of ANN and Boosting for the
MiniBooNE experiment
•
Comparisons of Boosting and Other
Classifiers
•
Some tested modifications to Boosting and
miscellaneous
Byron Roe
3
Training and Testing Events
•
Both ANN and boosting algorithms use a set of
known events to train the algorithm.
•
It would be biased to use the same set to
estimate the accuracy of the selection; the
algorithm has been trained for this specific
sample.
•
A new set, the testing set of events, is used to
test the algorithm.
•
All results quoted here are for the testing set.
Byron Roe
4
Boosted Decision Trees
•
What is a decision tree?
•
What is “boosting the decision trees”?
•
Two algorithms for boosting.
Byron Roe
5
Decision Tree
•
Go through all PID
variables and find best
variable and value to split
events.
•
For each of the two
subsets repeat the
process
•
Proceeding in this way a
tree is built.
•
Ending nodes are called
leaves.
Background/Signal
Byron Roe
6
Select Signal and Background
Leaves
•
Assume an equal weight of signal and
background training events.
•
If more than ½ of the weight of a leaf
corresponds to signal, it is a signal leaf;
otherwise it is a background leaf.
•
Signal events on a background leaf or
background events on a signal leaf are
misclassified.
Byron Roe
7
Criterion for “Best” Split
•
Purity,
P,
is the fraction of the weight of a
leaf due to signal events.
•
Gini: Note that gini is 0 for all signal or all
background.
•
The criterion is to minimize gini_left +
gini_right of the two children from a parent
node
Byron Roe
8
Criterion for Next Branch to Split
•
Pick the branch to maximize the change in
gini.
Criterion = gini
parent
–
gini
right

child
–
gini
left

child
Byron Roe
9
Decision Trees
•
This is a decision tree
•
They have been known for some time, but
often are unstable; a small change in the
training sample can produce a large
difference.
Byron Roe
10
Boosting the Decision Tree
•
Give the training
events misclassified
under this procedure
a higher weight.
•
Continuing build
perhaps 1000 trees
and do a weighted
average of the results
(1 if signal leaf,

1 if
background leaf).
Byron Roe
11
Two Commonly used Algorithms
for changing weights
•
1. AdaBoost
•
2. Epsilon boost (shrinkage)
Byron Roe
12
Definitions
•
X
i
=
set of particle ID variables for event
i
•
Y
i
=
1 if event
i
is signal,

1 if background
•
T
m
(x
i
) =
1 if event
i
lands on a signal leaf of
tree
m
and

1 if the event lands on a
background leaf.
Byron Roe
13
AdaBoost
•
Define err_m = weight wrong/total weight
Increase weight for misidentified events
Byron Roe
14
Scoring events with AdaBoost
•
Renormalize weights
•
Score by summing over trees
Byron Roe
15
Epsilon Boost (shrinkage)
•
After tree
m
, change weight of misclassified
events, typical ~0.01 (0.03). For
misclassfied events:
•
Renormalize weights
•
Score by summing over trees
Byron Roe
16
Unwgted, Wgted Misclassified
Event Rate vs No. Trees
Byron Roe
17
Comparison of methods
•
Epsilon boost changes weights a little at a
time
•
Let y=1 for signal,

1 for bkrd, F=score
summed over trees
•
AdaBoost can be shown to try to optimize
each change of weights. exp(

yF) is
minimized;
•
The optimum value is
F=½ log odds probability that Y is 1 given x
Byron Roe
18
The MiniBooNE Collaboration
Byron Roe
19
40’ D tank, mineral oil, surrounded by about 1280
photomultipliers. Both Cher. and scintillation light.
Geometrical shape and timing distinguishes events
Byron Roe
20
Tests of Boosting Parameters
•
45 Leaves seemed to work well for our application
•
1000 Trees was sufficient (or over

sufficient).
•
AdaBoost with beta about 0.5 and epsilonBoost with
epsilon about 0.03 worked well, although small changes
made little difference.
•
For other applications these numbers may need
adjustment
•
For MiniBooNE need around 100 variables for best
results. Too many variables degrades performance.
•
Relative ratio = const.*(fraction bkrd kept)/
(fraction signal kept).
Smaller is better!
Byron Roe
21
Effects of Number of Leaves and
Number of Trees
Smaller is better! R = c X frac. sig/frac. bkrd.
Byron Roe
22
Number of feature variables in
boosting
•
In recent trials we have used 182 variables.
Boosting worked well.
•
However,
by looking at the frequency with which
each variable was used as a splitting variable, it
was possible to reduce the number to 86 without
loss of sensitivity. Several methods for choosing
variables were tried, but this worked as well as
any
•
After using the frequency of use as a splitting
variable, some further improvement may be
obtained by looking at the correlations between
variables.
Byron Roe
23
Effect of Number of PID Variables
Byron Roe
24
Comparison of Boosting and ANN
•
Relative ratio here is ANN
bkrd kept/Boosting bkrd
kept. Greater than one
implies boosting wins!
•
A. All types of
background events.
Red
is 21 and
black
is 52
training var.
•
B. Bkrd is pi0 events.
Red
is 22 and
black
is 52
training variables
Percent nue CCQE kept
Byron Roe
25
Numerical Results from sfitter (a
second reconstruction program)
•
Extensive attempt to find best variables for
ANN and for boosting starting from about
3000 candidates
•
Train against pi0 and related
backgrounds
—
22 ANN variables and 50
boosting variables
•
For the region near 50% of signal kept,
the ratio of ANN to boosting background
was about 1.2
Byron Roe
26
Robustness
•
For either boosting or ANN, it is important to
know how robust the method is, i.e. will small
changes in the model produce large changes in
output.
•
In MiniBooNE this is handled by generating
many sets of events with parameters varied by
about 1 sigma and checking on the differences.
This is not complete, but, so far, the selections
look quite robust for boosting.
Byron Roe
27
How did the sensitivities change
with a new optical model?
•
In Nov. 04, a new, much changed optical model
of the detector was introduced for making MC
events
•
Both rfitter and sfitter needed to be changed to
optimize fits for this model
•
Using the SAME feature variables as for the old
model:
•
For both rfitter and sfitter, the boosting results
were about the same.
•
For sfitter, the ANN results became about a
factor of 2 worse
Byron Roe
28
For ANN
•
For ANN one needs to set temperature,
hidden layer size, learning rate… There are
lots of parameters to tune.
•
For ANN if one
a. Multiplies a variable by a constant,
var(17)
2.var(17)
b. Switches two variables
var(17)
var(18)
c. Puts a variable in twice
The result is very likely to change
.
Byron Roe
29
For Boosting
•
Only a few parameters and once set have
been stable for all calculations within our
experiment.
•
Let y=f(x) such that if x
1
>x
2
then y
1
>y
2
,
then the results are identical as it only
depends on the ordering of values.
•
Putting variables in twice or changing the
order of variables has no effect.
Byron Roe
30
Tests of Boosting Variants
•
None clearly better than AdaBoost or
EpsilonBoost
Byron Roe
31
Byron Roe
32
Can Convergence Speed be
Improved?
•
Removing correlations between variables
helps.
•
Random Forest (using random
fraction[1/2] of training events per tree with
replacement and random fraction of PID
variables per node (all PID var. used for
test here) WHEN combined with boosting.
•
Softening the step function scoring:
y=(2*purity

1); score = sign(y)*sqrt(y).
Byron Roe
33
Smooth Scoring and Step Function
Byron Roe
34
Performance of AdaBoost with
Step Function and Smooth
Function
Byron Roe
35
Post

Fitting
•
Post

Fitting is an attempt to reweight the
trees when summing tree scores after all
the trees are made
•
Two attempts produced only a very
modest (few %), if any, gain.
Byron Roe
36
Conclusions
•
Boosting is very robust. Given a sufficient number of
leaves and trees AdaBoost or EpsilonBoost reaches an
optimum level, which is not bettered by any variant tried.
•
Boosting was better than ANN in our tests by 1.2

1.8.
•
There are ways (such as the smooth scoring function) to
increase convergence speed in some cases.
•
Post

fitting makes only a small improvement.
•
Several techniques can be used for weeding variables.
Examining the frequency with which a given variable is
used works reasonably well.
•
Downloads in FORTRAN or C++ available at:
http://www.gallatin.physics.lsa.umich.edu/~roe/
Byron Roe
37
References
•
R.E. Schapire ``The strength of weak learnability.’’
Machine Learning
5
(2), 197

227
(1990). First suggested the boosting approach for 3 trees taking a majority vote
•
Y. Freund, ``Boosting a weak learning algorithm by majority’’,
Information and
Computation
121
(2), 256

285 (1995) Introduced using many trees
•
Y. Freund and R.E. Schapire, ``Experiments with an new boosting algorithm,
Machine
Learning: Proceedings of the Thirteenth International Conference,
Morgan Kauffman,
SanFrancisco, pp.148

156 (1996). Introduced AdaBoost
•
J. Friedman, T. Hastie, and R. Tibshirani, ``Additive logistic regression: a statistical
view of boosting’’,
Annals of Statistics
28
(2), 337

407 (2000). Showed that
AdaBoost could be looked at as successive approximations to a maximum likelihood
solution.
•
T. Hastie, R. Tibshirani, and J. Friedman, ``The Elements of Statistical Learning’’
Springer (2001). Good reference for decision trees and boosting.
•
B.P. Roe et. al., “Boosted decision trees as an alternative to artificial neural networks
for particle identification”, NIM A543, pp. 577

584 (2005).
•
Hai

Jun Yang, Byron P. Roe, and Ji Zhu, “Studies of Boosted Decision Trees for
MiniBooNE Particle Identification”, Physics/0508045, submitted to NIM, July 2005.
Byron Roe
38
Byron Roe
39
Example
•
AdaBoost: Suppose the weighted error
rate is 40%, i.e., err=0.4 and beta = 1/2
•
Then alpha = (1/2)ln((1

.4)/.4)= .203
•
Weight of a misclassified event is
multiplied by exp(0.203)=1.225
•
Epsilon boost: The weight of wrong
events is increased by exp(2X.01) = 1.02
Byron Roe
40
AdaBoost Optimization
Byron Roe
41
AdaBoost Fitting is Monotone
Byron Roe
42
The MiniBooNE Experiment
Byron Roe
43
Byron Roe
44
Byron Roe
45
Comparison of 21 (or 22) vs 52
variables for Boosting
•
Vertical axis is the
ratio of bkrd kept for
21(22) var./that kept
for 52 var., both for
boosting
•
Red
is if training
sample is cocktail and
black
is if training
sample is pi0
•
Error bars are MC
statistical errors only
R
a
ti
o
Byron Roe
46
Artificial Neural Networks
•
Use to classify events, for example into
“signal” and “noise/background”.
•
Suppose you have a set of “feature
variables”, obtained from the kinematic
variables of the event
Byron Roe
47
Neural Network Structure
Combine the features
in a non

linear way to
a “hidden layer” and
then to a “final layer”
Use a training set to find
the best
w
ik
to
distinguish signal and
background
Byron Roe
48
Feedforward Neural Network

I
Byron Roe
49
Feedforward Neural Network

II
Byron Roe
50
Determining the weights
•
Suppose want signal events to give output
=1 and background events to give
output=0
•
Mean square error given
N
p
training
events with desired outputs
o
i
either 0 or
1, and ANN results
t
i
.
Byron Roe
51
Back Propagation to Determine
Weights
Byron Roe
52
AdaBoost vs Epsilon Boost and
differing tree sizes
•
A. Bkrd for 8 leaves/
bkrd for 45 leaves.
Red
is AdaBoost,
Black
is Epsilon Boost
•
B. Bkrd for AdaBoost/
bkrd for Epsilon Boost
Nleaves = 45.
Byron Roe
53
Adaboost Output for Training and
Test Samples
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο