Introduction to Statistics
and Machine
Learning
1
How do
we:
•
understand
•
interpret
our measurements
How do
we get the data for
our measurements
Outline
Helge Voss
Introduction to Statistics and Machine Learning

GSI Power Week

Dec 5

9 2011
2
Multivariate classification/regression algorithms (MVA)
motivation
another introduction/repeat the ideas of hypothesis tests in this
context
Multidimensional Likelihood
(
kNN
: k

Nearest
Neighbour
)
Projective Likelihood (naïve Bayes)
What to do with correlated input variables?
Decorrelation
strategies
MVA

Literature /Software
Packages... a biased selection
3
Software packages for Mulitvariate Data Analysis/Classification
individual classifier software
e.g. “JETNET” C.Peterson, T. Rognvaldsson, L.Loennblad
and many other packages
attempts to provide “all inclusive” packages
S
tatPatternRecognition: I.Narsky,
arXiv: physics/0507143
http://www.hep.caltech.edu/~narsky/spr.html
TMVA: H
ö
cker,Speckmayer,Stelzer,Therhaag,von Toerne,Voss,
arXiv: physics/0703039
http://tmva.sf.net
or every ROOT distribution
(development moved from SourceForge to ROOT repository
)
WEKA:
http://www.cs.waikato.ac.nz/ml/weka/
“R”: a huge data analysis library:
http://www.r

project.org/
Literature:
T.Hastie, R.Tibshirani, J.Friedman, “
The Elements of Statistical Learning
”, Springer 2001
C.M.Bishop, “
Pattern Recognition and Machine Learning
”, Springer 2006
Conferences:
PHYSTAT, ACAT,…
Helge Voss
Introduction to Statistics and Machine Learning

GSI Power Week

Dec 5

9 2011
Event Classification
4
A linear boundary?
A nonlinear one?
Rectangular cuts?
S
B
x
1
x
2
S
B
x
1
x
2
S
B
x
1
x
2
How can we decide what to uses ?
Once decided on a class of boundaries, how to find the “optimal” one ?
Suppose data sample of two types of events: with class labels
Signal
and
Background
(
will restrict here to two class cases. Many classifiers can in
principle be extended to several classes, otherwise, analyses can be staged)
how to set the decision boundary to select events of type
S
?
we have discriminating variables
x
1
,
x
2
, …
Low variance (stable), high bias methods
High variance, small bias methods
Helge Voss
Introduction to Statistics and Machine Learning

GSI Power Week

Dec 5

9 2011
Regression
5
linear?
x
f(x)
x
f(x)
x
f(x)
constant ?
non

linear?
how to estimate a “functional
behaviour
” from a given set of ‘known measurements” ?
assume for example “D”

variables that somehow characterize the shower in your calorimeter
energy as function of the calorimeter shower parameters
.
seems trivial ?
what if you have
many
input variables?
Cluster Size
Energy
seems trivial ?
The h
uman brain has very good pattern recognition capabilities!
Helge Voss
Introduction to Statistics and Machine Learning

GSI Power Week

Dec 5

9 2011
if we had an analytic model (i.e. know the function is a n
th

order polynomial) than
we know how to fit this (i.e. Maximum Likelihood Fit)
but what if we just want to “draw any kind of curve” and parameterize it?
Regression
浯摥氠晵湣瑩潮t氠
扥桡癩v畲
6
x
f(x)
Assume for example “D”

variables that somehow characterize the shower in your calorimeter.
Monte Carlo or
testbeam
data sample
with
measured cluster observables
+ known particle energy
= calibration function (energy == surface in D+1 dimensional space)
1

D example
2

D example
better known: (linear) regression
fit a known analytic function
e.g
. the above 2

D example
reasonable function would be:
f(x) = ax
2
+by
2
+c
what if we don’t have a reasonable “model” ?
need something more general:
e.g
. piecewise defined splines, kernel estimators, decision trees to approximate f(x)
x
y
f(x,y)
events generated according:
underlying distribution
Helge Voss
Introduction to Statistics and Machine Learning

GSI Power Week

Dec 5

9 2011
NOT in order to “fit a parameter”
provide
predition
of function value f(x) for new measurements x (where f(x) is not known)
Event Classification
7
Each event, if
Signal
or
Background
, has “D” measured variables.
D
“feature
space”
y(x)
most general form
y = y(
x
);
x
D
x
={x
1
,….,x
D
}: input variables
y(x): R
D
R:
plotting (historamming)
the resulting y(x) values:
Find a mapping from D

dimensional input

observable =”feature” space
to one dimensional output
class label
Who sees how this would
look like for the
regression problem?
Helge Voss
Introduction to Statistics and Machine Learning

GSI Power Week

Dec 5

9 2011
Event Classification
8
Each event, if
Signal
or
Background
, has “D” measured variables.
D
“feature
space”
y
(
B
)
0
,
y
(
S
)
1
y(x): “test statistic” in D

dimensional space of input variables
y(x): R
n
R:
distributions of y(x): PDF
S
(y) and PDF
B
(y)
overlap of PDF
S
(y) and PDF
B
(y)
separation power , purity
used to set the selection cut!
Find a mapping from D

dimensional input/observable/”feature” space
y(x)=const: surface defining the decision boundary.
efficiency and purity
to one dimensional output
class
lables
> cut: signal
= cut: decision boundary
< cut: background
y(x
):
Helge Voss
Introduction to Statistics and Machine Learning

GSI Power Week

Dec 5

9 2011
Classification ↔
Regression
9
Classification:
Each event, if
Signal
or
Background
, has “D” measured variables.
D
“feature
space”
y(x): R
D
R
: “test statistic”
in D

dimensional space of
input variables
y(x)=const: surface defining
the decision boundary.
y(x): R
D
R:
Regression:
Each event has “D” measured variables + one function value
(
e.g.
cluster shape variables in the ECAL + particles energy)
y(x): R
D
R
find
y(x)=const
hyperplanes where the
target function is constant
Now, y(x) needs to be build such that it
best approximates the target, not such
that it best separates signal from bkgr.
X
1
X
2
f(x
1
,x
2
)
Helge Voss
Introduction to Statistics and Machine Learning

GSI Power Week

Dec 5

9 2011
Event Classification
Helge Voss
Introduction to Statistics and Machine Learning

GSI Power Week

Dec 5

9 2011
10
y(x)
PDF
B
(y).
PDF
S
(y):
normalised distribution of y=y(x)
for
background
and
signal
events
(i.e. the “function” that describes the shape of the
distribution)
with y=y(x) one can also say
PDF
B
(
y(x)
),
PDF
S
(
y(x)
):
:
Probability densities for
background
and
signal
now let’s assume we have an unknown event from the example above for which y(x) = 0.2
is the probability of an event with
measured
x
={x
1
,….,x
D
} that gives y(x)
to be of type signal
y(x): R
n
R: the mapping from the “feature space” (observables) to one output variable
let f
S
and f
B
be the fraction of signal and background events in the sample, then:
PDF
B
(
y(x)
) = 1.5
and
PDF
S
(
y(x)
) = 0.45
1.5
0.45
Event Classification
P(Class=C
x
) (or simply P(C
x
)) :
probability that the event class is of C, given the
measured observables
x
={
x
1
,….,x
D
}
y(
x
)
Prior probability to observe an event of “class C”
i.e.
the relative abundance of “signal” versus
“background”
Overall probability density to observe the actual
measurement y(x).
i.e.
Probability density distribution
according to the measurements
x
and the given mapping function
Posterior probability
11
Helge Voss
Introduction to Statistics and Machine Learning

GSI Power Week

Dec 5

9 2011
Any Decision Involves Risk!
Helge Voss
Introduction to Statistics and Machine Learning

GSI Power Week

Dec 5

9 2011
12
Trying to select signal events:
(i.e. try to disprove the null

hypothesis
stating it were “only” a background event)
Type

2 error
:
(false negative)
fail to identify an event from Class C as such
(reject a hypothesis although it would have been correct/true)
(fail to reject the null

hypothesis/accept null hypothesis although it is false)
loss of efficiency
(
e.g.
miss true (signal) events
)
Decide to treat an event
as “
Signal
” or “
Background
”
Signal
Back

ground
Signal
Type

2
error
Back

ground
Type

1
error
Significance
α
: Type

1 error rate:
α
= background selection “efficiency”
Size
β
:
Type

2 error rate:
(how often you miss the signal)
Power:
1

β
= signal selection efficiency
“A”: region of the outcome of the test where you accept the event as
Signal
:
should be
small
should be
small
Type

1 error
:
(false positive)
classify event as Class C even though it is not
(accept a hypothesis although it is not true/
i.e.false
)
(reject the null

hypothesis although it would have been the correct one)
loss of purity
(
e.g.
accepting wrong events)
most of the rest of the lecture will be about methods that try to make as little mistakes as possible
Neyman

Pearson Lemma
13
Neyman

Peason:
The Likelihood ratio used as “selection criterium”
y(x) gives for each selection efficiency the best
possible background rejection.
i.e.
it maximises the area under the “
Receiver
Operation Characteristics
” (ROC) curve
0
1
1
0
1

e
backgr
.
e
signal
varying y(x)>“cut” moves the working point (efficiency and purity) along the ROC curve
how to choose “cut”
need to know prior probabilities (
S
,
B
abundances)
measurement of signal cross section:
maximum of S/
√(S+B) or equiv.
√(
e
∙
p
)
discovery of a signal (typically: S<<B):
maximum of
S/√(B)
precision measurement:
high purity (p)
l慲g攠b慣歧round r敪散瑩on
trigger selection:
high efficiency (
e)
Helge Voss
Introduction to Statistics and Machine Learning

GSI Power Week

Dec 5

9 2011
y’(x)
y’’(x)
MVA and Machine Learning
14
The previous slide was basically the idea of “Multivariate Analysis” (MVA)
rem: What about “standard cuts”
(event rejection in each variable
separately with fix conditions. i.e. if x
1
>0 or x
2
<3 then background) ?
Finding y(x) :
R
n
R
given a certain type of model class y(x)
in an automatic way using “known” or “previously solved” events
i.e. learn from known “patterns”
such that the resulting y(x) has good generalization properties when
applied to “unknown” events (regression: fits well the target function “in
between” the known training events
that is what the “machine” is supposed to be doing:
supervised machine
learning
Of course… there’s no magic, we still need to:
choose the discriminating variables
choose the class of models (linear, non

linear, flexible or less flexible)
tune the “learning parameters”
bias vs. variance trade off
check generalization properties
consider trade off between statistical and systematic uncertainties
Helge Voss
Introduction to Statistics and Machine Learning

GSI Power Week

Dec 5

9 2011
Event Classification
15
Unfortunately, the true probability densities functions are typically unknown:
Neyman

Pearsons lemma doesn’t really help us directly
* hyperplane in the strict sense goes through the origin. Here I mean “affine set” to be precise
Monte Carlo simulation or in general cases: set of known (already classified) “events”
2 different ways: Use these “training” events to:
estimate the functional form of p(xC):
(e.g. the differential cross section folded with the
detector influences)
from which the likelihood ratio can be obtained
e.g. D

dimensional histogram, Kernel densitiy estimators, …
find a “discrimination function” y(x) and corresponding decision boundary (i.e.
hyperplane* in the “feature space”: y(x) = const) that optimially separates signal from
background
e.g. Linear Discriminator, Neural Networks, …
Helge Voss
Introduction to Statistics and Machine Learning

GSI Power Week

Dec 5

9 2011
Unsupervised Learning
Just a short remark as we talked about “supervised” learning before:
supervised:
training with “events” for which we know the outcome (i.e. Signal or
Backgr
)
un

supervised:

no prior knowledge about what is “Signal” or “Background” or … we don’t
even know if there are different “event classes”, then you can for example do:

cluster analysis: if different “groups” are found
class labels

principal component analysis: find basis in observable space with biggest
hierarchical differences in the variance
infer something about underlying substructure
Examples:

think about “success” or “not success” rather than “signal” and “background”
(i.e. a robot achieves his goal or does not / falls or does not fall/ …)

market survey:
If asked many different question, maybe you can find “clusters” of people,
group them together and test if there are correlations between this groups
and their tendency to buy a certain product.
address them
specialy

medical survey:
group people together and perhaps find common causes for certain
diseases
16
Helge Voss
Introduction to Statistics and Machine Learning

GSI Power Week

Dec 5

9 2011
Nearest Neighbour and Kernel
Density Estimator
17
estimate probability density P(x) in D

dimensional space:
The only thing at our disposal is our “training data”
x
1
x
2
“events” distributed according to P(x)
“
x
”
k
(u):
is called a
Kernel function
For the chosen a rectangular volume
h
Say we want to know P(x) at “this” point “x”
One expects to find in a volume V around point “
x
”
N*∫P(x)dx events from a dataset with N events
V
K (from the “training data”)
estimate of average P(x) in the volume V: ∫P(x)dx = K/N
V
Classification:
Determine
PDF
S
(x)
and
PDF
B
(x)
likelihood ratio as classifier!
K

events:
Kernel Density estimator of the probability density
Helge Voss
Introduction to Statistics and Machine Learning

GSI Power Week

Dec 5

9 2011
Nearest Neighbour and Kernel
Density Estimator
18
Regression:
If each events with (x
1
,x
2
) carries a “function value” f(x
1
,x
2
) (e.g. energy of incident
particle)
i.e.: the average function value
x
1
x
2
“events” distributed according to P(x)
“
x
”
k
(u):
is called a
Kernel function:
rectangular
Parzen

Window
h
K (from the “training data”)
estimate of average P(x) in the volume V: ∫P(x)dx = K/N
V
Helge Voss
Introduction to Statistics and Machine Learning

GSI Power Week

Dec 5

9 2011
estimate probability density P(x) in D

dimensional space:
The only thing at our disposal is our “training data”
For the chosen a rectangular volume
Say we want to know P(x) at “this” point “x”
One expects to find in a volume V around point “
x
”
N*∫P(x)dx events from a dataset with N events
V
K

events:
Nearest Neighbour and Kernel
Density Estimator
19
x
1
x
2
“
x
”
h
determine K from the “training data” with signal and
background mixed together
x
1
x
2
kNN : k

Nearest Neighbours
relative number events of the various
classes amongst the k

nearest neighbours
Helge Voss
Introduction to Statistics and Machine Learning

GSI Power Week

Dec 5

9 2011
“events” distributed according to P(x)
estimate probability density P(x) in D

dimensional space:
The only thing at our disposal is our “training data”
For the chosen a rectangular volume
Say we want to know P(x) at “this” point “x”
One expects to find in a volume V around point “
x
”
N*∫P(x)dx events from a dataset with N events
V
K

events:
Kernel Density Estimator
20
Parzen Window: “rectangular Kernel”
discontinuities at window edges
smoother model for P(x) when using smooth Kernel Functions:
e.g. Gaussian
place a “Gaussian” around each “training
data point” and sum up their contributions
at arbitrary points “
x
”
P(
x
)
h: “size” of the Kernel
“smoothing
parameter”
there is a large variety of possible Kernel
functions
↔
probability density estimator
individual kernels
averaged kernels
Helge Voss
Introduction to Statistics and Machine Learning

GSI Power Week

Dec 5

9 2011
Kernel Density Estimator
21
h: “size” of the Kernel
“smoothing parameter”
chosen size of the “smoothing

parameter”
more
important than kernel function
(Christopher M.Bishop)
h too small: overtraining
h too large: not sensitive to features in P(x)
a drawback of Kernel density estimators:
Evaluation for any test events involves ALL TRAINING DATA
typically very time consuming
binary search trees (i.e.
Kd

trees) are typically used in
kNN
methods to speed up searching
: a general probability density estimator using kernel K
Helge Voss
Introduction to Statistics and Machine Learning

GSI Power Week

Dec 5

9 2011
which metric for the Kernel (window)?
normalise
all variables to same range
include correlations ?
Mahalanobis
Metric:
x*x
xV

1
x
“Curse of Dimensionality”
22
Bellman, R. (1961), Adaptive
Control Processes: A
Guided Tour, Princeton
University Press.
Shortcoming of nearest

neighbour strategies:
in higher dimensional classification/regression cases
the idea of looking at “training events” in a reasonably
small “vicinity” of the space point to be classified
becomes difficult:
consider: total phase space volume V=1
D
for a cube of a particular fraction of the volume:
In 10 dimensions: in order to capture 1% of the phase space
63% of range in each variable necessary
that’s not “local” anymore..
We all know:
Filling a D

dimensional histogram to get a mapping of the PDF is typically unfeasable due
to lack of Monte Carlo events.
Therefore we still need to develop all the alternative classification/regression techniques
Helge Voss
Introduction to Statistics and Machine Learning

GSI Power Week

Dec 5

9 2011
Naïve Bayesian Classifier
“often called: (projective) Likelihood”
23
Multivariate Likelihood (k

Nearest
Neighbour)
estimate the full D

dimensional joint probability density
If correlations between variables are weak:
discriminating variables
Classes: signal,
background types
Likelihood ratio
for event
event
PDFs
One of the first and still very popular MVA

algorithm in HEP
No hard cuts on individual variables,
a
llow for some “
fuzzyness
”:
one very signal like variable may
counterweigh another less signal like variable
optimal method if correlations == 0 (
Neyman
Pearson Lemma)
PDE introduces fuzzy logic
product of marginal PDFs
(1

dim “histograms”)
Helge Voss
Introduction to Statistics and Machine Learning

GSI Power Week

Dec 5

9 2011
Naïve Bayesian Classifier
“often called: (projective) Likelihood”
24
example: original (underlying) distribution is
Gaussian
Difficult to automate
for arbitrary PDFs
parametric (function) fitting
Automatic, unbiased,
but
suboptimal
event counting
(
histogramming
)
Easy to automate, can create
artefacts/suppress information
nonparametric fitting
(i.e. splines,kernel)
How
parameterise
the 1

dim PDFs ??
If the correlations between variables is really
negligible, this classifier is “perfect” (simple,
robust, performing)
If not, you seriously loose performance
How can we “fix” this ?
Helge Voss
Introduction to Statistics and Machine Learning

GSI Power Week

Dec 5

9 2011
What if there are correlations?
25
Typically correlations are present:
C
ij
=
cov
[ x
i
, x
j
]=E[ x
i
x
j
]−E[ x
i
]E[
x
j
]≠0 (
i
≠j
)
pre

processing: choose set of linear transformed input variables for which C
ij
= 0
(i≠j)
Helge Voss
Introduction to Statistics and Machine Learning

GSI Power Week

Dec 5

9 2011
Decorrelation
26
Attention
:
eliminates only
linear
correlations!!
Determine
square

root
C
of correlation matrix
C
,
i
.
e
.,
C
=
C
C
compute
C
by
diagonalising
C
:
transformation from original
(x)
in de

correlated variable space
(x
)
by:
x
=
C

1x
Helge Voss
Introduction to Statistics and Machine Learning

GSI Power Week

Dec 5

9 2011
Find variable transformation that
diagonalises
the covariance matrix
Decorrelation
:
Principal Component Analysis
27
Principle Component
(PC) of variable
k
sample means
eigenvector
Matrix of eigenvectors V obey the relation:
PCA eliminates correlations!
correlation matrix
diagonalised square root of C
Helge Voss
Introduction to Statistics and Machine Learning

GSI Power Week

Dec 5

9 2011
PCA
(unsupervised learning algorithm)
reduce dimensionality of a problem
find most dominant features in a distribution
Eigenvectors of covariance matrix
“axis” in transformed variable space
large eigenvalue
污l来⁶慲楡湣i 慬潮朠瑨攠慸楳†
⡰物(捩灡氠捯浰c湥湴n
sort eigenvectors according to their eigenvalues
transform dataset accordingly
diagonalised
covariance matrix with first “variable”
variable with
largest variance
How to Apply the Pre

Processing
Transformation?
28
•
Correlation (
decorrelation
): different for signal and background variables
•
we don’t know beforehand if it is signal or background.
What do we do?
for
likelihood ratio
,
decorrelate
signal and background independently
for
other estimators
, one needs to decide on one of the two… (or
decorrelate
on a mixture of signal and background events)
signal transformation
background transformation
Helge Voss
Introduction to Statistics and Machine Learning

GSI Power Week

Dec 5

9 2011
Decorrelation at Work
29
Example: linear correlated Gaussians
decorrelation
works to 100%
1

D Likelihood on
decorrelated
sample give best possible performance
compare also the effect on the MVA

output variable!
correlated variables: after decorrelation
(note the different scale on the y

axis… sorry)
Helge Voss
Introduction to Statistics and Machine Learning

GSI Power Week

Dec 5

9 2011
Limitations of the Decorrelation
30
in cases with non

Gaussian distributions and/or nonlinear correlations,
the
decorrelation
needs to be treated with care
How does linear
decorrelation affect
cases where
correlations
between signal and
background differ?
Original correlations
Signal
Background
Helge Voss
Introduction to Statistics and Machine Learning

GSI Power Week

Dec 5

9 2011
Limitations of the Decorrelation
31
in cases with non

Gaussian distributions and/or nonlinear correlations,
the decorrelation needs to be treated with care
How does linear
decorrelation affect
cases where
correlations
between signal and
background differ?
SQRT decorrelation
Signal
Background
Helge Voss
Introduction to Statistics and Machine Learning

GSI Power Week

Dec 5

9 2011
Limitations of the Decorrelation
32
in cases with non

Gaussian distributions and/or nonlinear correlations,
the decorrelation needs to be treated with care
How does linear
decorrelation affect
strongly nonlinear
cases ?
Original correlations
Background
Signal
Helge Voss
Introduction to Statistics and Machine Learning

GSI Power Week

Dec 5

9 2011
Limitations of the Decorrelation
33
in cases with non

Gaussian distributions and/or nonlinear correlations,
the decorrelation needs to be treated with care
How does linear
decorrelation affect
strongly nonlinear
cases ?
SQRT decorrelation
Watch out before you used decorrelation “blindly”!!
Perhaps “decorrelate” only a subspace!
Background
Signal
Helge Voss
Introduction to Statistics and Machine Learning

GSI Power Week

Dec 5

9 2011
Improve decorrelation by pre

Gaussianisation of variables
First: transformation to achieve uniform (flat) distribution:
Rarity transform of variable
k
Measured value
PDF of variable
k
Second: make Gaussian via inverse error function:
The integral can be solved in an unbinned way by event counting,
or by creating non

parametric PDFs (see later for likelihood section)
Third: decorrelate (and “iterate” this procedure)
“Gaussian

isation“
Helge Voss
Introduction to Statistics and Machine Learning

GSI Power Week

Dec 5

9 2011
34
Original
Signal

Gaussianised
We cannot simultaneously “Gaussianise” both signal and background !
Background

Gaussianised
“Gaussian

isation“
Helge Voss
Introduction to Statistics and Machine Learning

GSI Power Week

Dec 5

9 2011
35
Summary
Helge Voss
Introduction to Statistics and Machine Learning

GSI Power Week

Dec 5

9 2011
36
Hope you are all convinced that Multivariate
Algorithem
are nice
and powerful classification techniques
Do not use hard selection criteria (cuts) on each individual
observables
Look at all observables “together”
eg
. combing them into 1 variable
Mulitdimensinal
Likelihood
偄䘠P渠n
摩d敮獩潮e
Projective Likelihood (Naïve Bayesian)
P䑆D楮 䐠瑩D敳eㄠ
dimension
How to “avoid” correlations
Comments 0
Log in to post a comment