and Machine Learning

wonderfuldistinctAI and Robotics

Oct 16, 2013 (3 years and 9 months ago)

45 views

Introduction to Statistics

and Machine
Learning

1

How do
we:


understand


interpret

our measurements

How do
we get the data for

our measurements

Outline

Helge Voss

Introduction to Statistics and Machine Learning
-

GSI Power Week
-

Dec 5
-
9 2011

2


Multivariate classification/regression algorithms (MVA)


motivation


another introduction/repeat the ideas of hypothesis tests in this
context


Multidimensional Likelihood

(
kNN

: k
-
Nearest
Neighbour
)


Projective Likelihood (naïve Bayes)


What to do with correlated input variables?


Decorrelation

strategies



MVA
-
Literature /Software
Packages... a biased selection

3

Software packages for Mulitvariate Data Analysis/Classification


individual classifier software



e.g. “JETNET” C.Peterson, T. Rognvaldsson, L.Loennblad

and many other packages




attempts to provide “all inclusive” packages


S
tatPatternRecognition: I.Narsky,
arXiv: physics/0507143


http://www.hep.caltech.edu/~narsky/spr.html



TMVA: H
ö
cker,Speckmayer,Stelzer,Therhaag,von Toerne,Voss,
arXiv: physics/0703039

http://tmva.sf.net

or every ROOT distribution
(development moved from SourceForge to ROOT repository
)


WEKA:
http://www.cs.waikato.ac.nz/ml/weka/



“R”: a huge data analysis library:
http://www.r
-
project.org/



Literature:


T.Hastie, R.Tibshirani, J.Friedman, “
The Elements of Statistical Learning
”, Springer 2001


C.M.Bishop, “
Pattern Recognition and Machine Learning
”, Springer 2006


Conferences:

PHYSTAT, ACAT,…

Helge Voss

Introduction to Statistics and Machine Learning
-

GSI Power Week
-

Dec 5
-
9 2011

Event Classification

4

A linear boundary?

A nonlinear one?

Rectangular cuts?

S

B

x
1

x
2

S

B

x
1

x
2

S

B

x
1

x
2



How can we decide what to uses ?



Once decided on a class of boundaries, how to find the “optimal” one ?


Suppose data sample of two types of events: with class labels

Signal
and

Background

(
will restrict here to two class cases. Many classifiers can in
principle be extended to several classes, otherwise, analyses can be staged)




how to set the decision boundary to select events of type
S

?



we have discriminating variables
x
1
,
x
2
, …


Low variance (stable), high bias methods

High variance, small bias methods

Helge Voss

Introduction to Statistics and Machine Learning
-

GSI Power Week
-

Dec 5
-
9 2011

Regression

5

linear?

x

f(x)

x

f(x)

x

f(x)


constant ?

non
-

linear?


how to estimate a “functional
behaviour
” from a given set of ‘known measurements” ?


assume for example “D”
-
variables that somehow characterize the shower in your calorimeter




energy as function of the calorimeter shower parameters
.


seems trivial ?


what if you have
many

input variables?

Cluster Size

Energy


seems trivial ?


The h
uman brain has very good pattern recognition capabilities!

Helge Voss

Introduction to Statistics and Machine Learning
-

GSI Power Week
-

Dec 5
-
9 2011


if we had an analytic model (i.e. know the function is a n
th

-
order polynomial) than
we know how to fit this (i.e. Maximum Likelihood Fit)


but what if we just want to “draw any kind of curve” and parameterize it?

Regression


浯摥氠晵湣瑩潮t氠
扥桡癩v畲

6

x

f(x)


Assume for example “D”
-
variables that somehow characterize the shower in your calorimeter.



Monte Carlo or
testbeam





data sample
with
measured cluster observables
+ known particle energy


= calibration function (energy == surface in D+1 dimensional space)

1
-
D example

2
-
D example


better known: (linear) regression


fit a known analytic function


e.g
. the above 2
-
D example


reasonable function would be:

f(x) = ax
2
+by
2
+c


what if we don’t have a reasonable “model” ?


need something more general:


e.g
. piecewise defined splines, kernel estimators, decision trees to approximate f(x)

x

y

f(x,y)

events generated according:

underlying distribution

Helge Voss

Introduction to Statistics and Machine Learning
-

GSI Power Week
-

Dec 5
-
9 2011



NOT in order to “fit a parameter”



provide
predition

of function value f(x) for new measurements x (where f(x) is not known)

Event Classification

7


Each event, if
Signal

or
Background
, has “D” measured variables.


D

“feature


space”

y(x)



most general form

y = y(
x
);
x


D


x
={x
1
,….,x
D
}: input variables

y(x): R
D

R:


plotting (historamming)
the resulting y(x) values:


Find a mapping from D
-
dimensional input
-
observable =”feature” space


to one dimensional output


class label




Who sees how this would
look like for the
regression problem?

Helge Voss

Introduction to Statistics and Machine Learning
-

GSI Power Week
-

Dec 5
-
9 2011

Event Classification

8


Each event, if
Signal

or
Background
, has “D” measured variables.


D

“feature


space”

y
(
B
)


0
,
y
(
S
)


1




y(x): “test statistic” in D
-
dimensional space of input variables

y(x): R
n

R:


distributions of y(x): PDF
S
(y) and PDF
B
(y)


overlap of PDF
S
(y) and PDF
B
(y)


separation power , purity



used to set the selection cut!


Find a mapping from D
-
dimensional input/observable/”feature” space


y(x)=const: surface defining the decision boundary.


efficiency and purity

to one dimensional output



class
lables



> cut: signal

= cut: decision boundary

< cut: background


y(x
):

Helge Voss

Introduction to Statistics and Machine Learning
-

GSI Power Week
-

Dec 5
-
9 2011

Classification ↔

Regression

9

Classification:


Each event, if
Signal

or
Background
, has “D” measured variables.


D

“feature


space”




y(x): R
D

R
: “test statistic”
in D
-
dimensional space of
input variables


y(x)=const: surface defining
the decision boundary.

y(x): R
D

R:

Regression:


Each event has “D” measured variables + one function value


(
e.g.

cluster shape variables in the ECAL + particles energy)


y(x): R
D

R




find


y(x)=const


hyperplanes where the


target function is constant

Now, y(x) needs to be build such that it

best approximates the target, not such

that it best separates signal from bkgr.

X
1

X
2


f(x
1
,x
2
)

Helge Voss

Introduction to Statistics and Machine Learning
-

GSI Power Week
-

Dec 5
-
9 2011

Event Classification

Helge Voss

Introduction to Statistics and Machine Learning
-

GSI Power Week
-

Dec 5
-
9 2011

10

y(x)

PDF
B
(y).

PDF
S
(y):

normalised distribution of y=y(x)
for
background
and
signal

events

(i.e. the “function” that describes the shape of the
distribution)

with y=y(x) one can also say
PDF
B
(
y(x)
),

PDF
S
(
y(x)
):
:


Probability densities for
background

and
signal

now let’s assume we have an unknown event from the example above for which y(x) = 0.2


is the probability of an event with
measured
x
={x
1
,….,x
D
} that gives y(x)
to be of type signal

y(x): R
n

R: the mapping from the “feature space” (observables) to one output variable

let f
S

and f
B

be the fraction of signal and background events in the sample, then:



PDF
B
(
y(x)
) = 1.5
and
PDF
S
(
y(x)
) = 0.45


1.5


0.45

Event Classification

P(Class=C|
x
) (or simply P(C|
x
)) :

probability that the event class is of C, given the

measured observables
x
={
x
1
,….,x
D
}


y(
x
)

Prior probability to observe an event of “class C”

i.e.

the relative abundance of “signal” versus
“background”

Overall probability density to observe the actual
measurement y(x).
i.e.


Probability density distribution
according to the measurements
x

and the given mapping function

Posterior probability

11

Helge Voss

Introduction to Statistics and Machine Learning
-

GSI Power Week
-

Dec 5
-
9 2011

Any Decision Involves Risk!

Helge Voss

Introduction to Statistics and Machine Learning
-

GSI Power Week
-

Dec 5
-
9 2011

12

Trying to select signal events:

(i.e. try to disprove the null
-
hypothesis
stating it were “only” a background event)

Type
-
2 error
:
(false negative)

fail to identify an event from Class C as such

(reject a hypothesis although it would have been correct/true)

(fail to reject the null
-
hypothesis/accept null hypothesis although it is false)



loss of efficiency
(
e.g.

miss true (signal) events
)

Decide to treat an event
as “
Signal
” or “
Background


Signal

Back
-
ground

Signal



Type
-
2
error

Back
-
ground

Type
-
1
error



Significance
α
: Type
-
1 error rate:


α

= background selection “efficiency”

Size
β
:



Type
-
2 error rate:
(how often you miss the signal)

Power:
1
-

β

= signal selection efficiency

“A”: region of the outcome of the test where you accept the event as
Signal
:

should be
small

should be
small

Type
-
1 error
:
(false positive)

classify event as Class C even though it is not

(accept a hypothesis although it is not true/
i.e.false
)

(reject the null
-
hypothesis although it would have been the correct one)



loss of purity
(
e.g.

accepting wrong events)

most of the rest of the lecture will be about methods that try to make as little mistakes as possible


Neyman
-
Pearson Lemma

13

Neyman
-
Peason:

The Likelihood ratio used as “selection criterium”
y(x) gives for each selection efficiency the best
possible background rejection.

i.e.

it maximises the area under the “
Receiver
Operation Characteristics
” (ROC) curve

0

1

1

0

1
-

e
backgr
.


e
signal

varying y(x)>“cut” moves the working point (efficiency and purity) along the ROC curve


how to choose “cut”


need to know prior probabilities (
S
,
B

abundances)



measurement of signal cross section:

maximum of S/
√(S+B) or equiv.
√(
e

p
)


discovery of a signal (typically: S<<B):

maximum of
S/√(B)


precision measurement:

high purity (p)


l慲g攠b慣歧round r敪散瑩on


trigger selection:

high efficiency (
e)


Helge Voss

Introduction to Statistics and Machine Learning
-

GSI Power Week
-

Dec 5
-
9 2011

y’(x)

y’’(x)

MVA and Machine Learning

14


The previous slide was basically the idea of “Multivariate Analysis” (MVA)


rem: What about “standard cuts”
(event rejection in each variable
separately with fix conditions. i.e. if x
1
>0 or x
2
<3 then background) ?


Finding y(x) :
R
n

R



given a certain type of model class y(x)


in an automatic way using “known” or “previously solved” events


i.e. learn from known “patterns”


such that the resulting y(x) has good generalization properties when
applied to “unknown” events (regression: fits well the target function “in
between” the known training events




that is what the “machine” is supposed to be doing:
supervised machine
learning



Of course… there’s no magic, we still need to:


choose the discriminating variables


choose the class of models (linear, non
-
linear, flexible or less flexible)


tune the “learning parameters”


bias vs. variance trade off


check generalization properties


consider trade off between statistical and systematic uncertainties


Helge Voss

Introduction to Statistics and Machine Learning
-

GSI Power Week
-

Dec 5
-
9 2011

Event Classification

15


Unfortunately, the true probability densities functions are typically unknown:



Neyman
-
Pearsons lemma doesn’t really help us directly


* hyperplane in the strict sense goes through the origin. Here I mean “affine set” to be precise



Monte Carlo simulation or in general cases: set of known (already classified) “events”



2 different ways: Use these “training” events to:



estimate the functional form of p(x|C):
(e.g. the differential cross section folded with the
detector influences)
from which the likelihood ratio can be obtained



e.g. D
-
dimensional histogram, Kernel densitiy estimators, …



find a “discrimination function” y(x) and corresponding decision boundary (i.e.
hyperplane* in the “feature space”: y(x) = const) that optimially separates signal from
background



e.g. Linear Discriminator, Neural Networks, …


Helge Voss

Introduction to Statistics and Machine Learning
-

GSI Power Week
-

Dec 5
-
9 2011

Unsupervised Learning

Just a short remark as we talked about “supervised” learning before:


supervised:


training with “events” for which we know the outcome (i.e. Signal or
Backgr
)


un
-
supervised:

-

no prior knowledge about what is “Signal” or “Background” or … we don’t

even know if there are different “event classes”, then you can for example do:




-

cluster analysis: if different “groups” are found


class labels




-

principal component analysis: find basis in observable space with biggest



hierarchical differences in the variance







infer something about underlying substructure



Examples:



-

think about “success” or “not success” rather than “signal” and “background”

(i.e. a robot achieves his goal or does not / falls or does not fall/ …)





-

market survey:





If asked many different question, maybe you can find “clusters” of people,


group them together and test if there are correlations between this groups




and their tendency to buy a certain product.


address them
specialy



-

medical survey:





group people together and perhaps find common causes for certain
diseases



16

Helge Voss

Introduction to Statistics and Machine Learning
-

GSI Power Week
-

Dec 5
-
9 2011

Nearest Neighbour and Kernel
Density Estimator

17


estimate probability density P(x) in D
-
dimensional space:



The only thing at our disposal is our “training data”


x
1

x
2

“events” distributed according to P(x)


x


k
(u):

is called a

Kernel function


For the chosen a rectangular volume

h



Say we want to know P(x) at “this” point “x”


One expects to find in a volume V around point “
x

N*∫P(x)dx events from a dataset with N events


V


K (from the “training data”)



estimate of average P(x) in the volume V: ∫P(x)dx = K/N



V


Classification:

Determine


PDF
S
(x)

and
PDF
B
(x)



likelihood ratio as classifier!




K
-
events:







Kernel Density estimator of the probability density

Helge Voss

Introduction to Statistics and Machine Learning
-

GSI Power Week
-

Dec 5
-
9 2011

Nearest Neighbour and Kernel
Density Estimator

18


Regression:

If each events with (x
1
,x
2
) carries a “function value” f(x
1
,x
2
) (e.g. energy of incident
particle)





i.e.: the average function value

x
1

x
2

“events” distributed according to P(x)


x


k
(u):

is called a

Kernel function:
rectangular


Parzen
-
Window

h


K (from the “training data”)



estimate of average P(x) in the volume V: ∫P(x)dx = K/N



V

Helge Voss

Introduction to Statistics and Machine Learning
-

GSI Power Week
-

Dec 5
-
9 2011


estimate probability density P(x) in D
-
dimensional space:



The only thing at our disposal is our “training data”



For the chosen a rectangular volume



Say we want to know P(x) at “this” point “x”


One expects to find in a volume V around point “
x

N*∫P(x)dx events from a dataset with N events


V




K
-
events:

Nearest Neighbour and Kernel
Density Estimator

19

x
1

x
2


x


h


determine K from the “training data” with signal and
background mixed together



x
1

x
2


kNN : k
-
Nearest Neighbours


relative number events of the various
classes amongst the k
-
nearest neighbours

Helge Voss

Introduction to Statistics and Machine Learning
-

GSI Power Week
-

Dec 5
-
9 2011

“events” distributed according to P(x)


estimate probability density P(x) in D
-
dimensional space:



The only thing at our disposal is our “training data”



For the chosen a rectangular volume



Say we want to know P(x) at “this” point “x”


One expects to find in a volume V around point “
x

N*∫P(x)dx events from a dataset with N events


V




K
-
events:

Kernel Density Estimator

20


Parzen Window: “rectangular Kernel”


discontinuities at window edges


smoother model for P(x) when using smooth Kernel Functions:
e.g. Gaussian





place a “Gaussian” around each “training
data point” and sum up their contributions
at arbitrary points “
x



P(
x
)


h: “size” of the Kernel


“smoothing
parameter”


there is a large variety of possible Kernel
functions



probability density estimator

individual kernels

averaged kernels

Helge Voss

Introduction to Statistics and Machine Learning
-

GSI Power Week
-

Dec 5
-
9 2011

Kernel Density Estimator

21


h: “size” of the Kernel


“smoothing parameter”



chosen size of the “smoothing
-
parameter”


more
important than kernel function

(Christopher M.Bishop)


h too small: overtraining


h too large: not sensitive to features in P(x)


a drawback of Kernel density estimators:

Evaluation for any test events involves ALL TRAINING DATA


typically very time consuming



binary search trees (i.e.
Kd
-
trees) are typically used in
kNN

methods to speed up searching

: a general probability density estimator using kernel K

Helge Voss

Introduction to Statistics and Machine Learning
-

GSI Power Week
-

Dec 5
-
9 2011


which metric for the Kernel (window)?


normalise

all variables to same range


include correlations ?


Mahalanobis

Metric:

x*x


xV
-
1
x


“Curse of Dimensionality”

22

Bellman, R. (1961), Adaptive
Control Processes: A
Guided Tour, Princeton
University Press.


Shortcoming of nearest
-
neighbour strategies:



in higher dimensional classification/regression cases
the idea of looking at “training events” in a reasonably
small “vicinity” of the space point to be classified
becomes difficult:

consider: total phase space volume V=1
D


for a cube of a particular fraction of the volume:


In 10 dimensions: in order to capture 1% of the phase space



63% of range in each variable necessary


that’s not “local” anymore..




We all know:


Filling a D
-
dimensional histogram to get a mapping of the PDF is typically unfeasable due
to lack of Monte Carlo events.


Therefore we still need to develop all the alternative classification/regression techniques

Helge Voss

Introduction to Statistics and Machine Learning
-

GSI Power Week
-

Dec 5
-
9 2011

Naïve Bayesian Classifier

“often called: (projective) Likelihood”

23

Multivariate Likelihood (k
-
Nearest
Neighbour)




estimate the full D
-
dimensional joint probability density

If correlations between variables are weak:


discriminating variables

Classes: signal,
background types

Likelihood ratio
for event
event

PDFs


One of the first and still very popular MVA
-
algorithm in HEP



No hard cuts on individual variables,



a
llow for some “
fuzzyness
”:
one very signal like variable may
counterweigh another less signal like variable


optimal method if correlations == 0 (
Neyman

Pearson Lemma)

PDE introduces fuzzy logic

product of marginal PDFs

(1
-
dim “histograms”)

Helge Voss

Introduction to Statistics and Machine Learning
-

GSI Power Week
-

Dec 5
-
9 2011

Naïve Bayesian Classifier

“often called: (projective) Likelihood”

24

example: original (underlying) distribution is
Gaussian

Difficult to automate
for arbitrary PDFs

parametric (function) fitting

Automatic, unbiased,
but

suboptimal

event counting

(
histogramming
)

Easy to automate, can create
artefacts/suppress information

nonparametric fitting

(i.e. splines,kernel)

How
parameterise

the 1
-
dim PDFs ??


If the correlations between variables is really
negligible, this classifier is “perfect” (simple,
robust, performing)


If not, you seriously loose performance



How can we “fix” this ?

Helge Voss

Introduction to Statistics and Machine Learning
-

GSI Power Week
-

Dec 5
-
9 2011

What if there are correlations?

25


Typically correlations are present:
C
ij
=
cov
[ x
i

, x

j

]=E[ x
i

x
j

]−E[ x
i

]E[
x
j

]≠0 (
i
≠j
)



pre
-
processing: choose set of linear transformed input variables for which C
ij

= 0
(i≠j)

Helge Voss

Introduction to Statistics and Machine Learning
-

GSI Power Week
-

Dec 5
-
9 2011

Decorrelation

26

Attention
:

eliminates only
linear

correlations!!


Determine

square
-
root

C


of correlation matrix
C
,
i
.
e
.,
C

=
C

C



compute
C


by
diagonalising

C
:



transformation from original

(x)

in de
-
correlated variable space

(x

)

by:

x


=
C
-
1x


Helge Voss

Introduction to Statistics and Machine Learning
-

GSI Power Week
-

Dec 5
-
9 2011


Find variable transformation that
diagonalises

the covariance matrix

Decorrelation
:

Principal Component Analysis

27

Principle Component
(PC) of variable

k

sample means

eigenvector


Matrix of eigenvectors V obey the relation:


PCA eliminates correlations!

correlation matrix

diagonalised square root of C

Helge Voss

Introduction to Statistics and Machine Learning
-

GSI Power Week
-

Dec 5
-
9 2011


PCA
(unsupervised learning algorithm)


reduce dimensionality of a problem


find most dominant features in a distribution



Eigenvectors of covariance matrix


“axis” in transformed variable space


large eigenvalue


污l来⁶慲楡湣i 慬潮朠瑨攠慸楳†
⡰物(捩灡氠捯浰c湥湴n


sort eigenvectors according to their eigenvalues


transform dataset accordingly


diagonalised

covariance matrix with first “variable”


variable with
largest variance

How to Apply the Pre
-
Processing
Transformation?

28


Correlation (
decorrelation
): different for signal and background variables




we don’t know beforehand if it is signal or background.



What do we do?



for
likelihood ratio
,
decorrelate

signal and background independently



for
other estimators
, one needs to decide on one of the two… (or
decorrelate

on a mixture of signal and background events)

signal transformation

background transformation

Helge Voss

Introduction to Statistics and Machine Learning
-

GSI Power Week
-

Dec 5
-
9 2011

Decorrelation at Work

29


Example: linear correlated Gaussians


decorrelation

works to 100%


1
-
D Likelihood on
decorrelated

sample give best possible performance


compare also the effect on the MVA
-
output variable!

correlated variables: after decorrelation

(note the different scale on the y
-
axis… sorry)

Helge Voss

Introduction to Statistics and Machine Learning
-

GSI Power Week
-

Dec 5
-
9 2011

Limitations of the Decorrelation

30

in cases with non
-
Gaussian distributions and/or nonlinear correlations,
the
decorrelation

needs to be treated with care


How does linear
decorrelation affect
cases where
correlations
between signal and
background differ?

Original correlations

Signal

Background

Helge Voss

Introduction to Statistics and Machine Learning
-

GSI Power Week
-

Dec 5
-
9 2011

Limitations of the Decorrelation

31

in cases with non
-
Gaussian distributions and/or nonlinear correlations,
the decorrelation needs to be treated with care


How does linear
decorrelation affect
cases where
correlations
between signal and
background differ?

SQRT decorrelation

Signal

Background

Helge Voss

Introduction to Statistics and Machine Learning
-

GSI Power Week
-

Dec 5
-
9 2011

Limitations of the Decorrelation

32

in cases with non
-
Gaussian distributions and/or nonlinear correlations,
the decorrelation needs to be treated with care


How does linear
decorrelation affect
strongly nonlinear
cases ?

Original correlations

Background

Signal

Helge Voss

Introduction to Statistics and Machine Learning
-

GSI Power Week
-

Dec 5
-
9 2011

Limitations of the Decorrelation

33

in cases with non
-
Gaussian distributions and/or nonlinear correlations,
the decorrelation needs to be treated with care


How does linear
decorrelation affect
strongly nonlinear
cases ?

SQRT decorrelation


Watch out before you used decorrelation “blindly”!!


Perhaps “decorrelate” only a subspace!

Background

Signal

Helge Voss

Introduction to Statistics and Machine Learning
-

GSI Power Week
-

Dec 5
-
9 2011


Improve decorrelation by pre
-
Gaussianisation of variables

First: transformation to achieve uniform (flat) distribution:


Rarity transform of variable

k

Measured value

PDF of variable
k

Second: make Gaussian via inverse error function:


The integral can be solved in an unbinned way by event counting,
or by creating non
-
parametric PDFs (see later for likelihood section)

Third: decorrelate (and “iterate” this procedure)

“Gaussian
-
isation“

Helge Voss

Introduction to Statistics and Machine Learning
-

GSI Power Week
-

Dec 5
-
9 2011

34

Original

Signal
-

Gaussianised

We cannot simultaneously “Gaussianise” both signal and background !

Background
-

Gaussianised

“Gaussian
-
isation“

Helge Voss

Introduction to Statistics and Machine Learning
-

GSI Power Week
-

Dec 5
-
9 2011

35

Summary

Helge Voss

Introduction to Statistics and Machine Learning
-

GSI Power Week
-

Dec 5
-
9 2011

36


Hope you are all convinced that Multivariate
Algorithem

are nice
and powerful classification techniques



Do not use hard selection criteria (cuts) on each individual
observables


Look at all observables “together”


eg
. combing them into 1 variable




Mulitdimensinal

Likelihood


偄䘠P渠n

摩d敮獩潮e


Projective Likelihood (Naïve Bayesian)


P䑆D楮 䐠瑩D敳eㄠ
dimension



How to “avoid” correlations