Introduction

Self-Organizing Map

Algorithms

Results

Practical Tips

Multivariate Analysis Techniques at the LHC

Eric Malmi

Helsinki Institute of Physics/Adaptive Informatics Research Centre,

Aalto University (Helsinki University of Technology)

January 6,2010

1/25

Introduction

Self-Organizing Map

Algorithms

Results

Practical Tips

Outline

Introduction

Self-Organizing Map

Algorithms

Neural Networks

Support Vector Machines

Gene Expression Programming

Multi-class classication

Results

Practical tips

2/25

Introduction

Self-Organizing Map

Algorithms

Results

Practical Tips

Introduction

Mikael Kuusela,Jerry W.Lamsa,Eric Malmi,Petteri Mehtala,and

Risto Orava.Multivariate techniques for identifying diractive

interactions at the LHC.International Journal of Modern Physics

A,to appear

3/25

Introduction

Self-Organizing Map

Algorithms

Results

Practical Tips

Data

Proton-proton scattering events in

Single diractive (SD)

Double diractive (DD)

Central diractive (CD)

Non-diractive processes (ND)

Generated by PYTHIA (SD,ND) and PHOJET (DD,CD)

Monte Carlo generators

12,000 events of each category:10,000 for training and 2,000

for testing (SD x2)

4/25

Introduction

Self-Organizing Map

Algorithms

Results

Practical Tips

Data

23 variables => 23 dimensional

data vectors to be classied

Instead of the usual

signal/background separation,our

task is to determine the diraction

types of the events,i.e.to give

labels in the range 1-4 for the data

vectors

This a multi-class pattern

recognition (=classication)

problem

Variable

Comments

E

zdcl

ZDC energy left

E

casl

CASTOR energy left

E

h

HF energy left

t2ml

T2 multiplicity left

t1ml

T1 multiplicity left

fwdm1l

FSC multiplicity left planes 1-2

fwdm2l

FSC multiplicity left planes 3-8

fwdm3l

FSC multiplicity left planes 9-10

fwd1stl

1st FSC plane hit left

fwdmaxl

FSC plane with the max.hits left

e

zdcr

ZDC energy right

e

casr

CASTOR energy right

e

hfr

HF energy right

t2mr

t2 multiplicity right

t1mr

t1 multiplicity right

fwdm1r

FSC multiplicity right planes 1-2

fwdm2r

FSC multiplicity right planes 3-8

fwdm3r

FSC multiplicity right planes 9-10

fwd1str

1st FSC plane hit right

fwdmaxr

FSC plane with the max.hits right

endc

l

CMS endcap energy left

endc

r

CMS endcap energy right

barrel

CMS barrel energy

5/25

Introduction

Self-Organizing Map

Algorithms

Results

Practical Tips

Exploratory data analysis by the self-organizing map

The self-organizing map (SOM) is a computational method

which can be used,e.g.,for dimensionality reduction and data

visualization

SOM conducts a nonlinear mapping from the 23 dimensional

space to two dimensional map

Gives us a qualitative view of the data

Which event types are easily distinguished and which are

overlapping

Which are the relevant features (detectors) for distinguishing

certain event types

6/25

Introduction

Self-Organizing Map

Algorithms

Results

Practical Tips

7/25

Introduction

Self-Organizing Map

Algorithms

Results

Practical Tips

8/25

Introduction

Self-Organizing Map

Algorithms

Results

Practical Tips

Neural Networks

We use the multi-layer perceptron (MLP) network:

MLP consists of an input layer,an

output layer and one or more hidden

layers of neurons

1.Data vector x is fed to the input layer

consisting of 23 nodes.

2.From there it propagates to the

hidden layer where we apply the

transfer function f(x)=tanh(x).

3.Finally it goes to the output node(s)

which denes the event category

y = Bf (Ax +a) +b

Network is trained by the back-propagation algorithm to give label

1 to the signal events and 0 to the background

9/25

Introduction

Self-Organizing Map

Algorithms

Results

Practical Tips

Support Vector Machines

The idea in SVM is to nd a hyperplane that separates two

dierent data samples with the largest possible margin

10/25

Introduction

Self-Organizing Map

Algorithms

Results

Practical Tips

Support Vector Machines

Usually the data vectors are rst projected

into a higher dimensional space

However,we only need to dene the dot

product,called the kernel function

(x;y),in the high-dimensional space

(the kernel trick)

We use the popular radial basis function:

(x;y) = exp( jjx yjj

2

)

Finding of the hyperplane is a quadratic

optimization problem

11/25

Introduction

Self-Organizing Map

Algorithms

Results

Practical Tips

Gene Expression Programming

Gene Expression Programming (GEP),introduced in 2001,is

an evolutionary algorithm that has similarities with genetic

algorithms (GA) and genetic programming (GP)

The main idea is to mimic biological evolution to evolve a

population of simple text strings called chromosomes

The chromosomes,in turn,encode complex expression trees

that can be used for classication

For each generation of chromosomes we select the best

individuals and apply crossover and mutation to produce the

ospring

12/25

Introduction

Self-Organizing Map

Algorithms

Results

Practical Tips

Gene Expression Programming

Nodes of the expression trees consist of

mathematical functions,input variables

and random constants.E.g.

-*/aQbcaacb

13/25

Introduction

Self-Organizing Map

Algorithms

Results

Practical Tips

Gene Expression Programming

Nodes of the expression trees consist of

mathematical functions,input variables

and random constants.E.g.

-*/aQbcaacb

14/25

Introduction

Self-Organizing Map

Algorithms

Results

Practical Tips

Gene Expression Programming

Nodes of the expression trees consist of

mathematical functions,input variables

and random constants.E.g.

-*/aQbcaacb

15/25

Introduction

Self-Organizing Map

Algorithms

Results

Practical Tips

Gene Expression Programming

Nodes of the expression trees consist of

mathematical functions,input variables

and random constants.E.g.

-*/aQbcaacb

16/25

Introduction

Self-Organizing Map

Algorithms

Results

Practical Tips

Gene Expression Programming

Nodes of the expression trees consist of

mathematical functions,input variables

and random constants.E.g.

-*/aQbcaacb

17/25

Introduction

Self-Organizing Map

Algorithms

Results

Practical Tips

Gene Expression Programming

Nodes of the expression trees consist of

mathematical functions,input variables

and random constants.E.g.

-*/aQbcaacb

18/25

Introduction

Self-Organizing Map

Algorithms

Results

Practical Tips

Gene Expression Programming

Nodes of the expression trees consist of

mathematical functions,input variables

and random constants.E.g.

-*/aQbcaacb

19/25

Introduction

Self-Organizing Map

Algorithms

Results

Practical Tips

Gene Expression Programming

Nodes of the expression trees consist of

mathematical functions,input variables

and random constants.E.g.

-*/aQbcaacb

20/25

Introduction

Self-Organizing Map

Algorithms

Results

Practical Tips

Gene Expression Programming

Nodes of the expression trees consist of

mathematical functions,input variables

and random constants.E.g.

-*/aQbcaacb

21/25

Introduction

Self-Organizing Map

Algorithms

Results

Practical Tips

Gene Expression Programming

Nodes of the expression trees consist of

mathematical functions,input variables

and random constants.E.g.

-*/aQbcaacb

Advantages of GEP are

Every chromosome encodes a valid expression tree )

eciency

It is not a black box in the same way as the NN

We get an idea of which are the important variables

(detectors)

Mimics the natural evolution more consistently:

chromosome $genotype,expression tree $ phenotype

22/25

Introduction

Self-Organizing Map

Algorithms

Results

Practical Tips

Multi-Class Classication

Ordered binarization

We train the following classiers:ND vs.fCD,SD,DDg,

CD vs.fSD,DDg and SD vs.DD

An event is fed to these classiers one by one in the same

order until one classier outputs label 1

Gives good results in case some events are easily distinguished

(in our case the ND events)

For the MLP network we can use several output nodes and

see which one gives the largest value

23/25

Introduction

Self-Organizing Map

Algorithms

Results

Practical Tips

Results

Average eciencies of dierent algorithms (left) and the

performance of the NN ordered binarization (right).

Method

<Eciency>

GEP

92.49

SVM

94.21

NN

94.54

RealnPred

DD SD CD ND

DD

87.60 12.05 0.35 0.00

SD

2.15 95.20 2.58 0.07

CD

0.00 4.25 95.75 0.00

ND

0.15 0.25 0.00 99.60

Purities

97.44 85.19 97.03 99.93

The results have been obtained optimizing the total accuracy (the

probability that an event of random category is classied correctly)

24/25

Introduction

Self-Organizing Map

Algorithms

Results

Practical Tips

Practical Tips for Data Analysis

1.Visualize with the self-organizing map

2.Normalize the data

3.Know your goals { do you want a high eciency or a high

purity?

25/25

## Comments 0

Log in to post a comment