PhD Day September 14, 2012

plantationscarfAI and Robotics

Nov 25, 2013 (3 years and 8 months ago)

64 views



Doing
statistics with
homonuclear

2D
-
NMR spectra

:

handling
and preliminary study of their
repeatability



Baptiste FERAUD


Bernadette GOVAERTS (UCL, ISBA)


Michel VERLEYSEN (UCL, MLG)


PhD

Day
September

14, 2012

Baptiste

Feraud

-

UCL
-

ISBA / Machine Learning Group

OUTLINE



WHAT ?


Some definitions to a good start (
Metabolomics
, 1D and 2D
-
NMR



experiences)




WHY ?



W
hy

use
two
-
dimensional

tools

instead

of «

traditional

» 1D
spectra

:



benefits

from a users' point of
view




HOW ?



Statistics

: How to
handle

2D
-
NMR data and
spectra

?



Example

from

a first 2D
-
COSY
experimental

design




NEED STATISTICAL GUARANTEES ?



A
rigorous

study

of 2D
-
NMR
tools

repeatability

and
robustness

is




needed

: clustering
approaches

and
preliminary

results

Baptiste

Feraud

-

UCL
-

ISBA / Machine Learning Group

WHAT ?

Metabolomics

is

the

scientific

study

of

chemical

processes

involving

metabolites
.

Specifically,

it

represents

the

systematic

study

of

the

unique

chemical

fingerprints

that

specific

cellular

processes

leave

behind
.


Metabonomics

is

the

study

of

biological

responses

to

a

stressor

(drug,

disease

)

in

the

level

of

metabolites
.



Applications :
pharmacology, pre
-
clinical drug trials, toxicology, newborn
screening, clinical chemistry, food and medicinal plants quality control, …


Data acquisition :
N
uclear

M
agnetic

R
esonance

Spectroscopy



vs. Mass Spectroscopy (mass
-
to
-
charge ratio)

1D
-
NMR (
see

Réjane
Rousseau’s

thesis
, 2011) vs. 2D
-
NMR

Baptiste

Feraud

-

UCL
-

ISBA / Machine Learning Group


1D

:
Mainly

1
H
-
NMR (Proton NMR or Hydrogen
-
1 NMR) and Carbon
-
13 NMR

2D (more
recently
)

:




Homonuclear

experiences

:



-

COSY

(
COrrelated

SpectroscopY
)

:

first

method

for

determining

which

signals

arise

from

neighboring

protons

(usually

up

to

four

bonds)
.

Correlations

appear

when

there

is

spin
-
spin

coupling

between

protons

(i
.
e
.

correlation

between

two

or

more

nearby

chemical

processes)
.



-

TOCSY

(
TOtal

Correlated

SpectroscopY
)

:

creates

correlations

between

all

protons

within

a

given

spin

system,

not

just

between

identical

or

vicinal

protons

as

in

COSY
.

Magnetization

is

transferred

successively

as

long

as

successive

protons

are

coupled,

and

is

interrupted

by

small

or

zero

proton
-
proton

couplings
.

Baptiste

Feraud

-

UCL
-

ISBA / Machine Learning Group



-

NOESY

(
Nuclear

Overhauser

Effect

SpectroscopY
)

:

useful

for

determining

which

signals

arise

from

protons

that

are

close

to

each

other

in

space

even

if

they

are

not

bonded
.

A

NOESY

spectrum

yields

through

space

correlations
.

(…)

















Heteronuclear

experiences

:



Heteronuclear

correlation

is

used

to

assign

the

spectrum

of

another

nucleus

once

the

spectrum

of

one

nucleus

is

known
.

For

small

molecules,

1
H

is

usually

correlated

with

13
C

while

for

biomolecules
,

1
H

is

also

commonly

correlated

to

15
N

(
HSQC

for

Heteronuclear

Single

Quantum

Coherence
)
.


Baptiste

Feraud

-

UCL
-

ISBA / Machine Learning Group

SOME GRAPHICS…

Baptiste

Feraud

-

UCL
-

ISBA / Machine Learning Group

Baptiste

Feraud

-

UCL
-

ISBA / Machine Learning Group

Baptiste

Feraud

-

UCL
-

ISBA / Machine Learning Group

WHY ?

biomarker
? or
biomarkers
?

1D
protein

spectra

are
often

far
too

complex

for
interpretation




Signals

overlap

heavily




Ambiguous

or
overlapping

resonances





Additional

spectral dimension = extra information (
obvious
)




separate

the contributions made by
individual

resonances



analysis

and
quantization

of
off
-
diagonal
peaks

!


QUESTION : extra information = relevant information ??

Baptiste

Feraud

-

UCL
-

ISBA / Machine Learning Group

HOW ?

Let’s

start

with

a first
1D and 2D COSY
experimental

plan
:

M1

M2

M3

M4



4 mixtures = 4
cell

culture
systems

containing

various

metabolites

(
fetal

bovine
serum
,
glutamax
,
amino

acids
,
vitamins
,
inorganic

salts
,
proteins
, …)



Expected

: M1, M2 and M4
quite

close



(Data
provided

by Pascal de Tullio, Pharmaceutical
chemistry
,
Ulg
)

Baptiste

Feraud

-

UCL
-

ISBA / Machine Learning Group

HOW ?

Let’s

start

with

a first
1D and 2D COSY
experimental

plan
:

M1

M2

M3

M4

(…)


Sampling

: 3
samples

per mixture

Baptiste

Feraud

-

UCL
-

ISBA / Machine Learning Group

HOW ?

Let’s

start

with

a first
1D and 2D COSY
experimental

plan
:

M1

M2

M3

M4

(…)



(…)




Time

: 3
repetitions

per
sample


-

Samples

are
subject

to
freezing

and
defrosting
.

-

Risks

:
degradation

and
bacterial

contamination
because

of the
duration

of


the 2D
analysis
.

Baptiste

Feraud

-

UCL
-

ISBA / Machine Learning Group

36
measures

= 36
spectra

= 36
peak

lists



From

individual

peak

list

… … to global
peak

list

C1

C2

INT







All points in a
specific

spectra

C1

C2

INT1

P1

INT2

P2








+

1

0

0







0

0


+

1








+

1


+

1


















includes

all

pairs

of

coordinates

that

appear

in

at

least

one

of

the

36

spectra


INT

:

intensities

vectors



P

:

position

vectors

(binary)





,
0
Baptiste

Feraud

-

UCL
-

ISBA / Machine Learning Group

REPEATABILITY ?

As

for

1
D

tools
,

we

need

to

verify

the

statistical

performances

and

reliability

of

2
D

data

and

spectra
.


Some

pre
-
processing

:




Symmetrisation

:

by

removing

negative

intensities

(or

too

close

to

zero
)

which

result

from

an

inappropriate

choice

of

baseline
.




Bucketing

:

by

controlling

the

size

of

the

database

(via

the

chosen

number

of

decimals

of

the

coordinates
)
.


One
decimal



(909
×

74)

Two

decimals



(2348
×

74)

Three

decimals



(3250
×

74)




Detection

of

outliers

among

spectra

via

the

intensities

vectors
.



Baptiste

Feraud

-

UCL
-

ISBA / Machine Learning Group

REPEATABILITY ?

An

intuitive

way

to

evaluate

the

repeatability

/

reproducibility

of

2
D

spectra

consists

in

non
-
supervised

multivariate

clustering

(
blind
)
.



If

we

manage

to

separate

and

recover

our

4

mixtures

starting

from

the

36

spectra



Done

!





1
)

Clustering

on

position

vectors





Need

some

specific

distances

or

similarity

measures

adapted

to

binary

vectors

such

as

Ochiai
,

Dice
,

Jaccard,

Russel
-
Rao,

Kulczynski






Ward

and

K
-
means

algorithms

Baptiste

Feraud

-

UCL
-

ISBA / Machine Learning Group

Exemple of
result

(
Ochiai
-
Ward, 2
decimals
)

Baptiste

Feraud

-

UCL
-

ISBA / Machine Learning Group

Exemple of
result

(
Ochiai
-
Ward, 2
decimals
)

in the vast majority of cases, we
can already isolate the mixture 3

Baptiste

Feraud

-

UCL
-

ISBA / Machine Learning Group


2
)

Clustering

on

intensities

vectors





Normalization

of

each

vector

such

that

sum

=

1





Euclidean

distance




Ward

and

K
-
means

algorithms





RESULTS

:





Generally
,

all

mixtures

are

well

recovered

by

the

algorithms,

in

spite

of

the

sampling

procedure

and

time

repetitions

!





Best

result

obtained

with

the

one
-
decimal

matrix

(
interest

of

the

bucketing
)

:

just

one

error

!




Baptiste

Feraud

-

UCL
-

ISBA / Machine Learning Group

Exemple of
result

(Ward, 1
decimal
)

Baptiste

Feraud

-

UCL
-

ISBA / Machine Learning Group

Validation : exemple of the K
-
means

Number

of

clusters

:

from

2

to

6

Validation

measure

:

Dunn

index

(ratio

between

minimal

inter
-
cluster

distance

and

maximal

intra
-
cluster

distance)
.



k
j
i
C
C
DI
k
m
k
j
i
i
j
m
j
m
i
m
,
,
max
,
min
min
1
,
1
1































Baptiste

Feraud

-

UCL
-

ISBA / Machine Learning Group


3
)

2
D

vs
.

1
D

(
current

work
)





Warning

:

be

very

careful

to

compare

what

is

objectively

comparable

!

This

implies

same

pre
-
processing

procedures

in

1
D

and

2
D

cases

(very

hard

)
.



But

we

can

:




-

eliminate

negative

intensities,




-

apply

the

same

standards

to

the

intensities,




-

use

a

same

number

of

decimals,




-

remove

outliers

(PCA),




-

choose

a

resolution

proportional

or

equal

to

the

2
D


horizontal

axis,

etc




By

doing

this,

we

can

already

visualize

that

the

repeatability

can

be

better

in

2
D

than

1
D

!

Baptiste

Feraud

-

UCL
-

ISBA / Machine Learning Group

1D clustering (Ward)

Baptiste

Feraud

-

UCL
-

ISBA / Machine Learning Group

It’s

commonly

accepted

by

users

(
biologists
,

pharmacologists
,

healthcare

professionnals

)

that

the

recent

introduction

of

2
D
-
NMR

methods

represents

a

huge

qualitative

gap

for

metabolomic

investigations
.

For

them
,

it’s

obvious

and

natural

that

more

information

=

more

power
.


BUT


for

the

moment,

no

statistical

study

proved

this

clearly






So,

we

are

trying

to

fill

this

lack
.

We

are

working

to

show

in

a

encouraging

way

that

2
D
-
NMR

tools

(
at

first,

COSY)

are

statistically

robust

tools
,

and,

more,

that

2
D
-
COSY

experiment

seems

to

be

more

repeatable

and

reliable

than

corresponding

1
D

methods

!


CONCLUSION

Baptiste

Feraud

-

UCL
-

ISBA / Machine Learning Group

CONCLUSION

Perspectives

:



continue

to

go

further

into

1
D

vs
.

2
D

comparisons






improve

2
D

data

pre
-
processing







apply

the

same

procedures

with

NOESY

and


heteronuclear

methods

(
same

conclusions

?)






implement

supervised

classification

methods

(
such

as


SVM,

Lasso

)

in

order

to

make

predictions

and

to

identify


discriminating

zones

(
biomarkers
)






work

with

«

challenging

»

real

datasets

(
disease
,

drug

)


Baptiste

Feraud

-

UCL
-

ISBA / Machine Learning Group



THANK YOU FOR

YOUR ATTENTION