Doing
statistics with
homonuclear
2D

NMR spectra
:
handling
and preliminary study of their
repeatability
Baptiste FERAUD
Bernadette GOVAERTS (UCL, ISBA)
–
Michel VERLEYSEN (UCL, MLG)
PhD
Day
September
14, 2012
Baptiste
Feraud

UCL

ISBA / Machine Learning Group
OUTLINE
WHAT ?
Some definitions to a good start (
Metabolomics
, 1D and 2D

NMR
experiences)
WHY ?
W
hy
use
two

dimensional
tools
instead
of «
traditional
» 1D
spectra
:
benefits
from a users' point of
view
HOW ?
Statistics
: How to
handle
2D

NMR data and
spectra
?
Example
from
a first 2D

COSY
experimental
design
NEED STATISTICAL GUARANTEES ?
A
rigorous
study
of 2D

NMR
tools
’
repeatability
and
robustness
is
needed
: clustering
approaches
and
preliminary
results
Baptiste
Feraud

UCL

ISBA / Machine Learning Group
WHAT ?
Metabolomics
is
the
scientific
study
of
chemical
processes
involving
metabolites
.
Specifically,
it
represents
the
systematic
study
of
the
unique
chemical
fingerprints
that
specific
cellular
processes
leave
behind
.
Metabonomics
is
the
study
of
biological
responses
to
a
stressor
(drug,
disease
…
)
in
the
level
of
metabolites
.
Applications :
pharmacology, pre

clinical drug trials, toxicology, newborn
screening, clinical chemistry, food and medicinal plants quality control, …
Data acquisition :
N
uclear
M
agnetic
R
esonance
Spectroscopy
vs. Mass Spectroscopy (mass

to

charge ratio)
1D

NMR (
see
Réjane
Rousseau’s
thesis
, 2011) vs. 2D

NMR
Baptiste
Feraud

UCL

ISBA / Machine Learning Group
1D
:
Mainly
1
H

NMR (Proton NMR or Hydrogen

1 NMR) and Carbon

13 NMR
2D (more
recently
)
:
•
Homonuclear
experiences
:

COSY
(
COrrelated
SpectroscopY
)
:
first
method
for
determining
which
signals
arise
from
neighboring
protons
(usually
up
to
four
bonds)
.
Correlations
appear
when
there
is
spin

spin
coupling
between
protons
(i
.
e
.
correlation
between
two
or
more
nearby
chemical
processes)
.

TOCSY
(
TOtal
Correlated
SpectroscopY
)
:
creates
correlations
between
all
protons
within
a
given
spin
system,
not
just
between
identical
or
vicinal
protons
as
in
COSY
.
Magnetization
is
transferred
successively
as
long
as
successive
protons
are
coupled,
and
is
interrupted
by
small
or
zero
proton

proton
couplings
.
Baptiste
Feraud

UCL

ISBA / Machine Learning Group

NOESY
(
Nuclear
Overhauser
Effect
SpectroscopY
)
:
useful
for
determining
which
signals
arise
from
protons
that
are
close
to
each
other
in
space
even
if
they
are
not
bonded
.
A
NOESY
spectrum
yields
through
space
correlations
.
(…)
•
Heteronuclear
experiences
:
Heteronuclear
correlation
is
used
to
assign
the
spectrum
of
another
nucleus
once
the
spectrum
of
one
nucleus
is
known
.
For
small
molecules,
1
H
is
usually
correlated
with
13
C
while
for
biomolecules
,
1
H
is
also
commonly
correlated
to
15
N
(
HSQC
for
Heteronuclear
Single
Quantum
Coherence
)
.
Baptiste
Feraud

UCL

ISBA / Machine Learning Group
SOME GRAPHICS…
Baptiste
Feraud

UCL

ISBA / Machine Learning Group
Baptiste
Feraud

UCL

ISBA / Machine Learning Group
Baptiste
Feraud

UCL

ISBA / Machine Learning Group
WHY ?
biomarker
? or
biomarkers
?
1D
protein
spectra
are
often
far
too
complex
for
interpretation
•
Signals
overlap
heavily
•
Ambiguous
or
overlapping
resonances
•
…
Additional
spectral dimension = extra information (
obvious
)
•
separate
the contributions made by
individual
resonances
•
analysis
and
quantization
of
off

diagonal
peaks
!
QUESTION : extra information = relevant information ??
Baptiste
Feraud

UCL

ISBA / Machine Learning Group
HOW ?
Let’s
start
with
a first
1D and 2D COSY
experimental
plan
:
M1
M2
M3
M4
4 mixtures = 4
cell
culture
systems
containing
various
metabolites
(
fetal
bovine
serum
,
glutamax
,
amino
acids
,
vitamins
,
inorganic
salts
,
proteins
, …)
Expected
: M1, M2 and M4
quite
close
(Data
provided
by Pascal de Tullio, Pharmaceutical
chemistry
,
Ulg
)
Baptiste
Feraud

UCL

ISBA / Machine Learning Group
HOW ?
Let’s
start
with
a first
1D and 2D COSY
experimental
plan
:
M1
M2
M3
M4
(…)
Sampling
: 3
samples
per mixture
Baptiste
Feraud

UCL

ISBA / Machine Learning Group
HOW ?
Let’s
start
with
a first
1D and 2D COSY
experimental
plan
:
M1
M2
M3
M4
(…)
(…)
Time
: 3
repetitions
per
sample

Samples
are
subject
to
freezing
and
defrosting
.

Risks
:
degradation
and
bacterial
contamination
because
of the
duration
of
the 2D
analysis
.
Baptiste
Feraud

UCL

ISBA / Machine Learning Group
36
measures
= 36
spectra
= 36
peak
lists
From
individual
peak
list
… … to global
peak
list
C1
C2
INT
…
…
…
All points in a
specific
spectra
C1
C2
INT1
P1
INT2
P2
…
…
…
+
1
0
0
…
0
0
+
1
…
+
1
+
1
…
…
…
…
…
…
…
…
includes
all
pairs
of
coordinates
that
appear
in
at
least
one
of
the
36
spectra
INT
:
intensities
vectors
P
:
position
vectors
(binary)
,
0
Baptiste
Feraud

UCL

ISBA / Machine Learning Group
REPEATABILITY ?
As
for
1
D
tools
,
we
need
to
verify
the
statistical
performances
and
reliability
of
2
D
data
and
spectra
.
Some
pre

processing
:
Symmetrisation
:
by
removing
negative
intensities
(or
too
close
to
zero
)
which
result
from
an
inappropriate
choice
of
baseline
.
Bucketing
:
by
controlling
the
size
of
the
database
(via
the
chosen
number
of
decimals
of
the
coordinates
)
.
One
decimal
→
(909
×
74)
Two
decimals
→
(2348
×
74)
Three
decimals
→
(3250
×
74)
Detection
of
outliers
among
spectra
via
the
intensities
vectors
.
Baptiste
Feraud

UCL

ISBA / Machine Learning Group
REPEATABILITY ?
An
intuitive
way
to
evaluate
the
repeatability
/
reproducibility
of
2
D
spectra
consists
in
non

supervised
multivariate
clustering
(
blind
)
.
If
we
manage
to
separate
and
recover
our
4
mixtures
starting
from
the
36
spectra
→
Done
!
1
)
Clustering
on
position
vectors
•
Need
some
specific
distances
or
similarity
measures
adapted
to
binary
vectors
such
as
Ochiai
,
Dice
,
Jaccard,
Russel

Rao,
Kulczynski
…
•
Ward
and
K

means
algorithms
Baptiste
Feraud

UCL

ISBA / Machine Learning Group
Exemple of
result
(
Ochiai

Ward, 2
decimals
)
Baptiste
Feraud

UCL

ISBA / Machine Learning Group
Exemple of
result
(
Ochiai

Ward, 2
decimals
)
in the vast majority of cases, we
can already isolate the mixture 3
Baptiste
Feraud

UCL

ISBA / Machine Learning Group
2
)
Clustering
on
intensities
vectors
•
Normalization
of
each
vector
such
that
sum
=
1
•
Euclidean
distance
•
Ward
and
K

means
algorithms
RESULTS
:
→
Generally
,
all
mixtures
are
well
recovered
by
the
algorithms,
in
spite
of
the
sampling
procedure
and
time
repetitions
!
→
Best
result
obtained
with
the
one

decimal
matrix
(
interest
of
the
bucketing
)
:
just
one
error
!
Baptiste
Feraud

UCL

ISBA / Machine Learning Group
Exemple of
result
(Ward, 1
decimal
)
Baptiste
Feraud

UCL

ISBA / Machine Learning Group
Validation : exemple of the K

means
Number
of
clusters
:
from
2
to
6
Validation
measure
:
Dunn
index
(ratio
between
minimal
inter

cluster
distance
and
maximal
intra

cluster
distance)
.
k
j
i
C
C
DI
k
m
k
j
i
i
j
m
j
m
i
m
,
,
max
,
min
min
1
,
1
1
Baptiste
Feraud

UCL

ISBA / Machine Learning Group
3
)
2
D
vs
.
1
D
(
current
work
)
Warning
:
be
very
careful
to
compare
what
is
objectively
comparable
!
This
implies
same
pre

processing
procedures
in
1
D
and
2
D
cases
(very
hard
…
)
.
But
we
can
:

eliminate
negative
intensities,

apply
the
same
standards
to
the
intensities,

use
a
same
number
of
decimals,

remove
outliers
(PCA),

choose
a
resolution
proportional
or
equal
to
the
2
D
horizontal
axis,
etc
…
By
doing
this,
we
can
already
visualize
that
the
repeatability
can
be
better
in
2
D
than
1
D
!
Baptiste
Feraud

UCL

ISBA / Machine Learning Group
1D clustering (Ward)
Baptiste
Feraud

UCL

ISBA / Machine Learning Group
It’s
commonly
accepted
by
users
(
biologists
,
pharmacologists
,
healthcare
professionnals
…
)
that
the
recent
introduction
of
2
D

NMR
methods
represents
a
huge
qualitative
gap
for
metabolomic
investigations
.
For
them
,
it’s
obvious
and
natural
that
more
information
=
more
power
.
BUT
…
for
the
moment,
no
statistical
study
proved
this
clearly
…
So,
we
are
trying
to
fill
this
lack
.
We
are
working
to
show
in
a
encouraging
way
that
2
D

NMR
tools
(
at
first,
COSY)
are
statistically
robust
tools
,
and,
more,
that
2
D

COSY
experiment
seems
to
be
more
repeatable
and
reliable
than
corresponding
1
D
methods
!
CONCLUSION
Baptiste
Feraud

UCL

ISBA / Machine Learning Group
CONCLUSION
Perspectives
:
►
continue
to
go
further
into
1
D
vs
.
2
D
comparisons
►
improve
2
D
data
pre

processing
►
apply
the
same
procedures
with
NOESY
and
heteronuclear
methods
(
same
conclusions
?)
►
implement
supervised
classification
methods
(
such
as
SVM,
Lasso
…
)
in
order
to
make
predictions
and
to
identify
discriminating
zones
(
biomarkers
)
►
work
with
«
challenging
»
real
datasets
(
disease
,
drug
…
)
Baptiste
Feraud

UCL

ISBA / Machine Learning Group
THANK YOU FOR
YOUR ATTENTION
Comments 0
Log in to post a comment