Transcript Slide 1

Doing statistics with homonuclear 2D-NMR spectra :
handling and preliminary study of their
repeatability
Baptiste FERAUD
Bernadette GOVAERTS (UCL, ISBA) – Michel VERLEYSEN (UCL, MLG)
PhD Day
September 14, 2012
OUTLINE
 WHAT ?
Some definitions to a good start (Metabolomics, 1D and 2D-NMR
experiences)
 WHY ?
Why use two-dimensional tools instead of « traditional » 1D spectra :
benefits from a users' point of view
 HOW ?
Statistics : How to handle 2D-NMR data and spectra ?
Example from a first 2D-COSY experimental design
 NEED STATISTICAL GUARANTEES ?
A rigorous study of 2D-NMR tools’ repeatability and robustness is
needed : clustering approaches and preliminary results
Baptiste Feraud - UCL - ISBA / Machine Learning Group
WHAT ?
Metabolomics is the scientific study of chemical processes involving
metabolites. Specifically, it represents the systematic study of the unique
chemical fingerprints that specific cellular processes leave behind.
Metabonomics is the study of biological responses to a stressor (drug,
disease…) in the level of metabolites.
Applications : pharmacology, pre-clinical drug trials, toxicology, newborn
screening, clinical chemistry, food and medicinal plants quality control, …
Data acquisition : Nuclear Magnetic Resonance Spectroscopy
vs. Mass Spectroscopy (mass-to-charge ratio)
1D-NMR (see Réjane Rousseau’s thesis, 2011) vs. 2D-NMR
Baptiste Feraud - UCL - ISBA / Machine Learning Group
1D : Mainly 1H-NMR (Proton NMR or Hydrogen-1 NMR) and Carbon-13 NMR
2D (more recently) :
• Homonuclear experiences :
- COSY (COrrelated SpectroscopY) : first method for determining
which signals arise from neighboring protons (usually up to four bonds).
Correlations appear when there is spin-spin coupling between protons (i.e.
correlation between two or more nearby chemical processes).
- TOCSY (TOtal Correlated SpectroscopY) : creates correlations
between all protons within a given spin system, not just between identical
or vicinal protons as in COSY. Magnetization is transferred successively as
long as successive protons are coupled, and is interrupted by small or zero
proton-proton couplings.
Baptiste Feraud - UCL - ISBA / Machine Learning Group
- NOESY (Nuclear Overhauser Effect SpectroscopY) : useful for
determining which signals arise from protons that are close to each other in
space even if they are not bonded. A NOESY spectrum yields through space
correlations.
(…)
• Heteronuclear experiences :
Heteronuclear correlation is used to assign the spectrum of another nucleus
once the spectrum of one nucleus is known. For small molecules, 1H is
usually correlated with 13C while for biomolecules, 1H is also commonly
correlated to 15N (HSQC for Heteronuclear Single Quantum Coherence).
Baptiste Feraud - UCL - ISBA / Machine Learning Group
SOME GRAPHICS…
Baptiste Feraud - UCL - ISBA / Machine Learning Group
Baptiste Feraud - UCL - ISBA / Machine Learning Group
Baptiste Feraud - UCL - ISBA / Machine Learning Group
WHY ?
biomarker? or biomarkers?
1D protein spectra are often far too
complex for interpretation
• Signals overlap heavily
• Ambiguous or overlapping resonances
• …
Additional spectral dimension = extra information (obvious)
• separate the contributions made by individual resonances
• analysis and quantization of off-diagonal peaks !
QUESTION : extra information = relevant information ??
Baptiste Feraud - UCL - ISBA / Machine Learning Group
HOW ?
Let’s start with a first 1D and 2D COSY experimental plan :
M1
M2
M3
M4
4 mixtures = 4 cell culture systems containing various metabolites
(fetal bovine serum, glutamax, amino acids, vitamins, inorganic salts,
proteins, …)
Expected : M1, M2 and M4 quite close
(Data provided by Pascal de Tullio, Pharmaceutical chemistry, Ulg)
Baptiste Feraud - UCL - ISBA / Machine Learning Group
HOW ?
Let’s start with a first 1D and 2D COSY experimental plan :
M1
M2
M3
(…)
Sampling : 3 samples per mixture
Baptiste Feraud - UCL - ISBA / Machine Learning Group
M4
HOW ?
Let’s start with a first 1D and 2D COSY experimental plan :
M1
M2
M3
M4
(…)
(…)
Time : 3 repetitions per sample
- Samples are subject to freezing and defrosting.
- Risks : degradation and bacterial contamination because of the duration of
the 2D analysis.
Baptiste Feraud - UCL - ISBA / Machine Learning Group
36 measures = 36 spectra = 36 peak lists
From individual peak list …
… to global peak list
C1
C2
INT1
P1
INT2
P2
…
…
…
+
1
0
0
…
0
0
+
1
…
+
1
+
1
…
…
…
…
…
…
…
…
includes all pairs of coordinates that
appear in at least one of the 36
spectra
INT : intensities vectors
All points in a
specific spectra
C1
C2
INT
…
…
…
 0, 
P : position vectors (binary)
Baptiste Feraud - UCL - ISBA / Machine Learning Group
REPEATABILITY ?
As for 1D tools, we need to verify the statistical performances and reliability
of 2D data and spectra.
Some pre-processing :
 Symmetrisation : by removing negative intensities (or too close to
zero) which result from an inappropriate choice of baseline.
 Bucketing : by controlling the size of the database (via the chosen
number of decimals of the coordinates).
One decimal → (909 × 74)
Two decimals → (2348 × 74)
Three decimals → (3250 × 74)
 Detection of outliers among spectra via the intensities vectors.
Baptiste Feraud - UCL - ISBA / Machine Learning Group
REPEATABILITY ?
An intuitive way to evaluate the repeatability / reproducibility of 2D spectra
consists in non-supervised multivariate clustering (blind).
If we manage to separate and recover our 4 mixtures starting from the 36
spectra → Done !
1) Clustering on position vectors
• Need some specific distances or similarity measures adapted to binary
vectors such as Ochiai, Dice, Jaccard, Russel-Rao, Kulczynski …
• Ward and K-means algorithms
Baptiste Feraud - UCL - ISBA / Machine Learning Group
Exemple of result (Ochiai-Ward, 2 decimals)
Baptiste Feraud - UCL - ISBA / Machine Learning Group
Exemple of result (Ochiai-Ward, 2 decimals)
in the vast majority of cases, we
can already isolate the mixture 3
Baptiste Feraud - UCL - ISBA / Machine Learning Group
2) Clustering on intensities vectors
• Normalization of each vector such that sum = 1
• Euclidean distance
• Ward and K-means algorithms
RESULTS :
→ Generally, all mixtures are well recovered by the algorithms, in
spite of the sampling procedure and time repetitions !
→ Best result obtained with the one-decimal matrix (interest of the
bucketing) : just one error !
Baptiste Feraud - UCL - ISBA / Machine Learning Group
Exemple of result (Ward, 1 decimal)
Baptiste Feraud - UCL - ISBA / Machine Learning Group
Validation : exemple of the K-means
Number of clusters : from 2 to 6
Validation measure : Dunn index (ratio between minimal inter-cluster
distance and maximal intra-cluster distance).




  Ci , C j 

DI m  min min 
 i, j, k
1i  m 1 j  m , j i
max



 1k m k 



Baptiste Feraud - UCL - ISBA / Machine Learning Group
3) 2D vs. 1D (current work)
Warning : be very careful to compare what is objectively comparable ! This
implies same pre-processing procedures in 1D and 2D cases (very hard…).
But we can :
- eliminate negative intensities,
- apply the same standards to the intensities,
- use a same number of decimals,
- remove outliers (PCA),
- choose a resolution proportional or equal to the 2D
horizontal axis, etc…
By doing this, we can already visualize that the repeatability can be better
in 2D than 1D !
Baptiste Feraud - UCL - ISBA / Machine Learning Group
1D clustering (Ward)
Baptiste Feraud - UCL - ISBA / Machine Learning Group
CONCLUSION
It’s commonly accepted by users (biologists, pharmacologists, healthcare
professionnals…) that the recent introduction of 2D-NMR methods represents
a huge qualitative gap for metabolomic investigations. For them, it’s obvious
and natural that more information = more power.
BUT… for the moment, no statistical study proved this clearly …
So, we are trying to fill this lack. We are working to show in a
encouraging way that 2D-NMR tools (at first, COSY) are statistically robust
tools, and, more, that 2D-COSY experiment seems to be more repeatable
and reliable than corresponding 1D methods !
Baptiste Feraud - UCL - ISBA / Machine Learning Group
CONCLUSION
Perspectives :
►
continue to go further into 1D vs. 2D comparisons
►
improve 2D data pre-processing
apply the same procedures with NOESY and
heteronuclear methods (same conclusions ?)
►
implement supervised classification methods (such as
SVM, Lasso…) in order to make predictions and to identify
discriminating zones (biomarkers)
►
►
work with « challenging » real datasets (disease, drug…)
Baptiste Feraud - UCL - ISBA / Machine Learning Group
THANK YOU FOR
YOUR ATTENTION
Baptiste Feraud - UCL - ISBA / Machine Learning Group