ClinProt - Algorithmen
Download
Report
Transcript ClinProt - Algorithmen
Multivariate Data Analysis
for Metabolomics Data
generated by
MS / NMR Spectroscopy
Metabolomics Workshop
Research Triangle Park / NC
July 14th-15th
H. Thiele, Bruker Daltonik, Bremen
Why do Metabolic Profiling ?
Clinical Diagnostics
• find metabolic markers for disease progression (e.g. cancer)
• diagnose inborn errors or other diseases
• study of genetic differences
Toxicology
• markers for drug toxicity and drug efficacy
• analyze time course of toxicological response
Food Science
• quality control / classification of origin
• health/flavor enhancement of agrochemical products
MS + NMR together:
• Parallel Statistics >> more confidence
• Hyphenation >> ultimate characterization tool
Fundamental Issue in MVS: Dimension Reduction
Bucketing
• several bucketing techniques for optimum design of variables
Dynamic Peak Bucketing Scheme
a1
a2 a3
a4 a5
a1 a2 a3 a4 a5
b1
b2
b3 b 4
b5b6
a1 0 a2 a3 0 a4 a5
b1 b2 b3 0 b4 b5 b6
c1c2c3
c4c5
c6
0 0 a1 0 a2 a3 0 a4 a5
0 0 b1 b2 b3 0 b 4 b5 b6
c1 c2 c3 0 0 c4 c5 c6 0
• Spectra are bucketed one by one
• Bucket table gets a new column whenever a new peak occurs
• Spectra not having peaks at new positions get corresponding 0
Kernel Bucketing for LC-MS Data
Bucketing Parameter e.g.
m/z bucket width = 1, Kernel 0.3 Da
Time bucket width = 60s, Kernel 10s
LC-MS chromatograms of N samples
Bucketing
Intensity
Intensity
100
100
Time
m/z
1Da
100
0.3 Da
10s
100
30
100
Intensity
Rt-m/z
...
1min -
1min -
1 min -
pairs
535 m/z
536 m/z
537 m/z
Sample X
500
50
5
Sample Y
0
25
0
Sample Z
500
50
5
...
50
100
60s
Time [s]
Table of N samples
Which Bucketing Technique to be used ?
• Rectangular, equidistant bucketing
standard, good compromise if no a priori knowledge
• Variable sized bucketing
makes shifts ineffective, allows selective usage
• Point wise bucketing
often used for broad line spectra as a special case of
rectangular, equidistant bucketing
• Dynamic peak bucketing
allows very fine bucketing without getting huge
tables, requires stable shifts or masses
• Kernelized bucketing
variant of rectangle bucketing to reduce effect of shifts
Data Preprocessing : Spectral Background Subtraction
•
Measured Data are contaminated by solvents and chemical noise
•
Intensity of contaminants may dominate the relevant data
Chemical Noise
Solvent at m=75.2, Baseline and
Scaled Noise Estimate
Detection of Traces by
dynamic grouping
Data Preprocessing : Spectral Background Subtraction
•
Subtraction of spectral background makes relevant data visible
Hidden traces of m=180.2 and
m=208.2 in BPC
Intens.
x10 5
1.25
Intensity
Base Peak Chromatogram (BPC)
before and after Background
Subtraction
1.00
0.75
0.50
0.25
0.00
12.0
12.5
13.0
13.5
Intens.
x10 5
3
14.0
Intensity
tim
e
14.5
15.0
15.5
16.0
16.5 Time [min]
2
Visible traces of m=180.2 and
m=208.2 in BPC
1
0
0
2
4
6
8
10
14
16
18
Time [min]
Intens.
x10 4
6
Intensity
tim
e
12
4
2
0
12.0
12.5
13.0
13.5
14.0
tim
e
14.5
15.0
15.5
16.0
16.5 Time [min]
Peak Picking Tasks
• Find compounds defined by RT, m/z, z and area
• Take together isotopic peaks and charge states
Multivariate Statistics in Metabolomics Spectroscopy
How to analyze large numbers of complex LC-MS
chromatograms or NMR spectra with the target of simple
discrimination or grouping?
- healthy / non-healthy
- high / low quality
NMR or MS = sensor
LC-MS chromatogram
or NMR spectrum=
fingerprint
Use Pattern Recognition
Techniques!
Pattern Recognition (PR)
Objectives of PR:
• Statistical characterization
• Model building
• Classification
Methods of PR:
Exploratory Data Analysis
• Statistical Tests
• Principle Component Analysis (PCA) >> variance analysis
Unsupervised Pattern Recognition
• Cluster Analysis
Supervised Pattern Recognition
• Discriminant Analysis: LDA, PCA-DA
• Classification of samples by various means e.g. Genetic Algorithm,
SVM, …
Idea of Principal Component Analysis (PCA)
y
y
y
x
y
x
x
x
PC3
PC2
PC1
e.g. Principal Component Analysis
Classification using PCA
Input
spectra list
Bucketing
coordinate
transformation
distance
measures
comparison to
critical values
classification
model
Coordinate Transformation
PC2
ppm1
ppm2
PC1
Scores
Loadings
Baby Urine Samples : PCA - NMR
mevalonic
aciduria
maple syrup
disease
PC 1,2
orotic
aciduria
200 normal candidates
Pattern in PC1/PC2 scores plot reveals candidates
with inborn errors.
New Born Screening by NMR
> 400 baby urines, PCA,
disease vectors indicating strength of metabolic disorders
PC1/PC2
PC3/PC4
PC11/PC12
Bucket Analysis
from 9 to 0.4ppm
in 0.04ppm steps
Excluded:
6 to 4.5 ppm
residual water
and urea
Results of
BEST-NMR at
600 MHz
1D-spectra
Noesy presat
64 scans
6c
6
New Born Screening by NMR
Distance from normals distribution is a measure for
concentration of the molecule representing an inborn error
Hippuric acid
vector
CH2group
of
hippuric
acid
4c
4
PC2-Scores
PCA : NMR vs. LC-MS
PC1-Scores
Fig. 1: Scores plot of NMR data
from baby urines (born 2003).
Fig. 2: Scores plot of LC-MS data
from a subset of baby urines
(born 2003).
LC-MS data of Samples 114 and 94
Sample 114
BPC: -All MS
8
6
4
-MS, 5.5-6.8min
263.1037
2
264.1062
Intensity * 104
0
Sample 94
BPC: -All MS
8
263.1037
6
C13 H15 N2 O4 ,263.10
264.1068
4
260
261 262
263
Measured
Pattern
264
Calculated
Pattern
265 266
267
m/z
0
8
eXpose 94 vs. 114
BPC: -All MS
6
263.1
2
4
2
0
2
4
6
8
t [min]
Generate Molecular Formula of
mass 263.1037 m/z @ 5.9min.
The determined formula C13H15N2O4
corresponds to phenyl-acetylglutamate.
Combining Spectra and Statistical Data
Interpretation scores / loadings
Loadings in PCA
indicate the
importance of the
original variables
(buckets) in the
variance space.
In ideal cases a
set of loadings
refers to signals
of a compound.
Analysis of Bucket Variables
Menu bar / Options
Data Viewer
Loadings Plot
Bucket table
Covariance matrix
The covariance matrix looks like a TOCSY, cross - peaks indicate
correlated fluctuations. This includes multi molecular fluctuations.
Rows at cursor position are shown on top of the 2D matrix.
Covariance Analysis
row from covariance
matrix
reference spectrum from
spectra base
Interesting rows can be saved to disk
as 1D NMR spectra and used
for spectra base searching as any
other 1D spectrum.
Often, a small number of compounds
from the spectra base match well
while others do not.
PCA analysis of 69 newborn urine LC-MS spectra
1: selecting two LC-MS runs differing in the PC1 values from scores plot
2: selecting bucket (spectral region) from loadings plot with high PC1 value
Scores plot (PC1-PC2)
Loadings plot (PC1-PC2)
Generate
SumFormula
no peak
LC-MS (run 1)
peak
LC-MS (run 2)
Sum-Formula Generation
Electron Configuration
Intensity
M+
M+*
C/H Ratio, Elemental Limits
m/z
N-rule; isotope distribution
double bond equiv.
Isotope Masses,
Abundances
Experimental Peak
Fast, exact
Intensities & Masses
calculation
Molecular Constraints
Fast Formula Generator
using CHNO Algorithm
List of Hits &
Mass/Intensity Patterns
of isotopic Patterns
Formula Scoring: Isotopic pattern as additional
decision criteria for elemental composition
List of Hits &
Intensity
Mass/Intensity Patterns
Three independent Scores:
• Intensity Ratios
• Intensity weighted
mean Masses
• Intensity weighted
Peak Distances
m/z
Experimental Intensities & Masses
{ Theoretical Intensities & Masses }
Calculating the elemental composition
Simulated mass spectrum of Chlorpyriphos
12C
isotope peak
13C
isotope peak (11% int.)
3 x Cl isotope peaks
H3C
Cl
Cl
S
O
P
O
O
H3C
N
Cl
Clinical Proteomics
The samples are different
The experimental and
mathematical techniques are
similar
But the goal
is the same
Workflow Clinical Proteomics
Binding
Washing
Patients
Serum Samples
Isolation
Normal
Normal
Disease
* * *
Analysis
Elution
Detection
Cluster analysis
Normal
Disease
Clinical Results
MALDI-TOF MS
W. Pusch et. al., Pharmacogenomics (2003) 4(4), 463-476
Data Preparation
Aim: Extraction of the same set of features from each individual
spectrum. These features will be used for model generation
and later for classification of new spectra. As with
metabonomics the identification of the features is of large
interest
Steps:
• Quality Checks for spectra
• Recalibration, Baseline correction, Noise Reduction
• Peak detection and area calculation
• Normalization of peak areas
Tasks for Clicinal Proteomics Data Analysis
1. Data preprocessing
2. Peak annotation
3. Statistical characterization
4. Discriminance analysis
Most of the tasks are quite similar for both kinds of applications
except for the dimensionality of the original data:
1D MALDI
2D LCMS-ESI MS
Data Preprocessing : Recalibration
Problem: peaks are not aligned to
each other as in this example;
For LC-MS it is usually the
retention time
Solution: application of a
recalibration algorithm
Result:
peaks are aligned
to each other
Data Preprocessing : Recalibration
Intensity
Mass shift
Above mass tolerance
Prominent peak
shifted
In tol. but not prominent
Spec 1
Spec 2
linear mass shift
m/z
• Selection of prominent peaks (e.g. 30% occurrence)
• Use this peak list with average masses as calibrants
• Assignment of peaks and calibrants with a mass tolerance
• Recalibration of all spectra by solving least square
problems for a linear mass shift
Data Transformation Wavelet vs. Fourier
Aim: Transformation of spectrum from
• time-amplitude (mass spectra) domain
• into time-frequency (Fourier) or
• time-scale (Wavelet) representation
Benefit:
- decomposition into distinct frequency/scale bands
- significant features (peaks, patterns) occur on
specific frequencies/scales
Steps for wavelet decomposition:
low pass filter approximation coefficients
high pass filter detail coefficients
Wavelets for Feature Extraction
Aim: Determine features from the spectra
which are discriminant
for class separation
Method: Wavelet-Transformation
• gives information about the signal
• localized in time (m/z) and frequency
• lower freq. corresponds to raw
structural information of the spectrum
• higher freq. corresponds to
fine/detailed information
• in contrast to FFT we get knowledge were
the feature is located in time (m/z)
Feature selection:
• we get much features (dep. on time resolution)
• Brute force + sophisticated feature selection needed
Peak detection
Problem:
•Common sets of peaks needed for later model stage
•Different peaks vary to a different extent over all spectra
•Small peaks , nevertheless giving a good separation between
classes, might be overlooked only considering single spectra
average spectrum
single spectrum
Peak detection
Solution - ClinProTools:
•
Determination of peak positions by use of Average-Spectrum
•
Integration over start and end Masses for detected Peaks
Blue areas indicate picked peaks
average spectrum
Red area for picked peaks in Model (see later)
Average spectra per class
Peak at ca. 2022Da – Idx 24 in Model of
most imp. 15 Peaks for GA and SVM
Avg.-spec class 1
Avg.-spec class 2
Avg.-spec class 3
Avg.-spec class 4
Univariate Statistics: Getting a basic idea about the data
Calculation based on: peak intensities / peak areas
• descriptive / robust statistics
• Welch`s t-test / Wilcoxen test
Statistic peak area sorted according p-value
Algorithms for Discriminate Analysis
Some alternatives to classical linear DA:
• Feature selection:
Genetic Algorithms (GA) + Cluster Analysis
• Support Vector Machines (SVM)
GA: Application to MS data
• Solution = combinations of peaks
• Initial population: randomly generated solutions
• Start with multiple initial populations using a migration
schema
• Each solution is assigned a fitness value according to its ability
to separate two or more classes (using centroid or KNNclustering and by determination of between and within class
distances)
• New generations of population are formed using
– Selection: the fitter a solution, the higher the chance for
being selected as a parent
– Crossover: parents form new solutions by exchanging some
of their peaks, new solutions replace parents
– Mutation: random changes in solutions
• Result: combinations of peaks, which separate classes best
GA: Genetic Evolution
Chromosome 1
Start Set
Mutation
Chromosome 2
1000 1200 1700
2100
2500
1700 1800 2000
2200
2300
1000 1200 1500
2100
2500
1700 1800 2000
2150
2300
1000 1200 1500
2150
2300
1700 1800 2000
2100
2500
50-500
cycles
Cross Over
1000 1200 1500
Selection
2150
Discard disadvantages
2300
1700 1800 2000
Fitness Test
using k-NN
2100
2500
Keep advantages
KNN: k-nearest neighbor clustering
•
Spectrum = point in Rn (e.g. areas of selected peaks)
•
Determination of k nearest neighbors for each spectrum
•
Classification of all points using classes of neighboring points
•
Example: point A is classified as class 2, point B as class 1
•
Fitness value:
– percentage of correctly classified points
– calculation of between/within distances
A
B
Legend:
class 1
class 2
Centroid clustering
• Spectrum = point in Rn (e.g. areas of selected peaks)
• Spectrum by spectrum is analyzed (iterative process):
– if it is the first spectrum or too far away from all existing clusters, a
new cluster with just this spectrum is created
– otherwise it is assigned to the nearest cluster, the centroid is
recalculated
• Fitness value:
– pureness & #clusters (optimal: k clusters, with all spectra of one class)
– calculation of between/within distances
3
4
1
2
GA: Results
Prediction capability of GA (plot for best 2 peaks)
~76% pred.*
~91% pred.*
~66% pred.*
* Prediction acc. for a model with 25 peaks
SVM: Support Vector Machine
• SVM: Calculation of direction in Rn, which separates best
between two classes (supervised method)
• PCA (principal component analysis): calculation of
direction in Rn, which best explains variability (unsupervised method, i.e. without looking at class memberships)
PCA =
PCA
SVM
SVM
support vectors
What is SVM – Basic problem
• Assume we have two classes of data points with two peaks
• Now I look for that line which – optimal - separates these 2 classes
Class 2
Many
decision
boundaries
can separate these two
classes.
Which should be chosen?
Class 1
The green boundaries are valid but bad ones
What is SVM – Basic idea
The decision boundary should be as far away from the data of both
classes as possible. We should maximize the margin, m:
Class 2
Class 1
m
This problem can be solved by mathematical
optimization theory
SVM: Application to MS data
•SVM: quadratic optimization problem, solved by an iterative process using
Sequential Minimal Optimization
•In simplest case a hyperplane separating classes is calculated
•Therefrom contribution of individual peaks is calculated
From Spectra in 3 classes :
We get:
• Separating hyperplanes
• Recognition & Prediction accuracy
• Peak ranking highlighting potential biomarker-patterns
SVM: Application to MS data - results
Note its plotted in 2D
but in fact it is
high dimenensional
Rec.
Pred.
Class 1
90
69
Class 2
86
83
Class 3
90
76
Rank
1
2
3
…
Index
7
17
18
…
Mass
1212
1450
1469
…
SVM: Results
Prediction capability of SVM (plot for best 2 peaks)
~93% pred.*
~70% pred. *
~75% pred. *
* Prediction acc. for a model with 25 peaks