Large Two-way Arrays

Download Report

Transcript Large Two-way Arrays

Large Two-way Arrays
Douglas M. Hawkins
School of Statistics
University of Minnesota
[email protected]
What are ‘large’ arrays?
• # of rows in at least hundreds
and/or
• # of columns in at least hundreds
Challenges/Opportunities
• Logistics of handling data more tedious
• Standard graphic methods work less well
• More opportunity for assumptions to fail
but
• Parameter estimates more precise
• Fewer model assumptions maybe possible
Settings
• Microarray data
• Proteomics data
• Spectral data (fluorescence, absorption…)
Common problems seen
• Outliers/Heavy-tailed distributions
• Missing data
• Large # of variables hurts some methods
The ovarian cancer data
• Data set as I have it:
• 15154 variables (M/Z values), % relative
intensity recorded
• 91 controls (clinical normals)
• 162 ovarian cancer patients
The normals
• Give us an array of 15154 rows, 91
columns.
• Qualifies as ‘large’
• Spectrum very ‘busy’
Controls - median relative intensity
100
Relative intensity
80
60
40
20
0
1
1521
3041
4561
6081
7601
M/Z
9121
10641
12161
13681
not to mention outlier-prone
• Subtracting off a median for each MZ and
making a normal probability plot of the
residuals
Probability plot of control residuals
5
Normal score
3
1
-1
-3
-5
-400
-100
200
Residual (* 100)
500
Comparing cases, controls
• First pass at a rule to distinguish normal
controls from cancer cases:
• Calculate two-sample t between groups
for each distinct M/Z
Student's t control - cancer
25
20
15
10
TEE
5
0
-5
-10
-15
-20
-25
1
1521
3041
4561
6081
7601
MZ
9121
10641
12161
13681
Good news / bad news
• Several places in spectrum with large
separation (t=24 corresponds to around 3
sigma of separation)
• Visually seem to be isolated spikes
• This is due to large # of narrow peaks
Close-up of Student t
25
20
15
10
TEE
5
0
-5
-10
-15
-20
-25
1500
1650
1800
1950
2100
2250
MZ
2400
2550
2700
2850
3000
Variability also differs
Log F ratio controls to cases
3
2
LOGF
1
0
-1
-2
-3
-4
1
1521
3041
4561
6081
7601
MZ
9121
10641
12161
13681
• Big differences in mean and variability
• suggest conventional statistical tools of
– Linear discriminant analysis
– Logistic regression
– Quadratic or regularized discriminant analysis
using a selected set of features.
Off-the-shelf software doesn’t like 15K
variables, but methods very do-able.
Return to beginning
• Are there useful tools for extracting
information from these arrays?
• Robust singular value decomposition
(RSVD) one that merits consideration (see
our two NISS tech reports)
Singular value approximation
•
•
•
•
Some philosophy from Bradu (1984)
Write X for nxp data array.
First remove structure you don’t want to see
k-term SVD approximation is
k
xij   rit c jt  eij
t 1
• The rit are ‘row markers’ You could use
them as plot positions for the proteins
• The cjt are ‘column markers’. You could
use them as plot positions for the cases.
They match their corresponding row
markers.
• The eij are error terms. They should
mainly be small
Fitting the SVD
• Conventionally done by principal
component analysis.
• We avoid this for two reasons:
– PCA is highly sensitive to outliers
– It requires complete data (an issue in many
large data sets, if not this one)
– Standard approach would use 15K square
covariance matrix.
Alternating robust fit algorithm
• Take trial values for the column markers.
Fit the corresponding row markers using
robust regression on available data.
• Use resulting row markers to refine
column markers.
• Iterate to convergence.
• For robust regression we use least
trimmed squares (LTS) regression.
Result for the controls
• First run, I just removed a grand median.
• Plots of the first few row markers show
fine structure like that of mean spectrum
and of the discriminators
Plot of first row marker
7
r_i1
5
3
1
-1
1
1521
3041
4561
6081
7601
MZ
9121
10641
12161
13681

But the subsequent terms capture the
finer structure
Plot of second row marker
0.11
P2
0.07
0.03
-0.01
-0.05
1
1521
3041
4561
6081
7601
MZ
9121
10641
12161
13681
Plot of third row marker
0.09
P3
0.04
-0.01
-0.06
-0.11
1
1521
3041
4561
6081
7601
MZ
9121
10641
12161
13681
Uses for the RSVD
• Instead of feature selection, we can use
cases’ c scores as variables in discriminant
rules. Can be advantageous in reducing
measurement variability and avoids
feature selection bias.
• Can use as the basis for methods like
cluster analysis.
Cluster analysis use
• Consider methods based on Euclidean
distance between cases (k-means / Kohonen
follow similar lines)


( xij  xim )     (rit c jt  rit cmt  eij  eim ) 

i 1
i 1  t 1

n
n
k
2
2
k
n
n
t 1
i 1
i 1
  (c jt  cmt ) 2 . rit2   (eij  eim ) 2  crossproduct
• The first term is sum of squared difference
in column markers, weighted by squared
Euclidean norm of row markers.
• Second term noise. Adds no information,
detracts from performance
• Third term, cross-product, approximates
zero because of independence.
This leads to…
• r,c scale arbitrary. Make column lengths 1
absorbing eigenvalue into c
• Replace column Euclidean distance with
squared distance between column
markers. This removes random variability.
• Similarly, for k-means/Kohonen, replace
column profile with its SVD approximation.
Special case
• If a one term SVD suffices, we get an
ordination of the rows and columns.
• Row ordination doesn’t make much sense
for spectral data
• Column ordination orders subjects
‘rationally’.
The cancer group
• Carried out RSVD of just the cancer
• But this time removed row median first
• Corrects for overall abundance at each MZ
• Robust singular values are 2800, 1850,
1200,…
• suggesting more than one dimension.
First component markers
0.21
P1
0.12
0.03
-0.06
-0.15
0
20
40
60
80
100
Ordered cancer case
120
140
160
• No striking breaks in sequence.
• We can cluster, but get more of a partition
of a continuum.
• Suggests that severity varies smoothly
Back to the two-group setting
• An interesting question (suggested by
Mahalanobis-Taguchi strategy) – are
cancer group alike?
• Can address this by RSVD of cancer cases
and clustering on column markers
• Or use the controls to get multivariate
metric and place the cancers in this
metric.
Do a new control RSVD
• Subtract row medians.
• Get canonical variates for all versus just
controls
• (Or, as we have plenty of cancer cases,
conventionally, of cancer versus controls)
• Plot the two groups
• Supports earlier comment re lack of big
‘white space’ in the cancer group – a
continuum, not distinct subpopulations
• Controls look a lot more homogeneous
than cancer cases.
Summary
• Large arrays – challenge and opportunity.
• Hard to visualize or use graphs.
• Many data sets show outliers / missing
data / very heavy tails.
• Robust-fit singular value decomposition
can handle these; provides large data
condensation.
Some references
Bradu, D., (1984), ‘Response Surface Model Diagnosis in Twoway Tables’ Communications in Statistics, Part A -- Theory and
Methods, 13, 3059—3106.
Hawkins, D. M., (2003), Discussion of ‘A review and analysis of
the Mahalanobis-Taguchi system’, Technometrics 45, 25 – 29.
Hawkins, D. M., Liu, L., and Young, S. S., (2001), Robust
Singular Value Decomposition Technical Report 122, National
Institute for Statistical Sciences
Liu, L., Hawkins, D. M., Ghosh, S., and Young, S. S., (2002),
Robust Singular Value Decomposition Analysis of Microarray
Data Technical Report 123, National Institute for Statistical
Sciences