Targeted Projection Pursuit for Microarray Data Analysis

Download Report

Transcript Targeted Projection Pursuit for Microarray Data Analysis

Targeted Projection Pursuit for
Microarray Data Analysis
Joe Faith
Northumbria University
Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
Outline
1.
2.
3.
4.
Analysing High-Dimensional Array Data
Dimension-Reduction Techniques
Targeted Projection Pursuit
Experimental Results
Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
Array Data
• New array technologies producing floods of
quantitative data
–
–
–
–
cDNA and oligonucleotide
Protein arrays
Combinatorial chemistry arrays
Tissue arrays
• Typically dozens of samples x thousands of
genes (or other attributes)
Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
Array Analysis Tasks
• In case of classified data (samples of known
diagnostic classes, eg cancer tumours)
–
–
–
–
spot clusters in data
spot outliers
classify new cases into existing classes
genetic profiles, feature selection, finding markers for
particular conditions
• Similar problems with time series / sequential data
– Genome-wide study of transcription and regulation
Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
Array Analysis Techniques
• Lots of techniques borrowed from statistics,
machine learning, data mining.
• Tend to be complicated and ‘opaque’
• Want to find ways to allow experimenter to:
– Visualise / communicate
– Explore
– Hypothesis formation
Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
Statistical Problems
• Nature of data presents many statistical
problems:
–
–
–
–
Normalisation
Control of variance
Determining significance
Determining reliability
• ‘high p, low n’
• Will ignore all these!
Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
Suppose we had just 2 genes…
Gene A
Clusters, classifications, outliers, correlations etc
are then immediately obvious
Gene B
Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
3D Scatter Plots
Gene A
Gene B
Gene C
Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
‘Virtual Reality’ 3D Scatter Plots
Angelova et al, 2005
Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
Dimension Reduction
Techniques
• But what about p=4, 5, … 1000??
• Need some way of visualising and
exploring higher dimensional ‘space’ in 2D
Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
Hierarchical Clustering
• Produce a dendrogram based on sample/gene
distances, and optimise order for display
• But single dimension obscures many
relationships
Eisen et al, 1998
Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
Multi-Dimensional Scaling
• Finds best possible 2D representation of
data points (ie preserve distances between
points)
• Eg Sammon’s Mapping (Ewing et al, 2001)
Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
But…
• ‘curse of dimensionality’ spreads points
• Not projection based, so cannot visualise
position of new unclassified samples
• No indication of particular stresses
Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
Linear Projection Based Methods
• Find a 2D ‘window’ through which we view the
multidimensional data
• The position of the window then contains useful
information about, eg, respective significance of
particular genes
• Principal Components Analysis (Yeung, 2001)
– Find view (window position) that best spreads the data
• Projection Pursuit
– Find projections best suited for particular purposes,
such as separating classifications (Lee, 2005)
Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
Grand Tours
• But each of these only show a single
view out of many
• So ‘Grand Tours’ show a
video of all possible views
(Asimov, 1985)
• Grand Tours in high dimensions
are mostly uninformative; and
make it hard to interpret
data
Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
Manual Controls
• Try using manual controls
to alter projections?
• Controls are ‘opaque’:
user has no intuition about
the effect their actions will
have
• Eg Xgobi (Cook, 97)
Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
Targeted Projection Pursuit
• The intuition:
– Allow user to manipulate view of data directly
– Computer then tries to find view that best
matches ‘target’
Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
Quantitative Evaluation
• Task: find a view of a data set that best
shows classifications
• Data: publicly available gene expression
data sets of diagnosed cancer tissues
• Method: compare resulting views with
standard techniques for degree of class
separation
Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
Data
• LEUK50: Gene expression in two types of acute leukemia: acute
lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML)
[Gol99]. 38 cases of B-cell ALL, 9 cases of T-cell ALL, and 25 cases
of AML. Expression levels of 7219 genes
• SRBCT50: cDNA microarray analysis of small, round blue cell
childhood tumors (SRBCT), including neuroblastoma (NB),
rhabdomyosarcoma (RMS), Burkitt Lymphoma (BL; a subset of nonHodgkin lymphoma) and members of Ewing’ family of tumors (EWS).
6567 genes for 83 samples [Kha01].
• NCI50: 60 cell lines from the National Cancer Institute's anticancer
drug screen [Sch00]: 9 breast, 5 central nervous system (CNS), 7
colon, 6 leukemia, 8 melanoma, 9 non-small-cell lung carcinoma
(NSCLC), 6 ovarian, 2 prostate, 8 renal. 9703 cDNA sequences.
Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
DR Techniques
• TPP: Targeted Projection Pursuit
• PP: Projection Pursuit (computer search for
optimal view)
• SAM: Sammon Mapping
• VS: VizStruct non-linear projection based
on radial coordinates
Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
Metrics
• ILDA: Linear Discriminant Analysis Index
(Lee 05)
• 5NN: Generalisation performance of KNearest Neighbours Classifier
Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
Results
Data
Metric
TPP
DR PP
SAM
VS
•
•
LEUK
SRBCT
NCI
ILDA
5NN
ILDA
5NN
ILDA
5NN
.997
100
.999
100
.994 96.7
.972 98.6 .988 100 .981 62.3
.959 97.2 .911 95.2 .927 67.2
.952 95.8 .637 56.6 .838 32.8
Joe Faith, Robert Mintram, Maia Angelova (2006), "Targeted Projection Pursuit for
Visualising Gene Expression Data Classifications", BioInformatics (forthcoming).
Joe Faith, Michael Brockway (2006), "Targeted Projection Pursuit Tool for Gene
Expression Visualisation", Journal of Integrative Biology, (forthcoming).
Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
LEUK Data Views
Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
SRBCT Data Views
Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
NCI Data Views
Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
Classifier Construction
• Find a view in which all classes are clearly
separated:
– Components of projection then define
combination of genes to define classification
– Can order by significance to find a subset of
relevant genes
Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
Outlier Detection
• See LEUK data, outliers between ALL/T
and ALL/B
• See which potential outliers move with the
rest of the samples of that class
Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
Gene Identification
• Separate each class in turn from the remainder of the
data. The most significant genes in this separation can
then be found
• NCI data:
– Human melanoma antigen recognized by T-cells (MART-1)
mRNA Chr.9 selects for Melanoma samples [Coulie 94]
– Desmoplakin gene selects ovarian cancer cases [Adams 06]
• SRBCT data:
– CD83 selects Burkitt's Lymphoma samples [Dudziak 03]
Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
Future Work
• Quantitatively evaluate TPP on other tasks
• Develop tool to:
– Handle wider range of data formats
– Display time series / sequential data
– Integrate with biological workflows:
• Standard gene lists
• Click-through to gene ontologies and DBs
• Work with biologists to trial tool and get
feedback
Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
References
• Maia Angelova, D. Ivanov and H. Yasrebi. Classification and
visualisation of E.coli genes from microarray experiments, poster
presentation, MASAMB05, March, Rothamsted Research, Harpenden,
UK www.rothamsted.bbsrc.ac.uk/bab/masamb/posters/MAngelova.pdf
• Eisen,M.B., Spellman,P.T., Brown,P.O., and Botstein,D. (1998)
Cluster analysis and display of genome-wide expression patterns,
PNAS 95:25, 14863-14868
• Ewing,R.M. and Cherry,J.M. (2001) Visualisation of expression
clusters using Sammon's non-linear mapping. Bioinformatics, 17,658659.
• K.Y.Yeung and W.L.Ruzzo, Principal Components Analysis for
clustering gene expression data, Bioinformatics 17 (9) 763-774 (2001)
• Lee,E.K, Cook,D., Klinke,S. and Lumley,T. (2005), Projection Pursuit
for Exploratory Supervised Classification, Journal of Computational
and Graphical Statistics, 14(4), 831-846
• Asimov, D. (1985). The Grand Tour: A Tool for Viewing
Multidimensional Data. SIAM Journal of Scientific and Statistical
Computing 6(1), 128 -- 11.
Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
• D. Cook, and A. Buja (1997), Manual Controls for High-Dimensional
Data Projections J. Computational and Graphical Statistics, vol. 6, no.
4, pp. 464-480.
• Golub,T.R., Slonim,D.K., Tamayo,P., Huard,C., Gaasenbeek,M.,
Mesirov,J.P., Coller,H., Loh,M.L., Downing,.J.R., Caligiuri,M.A.,
Bloomfield,C.D., Lander,E.S. (1999) Molecular classification of
cancer: class discovery and class prediction by gene expression
monitoring. Science,286(5439):531-7.
• Scherf,U., Ross,D.T., Waltham,M., Smith,L.H., Lee,J.K., Tanabe,L.,
Kohn,K.W.,
Reinhold,W.C.,
Myers,T.G.,
Andrews,D.T.,
Scudiero,D.A., Eisen,M.B., Sausville,E.A., Pommier,Y., Botstein,D.,
Brown,P.O., and Weinstein,J.N. (2000) A Gene Expression Database
for the Molecular Pharmacology of Cancer, Nature Genetics, 24(3),
236-244.
• Khan,J., Wei,J.S., Ringnér,M., Saal,L.H., Ladanyi,M., Westermann,F.,
Berthold,F., Schwab,M.,
Antonescu,C.R.,
Peterson,C., and
Meltzer,P.S. (2001) Classification and diagnostic prediction of cancers
using gene expression profiling and artificial neural networks. Nature
Medicine, 7(6): 673--679.
Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,
• Coulie,PG, et al, (1994) A new gene coding for a
differentiation antigen recognized by autologous
cytolytic T lymphocytes on HLA-A2 melanomas,
J Exp Med. Jul 1;180(1):35-42
• Dudziak et al (2003) Latent Membrane Protein 1
of Epstein-Barr Virus Induces CD83 by the NF-?B
Signaling Pathway, J Virol; 77(15): 8290--8298.
• Adams et al (2006) Meningothelial meningioma in
a mature cystic teratoma of the ovary, Pathologe
Mar 23
Targeted Projection Pursuit, Joe Faith, Northumbria University, v1.1,