Transcript Santos tony
Large-scale mining of gene
expression patterns
Paul Pavlidis
[email protected]
VanBUG September 2007
Students
Leon French
Meeta Mistry
Vaneet Lotay
Postdoc
Jesse Gillis
Undergraduates
Raymond Lim
Suzanne Lane
Programmers
Kelsey Hamer
Luke McCarthy
Injury
Stress
Disease
Aging
Development
Signal transduction
Synapse
Genome
Synaptic modulation
Topics
•
•
•
•
Connectivity database and analysis
Gene expression data re-use system
Scaling up gene coexpression analysis
Applications and ongoing work
Another ‘ome
Leon French, Suzanne Lane
Growth of GEO
120000
Submissions
100000
80000
60000
40000
20000
0
Dec-99
Apr-01
Sep-02
Jan-04
Date
May-05
Oct-06
Feb-08
Age
Genes
With JJ Mann, V Arango, E Sibille et al.
Samples
Age
Genes
Samples
Data from http://national_databank.mclean.harvard.edu/
GEO
Goals for a system
• Researchers should be able to put their new
expression data in a wider context of previous
studies without extraordinary effort.
• Move analyzing multiple microarray data sets
from a niche activity to the mainstream
• Integration of other data types, domain specific
information.
Public data
sources
Coexpression
Differential expression
Challenges to comparing data sets
•
•
•
•
•
•
Need to match genes/transcripts across platforms
Data from third parties not always easy to handle
Varying scales, normalization, etc.
Varying data quality
Varying levels of “raw data” available
Selecting appropriate data to compare
With Cincinnati Children’s Hospital (D.Glass, M. Barnes et al.)
15
10
Frequency
8
6
5
4
0
2
0
Frequency
10
12
20
14
Probe specificity (or lack thereof)
0.0
0.2
0.4
0.6
Fraction non-specific probes
0.8
1.0
0.0
0.2
0.4
0.6
Fraction of probes with alignments
0.8
1.0
Which data sets are reasonable to compare?
Too general, but lots of power
All mouse data sets
Mouse brain data sets
Mouse neocortex data sets
Mouse neocortex data sets examining stress
Mouse neocortex data sets examining hypoxic stress
Mouse neocortex data sets examining hypoxic
stress after 3 hours of hypoxia
Very specific, low power
Expression experiments
519
Mus musculus
254
Homo Sapiens
203
Rattus norvegicus
62
178
Assays (i.e., chips): 20837
Array Designs:
Coexpression links (probe-level): >100 million
Scaling up analysis of gene
coexpression
Eisen et al., 1998 PNAS
Genes that are coexpressed tend to have related function
•
•
•
Needed at the same place at the same time
“Guilt by association”
Reasonable to compare across studies
Two ribosomal protein genes.
Expression
•
Samples
Biological noise
• Induced gene expression effects are often small.
• Gene expression varies between “replicates” in
biologically-meaningful ways.
• Allows us to repurpose data
Sample type
Functional coexpression should be
(somewhat) generalized
•
•
•
If two genes are coexpressed under one condition, they will probably be
coexpressed under at least some other conditions (or data sets).
Coexpression seen “only once” needs special care in interpretation.
We shouldn’t expect coexpression to be perfectly reproducible (for biological
and technical reasons)
Correlation
Correlation
A simple approach:
Count Recurring patterns
Genome Research, June 2004
Pipeline for one dataset
Proof of concept analysis
•
•
•
•
60 human data sets, 15700 RefSeq genes.
70% cancer data
11 million “links”
About 9.7 million different links
Many links are replicated across
studies
1.E+07
Observed
1.E+06
Number of links
Shuffled database (mean)
1.E+05
1.E+04
1.E+03
1.E+02
1.E+01
1.E+00
1
10
Minimum number of data sets link is seen in
100
Evaluation on biological grounds
Cluster involving NMDAR1 (GRIN1)
GRIN1
ATP6V0A1
Allen Brain Institute
PLD3
Application: analysis of imprinted genes
Laurent Journot, INSERM – Universités Montpellier
Correlation p-value
LYAR interacting proteins
LYAR-interactors
Ewing et al, 2007 Molecular Systems Biology
Vote counting limitations
• Weak evidence
distributed across
data sets will not be
picked up.
• This example meets
strict “vote counting”
criteria in only 2/23
data sets
Correlation
2
4
6
8
10
12
Support (datasets)
Support
(# of datasets)
14
-1.0
-0.5
0.0
0.5
(Global)
Correlation
Global effect
size
1.0
Genes pairs
Datasets
Related work: Zhou XJ et al., Nat.Biotech 2005
Summary
• Reuse of public data: ‘adding value’
• Meta-analysis of coexpression
• Some applications
• Functional prediction
• Candidate identification
• Platform evaluation
Ongoing and future work
• Applications and analyses
• Protein interactions and hubs
• Prediction of gene function at the synapse
• Differential expression analysis
• Regionalization
• Mouse models of brain injury
• Mouse models of psychosis
• Expanding our public database and software
http://www.bioinformatics.ubc.ca/Gemma
Web-based tools for biologists; web services coming soon
• Integration with other information sources
Thanks
Gemma
Xiang Wan
Kelsey Hamer
Luke McCarthy
Kiran Keshav
Suzanne Lane
Meeta Mistra
Jesse Gillis
And to:
NCBI GEO team
Groups who made data available
Collaborators who provided data prior to
publication
Conrad Gilliam
Abraham Palmer
Joseph Santos
Gozde Cozen
David Quigley
Anshu Sinha
Spiro Pantazatos
Wei-Keat Lim
Tmm
Homin Lee
Amy Hsu
Jon Sajdak
Jie Qin
Tzu-Lin Hsaio
Andreas Kottmann
Etienne Sibille
Collaborators
Barclay Morrison
Joseph Gogos
Michael Hayden
Blair Leavitt
Tony Blau
Panos Papapanou
Answers to FAQs
•
•
•
•
No, they don’t have to be time course experiments.
Yes, we’re using cDNA as well as Affymetrix etc.
Yes, we see reproducible negative correlations.
Yes, we’re interested in finding differences as well as
similarities between data sets.
• No, we aren’t necessarily inferring regulatory relationships
• Yes, we know that RNA is just one way of measuring cell
state.
• No, we don’t have {worm,fly,yeast…} data, but we’d like to.