Mike Langston`s Progress Report Fall, 2005

Transcript Mike Langston`s Progress Report Fall, 2005

PROGRESS REVIEW
Mike Langston’s Research Team
Department of Computer Science
University of Tennessee
with collaborative efforts at
Oak Ridge National Laboratory
November 22, 2005
Team Members in Attendance
Bhavesh Borate, Elissa Chesler, John Eblen,
Roumyana Kirova, Mike Langston,
Andy Perkins, Yun Zhang
Team Members Absent
Xinxia Peng, Jon Scharff, Josh Steadmon
Mike Langston’s Progress Report
Fall, 2005
• Team Changes
New Students: Belma Ford (GST), Peter Shaw (Australia)
New Colleagues: Elissa Chesler & Roumyana Kirova (ORNL)
Graduating Soon: Xinxia Peng (December), Jon Scharff (May)
Moved Collaborators: Jay Snoddy (Vanderbilt)
• Recent Conferences/Talks
ACiD (England), Dagstuhl (Germany), COCOON (China),
Purdue, Supercomputing (Seattle)
• Upcoming Visits/Talks
RECOMB WS (San Diego), Texas A&M, Carleton (Canada),
Göteborg (Sweden), AICCSA-06 (UAE), ACM SAC (France)
• Support
NIH (John), ORNL (Yun), Science Alliance (Andy)
Proposals Outstanding
• Sample Projects
Eukaryotes: Allergy (Human), Diabetes (Mice), IR (Mice),
Neuroscience (Mice), others
Prokaryotes: Operon (R. palustris), Shock (Shewanella)
Yun Zhang
• Recent conferences/talks
– Prepared slides for Cocoon05, China
– Presented in SC05 (SuperComputing), Seattle
• Upcoming events
– Cray MTA (Multithreaded Architecture) Workshop, ORNL
• Projects: maximal clique enumeration
– Comparisons of multithreaded implementations on
• Altix vs. Cray vs. IBM
• Cray: Vectorization of for-loops
– Implementations on distributed-memory machines
• Using MPI vs. Global Arrays
• Load-balancing using master/slave vs. peer-to-peer model
– Comparison of MPI vs. Multithreaded
Parallel Clique Enumeration
• Object
– Minimize data communication vs. maximize balanced load
• Dynamic load balancing
– Data transfer: peer-to-peer
– DLB strategies: master/slave vs. peer-to-peer
Search tree
k=1
1
1
5
2
2
3
4
5
3
6
k=2
k=3
k=4
1
A task needed to be transferred
from slave1 to slave5
k=5
Clique Enumeration
• Methods to speed up the computation core
• Bit compression to save memory, and corresponding
bitwise operations on compressed bitmaps
Vertices
a
e
c
f
Cliques
b
d
g
a
b
c
d
e
f
g
a
0
1
1
1
1
0
0
b
1
0
1
1
1
1
0
c
1
1
0
1
1
1
1
d
1
1
1
0
1
0
1
e
1
1
1
1
0
0
1
f
0
1
1
0
0
0
1
g
0
0
1
1
1
1
0
(a, b, c, d) 0
0 0 0 1 0 0
dense
sparse
Andy Perkins
Projects
•
•
•
•
Low dose
Allergy
Shewanella
HRT
Microarray Data
• Normalization
• Filtering low or unchanging expression values
• Control spots
Differential Analysis
• Cliquification
– In a large percent of cliques in one group and few
in the other.
• Expression
– 2-fold change in expression between groups
• Correlation
– Correlation value >= 0.85 in one group and <=
0.25 in the other.
Differential Analysis
Red edge: >=0.85 in dose and <= 0.25 in control
Blue edge: >= 0.85 in control and <= 0.25 in dose
Other research
• Thresholding
• Pearson’s vs Spearman’s
• Random graphs
Papers
• ``Computational Analysis of Mass Spectrometry Data Using
Novel Combinatorial Methods,'' Proceedings, ACS/IEEE
International Conference on Computer Systems and
Applications, Dubai, United Arab Emirates, March, 2006, with A.
Fadiel, M. A. Langston, F. Naftolin, X. Peng, P. Pevsner, H.S.
Talor, O. Tuncalp, and D. Vitello.
• ``Innovative Computational Methods for Transcriptomic Data
Analysis,'' Proceedings, ACM Symposium on Applied
Computing, Dijon, France, April, 2006, with M. A. Langston, A.
M. Saxton, J. A. Scharff and B. H. Voy.
John Eblen
Clique Analysis Tool Chain
• Projects
– Gerling Data – NOD mice
– Shewanella Data
• Three Interesting Problems
– Aggregating Maximal Cliques
– Thresholding
– Biological Analysis of Clique Results
Aggregating Maximal Cliques
• The Problem
– A great deal of overlap among maximal cliques
– Many cliques differ by only a few nodes
• Solutions
–
–
–
–
Paraclique (Dr. Langston)
Nucleated Clique (Jon Scharff)
Clique Difference or “Nonoverlap”
Others
Direct Maximum Clique
• Parallel version scales well on Altix
supercomputer, shared memory machines
• Currently working on base serial code
efficiency
• Ultimate goal is speed
– Best algorithm possible
– Smart implementation(s)
Keller 7 Conjecture
• Goal is to find or prove nonexistence of 128clique in Keller 7 graph
• Current approach
– Found set of 128 nonoverlapping ISs
– Currently searching for more
– Should greatly reduce search space
Bhavesh Borate
Thresholding
•
•
•
•
•
•
•
•
•
•
GO Pairwise Similarity Analysis
Percentage of Cliques with Biological Meaning at each threshold
Confidence Intervals
Graph Properties (Edge Density, Maximal Cliques, Maximum
Clique)
Spectral Graph Theory
Bayesian Statistics
Control Spot Threshold verification
Utilization of Info from Pathway Databases
Combinatorial Strategy
Kentucky Windage ;)
Graph of GO-Pairwise Scores v/s Correlation Values
Shewanella data
Avg functional similarity v/s Correlation
Avg functional similarity
0.14
0.12
0.1
0.08
Series1
0.06
0.04
0.02
0
1.2
1
0.8
0.6
Correlation
0.4
0.2
0
GO Pairwise Similarity Analysis
For each pair of genes, we find a GO category X that covers both the
genes and has the minimum number of total genes
Get a GO score for each pair of genes
Accumulate correlation scores in bins 1,0.99,0.98…….0
Average the GO scores of pairs in each bin.
Plot.
Pairwise Scores
Score for each Clique
Get P-value for each Clique
For each threshold 0.8:0.01:0.95
At each threshold calculate
% Cliques with P-value < 0.01
Updates from Xinxia
•
•
•
•
•
Kevin was born in May
Defended in October
Graduating in December
Working on publications
Starting a job in December
Thank you all and Keep in touch!
Suman Duvvuru
Data analysis
•
•
Effect of Strain: Currently working on Dr.Brynns mice strain
data and I am writing up the code in SAS to see which strain is
producing strong correlation in the data.
The problem with microarray data
1. The numbers of variables is much higher than the number
of observations – causes many eigenvalues in the
Covariance matrix to be 0 – Correlation matrix is
problematic.
2. Can be corrected using
•
•
•
shrinkage based correlation
Information criteria based methods (using smooth covariance
estimators) .
(Implementation of these methods currently in progress)
Roumyana Kirova
mRNA expressions and Linkage
Gene expression data: N genes, K strains
Probe
1
2
3
4
5
BXD1
4.46
4.10
5.15
6.45
4.06
BXD2
5.30
4.49
4.74
6.03
5.06
BXD3
5.80
4.24
5.04
5.79
4.35
BXD5
5.51
4.06
6.10
6.56
4.09
BXD8
4.90
4.46
5.20
7.32
4.09
…
...
...
...
...
...
12000
4.16
4.06
5.37
5.28
5.31
…
Polymorphisms
Marker
M1
M2
M3
M4
M5
BXD1
AA
AA
aa
AA
AA
M3000
aa
BXD2
aa
AA
AA
aa
AA
BXD3
AA
AA
AA
AA
aa
aa
AA
0.7
0.6
aa
0.5
0.4
AA
0.3
0.2
0.1
0
4
4.4
4.5
5.5
6
6.5
7
7.5
More
Model: QTL mapping
m
l
i 1
i, j
y     qi xi   ij xi x j  e
y  expression levels
xi  2 if AA, 0 if aa
e  error terms
LOD  P( Model , qi  0) / P( Model , qi  0)
Expressions: 0.46 0.30 0.80 1.51 0.90
Paraclique1
Regulatory
C2
Model 1
C1
Paraclique2
Model 2
Clique 2
Regulatory
Correlation histogram
-0.5
0.0
500
-0.5
0.5
0.0
res
Regulatory ID 2840
Regulatory ID 267
0.5
0.0
0.4
0.4
res
0.8
0.8
res
0.0
res
0 200
Frequency
500
200
0
Frequency
Correlation histogram
0
500
1000
Paraclique members
1500
2000
0
500
1000
Paraclique members
1500
2000
Paraclique1
Principal components
C2
C1
QTL Model 1
Paraclique2
Principal components
C2
C1
QTL Model 2
Principal components
QTL Model 1
Meta component
Common QTL
Principal components
QTL Model 2
Open questions:
1.
How stable are the paracliques and QTL models if we choose different
samples (not the average of the replicates).
•
generate samples of the data by choosing randomly replicates and
build confidence intervals.
•
fit a multi variance model: Expression ~ Strain + Sample + Strain:
Sample
•
adding covariates in the QTL model to adjust for the gender effect.
2.
Power issues: How many strains, replicates and how many terms in the
model.
•
simulate expression data and calculate power as a function of the
sample size.
3.
Parametric vs non-parametric analysis.
4.
Multiple tests adjustments.
Ontological Discovery for Ethanol Research
(…the new acronym stinks)
Elissa J. Chesler
Department of Anatomy and Neurobiology
Center for Genomics and Bioinformatics
University of Tennessee-Memphis
Health Science Center
Ontological
Discovery for
Ethanol Research
SPECIFIC AIMS
• Aim 1: To develop a data
archive of ethanol, brain and
behavior related gene sets
that have been derived both
empirically and through
literature review.
Cocaine
& PTZ
Audiogenic
ATPases
T4
Pressure
• Aim 2: To develop a tool that
allows cross-species, crossmolecule type gene set
comparison.
• Aim 3: To develop a Web
interface to the data archive
and analysis that is aimed
toward behavioral
neuroscientists.
EtOH
Withdrawal
The Seizure
Related
Phenotype
Landscape
Highly related phenotypes
share many common
mRNA correlates
Ontological Discovery from
Phenotype Centered Gene Sets
Phenotypes are operationally defined, based on
phenomenology.
Gene sets can be empirically
associated with phenotypes.
But what underlying construct really “IS”?
Can we identify it by examining shared biological
substrates of related processes.
ERGO:
Ethanol Related Gene Ontology
AIM 1: Gene set assembly and archive
• Gene set is broadly • Attributes of each gene set
include:
defined.
– mRNA differential
expression
– mRNA correlation
– Literature review
– KO, mutants with
trait effects
• Search
– by gene
– by descriptor
– by set matching
•
•
•
•
Type (mRNA, lit, protein)
Species
Free text description
Structured description,
e.g. MPO
• Source DB (GO, KEGG,
WebQTL100)
• Associated document (e.g.
abstract, publication)
Aim 2: Analytic tool
• Translates gene sets to a
common reference species
via homology.
• Similar to existing tools, but
archives more information
about gene set
• Allows multiple set
comparisons (intersection
analyses are not limited to
two sets).
• Percent positive matching
allows estimation of the
relation of gene sets w/o
specific regard to identity of
genes. This allows a basis
for clustering phenotypes
based on gene annotation
GeneKeyDB can be used to generate translation tables
across species
Aim 3 Behavioral Neuroscience
Friendly interface
• Does the world need another boutique?
• Making genomics accessible to broader research
community.
• Text searching to retrieve, e.g. all gene sets related
to ‘stress’.
• Text mining
• Apparatus specific details
• OUR GOAL IS TO CREATE A TOOL FOR
PHENOTYPIC ANALYSIS, GENES CAN BE A
BLACK BOX THAT GET US THERE!
Future Directions
Bleeding Edge
From a matrix of set-set correlations estimated
by jacquards positive match, can we draw and analyze
graphs of gene set relations?
From a set of documents associated with overlapping
gene sets, can we mine text for frequently occurring
terms?
e.g. to answer “What term is most commonly occuring
in the set of sets extracted by match to expression
upregulation in response to handling stress?”
Research challenges
• Translation of genes across species:
– Homology is not perfect, how do we match when
no homologues are found?
• Reference Set
– What is the “reference set” for category
representation analysis when gene sets are drawn
from diverse sources?
– Lack of comprehensivity of reference sets, e.g. a
list of KO mice does not include all genes
screened.
• Generation and curation of gene sets:
establishing meaningful protocols and
definitions to increase the quality and utility
– Use GenMapp or Stanford models.
Gene set overlap unites diverse
phenomena Induction of a
research question:
ontology
Consumption
Correlates in
RI lines
Gene
Expression
Correlates
of Htr1b
Upregulated
Upregulated
in Social
in P vs NP
Isolation
“If I antagonize the
gene product of
consumption
correlate in socially
isolated monkeys,
consumption will
decrease.”
Literature
On
Neuroactive
Steroid
Synthesis
“Hey, you put your social isolation in my NP mice!
Yeah, well you put your P mice in my binge drinking!”

Mike Langston`s Progress Report Fall, 2005

Transcript Mike Langston`s Progress Report Fall, 2005

Directory