PROGRESS REVIEW Mike Langston`s Research Team Department

Download Report

Transcript PROGRESS REVIEW Mike Langston`s Research Team Department

PROGRESS REVIEW
Mike Langston’s Research Team
Department of Computer Science
University of Tennessee
with collaborative efforts at
Oak Ridge National Laboratory
June 27, 2005
Team Members in Attendance
Bhavesh Borate, Suman Duvvuru, John Eblen,
Mike Langston, Xinxia Peng, Andy Perkins, Jon
Scharff, Henry Suters, Yun Zhang
Team Members Absent
Josh Steadmon
Mike Langston’s Progress Report
Summer, 2005
• Team Changes
– Graduating Soon: Xinxia Peng, Jon Scharff
– New Member: Andy Perkins
• Team Foci
– FPT Tools and Applications
– Computational Biology
• Recent Conference Talks
– AICCSA-05 (Egypt), RTST-05 (Lebanon), DIMACS (New Jersey)
• Recent Visits
– Cold Spring Harbor Lab (New York)
• Upcoming Conference Talks
– ACiD-05 (England), Dagstuhl-05 (Germany), COCOON-05 (China)
• Upcoming Major Program Committee Service
– AICCSA-06 (Program Chair), IWPEC-06 (Program Co-Chair)
John Eblen
Dr. Ivan Gerling’s Data
• Details
– Leukocyte data - 2 ages, 3 strands
– Islet data – 3 ages, 4 strands
• Current Project – Adding Proteins
– Add 60 proteins to leukocyte data of 22690
probe sets
– How can we improve correlation?
– What other types of analysis are possible?
General Clique Problem
• Specific Approaches
– “Biographs” or graphs created from correlation values
– Brock graphs
– Approach for keller graphs?
• Information Gathering
– Markov chains
– General graph properties
• Combining Algorithms
Additional Projects
• Fast Direct Clique Codes
– Currently testing on DIMACS challenge
graphs
– Work continues
• Common Neighbor Preprocessing
Jon Scharff
Differential Expression
• Student’s t-test, in two normally distributed
populations:
– Mean assumed to be equal
– Variance assumed to be equal
• Gene by gene basis
Differential Correlation
Differential Cliquification
• Cliques that appear in one graph but not
the comparison graph
Nucleus Cliques/Clique Nuclei
Yun Zhang
Clique Enumeration Problem (1)
• Proposed a new maximal clique enumeration
algorithm
– Inspired by Kose et al algorithm
– Enumerates cliques in non-decreasing order of sizes
– Uses bitwise operations to speedup and reduce
space requirements
– Sequential algorithm is parallelizable
– Serial code is almost 400 times faster than Kose RAM
on the 0.85 threshold MAS5.0 graph (size 12,422)
Clique Enumeration Problem (2)
• Space required to hold the cliques is enormous
Memory Usage on a graph with 2895 vertices
20
Memory usage (GBytes)
18
16
14
12
10
8
6
4
2
0
3
4
5
6
7
8
9
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
Clique Size
On 0.7 threshold MAS5.0 graph, it used up to almost 1 terabyte
memory after 12 hours running
Clique Enumeration Problem (3)
• Parallelism on shared-memory machine
– SGI Altix, 256 processors, 2 Terabytes shared
memory, 8GB per CPU
– Use a dynamic task scheduler to
• Synchronize multiple threads
• Make load balancing decisions
– Achieves a super-linear speedup on up to 64
processors
Clique Enumeration Problem (4)
Run times with/without load balancing using up to 64
processors on a graph with 2895 vertices
with load balancing
no load balancing
Run time (seconds)
600.00
500.00
400.00
300.00
200.00
100.00
0.00
0
4
8
12
16
20
24
28
32
36
40
44
Number of processors (threads)
48
52
56
60
64
68
Maximum Common Subgraph
• Clique branch algorithm by Henry (Cocoon05)
– Takes advantage of the special structure of
association graph built from two graphs
• Finished serial implementation
• Preliminary performance testing on small graphs
• Next step:
– Benchmarking
– Parallel implementation
Andy Perkins
• Working with Jon on Brynn Voy's low dose
IR mouse data
• Finding and examining paracliques in the
low dose data
• Thresholding via spectral graph theory
• Clique on MPSS mouse data
Bravesh Borate
Thresholding in High-Throughput data
Ways of getting to the threshold
Graph & Statistical Analysis
• Graph
features/characteristics
• Using confidence
intervals with Bayesian
statistics
• Random: 0.5% edges in
graph
Utilizing Biological Info
• Gene Ontology
• Utilization of info from
pathway databases
Normal distribution of No. of edges
Spleen data
Skin data
400000
1200000
350000
1000000
300000
800000
250000
600000
200000
Series1
Series1
150000
400000
100000
200000
50000
0
0
0.2
0.4
0.6
0.8
1
0
-50000 0
1.2
-200000
0.2
0.4
0.6
0.8
1
1.2
RMA data
MAS5 data
1400000
3000000
1200000
2500000
1000000
2000000
800000
1500000
600000
Series1
Series1
1000000
400000
500000
200000
0
0
0
0.2
0.4
0.6
0.8
1
-200000
1.2
0
0.2
0.4
0.6
0.8
1
1.2
-500000
1800000
1600000
1400000
1200000
PDNN data
1000000
800000
Series1
600000
400000
200000
0
-200000 0
0.2
0.4
0.6
0.8
1
1.2
Comparison with other datasets
Data
No of
edges
Maximum
Degree
Size of
Maximum
Clique
Avg. size of
Maximal
Cliques
Spleen
Data
(0.85)
34753
349
39
20.03229
Skin data
(0.87)
32384
606
66
48.009
MAS5
data
(0.84)
3704
134
19
10.35285
RMA data
(0.92)
34814
698
116
--
PDNN
data (0.87)
34225
678
88
68.1974
Gene Ontology
1
Scores
from
GO
0.8
0.6
0.4
0.2
0
1
0.8
0.6
0.4
0.2
0
Correlation Scores
Limitations
•GO data: helpful but blind reliability questionable
• Only applicable to genes with GO annotation
•For Elissa’s data: Bing got inexplicable results (a more so flat
curve)
Info from Pathway Databases
• What graphs mean in biological context
• Extrapolate info from what is “known” to the “unknown”.
• Expression data from House-keeping genes invaluable.
Limitations
• Info in Pathway databases not arranged in tissue-specific
or condition-specific manner.
A Combinatorial Strategy
• Get info & develop algo to make sense of it all and
suggest a threshold to the user.
• Also suggest the biologist ideal thresholds with each
method !!!!
• Provide facility for displaying the graph at each threshold
• Better so, if it is interactive and dynamic (perhaps too
ambitious ???)
• User discretion in the end, determines the right threshold.
Comparison of clustering
algorithms
-Suman Duvvuru
What is clustering
• Clustering:
– Partitioning into dissimilar groups of similar objects (in
our case objects refer to genes).
– Cluster analysis is used to identify genes that show
similar expression patterns over a wide range of
experimental conditions.
• Traditional definition of a “good” clustering:
– Points assigned to same cluster should be highly
similar.
– Points assigned to different clusters should be highly
dissimilar.
Overview of clustering algorithms
• K-cores (Implemented):
– A k-core of a graph is a largest subgraph
with minimum degree at least k
– The k-cores of a graph can be generated by
• deleting the vertices from the graph whose degree
is less than k and
• Performing a DFS on the resulting graph to find all
the cores.
HCS (Highly connected graph):
– The edge connectivity or simply the connectivity k(G)
of a graph G is the minimum number k of edges
whose removal results in a disconnected graph.
– A minimum cut abbreviated mincut is a cut with a
minimum number of edges.
– A graph G with n vertices is called highly
connected if k(G) > n/2.
– A highly connected subgraph HCS is an induced
subgraph H such that H is highly connected.
– This algorithm identifies highly connected subgraphs
as clusters.
HCS Algorithm
• Using Dinics algorithm to compute mincut. The
complexity of this computation is O(nm2/3).
• Edge density half as compared to our clique method.
HCS: An example
Other clustering methods
Using cluster 3.0 software:
• K-means
• Hierarchical clustering
Disadvantages:
• None of these methods allow a single
gene to be present in multiple clusters.
Quality assessing
• Different measures for the quality of a clustering solution
are applicable in different situations.
• It depends on the data and on the availability of the true
solution.
• In case the true solution is known, and we wish to
compare it to another solution, we can use the
Minkowski measure or the Jaccard coefficient.
• When the true solution is not known, edge density,
Homogeneity and separation, Average silhouette are
used as criteria for evaluation.