Transcript PPT
Parallel Detection of Regulatory
Elements with gMP
Bertil Schmidt, Lin Feng, Amey Laud, Yusdi Santoso
Damayanti Gupta
CMSC 838 Presentation
Motivation
Fundamental question
How are expression levels of thousands of genes regulated ?
Very important
Understanding of gene function
Response to environment
Understand genetic causes of diseases
Evaluate effects of drus
Detect mutations
Remember
Sets of genes -> Pathways -> Genetic Networks
Gene regulation
Control decisions turn genes on/off
Gene Regulation Network
CMSC 838T – Presentation
Talk Overview
Overview of talk
Motivation
Technique
Experiment
Related work
Conclusions
CMSC 838T – Presentation
Technique
Motifs upstream of genes regulate gene expression
Motifs are sites of regulatory activity
Identify regulatory motifs by combining
Gene expression data
Detect common motifs occuring upstream of genes
Huge datasets
Utilise parallel computing
CMSC 838T – Presentation
Technique
gRNA
Java
development framework
gMP
Java
communication library
REDUCE
Algorithm to identify regulatory motifs
REDUCE parallelised with gMP
Increase computing power
Get motifs ranked in statistical significance
CMSC 838T – Presentation
gRNA framework
Consists of APIs
CMSC 838T – Presentation
gRNA - APIs
Interact with data sources
Provide functionality from biology
Pipelines tasks into unified process
Repository of resources
Distributed programming
CMSC 838T – Presentation
gRNA environment
gRNA Grid
Clustered computing environment
Application written for gRNA
Multiple-tier application
Applications operate from client computer
Communicates with cluster through single computer
Hosts EJB server
Server identifies processing nodes
each of these perform tasks
CMSC 838T – Presentation
gRNA Grid
CMSC 838T – Presentation
gMP
Java based message passing tool
Built on top of sockets
Manages virtual processors to run on available
machines
Scalable
Machines added/removed easily
CMSC 838T – Presentation
gMP
Processes are grouped
Communication primitives provided for sending and
receiving data
Collective communication to several nodes enabled
modularly and efficiently
Enables functions to be implemented on data
CMSC 838T – Presentation
REDUCE algorithm
Based on model
Upstream motifs contribute additively to expression level of
each gene
Quantify the extent to which these motifs contribute to
expression data
Fit log of expression ratio to sum of activating and
inhibitory terms
Find stastically most significant motifs
Plots of fitting parameters suggest biological function
CMSC 838T – Presentation
REDUCE algorithm
Terms
Occurence vector
Measure of how often a motif is found
Expression vector
Measure of gene expression
CMSC 838T – Presentation
REDUCE method
Consists of
1) Motif frequency counter
counts occurrences of DNA motifs upstream of each ORF
motifs are about 7~11 nucleotides in length
get occurence vectors
CMSC 838T – Presentation
REDUCE algorithm
2) Significant motif finder
Use
i) Normalised occurrence vector made for each motif nμ
ii) Normalised vector of logs of gene expression ratio vectorsa
Take dot product of these (a . nμ) ,and square.
Can be considered as frequency of occurence X expressive
power of regulatory motif
It is squared to get rid of negatives
Correlate gene expression with occurence of motif
Largest dot product is most significant motif
CMSC 838T – Presentation
....
a is modified to remove effect of this motif
residual gene expression vector
Process repeated until motifs are ranked
CMSC 838T – Presentation
Table: Finding significant motifs
Uses a - (.5816,.2522,.2886,-.5947, -.1595, -.3683)
CMSC 838T – Presentation
REDUCE parallelised with gMP...
Parallel motif frequency counter
Split set of ORFs equally
Distribute across available nodes
Each node calculates in parallel to get occurence vectors
Matrix transposition
Occurence vectors scattered across nodes
Advantageous to store each vector in single node
Transpose motif frequency matrix
For each ORF can only calculate fraction of occurence
frequencies for all motifs
But the entire occurence frequency is needed
CMSC 838T – Presentation
...
Parallel significant motif finder
Normalises occurence vector within each node
At each node, most significant motif calculated
Global most significant motif calculated
Process iterated to rank occurence vectors
Interface in gRNA allows ease of implementation
CMSC 838T – Presentation
Experiment
Use Compaq Alpha system
Consists of cluster of 8 AlphaServer SC/ES45
Connected by high-speed Alpha SC 16-Port switch and
ELAN PCI adapter cards.
Each server contains 4 Alpha EV68 processors
CMSC 838T – Presentation
Results
Use 7090 gene expressions of yeast
ORFs of length 600
Motifs upto length 7
Throughput (in MBytes/s) also shown
20 most significant motifs computed.
CMSC 838T – Presentation
Analysis
Runtime scales well with number of processing nodes
Frequency counter scales perfectly
Motif finder also scales
Cannot achieve perfect scaling because of communication
overhead.
CMSC 838T – Presentation
Related work
DiscoveryLink
Provides configurable wrappers as interfaces to multiple data
sources
Kleisli system
Systematically manages and integrates external databases
Uses functional query language to perform correlation
across databases
Toolkits designed with functionality for specialised
areas
BioJava, BioPerl, PAL
Sequence Analysis
Ensembl initiative, DAS
provide extensible approach to issue of annotating genomic
data
CMSC 838T – Presentation
Related work
Previous approaches using Java for high performance
computing
Bindings into native message-passing APIs(e.g.MPI)
Does not allow easy integration into larger Java
applications
Pure Java message passing interfaces
JMPI, CCJ
Both implemented on top of Java RMI
– Slower than using raw sockets
CCJ tries to overcome
– optimised RMI implementation
– not portable
Both cannot handle integration
CMSC 838T – Presentation
Comparison
According to authors ...
gRNA distinguishes itself
Uses whole range of requirements for applications in
computational biology
Provides decoupled, yet inter-related subsystems
Ease of 3rd party implementation
CMSC 838T – Presentation
Observations
REDUCE surpasses traditional clustering approach
REDUCE algorithm has high runtime
Complexity depends on product of number possible motifs and
that of genes.
Grows exponentially with length of sequences
So length of motif is restricted
REDUCE algorithm is greedy
suboptimal
REDUCE is simplistic
lacks parameters for interactions between motifs
does not consider impact of other biological knowledge
CMSC 838T – Presentation
...
Not clear that results of REDUCE are biologically
significant
Experiment does not effectively show how higher
computation power helps results
Only analysis from 9 to 16 processors, is this sufficient to
determine ‘good scaling’?
CMSC 838T – Presentation
Conclusions
Finally...
gRNA demonstrates efficient mechanism for
development of genome-centric applications
Further...
Extensions to REDUCE have been proposed
require higher computing power
more specialised programming interfaces required
Identifying communication patterns
Use of data structures e.g. sequences, trees, matrices
CMSC 838T – Presentation