Transcript PPT

Parallel Detection of Regulatory
Elements with gMP
Bertil Schmidt, Lin Feng, Amey Laud, Yusdi Santoso
Damayanti Gupta
CMSC 838 Presentation
Motivation

Fundamental question






How are expression levels of thousands of genes regulated ?
Very important
Understanding of gene function
 Response to environment
Understand genetic causes of diseases
 Evaluate effects of drus
Detect mutations
Remember


Sets of genes -> Pathways -> Genetic Networks
Gene regulation
 Control decisions turn genes on/off
 Gene Regulation Network
CMSC 838T – Presentation
Talk Overview

Overview of talk

Motivation

Technique

Experiment

Related work

Conclusions
CMSC 838T – Presentation
Technique

Motifs upstream of genes regulate gene expression



Motifs are sites of regulatory activity
Identify regulatory motifs by combining

Gene expression data

Detect common motifs occuring upstream of genes
Huge datasets

Utilise parallel computing
CMSC 838T – Presentation
Technique

gRNA
 Java

development framework
gMP
 Java
communication library

REDUCE

Algorithm to identify regulatory motifs
REDUCE parallelised with gMP



Increase computing power
Get motifs ranked in statistical significance
CMSC 838T – Presentation
gRNA framework

Consists of APIs
CMSC 838T – Presentation
gRNA - APIs

Interact with data sources

Provide functionality from biology

Pipelines tasks into unified process

Repository of resources

Distributed programming
CMSC 838T – Presentation
gRNA environment

gRNA Grid


Clustered computing environment
Application written for gRNA

Multiple-tier application

Applications operate from client computer

Communicates with cluster through single computer

Hosts EJB server
Server identifies processing nodes


each of these perform tasks
CMSC 838T – Presentation
gRNA Grid
CMSC 838T – Presentation
gMP

Java based message passing tool

Built on top of sockets

Manages virtual processors to run on available
machines

Scalable

Machines added/removed easily
CMSC 838T – Presentation
gMP

Processes are grouped

Communication primitives provided for sending and
receiving data

Collective communication to several nodes enabled
modularly and efficiently

Enables functions to be implemented on data
CMSC 838T – Presentation
REDUCE algorithm

Based on model

Upstream motifs contribute additively to expression level of
each gene

Quantify the extent to which these motifs contribute to
expression data
Fit log of expression ratio to sum of activating and
inhibitory terms
Find stastically most significant motifs



Plots of fitting parameters suggest biological function
CMSC 838T – Presentation
REDUCE algorithm

Terms

Occurence vector

Measure of how often a motif is found
Expression vector


Measure of gene expression
CMSC 838T – Presentation
REDUCE method
Consists of
1) Motif frequency counter

counts occurrences of DNA motifs upstream of each ORF

motifs are about 7~11 nucleotides in length

get occurence vectors
CMSC 838T – Presentation
REDUCE algorithm
2) Significant motif finder

Use

i) Normalised occurrence vector made for each motif nμ
ii) Normalised vector of logs of gene expression ratio vectorsa
Take dot product of these (a . nμ) ,and square.




Can be considered as frequency of occurence X expressive
power of regulatory motif
It is squared to get rid of negatives
Correlate gene expression with occurence of motif
Largest dot product is most significant motif
CMSC 838T – Presentation
....

a is modified to remove effect of this motif

residual gene expression vector
Process repeated until motifs are ranked

CMSC 838T – Presentation
Table: Finding significant motifs

Uses a - (.5816,.2522,.2886,-.5947, -.1595, -.3683)
CMSC 838T – Presentation
REDUCE parallelised with gMP...


Parallel motif frequency counter

Split set of ORFs equally

Distribute across available nodes

Each node calculates in parallel to get occurence vectors
Matrix transposition

Occurence vectors scattered across nodes

Advantageous to store each vector in single node



Transpose motif frequency matrix
For each ORF can only calculate fraction of occurence
frequencies for all motifs
But the entire occurence frequency is needed
CMSC 838T – Presentation
...

Parallel significant motif finder

Normalises occurence vector within each node

At each node, most significant motif calculated

Global most significant motif calculated

Process iterated to rank occurence vectors

Interface in gRNA allows ease of implementation
CMSC 838T – Presentation
Experiment

Use Compaq Alpha system

Consists of cluster of 8 AlphaServer SC/ES45

Connected by high-speed Alpha SC 16-Port switch and
ELAN PCI adapter cards.

Each server contains 4 Alpha EV68 processors
CMSC 838T – Presentation
Results

Use 7090 gene expressions of yeast

ORFs of length 600

Motifs upto length 7

Throughput (in MBytes/s) also shown

20 most significant motifs computed.
CMSC 838T – Presentation
Analysis

Runtime scales well with number of processing nodes

Frequency counter scales perfectly

Motif finder also scales

Cannot achieve perfect scaling because of communication
overhead.
CMSC 838T – Presentation
Related work

DiscoveryLink


Provides configurable wrappers as interfaces to multiple data
sources
Kleisli system



Systematically manages and integrates external databases
Uses functional query language to perform correlation
across databases
Toolkits designed with functionality for specialised
areas

BioJava, BioPerl, PAL


Sequence Analysis
Ensembl initiative, DAS

provide extensible approach to issue of annotating genomic
data
CMSC 838T – Presentation
Related work
Previous approaches using Java for high performance
computing

Bindings into native message-passing APIs(e.g.MPI)


Does not allow easy integration into larger Java
applications
Pure Java message passing interfaces

JMPI, CCJ



Both implemented on top of Java RMI
– Slower than using raw sockets
CCJ tries to overcome
– optimised RMI implementation
– not portable
Both cannot handle integration
CMSC 838T – Presentation
Comparison
According to authors ...

gRNA distinguishes itself

Uses whole range of requirements for applications in
computational biology

Provides decoupled, yet inter-related subsystems

Ease of 3rd party implementation
CMSC 838T – Presentation
Observations

REDUCE surpasses traditional clustering approach

REDUCE algorithm has high runtime


Complexity depends on product of number possible motifs and
that of genes.

Grows exponentially with length of sequences

So length of motif is restricted
REDUCE algorithm is greedy


suboptimal
REDUCE is simplistic

lacks parameters for interactions between motifs

does not consider impact of other biological knowledge
CMSC 838T – Presentation
...

Not clear that results of REDUCE are biologically
significant

Experiment does not effectively show how higher
computation power helps results

Only analysis from 9 to 16 processors, is this sufficient to
determine ‘good scaling’?
CMSC 838T – Presentation
Conclusions
Finally...

gRNA demonstrates efficient mechanism for
development of genome-centric applications
Further...

Extensions to REDUCE have been proposed

require higher computing power

more specialised programming interfaces required

Identifying communication patterns

Use of data structures e.g. sequences, trees, matrices
CMSC 838T – Presentation