Identify regulatory modules from gene expression data
Download
Report
Transcript Identify regulatory modules from gene expression data
Identify regulatory modules
from gene expression data
Xu Ling
02/09/2005
Introduction
Much of a cell’s activity is organized as a network of
interacting modules: sets of genes coregulated to respond
to different conditions. Identifying this organization is
crucial for understanding cellular responses to internal and
external signals.
Genome-wide expression profiles (e.g., DNA microarray)
provide important information about regulatory
mechanisms.
With the availability of complete genome sequences,
identifying cis-regulatory elements via a bioinformatics
approach on a genome-wide manner comes out as a
promising solution.
Tasks
What’s the underlying mechanisms by
which genes are regulated?
Modules of coregulated genes?
Regulators (transcription factors)?
Regulation conditions (TFBSs/motifs,
positional and combinatorial constraints)?
General scheme (1)
clustering-based approaches for finding motifs from
gene expression and sequence data
classify
General scheme (2)
sequence(/knowledge)-based approaches for finding
motifs from gene expression and sequence data
General scheme (3)
Comparative genomics has also been applied
to identify eukaryotic regulatory elements
(e.g., Human-Mouse) because functional
noncoding sequences may be conserved
across species from evolutionary constraints.
Finding a good pair of species to compare
and choosing a good sequence conservation
threshold are critical and such information is
not available for most species.
Related work
Predicting gene expression from sequence
Michael A. Beer and Saeed Tavazoie
Cell, 2004, 117: 185-198
A successful application of existing computational
approaches in studying the yeast transcriptional
regulation network
Approach
Clustering (k-means) – modules of
coregulated genes
Motif Finding (AlignACE) – putative
regulatory elements (TFBSs)
Bayesian network learning – regulation
conditions (motifs, positional and
combinatorial constraints)
Bayesian Network
Sequence features (x1,…,xn) expression patterns (ei)
Sequence feature (xi): presence of motifs, positional constraints,
and combinatorial constraints
Expression pattern (ei): a binary one layer network
Maximizing P(ei|x1,…,xn), the probability that genes with these
sequence features will participate in expression pattern i
Properties
Easy to integrate all kinds of sequence features
Explicit Sequence features
To avoid complex networks overfit the training data,
a parameter for penalizing dense networks is used.
“Optimal” network is greedily learned.
Motif finding approaches
Explicit statistical modeling based
Expectation maximization – MEME, …
Gibbs Sampling – AlignACE, Gibbs Motif
sampler, …
Others – CONSENSUS, …
word enumeration based – MDscan, …
MEME
Sequence is broken up into all overlapping subsequences of
length W which it contains.
Two-component finite mixture model: “Motif” (a set of similar
subsequences of fixed width) & “Background” (all other
positions in the sequences)
Motif model: each example of the motif is assumed to be
generated by a sequence of independent, multinomial random
variables.
Background model: each position (which is not part of a motif)
is generated independently by a multinomial random variable.
Maximize the likelihood of the model M given the data D:
L(M|D)=p(D|M) by EM algorithm
Gibbs motif sampler
Dealing with a specific model alignment rather than a weighted
average as EM does.
Iteratively sample motif models (or possibly background model)
for each subsequence and thereby partition motif-encoding
regions into different motifs.
Iterative heuristic method, which combines gradient search
steps with random jumps in the search space, hence not
guaranteed to reach optimal, but won’t stuck at local maximums
as EM does.
Identify the most probable motif models by locating the
optimum alignments, which maximize the ratios of the
corresponding target probabilities to the background
probabilities (MAP (maximum a posteriori) score).
Future work
Ab initio motif finding approach from
gene expression and sequence data by
attempting new heuristic or statistic
model.
Integrating prior knowledge (e.g., GO)
to facilitate identification of regulatory
elements and transcriptional network.