MOPAC: Motif-finding by Preprocessing and Agglomerative

Download Report

Transcript MOPAC: Motif-finding by Preprocessing and Agglomerative

MOPAC: Motif-finding by
Preprocessing and Agglomerative
Clustering from Microarrays
Thomas R. Ioerger1
Ganesh Rajagopalan1
Debby Siegele2
1Department
of Computer Science
2Department of Biology
Texas A&M University
Analyzing Gene Expression
Patterns
• DNA microarrays
• ~4000 genes E. coli, ~6000 genes for yeast
• Compare expression levels between
conditions
• Example: starvation response in E. coli
– starve cells for nutrient sources
– reintroduce => recovery => exponential growth
– which genes show changes in response?
• types of response:
–
–
–
–
up-regulation
down-regulation
transient response (spike)
(arbitrary temporal patterns)
• Problem: can cluster genes based on
response pattern, but then what?
– not all genes in cluster are regulated the same
way
• Couple with genomic analysis
– search for common motifs in up-stream regions
– subsets of co-regulated genes within clusters
• Assumptions:
1. regulation occurs by interaction of transcription
factors with small motifs (~10-20bp) within
several hundred bp of transcription start site
2. among many motifs, the ones of interest will be
common to some genes in a cluster, but not found
in any genes outside (with different responses)
3. the motif does not have to be shared by all genes
in the cluster, only a subset
Related Work
• Many algorithms exist for motif finding
– assume cluster (gene set) is already defined
– word/string analysis models
– probabilistic models
• Gibbs sampling (AlignACE, MotifSampler)
• Expectation Maximization (MEME)
• HMMs
– graph algorithms (e.g. clique)
• Pevzner and Sze
– what if motif only appears in a subset of genes?
• count as parameter in MotifSampler, MEME
Overview Our Approach
1. Definition of regulation patterns
2. Extraction of upstream sequences (for up-reg)
3. Define control set (genes with no change)
4. Make a list of all 12-mers in upstream regions
5. Find motifs that occur (more than once) in upregulated set, but not at all in control set
6. Group the motifs using clustering, form
consensus of patterns
Define Regulation Patterns
• measured at 0, 5, and 15min after recovery
• discrete representation of changes in
expression levels
• relative to exp. growth phase conditions
+1: >2-fold increase
-1: >2-fold decrease
0: otherwise (no significant change)
• up-regulation patterns:
(0,1,1) (0,1,0) (0,0,1) (-1,1,1) (-1,1,0) (-1,0,1)
• define control set: (0,0,0) (1,1,1) (-1,-1,-1)
Extraction of Upstream Sequences
• nominally, 600bp upstream of translation
start site (i.e. ORF; not transcription start)
• If gene is a member of an operon:
– take 300bp upstream of gene
– plus 300bp upstream of translation start of first
gene in operon
• databases: K12 sequence: GOLD
– operon relationships: E. coli Linkage Map
(Berlyn et al.)
• use reverse complement if transcribed in rev.
Pre-processing
• extract all 12-mers (overlapping) from
upstream regions of up-regulated genes
• note: better than DFS
• remove those that appear in the control set
• remove those that are dissimilar to
everything else (“de-noising”)
– score=mean distance to all motifs not in same
upstream region or operon
– remove if score>~9/12 mis-matches
Clustering
• compute similarity matrix among motifs
• repeatedly merge closest neighbors
– minimum spanning tree
– single-linkage clustering
• Stop merging when dist>3/12 mismatches
• Form consensus: relax constraints on
nucleotides at position by disjunction
–
–
–
–
ACCATGGTATC
ACGATGGTATT
ACTATAGTATC
AC(CTG)AT(AG)GTAT(TC)
Experiments
•
•
•
•
•
Starvation of E. coli for glucose in medium
3 time-points: starved (0min), 5min, 15min
Data collected in Siegele lab
up-regulated: 22 genes
control set: 1361 genes
Motifs Found
ID
1
2
3
4
5
6
7
8
9
10
11
12
13
Motif
AAsAAwT T mAwA
CmwT T kT T yT T C
T T CT wHT gAwAT
wT VAACwT hCAA
rAkT T T wT T CAT
CAArT wT T T wT r
AT wAAT AAT ksw
ACsdT T T T T mT w
rAAwT T mAT AAT
vwT T AAT AAT kC
AT wT T GAAT T ww
yT T T khGAT AT T
AkT T T wT T CAT y
Gene name
CmtB, ygjR, cysD
CysH, B3914, MetR
B1587, MetF, FliY
B1587, asnB, cysA,P,W
B3914, MetR, MetF
CmtB, yhaV, cysD
B1587, yhaV, CmtB
CmtB, asnB, b3914, ygjR
MetF, CmtB, ygjR
CmtB, b1587, yhaV, MetF
AsnB, metR, metF
YfiA, cysD, fliY
B3914, metR, metF
Sequence Logos
Distance to Transcription Start
Other Forms of Validation
• Palindromicity: 11/13 motifs have index>0.5
• TRANSFAC database:
– e.g. motif 2 matches pattern for MetJ-MetF site
– a number of other hits for known transcription
factors
• biological verification awaits...
– role in regulation pathway for starvation
response?
Conclusions
• Augment cluster-analysis of expression
patterns with motif analysis
• Efficient method for generating candidates
– from 12-mers in upstream regions
• Efficient method for screening them
– empirically, against a control set, rather than
probabilistic background model
• Advantage: Pattern does not have to be in all
the genes in a set
• Challenges: defining appropriate upstream
regions and the right control set (as filter)