Transcript Slide 1

Capstone Presentation
Motif Discovery from Large Number of Sequences:
A Case Study with Disease Resistance Genes
in Arabidopsis thaliana
by Irfan Gunduz
I.U. School of Informatics
04/25/04
INTRODUCTION
Motifs
• Highly conserved regions across a subset of proteins
that share the same function
>Seq A
>Seq B
>Seq C
>Seq D
YNEDSKH
YDDDSNH
YDNDSNH
YENDSKH
• Motifs can be used to predict
 A molecule’s function
 A Structural Feature
 Family membership
I.U. School of Informatics
INTRODUCTION
Current motif finding soft-wares:
• MEME
• PROSITE
• PRATT, etc
Do they work with large number of sequences?
• Pattern discovery relies on statistical or combinatorial techniques,
looking for signals
• Signal-to-noise ratio becomes less clear as the
number of sequences increases
What to do?
I.U. School of Informatics
Objective
 Develop a computational procedure to find functional motifs
from large number of sequences
I.U. School of Informatics
COMPUTATIONAL PROCEDURE
Tools







BLAST (Sequence alignment tool)
BAG ( Sequence Clustering package)
CLUSTAL W (Multiple sequence alignment)
HMMERII (HMM based software)
BLOCK MAKER (Block/Motif finder)
LAMA (Block comparison tools)
PERL
I.U. School of Informatics
COMPUTATIONAL PROCEDURE
1- Collecting and Clustering Sequences
Extract well-annotated sequences of interest from genome of interest
All to all pair wise comparison using Blast
Estimate the best bit score for clustering
Cluster sequences using BAG
I.U. School of Informatics
COMPUTATIONAL PROCEDURE
2 - ENRICHMENT
Align multiple sequences in each cluster
Start HMM based programs build profile for each cluster
Search genome of interest with new profile
and extract more sequences if available
I.U. School of Informatics
COMPUTATIONAL PROCEDURE
3 – REFINEMENT
Refine clusters by regrouping
4 – MOTIF FINDING
Submit sequences in each cluster to Block Maker
compare blocks using LAMA
Cluster blocks by using BAG
I.U. School of Informatics
A Case Study with Disease Resistance Genes
in Arabidopsis thaliana
I.U. School of Informatics
Why Disease Resistance Genes?
I.U. School of Informatics
Background, Disease Resistance Genes
Domain Probable Function
TIR
CC
KIN
LRR
Recognition of specificity
NB
ATP and GTP binding
I.U. School of Informatics
Case Study, Arabidopsis thaliana
• 116 disease resistance protein or disease resistance protein like
annotated sequences were extracted from Arabidopsis thaliana genome
• Clustered into 32 groups
• 20 to 640 sequences were added in each cluster after HMM iterations
• After refinement four clusters were formed for further analysis
# of Sequences
Cluster 1
96
Cluster 2
45
Cluster 3
641
Cluster 4
11
I.U. School of Informatics
Case Study, Arabidopsis thaliana
PFAM Search
Domains
Cluster 1
NB-ARC, TIR, Kin, LRR
Cluster 2
NB-ARC, Kin, LRR
Cluster 3
Ser/Thr Kin
Cluster 4
Kin
I.U. School of Informatics
Case Study, Arabidopsis thaliana
Results, Block Maker
15218608
15218618
15220795
Cluster1
Cluster2
I.U. School of Informatics
YDVFLSFRGVDTRQTIVSHL
YDVFLSFRGEDTRKNIVSHL
YDVFLSFRGEDTRKTIVSHL
Case Study, Arabidopsis thaliana
Clusters at the whole gene level
Results, Lama and BAG
Cluster1
Cluster2
Cluster1
Cluster2 Cluster3
Clusters at the Block Level
I.U. School of Informatics
Clusters at the whole gene level
Case Study, Arabidopsis thaliana
RPS4
RPP1
RPP5
Cluster1 TIR-I
Cluster2
TIR-II Kin1a
Kin1a NBS-A
Kin2
NBS-B
Kin2
NBS-B NBS-C GLPL
RPP8
RPM1
Cluster1
Cluster2 Cluster3
Clusters at the Block Level
I.U. School of Informatics
LRR
LRR
Case Study, Arabidopsis thaliana
Number of Disease Resistance Gene Candidates on each Chromosome
CHR-1 CHR-II CHR-III CHR-IV CHR-V
Cluster 1
Cluster 2
I.U. School of Informatics
16
20
2
0
6
6
16
4
35
9
Case Study, Arabidopsis thaliana
New Disease Resistance Gene Candidates
Cluster 1
Cluster 2
GI 15236505
GI 15242136
GI 15233862
GI 15221277
GI 15221280
GI 15217940
GI 15221744
I.U. School of Informatics
Case Study, Arabidopsis thaliana
To test effectiveness of the computational procedure

792 Unique sequences were merged and submitted to
MEME and PRATT to detect functional motifs.
• Time : Took more than 9000 minutes on Pentium IV
1.7 GHz machine running on Linux
• Result : No known disease resistance gene motifs
were detected
I.U. School of Informatics
Case Study, Arabidopsis thaliana
CONCLUSIONS:
 Sensible combination of tools provides an excellent mechanism
for motif detection
 Clustering helps to improve performance of other well known tools
I.U. School of Informatics
ACKNOWLEDGEMENT
Motif Discovery from Large Number of Sequences:
A Case Study with Disease Resistance Genes
in Arabidopsis thaliana
Irfan Gunduz, Sihui Zhao, Mehmet Dalkilic and Sun Kim
will be presented at
The 2003 International Conference on Mathematics and
Engineering Techniques in Medicine and Biological Sciences
I.U. School of Informatics
Case Study, Arabidopsis thaliana
I.U. School of Informatics
Disease Resistance Mechanism
I.U. School of Informatics
COMPUTATIONAL PROCEDURE
Refinement
B
A
D
C
I.U. School of Informatics
B
D
C