Transcript ppt

MotifML
A Novel Ontology-based XML Model for DataExchange of Regulatory DNA Motif Profiles
Eric Neumann, Beyond Genomics
Tian Niu, Harvard University
Ken Baclawski, Northeastern University
DNA Motifs
========== = ============
human GCTTGAATTAGACAGGATTAAAGGC
bovine GCTTGAATTAAATAGGATTAAAGGC
mouse GCTTGAATTAGACAGGATTAAAGGC
|
-70
===
= ===== ===
= = =====
== =======
TTACTGGAGCTGGAAGCCTTGCCCC -AACTCAGGAGTTTAGCCCCA
TTATCAGGGCTGGGAGCTACACCCC -AACTCCTGAGTTTAGCCCCA
TTAGCAGAGCTGGAAGCCTCACATC TAACTCCCACATTGAGCCCCA
|
-45
Motifs
Functional
Significance?
|
-20
Alignment Profile
|
+1
Motif Finding Tools
AlignACE
 GIBBS
 Consensus
 Propsector

The Need for motifML
Information resides at multiple sources
 Data follow multiple Structures
 Multiple Interfaces

Integrated XML view
MotifML
BioProspector
Gibbs
AlignACE
Consensus
Motif Function


Gene expression regulation that is dependent on
activated transcriptional factors
Key element of Gene Networks: Complex analysis
of microarrays
Transcriptional
Factors
+
Cis-Elements
Associated with a Gene
Regulated Gene
Expression
motifML Goals
to allow the full specification of all
experimental information known about
motifs
 to provide an extensible framework for this
annotation and provide a common vehicle
for exchanging the motif information
 to provide a single document interface to
integrate all project information, complete
with protocols for network data retrieval.

motifML Design
 formal
and concise- ontology based
 motifML documents easy to create
 clarity more important than brevity
 use both XML schema and XML DTD
motifML Semantics

Annotation
» The collection of features for a given set of
sequence(s) that have built in semantics

Features
» Characteristics supported by analytic evidence

Analyses
» Computational
» Experimental
motifML Semantics
Property
Annotation
Semantically
Definable &
Searchable
Pragmatic
Ontolog
y
Objects
Analyse
s
Features
Motifs
Intentional
Extraction
Results
motifML Sequence Item
<seq id=“demo_seq” name=“Human HAL Gene Exon 18”>
<dbxref>
<database>GenBank</database>
<unique_id>14588658 </unique_id>
</dbxref>
<feature>
<motif type=“cis-regulatory” name=“CBE” id=“dm312”/>
<description>
CRX Binding Element
</description>
<position start=“21” end=“32” />
<evidence>
<reference paper=“Davies, J Mol Biol. 1993 296:1205-14”/>
</evidence>
</feature>
<residues type=“dna”>
ATAATGTCCAAGATCTTCTGGAGAGTGTATCCCATGCTGTGGAGCACTCTGTGGAAGCCACGG
GTCCTTTAGACAGCTCATCCTATGAGGAGCACTTCTTAACTGGCACTGGTCTCTTGCAGTTTCT
GAGAACAAGGCTCTGTGCCATCCCTCGTCTGTTGACTCCCTCTCCACCAGCGCAGCCACGGA
GGACCACGTCTCCATGGGAGGATGGGCAGCAAGGAAAGCCCTCAGGGTCATCGAGCATGTGG
AGCAAGGTAATGCTGATGAGTTCGGGGTGGCGGGCCTGCCTGATAGACCACTGTGCCTGTGG
TTCTCAAGTGGGATCTCCCACCAGCAACATCAGCATC ACCTGGAAAC
Computational Analysis
<!ELEMENT computational_analysis (date?, program, version?, parameter*,
database?, result_set+)>
<!ATTLIST computational_analysis seq IDREF #REQUIRED>
<!ELEMENT program (#PCDATA)>
<!ELEMENT result_set (score?, output*, result*)>
<!ELEMENT result (score, type, subtype?, seq_relationship+, output*)>
<!ATTLIST result id ID #IMPLIED>
<!ELEMENT seq_relationship (location, alignment?)>
<!ATTLIST seq_relationship seq IDREF
#REQUIRED
type (query | subject | peer ) #REQUIRED>
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
alignment
type
value
parameter
output
database
version
score
(#PCDATA)>
(#PCDATA)>
(#PCDATA)>
(type, value)>
(type, value)>
(name, date?, version?)>
(#PCDATA)>
(#PCDATA)>
HSP and HSE



Heat shock and other environmental and pathophysiologic
stresses stimulate synthesis of heat shock proteins (Hsps).
These proteins enable the cell to survive and recover from
stressful conditions by as yet incompletely understood
mechanisms.
A conserved 14 base pair regulatory sequence, referred to as
the heat shock element (HSE), is found in multiple imperfect
copies upstream of the TATA box of all heat shock genes.
Genes with an HSE at the upstream region may be coregulated
Dataset (Vertebrates)*

> gid 3004462, start=1, end=1027
> gid 7861931, start=1, end=666
> gid 7108904, start=1, end=1519
> gid 7739662, start=1, end=800
> gid 64795, start=1, end=487
> gid 64791, start=1, end=614
> gid 64789, start=1, end=1128
> gid 64786, start=1, end=374
> gid 32480, start=1, end=483
> gid 32484, start=1, end=711
> gid 7669470, start=1, end=424
> gid 5729878, start=1, end=313
> gid 5031770, start=1, end=760

> gid 1816451, start=1, end=2179












> gid 184422, start=1, end=2634
 > gid 184416, start=1, end=488
 > gid 188491, start=1, end=959
 > gid 4691417, start=1, end=2631
 > gid 188489, start=1, end=485
 > gid 188487, start=1, end=489
 > gid 184416, start=1, end=488
 > gid 211940, start=1, end=391
 > gid 63508, start=1, end=1421
 > gid 63512, start=1, end=2300
 > gid 409185, start=1, end=1231
 > gid 163160, start=1, end=491
 > gid 414974, start=1, end=426
*Data are from GenBank

AlignACE program



uses a Gibbs sampling strategy which is similar
to that described by Neuwald et al., 1995
An iterative masking procedure is used to allow
multiple distinct motifs to be found within a
single data set
Reference: Hughes et al., J Mol Biol. 2000
296:1205-14
AlignACE Results
...
Motif 1
GGGGAGGGGGTGGGGGGGC
GGCGGGCGGGCGGCGGGGG
GGACAGCGGCGGCTGGCTG
GGGGTGCGGGGGCAGGCGC
CCGCGGGGGCGGGCGGGGC
...
** * ***** ** *** *
MAP Score: 794.004
Motif 2
GGGGAGGGGGTGGGGGGGCGGGG
GTGCGGGGGCAGGCGCGGAGAGC
GCGGAGCGGGAGGGGGCGTGGCC
GGGGTGCGGGAGGGCGGGCGGGC
GGGCAGTGGGCGGCTGGCAGCTG
...
23
23
11
23
13
788
867
107
1417
2034
0
1
0
1
1
23
23
13
23
14
784
1420
1932
1448
1452
0
1
1
1
1
Gibbs Motif Sampler Program



Uses Stochastic Iterative Sampling
The Bernoulli motif sampler assumes that each
sequence can contain zero or more ungapped motif
elements of each motif type
Reference:
» Lawrence et al., Science 1993;262(5131):208-14;
» Neuwald et al., Protein Sci. 1995 Aug;4(8):1618-32.
Gibbs Results
...
4,
4,
4,
5,
7,
9,
...
1
2
3
1
1
1
284
425
643
239
401
26
agtgc
ggtat
atgga
atgga
agtgt
ggagt
AGAGTCTGGAGAGC
AGATGTCGGAGAGT
AGCCTCGGGAAACT
AGCCTCGGGAAACT
GGGTGCTGGAGGCT
GGCGGTGGGAAGGG
**************
...
cgaat
cgttt
tcggg
tcggg
gacgg
tgttg
271
412
656
252
388
13
0.87
0.79
0.86
0.86
0.99
0.99
R
R
F
F
R
R
gid
gid
gid
gid
gid
gid
7739662, start=1, end=800
7739662, start=1, end=800
7739662, start=1, end=800
64795, start=1, end=487
64789, start=1, end=1128
32480, start=1, end=483
Consensus Program


Uses entropy-based scoring functions
References:
» Stormo and Hartzell, PNAS 1989;86:1183-1187
» Hertz et al., 1990, CABIOS, 6:81-92
Consensus Results
MATRIX 1
...
1|23 :
2|9
:
3|10 :
...
MATRIX 2
...
1|23 :
2|9
:
3|10 :
...
MATRIX 3
1|23 :
2|9
:
3|10 :
...
MATRIX 4
1|21 :
2|9
:
3|10 :
...
1/593
2/8
3/889
TGCAAGATTTTTAA
TGGAGGCTTCCAGA
TGGAGGCTTCCAGA
1/593
2/8
3/889
TGCAAGATTTTTAA
TGGAGGCTTCCAGA
TGGAGGCTTCCAGA
1/593
2/8
3/889
TGCAAGATTTTTAA
TGGAGGCTTCCAGA
TGGAGGCTTCCAGA
1/38
2/8
3/889
GGGAAAGCTCGAGA
TGGAGGCTTCCAGA
TGGAGGCTTCCAGA
BioProspector Program




a program that examines the upstream region of genes
in the same gene expression pattern group to search for
regulatory sequence motifs.
uses zero to third-order Markov background models
allows for the searching of gapped motifs and motifs
with palindromic patterns
Reference: Liu et al., Pac Symp Biocomput. 2001:127-38
BioProspector Results
...
Motif #1:
...
Seq #1 seg
Seq #2 seg
Seq #3 seg
...
Motif #2:
...
Seq #1 seg
Seq #2 seg
Seq #3 seg
...
Motif #3:
...
Seq #1 seg
Seq #2 seg
Seq #3 seg
...
1
1
1
r998
f91
r638
TCATCCAATCAGAG
TCAACCGAACAGAA
TCGACCAATCAAAA
1
1
1
f38
r648
r620
GGGAAAGCTCGAGA
TGGAAGCCTCCAGT
TGGAAGCCTCCAGT
1
1
1
r997
f90
r637
CTCATCCAATCAGA
CTCAACCGAACAGA
TTCGACCAATCAAA
Conceptions and Interactions of the
Underlying Statistical Algorithms Used by the
Motif Searching Programs
Gibbs
AlignACE
Gibbs Sampler; Iterative Updating Strategy
BioProspector
Two Block Motif Model
Information Content
CONSENSUS
T
I seq   f b log 2 ( f b / pb )
A
Motif Data Representation
 Common
data representation for motif
information.
 Uses XML Schema to specify format.
 Both human and machine readable.
 Supports “knowledge mining”.
 Statements can be asserted about a
motif such as a role in gene regulation.
Example of a motif
Blk1
A
G
C
1
0.00
0.21
0.21
0.59
2
0.00
0.44
0.50
0.06
3
0.70
0.29
0.00
0.00
4
0.32
0.62
0.00
0.06
5
0.03
0.00
0.97
0.00
6
0.00
0.00
1.00
0.00
7
0.85
0.09
0.03
0.03
8
0.88
0.12
0.00
0.00
9
0.03
0.00
0.03
0.94
10
0.03
0.09
0.88
0.00
11
0.70
0.12
0.18
0.00
...
T
<motif id="GXY1">
<block>
<base type="G">0.21</base>
<base type="C">0.21</base>
<base type="T">0.59</base>
</block>
<block>
<base type="G">0.44</base>
<base type="C">0.50</base>
<base type="T">0.06</base>
</block>
<block>
<base type="A">0.70</base>
<base type="G">0.29</base>
</block>
...
</motif>
XML Schema

Extends the XML document type
language:
» Data format restrictions.
» Data value (min and max) restrictions.
» Element occurrence (min and max)
restrictions.

No sophisticated restrictions:
» Probability distribution.
XML Schema for MotifML
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:element name="motif" type="MotifType"/>
<!-- A motif consists of a sequence of blocks. -->
<xsd:complexType name="MotifType">
<xsd:sequence>
<xsd:element name="block" minOccurs="0"
maxOccurs="unbounded" type="BlockType"/>
</xsd:sequence>
</xsd:complexType>
<!-- A block specifies a probability for each DNA base type. -->
<xsd:complexType name="BlockType">
<xsd:sequence>
<xsd:element name="base" minOccurs="1" maxOccurs="4">
...
Statements about motifs
<?xml version="1.0"?>
<RDF xmlns="http://www.w3.org/1999/02/22-rdf-syntax-ns#”
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#”
xmlns:mml="http://www.beyondgenomics.com/2001/07/motifml#"
xmlns:bp="http://www.beyondgenomics.com/2001/07/biopathway#"/>
<Description about="http://www.beyondgenomics.com/motifdb/gxy1">
<bp:upregulate rdf:resource="http://www.beyondgenomics.com/motifdb/awy5"/>
<bp:upregulate rdf:resource="http://www.beyondgenomics.com/motifdb/ftg6"/>
<bp:downregulate rdf:resource="http://www.beyondgenomics.com/motifdb/bgt3"/>
</Description>
</RDF>
The Need for Bio-Ontologies
How do biologists learn the element
structure of a document describing the
heterogeneous sequence alignment
output?
 How do biologists share the structure
and meta-data on motif profiles
efficiently and unambiguously?

A multiple sequence alignment linked with
TRANSFAC/TRANSPATH
========== = ============ ===
= ===== ===
= = =====
== =======
human GCTTGAATTAGACAGGATTAAAGGC TTACTGGAGCTGGAAGCCTTGCCCC -AACTCAGGAGTTTAGCCCCA
bovine GCTTGAATTAAATAGGATTAAAGGC TTATCAGGGCTGGGAGCTACACCCC -AACTCCTGAGTTTAGCCCCA
mouse GCTTGAATTAGACAGGATTAAAGGC TTAGCAGAGCTGGAAGCCTCACATC TAACTCCCACATTGAGCCCCA
PCE-I
-CBE-AP-4
8888
cETS
cETS
|
|
|
|
-70
-45
-20
+1
Alignment Profile
Shown here is the alignment from -70 to +1. The numbering shown corresponds to the mouse sequence. Identical bases are
shown by the = above each nucleotide. Consensus sequence matches conserved among all three species are: the Ret-1/PCE-I
element at -65 to -60, the CRX-binding element (CBE) at -55 to -50, an AP-4 consensus core sequence at -37 to -34, a cETS
consensus core at -35 to -31 and another at positions -57 to -54, and an S8 homeodomain is shown by "8888" at -64 to -61.
Only the core bases are marked. The criteria for searching the TRANSFAC Database by MatInspector were a match to the core
sequence of at least 80% and to the entire consensus sequence of at least 85%. The Genbank entries for human, bovine, and
mouse are X53044, M32733, and M32734, respectively. (Boatright, Mol Vis 1997; 3:15)
Transcriptional Factors
Ontology
Composite
Element
Upstream to
Site
Part of
•Disease
Gene
Within
•Tissue
•Stage
contains
Kind of
Context
produces
Transcript
•Env.Cond.
•Induced
Found in
Observation
Transcriptional
Motif Elements
Binds to
Transcriptional
Factors
MotifML Applications
 Develop
a data exchange format for DNA
motif data
 Handling output from motif analyses
 Annotation and data mining of micro-array
data
 Important in modeling transcriptional
regulatory networks in eukaryotes
Future Directions
Annotation System –
Lincoln Stein, Open-Bio
 Exchange with Other XML Dialects
 DAML development
 Distributed