tutorial6_13

Download Report

Transcript tutorial6_13

Tutorial 5
Motif discovery and Protein Databases
Multiple sequence alignments and motif discovery
• Motif discovery
–
–
–
–
MEME
MAST
TOMTOM
GOMO
• Protein database
– Uniprot
– Pfam
Motif discovery
Motif – definition
Motif
a widespread pattern with a biological significance.
Structural motif – Beta hairpin
Sequence motif
PTB (RNA binding protein)
UCUU
CAP (DNA binding protein)
TGTGAXXXXXXTCACAXT
Sequence motif – definition
Motif
a nucleotide or amino-acid sequence pattern that is widespread
and has a biological significance
PSSM - position-specific scoring matrix
..YDEEGGDAEE..
..YDEEGGDAEE..
..YGEEGADYED..
..YDEEGADYEE..
..YNDEGDDYEE..
..YHDEGAADEE..
1
2
3
4
5
6
7
8
9
10
A
0
0
0
0
0
3/6
1/6
2/6
0
0
D
0
3/6
2/6
0
0
1/6
5/6
1/6
0
1/6
E
0
0
4/6
1
0
0
0
0
1
5/6
G
0
1/6
0
0
1
1/3
0
0
0
0
H
0
1/6
0
0
0
0
0
0
0
0
N
0
1/6
0
0
0
0
0
0
0
0
Y
1
0
0
0
0
0
3/6
3/6
0
0
Can we find motifs using multiple
sequence alignment (MSA)?
YES!
NO
Using MSA for motif discovery
Can only work if things align nicely alone
For most motifs this is not the case!
Motif search: from de-novo motifs to
motif annotation
gapped motifs
Large DNA data
http://meme.sdsc.edu/
MEME – Multiple EM* for Motif finding
http://meme.sdsc.edu/
• Motif discovery from unaligned sequences - genomic or
protein sequences
• Flexible model of motif presence (Motif can be absent in
some sequences or appear several times in one sequence)
*Expectation-maximization
Email
address
Input file
(fasta file)
MEME - Input
How many
times in each
sequence?
Range of
motif
lengths
How many
motifs?
How
many
sites?
MEME - Output
Motif evalue
MEME – Sequence logo
Motif evalue
Motif length
Number of
appearnces
A graphical representation of the sequence motif
MEME – Sequence logo
High information content = High confidence
The relative sizes of the letters indicates their frequency in the
sequences
The total height of the letters depicts the information content
of the position, in bits of information.
MEME – Sequence logo
Multilevel Consensus
Patterns can be presented as regular
expressions
[AG]-x-V-x(2)-{YW}
[] - Either residue
x - Any residue
x(2) - Any residue in the next 2 positions
{} - Any residue except these
Examples: AYVACM, GGVGAA
MEME – motif alignment
Sequence
names
Position in
sequence
Strength of
match
Motif within
sequence
Sequence
names
MEME – motif locations
Motif location in
the input
sequence
Overall strength of
motif matches
What can we do with motifs?
• MAST - Search for them in
non annotated sequence
databases (protein and
DNA).
• TOMTOM - Find the protein
who binds the DNA motifs.
• GOMO - Find putative
target genes (DNA) of
motifs and analyze their
associated annotation
terms.
MAST
http://meme.sdsc.edu/meme4_4_0/cgi-bin/mast.cgi
• Searches for motifs (one or more) in sequence
databases:
– Like BLAST but motifs for input
– Similar to iterations of PSI-BLAST
• Profile defines strength of match
– Multiple motif matches per sequence
– Combined E value for all motifs
• MEME uses MAST to summarize results:
– Each MEME result is accompanied by the MAST result for
searching the discovered motifs on the given sequences.
MAST - Input
Email
address
Input file
(motifs)
Database
If you wish to use motifs
discovered by MEME
Input
motifs
MAST - Output
Presence of the motifs in a given database
TOMTOM
http://meme.sdsc.edu/meme/doc/tomtom.html
• Searches one or more query DNA motifs
against one or more databases of target
motifs, and reports for each query a list of
target motifs, ranked by p-value.
• The output contains results for each query, in
the order that the queries appear in the input
file.
TOMTOM - Input
Input
motif
Background
frequencies
Database
DNA IUPAC* code
A --> adenosine
C --> cytidine
G --> guanine
T --> thymidine
M --> A C (amino)
S --> G C (strong)
W --> A T (weak)
B --> G T C
R --> G A (purine)
Y --> T C (pyrimidine)
K --> G T (keto)
D --> G A T
H --> A C T
V --> G C A
N --> A G C T (any)
Example: YCAY = [TC]CA[TC]
*IUPAC = International Union of Pure and Applied Chemistry
TOMTOM - Output
Input
motif
Matching
motifs
TOMTOM – Output
Wrong input (RNA sequence of RNA binding protein NOVA1)
“OK” results
JASPAR
• Profiles
– Transcription factor binding sites
– Multicellular eukaryotes
– Derived from published collections of experiments
• Open data accesss
Name of
gene/protein
organism
score
logo
GOMO
• GOMO takes DNA binding motifs to find putative
target genes and analyze their associated GO
terms. A list of significant GO terms that can be
linked to the given motifs will be produced.
• GOMO returns a list of GO-terms that are
significantly associated with target genes of the
motif.
• Gene Ontology provides a controlled vocabulary
to describe gene and gene product attributes in
any organism.
GOMO - Input
Email
address
Input file
(motifs)
Database
Input
motifs
GOMO - Output
MF - Molecular function
BP - Biological process
CC - Cellular compartment
GO
annotation
Protein databases
Pfam
http://pfam.sanger.ac.uk/
Pfam is a database of multiple alignments of
protein domains or conserved protein
regions.
Glossary
Domain
A structural unit which can be found in
multiple protein contexts. Domains are
long motifs (30-100 aa).
Family
A collection of related proteins
What kind of domains can we find in Pfam?
Trusted Domains
Repeats
Fragment Domains
Nested Domains
Disulfide bonds
Important residues
(e.g active sites)
Trans membrane domains
Pfam input
Domains
Domain range
and score
Description
Structure info
Gene Ontology
Links
Domain organization
HMM logo
Known structures
for the domain
UniProt
http://www.uniprot.org/
The Universal Protein Resource (UniProt) is a
central repository of protein sequence,
function, classification and cross reference.
It was created by joining the information
contained in swiss-Prot and TrEMBL.
Protein
search
Reviewed
protein
Uniprot input
Sequence
download
Uniprot output
Accession
number
Protein
status
organism
length
Information for one protein
General
information
annotations
General
keywords
GO annotation
(MF, BP, CC)
Alternative
splicing isoforms
Features in the
sequence
Sequences
References
Alignment for two or more proteins
MSA
Blast
ID mapping
Retrieving sequences
Multiple sequence alignments and motif discovery
• Motif discovery
–
–
–
–
MEME
MAST
TOMTOM
GOMO
• Protein database
– Uniprot
– Pfam