pattern matching

Download Report

Transcript pattern matching

C
E
N
T
R
E
F
O
R
I
N
T
E
G
R
A
T
I
V
E
B
I
O
I
N
F
O
R
M
A
T
I
C
S
V
U
Master Course
Sequence Alignment
Lecture 11
Sequence Motif Searches
Pattern matching
• Functional genomics: finding out the
function of all genes (and other parts) in a
genome
• Ability to recognise protein function
paramount
• Database searching is crucial strategy
– trypsin has catalytic triad (His, Asp, Ser). How
to recognize this?
– (local) alignments not always suitable
• short patterns, too many ‘don’t cares’, etc.
Degenerate DNA codes
Four bases: A, C, G, T
Two-fold degenerate IUB codes:
• R=[AG]
• Y=[CT]
• K=[GT]
• M=[AC]
• S=[GC]
• W=[AT]
Four-fold degenerate: N=[AGCT]
Degenerate protein codes
20 bases: ACDEFGHIKLMNPQRSTVWY
Degenerate codes:
• X = unknown, all (20-fold degenerate)
• B = [DE]
• Z = [NQ]
Transcription factors
• Required but not a part of the RNA polymerase complex
• Many different roles in gene regulation
–
–
–
–
–
Binding
Interaction
Initiation
Enhancing
Repressing
• Various structural classes (e.g. zinc finger domains)
• Consist of both a DNA-binding domain and an interactive
domain
Transcription factors
Transcription
factor –
polymerase
interaction sets
off gene
transcription…
TF binding site
TF
mRNA
Pol II transcription
TATA
mRNA
transcription
TF binding site
(closed)
TATA
Nucleosomes (chromatin structures
composed of histones) are structures
round which DNA coils. This blocks
access of TFs
TF binding site
(open)
… many TFBSs are possible
Motifs
•
•
•
•
•
Short sequences of DNA (or RNA or amino acids)
Often consist of 5- 16 nucleotides
Protein motifs can be more variable
May contain gaps
Examples include:
–
–
–
–
–
–
–
Splice sites
Start/stop codons
Transmembrane domains
Centromeres
Phosphorylation sites
Coiled-coil domains
Transcription factor binding site (regulatory motifs)
pattern matching
This lecture:
• Regular expressions
• Pre-processing data for pattern matching:
suffix trees
• Hidden Markov models (brief)
Rationale for regular
expressions
• “I want to see all sequences that ...
– ... contain a C”
– ... contain a C or an F”
– ... contain a C and an F”
– ... contain a C immediately followed by an F”
– ... contain a C later followed by an F”
– ... begin with a C”
– ... do not contain a C”
– ... contain at least three Cs”
– ... contain exactly three Cs”
– ... has a C at the seventh position”
– ... either contain a C, an E, and an F in any order except CFE,
unless there are also at most three Ps, or there is a ....
regular expressions
• alphabet: set of symbols
– {A, C, T, G}
• string: sequence of symbols from
alphabet
– AACTG, CATG, GGA, ACFT, ε
• regex: formal method to define (sub)set
of strings
– [^C].AG?T*
– used for pattern matching
• check if database sequence ∈ regex
contruction of a regex
• regex contains:
– symbols from alphabet
• C
→
{C}
– operators
• operations on regex(es) yield new regex
• concatenation, union, repetition, ...
basic operators
r 1 r2
concatenation
AC
AAC
→
→
{ AC }
{ AAC }
[s1s2 ... sn]
union (of symbols)
r1|r2
union (of regexes)
[ACG]
[AC]G
A|CC
[AC]|AC
r+
→
→
→
→
{ A, C, G }
{ AG, CG }
{ A, CC }
{ A, C, AC }
repeat once or more
C+
A[AC]+
→
→
{ C, CC, CCC, CCCC, ... }
{ AA, AC, AAA, AAC, ACA, ACC, AAAA, AAAC, ... }
derived operators
r?
optional
C?
AC?G
r*
→
→
{ ε, C }
{ AG, ACG }
repeat zero or more times
C*
A*C
[AC]*
→
→
→
{ ε, C, CC, CCC, CCCC, ... }
{ C, AC, AAC, AAAC, ... }
{ ε, A, C, AA, AC, CA, CC,
AAA, AAC, ACA, ACC, ... }
repeat n – m times
rn-m
C4
C2-4
C-3
C3-
→
→
→
→
{ CCCC }
{ CC, CCC, CCCC }
{ ε, C, CC, CCC }
{ CCC, CCCC, CCCCC, ... }
miscellaneous
.
any symbol
.
A.C
.?
.*
[^s1s2 ... sn]
[^A]
[^AC]
→
→
→
→
{ A, C, G, T }
{ AAC, ACC, AGC, ATC }
{ ε, A, C, G, T }
{ ε, A, C, G, T, AA, AC, AG, AT,
CA, CC, CG, CT, GA, ... }
exclude symbols
→
→
(r)
{ C, G, T }
{ G, T }
grouping
(AC)?
AC?
(AC)*
AC*
→
→
→
→
{ ε, AC }
{ A, AC }
{ ε, AC, ACAC, ACACAC, ACACACAC, ... }
{ A, AC, ACC, ACCC, ... }
limitations
• regex cannot remember indeterminate
counts !!!
– “I want to see all sequences with ...
☺ ... six Cs followed by six Ts”
– C6T6
☺ ... any number of Cs followed by any number of Ts”
✰ C*T*
☹ ... Cs followed by an equal number of Ts”
✰ CnTn
✰ (CT|CCTT|CCCTTT|C4T4| ... )?
– use (context-free) grammar
regexes in pattern
matching
• pattern described by regex
• check if sequence ∈ regex
• matching done very efficiently
– O(n)
– using state machine
state machines
AC*T|GGC
• compile regex to state machine
• match sequence with regex
Example from BLAST:
Determining Query Words
• Given:
– query sequence: QLNFSAGW
– word length w = 3
– word score threshold T = 8
• Step 1: determine all words of length w in
query sequence
QLN LNF NFS FSA SAG AGW
Example from BLAST:
Determining Query Words
• Step 2: determine all words that score at least
T when compared to a word in the query
sequence:
words from
sequence
QLN
LNF
NFS
…
SAG
...
query words w/ T=8
QLN=11, QMD=9, HLN=8, ZLN=9,…
LNF=9, LBF=9, LBY=8, FNW=8,…
NFS=12, AFS=8, NYS=8, DFT=10,…
none
Example from BLAST:
Scanning the Database
• search database for all occurrences of
query words
• approach:
– build a DFA (deterministic finite-state
automaton) that recognizes all query words
– run DB sequences through DFA
– remember hits
Example from BLAST: Scanning the
Database
• consider a DFA to
recognize the query
words: QL, QM, ZL
• All that a DFA does is
read strings, and
output "accept" or
"reject."
• use Mealy paradigm
(accept on transitions)
to save space and
time
Moore paradigm: the alphabet is (a, b), the states
are q0, q1, and q2, the start state is q0 (denoted by
the arrow coming from nowhere), the only
accepting state is q2 (denoted by the double ring
around the state), and the transitions are the arrows.
The machine works as follows. Given an input
string, we start at the start state, and read in each
character one at a time, jumping from state to state
as directed by the transitions. When we run out of
input, we check to see if we are in an accept state. If
we are, then we accept. If not, we reject.
Moore paradigm: accept/reject states
Mealy paradigm: accept/reject transitions
Example from BLAST: a DFA to
recognize query words: QL, QM, ZL
Q
not (L or M or Q)
start
Q
L or M
Mealy
paradigm
Z
Z
L
not (L or Z)
not (Q or Z)
Accept on red transitions
(Mealy paradigm)
other uses
• many programs use regular
expressions
– command-line interpreter
• del *.*
– editor
• search
• replace
– compilers
– perl, grep, sed, awk
☹many different syntaxes
local vs. global matching
• global: regex describes entire string to be
matched
– ACCCCTG  C3-
• local: regex describes substring to be
matched
– ACCCCTG  C3-
– ^ matches start-of-string
• ^CG: match everything starting with CG
• ^[^CG]:
match everything not starting with C or G
– $ matches end-of-string
• AC$:
match everything ending with AC
Regular expressions
Alignment
ADLGAVFALCDRYFQ
SDVGPRSCFCERFYQ
ADLGRTQNRCDRYYQ
ADIGQPHSLCERYFQ
For short sequence
stretches, regular
expressions are often
more suitable to
describe the information
than alignments (or
profiles)
Regular expression
[AS]-D-[IVL]-G-x4-{PG}-C-[DE]-R-[FY]2-Q
{PG} = not (P or G)
Regular expressions
Regular expression
No. of exact
matches in DB
D-A-V-I-D
71
D-A-V-I-[DENQ]
252
[DENQ]-A-V-I-[DENQ]
925
[DENQ]-A-[VLI]-I-[DENQ]
2739
[DENQ]-[AG]-[VLI]2-[DENQ] 51506
D-A-V-E
1088
Motif-based function prediction
•
Prediction of protein functions based on identified
sequence motifs
•
PROSITE contains patterns specific for more than a
thousand protein families.

ScanPROSITE -- it allows to scan
a protein sequence for occurrence
of patterns and profiles stored in
PROSITE
http://www.expasy.org/prosite/
Prosite example:
Post-translational modification
ASN_GLYCOSYLATION, PS00001; N-glycosylation
site (PATTERN with a high probability of occurrence!)
Consensus
pattern:
N - {P} - [ST] - {P}
N is the glycosylation
site
Prosite example: extended profile
Acyl carrier protein phosphopantetheine domain profile.
/GENERAL_SPEC: ALPHABET='ABCDEFGHIKLMNPQRSTVWYZ'; LENGTH=71;
/DISJOINT: DEFINITION=PROTECT; N1=6; N2=66;
/NORMALIZATION: MODE=1; FUNCTION=LINEAR; R1=2.3; R2=.02281121; TEXT='-LogE';
/CUT_OFF: LEVEL=0; SCORE=271; N_SCORE=8.5; MODE=1; TEXT='!';
/CUT_OFF: LEVEL=-1; SCORE=184; N_SCORE=6.5; MODE=1; TEXT='?';
/DEFAULT: D=-20; I=-20; B1=-80; E1=-80; MI=-105; MD=-105; IM=-105; DM=-105; MM=1; M0=-1; A B C D E F G H I K L
MNPQRSTVWYZ
/I: B1=0; BI=-105; BD=-105;
/M: SY='T'; M= -5,-15,-20,-17,-12,-10,-22,-18, 2,-13, -1, 0,-13, -6,-10,-13, -5, 4, 1,-23, -9,-12;
/M: SY='E'; M= -6, -6,-22, -6, 9,-13,-21, -9,-11, 0, -8, -7, -7,-13, 1, 1, -4, -3, -8,-24,-10, 4;
/M: SY='E'; M= -5, 9,-24, 11, 15,-24,-12, -3,-23, 3,-20,-15, 6, -9, 5, 1, 4, -2,-19,-29,-16, 9;
/M: SY='E'; M= -5, 2,-26, 4, 8,-22,-13, -7,-21, 7,-17,-12, 0,-13, 3, 7, -2, -6,-16,-22,-12, 5;
/M: SY='L'; M= -6,-27,-19,-30,-23, 4,-30,-23, 26,-25, 28, 17,-25,-27,-21,-20,-19, -5, 23,-23, -3,-23;
/M: SY='R'; M= -3,-10,-10,-11, 2,-16,-19,-11,-13, -1, -8, -7, -8,-17, -1, 3, -5, -6, -9,-26,-13, -1; /M: SY='E'; M= -1, 3,-23, 4,
9,-24,-11, -7,-22, 8,-19,-13, 2,-11, 5, 6, 2, -2,-17,-26,-15, 7; /M: SY='I'; M= -5,-22,-20,-27,-19, -4,-29,-20, 20,-19, 13,
10,-18,-21,-14,-17,-15, -6, 14,-20, -4,-18; /M: SY='I'; M= -8,-30,-24,-33,-27, 8,-29,-26, 19,-24, 15, 9,-28,-27,-23,-21,22,-10, 17, 9, 4,-25; /M: SY='A'; M= 11, -8, -8,-12, -5,-19,-11,-14,-14, -1,-14, -9, -6,-15, -4, -4, 2, -2, -6,-25,-15, -5;
/M: SY='E'; M= -5, 10,-26, 15, 22,-28,-12, -2,-26, 6,-21,-16, 4, -8, 10, 0, 2, -6,-23,-28,-16, 16; /M: SY='V'; M= -5,-14,15,-16, -6,-11,-23,-14, 4,-11, 0, 1,-13,-19, -5,-12, -7, -5, 6,-24, -8, -6; /M: SY='L'; M= -2,-24,-21,-26,-19, 5,-24,-20,
10,-23, 22, 7,-23,-24,-18,-19,-18, -7, 6, -7, 0,-18; /M: SY='G'; M= 3, -4,-25, -5, -4,-27, 24,-12,-29, -6,-25,-16, 1,-12, 4, -8, 5, -9,-23,-24,-22, -4; /M: SY='V'; M= -1,-12,-19,-14, -8,-11,-20,-14, 4,-12, 0, 1,-11,-18,-10,-13, -5, -2, 7,-25, -9,10; /I: I=-4; MI=0; MD=-15; IM=0; /M: M= -2, -6,-13, -6, -5, -9,-11,-10, -2, -8, 0, -2, -7, -7, -7, -8, -5, -3, -1,-18, -8, -7;
D=-3; /I: DM=-15;
.
.
.
Suffix Trees
•
A suffix tree (also called PAT tree or, in an earlier form, position tree) is a data
structure that presents the suffixes of a given string, allowing fast implementation of
many important string operations.
•
A suffix is defined as the shortest sub-sequence starting at a given position that is
unique in the complete sequence and can therefore be used to clearly identify that
position.
•
The suffix tree for a string S is a tree whose edges are labelled with strings, such
that each suffix of S corresponds to exactly one path from the tree's root to a leaf.
•
Constructing such a tree for the string S takes time and space linear in the length
of S.
•
Once constructed, several operations can be performed quickly, for instance
locating a substring in S, locating a substring if a certain number of mistakes are
allowed, locating matches for a regular expression pattern etc.
•
Suffix trees also provided one of the first linear-time solutions for the longest
common substring problem. These speedups come at a cost: storing a string's
suffix tree typically requires significantly more space than storing the string itself.
With what kind of strings would this not be the case?
Suffix Trees
Given the string `mississippi', `miss' is a prefix, `ippi' is a suffix, & `issi' is a
substring. Note that a substring is a prefix of a suffix.
If txt=t1t2...ti...tn is a string, then Ti=titi+1...tn is the suffix of txt that
starts at position i, e.g.
T1 = mississippi = txt
T2 = ississippi
T3 = ssissippi
T4 = sissippi
T5 = issippi
T6 = ssippi
T7 = sippi
T8 = ippi
T9 = ppi
T10 = pi
T11 = i
T12 = (empty)
Suffix Trees
11: i
8: ippi
5: issippi
2: ississippi
1: mississippi
10: pi
9: ppi
7: sippi
4: sissippi
6: ssippi
3: ssissippi
|(1:mississippi)|leaf
tree:|
|
|
|(6:ssippi)|leaf
|
|(3:ssi)|
|
|
|(9:ppi)|leaf
|(2:i)|
|
|(9:ppi)|leaf
|
|
|
|(6:ssippi)|leaf
|
|(4:si)|
|
|
|(9:ppi)|leaf
|(3:s)|
|
|
|(6:ssippi)|leaf
|
|(5:i)|
|
|
|(9:ppi)|leaf
|
|
|(10:pi)|leaf
|(9:p)|
|
|(11:i)|leaf
7 branching nodes
From
http://www.allisons.org/ll/AlgDS/Tree/Suffix/,
please study the information provided by this
link!
Building a Suffix Tree
Take a sequence:
1. Group all positions according to base
(nucleotide, a.a.) type leading to the first level
in the tree
(symbol ‘$’ is often included to indicate the end
of the string)
1. Regroup each group according to the following
base, giving the second row of the tree
2. Continue this process and stop when a
(sub)group only contains one sequence
position (but include complete suffix)
Finding motifs in sequences
• Many methods exist
– MEME, Gibbs
• Typically, these programs try and find
motifs of a given length W in a set of N
sequences.
• Using probabilistic formalisms, they
distinguish background residues, those not
in the pattern, and residues at specific
positions in the pattern.
Example of HMM repository:
The PFAM Database
Pfam is a large collection of multiple sequence
alignments and hidden Markov models covering
many common protein domains and families. For
each family in Pfam you can:
• Look at multiple alignments
• View protein domain architectures
• Examine species distribution
• Follow links to other databases
• View known protein structures
• Search with Hidden Markov Model (HMM) for
each alignment
The PFAM Database
Pfam is a database of two parts, the first is the
curated part of Pfam containing about 9000
protein families (Pfam-A). Pfam-A comprises
manually crafted multiple alignments and profileHMMs .
To give Pfam a more comprehensive coverage of
known proteins we automatically generate a
supplement called Pfam-B. This contains a large
number of small families taken from the PRODOM
database that do not overlap with Pfam-A.
Although of lower quality Pfam-B families can be
useful when no Pfam-A families are found.
The PFAM Database
Sequence coverage Pfam-A : 74% (Yellow)
Sequence coverage Pfam-B : 13% (Blue)
Other (Grey)
74% of proteins have at least one match with Pfam.
Version 21.0 - November 2006: Pfam-A contains 8957 families
Pfam Ig Family Alignment
Clan pages in Pfam. (A) A screen shot of a clan summary page, containing the description, annotation and membership of the clan.
From this page, the user can view the family relationship diagram (B). Each family in the clan is represented by a blue box and its
relationship to other families is represented by solid lines (significant profile–profile comparison score) or dashed lines (non-significant
profile-profile comparison score). Beside each line, the profile–profile comparison E-value score is presented. This score is also linked to
a visualization of the profile–profile comparison alignment (C). The clan summary page also provides a link to the clan alignment (D) (for
more details see text). The clan alignment is a multiple sequence alignment of all of the clan members seed alignments (each set of seed
sequences are separated by the alternate background shading). The alignments are coloured using Jalview.
HMM-based homology
searching
Transition probabilities and Emission probabilities
Gapped HMMs also have insertion and deletion
states
Profile HMM: m=match state, I-insert state, d=delete state; go from
left to right. I and m states output amino acids; d states are ‘silent”.
d1
d2
d3
d4
I0
I1
I2
I3
I4
m0
m1
m2
m3
m4
m5
End
Start
Transition probabilities and Emission probabilities
A hidden Markov model
accompanying a PFAM
alignment
HMMER2.0 [2.2g]
NAME cytochrome_b_N
ACC PF00033
DESC Cytochrome b(N-terminal)/b6/petB
LENG 222 ALPH Amino RF no CS no MAP yes COM hmmbuild -F HMM_ls.ann SEED.ann COM hmmcalibrate --seed 0
HMM_ls.ann NSEQ 8 DATE Thu Dec 12 02:48:53 2002
CKSUM 8731
GA -41.9 -41.9 TC -41.9 -41.9 NC -42.4 -42.4 XT -8455 -4 -1000 -1000 -8455 -4 -8455 -4 NULT -4 -8455 NULE 595 -1558 85 338 294 453 -1158 197 249 902 -1085 -142 -21 -313 45 531 201 384 -1998 -644 EVD -170.913223 0.138730
HMM A C D E F G H I K L M N P Q R S T V W Y
m->m m->i m->d i->m i->i d->m d->d b->m m->e -300 * -2414
1 -2605 -2478 -3823 -3810 -1719 -3245 -3021 -1267 -3347 -794 5096 -3495 -3580 -3266 -3203 -3106 -2761 -1627 -2687 -2398 1
--564 -3141 -2265 -289 -2463 -701 -1378 -1300 –8788
2 1405 -805 -1720 -1394 -2431 567 -1366 -2119 -1298 -2328 -1482 -1065 -1670 -1131 -1568 1592 2088 -1424 -2629 -2251 3
- -148 -500 233 43 -381 399 106 -626 210 -466 -720 275 394 45 96 359 117 -369 -294 -249 –
HMMs are good for profile searches… but optimising the many
parameters when using HMMs to do alignments from scratch is a
problem.