Biology 162: Computational Genetics

Download Report

Transcript Biology 162: Computational Genetics

Pattern and string matching
tools
Biology 162 Computational
Genetics
Todd Vision
9 Sep 2004
Some more pattern and string
matching tools
• Simple signatures
– Logos
– Position-specific Scoring Matrices
– PSI-BLAST
• Regular expressions
• Suffix trees
Sequence logos
• Entropy of column j denoted Hj
• Information content denoted Ij
H j   f ij log 2 ( f ij )
i
I j  log 2 20  H j
• How to draw a logo
– Height of column given by Ij
– Height of each symbol = fij x Ij

Information content
• Information/Uncertainty is expressed in bits
– There is a natural relationship to log base 2
• Imagine 64 shells, under one of which is a
ball.
– 6 guesses are required to find the ball
– In this case, maximal uncertainty is log264=6 bits
• In the case of 20 amino acids, maximal
uncertainty is log220=4.32 bits.
Position-Specific Scoring
Matrix
• Constructed from conserved columns of a
MSA
• Log odds scores for each residue in each
column, based on
– Frequency of residue within column
– Background frequency of residues
• Takes advantage of the fact that columns
differ in
– Composition
– Levels of conservation
Position Specific Scoring
Matrix
pos
1
2
3
4
5
con
M
W
I
L
A
A
-1
-3
-1
-2
4
R
-3
-3
-3
-3
-2
N
-3
-4
-2
-2
-2
D
-4
-5
-3
-3
-2
C
-1
-3
7
-3
-2
… A
… 0
… 0
… 0
… 0
… 56
R
0
0
0
0
0
N
0
0
0
0
0
D
0
0
36
0
0
C
0
0
0
0
0
PSI-BLAST PSSM for DSCAM
…
…
…
…
…
…
Inf
0.50
2.32
0.71
0.47
0.52
Pseu
0.16
0.26
0.26
0.35
0.35
Pseudocounts
• If a residue is never seen in a particular
column in of a MSA
– What is the probability of ever seeing it there?
– Not really zero…
• Pseudocounts are added to actual counts to
account for uncertaintly in column
frequencies
• Many methods
– Laplace’s Rule
• Add one to every count
• Psudocounts grow less important as sample size gets
large
– Methods related to Bayesian priors - we will see
Calculating scores in a PSSM
•
•
•
•
•
Sij is score for residue i at position j
xij is position-specific count of residue i
fi is background frequency of residue i
bij are pseudocounts
N sequences in alignment



1
Sij  log 2x ij  bij N   bij  f i 




i


PSI-BLAST
• Can identify more distant homologs
than possible via pairwise BLAST
• Iterative BLAST
– After 1st iteration, multiple alignment is
computed for query and top matches
– PSSM generated from alignment
– PSSM used for subsequent iterations
– PSSM refined each iteration
PSI-BLAST
• Once high-scoring words are generated
from PSSM, algorithm proceeds as
before
– Still very fast
• l and K must be recalculated for each
iteration
Regular Expressions (regex)
• Can be thought of as a non-probabilistic
rule for generating (or matching) a
pattern
• Used for
– DNA/Protein signatures (e.g. Prosite)
– Text parsing (e.g. in Perl)
Prosite regexes
ID
AC
DT
DE
PA
CBD_FUNGAL; PATTERN.
PS00562;
DEC-1991 (CREATED); NOV-1997 (DATA UPDATE); JUL-1998 (INFO UPDATE).
Cellulose-binding domain, fungal type.
C-G-G-x(4,7)-G-x(3)-C-x(5)-C-x(3,5)-[NHG]-x-[FYWM]-x(2)-Q-C
In Perl regex syntax:
CGG\w{4,7}G\w{3}C\w{5}C\w{3,5}[NHG]\w[FYWM]\w{2}QC
In words:
C followed by G followed by G followed by any 4 to 7 letters followed by G
followed by any 3 letters followed by C followed by any 5 letters
followed by C followed by an 3 to 5 letters followed by one of N, H or G,
followed by any letter followed by one of F, Y, W, or M followed by any
two letters followed by Q followed by C
Perl regex metacharacters
•
•
•
•
•
•
•
•
•
[ ] - character class (e.g. [abc] = a, b or c)
{min, max} - quantifiers
{exactly}
* - repetition, zero or more
+ - repetition, one or more
? - optional, zero or one
. - wildcard (any character)
( ) - capture or delimit substrings
| - alternation (e.g. (a|b) = either a or b)
Regular expressions
Pattern
a[bc]d
ab{2,5}c
ab*c
ab+c
ab?c
a(bc|de)
Matches
abd, acd
abc, abbc, … abbbbbc
ac, abc, abbc, …
abc, abbc, …
ac, abc
abc, ade
Regular expressions:
limitations
• Non-probabilistic: all matches match
equally well
– Hidden Markov models improve upon this
• Cannot model dependencies among
different positions
– Neither can HMMs
– For RNA matches, where dependencies
matter, we need to allow more complex
rules
Chomsky hierarchy of
transformational grammars:
a preview
• General theory for modelling strings of
symbols used in linguistics
–
–
–
–
Regular grammars
Context-free grammars
Context-sensitive grammars
Unrestricted grammars
• Regular grammars (like regexes) are easy to
parse, but are structurally limited
• We will see context sensitive grammars for
modelling RNA sequences
Suffix Trees
• Data structure used for fast matching of
sequence patterns
• Helps to explain how BLAST can find
word matches so fast
• Commonly used for
– Exact matching
– Identifying repeated sequences
Suffix Trees
•
•
•
•
Rooted, directed tree for string S
|S| = m leaves, labeled 1..m
Edges labelled with substrings of S
Internal node has at most one edge for
each symbol in alphabet
• Concatenation of edge labels on path
from root to leaf i equals suffix S[1..m]
Suffix Trees: An Example
S = ‘gatgac’
root
a
c
3
6
c
5
ga
tgac
c
2
4
tgac
1
Least common ancestor
• LCA corresponds to shared prefix of
suffix (e.g. path labeled ‘ga’ for nodes 1
and 4)
• LCA can be retrieved
in constant time
root
a
c
3
6
c
5
ga
tgac
c
2
4
tgac
1
If suffix trees are the answer,
what is the question?
• Rapid word matching
• Find all occurrences of ‘ga’ in S =
‘gatgac’
root
a
c
3
6
c
5
ga
tgac
c
2
4
tgac
1
If suffix trees are the answer,
what is the question?
• Longest common substring problem
• Find the starting positions, length and identity
of the longest substring that occurs in both S1
and S2
root
S1 = ‘gatgac’
S2 = ‘gatcac’
t
a
c
t
ac c gac cac
cac
3
ga
3
6 6 4
5 5
2 2
t
c
4
cac
gac
1
1
If suffix trees are the answer,
what is the question?
• Find all direct palindromes (a substring
concatenated with its reverse) in
S=‘agattagct’
• Observation
– Let Sr=‘tcgattaga’
– If a palindrome is centered between q and q+1 of
S, then it is also centered between m-q and m-q+1
of Sr.
• Solution
– Construct joint suffix tree for S and Sr, find least
common ancestor for all pairs q+1, n-q+1
Myriad uses for suffix trees
• Direct and inverted repeats
– Microsatellites
– Transposons
• Inverted palindromes
– Restriction enzyme recognition sites
• Imperfect matches
• Algorithmic efficiency
– Many efficient algorithms for traversing suffix trees
– The trees themselves can be constructed in O(m)
time
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Reading assignment
(for Tuesday and Thursday)
• Durbin et al. (1998) pgs. 46-79 in
Biological Sequence Analysis.
– Markov chains
– Hidden Markov models