Sequential Data Mining - Computer Science
Download
Report
Transcript Sequential Data Mining - Computer Science
Sequential Data Mining
Jianlin Cheng
Department of Computer Science
University of Missouri, Columbia
2011
Sequential Data
• Sequence data: a series of alphabets or numbers
• Spatial sequence data: biological sequence, text
document, a computer program, web page, a
missile trajectory
• Temporal sequence data: speech, movie clips,
stock prices, a series of web clicks
Key Issue
• Measure the similarity between two
sequences, i.e. sequence comparison
• Sequence database search / retrieval
• Sequence classification
• Sequence clustering
• Sequence pattern recognition
• Sequence similarity definition and calculation
is a challenge compared to vector-based data
Sequence Alignment
•
•
•
•
Use biological sequences as an example
Optimal pairwise sequence alignment algorithm
Fast, heuristic sequence alignment (BLAST)
Statistical significance of sequence alignment
score
Sequence Data
• DNA Sequence (gene)
gaagagagcttcaggtttggggaagagaca
acaactcccgctcagaagcaggggccgata
• RNA Sequence
caaaucacuacacacaggguagaagguggaacgcacaggagcaugu
caacggggugc
• Protein Sequence
RELQVWGRDNNSLSEAGANRQGDVSFNLPQITLWQRPL
VTIKIGGQLKEALLDTGADDTVLEDIDLPGKWKP
Global Pairwise Sequence Alignment
ITAKPAKTTSPKEQAIGLSVTFLSFLLPAGVLYHL
ITAKPQWLKTSESVTFLSFLLPQTQGLYHL
Global Pairwise Sequence Alignment
ITAKPAKTTSPKEQAIGLSVTFLSFLLPAGVLYHL
ITAKPQWLKTSESVTFLSFLLPQTQGLYHL
ITAKPAKT-TSPKEQAIGLSVTFLSFLLPAG-VLYHL
ITAKPQWLKTSE-------SVTFLSFLLPQTQGLYHL
Alignment (similarity) score
Three Main Issues
1.Definition of alignment score
2.Algorithms of finding the optimal alignment
3.Evaluation of significance of alignment score
A Simple Scoring Scheme
• Score of a character pair: S(match)=1,
S(not_match) = -1, S(gap-char) = -1
n
• Score of an alignment =
S
i
1
ITAKPAKTPTSPKEQAIGLSVTFLSFLLPAGWVLYHL
ITAKPQWLKSTE-------SVTFLSFLLPQTQGLYHL
5 – 7 – 7 +10 -4 + 4= 1
Optimization Problem
• How many possible alignments exist for two
sequences with length m and n?
• How can we find the best alignment to
maximize alignment score?
Possible Alignments
Shortest
ACTG--ATGGATG
ACTG
ATGGATG
Intermediate
AC-TG--ATG-GATG
Longest
ACTG----------ATGGATG
Total Number of Possible Alignments
AGATCAGAAAT-G
--AT-AG-AATCC
A G A A T T C A A G GA A A A A T T C G C
m+n
Total Number of Alignments
Select m positions out of m+n possible positions:
m n (m n)!
m!n!
m
Exponential!
If m = 300, n = 300, total = 1037
Needleman and Wunsch Algorithm
• Given sequences P and Q, we use a matrix M to record the
optimal alignment scores of all prefixes of P and Q. M[i,j] is
the best alignment score for the prefixes P[1..i] and Q[1..j].
• M[i,j] =
max [
M[i-1,j-1] + S(P[i],Q[j]),
M[i,j-1] + S(-, Q[j])
M[i-1,j] + S(P[i], -)
]
Needleman and Wunsch Algorithm
• Given sequences P and Q, we use a matrix M to record the
optimal alignment scores of all prefixes of P and Q. M[i,j] is
the best alignment score for the prefixes P[1..i] and Q[1..j].
• M[i,j] =
max [
M[i-1,j-1] + S(P[i],Q[j]),
M[i,j-1] + S(-, Q[j])
M[i-1,j] + S(P[i], -)
]
Dynamic Programming
Dynamic Programming Algorithm
Three-Step Algorithm:
•Initialization
•Matrix filling (scoring)
•Trace back (alignment)
1. Initialization of Matrix M
_ A
T
_
0
i
A
G
A
-1
T
C
A
G
-4
A
A
A
-8
T
G
-11
-2
-3
-5
-6
-7
-9
-10
-12
-1
-2
j
A
-3
G
-4
A
-5
A
-6
T
-7
2. Fill Matrix
_
_
0
A
-1
A
G
A
-1
1
-2
0
-3
-1
T
C
A
G
-4
T
-2
A
-3
G
-4
A
-5
A
-6
T
-7
-2
-5
-3
-6
-4
-7
-5
A
A
A
-8
-6
-9
-7
-10
-8
T
G
-11
-9
-12
-10
M[i,j] =
max [
M[i-1,j-1] + S(P[i],Q[j]),
M[i,j-1] + S(-, Q[j])
M[i-1,j] + S(P[i], -)
]
2. Fill Matrix
_
_
0
A
-1
T
-2
A
G
A
T
C
A
G
A
A
A
-1
1
0
-2
0
0
-3
-1
-1
T
G
-4
-2
A
-3
G
-4
A
-5
A
-6
T
-7
0
-5
-3
-1
-6
-4
-2
-7
-5
-3
-8
-6
-4
-9
-7
-5
-10
-8
-6
-11
-9
-7
-12
-10
-8
M[i,j] =
max [
M[i-1,j-1] + S(P[i],Q[j]),
M[i,j-1] + S(-, Q[j])
M[i-1,j] + S(P[i], -)
]
2. Fill Matrix
_
_
0
A
-1
T
-2
A
-3
G
A
A
T
-4
-5
-6
-7
A
G
A
-1
1
0
-1
-2
-3
-4
-5
-2
0
0
-1
0
-1
-2
-3
-3
-1
-1
1
-1
1
0
-1
T
C
A
G
A
A
A
-4
0
0
0
0
0
1
T
G
-2
-5
-3
-1
-1
-1
-1
-1
0
-6
-4
-2
0
-1
0
0
-1
-7
-5
-3
-1
1
0
-1
-1
-8
-6
-4
-2
0
2
1
0
-9
-7
-5
-3
-1
1
3
2
-10
-8
-6
-4
-2
0
2
2
-11
-9
-7
-5
-3
-1
1
3
-12
-10
-8
-6
-4
-2
0
2
M[i,j] =
max [
M[i-1,j-1] + S(P[i],Q[j]),
M[i,j-1] + S(-, Q[j])
M[i-1,j] + S(P[i], -)
]
3. Trace Back
_
_
A
G
A
T
C
A
AGATCAGAAATG
G
--AT-AG-AATA
A
A
T
G
0
A
T
A
-1
-2
-3
G
A
A
T
-4
-5
-6
-7
-1
1
0
-1
-2
-3
-4
-5
-2
0
0
-1
0
-1
-2
-3
-3
-1
-1
1
-1
1
0
-1
-4
-2
0
0
0
0
0
1
-5
-3
-1
-1
-1
-1
-1
0
-6
-4
-2
0
-1
0
0
-1
-7
-5
-3
-1
1
0
-1
-1
-8
-6
-4
-2
0
2
1
0
-9
-7
-5
-3
-1
1
3
2
-10
-8
-6
-4
-2
0
2
2
-11
-9
-7
-5
-3
-1
1
3
-12
-10
-8
-6
-4
-2
0
2
Pairwise Alignment Algorithm Using Dynamic
Programming
• Initialization: Given two sequences with length m and n,
create a (m+1)×(n+1) matrix M. Initialize the first row and first
column according to scoring matrix.
• For j in 1..n (column)
for i in 1..m (row)
M[i,j] = max( (M[i-1,j-1]+S(i,j), M[i,j-1]+S(-,j), M[i1, j] + S(i,-) )
Record the selected path toward (i,j)
• Report alignment score M[m,n] and trace back to M[0,0] to
generate the optimal alignment.
What is the time complexity of this algorithm?
Local Sequence Alignment Using DP
• Biological sequences often only have local similarity.
• During evolution, only functional and structural
important regions are highly conserved.
• Global alignment sacrifices the local similarity to
maximize the global alignment score.
• Need to use alignment method to identify the local
similar regions disregard of other dissimilar regions.
Transcription binding site
Local Alignment Algorithm
Goal: find an alignment of the substrings of P and Q with
maximum alignment score.
Naïve Algorithm:
(m+1)*m/2 substrings of P, (n+1)*n/2 substrings of Q
Using DP for each substring pairs:
m2 * n2 * O(mn) = O(m3n3)
Smith-Waterman Algorithm
Same dynamic programming algorithm as global
alignment except for three differences.
1. All negative scores is converted to 0 (why?)
2. Alignment can start from anywhere in the
matrix
3. Alignment can end at anywhere in the
matrix
Local Alignment Algorithm
• Initialization: Given two sequences with length m and n,
create a (m+1)×(n+1) matrix M. Initialize the first row and first
column to 0s.
• For j in 1..n (column)
for i in 1..m (row)
M[i,j] = max( 0, M[i-1,j-1]+S(i,j), M[i,j-1]+S(-,j),
M[i-1, j] + S(i,-) )
Record the selected path.
• Find elements in matrix M with maximum values. Trace back
till 0 and report the alignment corresponding to the path.
1. Initialization
_
_
0
A
0
G
0
A
0
T
0
C
0
A
0
G
0
A
0
A
0
A
0
T
0
G
0
A
0
T
0
A
0
G
0
A
0
A
0
T
0
2. Fill the Matrix
_
A
T
A
G
A
A
T
_
0
0
0
0
0
0
0
0
A
0
1
0
1
0
1
1
0
G
0
0
0
0
2
1
0
0
A
0
1
0
1
0
3
2
1
T
0
0
2
1
0
2
2
3
C
0
0
1
0
0
1
1
2
A
0
1
0
2
1
1
2
1
G
0
0
0
1
3
2
1
1
A
0
1
0
1
2
4
3
2
A
0
1
0
1
1
3
5
4
A
0
1
0
1
0
2
4
4
T
0
0
2
1
0
1
3
5
G
0
0
1
1
2
1
2
4
3. Trace Back
_
A
T
A
G
A
A
T
_
0
0
0
0
0
0
0
0
A
0
1
0
1
0
1
1
0
G
0
0
0
0
2
1
0
0
A
0
1
0
1
0
3
2
1
T
0
0
2
1
0
2
2
3
C
0
0
1
0
0
1
1
2
A
0
1
0
2
1
1
2
1
G
0
0
0
1
3
2
1
1
A
0
1
0
1
2
4
3
2
A
0
1
0
1
1
3
5
4
A
0
1
0
1
0
2
4
4
T
0
0
2
1
0
1
3
5
G
0
0
1
1
2
1
2
4
Two Best Local Alignments
Local Alignment 1:
ATCAGAA
AT-AGAA
Local Alignment 2:
ATCAGAAAT
AT-AGA-AT
Scoring Matrix
• How to accurately measure the similarity
between amino acids (or nucleotides) is one
key issue of sequence alignment.
• For nucleotides, a simple identical / not
identical scheme is mostly ok.
• Due to various properties of amino acids, it is
hard and also critical to measure the similarity
between amino acids.
Evolutionary Substitution Approach
• During evolution, the substitution of similar
amino acids is more likely to be selected
within a protein family of similar sequences
than random substitutions (M. Dayhoff)
• The frequency / probability that one residue
substitutes another one is an indicator of their
similarity.
PAM Scoring Matrices
(M. Dayhoff)
• Select a number of protein families.
• Align sequences in each family and count the
frequency of amino acid substitution in each
column: Pij.
• Similarity score is logarithm of the ratio of
observed substitution probability over the
random substitution probability. S(i,j) = log(Pij /
(Pi * Pj)). Pi is the observed probability of residue
i and Pj is the observed probability of residue j
• PAM: Point Accepted Mutation
• Let data tell strategy! - NBA
A Simplified Example
ACGTCGAGT
ACCACGTGT
CACACTACT
ACCGCATGA
ACCCTATCT
TCCGTAACA
ACCATAAGT
AGCATAAGT
ACTATAAGT
ACGATAAGT
Chars
A
C
G
T
Prob.
6 / 10
1 / 10
2 / 10
1 / 10
P(A<->C) = 0.07+0.07=0.14
Substitution Frequency Table
A
C
G
T
A
30
6
12
6
C
6
0
2
1
G
12
2
1
2
T
6
1
2
0
Total number of substitutions: 90
A
C
G
T
A
.33
.07
.14
.07
C
.07
0
.02
.01
G
.14
.02
.01
.02
T
.07
.01
.02
0
A Simplified Example
ACGTCGAGT
ACCACGTGT
CACACTACT
ACCGCATGA
ACCCTATCT
TCCGTAACA
ACCATAAGT
AGCATAAGT
ACTATAAGT
ACGATAAGT
Chars
A
C
G
T
Prob.
6 / 10
1 / 10
2 / 10
1 / 10
Substitution Frequency Table
A
C
G
T
A
30
6
12
6
C
6
0
2
1
G
12
2
2
2
T
6
1
2
0
Total number of substitutions: 90
A
C
P(A<->C) = 0.07+0.07=0.14
S(A,C) = log(0.14/(0.6*0.1)) = 0.36 G
T
A
.33
.07
.14
.07
C
.07
0
.02
.01
G
.14
.02
.01
.02
T
.07
.01
.02
0
PAM250 Matrix (log odds multiplied by 10)
BLOSUM Matrices
(Henikoff and Henikoff)
• PAM calculation is based on global alignments
• PAM matrices don’t work well for aligning
sequences with little similarity.
• BLOSUM: BLOcks SUbstitution Matrix
• BLOSUM based on highly conserved local
regions /blocks without gaps.
Block 1
Block2
BLOSUM62 Matrix
Significance of Sequence Alignment Score
• Why do we need significant test?
• Mathematical view: unusual versus “by
chance”
• Biological view: evolutionary related or not?
Randomization Approach
• Randomization is a fundamental idea due to
a statistician Fisher.
• Randomly permute chars within sequence P
and Q to generate new sequences (P’ and
Q’). Align random sequences and record
alignment scores.
• Assuming these scores obey normal
distribution, compute mean (u) and standard
derivation (σ) of alignment scores
95%
Alignment score
Normal distribution of alignment scores of two sequences
•If S = u+2 σ, the probability of observing the alignment score equal to or more
extreme than this by chance is 2.5%, e.g., P(S>=u+2 σ) = 2.5%.
Thus we are 97.5% confident that the alignment score is not by chance, i.e. significant.
•For any score x, we can compute P(S >= x), which is called p-value.
Model-Based Approach
(Karlin and Altschul)
http://www.people.virginia.edu/~wrp/cshl02/Altschul/Altschul-3.html
• Extreme Value
Distribution
K and λ are statistical parameters depending
on substitution matrix. For BLOSUM62, λ =0.252, K=0.35
P-Value
• P(S≥x) is called p-value. It is the probability
that two random sequences has an alignment
score >= x.
• Smaller means more significant.
• E-value = p-value * # sequences
Problems of Using Dynamic Programming to
Search Large Sequence Databases
• Search homologs in DNA and protein database is
often the first step of a bioinformatics study.
• DP is too slow for large sequence database search
such as Genbank and UniProt. Each DP search can
take hours.
• Most DP search time is wasted on unrelated
sequences or dissimilar regions.
• Developing fast, practical sequence comparison
methods for database search is important.
Fast Sequence Search Methods
• All successful, rapid sequence comparison methods
are based on a simple fact: similar sequences share
some common words.
• First such method is FASTP (Pearson & Lipman, 1985)
• Most widely used methods are BLAST (Altschul et al., 1990)
and PSI-BLAST (Altschul et al., 1997).
Basic Local Alignment Search Tool
(S. Altschul, W. Gish, W. Miller, E. Meyer and D. Lipman)
1. Compile a list of words for
a query
2. Scan sequences in
database for word hits
3. Extending hits
David Lipman
Stephen Altschul
Step 1: Compile Word List
• Words: w-mer with length w.
• Protein 4-mer and DNA 12-mer
Query:
DSRSKGEPRDSGTLQSQEAKAVKKTSLFE
Words: DSRS, SRSK, RSKG, KGEP….
Step 2: Scan Database
Classic problem: find occurrence of a
list of words in a sequence.
•Integer indexing approach (hashing)
•Deterministic finite automaton or finite state machine (faster)
Finite State Machine
• A finite state machine for string matching:
(input alphabet: a,b,c)
Word: abab
a
a
a
0 a 1 b 2 a 3 b 4
Database sequence: bcabccaaababacababacabb
Step 3: Extension
• Extend words on both ends
• Terminate the process when we reach a segment pair
whose score falls a certain distance below the best
score found for shorter extensions.
• Depart from the ideal of finding a guaranteed
Maximum Segment Pair, but the added inaccuracy is
negligible.
• Report significant MSP according to extreme value
distribution
An Example of Extension
Query:
DSRSKGEPRDSGTLQSQEAKAVKKTSLFE
Words: DSRS, SRSK, RSKG, KGEP….
Database Sequence: PESRSKGEPRDSGKKQMDSOKPD
Maximum Segment Pair: ESRSKGEPRDSG
P-Value and E-Value
• P-value
• E-value = database
size * p-value
• Common threshold:
0.01
S
P-value = Prob(score >=S)
Why is BLAST so Successful?
•
•
•
•
Address a fundamental problem
Extremely fast
Simple, yet powerful idea
Well founded in computer theory (wordsstring matching, hashing, random process of
Maximum Segment Pair)
• Implementation tricks -> super speed software
• Sacrifice a little accuracy for speed practically
(good heuristics) Realism
NCBI Online Blast
DNA Blast
Protein
Blast
Matched sequences ranked by score and evalue
Output
Format
Significant local alignments
Application of Pairwise Sequence
Comparison
•
•
•
•
Database search
Clustering
Classification
Motif recognition