Sequence Alignment - Bilkent University
Download
Report
Transcript Sequence Alignment - Bilkent University
Sequence Alignment
K-tuple methods
Statistics of alignments
Phylogenetics
Database searches
What is the problem?
Large number of sequences to search your
query sequence against.
Various indexing schemes and heuristics are
used, one of which is BLAST.
heuristic is a technique to solve a problem that ignores
whether the solution can be proven to be correct, but
usually produces a good solution, are intended to gain
computational performance or conceptual simplicity
potentially at the cost of accuracy or precision.
http://en.wikipedia.org/wiki/Heuristics#Computer_science
Concepts of Sequence Similarity
Searching
The premise:
The sequence itself is not informative; it
must be analyzed by comparative
methods against existing databases to
develop hypothesis concerning relatives
and function.
Important Terms for Sequence
Similarity Searching with very different
meanings
Similarity
Identity
The extent to which nucleotide or protein
sequences are related. In BLAST similarity refers
to a positive matrix score.
The extent to which two (nucleotide or amino
acid) sequences are invariant.
Homology
Similarity attributed to descent from a common
ancestor.
Sequence Similarity Searching:
The Approach
Sequence similarity searching involves the
use of a set of algorithms (such as the
BLAST programs) to compare a query
sequence to all the sequences in a specified
database.
Comparisons are made in a pairwise
fashion. Each comparison is given a score
reflecting the degree of similarity between
the query and the sequence being
compared.
Blast
QUERY sequence(s)
BLAST results
BLAST program
BLAST
database
Topics:
There are different blast programs
Understanding the BLAST algorithm
BLAST program
Word size
HSPs (High Scoring Pairs)
Understanding BLAST statistics
The alignment score (S)
Scoring Matrices
Dealing with gaps in an alignment
The expectation value (E)
The BLAST algorithm
The BLAST programs (Basic Local
Alignment Search Tools) are a set of
sequence comparison algorithms
introduced in 1990 for optimal local
alignments to a query.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) “Basic local alignment
search tool.” J. Mol. Biol. 215:403-410.
Altschul SF, Madden TL, Schaeffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ
(1997) “Gapped BLAST and PSI-BLAST: a new generation of protein database
search programs.” NAR 25:3389-3402.
http://www.ncbi.nlm.nih.gov/BLAST
blastn
blastx
tblastn
tblastx
blastp
Other BLAST programs
BLAST 2 Sequences (bl2seq)
Aligns two sequences of your choice
Gives dot-plot like output
More BLAST programs
BLAST against genomes
Many available
BLAST parameters pre-optimized
Handy for mapping query to genome
Search for short exact matches
BLAST parameters pre-optimized
Great for checking probes and primers
How Does BLAST Work?
The BLAST programs improved the overall
speed of searches while retaining good
sensitivity (important as databases continue
to grow) by breaking the query and database
sequences into fragments ("words"), and
initially seeking matches between fragments.
Word hits are then extended in either
direction in an attempt to generate an
alignment with a score exceeding the
threshold of “T".
Picture used with permission from Chapter 11 of “Bioinformatics:
A Practical Guide to the Analysis of Genes and Proteins”
Each BLAST “hit” generates an
alignment that can contain one or more
high scoring pairs (HSPs)
Each BLAST “hit” generates an
alignment that can contain one or more
high scoring pairs (HSPs)
Where does the score (S)
come from?
The quality of each pair-wise
alignment is represented as a score
and the scores are ranked.
Scoring matrices are used to
calculate the score of the alignment
base by base (DNA) or amino acid by
amino acid (protein).
The alignment score will be the sum
of the scores for each position.
What’s a scoring matrix?
Substitution matrices are
used for amino acid
alignments. These are
matrices in which each
possible residue substitution
is given a score reflecting the
probability that it is related to
the corresponding residue in
the query.
PAM vs. BLOSUM scoring
matrices
BLOSUM 62 is the default matrix in BLAST
2.0. Though it is tailored for comparisons of
moderately distant proteins, it performs well
in detecting closer relationships. A search
for distant relatives may be more sensitive
with a different matrix.
PAM vs BLOSUM scoring
matrices
The PAM Family
The BLOSUM family
PAM matrices are
BLOSUM matrices are
based on global
based on local alignments.
alignments of closely
BLOSUM 62 is a matrix
related proteins.
calculated from
The PAM1 is the matrix
comparisons of sequences
calculated from
with no less than 62%
comparisons of
divergence.
sequences with no
All BLOSUM matrices are
more than 1%
based on observed
divergence.
alignments; they are not
Other PAM matrices
extrapolated from
are extrapolated from
comparisons of closely
PAM1.
related proteins.
What happens if you have a gap
in the alignment?
A gap is a position in the alignment at which
a letter is paired with a null
Gap scores are negative. Since a single
mutational event may cause the insertion or
deletion of more than one residue, the
presence of a gap is frequently ascribed
more significance than the length of the gap.
Hence the gap is penalized heavily, whereas a
lesser penalty is assigned to each subsequent
residue in the gap.
Percent Sequence Identity
The extent to which two nucleotide or
amino acid sequences are invariant
AC C TG A G – AG
AC G TG – G C AG
mismatch
indel
70% identical
BLAST algorithm
Keyword search of all words of length
w in the query of default length n in
database of length m with score above
threshold
w = 11 for nucleotide queries, 3 for
proteins
Do local alignment extension for each
hit of keyword search
Extend result until longest match
above threshold is achieved and
output
BLAST algorithm (cont’d)
keyword
Query: KRHRKVLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLKIFLENVIRD
GVK 18
GAK 16
Neighborhood
GIK 16
words
GGK 14
neighborhood
GLK 13
score threshold
GNK 12
(T = 13)
GRK 11
GEK 11
GDK 11
extension
Query: 22
VLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLK 60
+++DN +G +
IR L
G+K I+ L+ E+ RG++K
Sbjct: 226 IIKDNGRGFSGKQIRNLNYGIGLKVIADLV-EKHRGIIK 263
High-scoring Pair (HSP)
Original BLAST
Dictionary
All words of length w
Alignment
Ungapped extensions until score falls
below statistical threshold T
Output
All local alignments with score >
statistical threshold
Original BLAST: Example
From lectures by Serafim Batzoglou
(Stanford)
A C G A A G T A A G G T C C A G T
C T G A T C C T G G A T T G C G A
• w = 4, T = 4
• Exact keyword
match of GGTC
• Extend
diagonals with
mismatches
until score falls
below a
threshold
• Output result
GTAAGGTCC
GTTAGGTCC
Gapped BLAST: Example
Original BLAST
exact keyword
search, THEN:
Extend with
gaps in a zone
around ends of
exact match
Output result
GTAAGGTCCAGT
GTTAGGTC-AGT
From lectures by Serafim Batzoglou
(Stanford)
C T G A T C C T G G A T T G C G A
A C G A A G T A A G G T C C A G T
Gapped BLAST : Example
(cont’d)
Original BLAST
exact keyword
search, THEN:
Extend with gaps
around ends of
exact match until
score <T, then
merge nearby
alignments
Output result
GTAAGGTCCAGT
GTTAGGTC-AGT
From lectures by Serafim Batzoglou
(Stanford)
C T G A T C C T G G A T T G C G A
A C G A A G T A A G G T C C A G T
Topics:
The different blast databases provided
by the NCBI
BLAST
databases
Protein databases
Nucleotide databases
Genomic databases
Considerations for choosing a BLAST
database
Custom databases for BLAST
BLAST protein databases available at
through blastp web interface @ NCBI
blastp db
Considerations for choosing a
BLAST database
First consider your research question:
Are you looking for an ortholog in a
particular species?
Are you looking for additional members of
a protein family across all species?
BLAST against the genome of that species.
BLAST against nr, if you can’t find hits check wgs,
htgs, and the trace archives.
Are you looking to annotate genes in your
species of interest?
BLAST against known genes (RefSeq) and/or ESTs
from a closely related species.
When choosing a database for
BLAST…
It is important to know your
reagents.
Changing your choice of database is
changing your search space completely
Database size affects the BLAST
statistics
record BLAST parameters, database choice, database
size in your bioinformatics lab book, just as you would
for your wet-bench experiments.
Databases change rapidly and are
updated frequently
It may be necessary to repeat your analyses
Topics: BLAST results
Choosing the right BLAST program
Running a blastp search
BLAST parameters and options to consider
Viewing BLAST results
Look at your alignments
Using the BLAST taxonomy report
BLAST parameters and
options to consider:
conserved domains
Entrez query
E-value cutoff
Word size
More BLAST parameters and
options to consider:
filtering
matrix
gap penalities
Run your BLAST search:
BLAST
The BLAST Queue:
click for more info
Note your RID
Formatting and Retrieving
your BLAST results:
Results
options
A graphical view of your BLAST
results:
The BLAST “hit” list:
Score
E-Value
GenBank
alignment
EntrezGene
The BLAST pairwise
alignments
Identity
Similarity
Sample BLAST output
• Blast of human beta globin protein against zebra fish
Sequences producing significant alignments:
Score
E
(bits) Value
gi|18858329|ref|NP_571095.1| ba1 globin [Danio rerio] >gi|147757...
gi|18858331|ref|NP_571096.1| ba2 globin; SI:dZ118J2.3 [Danio rer...
gi|37606100|emb|CAE48992.1| SI:bY187G17.6 (novel beta globin) [D...
gi|31419195|gb|AAH53176.1| Ba1 protein [Danio rerio]
171
170
170
168
ALIGNMENTS
>gi|18858329|ref|NP_571095.1| ba1 globin [Danio rerio]
Length = 148
Score = 171 bits (434), Expect = 3e-44
Identities = 76/148 (51%), Positives = 106/148 (71%), Gaps = 1/148 (0%)
Query: 1
Sbjct: 1
Query: 61
Sbjct: 61
MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK 60
MV T E++A+ LWGK+N+DE+G +AL R L+VYPWTQR+F +FG+LS+P A+MGNPK
MVEWTDAERTAILGLWGKLNIDEIGPQALSRCLIVYPWTQRYFATFGNLSSPAAIMGNPK 60
VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG 120
V AHG+ V+G
+ ++DN+K T+A LS +H +KLHVDP+NFRLL + +
A FG
VAAHGRTVMGGLERAIKNMDNVKNTYAALSVMHSEKLHVDPDNFRLLADCITVCAAMKFG 120
Query: 121 KE-FTPPVQAAYQKVVAGVANALAHKYH 147
+ F
VQ A+QK +A V +AL +YH
Sbjct: 121 QAGFNADVQEAWQKFLAVVVSALCRQYH 148
3e-44
7e-44
7e-44
3e-43
Sample BLAST output (cont’d)
• Blast of human beta globin DNA against human DNA
Sequences producing significant alignments:
Score
E
(bits) Value
gi|19849266|gb|AF487523.1| Homo sapiens gamma A hemoglobin (HBG1...
gi|183868|gb|M11427.1|HUMHBG3E Human gamma-globin mRNA, 3' end
gi|44887617|gb|AY534688.1| Homo sapiens A-gamma globin (HBG1) ge...
gi|31726|emb|V00512.1|HSGGL1 Human messenger RNA for gamma-globin
gi|38683401|ref|NR_001589.1| Homo sapiens hemoglobin, beta pseud...
gi|18462073|gb|AF339400.1| Homo sapiens haplotype PB26 beta-glob...
289
289
280
260
151
149
1e-75
1e-75
1e-72
1e-66
7e-34
3e-33
ALIGNMENTS
>gi|28380636|ref|NG_000007.3| Homo sapiens beta globin region (HBB@) on chromosome
11
Length = 81706
Score = 149 bits (75), Expect = 3e-33
Identities = 183/219 (83%)
Strand = Plus / Plus
Query: 267
ttgggagatgccacaaagcacctggatgatctcaagggcacctttgcccagctgagtgaa 326
|| ||| | ||
| || | |||||| ||||| |||||||||||
||||||||
Sbjct: 54409 ttcggaaaagctgttatgctcacggatgacctcaaaggcacctttgctacactgagtgac 54468
Query: 327
ctgcactgtgacaagctgcatgtggatcctgagaacttc 365
||||||||| |||||||||| ||||| ||||||||||||
Sbjct: 54469 ctgcactgtaacaagctgcacgtggaccctgagaacttc 54507
What do the Score and the evalue really mean?
The quality of the alignment is represented
by the Score.
Score (S)
The score of an alignment is calculated as the sum of
substitution and gap scores. Substitution scores are given
by a look-up table (PAM, BLOSUM) whereas gap scores are
assigned empirically .
The significance of each alignment is
computed as an E value.
E value (E)
Expectation value. The number of different alignments with
scores equivalent to or better than S that are expected to
occur in a database search by chance. The lower the E
value, the more significant the score.
E value
E value (E)
Expectation value. The number of different
alignments with scores equivalent to or better
than S expected to occur in a database search by
chance. The lower the E value, the more
significant the score.
Assessing sequence
homology
Need to know how strong an alignment
can be expected from chance alone
“Chance” is the comparison of
Real but non-homologous
sequences
Real sequences that are shuffled to
preserve compositional properties
Sequences that are generated
randomly based upon a DNA or
protein sequence model (favored)
High Scoring Pairs (HSPs)
All segment pairs whose scores can
not be improved by extension or
trimming
Need to model a random sequence to
analyze how high the score is in
relation to chance
Expected number of HSPs
Expected number of HSPs with score > S
E-value E for the score S:
E = Kmne-lS
Given:
Two sequences, length n and m
The statistics of HSP scores are
characterized by two parameters K and λ
K: scale for the search space size
λ: scale for the scoring system
BLAST statistics to record in
your bioinformatics labbook
Record the statistics
that are found at
bottom of your
BLAST results page
Scoring matrices
Amino acid substitution matrices
PAM
BLOSUM
Bit Scores
Normalized score to be able to
compare sequences
Bit score
S’ = lS – ln(K)
ln(2)
E-value of bit score
E = mn2-S’
Assessing the significance of an
alignment
How to assess the significance of an
alignment between the comparison of a
protein of length m to a database
containing many different proteins, of
varying lengths?
Calculate a "database search" E-value.
Multiply the pairwise-comparison E-value
by the number of sequences in the
database N divided by the length of the
sequence in the database n
Homology: Some Guidelines
Similarity can be indicative of homology
Generally, if two sequences are significantly
similar over entire length they are likely
homologous
Low complexity regions can be highly similar
without being homologous
Homologous sequences not always highly
similar
Homology: Some Guidelines
Suggested BLAST Cutoffs
(source: Chapter 11 – Bioinformatics: A Practical
Guide to the Analysis of Genes and Proteins)
For nucleotide based searches, one should look
for hits with E-values of 10-6 or less and
sequence identity of 70% or more
For protein based searches, one should look for
hits with E-values of 10-3 or less and sequence
identity of 25% or more
Contributors
http://creativecommons.org/licenses/by-sa/2.0/
Odds score in sequence
alignment
The chance of an aligned amino acid
pair being found in alignments of
related sequences compared to the
chance of that pair being found in
random alignments of unrelated
sequences.
Statistical significance of an
alignment
The probability that random or
unrelated sequences could be aligned
to produce the same score.
Smaller the probability is the better.
Alignment Statistics:
For two sequences of length n and m,
n times m comparisons are being
made; thus the longest length of the
predicted match would be log1/p(mn).
Alignment Statistics:
Expectation value or the mean longest
match would be
E(M) = log1/p(Kmn), where K is a constant
that depends on amino acid or base
composition and p is the probability of a
match.
This is only true for ungapped local alignments.
Distribution of alignment
scores
resembles Gumbel extreme value
distribution.
Extreme Value Distribution
Alignment Statistics
E(M)=log1/p(Kmn) means that match length gets
bigger as the log of the product of sequence
lengths. Amino acid substitution matrices will turn
match lengths into alignment scores (S).
More commonly l = ln(1/p) is used.
Number of longest run HSP will be estimated
E = Kmne-lS
How good a sequence score is evaluated based on
how many HSPs (i.e. E value) one would expect for
that score.
Alignment Statistics
Two ways to get K and l :
For 10000 random amino acid sequences
with various gap penalties, K and lambda
parameters have been tabulated.
Calculation of the distribution for two
sequences being aligned by keeping one
of them fixed and scrambling the other,
thus preserving both the sequence length
and amino acid composition.
Alignment Statistics
Alignment Statistics
Alignment Statistics
Alignment Statistics
Gene Structure
Mutation Rates
Functional Constraint
Synonymous vs
nonsynonymous substitutions
Synonymous vs
nonsynonymous substitutions
Synonymous vs
nonsynonymous substitutions
Mutation vs substitution
Estimating substitutions
Jukes-Cantor model
Transitions vs transversions
Kimura’s 2-parameter model
Kimura’s 2-parameter model
Kimura’s 2-parameter model
Functional Constraints
Molecular Clocks
Relative Rate
Distance based phylogenetics
Distance based phylogenetics
Distance based phylogenetics
Distance based phylogenetics
Distance based phylogenetics
Distance based phylogenetics
Phylogenetics Programs