global alignment - University of South Carolina
Download
Report
Transcript global alignment - University of South Carolina
CSCE555 Bioinformatics
Lecture 4 Sequence Alignment
(part1)
Meeting: MW 4:00PM-5:15PM SWGN2A21
Instructor: Dr. Jianjun Hu
Course page: http://www.scigen.org/csce555
University of South Carolina
Department of Computer Science and Engineering
2008 www.cse.sc.edu.
Roadmap
Sequence analysis applications
Sequence alignment application cases
Pairewise sequence alignment
Summary
4/8/2016
2
180 Genomes Sequenced
What does this mean?
What we can do with all these
sequence information?
Typical Sequence Analysis Tasks
Comparison of sequences in order to find
similar sequences (sequence alignment)
Identification of gene-structures, reading
frames, distributions of introns and exons
and regulatory elements
Prediction of protein structures
genome mapping
Comparison of homologous sequences to
construct a molecular phylogeny
Applications of Sequence Alignment
Prediction of functions of
(gene/protein/promoters) homology
Database search
◦ Find similar sequences that are similar to our
query sequence (e.g. new gene)
Gene finding by genome comparison
Sequence divergence/phylogeny
Sequence Assembly
What Biological Sequence Similarity
Tells Us
Homology is an evolutionary relationship
that either exists or does not. It cannot
be partial.
An ortholog is a homolog with shared
function.
A paralog is a homolog that arose through
a gene duplication event. Paralogs often
have divergent function.
Similarity is a measure of the quality of
alignment between two sequences. High
similarity is evidence for homology.
Similar sequences may be orthologs or
paralogs.
Timothy Bailey
6
What is Sequence Alignment?
Given two sequences, how to measure
their similarity?
ATAACTTTAATTAA
ATCC-TTTACTAA
ATAACTTTAATTAA
ATCC-TTTAC-TAA
What is Sequence Alignment?
Arranging the primary sequences of DNA,
RNA, or protein to identify regions of
similarity that may be a consequence of
functional, structural, or evolutionary
relationships between the sequences
What is Sequence Alignment?
sequence alignment of instances of the acidic ribosomal protein P0
(L10E) from several organisms
Tasks of Sequence Alignment
Pairwise alignment
Multiple sequence alignment
Global alignment
Local assignment
Approximate alignment algorithms versus
optimal/exact alignment algorithms
Web Sequence Alignment Tool
Web Tool to Align Two Sequences
http://www.ebi.ac.uk/emboss/align
Multiple Sequence Alignment
Sequence Blasting on Web
Roadmap
Sequence analysis applications
Sequence alignment application cases
Pairwise Sequence Alignment
Summary
15
Pairwise Sequence Alignment
Pairwise sequence alignment primary tool in
sequence analysis, used for database search
and as a component of other algorithms.
What? To detect similarity
Why?
◦
◦
◦
◦
Infer phylogeny (similarity ~ distance)
Predict function
Predict structure
Predict “signals” (binding sites, splicing signals etc.
16
Problem Definition:
(Optimal) pairwise alignment consists of
considering all possible alignments of two
sequences and choosing the optimal one.
Sub-optimal (heuristic) alignment
algorithms are also very important: eg
BLAST
We will focus on optimal alignment
methods in this class.
Key Issues
The key issues are:
Types of alignments (local vs. global)
The scoring system
The alignment algorithm
Measuring alignment significance
Example Alignment
True homology: alpha globin and beta globin
HBA_HUMAN
GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL
G+ +VK+HGKKV
HBB_HUMAN
A+++++AH+D++ +++++LS+LH
KL
GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL
Spurious “homology”: alpha globin protein with
different strucure and function
HBA_HUMAN
LHAHKL
GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD---GS+ + G
+
+D L
++ H+ D+
++AH+
F11G11.2
GSGYLVGDSLTFVDLL-VAQHTADLLAANAALLDEFPQFKAHQE
19
A +AL D
Types of Alignments
Global—sequences aligned from end-toend.
Local—alignments may start in the
middle of either sequence
Ungapped—no insertions or deletions
are allowed
Other types: overlap alignments, repeated
match alignments
20
Local vs. Global
Pairwise Alignments
A global alignment includes all elements of the
sequences and includes gaps.
◦ A global alignment may or may not include "end gap"
penalties.
◦ Global alignments are better indicators of homology
and take longer to compute.
A local alignment includes only subsequences,
and sometimes is computed without gaps.
◦ Local alignments can find shared domains in divergent
proteins and are fast to compute
21
Scoring systems
Match scores: s(a,b) assigns a score to
each combination of aligned letters.
Examples: transition/transversion, PAM,
Blosum.
Gap score: f(g) assigns a score to a gap
of length g. Examples: linear, affine.
Scoring systems are usually additive: the
total score is the sum of the substitution
scores and all the gap scores.
22
Match Scores
Some amino acids are more
“substitutable” for each other than
others. Serine and threonine are more
alike than tryptophan and alanine.
We can introduce "mismatch costs" for
handling different substitutions.
We don't usually use mismatch costs in
aligning nucleotide sequences, since often
no substitution is per se better than any
other.
23
Match Score Tables
Match scores are
computed using a lookup
table with an entry for
each possible pair of
letters.
Match scores are often
calculated on the basis of
the frequency of
particular mutations in
very similar sequences.
24
Log-odds Match Scores
Match scores are usually log-odds scores.
s(a,b) = log(Pr(a, b | M) / Pr(a, b |
R))
Pr(a, b | M) = pab is the probability of the
residues a and b appearing assuming the correct
positions are aligned and the sequences
descended from a common ancestor.
Pr(a, b | R) is the probability of a and b if we
pick two random sequence positions to align.
Assuming independence (null model) gives Pr(a,
b | R)=qaqb.
25
Adding log-odds substitution scores
gives the log-odds of the alignment
For ungapped alignments,
the log-odds alignment
score of two sequence
segments x and y is equal
to the log-odds ratio of
the segments.
Let x = x1 x2 ...xn and y =
y1 y2 ...yn.
S s ( xi , yi )
i
log
i
log
i
p xi yi
q xi q yi
p xi yi
q xi q yi
Pr( x, y | M )
log
Pr( x, y | R )
26
Gap Scores
Gap scores are negative (they are “costs”)
Linear: gap score is proportional to
length of gap.
f(g) = -g · d
Affine: gap score is a constant plus a
linear score. With affine scores, many
small gaps cost more than one large gap.
f(g) = -d -(g-1)e, for g>0
27
Gap Score Example
28
Summary
Sequence alignment application cases
Pairwise sequence alignment