Powerpoint Show on Dot Matrix

Download Report

Transcript Powerpoint Show on Dot Matrix

Pairwise Sequence
Alignments
Bioinformatics
Some Bioinformatics
Programming Terminology
Model
• A model is a set of propositions or
equations describing in simplified form
some aspects of experience.
• A valid model includes all essential
elements and their interactions of the
concept or system it describes.
Algorithm
• An algorithm is a complete, unambiguous
procedure for solving a specified problem
in a finite number of steps.
• Algorithms leave nothing undefined and
require no intuition to achieve their end.
Five Features of an Algorithm:
• An algorithm must stop after a finite number of
steps.
• All steps of the algorithm must be precisely
defined.
• Input to the algorithm must be specified.
• Output of the algorithm must be specified.
There must be at least one output.
• An algorithm must be effective - i.e. its
operations must be basic and doable.
Data Structures:
Foundation of an Algorithm
• One of the most important choices of
writing a program.
• For the same operation, different data
structures can lead to vastly more or less
efficient algorithms.
• The design of data structures and
algorithms goes hand in hand.
• Once the data structure is well defined,
usually the algorithm can be simple.
Data Structure Primitives:
• Strings
• Arrays
A String Is a Linear Sequence
of Characters.
• This implies several important properties:
• Finite strings have beginnings and ends.
Thus they also have a length.
• Strings imply an alphabet.
• The elements of a strings are ordered.
• A string is a one dimensional array.
Two Dimensional Arrays
Pairwise Sequence Alignment is
Fundamental to Bioinformatics
• It is used to decide if two proteins (or
genes) are related structurally or
functionally
• Two Dimensional Arrays are the basis of
Pairwise Alignments
• It is used to identify domains or motifs
that are shared between proteins
• It is the basis of BLAST searching
• It is used in the analysis of genomes
Are there other sequences like
this one?
• Huge public databases - GenBank,
Swissprot, etc.
• Sequence comparison is the most
powerful and reliable method to
determine evolutionary relationships
between genes
• Similarity searching is based on alignment
between two strings in a 2-D array
Why Search for Similarity?
1. I have just sequenced something. What is
known about the thing I sequenced?
2. I have a unique sequence. Is there similarity to
another gene that has a known function?
3. I found a new protein in a lower organism. Is
it similar to a protein from another species?
4. I have decided to work on a new gene. The
people in the field will not give me the plasmid.
I need the complete cDNA sequence to perform
RT-PCR of some other experiment.
Definitions
• Similarity: The extent to which nucleotide or
protein sequences are related. It is based upon
identity plus conservation.
• Identity: The extent to which two sequences
are invariant.
• Conservation: Changes at a specific position of
an amino acid or (less commonly, DNA) sequence
that preserve the physico-chemical properties
of the original residue.
RBP:
V
T
+
glycodelin:
26
23
RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWD- 84
+ K ++ + + +
GTW++ MA
+
L
+
A
+L+
W+
QTKQDLELPKLAGTWHSMAMA-TNNISLMATLKAPLRVHITSLLPTPEDNLEI V
LHRWEN 81
Definitions
• Identical - When a corresponding
character is shared between two species
or populations, that character is said to
be identical.
• Similar - The degree to which two
species or populations share identities.
• Homologous - When characters are
similar due to common ancestry, they are
homologous.
Evolution and Alignment
• Homology - two (or more) sequences have
a common ancestor.
• This is a statement about evolutionary
history.
• Similarity - two sequences are similar, by
some criterion.
• It does not refer to any historical
process, just to a comparison of the
sequences by some method.
• It is a logically weaker statement.
Caution
• In bioinformatics these two terms are
often confused and used interchangeably.
• The reason is probably that significant
similarity is such a strong argument for
homology.
Similarity ≠ Homology
1) 25% similarity ≥ 100 AAs is strong
evidence for homology
2) Since homology is an evolutionary
statement, there should be additional
evidence which indicates “descent from
a common ancestor”
–
–
common 3D structure
usually common function
3) Homology is all or nothing. You cannot
say "50% homologous"
retinol-binding protein
(NP_006735)
b-lactoglobulin
(P02754)
Page 42
Definitions, Con’t.
• Analogous - Characters are similar due to
convergent evolution.
• Orthologous - Homologous sequences (or
characters) between different species that
descended from a common ancestral gene during
speciation; They may or may not be responsible for
a similar function.
• Paralogous - Homologous sequences within a single
species that arose by gene duplication.
• Homology is therefore NOT synonymous with
similarity.
• Homology is a judgment, similarity is a
measurement.
Proteins or Genes Related by
Evolution Share a Common Ancestor
• Random mutations in the sequences
accumulate over time, so that proteins or
genes that have a common ancestor far
back in time are not as similar as proteins
or genes that diverged from each other
more recently.
• Analysis of evolutionary relationships
between protein or gene sequences
depends critically on sequence alignments.
Function is Conserved
• Alignments can reveal which parts of the
sequences are likely to be important for the
function, if the proteins are involved in similar
processes.
• In parts of the sequence of a protein which
are not very critical for its function, random
mutations can easily accumulate.
• In parts of the sequence that are critical for
the function of the protein, hardly any
mutations will be accepted; nearly all changes
in such regions will destroy the function.
Sequence Alignments
• Comparing sequences provides information as to
which genes have the same function
• Sequences are compared by aligning them –
sliding them along each other to find the most
matches with a few gaps
• An alignment can be scored – count matches,
and can penalize mismatches and gaps
• It is much easier to align proteins. Why?
Why Search with Protein,
not DNA Sequences?
1) 4 DNA bases vs. 20 amino acids - less
chance similarity
2) can have varying degrees of similarity
between different AAs
- # of mutations, chemical similarity, PAM
matrix
3) protein databanks are much smaller than
DNA databanks
Similarity is Based on Dot Plots
1) two sequences on vertical and
horizontal axes of graph
2) put dots wherever there is a match
3) diagonal line is region of identity
(local alignment)
4) apply a window filter - look at a group
of bases, must meet % identity to get
a dot
Definition
• Pairwise alignment:
• The process of lining up two or more
sequences to achieve maximal levels of
identity (and conservation, in the case of
amino acid sequences) for the purpose of
assessing the degree of similarity and the
possibility of homology.
Dot Plots
A Simple Way to Measure
Similarity
Simple Dot Plot
G
G
G
A
A
G
A
A
C
T
T
G
G
G
A
A
A
A
A
A
A
A
A
A
A
C
C
A
T
A
A
T
A
A
G
A
A
A
A
A
A
C
A
A
A
C
A
A
A
C
A
A
A
G
A
A
A
Dot plot filtered with 4 base
window and 75% identity
GA TC AA CTGAC GTA
G
T
T
C
A
G
C
T
G
C
G
T
A
C
Dot matrix provides visual picture
of alignment
• It is used to easily spot segments of good
sequence similarity.
• The two sequences are placed on each
side of 2-dimensional matrix, and each
cell in the matrix is then filled with a
value for how well a short window of the
sequences match at that point.
Simple Dot Plot
G
G
G
A
A
G
A
A
C
T
T
G
G
G
A
A
A
A
A
A
A
A
A
A
A
C
C
A
T
A
A
T
A
A
G
A
A
A
A
A
A
C
A
A
A
C
A
A
A
C
A
A
A
G
A
A
A
A Limitation to Dot Matrix
Comparison
• Where part of one sequence shares a long
stretch of similarity with the other sequence, a
diagonal of dots will be evident in the matrix.
• However, when single bases are compared at
each position, most of the dots in the matrix
will be due to background similarity.
• That is, for any two nucleotides compared
between the two sequences, there is a 1 in 4
chance of a match, assuming equal frequencies
of A,G,C and T.
A Solution
• This background noise can be filtered out by
comparing groups of l nucleotides, rather than
single nucleotides, at each position.
• For example, if we compare dinucleotides (l =
2), the probability of two dinucleotides chosen
at random from each sequence matching is 1/16,
rather than 1/4.
• Therefore, the number of background matches
will be lower:
A Filtered Dot Plot
G
G
G
C
T
T
G
A
C
C
A
A
G
A
A
T
T
G
A
A
A
A
A
C
A
C
A
C
G
G
A
G
The Dot Matrix Algorithm
• The dot-matrix algorithm can be generalized
for sequences s and t of sizes m and n,
respectively, and window size l.
• For each position in sequence s, compare a
window of l nucleotides centered at that
position with each window of l nucleotides in
sequence t.
• Conceptually, you can think of windows of length
l sliding along each axis, so that all possible
windows of l nucleotides are compared between
the two sequences.
• The dot-matrix algorithm
can be generalized for
sequences s and t of sizes m
and n, respectively, and
window size l.
• For each position in
sequence s, compare a
window of l nucleotides
centered at that position
with each window of l
nucleotides in sequence t.
• Conceptually, you can think
of windows of length l
sliding along each axis, so
that all possible windows of
l nucleotides are compared
between the two sequences.
Dot Matrix Sequence
Comparison Examples
Examples
• These examples comes from the webpage
www.bioinformaticsonline.org
• This has a nice discussion of results in
another package, DNA Strider.
• I used COMPARE program in SeqWeb.
• I used the BLOSUM 62 scoring matrix.
Comparing a Protein with Itself
• Proteins can be compared with themselves to
show internal duplications or repeating
sequences.
• A self-matrix produces a central diagonal line
through the origin, indicating an exact match
between the x and y axes.
• The parallel diagonals that appear off the
central line are indicative of repeated sequence
elements in different locations of the same
protein.
Haptoglobin
• Haptoglobin is a protein that is secreted into
the blood by the liver. This protein binds free
hemoglobin.
• The concentration of "free" hemoglobin (that
is, outside red blood cells) in plasma (the fluid
portion of blood) is ordinarily very low.
• However, free hemoglobin is released when
red blood cells hemolyze for any reason.
• After haptoglobin binds hemoglobin, it is
taken up by the liver.
• The liver recycles the iron, heme, and amino
acids contained in the hemoglobin protein.
Our Comparison
• Files used
– 1006264A Haptoglobin H2
• DNA sequencing shows that the intragenic
duplication within the human haptoglobin Hp2
allele was formed by a non-homologous,
probably random, crossing-over within
different introns of two Hp1 genes.
• A repeated sequence (starting with
ADDGCP...) is observed beginning at positions
30-90 and 90-150 - probably due to a
duplication event in one of these locations.
Window: 30 Stringency: 3
Blosum 62 matrix
• One of the strengths of dot-matrix
searches is that they make repeats easy
to detect by comparing a sequence
against itself.
• In self comparisons, direct repeats
appear as diagonals parallel to the main
line of identity.
Comparison of Two Similar
Sequences
Our Comparison
• Files Used:
– P03035
• Repressor protein from E. coli Phage p22
– RPBPL
• Repressor protein from E. coli phage Lambda
• Lambda phages infect E. coli. They can be lytic and
destroys the host cell, making hundreds of progeny.
• They can also be lysogenic, and live quietly within the
DNA of the bacteria.
• A gene makes the repressor protein that prevents
the phage from going destructively lytic.
• Phage p22 is a related phage that also makes a
repressor.
• Both proteins form a dimer and bind DNA to prevent
lysis.
Dot Matrix Sequence Comparison
• A row of dots represents a region of
sequence similarity.
• Background matching also appears as
scattered dots.
• There is a decrease in background noise
as window and stringency parameters
increase.
Window: 10 Stringency: 1
Blosum 62 matrix
Window: 10 Stringency: 3
Blosum 62 matrix
Window: 30 Stringency: 1
Blosum 62 matrix
Window: 30 Stringency: 3
Blosum 62 matrix
BLAST Sequence Alignment
• Perform a search of all sequences in a
database for a match to a query sequence
- BLAST search.
– BLAST is an acronym for Basic Local
Alignment Search Tool.
• Search for patterns or domains in a
sequence.
Disadvantages to Dot Plots
• While dot-matrix searches provide a
great deal of information in a visual
fashion, they can only be considered
semi-quantitative, and therefore do not
lend themselves to statistical analysis.
• Also, dot-matrix searches do not provide
a precise alignment between two
sequences.
Some Definitions for
Sequence Alignments
Gaps and Insertions
• In an alignment, much better
correspondence can be obtained between
two sequences if a gap can be introduced
in one sequence.
• Alternatively, an insertion could be
allowed in the other sequence.
• Biologically, this corresponds to a
mutation event that eliminates a part of a
gene, or introduces new DNA into a gene.
Gaps
• Positions at which a letter is paired with a
null are called gaps.
• Gap scores are typically negative.
• Since a single mutational event may cause
the insertion or deletion of more than one
residue, the presence of a gap is
considered more significant than the
length of the gap.
Optimal Alignment
• The alignment that is the best, given a defined
set of rules and parameter values for comparing
different alignments.
• There is no such thing as the single best
alignment, since optimality always depends on
the assumptions one bases the alignment on.
• For example, what penalty should gaps carry?
• All sequence alignment procedures make some
such assumptions.
Global Alignment
• An alignment that assumes that the two strings
are basically similar over the entire length of
one another.
• The alignment attempts to match them to each
other from end to end, even though parts of
the alignment are not very convincing.
• A tiny example:
LGPSTKDFGKISESREFDN
|
||||
|
LNQLERSFGKINMRLEDA
Local Alignments
• An alignment that searches for segments of the
two sequences that match well.
• There is no attempt to force entire sequences
into an alignment, just those parts that appear
to have good similarity, according to some
criterion. Using the same sequences as above,
one could get:
----------FGKI---------||||
----------FGKI----------
Local Alignments
• It may seem that one should always use local
alignments.
• However, it may be difficult to spot an overall
similarity, as opposed to just a domain-todomain similarity, if one uses only local
alignment.
• So global alignment is useful in some cases.
• The popular programs BLAST and FASTA for
searching sequence databases produce local
alignments.
Are there other sequences
like this one?
1) Huge public databases - GenBank, Swissprot,
etc.
2) Sequence comparison is the most powerful
and reliable method to determine
evolutionary relationships between genes
3) Similarity searching is based on alignment
4) BLAST and FASTA provide rapid similarity
searching
a. rapid = approximate (heuristic)
b. false + and - scores
Global vs. Local similarity
1) Global similarity uses complete aligned
sequences - total % matches
– GCG GAP program, Needleman & Wunsch
algorithm
2) Local similarity looks for best internal
matching region between 2 sequences
– GCG BESTFIT program,
– Smith-Waterman algorithm,
– BLAST and FASTA
3) dynamic programming
– optimal computer solution, not approximate
What Program to use When?
1) BLAST is fastest and easily accessed on the
Web
– limited sets of databases
– nice translation tools (BLASTX, TBLASTN)
2) FASTA works best in GCG
–
–
–
–
integrated with GCG
precise choice of databases
more sensitive for DNA-DNA comparisons
FASTX and TFASTX can find similarities in sequences with
frameshifts
3) Smith-Waterman is slower, but more sensitive
– known as a “rigorous” or “exhaustive” search
– SSEARCH in GCG and standalone FASTA
Sequence Alignments
• Sometimes only parts of sequences match
e.g. domain (longer) or motif (shorter) of
a protein or a regulatory pattern in DNA
• Poor alignments can be misleading – you
have to learn to recognize and test the
significance of an alignment
Comparing the protein kinase KRAF_HUMAN and
the uncharacterized O22558 from Arabidopsis
using BLAST
546 AA
Score = 185 bits (464), Expect = 1e-45
Identities = 107/283 (37%), Positives = 172/283 (59%), Gaps = 15/283 (5%)
Query: 337 DSSYYWEIEASEVMLSTRIGSGSFGTVYKGKWHG-DVAVKILKVVDPTPEQFQAFRNEVA 395
D + WEI+ +++ + ++ SGS+G +++G +
+VA+K LK
E + F EV
Sbjct: 274 DGTDEWEIDVTQLKIEKKVASGSYGDLHRGTYCSQEVAIKFLKPDRVNNEMLREFSQEVF 333
Query: 396 VLRKTRHVNILLFMGYMTKD-NLAIVTQWCEGSSLYKHLHVQETKFQMFQLIDIARQTAQ 454
++RK RH N++ F+G T+
L IVT++
S+Y LH Q+ F++ L+ +A
A+
Sbjct: 334 IMRKVRHKNVVQFLGACTRSPTLCIVTEFMARGSIYDFLHKQKCAFKLQTLLKVALDVAK 393
Query: 455 GMDYLHAKNIIHRDMKSNNIFLHEGLTVKIGDFGLATVKSRWSGSQQVEQPTGSVLWMAP 514
GM YLH NIIHRD+K+ N+ + E
VK+ DFG+A V+
SG
E TG+ WMAP
Sbjct: 394 GMSYLHQNNIIHRDLKTANLLMDEHGLVKVADFGVARVQIE-SGVMTAE--TGTYRWMAP 450
Query: 515 EVIRMQDNNPFSFQSDVYSYGIVLYELMTGELPYSHINNRDQIIFMVGRGYASPDLSKLY 574
EVI
++ P++ ++DV+SY IVL+EL+TG++PY+ +
+ +V +G
P + K
Sbjct: 451 EVI---EHKPYNHKADVFSYAIVLWELLTGDIPYAFLTPLQAAVGVVQKG-LRPKIPK-- 504
Query: 575 KNCPKAMKRLVADCVKKVKEERPLFPQILSSIELLQHSLPKIN 617
K PK +K L+ C + E+RPLF +I
IE+LQ + ++N
Sbjct: 505 KTHPK-VKGLLERCWHQDPEQRPLFEEI---IEMLQQIMKEVN 543