Bioinformatics - Computer Science : The University of Akron
Download
Report
Transcript Bioinformatics - Computer Science : The University of Akron
Roadmap
The topics:
basic concepts of molecular biology
more on Perl
overview of the field
biological databases and database searching
sequence alignments
phylogenetics
structure prediction
microarray data analysis
Sequence alignments
Introduction
What is an alignment?
Why do alignments?
A bit of history
Dot matrix comparison
Scoring alignments
Alignment methods
Significance of alignments
What is Sequence alignment
Sequence alignment is an arrangement of
two or more sequences, highlighting their
similarity.
Why do alignments?
Sequence Alignment is useful for
discovering structural, functional and
evolutional information in biological
sequences.
Over time, genes
accumulate mutations
Environmental factors
Radiation
Oxidation
Mistakes in replication/repair
Deletions, Duplications
Insertions
Inversions
Point mutations
Comparing two sequences
Point
mutations, easy:
ACGTCTGATACGCCGTATAGTCTATCT
ACGTCTGATTCGCCCTATCGTCTATCT
Insertions/deletions,
must align:
ACGTCTGATACGCCGTATAGTCTATCT
CTGATTCGCATCGTCTATCT
ACGTCTGATACGCCGTATAGTCTATCT
----CTGATTCGC---ATCGTCTATCT
Sequence Alignment
Doolittle RF, Hunkapiller MW, Hood LE,
Devare SG, Robbins KC, Aaronson SA,
Antoniades HN. Science 221:275-277, 1983.
Russell F. Doolittle
A sequence for platelet derived
growth factor (PDGF) from mammalian cells was
virtually identical to the sequence for the retrovirus
encoded oncogene known as v-sis (gene causing cancer
in animals).
Retrovirus had acquired the gene from the host cell as some kind
of genetic exchange event and then had produced a mutant that
could alter the function of the normal protein when it infected
another animal.
Dot Matrix Comparison
A: T C A G A G G T C T G
B: T C A G A G C T G
T
T
C
A
G
G
T
X
G
X
G
G
X
X
X
A
T
X
X
X
X
X
X
X
X
X
X
C
C
X
A
G
A
X
C
T
G
X
X
X
X
X
X
X
X
X
Interpretation of dot matrix
Regions of similarity appear as diagonal runs of dots
Reverse diagonals (perpendicular to diagonal) indicate
inversions
Can link or "join" separate diagonals to form
alignment with "gaps"
More on Dot Matrix
Improving detection of matching regions by
filtering
using sliding window to compare the two
sequences. For example, print a dot at a matrix
position only if
7 out of the next 11 positions in the sequence
are identical
Similarity score of the next 11 positions in the
sequence is greater than 5.
Sequence repeats
Many
sequences
contains
repetitive
regions.
a retrovirus vector sequence against itself using a window size of 9 and mismatch limit of 2
(http://arbl.cvmbs.colostate.edu/molkit/dnadot/bkg.html)
More on Dot Matrix
Dot matrix graphically presents regions of identity or
similarity between two sequences
The use of windows and thresholds can reduce
“noise” in dot matrix
Inversions and duplications have unique “signatures”
in dot matrix
Software
Dotlet (java applet)–
www.ch.embnet.org
Dnadot –
arbl.cvmbs.colostate.edu/molkit/dnadot/
Dotter –
www.cgr.ki.se/cgr/groups/sonnhammer/Dotter.html
Dottup –
www.emboss.org
How to measure the similarity
Basically three kinds of changes can occur at any
given position within a sequence:
Mutation
Insertion
Deletion
Insertion and deletion have been found to occur in
nature at a significantly lower frequency than
mutations.
Scoring Matrices for Aligning DNA Sequences
A
T
C
G
A
1
0
0
0
A
T
0
1
0
0
T -4 5
C
0
0
1
0
C
G
0
0
0
1
G -4 -4 -4 5
Identity matrix
A T
5
C
G
-4 -4 -4
-4 -4
-4 -4 5
BLAST matrix
-4
A T
A
1
G
-5 -5 -1
T -5 1
C
C
-1 -5
-5 -1 1
-5
G -1 -5 -5 1
Transition-Transversion matrix
Transition --- substitutions in which a purine (A/G) is replaced by
another purine (A/G) or a pyrimadine (C/T) is replaced by
another pyrimadine (C/T).
Transversions --(A/G) (C/T)
Scoring a sequence alignment
Match score:
Mismatch score:
Gap penalty:
+1
+0
–1
ACGTCTGATACGCCGTATAGTCTATCT
||||| |||
|| ||||||||
----CTGATTCGC---ATCGTCTATCT
Matches: 18 × (+1)
Mismatches: 2 × 0
Gaps: 7 × (– 1)
Score = +11
Gap opening and extension penalties
We want to find alignments that are evolutionarily likely.
Which of the following alignments seems more likely to
you?
ACGTCTGATACGCCGTATAGTCTATCT
ACGTCTGAT-------ATAGTCTATCT
ACGTCTGATACGCCGTATAGTCTATCT
AC-T-TGA--CG-CGT-TA-TCTATCT
We can achieve this by penalizing more for a new gap,
than for extending an existing gap
Scoring a sequence alignment
Match/mismatch score:
+1/+0
Open/extension penalty:
–2/–1
ACGTCTGATACGCCGTATAGTCTATCT
||||| |||
|| ||||||||
----CTGATTCGC---ATCGTCTATCT
Matches: 18 × (+1)
Mismatches: 2 × 0
Open: 2 × (–2)
Extension: 5 × (–1)
Score = +9
Amino Acid Substitution Matrices
PAM - point accepted mutation based on
global alignment [evolutionary model]
BLOSUM - block substitutions based on
local alignments [similarity among
conserved sequences]
Part of PAM 250 Matrix
C
S
T
P
A
G
C
12
0
-2
-3
-2
-3
Log-odds = log (
S
T
P
A
G
2
1
1
1
1
3
0
1
0
6
1
-1
2
1
5
chance to see the pair in homologous proteins )
chance to see the pair in unrelated proteins by chance
PAM matrices
PAM 1 Matrix reflects an amount of evolution
producing on average one mutation per hundred
amino acids (1 unit evolution).
PAM 250 --- 250 unit evolution
Amino acid change
Probability
PAM 1
PAM 250
Phe to Ala
0.0002
0.04
Phe to Arg
0.0001
0.01
Phe to Asn
0.0001
0.02
Phe to Asp
0.0000
0.01
Phe to Cys
0.0000
0.01
...
…
…
Limitations of PAM Matrices
Constructed based on the phylogenetic
relationships prior to scoring mutations;
Difficulty of determining ancestral relationships
among sequences;
Based on a small set of closely related proteins;
…
BLOSUM Matrices
Based on the observed amino acid substitutions in a
large set of ~2000 conserved amino acid patterns
(blocks). The blocks are found in a database of protein
sequences representing more than 500 families of
related proteins and act as signatures of these protein
families.
The matrices are measured on the multiple alignment
of the blocks.
The entries of the matrices are computed based on the
same principle used in PAM -- log(odds’ ratio).
Part of BLOSUM 62 Matrix
BLOSUM62 was
measured on pairs of
sequences with an
average of 62 %
identical amino acids.
Log-odds = log (
C
S
T
P
A
G
C
9
-1
-1
-3
0
-3
S T P A G
4
1
-1
1
0
5
-1 7
0 -1 4
-2 -2 0
chance to see the pair in homologous proteins
chance to see the pair in unrelated proteins by chance
6
)
PAM vs. BLOSUM
PAM
Based on mutational model of evolution (Markov process)
PAM1 is based on sequences of 85% similarity
Designed to track the evolutionary origins
BLOSUM
Based on the multiple alignment of blocks
Good to be used to compare distant sequences
Designed to find proteins’ conserved domains
Gap Penalty
Optimal penalties vary from sequence to sequence, and
finding the most adequate value is a matter of empirical
trial and error.
When compare distantly related sequences, a high gapopening penalty and a very low gap-extension penalty
often give better results
When compare closely related sequences, gaps should
be penalized on both a gap-opening and gap-extension