Bioinformatics - Computer Science : The University of Akron

Download Report

Transcript Bioinformatics - Computer Science : The University of Akron

Roadmap
The topics:








basic concepts of molecular biology
more on Perl
overview of the field
biological databases and database searching
sequence alignments
phylogenetics
structure prediction
microarray data analysis
Sequence alignments

Introduction
What is an alignment?
 Why do alignments?
 A bit of history





Dot matrix comparison
Scoring alignments
Alignment methods
Significance of alignments
What is Sequence alignment
Sequence alignment is an arrangement of
two or more sequences, highlighting their
similarity.
Why do alignments?
Sequence Alignment is useful for
discovering structural, functional and
evolutional information in biological
sequences.
Over time, genes
accumulate mutations

Environmental factors
Radiation
 Oxidation


Mistakes in replication/repair
Deletions, Duplications
 Insertions
 Inversions
 Point mutations

Comparing two sequences
 Point
mutations, easy:
ACGTCTGATACGCCGTATAGTCTATCT
ACGTCTGATTCGCCCTATCGTCTATCT
 Insertions/deletions,
must align:
ACGTCTGATACGCCGTATAGTCTATCT
CTGATTCGCATCGTCTATCT
ACGTCTGATACGCCGTATAGTCTATCT
----CTGATTCGC---ATCGTCTATCT
Sequence Alignment
Doolittle RF, Hunkapiller MW, Hood LE,
Devare SG, Robbins KC, Aaronson SA,
Antoniades HN. Science 221:275-277, 1983.
Russell F. Doolittle

A sequence for platelet derived
growth factor (PDGF) from mammalian cells was
virtually identical to the sequence for the retrovirus
encoded oncogene known as v-sis (gene causing cancer
in animals).

Retrovirus had acquired the gene from the host cell as some kind
of genetic exchange event and then had produced a mutant that
could alter the function of the normal protein when it infected
another animal.
Dot Matrix Comparison
A: T C A G A G G T C T G
B: T C A G A G C T G
T
T
C
A
G
G
T
X
G
X
G
G
X
X
X
A
T
X
X
X
X
X
X
X
X
X
X
C
C
X
A
G
A
X
C
T
G
X
X
X
X
X
X
X
X
X
Interpretation of dot matrix

Regions of similarity appear as diagonal runs of dots

Reverse diagonals (perpendicular to diagonal) indicate
inversions

Can link or "join" separate diagonals to form
alignment with "gaps"
More on Dot Matrix

Improving detection of matching regions by
filtering
 using sliding window to compare the two
sequences. For example, print a dot at a matrix
position only if


7 out of the next 11 positions in the sequence
are identical
Similarity score of the next 11 positions in the
sequence is greater than 5.
Sequence repeats

Many
sequences
contains
repetitive
regions.
a retrovirus vector sequence against itself using a window size of 9 and mismatch limit of 2
(http://arbl.cvmbs.colostate.edu/molkit/dnadot/bkg.html)
More on Dot Matrix

Dot matrix graphically presents regions of identity or
similarity between two sequences

The use of windows and thresholds can reduce
“noise” in dot matrix

Inversions and duplications have unique “signatures”
in dot matrix
Software



Dotlet (java applet)–
www.ch.embnet.org
Dnadot –
arbl.cvmbs.colostate.edu/molkit/dnadot/
Dotter –
www.cgr.ki.se/cgr/groups/sonnhammer/Dotter.html

Dottup –
www.emboss.org
How to measure the similarity
Basically three kinds of changes can occur at any
given position within a sequence:
 Mutation
 Insertion
 Deletion

Insertion and deletion have been found to occur in
nature at a significantly lower frequency than
mutations.
Scoring Matrices for Aligning DNA Sequences
A
T
C
G
A
1
0
0
0
A
T
0
1
0
0
T -4 5
C
0
0
1
0
C
G
0
0
0
1
G -4 -4 -4 5
Identity matrix
A T
5
C
G
-4 -4 -4
-4 -4
-4 -4 5
BLAST matrix
-4
A T
A
1
G
-5 -5 -1
T -5 1
C
C
-1 -5
-5 -1 1
-5
G -1 -5 -5 1
Transition-Transversion matrix
Transition --- substitutions in which a purine (A/G) is replaced by
another purine (A/G) or a pyrimadine (C/T) is replaced by
another pyrimadine (C/T).
Transversions --(A/G)  (C/T)
Scoring a sequence alignment



Match score:
Mismatch score:
Gap penalty:
+1
+0
–1

ACGTCTGATACGCCGTATAGTCTATCT
||||| |||
|| ||||||||
----CTGATTCGC---ATCGTCTATCT

Matches: 18 × (+1)
Mismatches: 2 × 0
Gaps: 7 × (– 1)


Score = +11
Gap opening and extension penalties


We want to find alignments that are evolutionarily likely.
Which of the following alignments seems more likely to
you?
ACGTCTGATACGCCGTATAGTCTATCT
ACGTCTGAT-------ATAGTCTATCT
ACGTCTGATACGCCGTATAGTCTATCT
AC-T-TGA--CG-CGT-TA-TCTATCT



We can achieve this by penalizing more for a new gap,
than for extending an existing gap
Scoring a sequence alignment






Match/mismatch score:
+1/+0
Open/extension penalty:
–2/–1
ACGTCTGATACGCCGTATAGTCTATCT
||||| |||
|| ||||||||
----CTGATTCGC---ATCGTCTATCT
Matches: 18 × (+1)
Mismatches: 2 × 0
Open: 2 × (–2)
Extension: 5 × (–1)
Score = +9
Amino Acid Substitution Matrices
PAM - point accepted mutation based on
global alignment [evolutionary model]
BLOSUM - block substitutions based on
local alignments [similarity among
conserved sequences]
Part of PAM 250 Matrix
C
S
T
P
A
G
C
12
0
-2
-3
-2
-3
Log-odds = log (
S
T
P
A
G
2
1
1
1
1
3
0
1
0
6
1
-1
2
1
5
chance to see the pair in homologous proteins )
chance to see the pair in unrelated proteins by chance
PAM matrices
PAM 1 Matrix reflects an amount of evolution
producing on average one mutation per hundred
amino acids (1 unit evolution).
PAM 250 --- 250 unit evolution
Amino acid change
Probability
PAM 1
PAM 250
Phe to Ala
0.0002
0.04
Phe to Arg
0.0001
0.01
Phe to Asn
0.0001
0.02
Phe to Asp
0.0000
0.01
Phe to Cys
0.0000
0.01
...
…
…
Limitations of PAM Matrices




Constructed based on the phylogenetic
relationships prior to scoring mutations;
Difficulty of determining ancestral relationships
among sequences;
Based on a small set of closely related proteins;
…
BLOSUM Matrices



Based on the observed amino acid substitutions in a
large set of ~2000 conserved amino acid patterns
(blocks). The blocks are found in a database of protein
sequences representing more than 500 families of
related proteins and act as signatures of these protein
families.
The matrices are measured on the multiple alignment
of the blocks.
The entries of the matrices are computed based on the
same principle used in PAM -- log(odds’ ratio).
Part of BLOSUM 62 Matrix

BLOSUM62 was
measured on pairs of
sequences with an
average of 62 %
identical amino acids.
Log-odds = log (
C
S
T
P
A
G
C
9
-1
-1
-3
0
-3
S T P A G
4
1
-1
1
0
5
-1 7
0 -1 4
-2 -2 0
chance to see the pair in homologous proteins
chance to see the pair in unrelated proteins by chance
6
)
PAM vs. BLOSUM

PAM
Based on mutational model of evolution (Markov process)
 PAM1 is based on sequences of 85% similarity
 Designed to track the evolutionary origins


BLOSUM
Based on the multiple alignment of blocks
 Good to be used to compare distant sequences
 Designed to find proteins’ conserved domains

Gap Penalty

Optimal penalties vary from sequence to sequence, and
finding the most adequate value is a matter of empirical
trial and error.

When compare distantly related sequences, a high gapopening penalty and a very low gap-extension penalty
often give better results

When compare closely related sequences, gaps should
be penalized on both a gap-opening and gap-extension