Bioinfo primer - part 4/6

Download Report

Transcript Bioinfo primer - part 4/6

Christophe Roos - MediCel ltd
[email protected]
Mutations change sequences
Function preserves sequences
Similarity is a tool in understanding
the information in a sequence
Sequence comparison
•
•
•
•
- Why, how?
Function by analogy: If sequences are conserved their
function is probably also conserved.
Functional domains: If some parts of the sequences are
more conserved than other parts, there must be an
underlying biological reason for it.
Establishing relationship/differences in function: By
quantification of sequence relationships it is possible to
estimate function of novel genes
Establishing relationship between species
•
•
•
•
•
•
Christophe Roos - 4/6 Sequence comparison
Compare two sequences of similar length
Compare two sequences of very different length
Compare several sequences
Allow gaps or not?
Scoring: yes-no or good-intermediate-bad
The best or all above a threshold?
Spring 2002
Sequence comparison – metrics
gap
match
GA-CGGATTAG
•
•
•
•
•
•
GATCGGAATAG
The scoring matrix
mismatch
The score for a match
The penality for a mismatch
The penality for the insertion of a gap (gap-open)
The penality for elongating a gap (gap-length)
Local or global similarities ?
Christophe Roos - 4/6 Sequence comparison
Spring 2002
Scoring matrices
•
•
•
•
When evaluating the occurrence of a
pair, one scales the meaningfulness of
its being there.
The matrix is a table of values that
describe the probability of a residue
pair occurring in an alignment.
Probabilities are derived from samples
of alignments known to be valid.
They can then be used to evaluate
similarity of sequences with unknown
function to sequences with known
function.
Christophe Roos - 4/6 Sequence comparison
A
C
G
T
A
C
G
T
A
1
C
0
1
G
0
0
1
T
0
0
0
1
5
-3
5
-3
0
5
0
-3
-3
5
Spring 2002
Scoring matrices
•
•
For DNA, they are usually binary: either
there is similarity or there is not.
For proteins, they reflect the chemical
nature and frequencies of the amino acids,
and cover a larger range of values.
Christophe Roos - 4/6 Sequence comparison
Spring 2002
Commonly used matrices for proteins
•
Blosum matrices are derived from
Blocks database that contain
ungapped alignments from families of
related proteins. The number indicates
the similarity threshold level:
Blosum62, Blosum45
•
PAM matrices are scaled according to a model of evolutionary distance from
alignments of closely related sequences. One PAM-%1 unit is 1% change over all
positions.
Christophe Roos - 4/6 Sequence comparison
Spring 2002
Walking through an alignment matrix
•
•
•
•
•
•
Start with a gap (-) agains itself,
score it 0.
Fill in one row at a time
At each position compute the scores
that result for each of the choices:
move one step in each sequence
(diagonal), skip one horizontal or
one vertical.
Choose the best of the three values
and save it.
Score +5 for match, -4 for mismatch
and –7 for a gap. If S0 write 0.
Traceback along the highest scoring
path.
Christophe Roos - 4/6 Sequence comparison
Example:
10-4=6 (diagonal)
10-7=3 (gap, horizontal or vertical)
Spring 2002
Global alignments and local ones
• Aligning 2 sequences along their whole length is done by stepping through
the matrix from top left to bottom right. The best-scoring path can be traced
through the matrix, resulting in an optimal alignment. The NeedlemanWunsch algorithm belongs to this class.
• Sequences are often modular, therefore similarities can be only local and
global alignments will fail. The Smith-Waterman is a dynamic programming
algorithm that performs local alignment of 2 sequences. If the cumulative
score up to some point in the sequence is negative, it can be abandoned. It
can also end anywhere in the matrix.
Christophe Roos - 4/6 Sequence comparison
Spring 2002
Example: local alignment of 2 sequences
Web page: enter two sequences and
search for local alignments. Two are
found.
Christophe Roos - 4/6 Sequence comparison
Spring 2002
Iterate: compare one against many
• By iterating pairwise comparisons, one can compare one
sequence agains a database of many sequences.
• Algorithms such as Smith & Waterman are too slow (quality
optimised).
• Multistep algoritms have been developed for this task
– Fasta: (i) use only every k:th position (k is usually 2 for proteins and 6
for DNA) and search short sequences (k-tups). (ii) score the 10
ungapped alignments with most identical k-tups. (iii) try to merge into a
gapped alignment without reducing the score below a threshold.
– Blast: (i) create a list of short words that score enough when compared
to the query (ii) search these words in a precomputed table of all words
and their positions in the database (iii) extend into ungapped or even
gapped local alignments.
Christophe Roos - 4/6 Sequence comparison
Spring 2002
Example: BLAST one against SwissProt
Web page: enter the sequence and search
for local alignments. Several are found
and listed both graphically and as text.
Note the modularity of the query: two
domains are apparent.
Christophe Roos - 4/6 Sequence comparison
Spring 2002
Iterate (2): compare many against many
•Multiple sequence alignments
•Example: The eyeless gene is also called PAX6 and can be found in
several species: birds, mammals, reptiles, fish, invertebrates
Christophe Roos - 4/6 Sequence comparison
Spring 2002
Multiple sequence alignments
CLUSTAL W (1.81) multiple sequence alignment
First all sequence pairs are aligned and
scored, then in a second round a
multiple sequence alignment is built up.
In this case (PAX6 proteins from
vertebrates and fruit fly), two domains
are more conserved than the rest of the
sequence.
Only the first domain is shown here.
Christophe Roos - 4/6 Sequence comparison
PAX6_CHICK
------------------------------------------------------------
PAX6_HUMAN
------------------------------------MQNS----------------HSGV 8
PAX6_MOUSE
------------------------------------MQNS----------------HSGV 8
PAX6_COTJA
------------------------------------MQNS----------------HSGV 8
PAX6_BRARE
-----------------MPQKEYYNRATWESGVASMMQNS----------------HSGV 27
PAX6_ORYLA
-----------------MPQKEYHNQATWESGVASMMQNS----------------HSGV 27
PAX6_XENLA
------------------------------------MQNS----------------HSGV 8
PAX6_DROME
MRNLPCLGTAGGSGLGGIAGKPSPTMEAVEASTASHRHSTSSYFATTYYHLTDDECHSGV 60
PAX6_CHICK
------------------------------------------------------------
PAX6_HUMAN
NQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRP 68
PAX6_MOUSE
NQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRP 68
PAX6_COTJA
NQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRP 68
PAX6_BRARE
NQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRP 87
PAX6_ORYLA
NQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRP 87
PAX6_XENLA
NQLGGVFVNGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRP 68
PAX6_DROME
NQLGGVFVGGRPLPDSTRQKIVELAHSGARPCDISRILQVSNGCVSKILGRYYETGSIRP 120
PAX6_CHICK
------------------------------------------------------------
PAX6_HUMAN
RAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLLSEGVCTNDNIPSVSSINRVLR 128
PAX6_MOUSE
RAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLLSEGVCTNDNIPSVSSINRVLR 128
PAX6_COTJA
RAIGGSKPRVATPEVVSKIAQYKRECPSIFAWEIRDRLLSEGVCTNDNIPSVSSINRVLR 128
PAX6_BRARE
RAIGGSKPRVATPEVVGKIAQYKRECPSIFAWEIRDRLLSEGVCTNDNIPSVSSINRVLR 147
PAX6_ORYLA
RAIGGSKPRVATPEVVAKIAQYKRECPSIFAWEIRDRLLSEGICTNDNIPSVSSINRVLR 147
PAX6_XENLA
RAIGGSKPRVATPEVVNKIAHYKRECPSIFAWEIRDRLLSEGVCTNDNIPSVSSINRVLR 128
PAX6_DROME
RAIGGSKPRVATAEVVSKISQYKRECPSIFAWEIRDRLLQENVCTNDNIPSVSSINRVLR 180
PAX6_CHICK
------------------------------------------------------------
PAX6_HUMAN
NLASEKQQMGA------------------------------------------------- 139
PAX6_MOUSE
NLASEKQQMGA------------------------------------------------- 139
PAX6_COTJA
NLASEKQQMGA------------------------------------------------- 139
PAX6_BRARE
NLASEKQQMGA------------------------------------------------- 158
PAX6_ORYLA
NLASEKQQMGA------------------------------------------------- 158
PAX6_XENLA
NLASDKQQMGS------------------------------------------------- 139
PAX6_DROME
NLAAQKEQQSTGSGSSSTSAGNSISAKVSVSIGGNVSNVASGSRGTLSSSTDLMQTATPL 240
Spring 2002
Multiple sequences in phylogeny
•
Once a multiple sequence alignment
is done, it can be used for finding
– Domains (previous slide)
– Relationship (evolutionary
distance)
•
The distance is calculated as the
amount of mutations needed to
evolve from a putative ancestor to
all used ‘present-day’ sequences.
Then a path including all sequences
is computed. Different metrics can
be used (most parsimonious,
maximum likelihood, etc).
Christophe Roos - 4/6 Sequence comparison
Spring 2002