3.blast-and-E-values..

Download Report

Transcript 3.blast-and-E-values..

Finding Sequence Similarities
There are many programs used to do this.
They range from relatively slow programs which find the exact best
matching alignment, through ones which take progressively inexact
shortcuts to speed things up. Of this latter class, the best known, and
easily most widely used is BLAST, developed by Stephen Altschul and
others, and continuously refined over the last 10-15 years.
The essential idea is to compare your query sequence against a collection
or ‘database’ of target sequences, looking for the one(s) that match the
query sequence the best.
database
query
>query
AGACGAACCTAGCACAAGCGCGTCTGGAAAGACCCGCCAGCTACGGTCACCGAG
CTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTCAGGA
GTATTTGGACTGCAATATTGGCCCTCGTTCAAGGGCGCCTACCATCACCCGACG
GTCATGCCGGTCCCCAGCAGCTGCTAATAACTTCCTTCGCTACTCAAGTTACCA
CGCTAGCAAAACCCACGGCATACCGTTTACCCTTTAAAATCAGCTTCAACCAGC
AACGAA
COMPARE
LIST MATCHES
>target1
AAAACAGGAATATTTACCGGGACCGGGTAATGATGCATCTCGAGGTACACAATATACCTG
GAGAACCGAATTATGAGTTGGCCACCTTACTTAACGAAACCAGCAGAGAAAATCCAACAT
GGCAACACCCCTCTGACTACACTAGAAGGAACTACTATGTAAGAAAACAGCCTGTCCCTT
GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT
TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA
ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAG
>target2
CTCTTAATTTATTTCTCTTCCTGCAGCTCCCTCGCTTTTTCCTTTCCCTGTTACATTCAT
CTGACTTGAAGAGTTGCAAATTTTCAGTGTTTCTGTTTTTGTTGCTGATATGTTGTAAAC
TTTTTAATAAAATCTATTTCTATAG
>target3
GCAGTTTGAATGACTGGGTGATGCGAAATGGGGGTCCTGCCATAGAGCGCTTCCATGGTT
TACCTTGCACATTTCAGAGAAGTCCTATGCCAGGAGTCCTTCCTACAGGGCCTTCCTGAA
ACTATATATGTGCTTATTCTTGTTTGATTTGGCTTTGCAGCTAGGGTTTTCACCTTTTCT
GGAAAAAAAAATACTGGCTTCC
>target4
CTGCTATTAATGGGCAAAACAACTCAAATAAAGTCCCTCTGCCACCCTCAGACACTGCCC
CTGGCCCCCAGCTGCCCGCTGATCCTTGTAGCCAGAGCAGTAAAGTTTTGAAAGTGGAGC
CCAAGGAGAATAAAGTTATTAAAGAAACTGGCTTTGAACAAGGTGAAAAGTCTTGTGCAG
CACCTCTAGATCATACTGTGAAGGAAAATCTTGGACAAACTTCTAAAGAACAGGTGGTAG
Flavours of BLAST
BLAST can perform a number of similar tasks with different types of sequence:
BLASTn – comparing nucleotide sequence vs. nucleotide sequence database - FAST
BLASTp – comparing protein sequence vs. protein sequence database - FAST
BLASTx – comparing nucleotide sequence vs. protein sequence database by translating
the nucleotide sequence in all possible reading frames - SLOW
tBLASTn – comparing protein sequence vs. nucleotide sequence database translated
into all possible reading frames - SLOWER
tBLASTx – comparing nucleotide sequence vs. nucleotide sequence database
translating both into all possible reading frames – EXCRUCIATINGLY SLOW!
The amino acid sequence based programs use a substitution matrix to allow some
amino acids to count as effective matches with each other. These are the BLOSUM and
PAM matrices you may see referred to from time to time.
How does it work?
The main task of any sequence comparison program is to test all possible
mutual alignments of two sequence and see how good the match is:
query
1st database
sequence
CCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCGTTC
CCGAGCTTCTCATTGCTCTTCCTAACAGTG=TGATAGGCTAACCGTAATGGCGTTC
|||||||||||||||||||||||||
||||||||||||||||||||||||
||||||||||||||||||||||||
| ||
|
|||||||||||||||||||||||||
| | | ||
|| || || || || ||
CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT
CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT
CTTCTCATTGCTCTTCCTAACAGTGATGATAGGCTAACCGTAATGGCGTTCAGGAGT
BLAST achieves its speed through two strategies:
It ‘indexes’ the database sequences so it know where all the minor
subsequences are in each sequence, so it doesn’t have to look all the way
through each sequence each time, letter by letter.
It’s ‘word based’, so that it will only start looking for possible extensive
alignments once it’s found a seed alignment of an exact match. The default
seed lengths are 11 letters for BLASTn and 3 for BLASTp. This means that
some good alignments are un-findable, e.g. a 50% protein match with exactly
every second amino acid matching. It relies on these ‘uniformly distributed’
alignments being very rare occurrences.
BLAST –Typical Output
INPUT:
>partial cDNA sequence, Xenopus tropicalis
CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCGTTCCCACCTCTCCTCTTTCACCATGAAGCTCAAGGACAAATTCCACTCCC
CCAAAATCAAGCGCACCCCGTCCAAGAAGGGGAAGCCGGCCGACCTCACCGTCAAAACAGAAGAGAAACCCGTCAACAAAACCTTAAGCCGCTTGGAGGAACAGGAGAAA
GAAGTCGTTAATGCCTTGCGTTACTTTAAGACAATTGTTGACAAGATGGCGGTGGACAAGATGGTGCTGGTGATGCTGCCAGGGTCGGCGA
OUTPUT:
Query= (311 letters)
Database: NCBI Protein Reference Sequences 954,378 sequences; 347,895,532 total letters
>gi|41055060|ref|NP_957420.1|
similar to guanine nucleotide-releasing factor 2 (specific
for crk proto-oncogene) [Danio rerio]
Length=691
Score
Expect
Identities
Positives
Gaps
Frame
Query
Sbjct
=
=
=
=
=
=
133 bits (335)
6e-31
76/98 (77%)
82/98 (83%)
4/98 (4%)
+2
26 MSGKIE-KADSQRSHLSSFTMKLKDKFHSPKIKRTPSKKGKPA--DLTVKTEEKPVNKTL 196
MSGKIE K +SQ+SHLSSFTMKL KFHSPKIKRTPSKKGK
+ VKT EKPVNK +
1 MSGKIESKHESQKSHLSSFTMKLM-KFHSPKIKRTPSKKGKQLQPEPAVKTPEKPVNKKV 59
Query 197 SRLEEQEKEVVNALRYFKTIVDKMAVDKMVLVMLPGSA 310
SRLEEQEK+VV+ALRYFKTIVDKM VD VL MLPGSA
Sbjct 60 SRLEEQEKDVVSALRYFKTIVDKMNVDTKVLQMLPGSA 97
When is a match significant?
Here is a ‘typical’ weak alignment from BLASTp:
RF---KISDCQHPCTYSH-NQYMTNHMREC----PYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV
F
K S+ + C + + N Y N
+C
P+
+ +W
+P
+
D
I
N
M ++
NFSWKKTSEKETNCQFDYPNDY--NEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFN------MCWLEVNSS
In fact the sequences were randomly generated, so there
is no biologically significant alignment…
RFKISDCQHPCTYSHNQYMTNHMRECPYNGAATSIPSWHLIVHPSNGQSVSFPQSDPCQIKMNQNLHLVQMMYDMQTTHV
NFSWKKTSEKETNCQFDYPNDYNEQTQCQPMTPFKADVFDLWNWEFNANPKLENGIRDLIDDKHDILQIFNMCWLEVNSS
E-values
The number of matches like the discovered
match that I would expect to find by chance.
An E-value of 0.0 implies that I would expect no matches like
this to arise by chance, therefore…
An E-value of 1 implies I would expect 1 match like this to
arise by chance, so if I have a match with such an E-value…
E-values From First Principles
Notation:
1.2e-35 = 1.2 x 10-35
4.8 x 106 = 4,800,000
Some database statistics (23rd July 2005):
Database: NCBI RefSeq mRNA
272,619 sequences; 503,566,580 total letters (~5.0 x 108)
Database: NCBI nr
3,329,110 sequences; 14,601,814,750 total letters (~1.4 x 1010)
We will consider first searching a nucleotide sequence (‘ACGTAGACGT’)
against a nucleotide database, e.g. the RefSeq mRNA above.
Then we will consider the more complex case of amino acid sequence (protein)
searches. Which is of course what we mostly do.
Calculating an E-value
The RefSeq mRNA database has ~ 5.0 x 108 letters
There are 4 possible nucleotides - ACGT
How many matches do we expect to find by chance?
Query = ‘A’
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCG
A
A
A
A
A
AA A
A A
AA
AA
Expected number of matches = (5.0 x 108) / 4 = ~1.2 x 108
Query = ‘AC’
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCG
AC
AC
AC
AC
Expected number of matches = (5.0 x 108) / (4x 4) = ~3.1 x 107
Query = ‘ACG’
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCG
ACG
Expected number of matches = (5.0 x 108) / (4 x 4 x 4) = ~8.1 x 106
Query = ‘ACGTCGA…..CTGATTCG’ - 60-mer
Expected number of matches = (5.0 x 108) / (4 x 4 x 4 x 4 … 60 times )
= (5.0 x 108) / 1036
= 5.0 x 10-28
E-value = 5.0 x 10-28
E-values In Practice
So if I take a 60 nt sequence:
>sequence
ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA
and actually BLAST it against the RefSeq mRNA database, I get:
BLAST OUTPUT:
>gi|27469838|gb|BC041710.1|
Homo sapiens Rap guanine nucleotide exchange factor
(GEF) 1, transcript variant 2, mRNA (cDNA clone MGC:49019 IMAGE:6051007), complete cds
Length=6060
-28 - !?
Score = 119 bits (60), Expect = 2e-26 theoretical value was 5.0e
Identities = 60/60 (100%), Gaps = 0/60 (0%) Strand=Plus/Plus
Query
1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036
What do I get if I BLAST it against the larger nr database?
BLAST OUTPUT:
>gi|27469838|gb|BC041710.1|
Homo sapiens Rap guanine nucleotide exchange factor
(GEF) 1, transcript variant 2, mRNA (cDNA clone MGC:49019 IMAGE:6051007), complete cds
Length=6060
Score = 119 bits (60), Expect = 6e-25
Identities = 60/60 (100%), Gaps = 0/60 (0%) Strand=Plus/Plus
Query
1 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 60
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 2977 ACAGCTCGTCCTCCTTCCGAGCCTACCGGGCCGCCCTCTCGGAGGTGGAACCGCCGTGCA 3036
E-values: Effect of Database Size
The nr mRNA database has ~ 1.4 x 1010 letters
There are 4 possible nucleotides - ACGT
How many matches do we expect to find by chance?
Query = ‘A’
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCG
A
A
A
A
A
AA A
A A
AA
AA
Expected number of matches = (1.4 x 1010) / 4 = ~1.2 x 108
Query = ‘AC’
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCG
AC
AC
AC
AC
Expected number of matches = (1.4 x 1010) / (4x 4) = ~3.1 x 107
Query = ‘ACG’
CCGCCAGCTACGGTCACCGAGCTTCTCATTGCTCTTCCTAACAGTGTGATAGGCTAACCGTAATGGCG
ACG
Expected number of matches = (1.4 x 1010) / (4 x 4 x 4) = ~8.1 x 106
Query = ‘ACGTCGA…..CTGATTCG’ - 60-mer
Expected number of matches = (1.4 x 1010) / (4 x 4 x 4 x 4 … 60 times )
= (1.4 x 1010) / 1036
= 1.4 x 10-26
E-value = 1.4 x 10-26
E-values: Effect of Database Size
5.0
x108 letters
1.4 x 1010 letters
RefSeq
30 x bigger
nr
E-value =
5.0e-28
BLAST the same sequence
against each
E-value = 1.4e-26
The database was ~30 times bigger and so the E-value was ~30 times bigger.
The E-value is simply dependent on database size.
Why were the values different?
Our calculated E-value for searching against the RefSeq mRNA database was
5.0 x 10-28.
But our actual BLAST search at NCBI gave a value of
2.0 x 10-26 - about 40x larger - why is this?
Gapped alignments
If we were expecting N matches for a query sequence ‘ACGTACGTACGT’,
imagine what would happen to N if we allowed gaps in our matches.
ACGTAC?GTACGT
This would now give us additional possible alignments that would meet our
‘match’ criteria:
ACGTACGTACGT
||||||||||||
ACGTACGTACGT
ACGTACAGTACGT
|||||| ||||||
ACGTAC-GTACGT
ACGTACCGTACGT
|||||| ||||||
ACGTAC-GTACGT
etc.
We will expect many more matches in a given database, if we allow our
alignments to have gaps. The E-value will be larger.
E-values: Effect of Query Length
BLAST 500 nt sequence against a database
>sequence
ACTAGTCTAGCTAGACATCG
ATCGATGATGCTACACAGAT
AGACGATAGATAGTAAGTCG
ATCGATCGCGCATCGATCGT
CTAGATCGATCGCTCGCTGT
GTAGATAGATCGGCGATAGA
BLASTn
database
Get a full length match with sequence
XYZ at an E-value = 5.0e-160
BLAST half of the same sequence against the same database
>sequence
ACTAGTCTAGCTAGACATCG
ATCGATGATGCTACACAGAT
AGACGATAGATAGTAAGTCG
BLASTn
database
Get a match with sequence XYZ
again, but at an E-value = 5.0e-80
Biologically it’s the same match! Does it mean we are
any less sure that this match didn’t occur by chance?
The E-value is simply dependent on match length.
Why not just use % identity?
At some levels this a good question.
But consider two very different searches, both of which give a 75% identity match
Query1 was 60 nt long:
CGGAGCTCAGGGCTTAACGACTGATATCTCCGCGCATGTCGAGAAACGATACAGCCAGCG
||||||||||| || | || | || || |||| | | | |||||| | ||||||||||
CGGAGCTCAGGCCTCACCGGCGGACATGTCCGGGAAAATAGAGAAAGCAGACAGCCAGCG
Which would have an E-value ~ 5.0 x 10-19
And, Query2 only 16 nt long:
ACGTACGTACGTACGT
||| || | |||| ||
ACGCACCTTCGTAGGT
Which would have an E-value ~ 30
And intuitively we feel we would expect to see that sort of number of matches in
the database just by chance…
So what’s the real problem?
Basically you are usually trying to answer the question:
Can I find the ortholog of my gene in some other species, so that I can
work out what it might be doing in my organism?
And the difficulty is because BLAST does not set out to address questions like
orthology. BLAST only tells you about sequence similarity, with some notion of
how likely a similarity is to have arisen by chance, based on some general
biological principles.
You will always have to add in your own knowledge of biology, and exactly what
your query sequence was, and how it is related to your matching sequences. In
particular whether the degree of similarity matches up to the supposed
evolutionary distance between the two species. You will also need to take into
account the length of the reported match, compared to the lengths of your query
and matched sequences. And of course the size of the database.
Are there any useful guidelines though?
Rules of Thumb
How good does an E-value have to be before we might even think we have an
ortholog?
 larger/worse
E-values
10-5
10-10
smaller/better 
10-40
10-100
0.0
fantasy
borderline
encouraging
pretty good
can’t get better
But note that in some gene families with closely related members you can get an
E-value of 0.0 for several different matches, and then % identity may be more
sensitive. Also bear in mind, in cases like this, that ideas of ‘functional’ orthology
may break down, with more than one locus producing identical proteins which
share the same function…
Protein BLAST
It’s (nearly) always better to make comparisons at the amino acid level between
protein sequences than the DNA level, because there are many different DNA
sequences that can give exactly the same protein sequence.
Does this cause us to treat expected values any differently?
If we follow the argument as before then for an exact match of a 20 amino acid
sequence in the RefSeq protein database, each additional amino acid will reduce
the E-value by 1/20th (there are 20 different amino acids). And as there are
347,895,532 letters in that database,
E-value = ~3.5 x 108 / (20 x 20 x 20 …20 times) = ~3.5 x 10-18.
But this is what we get of we run the blast at NCBI:
Score = 43.1 bits (100), Expect = 8e-04
Identities = 20/20 (100%), Positives = 20/20 (100%), Gaps = 0/20 (0%)
Frame = +3
Query
3
Sbjct
972
SSSSFRAYRAALSEVEPPCI
SSSSFRAYRAALSEVEPPCI
SSSSFRAYRAALSEVEPPCI
62
991
Really too big a discrepancy to easily explain with hand waving…
Amino Acid Substitutions
In fact we need to take into
account both amino acid
substitutability, as well as, as
before, allowing gapped
alignments. On average any
residue can be substituted for
by about 2 others, so each
position has about 1/7th
chance of ‘matching’ rather
than 1/20th.
A
C
F
G
I
L
M
P
V
W
S
LWY
LMV
IMFV
ILV
ILM
FY
N
Q
S
T
Y
DHS
REHK
ANT
S
HFW
H
K
R
NQY
RQE
QK
D
E
NE
DQK
So now we get:
E-value = ~3.5 x 108 / (7 x 7 x 7 …20 times) = ~4.4 x 10-9,
which is much closer to the actual BLAST value.
Exercises
Go to the file random-DNA-sequences.html, randomly select one of the
20 randomly generated nucleotide sequences, and do a BLASTx
(translated DNA->protein) at NCBI against the nr protein database.
Did you find any ‘significant’ hits?
Repeat with a second sequence.
What conclusions might you draw from this exercise?
Try the same sequence(s) against the nr nucleotide database.
Is there any general difference?