search_2009

download report

Transcript search_2009

Sequence based searching
Lesson 7
Based on a presentation by Irit Gat-Viks
Based on presentation by Amir Mitchel,
Introduction to bioinformatics course,
Bioinformatics unit, Tel Aviv University.
Reminder – Importance of
Homology
Use a sequence as a search query in order to find homologous
sequences in a database.
Homology – similarity between sequences that results from a
common ancestor.
Basic Assumption:
Sequence homology → similar structure/function
Why ?
• Characterizing an ORF.
• Finding duplicate genes in the same organism
(known function, variants)
• Finding homologues genes in other organisms (phylogeny, known
function)
Study a sequence through homologs
Identity Similarity Homology
Query= uniprot|Q9UP52|TFR2_HUMAN Transferrin receptor protein 2
>gi|20140567|sp|Q07891|TFR1_CRIGR
(Trfr)
(TfR2).
Transferrin receptor protein 1 (TfR1) (TR) (TfR)
Length = 757
Score = 540 bits (1392), Expect = e-152
Identities = 305/727 (41%), Positives = 412/727 (56%), Gaps = 52/727 (7%)
Query: 87 LTALLIFTGAFLLGYVAF--RGSCQAC--------GDSVLVVSEDVNYEPDLDFHQGRLY 136
+ ++ F
F++GY+ + R
+ C
G+S ++ E++
RLY
Sbjct: 71 IAVVIFFLIGFMIGYLGYCKRTEQKDCVRLAETETGNSEIIQEENIP-------QSSRLY 123
Query: 137 WSDLQAMFLQFLGEGRLEDTIRQTSLRERVAGSAGMAALTQDIRAALSRQKLDHVWTDTH 196
W+DL+ + + L
DTI+Q S
R AGS
L
I
KL VW D H
Sbjct: 124 WADLKKLLSEKLDAIEFTDTIKQLSQTSREAGSQKDENLAYYIENQFRDFKLSKVWRDEH 183
Query: 197 YVGLQFPDPAHPNTLHWVDEAGKVGEQLPLEDPDVYCPYSAIGNVTGELVYAHYGRPEDL 256
YV +Q
A N + ++ G
+
+E+P Y YS
V+G+L++A++G +D
Sbjct: 184 YVKIQVKGSAAQNAVTIINVNG---DSDLVENPGGYVAYSKATTVSGKLIHANFGTKKDF 240
Query: 257 QDLRAXXXXXXXXXXXXXXXXISFAQKVTNAQDFGAQGVLIYPEPADFSQDPPKPSLSSQ 316
+DL+
I+FA+KV NAQ F A GVLIY +
F
P + ++
Sbjct: 241 EDLK---YPVNGSLVIVRAGKITFAEKVANAQSFNAIGVLIYMDQTKF------PVVEAE 291
Query: 317 QAVYGHVHLGTGDPYTPGFPSFNQTQFPPVASSGLPSIPAQPISADIASRLLRKLKGPVA 376
+++GH HLGTGDPYTPGFPSFN TQFPP SSGLPSIP Q IS
A +L + ++
Sbjct: 292 LSLFGHAHLGTGDPYTPGFPSFNHTQFPPSQSSGLPSIPVQTISRKAAEKLFQNMETNCP 351
• For Proteins, finding distant relatives is a difficult task.
• Distant protein family members, may share <20% amino acid identity(!).
>gi|3582021|emb|CAA70575.1| cytochrome P450 [Nepeta racemosa]
Length = 509
Score = 405 bits (1043), Expect = e-111
Identities = 94/479 (19%), Positives = 192/479 (40%), Gaps = 35/479 (7%)
Query: 61
NLYHFWRETGTHKVHLHHVQNFQKYGPIYREKLGNVESVYVIDPEDVALLFKSEGPNPER
NL+
G + H +
++YGP+ +
G+V +
PE
+ K++
Sbjct: 45 NLHQL----GLY-PHRYLQSLSRRYGPLMQLHFGSVPVLVASSPEAAREIMKNQDIVFSN
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Query: 297 -----DYRGMLYRLLGDSK----MSFEDIKANVTEMLAGGVDTTSMTLQWHLYEMARNLK
D+ +L +
++K
+ + +KA + +M
G DTT+ L+W + E+ +N +
Sbjct: 271 GDGALDFVDILLQFQRENKNRSPVEDDTVKALILDMFVAGTDTTATALEWAVAELIKNPR
120
99
347
330
Query: 348 VQDMLRAEVLAARHQAQGDMATMLQLVPLLKASIKETLRLH-PISVTLQRYLVNDLVLRD 406
L+ EV
L+ +P LKASIKE+LRLH P+ + + R
D +
Sbjct: 331 AMKRLQNEVREVAGSKAEIEEEDLEKMPYLKASIKESLRLHVPVVLLVPRESTRDTNVLG 390
Query: 407 YMIPAKTLVQVAIYALGREPTFFFDPENFDPTRWLSK--DKNITYFRNLGFGWGVRQCLG 464
Y I + T V + +A+ R+P+ + +PE F P R+L
D
+F L FG G R C G
Sbjct: 391 YDIASGTRVLINAWAIARDPSVWENPEEFLPERFLDSSIDYKGLHFELLPFGAGRRGCPG 450
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Query types
DNA vs. Protein
• The sequence query can be a nucleotide sequence or an amino acid
sequence.
• The search is preformed against a nucleotide or amino acid database
Which search is preferable?
1. The genetic code is redundant. Some amino acids are coded by more
than one codon. Therefore, the DNA sequence can change while the
amino acid sequence will remain the same.
2. Nucleotides: a four letter alphabet. Amino acids: a twenty letter
alphabet. Two random DNA sequences will share on average 25% of
identity. Two random protein sequences will share on average 5% of
identity.
3. Protein comparison matrices are much more sensitive than those for
DNA, i.e., similarity relationships are defined between two amino
acids (PAM/Blosum).
4. DNA databases are much larger, meaning more random hits.
Using the amino acid sequence is
preferable for homology search.
1. Protein sequence comparisons typically double
the evolutionary look-back time over DNA
sequence comparisons.
2. Evolutionary distant proteins will exhibit a high
similarity rather than a high identity.
3. Hits can exhibit a long alignment (homology) or a
short alignment (conserved domains).
Why use a nucleotide sequence after all?
Query type
•
•
The sequence query can be a nucleotide sequence or an amino acid sequence.
But … we can translate the query sequence!
The search is performed against a nucleotide or amino acid database.
But … we can use translated databases! (e.g., trEMBL)
All types of searches
are possible.
•
Query:
DNA
Protein
Database:
DNA
Protein
Nucleotide query can be translated and searched against protein databases:
1. Translate all reading frames (3 + 3)
2. Find long ORF.
•
Amino acid query can be back-translated to and searched against nucleotide
databases?
1. During translation we lose information.
2. A single amino acid sequence can be back-translated to many possible nucleotide
sequences .
Query types
1. amino acid query against protein database (blastp)
– identifying a protein sequence
– finding similar sequences in protein databases.
2. nucleotide query against nucleotide database (blastn)
– In non-coding regions (no ORF found)- Identify the query sequence or find similar
sequences.
– Find primer binding sites or map short contiguous motifs
3. compares translated nucleotide query against protein database. (blastx)
– Useful when the query include a coding region, and we try to find homologous
proteins.
– Used extensively in analyzing EST sequences. This search is more sensitive than
nucleotide blast since the comparison is performed at the protein level.
4. protein query against translated nucleotide database (tblastn)
– useful for finding protein homologs in unnannotated nucleotide data of coding
regions (e.g., ESTs, draft genome records (HTG)).
5. translated nucleotide query against translated nucleotide database.
(tblastz)
– Useful for identifying novel genes in error prone query sequences.
– Used for identifying potential proteins encoded by single pass read ESTs.
* six-frame in all translations!!!
Why Heuristics ?
• Motivation:
– Dynamic programming guarantees an optimal
solution & is efficient, but
– Not fast enough when searching a database of size
~109, with a query of length 200-500bp
• Solutions:
– Implement on hardware. (COMPUGEN)
– Parallel hardware. (MASSPAR)
– Use faster heuristic algorithms.
• Common Heuristics: FASTA, BLAST
CG © Ron Shamir, 06
9
Key observations
• Even O(m+n) time would be problematic when
db size is huge
• Numerous queries are run on the same db
• Preprocessing of the db is desirable
• Substitutions are much more likely than indels
• Homologous sequences contain many matches
CG © Ron Shamir, 06
10
a
g
g
t
c
c
g
t
t
c
CG © Ron Shamir, 06
a
*
a
*
g
t
c
c
c
*
*
g
t
*
*
g
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
Alignment Dot-Plot Matrix
11
BLAST
• Preprocess
– Low complexity regions are removed
– A dictionary for K-tuple words is prepared for the query
sequence and the database. Protein 3 letter words, DNA
4-6 or even 11 letter words.
• Searches for K-tuple words and find database
records with common words. Words can be similar,
not only identical.
–
–
–
–
Identity - CAT : CAT
Similarity – CAT : CAT, CAR, HAT …
But even CAT : ZTX can be similar
For each three letter word there are at most 203 similar
words.
– Similar words are only the ones that have a minimum
cut-off score (T).
BLAST Stage I
• Find matching word pairs
• Extend word pairs as much as possible,
i.e., as long as the total weight increases
• Result: High-scoring Segment Pairs (HSPs)
THEFIRSTLINIHAVEADREAMESIRPATRICKREAD
INVIEIAMDEADMEATTNAMHEWASNINETEEN
BLAST Stage II
• Try to connect HSPs by aligning the
sequences in between them:
THEFIRSTLINIHAVEADREA____M_ESIRPATRICKREAD
INVIEIAMDEADMEATTNAMHEW___ASNINETEEN
BLAST
Versions of the program
[t]BLAST[x/n/p]
t
: Translate a DNA database in all 6 reading frames
for comparison with a Protein query.
x
: Translate a nucleotide query in all 6 reading frames
for comparison with a Protein database.
p
: Comparison is against a Protein database.
n
: Comparison is against a Nucleotide database.
Masking low complexity
• There is one frequent case where the random models and therefore the
statistics discussed here break down: regions with highly biased amino acid
composition ("low complexity" regions).
• Alignments of two regions with similarly biased composition may achieve
very high scores that owe virtually nothing to residue order but are due to
segment composition.
• Usually generated by slippage and thus not interesting. The BLAST
programs employ the SEG (protein) DUST(DNA) algorithm to filter low
complexity regions from proteins before executing a database search.
•Masking is practiced on the query sequence only, not on the database sequences!
BLAST
http://www.ncbi.nlm.nih.gov/
BLAST
http://www.ncbi.nlm.nih.gov/BLAST/
BLAST
…or even construct my own
BLAST
searchable database by an
Entrez query
Mask According to the case
within the query sequence.
I can limit my search to a
selected organism
Mask for lookup table hit
search stage, but NOT for the
hit extension stage.
Filter Low Complexity regions
by SEG or DUST
BLAST
Lineage Report
root
. Bilateria
[animals]
. . Coelomata
[animals]
. . . Euteleostomi [vertebrates]
. . . . Tetrapoda
[vertebrates]
. . . . . Eutheria
[mammals]
. . [mammals]
. . . Homo sapiens
(man) acid
-----------571
18 hitsacid
[mammals]
retino
18 .hits
retinoic
induced 3;
retinoic
responsive gene
[H
. . [mammals]
. . . Mus musculus
(mouse)
..........
15 hits
[mammals]
retino
15 .hits
retinoic
acid
inducible 432
protein
3 [Mus
musculus]
. . [mammals]
. . . Rattus norvegicus
(brown
rat) acid
. 411
5 hits protein
[mammals]
simila
5 .hits
similar to
retinoic
inducible
3 [Rattus
norv
. . [amphibians]
. . Xenopus laevis
(clawed
frog) [Xenopus
---- 216laevis]
1 hit [amphibians]
MGC687
1 .hit
MGC68729
protein
. . [bony
. Takifugu
rubripes
(torafugu)
-----40 4rubripes]
hits [bony fishes]
pherom
4 .hits
fishes]
pheromone
receptor
[Takifugu
. . [flies]
Drosophila melanogaster
------------48 4 hits [flies]
CG8285
4 .hits
CG8285-PA
[Drosophila melanogaster]
>gi|2827758|sp|P22815
. . [flies]
Drosophila virilis
..................
39 precursor
1 hit [flies]
Bride
1 .hit
Bride
of sevenless protein
>gi|1079166|pir||A47
. . [flies]
Anopheles gambiae
str. PEST .........
38 1 gambiae]
hit [flies]
ENSANG
1 .hit
ENSANGP00000013404
[Anopheles
>gi|21296536|gb|EA
. Caenorhabditis
---------------2 hitsto[nematodes]
calciu
2 .hits
[nematodes] elegans
calcium-sensing
receptor,41similar
human metabotropic
g
environmental
sequence
-----------------40 2 hits [unclassified] unknow
2 .hits
[unclassified]
unknown
[environmental sequence]
BLAST
BLAST
BLAST
BLAST statistics
• Theory of Karlin, Altschul, and Dembo on the
distribution of the MSP of score at random
• Define parameters K,  (depending on AA
distribution)
• Pr (finding a pair of score >S in comparing two
random seqs of length m, n) = 1 – e-y where
Y=Kmn e-s
• Extreme value dist
(or Gumbell dist)
• Allow the calculated
choice of smallest C
CG © Ron Shamir, 06
26
Improved BLAST - Altschul et al. 97
Two Hit Version:
• Need two non-overlapping w-long words with:
– score  t, each
– on same diagonal
– within distance  A
• Allow local alignment with indels (merging
local alignments from different diagonals)
• Extends less starting points!
CG © Ron Shamir, 06
27
The statistics of sequence
similarity scores
• Bits score – A score for the alignments according to the
number of similarities, identities, etc.
• Expected-score (E-value) (of an alignment having a
score S):
The number of times one expects to find alignments with
a score >= S of a random sequence Vs. a random
database. (having the same lengths and compositions).
The closer the e-value approaches zero, the greater the
confidence that the match is real (from zero to one).
BLAST
What about:
•Short sequences?
•large sequences and queries?
Short sequences:
Parameter settings for standard blastp and
"Search for short and nearly exact matches"
Word
Size
SEG
Filter
Standard Blastp
3
On
10
BLOSUM62
Search for short and
nearly exact matches
2
Off
20000
PAM30
Program
Expect
Score Matrix
Value
Parameter settings for standard blastn and
"Search for short and nearly exact matches"
Word
Size
DUST
Filter
Expect
Value
Standard blastn
11
On
10
Search for short nearly
exact matches
7
Off
1000
Program
PAM vs. BLUSOM- reminder
– Different BLOSUM matrices are derived from blocks with different identity
percentage. (e.g., blosum62 is derived from an alignment of sequences
that share at least 62% identity.) Larger n  smaller evolutionary distance.
– Single PAM was constructed from at least 85% identity dataset. Different
PAM matrices were computationally derived from it. Larger n  larger
evolutionary distance
Observed %
Difference
1
10
20
30
40
50
60
70
80
Evolutionary
distance (PAM)
1
11
23
38
56
80
120 120
159
250 250
BLOSUM
99
90
80
70
60
50
40
30
20
62
How to filter results from large
sequences and queries?
1. Some sequences contain large regions of ALU repeats. In this case
you can select the "Human Repeat" filtering option on the main BLAST
search page. This will mask repeat regions which generate a large
number of biologically uninteresting hits to the databases.
2. Increase the Word Size to 20 - 25. With a default Word Size of 7,
limiting the number small initial fragments to be extended to HSPs.
3. Decrease the Expect value to 1.0 or lower  eliminates many hits
and concentrate on results which are more likely to contain large
coding regions and genomic fragments.
4. Processing multiple query sequences in one run can be much faster
than processing them with separate runs because the database is
scanned only 1 time for the entire set of queries.
PSI-BLAST (Position-Specific Iterated (PSI)-BLAST )
Sensitive protein-protein similarity searches.
• The most sensitive BLAST program, making it useful for
finding very distantly related proteins.
• Use PSI-BLAST when your standard protein-protein BLAST
search failed to find significant hits.
Algorithm:
• The first round of PSI-BLAST is a standard protein-protein
BLAST search. The program builds a position-specific
scoring matrix (PSSM or profile) from an alignment of the
sequences returned with Expect values better (lower) than
the inclusion threshold (default=0.005).
• The PSSM will be used to evaluate the alignment in the next
iteration of search. Any new database hits below the
inclusion finding very distantly related proteins.