Database homology searching

Download Report

Transcript Database homology searching

School B&I TCD Bioinformatics
Database homology searching
May 2010
Why search a Database
• To find homologous sequences to your unknown
to determine function
• To find other related sequences to do
evolutionary studies (trees) or to make
specialised database (nematode 16sRNA)
• To find the mouse or E.coli homolog of your
gene of interest
• To find genes in a newly sequenced genome
• To predict 3-D structure (blast vs PDB)
BLAST
• For many molecular biologists
• a Blast search
– with a DNA sequence
– at NCBI
– accepting default parameters
• IS bioinformatics
– 70% of searches at NCBI at blastN
BLAST
• Basic local alignment search tool
• More or less unchanged for 20 years
• Extraordinarily quick:
– Submit seq via satellite to Bethesda MD
– Crunch thro 400 GB of sequence
– Return results via satellite
– ALL in 30 seconds
• Available at other sites (EBI, ExPaSy etc.)
• with better response times, fewer conditions
NCBI penalties
•
•
•
•
•
1st request: current time
2nd request: current time + 60 seconds
3rd request: current time + 120 seconds
4th request: current time + 180 seconds
5th request: current time + 240 second
• DON’T submit multiple simultaneous
queries
Reliable servers
• http://www.ch.embnet.org/software/bBLAST.html
• http://www.ch.embnet.org/software/aBLAST.html
– Basic and advanced SwissBlast
• http://www.ebi.ac.uk/blast2/index.html
– EBI (Cambridge, UK) NCBI Blast
• http://www.ncbi.nlm.nih.gov/blast/
– Not the only Blast server!
BLAST varieties
• blastn: searches a DNA sequence against a DNA database like
EMBL, Genbank, or dbEST. also megablast for exact matches
• blastp: searches a protein sequence against a protein database
such as UniProt, or "nr" a non-redundant database which ideally
contains one copy of every available sequence.
Then you have:
• blastx: searches a DNA sequence (translated in all six reading
frames) against a protein database.
• tblastn: searches a protein sequence against a DNA database
(translated in all six reading frames) – essential for searching EST
databases.
and in the interests of completeness there is:
• tblastx: searches a DNA sequence (translated in all six reading
frames) against a DNA database (translated in all six reading
frames).
finally
• Psi-blast an iterative process using position specific subst matrix
PSSM to find distantly related sequences
Algorithm
• Five steps
– Break the query seq into short “words” – typically 3
consecutive residues for protein
– Search the database for (nearly) exact matches to
these words
– Extend match “hits” out on either side until score
stops going up – these are HSPs (high scoring
segment pairs)
– Sort the HSPs by some “optimum” criterion
– Significant hits are then formally scored, aligned and
displayed
Alternatives to BLAST
• FASTA
– A little slower than Blast
– More sensitive (so recommended) for DNA to
DNA searches
• Smith-Waterman (Blitz)
– Much slower than Blast (20x slower)
– Much more sensitive
• But Blast is standard because it gives a
“good enough” answer quickly.
Blast Blast at NCBI
Paste your query sequence here
Parallel search in Conserved domain DB
Input sequence parameters
Optional parameters to change include
• Parameters that alter the hits found
–
–
–
–
–
Database searched
Word size
Substitution matrix
Gap penalties
Low complexity masking
• Parameters that alter the results delivered
–
–
–
–
Expectation cut-off
Limit by organism or taxonomic group
Number of hits reported
Number of alignments shown
Database
• If query is a coding gene: translate and search
protein database
• Search PDB if you want a 3-D structure
• Search NR if you want any hit
• Search UniProt to know what the hits are
• Search dbEST to know if your sequence is
expressed
• UniProt90: no seq is more than 90% ident to any
other (for an uncluttered tree) also UniProt50
Word size
• Default at NCBI Blast
– 11 for DNA; option of 7 and 5
– 3 for protein; option of 2
• Increasing wordsize will speed search
… but will lose sensitivity
– It will miss some useful but distant hits
• Decreasing wordsize will be more
sensitive … but take longer
Substitution matrix
• By default BLAST uses BLOSUM62
• Can change this
– Blosum90 mirrors changes in closely related
sequences
– Blosum30 changes in distant sequences
• Should run 3 blast searches with different
substitution matrices.
Changes 30-62-90 not linear
Substitution matrices
Top left part of a BLOSUM 90 matrix
A R N D C Q E G H
A 5 -2 -2 -3 -1 -1 -1 0 -2
R -2 6 -1 -3 -5 1 -1 -3 0
N -2 -1 7 1 -4 0 -1 -1 0
D -3 -3 1 7 -5 -1 1 -2 -2
C -1 -5 -4 -5 9 -4 -6 -4 -5
Q -1 1 0 -1 -4 7 2 -3 1
E -1 -1 -1 1 -6 2 6 -3 -1
G 0 -3 -1 -2 -4 -3 -3 6 -3
H -2 0 0 -2 -5 1 -1 -3 8
I -2 -4 -4 -5 -2 -4 -4 -5 -4
L -2 -3 -4 -5 -2 -3 -4 -5 -4
I
-2
-4
-4
-5
-2
-4
-4
-5
-4
5
1
L
-2
-3
-4
-5
-2
-3
-4
-5
-4
1
5
Positive offdiagonals are
similar
Reality of matrix change
• Query:
•
• Sbjct:
GHDEICI
GH + C
GHACNCG
– Scores 39 with Blosum30; 5 with Blosum90
• Query:
•
• Sbjct:
HEQCRLEN
+E
LEN
QENAHLEN
– Scores 19 with Blosum30; 24 with Blosum90
• So GHDEICIHEQCRLEN will find different hits
Matrix families
• PAM – point accepted mutation
– Margaret Dayhoff 1970s (before Genbank)
• BLOSUM – based on aligned blocks
– Henikoff and Henikoff 1992
– Blosum 62 default (based on aligned seqs
62% identical – don’t ask why 62: it works)
• Low PAM = High Blosum
– PAM 250 == Blosum 30
– PAM 30 == Blosum 90
Gap penalty
• Original Blast was gap-free
– Because gapped aligns much more CPU
• Now can insert “affine” gaps
– Gap open 10; gap extend 1
• Raise gap penalty to discourage gaps
– Preferentially gets closely related hits
• Lower gap penalty for sensitive search for
distant relatives
Low Complexity Masking
• Seg masks low-complexity regions
– “too many” serines, prolines etc.
• What a masked sequence looks like:
>P04729 Wheat gamma gliadin
MKTFLVFALIAVVATSAIAQMETSCISGLERPWQQQPLPPQQSFSQQPPFSQQQQQPLPQ
QPSFSQQQPPFSQQQPILSQQPPFSQQQQPVLPQQSPFSQQQQLVLPPQQQQQQLVQQQI
PIVQPSVLQQLNPCKVFLQQQCSPVAMPQRLARSQMWQQSSCHVMQQQCCQQLQQIPEQS
RYEAIRAIIYSIILQEQQQGFVQPQQQQPQQSGQGVSQSQQQSQQQLGQCSFQQPQQQLG
QQPQQQQQQQVLQGTFLQPHQIAHLEAVTSIALRTLPTMCSVNVPLYSATTSVPFGVGTG
VGAY*
and after low complexity masking:
>P04729 SEG low-complexity masked
MKTFLVFALIAVVATSAIAQMETSCISGLERPWXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXLNPCKVFLQQQCSPVAMPQRLARSQMWXXXXXXXXXXXXXXXXXXXXXXX
RYEAIRAIIYSIIXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXHQIAHLEAVTSIALRTLPTMCSVNVPLYSATTSVPFGVGTG
VGAY*
• Xnu masks repeats (from database of common
repeats)
Masking
• Check if masking on/off by default (differs at
sites)
• Run 2 searches with masking on, masking off
ANNYway
• Masking by hand:
– If you know about the DNA binding domain already
• Which is really common and will occupy the top 100 hits
against any database
– Replace the region with Xs
– Re-run blast to find other motifs/domains/information
– NCBI blast allows select subsequences
• DUST masks low info DNA sequences
– Like polyA tails
Advanced
Expectation cut-off
• E value
– Expected number of chance hits
• Given the database searched
• Query HSP length
• Related to probability
• Default is 10: “expect to find 10 hits as
good as this by chance alone”
• E values less than 10-4 unreliable
• So different from p < 0.05
• For short seqs crank E up to 100 or 1000
Data deluge – hits delivered
• Default suits general purpose
• V = 50 num of “hits” one line descriptions
• B = 50 num of alignments between query
and hit which are displayed
Swiss Blast
EBI blast
Limit by taxon/organism
Also search at www.ensembl.org for “genomed” organisms
Database choice
NR for getting aNNy hit
swissprot or refseq for getting hits that have annotation
Month for recent hits
The output
• Is in five parts
1. Admin – date, size of database, id of
query
2. Graphic display of query and hits
3. List of hits with links to database and
to…
4. …alignments (may be > 1 HSP per hit)
5. More admin, including errors/warnings
Notes!
• Record the top admin stuff. Your search will be
different
–
–
–
–
Next week
On a different server
With different DB
With different input parameters
• Mouse-over any graphic display
– Shows domain structure
– Shows if hit is global or local
• Read the bottom admin stuff
Blast output
sp|P06471|HOR3_HORVU B3-HORDEIN
Length = 264
Score = 62.5 bits (149), Expect = 1e-09
Identities = 32/63 (50%), Positives = 38/63 (59%)
Query: 131 LNPCARSQMWXXXXXXXXXXXXXXXXXXXXXXXRYEAIY
LNPCARSQM
R+EA+Y
Sbjct: 111 LNPCARSQMLQQSSCHVLQQQCCQQLPQIPEQLRHEAVY
Query: 191 SII 193
Low-complexity masked region
SI+
What sort of residues masked in
Sbjct: 171 SIV 173
The query sequence??
May be more HSPs for same hit
If big deletion in either seq
General blast protocol
1.
2.
3.
4.
5.
Find server, choose DB, paste seq, GO
Rerun search with/out masking
Rerun search with two diff subs matrix
2 x 3 for six searches
If top N hits all same family/domain then
XXX this region and resubmit
6. LOOK at the results; esp strange ones
7. Limit results by organism
Blast notes
• Twilight zone <25% protein <70% NA
• Because two sequences give high blast scores
to a third, doesn’t mean they are related
– NNNNDOMAINANNNNDOMAINBNNN
– NNNNDAMIANANNNNNNNNNNNNNN
– NNNNNNNNNNNNNNNDIMIAMBNNN
• E-value vs % ID. >10-4 unreliable
• Bit score <50 unreliable
PSI-BLAST
• Uses a PSSM rather than BloSum/PAM
• Iterative …
– Can find very distant relatives
– …so deep insight
• BUT can iterate off with the fairies