EMBL-EBI Powerpoint Presentation
Download
Report
Transcript EMBL-EBI Powerpoint Presentation
Sequence Searching Strategies
A guide to efficient
database searching
Jennifer McDowall
EMBL-EBI
EBI is an Outstation of the European Molecular Biology Laboratory.
Overview
• Know the data
• The Toolbox
• Search Guidelines
Sequence Searching Tools
Know the data
Sequence Searching Tools
Know the Data…
• Many databases, each getting bigger
• Efficient searching requires knowledge of
what data is stored in a database
Don’t assume annotation can be transferred because of
a good match
• Databases can contain errors
• Data can change
Deletions, sequence modifications
Daily updates, identifier changes…
Sequence Searching Tools
Know the Data…Nucleotides
EMBL-Bank
• Divided into classes and divisions...
• Release and updates
• Supplementary sets: EMBL-CDS, EMBL-MGA
Specialist databases
•
•
•
•
Immunoglobulins: IMGT/HLA, IMGT/LIGM…
Alternative splicing: ASTD…
Completed genomes: Ensembl, Integr8…
Variation: HGVBase, dbSNP…
Sequence Searching Tools
Know the Data…Proteins
UniProt
• Divided into 3 sections
• Release and updates
Specialist databases
•
•
•
•
•
•
Sequence from structure: PDB, SGT…
Immunoglobulins: IMGT/HLA…
Alternative splicing: ASTD…
Completed proteomes: Ensembl, Integr8…
Protein interactions: IntAct
Patent proteins: EPO, USTPO, JPO, KIPO
Sequence Searching Tools
Homology
• Homologous sequences
share a common origin
vs.
Similarity
• Similarity is a measure of the
“likeness” of 2 sequences
• Presence of similar features • Uses statistics to determine
because of common decent
‘significance’ of similarity
• Statistically significant
similar sequences are
considered ‘homologous’
If significant, considered to be
homologous
If not significant uncertain
• Similarity does not
• Homology is like pregnancy:
necessarily reflect homology
either one is or one isn’t!
(Gribskov – 1999)
Sequence Searching Tools
The Toolbox
Sequence Searching Tools
Sequence Similarity Search Tools
Sequence Searching Tools
Sequence Similarity Search Tools
BLAST
FASTA
Iterative
searches
Sequence Searching Tools
Sequence Similarity Search Tools
BLAST
NCBI-BLAST
Wu-BLAST
FASTA
FASTA
SSEARCH
GGSEARCH
GLSEARCH
Iterative
search
Sequence Searching Tools
PSI-BLAST
PSI-SEARCH
Tools: NCBI BLAST
• BLASTP:
protein
• BLASTX:
DNA
• BLASTN:
DNA
Sequence Searching Tools
Protein DB
translate
Protein DB
DNA DB
Tools: NCBI BLAST
Nucleotide search
Sequence Searching Tools
Protein search
Tools: Wu-BLAST
• BLASTP:
protein
• BLASTN:
DNA
• BLASTX:
DNA translate
• TBLASTN:
protein
• TBLASTX:
DNA translate
Sequence Searching Tools
Protein DB
DNA DB
Protein DB
Translated DNA DB
Translated DNA DB
Tools: Wu-BLAST
Nucleotide search
Sequence Searching Tools
Protein search
Tools: FASTA
• FASTA:
• FASTX/Y:
protein
Protein DB or DNA
DNA translate
• SSEARCH:
protein
Protein DB
Protein DB or DNA
• GLSEARCH:
protein
Protein DB
• GGSEARCH:
protein
Protein DB
Sequence Searching Tools
DNA DB
DNA DB
Tools: FASTA
Nucleotide search
Sequence Searching Tools
Protein search
Query length
When to use which search?
NCBI BLAST
WU-BLAST
FASTA
PSI-SEARCH
Database size
Sequence Searching Tools
Speed of search
When to use which search?
NCBI BLAST
WU-BLAST
FASTA
PSI-SEARCH
PDB Swiss-Prot UniRef50 UniRef 90 UniRef100 UniProtKB UniParc
Sequence Searching Tools
BLAST
v
FASTA
• Fast
• Slower
• Excels with proteins
• Excels with proteins and DNA
(better than BLASTN for DNA)
• Good local alignments +
short global alignments
• Produces S-W alignments
• Proteins: BLOSUM62(-11/-1)
alignments good at >85%
homology
• Proteins: BLOSUM50(-10/-2)
longer alignments good at
>70% homology
• Good at finding siblings
• Good at finding cousins
Sequence Searching Tools
GLSEARCH and GGSEARCH
GLSEARCH
Global (query) - Local (target DB) alignment
For global query alignments to domains/patterns in
target proteins
GGSEARCH
Global (query) – Global (target DB) alignment
Specific for searching short sequences against short targets
or for gene-to-gene comparisons
Sequence Searching Tools
What are global and local alignments?
Query
BLAST,
FASTA
Local - Local
|||||||| ||||||||||||||
Subject
Query
GLSEARCH
Global - Local
||||||||| |||||||||||||
Subject
Query
GGSEARCH
Global - Global
||||||||| |||||||||||||
Subject
Sequence Searching Tools
Tools: PSI
(Position Specific Iterated)
Search
Single Protein Sequence
Search Database
Estimate significance
iterate
Construct profile
Sequence Searching Tools
Generate Alignment
Tools: PSI Search
• PSI-BLAST
• Part of NCBI-BLAST package
• Automatic iteration service
• (PSSM = position specific scoring)
• Manually guided service
• PSI-SEARCH
• Combines: SSEARCH
(S&W algorithm)
• Manually guided service
Sequence Searching Tools
+
PSI-BLAST
(iterative strategy)
Let’s look at a FASTA search
Sequence Searching Tools
FASTA search
Step 1:
Select a database
Sequence Searching Tools
Which database to choose?
Database size is important
• ENA-Annotation >124 million
• UniParc (non-redundant) >24 million
• Databases grow every day
Sequence Searching Tools
How database size affects results
BLAST
>122M
sequence: gatctccatggg
>15M
>1.5M
0 hits
60 hits
489 hits
(>1000)
789.0
621.0
e-values of 100% matches
Sequence Searching Tools
>700,000
3 hits
0.96
How database size affects results
•
Search smallest database
likely to contain your sequence
•
Run multiple small searches
(can run all ENA/UniParc as well)
Sequence Searching Tools
Protein or nucleotide database search?
Two issues are worth considering…
Sequence Searching Tools
Protein or nucleotide database search?
Codon degeneracy
Amino acids
Ser
Ser
Nucleotides
UCU
AGC
Sequence Searching Tools
match
mismatch
Protein or nucleotide database search?
Over-simple match/mismatch scoring
Amino acids
Nucleotides
Sequence Searching Tools
highly
conserved
weakly
conserved
not
conserved
Ser
Ser
Ser
Ser
identical
Asn
similar
Leu
mismatch
UCU
UCU
UCU
AGC
mismatch
AAC
mismatch
CUC
mismatch
no
distinction
Protein or nucleotide database search?
Human CKS1B kinase
Protein
Nucleotide
Sequence Searching Tools
v
Zebra finch CDC28 kinase 1B
today
extinction of dinosaurs
Billions of years ago
Cambrian explosion
1
multicellular life
2
complex cells
3
photosynthesis
self-replicating cells
Protein comparisons
chemical evolution
4
identify homologues
formation of Earth
5-10x further back
Sequencein
Searching
Tools
evolution
prokaryotes
archaea
cyanobacteria
eukarytoes
plants
insects
land plants
fish
arthropods
reptiles
amphibians
flowers
birds
mammals
Protein or nucleotide search?
genus Homo
Identify homologs
searching:
Protein or nucleotide database search?
…therefore, searching a protein database could pull
out many more homologues than searching a
nucleotide database
…if you start with a nucleotide sequence, try
BLASTX or FASTX to translate your query sequence
and search a protein database
Sequence Searching Tools
FASTA search
Step 1:
Select a database
Step 2:
Paste sequence
Sequence Searching Tools
FASTA search
Step 1:
Select a database
Step 2:
Paste sequence
Step 3:
Choose parameters
Sequence Searching Tools
Choosing parameters
Sequence Searching Tools
Choosing parameters
User manual
provides help
Sequence Searching Tools
Which parameters to choose?
Matrix
Nucleotide search
‘simpler’ - only
match/mismatch
Protein search uses
substitution matrix tables
(based on amino acid
similarities and rate of change)
Sequence Searching Tools
Which parameters to choose?
Choice of
matrix
depends on:
1. strictness of search
2. length of query sequence
Sequence Searching Tools
QUERY LENGTH
>300
85-300
50-85
>300
85-300
35-85
<=35
<=10
MATRIX
BLOSUM50
BLOSUM62
BLOSUM80
PAM250
PAM120
MDM40
MDM20
MDM10
open
-10
-7
-16
-10
-16
-12
-22
-23
ext
-2
-1
-4
-2
-4
-2
-4
-4
Matrices - controlling search sensitivity
PAM (point accepted mutation)
• Based on global alignments of related proteins
• 1 substitution in 100 residues = PAM 1
• Other matrices extrapolated from PAM 1
• Model of evolutionary divergence
• Bias against rare substitutions (e.g. Cys → Tyr)
due to seed proteins
Sequence Searching Tools
Matrices - controlling search sensitivity
BLOSUM (BLOCKS amino-acid substitution)
• Based on protein domain alignments from the
BLOCKS database
• Observed substitutions in conserved domains
• Based on percentage identity, so BLOSUM50 is
deeper than BLOSUM80
Sequence Searching Tools
Effect of applying PAM10 -> 500 matrices to the human LDL receptor sequence
Sequence Searching Tools
10
100
200
300
400
500
Which parameters to choose?
Matrix - protein
Match/mismatch - nucleotide
FASTA
...instead have...
Sequence Searching Tools
BLAST
Match/mismatch scores
• “Reward” for match, “penalty” for mismatch
• Reward/penalty ratio:
Increase ratio to find more divergent sequences:
Ratio of 0.33 (1/-3) for 99% conserved
Ratio of 0.5 (1/-2) for 95% conserved
Ratio of 1 (1/-1) for 75% conserved
Sequence Searching Tools
Which parameters to choose?
gap penalties
Nucleotide search
gap open = -2 to -16
Gap extension = 0 to -4
Protein search
gap open = 0 to -23
Sequence Searching Tools
Gap extension = 0 to -8
Which parameters to choose?
Choice of
gap penalties
depends on:
1. strictness of search
• larger penalty fewer gaps
2. to match scoring matrix
Sequence Searching Tools
QUERY LENGTH
>300
85-300
50-85
>300
85-300
35-85
<=35
<=10
MATRIX
BLOSUM50
BLOSUM62
BLOSUM80
PAM250
PAM120
MDM40
MDM20
MDM10
open
-10
-7
-16
-10
-16
-12
-22
-23
ext
-2
-1
-4
-2
-4
-2
-4
-4
Which parameters to choose?
KTUP
(word length)
KTUP = ‘word-length’ of search
Large word-length less sensitive
faster
Nucleotide search
- fewer bases than amino
acids higher KTUP
Sequence Searching Tools
Which parameters to choose?
Do I mask my
sequence?
Low complexity regions should be
masked to avoid spurious results
•
CA repeats
•
poly-A tails
•
proline-rich regions
**Be careful you don’t mask what you are looking for
Sequence Searching Tools
Which parameters to choose?
What do I use
for short
sequences?
use strict matrices
use high gap penalties
avoid masking
allow high e-values
Sequence Searching Tools
Fasta results
Matches
section
Sequence Searching Tools
Fasta results
Do you use
e-values or
% identity?
Sequence Searching Tools
E-values or % identity?
E-value
Estimates statistical significance of matches
Default = 10 expect 10 matches found by chance
E() = <0.01 usually homologous
E() = 1-10 frequently related
% identity
% of positions identical between query
and match sequence
Sequence Searching Tools
E-values or % identity?
Sequence Searching Tools
Similar
Different
% identity scores
e-values
E-values or % identity?
Pattern of conservation
indicates homology
No evidence of
homology
Sequence Searching Tools
E-values or % identity?
Use e-values to estimate likelihood
two sequences are homologous
Sequence Searching Tools
Fasta results
Check length and
alignments in relation
to % identity
Sequence Searching Tools
Length of match
100% identity, but only over 124 / 663 (20%) of sequence
Sequence Searching Tools
Fasta results
Protein and
nucleotide
search results
have additional
annotation
Sequence Searching Tools
Fasta results
Related EMBL
nucleotide entries
Sequence Searching Tools
Fasta results
Related genomic
information
Sequence Searching Tools
Fasta results
Gene ontology (GO)
mapping for protein
Sequence Searching Tools
Fasta results
InterPro family/domain
classification
Sequence Searching Tools
Fasta results
Literature
Sequence Searching Tools
Fasta results
Functional
prediction on
ALL proteins
Sequence Searching Tools
Function Predictions using InterPro
Extract
information
Functional predictions:
InterPro family/domain
classifications
Visual comparison
find mis- or partial matches
Prioritize
results
Sequence Searching Tools
Function Predictions using InterPro
100% ID
• Matches:
• family signature
• 4 domain signatures
34% ID
• Matches:
• family signature
• 3 domain signatures
28% ID
• Matches:
• 1 domain signature
24% ID
Sequence Searching Tools
• Matches:
• No signatures
Sequence Search Summary
Navigate to search tools
Select search tool
Functional predictions
Result summary + annotation
Sequence Searching Tools
(1) Select database
(2) Copy/paste sequence
(4) Submit
(3) Set parameters
Search guidelines
Sequence Searching Tools
Search Guidelines: #1
• BEST:
•
2nd
•
3rd
AVTEGPIPEV
LFNYDAQYT
FGHKNSDKSS
BEST:
BEST:
• WORST:
Sequence Searching Tools
protein
ATGGCTAGC
TTCGACTAG
GCGATGCGA
ATGGCTAGC
TTCGACTAG
GCGATGCGA
AVTEGPIPEV
LFNYDAQYT
FGHKNSDKSS
Protein DB
DNA translate
DNA
protein
Protein DB
DNA DB
Translated DNA DB
FASTA
BLASTP
FASTX
BLASTX
FASTA
BLASTN
TFASTX
TBLASTN
Search Guidelines: #2
• Search smallest database likely to contain your
sequence
• Use sequence statistics (E-values) rather than
% identity or % similarity, as your primary criterion
for sequence homology
Sequence Searching Tools
Search Guidelines: #3
• Check statistics are likely to be accurate by
looking for highest scoring unrelated sequence
Examine the histograms
Use programs such as prss3 to confirm the E-values
Searching with shuffled sequences (use MLE/Shuffle in
FASTA) which should have an E-value ~1.0
Sequence Searching Tools
Search Guidelines: #4
• Consider searches with different gap penalties and
other scoring matrices
Use shallower matrices and/or more stringent gaps to
uncover or force out relationships in partial sequences
Adjust scoring matrix to suit length of query sequence
Adjust gap penalties to match scoring matrix
Sequence Searching Tools
QUERY LENGTH
>300
85-300
50-85
>300
85-300
35-85
<=35
<=10
MATRIX
BLOSUM50
BLOSUM62
BLOSUM80
PAM250
PAM120
MDM40
MDM20
MDM10
open
-10
-7
-16
-10
-16
-12
-22
-23
ext
-2
-1
-4
-2
-4
-2
-4
-4
Search Guidelines: #5
• Homology can be reliably inferred from
statistically significant similarity
Homology = common 3D structure
Homology - NOT common function
• Orthologous sequences have similar functions
• Paralogous sequences acquire very different
functional roles
Sequence Searching Tools
Search Guidelines: #6
• Consult motif or fingerprint databases to find
evidence for conservation critical for functional
residues
Motif identity in the absence of overall sequence
similarity is not a reliable indicator of homology!
• Try to produce multiple sequence alignments in
order to validate the relatedness of your sequence
data
ClustalW, MUSCLE, T-Coffee, Kalign, MAFFT
Mview, DBClustal (available form EBI FASTA & BLAST services)
Sequence Searching Tools
Search Guidelines: #7
• Low complexity regions (e.g. CA repeats, poly-A tails and
Proline-rich regions) give spuriously high scores that
reflect compositional bias rather than significant
position-by-position alignment
Use seg, xnu, dust, CENSOR, etc. BUT be careful
about what you filter!!!
Sequence Searching Tools
Search Guidelines: #8
• What about short sequences?
• Depends on their nature:
• Protein:
Reduce word length and/or increase e-value
Use shallow matrices
• DNA:
Reduce word length (but NOT to 1!)
Set Threshold for band optimisation (FASTA) to 0
Ignore gap penalties (force local alignments only)
Sequence Searching Tools
Accessing Tools at EBI
Access Sequence Similarity Search services over
various interfaces:
1) Using your browser
2) Over email
3) Using Web Services (SOAP/REST)
Perl, Python, C and Java clients available
Taverna & Triana workflows are fully supported
(See: http://www.ebi.ac.uk/Tools/webservices/)
(See: http://www.myexperiment.org/)
Sequence Searching Tools
Typical workflow
search
function
review
Check stats
evolution
compare
Sequence Searching Tools
Final remarks
• Don’t assume a single tool will cater for all your
search needs
• DO change the parameters of the tools
• Remember where the tool excels and what its
limitations are
• A tool intended for specific task A can also be
used for task B (and may be better than the tool
intended for task B specifically!)
• Crazy input will always give crazy results!
Sequence Searching Tools
Contacts:
http://www.ebi.ac.uk/support/
EBI is an Outstation of the European Molecular Biology Laboratory.