tutorial3_10

Download Report

Transcript tutorial3_10

Tutorial 3
BLAST
What is BLAST?
• Basic Local Alignment Search Tool
• Is a set of similarity search programs designed
explore sequence databases.
to
What are similarity searches good for?
• One sequence by itself is not informative; it must be
analyzed by comparative methods against existing
sequence databases to develop hypothesis concerning
relatives and function
Query
BLAST program
Database
BLAST Databases
Name
Query type
Database
blastn
Genomic
Genomic
blastp
Protein
Protein
blastx
Translated
genomic
Protein
tblastn
Protein
tblastx
Translated
genomic
Translated
genomic
Translated
genomic
http://www.ncbi.nlm.nih.gov/BLAST/
Place Query
Choose
Database
?
BLASTN Databases
GenBank, EMBL, DDBJ, PDB and NCBI
Gene
collection reference sequences (RefSeq)
Genomic +
Transcript
EST
mito
vector
month
Envi
Complete human and mouse genome +
transcriptome
Expressed sequence tags
Mitochondrial sequences
Vector subset of GenBank
GenBank, EMBL, DDBJ, PDB from 30 days
Environmental samples
http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml#nucleotide_databases
Place Query
Choose Database
Optimize similarity level
of the search
?
Limit output size
Threshold for results significance
Primary word match (16-64 nt)
Reward and penalty for matching and
mismatching bases
Cost to create and extend a gap
Remove low information content
Limit search to specific organism
Search for homologous to chick
“olfactory receptor 6” gene
Global
Alignments
Local
Alignments
Query sequence
Matched Areas of
database
sequences
Sequence
description
Sequence
Identifier
Score(bits)
Coverage
E value
Identity
Score and
E value
Identities
and gaps
Strand
Multiple hits
on a same subject
Design of the BLAST survey
Consider your research question:
•Are you looking for an particular gene in a particular
species?: BLAST against the genome of that species.
•Are you looking for additional members of a gene family
across all species? : BLAST against the gene collection
database.
•Are you looking for exact motif matches? : increase gap
penalty or use megablast.
Score and E-value
Score (S): (identities + mismatches)-gaps
Bit Score (S’):
Score
Depends on
search space
Query
length(bp)
Depends on
scoring system
Database
length(bp)
Score and E-value
•The score is a measure of the similarity of the query
to the sequence shown.
•The E-value is a measure of the reliability of the score.
•The definition of the E-value is: The probability due to
chance, that there is another alignment with a similarity
greater than the given S score.
Score and E-value
The Size of the E-value
•The typical threshold for a good E-value from a BLAST
search is E=10-6≈e-6 or lower.
•The reason for such low values is that an E=0.001 in a
million entry database would still leave 1000 entries due
to chance. An E=e-6 would only leave one entry due to
chance.
Exercise
Calculate the S, S’ and E for the following BLAST hit:
ACGTCGATCGAGCT
|||||||| |||||
AGGTCGTC-GAGGT
Given the following parameters:
Query length: 150
=1.37
K=0.711
Average Sequence length in database: 270
Number of sequences in database: 4,554,026
S =
S’=
S’=
S’=
13-1 = 12
(1.37*12 – ln(0.711))/ln(2)
16.44 + 0.341 /0.693
24.2
S: (Id+MM)-GP
Exercise
Calculate the S, S’ and E for the following BLAST hit:
ACGTCGATCGAGCT
|||||||| |||||
AGGTCGTC-GAGGT
Given the following parameters:
Query length: 150
=1.37
K=0.711
Average Sequence length in database: 270
Number of sequences in database: 4,554,026
E= 0.711x150x270x4,554,026xe-1.37*12
E= 131135455683x7.24e-8
E= 9504.27
Exercise
What will be the minimal score in order to achieve a
significant E value (e-6~10-6)?
131135455683e-1.37S=10-6
ln (131135455683e-1.37S)=ln(10-6)
ln (131135455683)+ln(e-1.37S)=-13.81
25.6-1.37S=-13.81
S= =-13.81-25.6/-1.37
S≈ 28.76
‫‪ .1‬חיפוש רצפים הומולוגיים לגן ‪ CFTR‬באדם‬
‫‪ .2‬חברי משפחה נוספים לגן ‪ CFTR‬הנמצאים ביצורים אחרים‬
‫‪ .3‬חיפוש של גנים נוספים חברי משפחת ‪ABC transporters‬‬