blast - Computer Science | Winona State University

Download Report

Transcript blast - Computer Science | Winona State University

Summer Bioinformatics Workshop 2008
BLAST
Chi-Cheng Lin, Ph.D., Professor
Department of Computer Science
Winona State University – Rochester Center
[email protected]
Summer Bioinformatics Workshop 2008
BLAST
• Introduction
– What is BLAST?
– Query Sequence in FASTA Format
– What does BLAST tell you?
• Choices
– BLAST Programs: Which One to Use?
– Commonly Used BLAST programs
– BLAST Databases: Which One to Search?
• Understanding the Output
• Database Search with BLAST
• Blast Steps – How It Works
Acknowledgement: The presentation includes adaptations from NCBI’s
Introduction to Molecular Biology Information Resources Modules
2
Summer Bioinformatics Workshop 2008
What is BLAST?
• Basic Local Alignment Search Tool
• The GoogleTM of bioinformatics
• query is a DNA or protein sequence, not a
text term
• character string comparison against all the
sequences in the target database
• rigorous statistics used to identify
statistically significant matches
3
Summer Bioinformatics Workshop 2008
Query Sequence in FASTA Format
• FASTA definition line ("def line") that
begins with a >, followed by some text that
briefly describes the query sequence on a
single line
• up to 80 nucleotide bases or amino acids
per line
• example and additional information
>gi|532319|pir|TVFV2E|TVFV2E envelope protein
ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT
QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC
HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK
MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK
4
Summer Bioinformatics Workshop 2008
What does BLAST tell you?
• putative identity and function of your query
sequence
• helps to direct experimental design to prove the
function
• find similar sequences in model organisms (e.g.,
yeast, C. elegans, mouse), which can be used to
further study the gene
• compare complete genomes against each other
to identify similarities and differences among
organisms
5
Summer Bioinformatics Workshop 2008
BLAST Programs: Which One to Use?
Depends on:
• what type of query sequence you have
(nucleotide or protein)
• what type of database you will search
against (nucleotide or protein)
• Most commonly used BLAST programs
– blastn
– blastp
– blastx
6
Summer Bioinformatics Workshop 2008
Commonly Used BLAST Programs
• BLASTN
– Nucleic acids against nucleic acids
• BLASTP
– Protein query against protein database
– usually better to use than nucleotide-nucleotide
BLAST
– ...but... if we don't have a protein query sequence,
what are our options?
• BLASTX
– Translated nucleic acids against protein database
– one way to do a protein BLAST search if you have a
nucleotide query sequence
– the BLAST program does the translating for you, in all
7
6 reading frames
Summer Bioinformatics Workshop 2008
Request ID: RID
• An RID is like a ticket number that allows
you to retrieve your search results and
format them in many different ways over
the next 24 hours.
• If you've saved RIDs from your recent
searches, you can enter the RIDs directly
using the Retrieve results with a Request
ID page, which is accessible from the
bottom of the BLAST home page
8
Summer Bioinformatics Workshop 2008
Search Results:
Understanding the Output
• Reference to BLAST paper
• Reminders about your specific query
– RID
– query sequence reminder (contains the information from your
FASTA def line)
– what database you searched against
• Graphical summary
– shows where the hits aligned to your query
– colors indicate score range
– mouse over a colored bar to see info about that hit
• Text summary (GI numbers and Def lines)
– GI links to complete record in Entrez
– Score links to pairwise alignment between your query sequence
and the hit
• Pairwise alignments
• BLAST statistics for your search
9
Summer Bioinformatics Workshop 2008
Database Search w/ BLAST
Used most often!
10
Summer Bioinformatics Workshop 2008
Database
Search w/
BLAST
• Selecting a
BLAST
program
• Insert
sequence
• Hit “BLAST”
near the end of
the web page
In general, if you select
blastn, select “Others” as
your Database to search.
11
Summer Bioinformatics Workshop 2008
Database Search w/ BLAST
• RID and search status will appear
RID
12
Summer Bioinformatics Workshop 2008
Database Search w/ BLAST
• Wait for
your
result
(patiently
…)
13
Summer Bioinformatics Workshop 2008
Database Search w/ BLAST
• Interpret the result
– Graphic result
– The black color lines are sequences that matched the
least while the red lines would be sequences that
matched best. In the example below, the purple color
sequences are the best matches available.
14
Source of the image: http://www.bio.davidson.edu/courses/genomics/2006/martens/favorite_gene.html
Summer Bioinformatics Workshop 2008
Database Search w/ BLAST
• BLAST result
0Matching sequences w/ bit-score & E-value
0Hyperlinks to database entry for sequence
• Example
Notes that 3e-188 means 3  10-188.
15
Summer Bioinformatics Workshop 2008
BLAST – Statistical Evaluation
• E Value
– The number of different alignments with
scores equivalent to or better than alignment
score that are expected to occur in a
database search by chance.
– The lower the E value, the more significant
the score.
16
Summer Bioinformatics Workshop 2008
BLAST Steps – How It Works
1. Seeding
- Prepare a list of short, fixed-length segments (words)
from the query
2. Searching
- Find highly similar or exact match for each word
3. Extension
- Extend each match to (potentially) a longer match
4. Evaluation
- Evaluate the results using E values
17