Transcript PPT

Review Concepts 20040715
Chuong Huynh
NCBI
Pairwise Sequence Alignments
• Purpose:
• identification of sequences with significant similarity to (a)
sequence(s) in a sequence-repository
• identification of all homologous sequences the repository
• identification of domains with sequence similarity
• Terminology
NCBI
• Global alignment
• Local alignment
Terminology: Global Alignment
NCBI
• Finds the optimal alignment over the
entire length of the two compared
sequences
• Unlikely to detect genes that have
evolved by recombination (e.g. domain
shuffling) or insertion/deletion of DNA
• Suitable for sequences of homologous
molecules
Terminology: Local Alignment
NCBI
• short regions of similarity between a pair
of sequences.
• compared sequences can receive high local
similarity scores, without the need to
have high levels of similarity over their
entire length
• useful when looking for domains within
proteins or looking for regions of
genomic DNA that contain coding exons
An alignment that BLAST can’t find
1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG
|| | || || || | || || ||
|| | ||| |||||| | | || | ||| |
1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG
61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT
| || ||
|| ||| || | |||||| || | |||||| ||||| |
|
61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT
NCBI
121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC
|||| || ||||| || ||
| | |||| || |||
121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC
BLAST Selection Matrix
NCBI
Choosing The Right BLAST
Flavor for Proteins
The Right BLAST Flavor
Find out something about
the function of the
protein
Use blastp to compare
your protein with other
proteins contained in the
databases.
Use tblastn to compare
your protein with DNA
sequences translated into
their 6 possible reading
frames
Discover new genes
encoding similar proteins
Claverie & Notredame 2003
NCBI
What you Want to Do?
Questions
Choosing the Right BLAST
Flavor for DNA
Answer
Yes, Use blastn. Rem: blastn is
only for closely related DNA
sequences (more than 70%
identical)
Do I want to discover new
proteins?
Yes, Use tblastx
Do I want to discover proteins
encoded in my query DNA
sequences?
Yes, Use blastx
Am I unsure of the quality of my
DNA?
Yes, Use blastx. Especially if you
suspsect your DNA sequence
codes for a protein, but may
contain sequencing errors.
Claverie & Notredame 2003
NCBI
Am I interested in non coding
DNA?
Choosing The Right BLAST Flavor
for DNA Sequences
Query
Database
Program
Find very similar
DNA sequence
Protein
discovery and
ESTs
Analysis of
query DNA
sequence
DNA
DNA
blastn
Translated
DNA
Translated tblastx
DNA
Translated
DNA
Protein
blastx
Claverie & Notredame 2003
NCBI
Usage
BLAST Tips
NCBI
• It is faster and more accurate to BLAST
proteins (blastp) rather than nucleotides.
• If in doubt use blastp.
• When possible restrict to the subset of
the database you are interested in.
• Look around for the database you need or
create your own custom BLAST database.
BUT HOW???
• When is the best time to use the BLAST
server?
Asking Biological Problems with BLAST
General (but More
Complicated)
Computational Method
Using BLAST
Finding genes
in a genome
Run gene prediction
software or an ORF
Finder (for bacteria)
Cut your genome sequence in little (2-5kb)
overlapping sequences. Use blastx to BLAST each
piece of genome against NR (nonredundant
protein db). Works better for sequences with no
introns (bacteria).
Predicting
protein
function
Domain analysis or
wet-lab
experimentation
Use blastp to BLAST your protein sequence
against SWISS-Prot (future = UniProt). If you
get a good hit (more than 25% identify) over the
complete length of the protein, then your protein
has the same function as the SWISS-PROT
protein
Predicting
protein 3-D
structure
Homology modeling, Xray, NMR analysis of
protein of interest
Use blastp to BLAST your protein against PDB
(Protein structure DB), if you get hit >25%
identity, then your protein and the good hit(s)
have a similar 3-D structure
Finding
protein family
members
Clone new family
members using PCR
techniques
Use blastp (or better use PSI-BLAST) and run
against NR (nonredundant protein family). After
you have all members of family, you can make
multiple sequence alignment  phylogenetic tree
Claverie & Notredame 2003
NCBI
What You
Want to DO
BLAST and PSI-BLAST Servers
on the Internet
Country Program URL
USA
USA
EUROPE
BLAST/
PSIBLAST
BLAST
BLAST
BLAST
Japan
BLAST/
PSIBLAST
http://genome.wustl.edu/gsc/BLAST
http://www.ch.embnet.org/software/b
BLAST.html
http://www.ebi.ac.uk/blast2/
http://www.ddbj.nig.ac.jp/E-mail/
homology.html
NCBI
Europe
http://www.ncbi.nlm.nih.gov/BLAST
Common Mistake
Sequence 1: AAAAAABBBBBB
Sequence 2: AAAAAA
Sequence 3: BBBBBB
NCBI
• Seq1 has domain A & B; Seq2 has domain A and Seq3
has domain B
• Use Seq 1 as query sequence
• What happens? E-value of both of these hits may be
very high if domain A and B are long and well
conserved.
• Seq1 is homologous to Seq2&3, but remember Seq1 is
not homlogous over the entire length to Seq2&3
• Just don’t depend on the E-value
• “BLAST hits are not transitive, unless the alignments
are overlapping”
• Most proteins have more than one domain, so
becareful when looking a BLAST results, not all
reported hits belong to the same big family.
Alternative Method for
Homology Searches
NCBI
• Smith-Waterman (ssearch): slower but
more accurate
• FASTA: slower than BLAST, but more
accurate when making DNA comparison
• BLAT: for locating cDNA in a genome or
finding close proteins in a genome
Common Questions
• When I do a blast job using WU-BLAST vs NCBI
BLAST with the same query sequence, I get a
different result? Both are based on the same
algorithm, but a different implementation. So why
the difference?
NCBI
Usually this is due to the slight variation in the
database version, but differences in BLAST
program version also play a minor role in the
difference. Usually the result, do not change in a
dramatic manner, but they do change a bit.
Basic Gene Prediction Flow Chart
Obtain new genomic DNA sequence
1. Translate in all six reading frames and compare to protein
sequence databases
2. Perform database similarity search of expressed sequence tag
Sites (EST) database of same organism, or cDNA sequences if available
Use gene prediction program to locate genes
NCBI
Analyze regulatory sequences in the gene
The Annotation Process
ANNALYSIS SOFTWARE
DNA SEQUENCE
Useful
Information
NCBI
Annotator
Annotation Process
DNA sequence
Blastn
Repeats
Promoters
Fasta
BlastP
Gene finders
rRNA
Pfam
Blastx
Halfwise
Pseudo-Genes
Prosite
Psort
tRNA scan
Genes
SignalP
tRNA
TMHMM
NCBI
RepeatMasker
How do I do large scale genome analysis?
• Read Koonin’s book on NCBI Bookshelf
NCBI
Demo TaxPlot
TaxPlot is a tool for three-way comparisons of genomes
on the basis of the protein sequences they encode.
NCBI
http://www.ncbi.nlm.nih.gov/sutils/taxik2.cgi
Demo - VecScreen
NCBI
http://www.ncbi.nlm.nih.gov/VecScreen/VecScreen.html