Transcript Document

Advanced BLAST Searching
Part 2 of 2
September 17, 2003
Copyright notice
Many of the images in this powerpoint presentation
are from Bioinformatics and Functional Genomics
by Jonathan Pevsner (ISBN 0-471-21004-8).
Copyright © 2003 by John Wiley & Sons, Inc.
These images and materials may not be used
without permission from the publisher. We welcome
instructors to use these powerpoints for educational
purposes, but please acknowledge the source.
The book has a homepage at http://www.bioinfbook.org
Including hyperlinks to the book chapters.
PSI-BLAST alignment of RBP and b-lactoglobulin: iteration 2
Score = 140 bits (353), Expect = 1e-32
Identities = 45/176 (25%), Positives = 78/176 (43%), Gaps = 33/176 (18%)
Query: 4
Sbjct: 2
Query: 56
Sbjct: 61
VWALLLLAAWAAAERDCRVSSF--------RVKENFDKARFSGTWYAMAKKDPEGLFLQD 55
V L+ LA A
+ +F
V+ENFD ++ G WY + +K P
+
VTMLMFLATLAGLFTTAKGQNFHLGKCPSPPVQENFDVKKYLGRWYEI-EKIPASFEKGN 60
NIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMV---GTFTDTEDPAKFKMKYWGVASF 112
I A +S+ E G +
K
+ D
+ V
++ +PAK +++++ +
CIQANYSLMENGNIEVLNKEL-----SPDGTMNQVKGEAKQSNVSEPAKLEVQFFPL--- 112
Query: 113 LQKGNDDHWIVDTDYDTYAVQYSCR----LLNLDGTCADSYSFVFSRDPNGLPPEA 164
+WI+ TDY+ YA+ YSC
L ++D
+ ++ R+P LPPE
Sbjct: 113 --MPPAPYWILATDYENYALVYSCTTFFWLFHVD------FFWILGRNPY-LPPET 159
Page 142
PSI-BLAST alignment of RBP and b-lactoglobulin: iteration 3
Score = 159 bits (404), Expect = 1e-38
Identities = 41/170 (24%), Positives = 69/170 (40%), Gaps = 19/170 (11%)
Query: 3
Sbjct: 1
Query: 55
Sbjct: 60
WVWALLLLAAWAAAERD--------CRVSSFRVKENFDKARFSGTWYAMAKKDPEGLFLQ 54
V L+ LA A
+ S V+ENFD ++ G WY + K
MVTMLMFLATLAGLFTTAKGQNFHLGKCPSPPVQENFDVKKYLGRWYEIEKIPASFE-KG 59
DNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTEDPAKFKMKYWGVASFLQ 114
+ I A +S+ E G +
K
V +
++ +PAK +++++ +
NCIQANYSLMENGNIEVLNKELSPDGTMNQVKGE--AKQSNVSEPAKLEVQFFPL----- 112
Query: 115 KGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADSYSFVFSRDPNGLPPEA 164
+WI+ TDY+ YA+ YSC
+ ++ R+P LPPE
Sbjct: 113 MPPAPYWILATDYENYALVYSCTTFFWL--FHVDFFWILGRNPY-LPPET 159
Page 142
1
Score = 46.2 bits (108), Expect = 2e-04
Identities = 40/150 (26%), Positives = 70/150 (46%), Gaps = 37/150 (24%)
Query: 27
Sbjct: 33
Query: 87
Sbjct: 83
VKENFDKARFSGTWYAMAKKDPEGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVC 86
V+ENFD ++ G WY + +K P
+ I A +S+ E G +
K
++
VQENFDVKKYLGRWYEI-EKIPASFEKGNCIQANYSLMENGNIEVLNK---------ELS 82
ADMVGTF---------TDTEDPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCR 137
D GT
++ +PAK +++++ +
+WI+ TDY+ YA+ YSC
PD--GTMNQVKGEAKQSNVSEPAKLEVQFFPLMP-----PAPYWILATDYENYALVYSCT 135
Query: 138 ----LLNLDGTCADSYSFVFSRDPNGLPPE 163
L ++D
+ ++ R+P LPPE
Sbjct: 136 TFFWLFHVD------FFWILGRNPY-LPPE 158
3
Score = 159 bits (404), Expect = 1e-38
Identities = 41/170 (24%), Positives = 69/170 (40%), Gaps = 19/170 (11%)
Query: 3
Sbjct: 1
Query: 55
Sbjct: 60
WVWALLLLAAWAAAERD--------CRVSSFRVKENFDKARFSGTWYAMAKKDPEGLFLQ 54
V L+ LA A
+ S V+ENFD ++ G WY + K
MVTMLMFLATLAGLFTTAKGQNFHLGKCPSPPVQENFDVKKYLGRWYEIEKIPASFE-KG 59
DNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTEDPAKFKMKYWGVASFLQ 114
+ I A +S+ E G +
K
V +
++ +PAK +++++ +
NCIQANYSLMENGNIEVLNKELSPDGTMNQVKGE--AKQSNVSEPAKLEVQFFPL----- 112
Query: 115 KGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADSYSFVFSRDPNGLPPEA 164
+WI+ TDY+ YA+ YSC
+ ++ R+P LPPE
Sbjct: 113 MPPAPYWILATDYENYALVYSCTTFFWL--FHVDFFWILGRNPY-LPPET 159
Page 142
The universe of lipocalins (each dot is a protein)
retinol-binding
protein
apolipoprotein D
odorant-binding
protein
Page 143
Scoring matrices let you focus on the big (or small) picture
retinol-binding
protein
your RBP query
Page 143
Scoring matrices let you focus on the big (or small) picture
PAM250
PAM30
retinol-binding
retinol-binding
protein
protein
Blosum80
Blosum45
Page 143
PSI-BLAST generates scoring matrices
more powerful than PAM or BLOSUM
retinol-binding
protein
Page 143
PSI-BLAST: performance assessment
Evaluate PSI-BLAST results using a database in
which protein structures have been solved and all
proteins in a group share < 40% amino acid identity.
Page 143
PSI-BLAST: the problem of corruption
PSI-BLAST is useful to detect weak but biologically
meaningful relationships between proteins.
The main source of false positives is the spurious
amplification of sequences not related to the query.
For instance, a query with a coiled-coil motif may
detect thousands of other proteins with this motif
that are not homologous.
Once even a single spurious protein is included
in a PSI-BLAST search above threshold, it will not
go away.
Page 144
PSI-BLAST: the problem of corruption
Corruption is defined as the presence of at least one
false positive alignment with an E value < 10-4
after five iterations.
Three approaches to stopping corruption:
[1] Apply filtering of biased composition regions
[2] Adjust E value from 0.001 (default) to a lower
value such as E = 0.0001.
[3] Visually inspect the output from each iteration.
Remove suspicious hits by unchecking the box.
Page 144
Page 152
Page 152
PHI-BLAST: Pattern hit initiated BLAST
Launches from the same page as PSI-BLAST
Combines matching of regular expressions
with local alignments surrounding the match.
Page 145
PHI-BLAST: Pattern hit intiated BLAST
Launches from the same page as PSI-BLAST
Combines matching of regular expressions
with local alignments surrounding the match.
Given a protein sequence S and a regular expression
pattern P occurring in S, PHI-BLAST helps answer the
question: What other protein sequences both contain
an occurrence of P and are homologous to S in the vicinity
of the pattern occurrences? PHI-BLAST may be preferable
to just searching for pattern occurrences because it
filters out those cases where the pattern occurrence
is probably random and not indicative of homology.
Page 145
Align three lipocalins (RBP and two bacterial lipocalins)
ecblc
vc
hsrbp
1
50
MRLLPLVAAA TAAFLVVACS SPTPPRGVTV VNNFDAKRYL GTWYEIARFD
MRAIFLILCS V...LLNGCL G..MPESVKP VSDFELNNYL GKWYEVARLD
~~~MKWVWAL LLLAAWAAAE RDCRVSSFRV KENFDKARFS GTWYAMAKKD
Page 145
Pick a small, conserved region and see which amino acid
residues are used
ecblc
vc
hsrbp
1
50
MRLLPLVAAA TAAFLVVACS SPTPPRGVTV VNNFDAKRYL GTWYEIARFD
MRAIFLILCS V...LLNGCL G..MPESVKP VSDFELNNYL GKWYEVARLD
~~~MKWVWAL LLLAAWAAAE RDCRVSSFRV KENFDKARFS GTWYAMAKKD
GTWYEI
K AV
M
Page 145
Create a pattern using the appropriate syntax
ecblc
vc
hsrbp
1
50
MRLLPLVAAA TAAFLVVACS SPTPPRGVTV VNNFDAKRYL GTWYEIARFD
MRAIFLILCS V...LLNGCL G..MPESVKP VSDFELNNYL GKWYEVARLD
~~~MKWVWAL LLLAAWAAAE RDCRVSSFRV KENFDKARFS GTWYAMAKKD
GTWYEI
K AV
M
GXW[YF][EA][IVLM]
Page 145
Page 146
Page 147
Syntax rules for PHI-BLAST
The syntax for patterns in PHI-BLAST follows the
conventions of PROSITE (protein lecture, Chapter 8).
When using the stand-alone program, it is permissible
to have multiple patterns. When using the Web-page
only one pattern is allowed per query.
[ ] means any one of the characters enclosed in the brackets
e.g., [LFYT] means one occurrence of L or F or Y or T
- means nothing
x(5) means 5 positions in which any residue is allowed
x(2,4) means 2 to 4 positions where any residue is allowed
BLAST for gene discovery
You can use BLAST to find a “novel” gene
Page 147
BLAST for gene discovery
You can use BLAST to find a “novel” gene
Note to students taking this class for credit:
You will need to do this for 40% of your grade.
In the first three years of this course,
everyone has succeeded at this exercise.
Page 147
Start with the
sequence of a
known protein
Page 148
Start with the
sequence of a
known protein
tblastn
Search a DNA database
(e.g. HTGS, dbEST,
or genomic sequence
from a specific organism)
Page 148
Start with the
sequence of a
known protein
tblastn
Search a DNA database
(e.g. HTGS, dbEST,
or genomic sequence
from a specific organism)
inspect
Find matches…
[1] to DNA encoding
known proteins
[2] to DNA encoding
related (novel!) proteins
[3] to false positives
Page 148
Start with the
sequence of a
known protein
tblastn
Search a DNA database
(e.g. HTGS, dbEST,
or genomic sequence
from a specific organism)
inspect
Search your DNA or
protein against a
protein database (nr)
to confirm you have
identified a novel gene
blastx
or
blastp
nr
Find matches…
[1] to DNA encoding
known proteins
[2] to DNA encoding
related (novel!) proteins
[3] to false positives
Page 148
Page 148
Page 148
(Page 150)
this is a good candidate
for a novel gene/protein
A blastp nr search confirms that
the Salmonella query is closely
related to other lipocalins
(Page 150)
BLAST for gene discovery
You can use BLAST to find a “novel” gene
Note to students taking this class for credit:
You will need to do this for 40% of your grade.
Ideally, try to find a new gene this week. You
can discuss it anytime with me or Mayra, Hugh
and Gek. You should have your novel protein
by October 13 (for the first phylogeny lecture)
so you can put your novel protein into a tree.
I will provide sample projects from last year.