Lecture_4_2005
Download
Report
Transcript Lecture_4_2005
Genome of the week
Bacillus subtilis
Gram-positive soil bacterium
Genetically tractable, well-studied
Developmental pathways (sporulation, genetic competence)
Industrial and agricultural importance
4.2 Mb genome (sequence completed 1997)
Close relative of Bacillus anthracis (Anthrax)
B. subtilis genome features
• 4,106 protein coding genes
• 10 rRNA operons
• Nearly 50% of the genome consists of paralogous
genes.
– 77 ABC transporter binding proteins
• 10 phage like regions - horizontal transfer. Low
GC regions in the genome.
• 18 sigma factors - initiate transcription.
• 34 two-component regulatory systems.
Annotating genes
• How to assign preliminary functions to genes.
• Automated programs.
• Similarity searches
– BLAST and PSI-BLAST
– COGs, Pfam, CDD, other databases
– Only 50-75% of genes will have a predicted function.
Some have no known homologs in any other genome.
• Functional characterization (individual genes)
– Gene knockouts
– Overexpression
• In many cases computer annotation will
only be able to predict function - NOT
assign function!
– The biological function of many genes have not
been determined, even in model systems.
– As genomic characterization of gene function
continues - more and more computer generated
annotations will be correct.
• Molecular function - activity of a protein at
the molecular level.
– Examples would be ATPase, metal binding,
converting glucose-6-phosphate to fructose-6phosphate.
• Biological function - cellular role of the
protein.
– Examples would be translation initiation, DNA
replication, glycolysis.
Homologs, orthologs, and
paralogs.
• Homologous genes are genes that share a
common evolutionary ancestor.
– Orthologs are genes found in different
organisms that arose from a common ancestor
– Paralogs are genes found in the same organism
that arose from a common ancestor.
Duplication could have occurred in the species
or earlier.
Using BLAST to predict gene
function.
• BLAST predicted protein sequence against
the non-redundant database.
• Determine best hits
• Automated annotation programs will often
assign the best hit function to the gene
being searched.
• Must manually confirm automated
annotations. (Final project).
Basic Local Alignment Search Tool
Calculates similarity for biological sequences
Finds best local alignments
Heuristic approach based on Smith-Waterman algorithm
Searches for matching “words” rather than individual
residues
Uses statistical theory to determine if a match might have
occurred by chance
NCBI Field Guide
Nucleotide Words
Query: GTACTGGACATGGACCCTACAGGAA
Minimum word size = 7
Word Size = 11
GTACTGGACAT
blastn default = 11
TACTGGACATG megablast default = 28
Make a lookup
ACTGGACATGG
table of words
CTGGACATGGA
TGGACATGGAC
GGACATGGACC
GACATGGACCC
ACATGGACCCT
NCBI Field Guide
...........
Protein Words
Query: GTQITVEDLFYNIATRRKALKN
GTQ
Word Size can be 2 or 3 (default = 3)
TQI
Make a lookup
Neighborhood Words
QIT
table of words
LTV, MTV, ISV, LSV, etc.
ITV
TVE
VED
EDL
DLF
NCBI Field Guide
...
Word Size = 3
Minimum Requirements for a Hit
ATCGCCATGCTTAATTGGGCTT
CATGCTTAATT
exact word match
one match
•Nucleotide BLAST requires one exact match
•Protein BLAST requires two neighboring matches within 40 aa
GTQITVEDLFYNI
YYN
neighborhood words SEI
two matches
NCBI Field Guide
Scoring Systems - Nucleotides
Identity matrix
A
G
C
T
A
+1
–3
–3
–3
G
–3
+1
–3
–3
CAGGTAGCAAGCTTGCATGTCA
|| |||||||||||| |||||
CACGTAGCAAGCTTG-GTGTCA
C
–3
–3
+1
–3
T
-3
-3
-3
+1
raw score = 19-9 = 10
NCBI Field Guide
Scoring Systems - Proteins
Position Independent Matrices
PAM Matrices (Percent Accepted Mutation)
•
•
•
•
Derived from observation; small dataset of alignments
Implicit model of evolution
All calculated from PAM1
PAM250 widely used
BLOSUM Matrices (BLOck SUbstitution Matrices)
• Derived from observation; large dataset of highly
conserved blocks
• Each matrix derived separately from blocks with a
defined percent identity cutoff
• BLOSUM62 - default matrix for BLAST
Position Specific Score Matrices (PSSMs)
PSI- and RPS-BLAST
NCBI Field Guide
BLOSUM62
NCBI Field Guide
A 4
R -1 5
N -2 0 6
D -2 -2 1 6 Common amino acids have low weights
C 0 -3 -3 -3 9
Q -1 1 0 0 -3 5
E -1 0 0 2 -4 2 5
G 0 -2 0 -1 -3 -2 -2 6
H -2 0 1 -1 -3 0 0 -2 8
I -1 -3 -3 -3 -1 -3 -3 -4 -3 4
L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4
Rare amino acids have high weights
K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5
M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5
F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7
S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4
T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5
Negative
for less likely substitutions
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11
Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7
V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4
X 0 -1
-1 -1 for
-2 -1
-1 -1
-1 -1
-1 -1 -1 -1 -2 0 0 -2 -1 -1 -1
Positive
more
likely
substitutions
A R N D C Q E G H I L K M F P S T W Y V X
Scores
Simply add the scores for each pair of aligned residues
V
V
BLOSUM62 +4
PAM30
+7
D S –
C
Y
E T L
C
F
+2 +1 -12 +9 +3
+2 0 -10 +10 +2
7
11
Different matrices produce different scores!
NCBI Field Guide
Local Alignment Statistics
High scores of local alignments between two random sequences
follow the Extreme Value Distribution
Expect Value
E = number of database hits you expect to find by chance
Alignments
size of database
your score
expected number
of random hits
At low E values E
approximates a P value
Score
NCBI Field Guide
BLAST Databases for Proteins
nr (non-redundant protein sequences)
–
–
–
–
GenBank CDS translations
NP_ RefSeqs
PIR, Swiss-Prot, PRF
PDB (sequences from structures)
swissprot
NCBI Field Guide
pat - patents
pdb – sequences with 3D structures
month – sequences updated within 30 days
Assessment of BLAST output
• What is the level of identity and similarity of the
best hits?
– More identity - more likely the proteins may have
similar functions.
• Does the area of similarity occur over the entire
protein? Or just part of the protein? (fig. 2.19)
– Often you will find hits to only part of your protein. A
GTP-binding domain for example.
• Have any of the best hits been characterized
experimentally?
– With so many microbial genomes sequenced chances
are you will have to search extensively to find a hit that
has been characterized experimentally.
NCBI Field Guide
BLAST Formatting Page
NCBI Field Guide
BLAST Output: Graphic Overview
SH3
PX
NCBI Field Guide
BLAST Output: Descriptions
4 X 10-68
links to entrez
default e value cutoff = 10
TaxBLAST: Taxonomy Reports
BLAST Output: Alignments
>gi|12643956|sp|Q9Y5X1|SNX9_HUMAN Sorting nexin 9 (SH3 and PX domaincontaining protein 1) (SDP1 protein) Length = 595
Score =
255 bits (652), Expect = 4e-68
Identities = 140/322 (43%), Positives = 185/322 (56%), Gaps = 7/322 (2%)
Query: 221 SSATVSRNLNRFSTFVKSGGEAFVLGEASGFVKDGDKLCVVLGPYGPEWQENPYPFQCTI 280
SS+++
LN+F F K G E ++L A
K +K+ +++G YGP W
F C +
Sbjct: 197 SSSSMKIPLNKFPGFAKPGTEQYLL--AKQLAKPKEKIPIIVGDYGPMWVYPTSTFDCVV 254
Query: 281 DDPTKQTKFKGMKSYISYKLVPTHTQVPVHRRYKHFDWLYARLAEKF-PVISVPHLPEKQ 339
DP K +K
G+KSYI Y+L PT+T
V+ RYKHFDWLY RL
KF
I +P LP+KQ
Sbjct: 255 ADPRKGSKMYGLKSYIEYQLTPTNTNRSVNHRYKHFDWLYERLLVKFGSAIPIPSLPDKQ 314
Query: 340 ATGRFEEDFISKRRKGLIWWMNHMASHPVLAQCDVFQHFLTCPSSTDEKAWKQGKRKAEK 399
TGRFEE+FI
R + L
WM
M
HPV+++ +VFQ FL
+
DEK WK GKRKAE+
Sbjct: 315 VTGRFEEEFIKMRMERLQAWMTRMCRHPVISESEVFQQFL---NFRDEKEWKTGKRKAER 371
NCBI Field Guide
Blink – Protein BLAST Alignments
• Lists only 200 hits
• List is nonredundant
NCBI Field Guide
Nucleotide vs. Protein BLAST
Comparing ADSS from H. sapiens and A. thaliana
aaccgggtgacggtggtgctcggtgcgcagtggggcgacgaaggc
Human: N R V T V V L G A Q W G D E G
+ + V +
V L G
Q W G D E G
A.th.: S Q V S G V L G C Q W G D E G
agtcaagtatctggtgtactcggttgccaatggggagatgaaggt
BLASTp finds three matching words
BLASTn finds no match, because there are no 7 bp words
Protein searches are generally more sensitive than nucleotide searches.
NCBI Field Guide
N ucleotide Translated BLAST P rotein
Particularly useful for nucleotide sequences without
protein annotations, such as ESTs or genomic DNA
tblastn
P
N
PPP
PPP
tblastx
PPP
P
N
N
PPP
PPP
PPP
N
Database
PPP
blastx
Query
PPP
Program
Linking Protein Sequence,
Structure, and Function
Protein
Domains
Protein sequences
CDD: Conserved functional domains in
proteins represented by a PSSM
PSI-BLAST, RPS-BLAST, CDART
3D Domains
NCBI Field Guide
Position Specific Substitution Rates
Weakly conserved serine
Active site serine
Position Specific Score Matrix
(PSSM)
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
D
G
V
I
S
S
C
N
G
D
S
G
G
P
L
N
C
Q
A
A
0
-2
-1
-3
-2
4
-4
-2
-2
-5
-2
-3
-3
-2
-4
-1
0
0
-1
R N D C Q E G H I L K M F
-2 0 2 -4 2 4 -4 -3 -5 -4 0 -2 -6
-1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3
1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5 -6
3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5
-5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -7
-4 -4 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -5
-7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0
Serine
scored
0 2 -1 -6
7 0is -2
0 -6differently
-4 2 0 -2
-3 -3 -4 -4
-5 two
7 -4 positions
-7 -7 -5 -4 -4
in-4
these
-5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -7 -7
-4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6
-6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6 -7
-6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6 -7
Active
-6
-6 -5site
-6 nucleophile
-5 -5 -6 -6 -6 -7 -4 -6 -7
-6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1 0
-6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3
-4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0 -1
1 4 2 -5 2 0 0 0 -4 -2 1 0 0
-1 1 3 -4 -1 1 4 -3 -4 -3 -1 -2 -2
P
1
-2
-4
-5
-5
-1
-7
-5
-6
-5
-4
-6
-6
9
-6
-6
-4
0
-3
S
0
-2
0
-3
1
4
-4
-1
-3
-4
7
-4
-2
-4
-6
-2
-1
-1
0
T
-1
-1
-2
0
-3
3
-4
-3
-5
-4
-2
-5
-4
-4
-5
-1
0
-1
-2
W
-6
0
-6
-1
-7
-6
-5
-3
-6
-8
-6
-6
-6
-7
-5
-6
-5
-3
-2
Y
-4
-6
-4
-4
-5
-5
0
-4
-6
-7
-5
-7
-7
-7
-4
-1
0
-3
-2
V
-1
-5
-2
0
-6
-3
-4
-3
-6
-7
-5
-7
-7
-6
0
6
0
-4
-3
PSIBLAST
NCBI Field Guide
Create your own PSSM:
Confirming relationships of purine
nucleotide metabolism proteins
query
PSSM
BLOSUM62
Alignment
PSI BLAST
>gi|113340|sp|P03958|ADA_MOUSE ADENOSINE DEAMINASE (ADENOSINE AMINOH
MAQTPAFNKPKVELHVHLDGAIKPETILYFGKKRGIALPADTVEELRNIIGMDKPLSLPGFLAKFDYY
VIAGCREAIKRIAYEFVEMKAKEGVVYVEVRYSPHLLANSKVDPMPWNQTEGDVTPDDVVDLVNQGLQ
EQAFGIKVRSILCCMRHQPSWSLEVLELCKKYNQKTVVAMDLAGDETIEGSSLFPGHVEAYEGAVKNG
RTVHAGEVGSPEVVREAVDILKTERVGHGYHTIEDEALYNRLLKENMHFEVCPWSSYLTGAWDPKTTH
VRFKNDKANYSLNTDDPLIFKSTLDTDYQMTKKDMGFTEEEFKRLNINAAKSSFLPEEEKKELLERLY
e value cutoff for PSSM
NCBI Field Guide
PSI Results: Initial BLAST
Run
NCBI Field Guide
First PSSM Search
Other purine nucleotide metabolizing enzymes not found by ordinary BLAST
NCBI Field Guide
Third PSSM Search: Convergence
Just below threshold, another
nucleotide metabolism enzyme
NCBI Field Guide
Domains
Entrez Domains (CDD)
16,482
records
A Database of Position Specific Score Matrices
SMART 4%
CDD 2%
LOAD 0.3%
• EMBL
• HMM based models
originally concentrating
on eukaryotic signaling
domains, now expanding
• NCBI
• Eukaryotic COGs
NCBI Curated Alignments
• NCBI
• Library of Ancient Domains
KOG 29%
• NCBI
• BLAST based
alignments derived from
complete proteomes of
unicelluar organisms
Pfam 35%
COG 30%
• Sanger Center
• Pfam-A seeds:
HMM based models
representing a wide
variety of functional
domains derived from
SWISS-PROT
NCBI Field Guide