Transcript TM review

CS177
Review/Summary of the Madej
lectures
Tom Madej 12.07.05
Overview
•
Basic biology.
•
Protein/DNA sequence comparison.
•
Protein structure comparison/classification.
•
NCBI databases overview.
•
Miscellaneous topics.
Lodish et al. Molecular Cell Biology, W.H. Freeman 2000
Protein/DNA sequence comparison
• What is the meaning of a sequence alignment?
• Scoring methods; amino acid substitution matrices,
PSSMs.
• Basic computational methods; e.g. BLAST.
• Know how to run PSI-BLAST, interpret the results.
Homology
“… whenever statistically significant sequence or structural
similarity between proteins or protein domains is
observed, this is an indication of their divergent evolution
from a common ancestor or, in other words, evidence of
homology.”
E.V. Koonin and M.Y. Galperin, Sequence – Evolution – Function,
Kluwer 2003
A simple phylogenetic tree…
Human hemoglobin and more distantly
related globins
• Human and horse
• Human and fish
• Human and insect
• Human and bacteria
Alignment notation: different notations for
the same alignment!
VISDWNMPN-------MDGLE
CILVV----AANDGPMPQTRE
VISDWnm---pnMDGLE
CILVVaandgpmPQTRE
Computing sequence alignments
• You must be able to recognize the “answer” (correct
alignment) when you see it (scoring system).
• You must be able to find the answer; i.e. compute it
efficiently.
Scoring and computing alignments
• “Position independent” amino acid substitution tables;
e.g. BLOSUM62.
• Global alignment algorithms such as Smith-Waterman
(dynamic programming); or fast heuristics such as
BLAST.
Score this alignment:
VISDWnm---pnMDGLE
CILVVaandgpmPQTRE
Use: BLOSUM62 matrix; gap opening penalty 10;
gap extension penalty 1
(-1 + 4 – 2 – 3 – 3) –10 – 1*11 + (-2 + 0 – 2 – 2 + 5) = -27
BLAST (Basic Local Alignment Search Tool)
• Extremely fast, can be on the order of 50-100 times
faster than Smith-Waterman.
• Method of choice for database searches.
• Statistical theory for significance of results (extreme
value distribution).
• Heuristic; does not guarantee optimal results.
• Many variants, e.g. PHI-, PSI-, RPS-BLAST.
Why database searches?
• Gene finding.
• Assigning likely function to a gene.
• Identifying regulatory elements.
• Understanding genome evolution.
• Assisting in sequence assembly.
• Finding relations between genes.
Issues in database searches
• Speed.
• Relevance of the search results (selectivity).
• Recovering all information of interest (sensitivity).
– The results depend on the search parameters, e.g. gap
penalty, scoring matrix.
– Sometimes searches with more than one matrix should be
performed.
E-values, P-values
• E-value, Expectation value; this is the expected number
of hits of at least the given score, that you would expect
by random chance for the search database.
• P-value, Probability value; this is the probability that a hit
would attain at least the given score, by random chance
for the search database.
• E-values are easier to interpret than P-values.
• If the E-value is small enough, e.g. no more than 0.10,
then it is essentially a P-value.
PSI-BLAST
• Position Specific Iterated BLAST
• As a first step runs a (regular) BLAST.
• Hits that cross the threshold are used to construct a
position specific score matrix (PSSM).
• A new search is done using the PSSM to find more
remotely related sequences.
• The last two steps are iterated until convergence.
PSSM (Position Specific Score Matrix)
• One column per residue in the query sequence.
• Per-column residue frequencies are computed so that
log-odds scores may be assigned to each residue type in
each column.
• There are difficulties; e.g. pseudo-counts are needed if
there are not a lot of sequences, the sequences must be
weighted to compensate for redundancy.
Two key advantages of PSSMs
• More sensitive scoring because of improved estimates of
probabilities for a.a.’s at specific positions.
• Describes the important motifs that occur in the protein
family and therefore enhances the selectivity.
Position Specific Substitution Rates
Weakly conserved serine
Active site serine
Position Specific Score Matrix (PSSM)
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
D
G
V
I
D
S
C
N
G
D
S
G
G
P
L
N
C
Q
A
A R N
0 -2 0
-2 -1 0
-1 1 -3
-3 3 -3
-2 -5 0
4 -4 -4
-4 -7 -6
-2 0 2
-2 -3 -3
-5 -5 -2
-2 -4 -2
-3 -6 -4
-3 -6 -4
-2 -6 -6
-4
-6 -7
Active
-1 -6 0
0 -4 -5
0 1 4
-1 -1 1
D C Q E G H I L K M F P
2 -4 2 4 -4 -3 -5 -4 0 -2 -6 1
-2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2
-3 -5 -1 -2 6 -1 -4 -5 1 -5 -6 -4
-4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5
8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -7 -5
-4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1
-7 12 -7 -7 -5 -6 -5 -5 -7 -5 0 -7
Serine scored differently
-1 -6 7 0 -2 0 -6 -4 2 0 -2 -5
in-4these
-4 -4
-5 7 two
-4 -7 positions
-7 -5 -4 -4 -6
9 -7 -4 -1 -5 -5 -7 -7 -4 -7 -7 -5
-4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4
-5 -6 -5 -6 8 -6 -8 -7 -5 -6 -7 -6
-5 -6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6
-5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7 9
-7 -5nucleophile
-5 -6 -7 0 -1 6 -6 1 0 -6
site
-6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6
-5 10 -2 -5 -5 1 -1 -1 -5 0 -1 -4
2 -5 2 0 0 0 -4 -2 1 0 0 0
3 -4 -1 1 4 -3 -4 -3 -1 -2 -2 -3
S
0
-2
0
-3
1
4
-4
-1
-3
-4
7
-4
-2
-4
-6
-2
-1
-1
0
T
-1
-1
-2
0
-3
3
-4
-3
-5
-4
-2
-5
-4
-4
-5
-1
0
-1
-2
W
-6
0
-6
-1
-7
-6
-5
-3
-6
-8
-6
-6
-6
-7
-5
-6
-5
-3
-2
Y
-4
-6
-4
-4
-5
-5
0
-4
-6
-7
-5
-7
-7
-7
-4
-1
0
-3
-2
V
-1
-5
-2
0
-6
-3
-4
-3
-6
-7
-5
-7
-7
-6
0
6
0
-4
-3
PSI-BLAST key points
• The first PSSM is constructed from all hits that cross the
significance threshold using “standard” BLAST.
• The search is then carried out with the PSSM to draw in
new significant hits.
• If new hits are found then a new PSSM is constructed;
these last two steps are iterated.
• The computation terminates upon “convergence”, i.e.
when no new sequences are found to cross the
significance threshold.
Protein structure comparison/classification
• Protein secondary structure elements.
• Supersecondary structures (simple structure motifs).
• Folds and domains.
• Comparing structures (VAST).
• Superfolds.
• Fold classification (SCOP).
• Conserved Domain Database (CDD).
α-helix (3chy)
backbone atoms
with sidechains
Parallel β-strands (3chy)
Anti-parallel β-strands (1hbq)
Higher level organization
• A single protein may consist of multiple domains.
Examples: 1liy A, 1bgc A. The domains may or may not
perform different functions.
• Proteins may form higher-level assemblies. Useful for
complicated biochemical processes that require several
steps, e.g. processing/synthesis of a molecule.
Example: 1l1o chains A, B, C.
Supersecondary structures
• β-hairpin
• α-hairpin
• βαβ-unit
• β4 Greek key
• βα Greek key
Supersecondary structure: simple units
G.M. Salem et al. J. Mol. Biol. (1999) 287 969-981
Supersecondary structure: Greek key motifs
G.M. Salem et al. J. Mol. Biol. (1999) 287 969-981
Protein folds
• There is a continuum of similarity!
• Fold definition: two folds are similar if they have a similar
arrangement of SSEs (architecture) and connectivity
(topology). Sometimes a few SSEs may be missing.
• Fold classification: To get an idea of the variety of
different folds, one must adjust for sequence redundancy
and also try to correctly assign homologs that have low
sequence identity (e.g. below 25%).
Vector Alignment Search Tool (VAST)
• Fast structure comparison based on representing SSEs
by vectors.
• A measure of statistical significance (VAST E-value) is
computed (very differently from a BLAST E-value).
• VAST structure neighbor lists useful for recognizing
structural similarity.
Superfolds (Orengo, Jones, Thornton)
• Distribution of fold types is highly non-uniform.
• There are about 10 types of folds, the superfolds, to which about
30% of the other folds are similar. There are many examples of
“isolated” fold types.
• Superfolds are characterized by a wide range of sequence diversity
and spanning a range of non-similar functions.
• It is a research question as to the evolutionary relationships of the
superfolds, i.e. do they arise by divergent or convergent evolution?
Superfolds and examples
•
•
•
•
•
Globin 1hlm sea cucumber
hemoglobin; 1cpcA phycocyanin;
1colA colicin
α-up-down 2hmqA hemerythrin;
256bA cytochrome B562; 1lpe
apolipoprotein E3
Trefoil 1i1b interleukin-1β; 1aaiB
ricin; 1tie erythrina trypsin inhibitor
TIM barrel 1timA triosephosphate
isomerase; 1ald aldolase; 5rubA
rubisco
OB fold 1quqA replication protein
A 32kDa subunit; 1mjc major coldshock protein; 1bcpD pertussis
toxin S5 subunit
•
•
•
•
•
α/β doubly-wound 5p21 Ras p21;
4fxn flavodoxin; 3chy CheY
Immunoglobulin 2rhe BenceJones protein; 2cd4 CD4; 1ten
tenascin
UB αβ roll 1ubq ubiquitin; 1fxiA
ferredoxin; 1pgx protein G
Jelly roll 2stv tobacco necrosis
virus; 1tnfA tumor necrosis factor;
2ltnA pea lectin
Plaitfold (Split αβ sandwich) 1aps
acylphosphatase; 1fxd ferredoxin;
2hpr histidine-containing
phosphocarrier
Fold classification (when you have the
structure…)
• First, look up PubMed abstracts for any relevant papers.
E.g. if this is from a PDB file there will be references in it.
• Try checking SCOP or CATH.
• Look at VAST neighbors. See if the structure in question
is highly similar to another structure with a known fold.
SCOP (Structural Classification of Proteins)
• http://scop.mrc-lmb.cam.ac.uk/scop/
• Levels of the SCOP hierarchy:
– Family: clear evolutionary relationship
– Superfamily: probable common evolutionary origin
– Fold: major structural similarity
Bioinformatics databases
• Entrez is by far the most useful, because of the links
between the individual databases, e.g. literature,
sequence, structure, taxonomy, etc.
• Other specialty databases available on the internet can
also be very useful, of course!
Links Between and Within Nodes
Word weight
Computational
PubMed
abstracts
3 -D
3-D
Structures
Structure
Taxonomy
Phylogeny
VAST
Computation
Genomes
Nucleotide
BLAST
Computationalsequences
Protein
sequences
BLAST
Computationa
Entrez queries
• Be able to formulate queries using index terms
(Preview/Index), and limits.
Exercises!
• How many protein structures are there that include DNA and are
from bacteria?
• In PubMed, how many articles are there from the journal Science
and have “Alzheimer” in the title or abstract, and “amyloid beta”
anywhere? How many since the year 2000?
• Notice that the results are not 100% accurate!
• In 3D Domains, how many domains are there with no more than two
helices and 8 to 10 strands and are from the mouse?
P53 tumor suppressor protein
• Li-Fraumeni syndrome; only one functional copy of p53
predisposes to cancer.
• Mutations in p53 are found in most tumor types.
• p53 binds to DNA and stimulates another gene to
produce p21, which binds to another protein cdk2. This
prevents the cell from progressing thru the cell cycle.
G. Giglia-Mari, A. Sarasi, Hum. Mutat. (2003) 21 217-228.
Exercise!
• Use Cn3D to investigate the binding of p53 to DNA.
• Formulate a query for Structure that will require the DNA
molecules to be present (there are 2 structures like this).
Miscellaneous topics
• BLAST a sequence against a genome; locate hits on
chromosomes with map viewer.
• Obtain genomic sequence with map viewer.
• Spidey to predict intron/exon structure (but we won’t use
spidey on the exam!).
• How sequence variations can affect protein
structure/function.
“EST exercise” summary
• BLAST the EST (or other DNA seq) against the genome.
• From the BLAST output you can get the genomic coordinates of any
nucleotide differences.
• Use map viewer to locate the hit on a chromosome; assume the hit
is in the region of a gene.
• By following the gene link you can get an accession for mRNA.
• By using the “dl” link you can get an accession for the genomic
sequence.
• Use “spidey” with the mRNA and genomic sequence to locate
changed residues in the protein.
“EST exercise” summary (cont.)
• From the gene report you can follow the protein link, and then
“Blink”.
• From the BLAST link page you can get to CDD and related
structures.
• Since you know where are the changed residues you can use the
structures to study what effect the changes might have on the
function of the protein.
Gene variants that can affect protein
function
• Mutation to a stop codon; truncates the protein product!
• Insertion/deletion of multiple bases; changes the sequence of amino
acid residues.
• Single point change could alter folding properties of the protein.
• Single point change could affect the active site of the protein.
• Single point change could affect an interaction site with another
molecule.
Important note!
• Most diseases (e.g. cancer) are complex and involve
multiple factors (not just a single malfunctioning protein!).