Finding Patterns in Protein Sequence and Structure

Download Report

Transcript Finding Patterns in Protein Sequence and Structure

Bioinformatics
For MNW 2nd Year
Jaap Heringa
FEW/FALW
Integrative Bioinformatics Institute VU (IBIVU)
[email protected]
Current Bioinformatics Unit
• Jens Kleinjung (1/11/02)
• Victor Simosis – PhD (1/12/02)
• Radek Szklarczyk - PhD (1/01/03)
• John Romein (1/12/02, Henri Bal)
Bioinformatics course 2nd year
MNW spring 2003
• Pattern recognition
–
–
–
–
–
–
–
Supervised/unsupervised learning
Types of data, data normalisation, lacking data
Search image
Similarity tables
Clustering
Principal component analysis
Discriminant analysis
Bioinformatics course 2nd year
MNW spring 2003
• Protein
–
–
–
–
–
–
–
–
–
Folding
Structure and function
Protein structure prediction
Secondary structure
Tertiary structure
Function
Post-translational modification
Prot.-Prot. Interaction -- Docking algorithm
Molecular dynamics/Monte Carlo
Bioinformatics course 2nd year
MNW spring 2003
• Sequence analysis
–
–
–
–
–
Pairwise alignment
Dynamic programming (NW, SW, shortcuts)
Multiple alignment
Combining information
Database/homology searching (Fasta, Blast,
Statistical issues-E/P values)
Bioinformatics course 2nd year
MNW spring 2003
• Gene structure and gene finding algorithm
• Omics
– DNA makes RNA makes protein
– Expression data, Nucleus to ribosome,
translation, etc.
– Metabolomics
– Physiomics
– Databases
• DNA, EST
• Protein sequence
• Protein structure
Bioinformatics course 2nd year
MNW spring 2003
o Microarray data
o Protein structure (PDB)
o Proteomics
o Mass spectrometry/NMR/X-ray?
Bioinformatics course 2nd year
MNW spring 2003
•
•
•
•
•
Bioinformatics method development
IPR issues
Programming and scripting languages
Web solutions
Computational issues
– NP-complete problems
– CPU, memory, storage problems
– Parallel computing
• Bioinformatics method usage/application
• Molecular viewers (RasMol, MolMol, etc.)
Gathering knowledge
• Anatomy, architecture
Rembrandt,
1632
• Dynamics, mechanics
Newton,
1726
• Informatics
(Cybernetics – Wiener, 1948)
(Cybernetics has been defined as the science of control in machines and
animals, and hence it applies to technological, animal and environmental
systems)
• Genomics, bioinformatics
Bioinformatics
Chemistry
Biology
Molecular
biology
Mathematics
Statistics
Bioinformatics
Computer
Science
Informatics
Medicine
Physics
Bioinformatics
“Studying informational processes in biological systems”
(Hogeweg, early 1970s)
• No computers necessary
• Back of envelope OK
“Information technology
applied to the management and
analysis of biological data”
(Attwood and Parry-Smith)
Applying algorithms with mathematical formalisms in
biology (genomics) -- USA
Bioinformatics in the olden days
• Close to Molecular Biology:
– (Statistical) analysis of protein and nucleotide
structure
– Protein folding problem
– Protein-protein and protein-nucleotide
interaction
• Many essential methods were created early
on (BG era)
– Protein sequence analysis (pairwise and
multiple alignment)
– Protein structure prediction (secondary, tertiary
structure)
Bioinformatics in the olden days
(Cont.)
• Evolution was studied and methods created
– Phylogenetic reconstruction (clustering – NJ
method
The Human Genome -- 26 June 2000
The Human Genome -- 26 June 2000
Dr. Craig Venter
Sir John Sulston
Celera Genomics
Human Genome
Project
-- Shotgun method
Human DNA
• There are about 3bn (3  109) nucleotides in the
nucleus of almost all of the trillions (3.5  1012 ) of
cells of a human body (an exception is, for example,
red blood cells which have no nucleus and therefore
no DNA) – a total of ~1022 nucleotides!
• Many DNA regions code for proteins, and are called
genes (1 gene codes for 1 protein in principle)
• Human DNA contains ~30,000 expressed genes
• Deoxyribonucleic acid (DNA) comprises 4 different
types of nucleotides: adenine (A), thiamine (T),
cytosine (C) and guanine (G). These nucleotides are
sometimes also called bases
Human DNA (Cont.)
• All people are different, but the DNA of different
people only varies for 0.2% or less. So, only 2
letters in 1000 are expected to be different. Over
the whole genome, this means that about 3 million
letters would differ between individuals.
• The structure of DNA is the so-called double
helix, discovered by Watson and Crick in 1953,
where the two helices are cross-linked by A-T and
C-G base-pairs (nucleotide pairs – so-called
Watson-Crick base pairing).
Tot hier 3/2 – 10.45-12.30
DNA compositional biases
• Base composition of genomes:
• E. coli: 25% A, 25% C, 25% G, 25% T
• P. falciparum (Malaria parasite): 82%A+T
• Translation initiation:
• ATG is the near universal motif indicating the
start of translation in DNA coding sequence.
Some facts about human genes
•
•
•
•
•
•
Comprise about 3% of the genome
Average gene length: ~ 8,000 bp
Average of 5-6 exons/gene
Average exon length: ~200 bp
Average intron length: ~2,000 bp
~8% genes have a single exon
• Some exons can be as small as 1 or 3 bp.
• HUMFMR1S is not atypical: 17 exons 40-60 bp long,
comprising 3% of a 67,000 bp gene
Genetic diseases
• Many diseases run in families and are a result of
genes which predispose such family members to
these illnesses
• Examples are Alzheimer’s disease, cystic fibrosis
(CF), breast or colon cancer, or heart diseases.
• Some of these diseases can be caused by a problem
within a single gene, such as with CF.
Genetic diseases (Cont.)
• For other illnesses, like heart disease, at least 20-30
genes are thought to play a part, and it is still
unknown which combination of problems within
which genes are responsible.
• With a “problem” within a gene is meant that a
single nucleotide or a combination of those within
the gene are causing the disease (or make that the
body is not sufficiently fighting the disease).
• Persons with different combinations of these
nucleotides could then be unaffected by these
diseases.
Genetic diseases (Cont.)
Cystic Fibrosis
• Known since very early on (“Celtic gene”)
• Inherited autosomal recessive condition (Chr. 7)
• Symptoms:
– Clogging and infection of lungs (early death)
– Intestinal obstruction
– Reduced fertility and (male) anatomical anomalies
• CF gene CFTR has 3-bp deletion leading to Del508
(Phe) in 1480 aa protein (epithelial Cl- channel) –
protein degraded in ER instead of inserted into cell
membrane
Genomic Data Sources
• DNA/protein sequence
• Expression (microarray)
• Proteome (xray, NMR,
mass spectrometry)
• Metabolome
• Physiome (spatial,
temporal)
Integrative
bioinformatics
Genomic Data Sources
Vertical Genomics
genome
transcriptome
proteome
metabolome
physiome
Dinner discussion: Integrative Bioinformatics & Genomics VU
A gene codes for a protein
DNA
CCTGAGCCAACTATTGATGAA
transcription
mRNA
CCUGAGCCAACUAUUGAUGAA
translation
Protein
PEPTIDE
Humans have
spliced genes…
DNA makes RNA makes Protein
Remark
• The problem of identifying (annotating) human genes is
considerably harder than the early success story for ßglobin might suggest.
• The human factor VIII gene (whose mutations cause
hemophilia A) is spread over ~186,000 bp. It consists
of 26 exons ranging in size from 69 to 3,106 bp, and its
25 introns range in size from 207 to 32,400 bp. The
complete gene is thus ~9 kb of exon and ~177 kb of
intron.
• The biggest human gene yet is for dystrophin. It has
> 30 exons and is spread over 2.4 million bp.
DNA makes RNA makes Protein:
Expression data
• More copies of mRNA for a gene leads to
more protein
• mRNA can now be measured for all the
genes in a cell at ones through microarray
technology
• Can have 60,000 spots (genes) on a single
gene chip
• Colour change gives intensity of gene
expression (over- or under-expression)
Metabolic
networks
Glycolysis
and
Gluconeogenesis
Kegg database
(Japan)
High-throughput Biological Data
• Enormous amounts of biological data are
being generated by high-throughput
capabilities; even more are coming
–
–
–
–
–
–
genomic sequences
gene expression data
mass spec. data
protein-protein interaction
protein structures
......
Protein structural data explosion
Protein Data Bank (PDB): 14500 Structures (6 March 2001)
10900 x-ray crystallography, 1810 NMR, 278 theoretical models, others...
Dickerson’s formula: equivalent
to Moore’s law
n = e0.19(y-1960)
with y the year.
On 27 March 2001 there were 12,123 3D protein
structures in the PDB: Dickerson’s formula predicts
12,066 (within 0.5%)!
Sequence versus structural data
• Despite structural genomics efforts, growth
of PDB slowed down in 2001-2002 (i.e did
not keep up with Dickerson’s formula)
• More than 100 completely sequenced
genomes
Increasing gap between structural and
sequence data
Bioinformatics
Large - external
(integrative)
Science
Planetary Science
Population Biology
Sociobiology
Systems Biology
Biology
Human
Cultural Anthropology
Sociology
Psychology
Medicine
Molecular Biology
Chemistry
Physics
Small – internal (individual)
Bioinformatics
• Offers an ever more essential input to
–
–
–
–
–
–
–
–
Molecular Biology
Pharmacology (drug design)
Agriculture
Biotechnology
Clinical medicine
Anthropology
Forensic science
Chemical industries (detergent industries, etc.)
High-throughput Biological Data
The data deluge
• Hidden in these data is information that
reflects
– existence, organization, activity,
functionality …… of biological machineries
at different levels in living organisms
Most effectively utilising this information will prove
to be essential for Integrative Bioinformatics
Data Issues ……
• Data collection: getting the data
• Data representation: data standards, data normalisation …..
• Data organisation and storage: database issues …..
• Data analysis and data mining: discovering “knowledge”,
patterns/signals, from data, establishing associations among
data patterns
• Data utilisation and application: from data patterns/signals to
models for bio-machineries
• Data visualization: viewing complex data ……
• Data transmission: data collection, retrieval, …..
• ……
Tot hier 5/2
Bioinformatics
“Nothing in Biology makes sense except in
the light of evolution” (Theodosius
Dobzhansky (1900-1975))
“Nothing in bioinformatics makes sense
except in the light of Biology”
Pair-wise alignment
T D W V T A L K
T D W L - - I K
Combinatorial explosion
- 1 gap in 1 sequence: n+1 possibilities
- 2 gaps in 1 sequence: (n+1)n
- 3 gaps in 1 sequence: (n+1)n(n-1), etc.
2n
~
=
n
22n
(2n)!
(n!)2
n
2 sequences of 300 a.a.: ~1088 alignments
2 sequences of 1000 a.a.: ~10600 alignments!
Dynamic programming
Scoring alignments
Sa,b = l s(ai, b )+
j
gp(k) = pi + kpe

k
Nk  gp(k )
affine gap penalties
pi and pe are the penalties for gap initialisation
and extension, respectively
Dynamic programming
Scoring alignments
T D W V T A L K
T D W L - - I K
2020
10
Amino Acid Exchange Matrix
1
Gap penalties (open,
extension)
Score: s(T,T)+s(D,D)+s(W,W)+s(V,L)+Po+2Px +
+s(L,I)+s(K,K)
Pairwise sequence alignment
Global dynamic programming
MDAGSTVILCFVG
M
D
A
A
S
T
I
L
C
G
S
Evolution
Amino Acid Exchange
Matrix
Search matrix
MDAGSTVILCFVGMDAAST-ILC--GS
Gap penalties
(open,extension)
Global dynamic programming
j-1
i-1
Si,j = si,j + Max
Max{S0<x<i-1, j-1 - Pi - (i-x-1)Px}
Si-1,j-1
Max{Si-1, 0<y<j-1 - Pi - (j-y-1)Px}
Global dynamic programming
Global dynamic programming
Tot hier 17/02/03
Local dynamic programming
(Smith & Waterman, 1981)
LCFVMLAGSTVIVGTR
E
D
A
S
T
I
L
C
G
S
Negative
numbers
Amino Acid
Exchange Matrix
Search matrix
AGSTVIVG
A-STILCG
Gap penalties
(open, extension)
Local dynamic programming
(Smith & Waterman, 1981)
j-1
i-1
Si,j = Max
Si,j + Max{S0<x<i-1,j-1 - Pi - (i-x-1)Px}
Si,j + Si-1,j-1
Si,j + Max {Si-1,0<y<j-1 - Pi - (j-y-1)Px}
0
Local dynamic programming
Sequence database searching –
Homology searching
DP too slow for repeated database searches
• FASTA
• BLAST and PSI-BLAST
• QUEST
• HMMER
• SAM-T98
Fast heuristics
Hidden Markov modelling
FASTA
• Compares a given query sequence with a library of
sequences and calculates for each pair the highest
scoring local alignment
• Speed is obtained by delaying application of the
dynamic programming technique to the moment
where the most similar segments are already
identified by faster and less sensitive techniques
• FASTA routine operates in four steps:
FASTA
Operates in four steps:
1. Rapid searches for identical words of a user specified length
occurring in query and database sequence(s) (Wilbur and
Lipman, 1983, 1984). For each target sequence the 10 regions
with the highest density of ungapped common words are
determined.
2. These 10 regions are rescored using Dayhoff PAM-250 residue
exchange matrix (Dayhoff et al., 1983) and the best scoring
region of the 10 is reported under init1 in the FASTA output.
3. Regions scoring higher than a threshold value and being
sufficiently near each other in the sequence are joined, now
allowing gaps. The highest score of these new fragments can be
found under initn in the FASTA output.
4. full dynamic programming alignment (Chao et al., 1992) over the
final region which is widened by 32 residues at either side, of
which the score is written under opt in the FASTA output.
FASTA output example
DE METAL RESISTANCE PROTEIN YCF1 (YEAST CADMIUM FACTOR 1). . . .
SCORES Init1: 161 Initn: 161 Opt: 162 z-score: 229.5 E(): 3.4e-06
Smith-Waterman score: 162; 35.1% identity in 57 aa overlap
test.seq
YCFI_YEAST
10
20
30
MQRSPLEKASVVSKLFFSWTRPILRKGYRQRLE
:| :|::| |:::||:|||::|: |
CASILLLEALPKKPLMPHQHIHQTLTRRKPNPYDSANIFSRITFSWMSGLMKTGYEKYLV
180
test.seq
YCFI_YEAST
190
200
210
220
230
40
50
60
LSDIYQIPSVDSADNLSEKLEREWDRE
:|:|::|
|:::||:|||::|: |
EADLYKLPRNFSSEELSQKLEKNWENELKQKSNPSLSWAICRTFGSKMLLAAFFKAIHDV
240
250
260
270
280
290
FASTA
(1) Rapid identical word searches:
• Searching for k-tuples of a certain size within a
specified bandwidth along search matrix diagonals.
• For not-too-distant sequences (> 35% residue
identity), little sensitivity is lost while speed is greatly
increased.
• Technique employed is known as hash coding or
hashing: a lookup table is constructed for all words in
the query sequence, which is then used to compare all
encountered words in each database sequence.
FASTA
• The k-tuple length is user-defined and is usually 1 or
2 for protein sequences (i.e. either the positions of
each of the individual 20 amino acids or the positions
of each of the 400 possible dipeptides are located).
• For nucleic acid sequences, the k-tuple is 5-20, and
should be longer because short k-tuples are much
more common due to the 4 letter alphabet of nucleic
acids. The larger the k-tuple chosen, the more rapid
but less thorough, a database search.
BLAST
• blastp compares an amino acid query sequence
against a protein sequence database
• blastn compares a nucleotide query sequence
against a nucleotide sequence database
• blastx compares the six-frame conceptual protein
translation products of a nucleotide query
sequence against a protein sequence database
• tblastn compares a protein query sequence against
a nucleotide sequence database translated in six
reading frames
• tblastx compares the six-frame translations of a
nucleotide query sequence against the six-frame
translations of a nucleotide sequence database.
BLAST
• Generates all tripeptides from a query sequence
and for each of those the derivation of a table of
similar tripeptides: number is only fraction of total
number possible.
• Quickly scans a database of protein sequences for
ungapped regions showing high similarity, which
are called high-scoring segment pairs (HSP),
using the tables of similar peptides. The initial
search is done for a word of length W that scores
at least the threshold value T when compared to
the query using a substitution matrix.
• Word hits are then extended in either direction in
an attempt to generate an alignment with a score
exceeding the threshold of S, and as far as the
cumulative alignment score can be increased.
BLAST
Extension of the word hits in each direction are halted
• when the cumulative alignment score falls off by the
quantity X from its maximum achieved value
• the cumulative score goes to zero or below due to the
accumulation of one or more negative-scoring residue
alignments
• upon reaching the end of either sequence
• The T parameter is the most important for the speed and
sensitivity of the search resulting in the high-scoring
segment pairs
• A Maximal-scoring Segment Pair (MSP) is defined as
the highest scoring of all possible segment pairs
produced from two sequences.
PSI-BLAST
• Query sequences are first scanned for the presence of
so-called low-complexity regions (Wooton and
Federhen, 1996), i.e. regions with a biased composition
likely to lead to spurious hits; are excluded from
alignment.
• The program then initially operates on a single query
sequence by performing a gapped BLAST search
• Then, the program takes significant local alignments
found, constructs a multiple alignment and abstracts a
position specific scoring matrix (PSSM) from this
alignment.
• Rescan the database in a subsequent round to find more
homologous sequences Iteration continues until user
decides to stop or search has converged
PSI-BLAST iteration
Q
xxxxxxxxxxxxxxxxx
Query sequence
Gapped BLAST search
Q
xxxxxxxxxxxxxxxxx
Query sequence
Database hits
A
C
D
.
.
Y
PSSM
Pi
Px
Gapped BLAST search
A
C
D
.
.
Y
Pi
Px
PSSM
Database hits
PSI-BLAST output example
Multiple alignment profiles
Gribskov et al. 1987
i
A
C
D



W
Y
Gap
penalties
1.0
0.3
0.1
0



0.3
0.3
0.5
Position dependent gap penalties
Normalised sequence similarity
The p-value is defined as the probability of seeing at
least one unrelated score S greater than or equal to a
given score x in a database search over n sequences.
This probability follows the Poisson distribution
(Waterman and Vingron, 1994):
P(x, n) = 1 – e-nP(S x),
where n is the number of sequences in the database
Depending on x and n (fixed)
Normalised sequence similarity
Statistical significance
The E-value is defined as the expected number of nonhomologous sequences with score greater than or equal
to a score x in a database of n sequences:
E(x, n) = nP(S  x)
if E-value = 0.01, then the expected number of random
hits with score S  x is 0.01, which means that this Evalue is expected by chance only once in 100
independent searches over the database.
if the E-value of a hit is 5, then five fortuitous hits with S
 x are expected within a single database search, which
renders the hit not significant.
Normalised sequence similarity
Statistical significance
• Database searching is commonly performed
using an E-value in between 0.1 and 0.001.
• Low E-values decrease the number of false
positives in a database search, but increase
the number of false negatives, thereby
lowering the sensitivity of the search.
HMM-based homology searching
• Most widely used HMM-based profile searching
tools currently are SAM-T98 (Karplus et al.,
1998) and HMMER2 (Eddy, 1998)
• formal probabilistic basis and consistent theory
behind gap and insertion scores
• HMMs good for profile searches, bad for
alignment
• HMMs are slow
The HMM algorithms
Forward:
 t (i) = P(observed sequence, ending in state i at base t)
Backward:
ß t (i) = P(obs. after t | ending in state i at base t)
Viterbi:
 t (i) = max P(obs. , ending in state i at base t)
Questions:
1. What is the most likely die (predicted) sequence? Viterbi
2. What is the probability of the observed sequence? Forward
3. What is the probability that the 3rd state is B, given the
observed sequence? Backward
HMM-based homology searching
Transition probabilities and Emission probabilities
Gapped HMMs also have insertion and deletion
states
Profile HMM: m=match state, I-insert state, d=delete state; go from
left to right. I and m states output amino acids; d states are ‘silent”.
d1
d2
d3
d4
I0
I1
I2
I3
I4
m0
m1
m2
m3
m4
Start
m5
End
Homology-derived Secondary Structure of Proteins
(HSSP)
Sander & Schneider, 1991
Tot hier 17/02/03
Bio-Data Analysis and Data Mining
• Existing/emerging bio-data analysis and mining tools for
–
–
–
–
–
–
–
–
–
DNA sequence assembly
Genetic map construction
Sequence comparison and database searching
Gene finding
….
Gene expression data analysis
Phylogenetic tree analysis to infer horizontally-transferred genes
Mass spec. data analysis for protein complex characterization
……
• Current mode of work:
Often enough: developing ad hoc tools
for each individual application
Bio-Data Analysis and Data Mining
• As the amount and types of data and their
cross connections increase rapidly
• the number of analysis tools needed will go up
“exponentially”
– blast, blastp, blastx, blastn, … from BLAST family
of tools
– gene finding tools for human, mouse, fly, rice,
cyanobacteria, …..
– tools for finding various signals in genomic
sequences, protein-binding sites, splice junction
sites, translation start sites, …..
Bio-Data Analysis and Data Mining
Many of these data analysis problems are
fundamentally the same problem(s) and can
be solved using the same set of tools: e.g.
clustering or optimal segmentation by
Dynamic Programming
Developing ad hoc tools for each application
(by each group of individual researchers)
may soon become inadequate as bio-data
production capabilities further ramp up
Bio-data Analysis, Data
Mining and Integrative
Bioinformatics
To have analysis capabilities covering wide range
of problems, we need to discover the common
fundamental structures of these problems;
HOWEVER in biology one size does NOT fit all…
Goal is development of a data analysis
infrastructure in support of Genomics and
beyond
Algorithms in bioinformatics
• string algorithms
• dynamic programming
• machine learning (NN, k-NN, SVM, GA, ..)
• Markov chain models
• hidden Markov models
• Markov Chain Monte Carlo (MCMC) algorithms
• stochastic context free grammars
• EM algorithms
• Gibbs sampling
• clustering
• tree algorithms
• text analysis
• hybrid/combinatorial techniques and more…
Sequence analysis and homology searching
Finding genes and regulatory elements
Expression data
Functional genomics
• Monte Carlo
Protein translation
Example of algorithm reuse: Data
clustering
• Many biological data analysis problems can be
formulated as clustering problems
– microarray gene expression data analysis
– identification of regulatory binding sites (similarly, splice
junction sites, translation start sites, ......)
– (yeast) two-hybrid data analysis (for inference of protein
complexes)
– phylogenetic tree clustering (for inference of horizontally
transferred genes)
– protein domain identification
– identification of structural motifs
– prediction reliability assessment of protein structures
– NMR peak assignments
– ......
Data Clustering Problems
• Clustering: partition a data set into clusters so that data
points of the same cluster are “similar” and points of
different clusters are “dissimilar”
• cluster identification -- identifying clusters with
significantly different features than the background
Application Examples
•
Regulatory binding site identification: CRP (CAP) binding site
•
Two hybrid data analysis

Gene expression data analysis
Are all solvable by the same algorithm!
Other Application Examples
• Phylogenetic tree clustering analysis
• Protein sidechain packing prediction
• Assessment of prediction reliability of protein
structures
• Protein secondary structures
• Protein domain prediction
• NMR peak assignments
• ……
Integrative bioinformatics @ VU
Studying informational processes at biological system
level
• From gene sequence to intercellular processes
• Computers necessary
• We have biology, statistics, computational intelligence (AI),
HTC, ..
• VUMC: microarray facility
• Enabling technology: new glue to integrate
• New integrative algorithms
• Goals: understanding cells in terms of genomes,
fighting disease (VUMC)
Bioinformatics @ VU
Progression:
• DNA: gene prediction, predicting regulatory
elements
• mRNA expression
• Proteins: docking, domain prediction
• Metabolic pathways: metabolic control
• Cell-cell communication
Protein structure and function can be complex…
Pyruvate kinase
Phosphotransferase
b barrel regulatory domain
/b barrel catalytic substrate binding
domain
/b nucleotide binding domain
1 continuous + 2 discontinuous domains
Bioinformatics @ VU
Qualitative challenges:
• High quality alignments (alternative splicing)
• In-silico structural genomics
• In-silico functional genomics: reliable annotation
• Protein-protein interactions.
• Metabolic pathways: assign the edges in the
networks
• Cell-cell communication: find membrane
associated components
• New algorithms
Bioinformatics @ VU
Quantitative challenges:
• Understanding mRNA expression levels
• Understanding resulting protein activity
• Time dependencies
• Spatial constraints, compartmentalisation
• Are classical differential equation models adequate or do
we need more individual modeling (e.g macromolecular
crowding and activity at oligomolecular level)?
• Metabolic pathways: calculate fluxes through time
• Cell-cell communication: tissues, hormones, innervations
Need ‘complete’ experimental data for good
biological model system to learn to integrate
Bioinformatics @ VU
VUMC
• Neuropeptide – addiction
• Oncogenes – disease patterns
• Reumatic disease
CNCR
• From synapses to higher order behaviour
• Addiction
FPP
• Genetic psychology – twin data bank
Integrative bioinformatics
• Integrate data sources
• Integrate methods
• Integrate data through method
integration (biological model)
Bioinformatics tool
Algorithm
Data
tool
Biological
Interpretation
(model)
Bioinformatics
“Nothing in Biology makes sense except in
the light of evolution” (Theodosius
Dobzhansky (1900-1975))
“Nothing in Bioinformatics makes sense
except in the light of Biology”
Pair-wise sequence alignment
(more than just string matching)
Global dynamic programming
MDAGSTVILCFVG
M
D
A
A
S
T
I
L
C
G
S
Evolution
Amino Acid Exchange
Matrix
Search matrix
MDAGSTVILCFVGMDAAST-ILC--GS
Gap penalties
(open,extension)
Pair-wise alignment search explosions
T D W V T A L K
T D W L - - I K
Combinatorial explosion
- 1 gap in 1 sequence: n+1 possibilities
- 2 gaps in 1 sequence: (n+1)n
- 3 gaps in 1 sequence: (n+1)n(n-1), etc.
2n
~
=
n
22n
(2n)!
(n!)2
n
2 sequences of 300 a.a.: ~1088 alignments
2 sequences of 1000 a.a.: ~10600 alignments!
Global dynamic programming
This talk – own kitchen
Three integrative methods to predict protein structural
aspects:
• Iterative multiple alignment + protein secondary
structure (Praline)
Intermezzo: 2½-D structure prediction of
flavodoxin fold by hand
• Protein domain delineation based on consistency of
multiple ab initio model tertiary structures
(SnapDRAGON)
• Protein domain delineation based on combining
homology searching with domain prediction
(Domaination)
Comparing sequences
- Similarity Score Many properties can be used:
• Nucleotide or amino acid composition
• Isoelectric point
• Molecular weight
• Morphological characters
Multivariate statistics – Cluster analysis
1
2
3
4
5
C1 C2 C3 C4 C5 C6 ..
Raw table
Similarity criterion
Scores
Similarity
matrix
5×5
Cluster criterion
Phylogenetic tree
Human Evolution
Comparing sequences
- Similarity Score Many properties can be used:
• Nucleotide or amino acid composition
• Isoelectric point
• Molecular weight
• Morphological characters
• But: molecular evolution through sequence
alignment
Multivariate statistics – Cluster analysis
1
2
3
4
5
Multiple
alignment
Similarity criterion
Scores
5×5
Similarity
matrix
Phylogenetic tree
Lactate dehydrogenase multiple alignment
Human
Chicken
Dogfish
Lamprey
Barley
Maizey casei
Bacillus
Lacto__ste
Lacto_plant
Therma_mari
Bifido
Thermus_aqua
Mycoplasma
-KITVVGVGAVGMACAISILMKDLADELALVDVIEDKLKGEMMDLQHGSLFLRTPKIVSGKDYNVTANSKLVIITAGARQ
-KISVVGVGAVGMACAISILMKDLADELTLVDVVEDKLKGEMMDLQHGSLFLKTPKITSGKDYSVTAHSKLVIVTAGARQ
–KITVVGVGAVGMACAISILMKDLADEVALVDVMEDKLKGEMMDLQHGSLFLHTAKIVSGKDYSVSAGSKLVVITAGARQ
SKVTIVGVGQVGMAAAISVLLRDLADELALVDVVEDRLKGEMMDLLHGSLFLKTAKIVADKDYSVTAGSRLVVVTAGARQ
TKISVIGAGNVGMAIAQTILTQNLADEIALVDALPDKLRGEALDLQHAAAFLPRVRI-SGTDAAVTKNSDLVIVTAGARQ
-KVILVGDGAVGSSYAYAMVLQGIAQEIGIVDIFKDKTKGDAIDLSNALPFTSPKKIYSA-EYSDAKDADLVVITAGAPQ
TKVSVIGAGNVGMAIAQTILTRDLADEIALVDAVPDKLRGEMLDLQHAAAFLPRTRLVSGTDMSVTRGSDLVIVTAGARQ
-RVVVIGAGFVGASYVFALMNQGIADEIVLIDANESKAIGDAMDFNHGKVFAPKPVDIWHGDYDDCRDADLVVICAGANQ
QKVVLVGDGAVGSSYAFAMAQQGIAEEFVIVDVVKDRTKGDALDLEDAQAFTAPKKIYSG-EYSDCKDADLVVITAGAPQ
MKIGIVGLGRVGSSTAFALLMKGFAREMVLIDVDKKRAEGDALDLIHGTPFTRRANIYAG-DYADLKGSDVVIVAAGVPQ
-KLAVIGAGAVGSTLAFAAAQRGIAREIVLEDIAKERVEAEVLDMQHGSSFYPTVSIDGSDDPEICRDADMVVITAGPRQ
MKVGIVGSGFVGSATAYALVLQGVAREVVLVDLDRKLAQAHAEDILHATPFAHPVWVRSGW-YEDLEGARVVIVAAGVAQ
-KIALIGAGNVGNSFLYAAMNQGLASEYGIIDINPDFADGNAFDFEDASASLPFPISVSRYEYKDLKDADFIVITAGRPQ
Distance Matrix
1
2
3
4
5
6
7
8
9
10
11
12
13
Human
Chicken
Dogfish
Lamprey
Barley
Maizey
Lacto_casei
Bacillus_stea
Lacto_plant
Therma_mari
Bifido
Thermus_aqua
Mycoplasma
1
0.000
0.112
0.128
0.202
0.378
0.346
0.530
0.551
0.512
0.524
0.528
0.635
0.637
2
0.112
0.000
0.155
0.214
0.382
0.348
0.538
0.569
0.516
0.524
0.524
0.631
0.651
3
0.128
0.155
0.000
0.196
0.389
0.337
0.522
0.567
0.516
0.512
0.524
0.600
0.655
4
0.202
0.214
0.196
0.000
0.426
0.356
0.553
0.589
0.544
0.503
0.544
0.616
0.669
5
0.378
0.382
0.389
0.426
0.000
0.171
0.536
0.565
0.526
0.547
0.516
0.629
0.575
6
0.346
0.348
0.337
0.356
0.171
0.000
0.557
0.563
0.538
0.555
0.518
0.643
0.587
7
0.530
0.538
0.522
0.553
0.536
0.557
0.000
0.518
0.208
0.445
0.561
0.526
0.501
8
0.551
0.569
0.567
0.589
0.565
0.563
0.518
0.000
0.477
0.536
0.536
0.598
0.495
9
0.512
0.516
0.516
0.544
0.526
0.538
0.208
0.477
0.000
0.433
0.489
0.563
0.485
10
0.524
0.524
0.512
0.503
0.547
0.555
0.445
0.536
0.433
0.000
0.532
0.405
0.598
11
0.528
0.524
0.524
0.544
0.516
0.518
0.561
0.536
0.489
0.532
0.000
0.604
0.614
12
0.635
0.631
0.600
0.616
0.629
0.643
0.526
0.598
0.563
0.405
0.604
0.000
0.641
13
0.637
0.651
0.655
0.669
0.575
0.587
0.501
0.495
0.485
0.598
0.614
0.641
0.000
Multiple sequence alignment
Why?
• It is the most important means to assess
relatedness of a set of sequences
• Gain information about the structure/function of a
query sequence (conservation patterns)
• Construct a phylogenetic tree
• Putting together a set of sequenced fragments
(Fragment assembly)
• Comparing a segment sequenced by two different
labs
• Many bioinformatics methods depend on it (e.g.
secondary/tertiary structure prediction)
Flavodoxin fold: aligning 13 Flavodoxins + cheY
5(b) fold
Flavodoxin-cheY multiple alignment
Praline with pre-processing
1fx1
FLAV_DESDE
FLAV_DESVH
FLAV_DESSA
FLAV_DESGI
2fcr
FLAV_AZOVI
FLAV_ENTAG
FLAV_ANASP
FLAV_ECOLI
4fxn
FLAV_MEGEL
FLAV_CLOAB
3chy
-PKALIVYGSTTGNT-EYTAETIARQLANAG-YEVDSRDAASVEAGGLFEGFDLVLLGCSTWGDDSI------ELQDDFIPLF-DSLEETGAQGRKVACF
MSKVLIVFGSSTGNT-ESIaQKLEELIAAGG-HEVTLLNAADASAENLADGYDAVLFgCSAWGMEDL------EMQDDFLSLF-EEFNRFGLAGRKVAAf
MPKALIVYGSTTGNT-EYTaETIARELADAG-YEVDSRDAASVEAGGLFEGFDLVLLgCSTWGDDSI------ELQDDFIPLF-DSLEETGAQGRKVACf
MSKSLIVYGSTTGNT-ETAaEYVAEAFENKE-IDVELKNVTDVSVADLGNGYDIVLFgCSTWGEEEI------ELQDDFIPLY-DSLENADLKGKKVSVf
MPKALIVYGSTTGNT-EGVaEAIAKTLNSEG-METTVVNVADVTAPGLAEGYDVVLLgCSTWGDDEI------ELQEDFVPLY-EDLDRAGLKDKKVGVf
--KIGIFFSTSTGNT-TEVADFIGKTLGA---KADAPIDVDDVTDPQALKDYDLLFLGAPTWNTG----ADTERSGTSWDEFLYDKLPEVDMKDLPVAIF
-AKIGLFFGSNTGKT-RKVaKSIKKRFDDET-MSDA-LNVNRVS-AEDFAQYQFLILgTPTLGEGELPGLSSDCENESWEEFL-PKIEGLDFSGKTVALf
MATIGIFFGSDTGQT-RKVaKLIHQKLDG---IADAPLDVRRAT-REQFLSYPVLLLgTPTLGDGELPGVEAGSQYDSWQEFT-NTLSEADLTGKTVALf
SKKIGLFYGTQTGKT-ESVaEIIRDEFGN---DVVTLHDVSQAE-VTDLNDYQYLIIgCPTWNIGEL--------QSDWEGLY-SELDDVDFNGKLVAYf
-AITGIFFGSDTGNT-ENIaKMIQKQLGK---DVADVHDIAKSS-KEDLEAYDILLLgIPTWYYGE--------AQCDWDDFF-PTLEEIDFNGKLVALf
-MK--IVYWSGTGNT-EKMAELIAKGIIESG-KDVNTINVSDVNIDELL-NEDILILGCSAMGDEVL-------EESEFEPFI-EEIS-TKISGKKVALF
MVE--IVYWSGTGNT-EAMaNEIEAAVKAAG-ADVESVRFEDTNVDDVA-SKDVILLgCPAMGSEEL-------EDSVVEPFF-TDLA-PKLKGKKVGLf
-MKISILYSSKTGKT-ERVaKLIEEGVKRSGNIEVKTMNLDAVD-KKFLQESEGIIFgTPTYYAN---------ISWEMKKWI-DESSEFNLEGKLGAAf
ADKELKFLVVDDFSTMRRIVRNLLKELGFN--NVEEAEDGVDALNKLQAGGYGFVI---SDWNMPNM----------DGLELL-KTIRADGAMSALPVLM
1fx1
FLAV_DESDE
FLAV_DESVH
FLAV_DESSA
FLAV_DESGI
2fcr
FLAV_AZOVI
FLAV_ENTAG
FLAV_ANASP
FLAV_ECOLI
4fxn
FLAV_MEGEL
FLAV_CLOAB
3chy
GCGDS-SY-EYFCGA-VDAIEEKLKNLGAEIVQD---------------------GLRIDGD--PRAARDDIVGWAHDVRGAI-------ASGDQ-EY-EHFCGA-VPAIEERAKELgATIIAE---------------------GLKMEGD--ASNDPEAVASfAEDVLKQL-------GCGDS-SY-EYFCGA-VDAIEEKLKNLgAEIVQD---------------------GLRIDGD--PRAARDDIVGwAHDVRGAI-------GCGDS-DY-TYFCGA-VDAIEEKLEKMgAVVIGD---------------------SLKIDGD--PE--RDEIVSwGSGIADKI-------GCGDS-SY-TYFCGA-VDVIEKKAEELgATLVAS---------------------SLKIDGE--PD--SAEVLDwAREVLARV-------GLGDAEGYPDNFCDA-IEEIHDCFAKQGAKPVGFSNPDDYDYEESKS-VRDGKFLGLPLDMVNDQIPMEKRVAGWVEAVVSETGV-----GLGDQVGYPENYLDA-LGELYSFFKDRgAKIVGSWSTDGYEFESSEA-VVDGKFVGLALDLDNQSGKTDERVAAwLAQIAPEFGLS--L-GLGDQLNYSKNFVSA-MRILYDLVIARgACVVGNWPREGYKFSFSAALLENNEFVGLPLDQENQYDLTEERIDSwLEKLKPAV-L-----GTGDQIGYADNFQDA-IGILEEKISQRgGKTVGYWSTDGYDFNDSKA-LRNGKFVGLALDEDNQSDLTDDRIKSwVAQLKSEFGL-----GCGDQEDYAEYFCDA-LGTIRDIIEPRgATIVGHWPTAGYHFEASKGLADDDHFVGLAIDEDRQPELTAERVEKwVKQISEELHLDEILNA
G-----SY-GWGDGKWMRDFEERMNGYGCVVVET---------------------PLIVQNE--PDEAEQDCIEFGKKIANI--------G-----SY-GWGSGEWMDAWKQRTEDTgATVIGT----------------------AIVNEM--PDNA-PECKElGEAAAKA--------STANSIAGGSDIA---LLTILNHLMVKgMLVYSG----GVAFGKPKTHLGYVHINEIQENEDENARIfGERiANkVKQIF----------VTAEAKK--ENIIAA---------AQAGAS-------------------------GYVV-----KPFTAATLEEKLNKIFEKLGM------
Iteration 0
T
G
SP= 136944.00
AvSP= 10.675
SId= 4009
AvSId= 0.313
Flavodoxin-cheY NJ tree
Integrating secondary structure
prediction in multiple alignment
Victor Simossis
Praline multiple alignment method
(Heringa, Comp. Chem. 23, 341-364;1999, Comp. Chem., 26, 459-477;2002;
Kleinjung, Douglas & Heringa, Bioinformatics, in press;2002)
• Combining sequence data and secondary
structure prediction (Heringa, Curr. Prot. Pept. Sci., 1 (3),
273-301;2000)
• Secondary structure methods: PhD, Predator,
PSIPred, Jpred, SSPRED,...
Using secondary structure in
multiple alignment
“Structure more conserved than
sequence”
Protein structure hierarchical levels
PRIMARY STRUCTURE (amino acid sequence)
SECONDARY STRUCTURE (helices, strands)
VHLTPEEKSAVTALWGKVNVDE
VGGEALGRLLVVYPWTQRFFE
SFGDLSTPDAVMGNPKVKAHG
KKVLGAFSDGLAHLDNLKGTFA
TLSELHCDKLHVDPENFRLLGN
VLVCVLAHHFGKEFTPPVQAAY
QKVVAGVANALAHKYH
QUATERNARY STRUCTURE (oligomers)
TERTIARY STRUCTURE (fold)
Protein structure hierarchical levels
PRIMARY STRUCTURE (amino acid sequence)
SECONDARY STRUCTURE (helices, strands)
VHLTPEEKSAVTALWGKVNVDE
VGGEALGRLLVVYPWTQRFFE
SFGDLSTPDAVMGNPKVKAHG
KKVLGAFSDGLAHLDNLKGTFA
TLSELHCDKLHVDPENFRLLGN
VLVCVLAHHFGKEFTPPVQAAY
QKVVAGVANALAHKYH
QUATERNARY STRUCTURE (oligomers)
TERTIARY STRUCTURE (fold)
Secondary structure-induced
alignment
Using secondary structure in
multiple alignment
Dynamic programming
search matrix
M
D
A
A
S
T
I
L
C
G
S
Amino acid exchange
weights matrices
MDAGSTVILCFV
HHHCCCEEEEEE
H
H H
H
H
C
C
E
E
E
C
C
H
C
C
E
E
Default
Flavodoxin-cheY predicted secondary structure
(PREDATOR)
1fx1
FLAV_DESVH
FLAV_DESGI
FLAV_DESSA
FLAV_DESDE
2fcr
FLAV_ANASP
FLAV_ECOLI
FLAV_AZOVI
FLAV_ENTAG
4fxn
FLAV_MEGEL
FLAV_CLOAB
3chy
1fx1
FLAV_DESVH
FLAV_DESGI
FLAV_DESSA
FLAV_DESDE
2fcr
FLAV_ANASP
FLAV_ECOLI
FLAV_AZOVI
FLAV_ENTAG
4fxn
FLAV_MEGEL
FLAV_CLOAB
3chy
-PK-ALIVYGSTTGNTEYTAETIARQLANAG-YEVDSRDAASVEAGGLFEGFDLVLLGCSTWGDDSI------ELQDDFIPLFDS-LEETGAQGRKVACF
e eeee b ssshhhhhhhhhhhhhhttt eeeee stt
tttttt seeee b ee sss
ee ttthhhhtt ttss tt eeeee
MPK-ALIVYGSTTGNTEYTaETIARELADAG-YEVDSRDAASVEAGGLFEGFDLVLLgCSTWGDDSI------ELQDDFIPLFDS-LEETGAQGRKVACf
e eeeeee
hhhhhhhhhhhhhhh
eeeeee
eeeeee
hhhhhh
eeeee
MPK-ALIVYGSTTGNTEGVaEAIAKTLNSEG-METTVVNVADVTAPGLAEGYDVVLLgCSTWGDDEI------ELQEDFVPLYED-LDRAGLKDKKVGVf
e eeeeee
hhhhhhhhhhhhhh
eeeeee
hhhhhh eeeeeee
hhhhhh
eeeeee
MSK-SLIVYGSTTGNTETAaEYVAEAFENKE-IDVELKNVTDVSVADLGNGYDIVLFgCSTWGEEEI------ELQDDFIPLYDS-LENADLKGKKVSVf
eeeeee
hhhhhhhhhhhhhh
eeeee
eeeee
hhhhhhh h
eeeee
MSK-VLIVFGSSTGNTESIaQKLEELIAAGG-HEVTLLNAADASAENLADGYDAVLFgCSAWGMEDL------EMQDDFLSLFEE-FNRFGLAGRKVAAf
eeee
hhhhhhhhhhhhhh
eeeee
hhhhhhhhhhheeeee
hhhhhhh hh
eeeee
--K-IGIFFSTSTGNTTEVADFIGKTLGAK---ADAPIDVDDVTDPQALKDYDLLFLGAPTWNTGAD----TERSGTSWDEFLYDKLPEVDMKDLPVAIF
eeeee ssshhhhhhhhhhhhhggg
b
eeggg s gggggg seeeeeee stt s
s s sthhhhhhhtggg
tt eeeee
SKK-IGLFYGTQTGKTESVaEIIRDEFGND--VVTL-HDVSQAE-VTDLNDYQYLIIgCPTWNIGEL--------QSDWEGLYSE-LDDVDFNGKLVAYf
eeeee
hhhhhhhhhhhh
eee
hhh hhhhhhheeeeee
hhhhhhhhh
eeeeee
-AI-TGIFFGSDTGNTENIaKMIQKQLGKD--VADV-HDIAKSS-KEDLEAYDILLLgIPTWYYGEA--------QCDWDDFFPT-LEEIDFNGKLVALf
eee
hhhhhhhhhhhh
eee
hhh hhhhhhheeeee
hhhhh
eeeeee
-AK-IGLFFGSNTGKTRKVaKSIKKRFDDET-MSDA-LNVNRVS-AEDFAQYQFLILgTPTLGEGELPGLSSDCENESWEEFLPK-IEGLDFSGKTVALf
eee
hhhhhhhhhhhhh
hhh hhhhhhheeeee
hhhhhhhhh
eeeeee
MAT-IGIFFGSDTGQTRKVaKLIHQKLDG---IADAPLDVRRAT-REQFLSYPVLLLgTPTLGDGELPGVEAGSQYDSWQEFTNT-LSEADLTGKTVALf
eeee
hhhhhhhhhhhh
hhh hhhhhhheeeee
hhhhh
eeeee
----MKIVYWSGTGNTEKMAELIAKGIIESG-KDVNTINVSDVNIDELLNE-DILILGCSAMGDEVL------E-ESEFEPFIEE-IST-KISGKKVALF
eeeee ssshhhhhhhhhhhhhhhtt
eeeettt sttttt seeeeee btttb
ttthhhhhhh hst t tt eeeee
M---VEIVYWSGTGNTEAMaNEIEAAVKAAG-ADVESVRFEDTNVDDVASK-DVILLgCPAMGSEEL------E-DSVVEPFFTD-LAP-KLKGKKVGLf
hhhhhhhhhhhhhh
eeeee
hhhhhhhh eeeee
eeeee
M-K-ISILYSSKTGKTERVaKLIEEGVKRSGNIEVKTMNL-DAVDKKFLQESEGIIFgTPTY-YANI--------SWEMKKWIDE-SSEFNLEGKLGAAf
eee
hhhhhhhhhhhhhh
eeeeee
hhhhhhhhhh eeee
hhhhhhhhh
eeeee
ADKELKFLVVDDFSTMRRIVRNLLKELGFNN-VEEAEDGV-DALNKLQAGGYGFVISD---WNMPNM----------DGLELLKTIRADGAMSALPVLMV
tt eeee s hhhhhhhhhhhhhht
eeeesshh hhhhhhhh
eeeee
s sss
hhhhhhhhhh ttttt eeee
GCGDS-SY-EYFCGAVDAIEEKLKNLGAEIVQD---------------------GLRIDGD--PRAARDDIVGWAHDVRGAI-------eee s ss sstthhhhhhhhhhhttt ee s
eeees
gggghhhhhhhhhhhhhh
GCGDS-SY-EYFCGAVDAIEEKLKNLgAEIVQD---------------------GLRIDGD--PRAARDDIVGwAHDVRGAI-------eee
hhhhhhhhhhhh
eeeee
eeeee
hhhhhhhhhhhhhh
GCGDS-SY-TYFCGAVDVIEKKAEELgATLVAS---------------------SLKIDGE--P--DSAEVLDwAREVLARV-------eee
hhhhhhhhhhhh
eeeee
hhhhhhhhhhh
GCGDS-DY-TYFCGAVDAIEEKLEKMgAVVIGD---------------------SLKIDGD--P--ERDEIVSwGSGIADKI-------hhhhhhhhhhhh
eeeee
e
eee
ASGDQ-EY-EHFCGAVPAIEERAKELgATIIAE---------------------GLKMEGD--ASNDPEAVASfAEDVLKQL-------e
hhhhhhhhhhhhhh
eeeee
ee
hhhhhhhhhhh
GLGDAEGYPDNFCDAIEEIHDCFAKQGAKPVGFSNPDDYDYEESKSVRD-GKFLGLPLDMVNDQIPMEKRVAGWVEAVVSETGV-----eee ttt ttsttthhhhhhhhhhhtt eee b gggs s tteet teesseeeettt ss hhhhhhhhhhhhhhhht
GTGDQIGYADNFQDAIGILEEKISQRgGKTVGYWSTDGYDFNDSKALR-NGKFVGLALDEDNQSDLTDDRIKSwVAQLKSEFGL-----hhhhhhhhhhhhhh
eeee
hhhhhhhhhhhhhhhh
GCGDQEDYAEYFCDALGTIRDIIEPRgATIVGHWPTAGYHFEASKGLADDDHFVGLAIDEDRQPELTAERVEKwVKQISEELHLDEILNA
hhhhhhhhhhhhhh
eeee
hhhhhhhhhhhhhhhhhh
GLGDQVGYPENYLDALGELYSFFKDRgAKIVGSWSTDGYEFESSEAVVD-GKFVGLALDLDNQSGKTDERVAAwLAQIAPEFGLS--L-e
hhhhhhhhhhhhhh
eeeee
hhhhhhhhhhh
GLGDQLNYSKNFVSAMRILYDLVIARgACVVGNWPREGYKFSFSAALLENNEFVGLPLDQENQYDLTEERIDSwLEKLKPAV-L-----hhhhhhhhhhhhhhh
eeee
hhhhhhh
hhhhhhhhhhhh
G-----SYGWGDGKWMRDFEERMNGYGCVVVET---------------------PLIVQNE--PDEAEQDCIEFGKKIANI--------e
eesss shhhhhhhhhhhhtt ee s
eeees
ggghhhhhhhhhhhht
G-----SYGWGSGEWMDAWKQRTEDTgATVIGT----------------------AIVNEM--PDNAPE-CKElGEAAAKA--------hhhhhhhhhhh
eeeee
eeee
h hhhhhhhh
STANSIA-GGSDIALLTILNHLMVK-gMLVYSG----GVAFGKPKTHLG-----YVHINEI--QENEDENARIfGERiANkV--KQIF-hhhhhhhhhhhhhh eeeee
hhhh hhh
hhhhhhhhhhhh h
-----------TAEAKKENIIAAAQAGASGY-------------------------VVK----P-FTAATLEEKLNKIFEKLGM-----ess hhhhhhhhhtt see
ees
s
hhhhhhhhhhhhhhht
G
Enough to
predict
5(b)
topology
Secondary structure-induced
alignment
Flavodoxin-cheY multiple alignment/
secondary structure iteration
cheY SSEs
3chy-AA SEQUENCE||
3chy-ITERATION-0||
3chy-ITERATION-1||
3chy-ITERATION-2||
3chy-ITERATION-3||
3chy-ITERATION-4||
3chy-ITERATION-5||
3chy-ITERATION-6||
3chy-ITERATION-7||
3chy-ITERATION-8||
3chy-ITERATION-9||
AA
PHD
PHD
PHD
PHD
PHD
PHD
PHD
PHD
PHD
PHD
|ADKELKFLVVDDFSTMRRIVRNLLKELGFNNVEEAEDGVDALNKLQAGGYGFVISDWNMP|
| EEEEEEE
HHHHHHHHHHHHHHHHH
E HHHHHHHHHH HHHEEE
|
| EEEEEEEE
HHHHHHHHHHHHHHH
HHHHHHHH
EEEEEE
|
| EEEEEEEE
HHHHHHHHHHHHHH
HHHHHHHHH EEEEEE
|
| EEEEEEEE
HHHHHHHHHHHHHH
EEE
HHHHHH
EEEEE
|
| EEEEEEEE
HHHHHHHHHHHHHH
HHHHHHH
EEEEE
|
| EEEEEEEE
HHHHHHHHHHHHHH
EEE
HHHHHH
EEEEE
|
| EEEEEEEE
HHHHHHHHHHHHHH
HHHHHHHH EEEEEE
|
| EEEEEEEE
HHHHHHHHHHHHHH
EEE
HHHHHH
EEEEE
|
| EEEEEEEE
HHHHHHHHHHHHHH
HHHHHHH
EEEEEE
|
| EEEEEEEE
HHHHHHHHHHHHHH
HHHHHHHHHH
EEEEE
|
3chy-AA SEQUENCE||
3chy-ITERATION-0||
3chy-ITERATION-1||
3chy-ITERATION-2||
3chy-ITERATION-3||
3chy-ITERATION-4||
3chy-ITERATION-5||
3chy-ITERATION-6||
3chy-ITERATION-7||
3chy-ITERATION-8||
3chy-ITERATION-9||
AA
PHD
PHD
PHD
PHD
PHD
PHD
PHD
PHD
PHD
PHD
|NMDGLELLKTIRADGAMSALPVLMVTAEAKKENIIAAAQAGASGYVVKPFTAATLEEKLNKIFEKLGM|
|
HHHHHHEEEEEE
HHHHHHHHHHHHHHHHH
HHHHHHHHHHHHHH
|
|
HHHHHHEEEEEE
HHH HHHHHHHHHHHHHHHHHH
EEE
HHHHHHHHHHHHHH
|
|
HHHHHHEEEEEE
HHHHHHHHHHHHHHHHHH
EEE
HHHHHHHHHHHHHH
|
| HHHHHHHHHHHH
HHHHHHHHHHHHHHHHHH
EEE
HHHHHHHHHHHHHH
|
|
HHHHH
EEEEE HHHHHHHHHHHHHHHHH
EEE
HHHHHHHHHHHHHH
|
|
HHHHHHHH
EEEEE
HHHHHHHHHHHHHHHH
EEE
HHHHHHHHHHHHHH
|
|
HHHHHHHH
EEEEE
HHHHHHHHHHHHHHHH
EEEE
HHHHHHHHHHHHHH
|
|
HHHHHHHH
EEEEEE
HHHHHHHHHHHHHHHH
EEE
HHHHHHHHHHHHHH
|
|
HHHHHHHH
EEEEE
HHHHHHHHHHHHHHHH
EEE
HHHHHHHHHHHHHH
|
|
HHHHHHHH
EEEEE
HHHHHHHHHHHHHHH
EEEE
HHHHHHHHHHHHHH
|
4fxn-AA SEQUENCE||
4fxn-ITERATION-0||
4fxn-ITERATION-1||
4fxn-ITERATION-2||
4fxn-ITERATION-3||
4fxn-ITERATION-4||
4fxn-ITERATION-5||
4fxn-ITERATION-6||
4fxn-ITERATION-7||
4fxn-ITERATION-8||
4fxn-ITERATION-9||
AA
PHD
PHD
PHD
PHD
PHD
PHD
PHD
PHD
PHD
PHD
|MKIVYWSGTGNTEKMAELIAKGIIESGKDVNTINVSDVNIDELLNEDILILGCSAMGDEV|
| EEEEE
HHHHHHHHHHHHHHH
EEE
EEEEE
|
| EEEEE
HHHHHHHHHHHHHHH
EEEE
EEEEE
|
| EEEEE
HHHHHHHHHHHHHHH
EEEE
EEEEE
|
| EEEEE
HHHHHHHHHHHHHHH
E
EEEEE
|
| EEEEEE
HHHHHHHHHHHHHHH
EEEE
EEEEE
|
| EEEEEE
HHHHHHHHHHHHHHH
EE
EEEEE
|
| EEEEEE
HHHHHHHHHHHHHHH
EEEE
EEEEE
|
| EEEEEE
HHHHHHHHHHHHHHH
EE
EEEEE
|
| EEEEEE
HHHHHHHHHHHHHHH
EEE
EEEEE
|
| EEEEE
HHHHHHHHHHHHHHH
EEE
EEEEE
|
4fxn-AA SEQUENCE||
4fxn-ITERATION-0||
4fxn-ITERATION-1||
4fxn-ITERATION-2||
4fxn-ITERATION-3||
4fxn-ITERATION-4||
4fxn-ITERATION-5||
4fxn-ITERATION-6||
4fxn-ITERATION-7||
4fxn-ITERATION-8||
4fxn-ITERATION-9||
AA
PHD
PHD
PHD
PHD
PHD
PHD
PHD
PHD
PHD
PHD
|LEESEFEPFIEEISTKISGKKVALFGSYGWGDGKWMRDFEERMNGYGCVVVETPLIVQNE|
|
EEEEE
HHHHHHHHHHHHHHHHH EEE
EEE
|
|
HHHH
EEEEE
HHHHHHHHHHHHHHH
EEE
EE
|
|
HHHHHHHHHHHH
EEEEEE
HHHHHHHHHHHHHHH
EEE
EE
|
|
HHHHHHHHHHHH
EEEEE
HHHHHHHHHHHHHHH
EEE
EE
|
|
HHHHHHHHHHHH
EEEEE
HHHHHHHHHHHHHHHHH
EEE
E
|
|
HHHHHHHHHHHH
EEEEE
HHHHHHHHHHHHHHHHH
EEE
E
|
|
HHHHHHHHHHHH
EEEEEE
HHHHHHHHHHHHHHHH
EEE
E
|
|
HHHHHHHHHHHH
EEEEE
HHHHHHHHHHHHHHHHH
EEE
E
|
|
HHHHHHHHHHHH
EEEEE
HHHHHHHHHHHHHHHHH
EEE
E
|
|
HHHHHHHHHHHH
EEEEEE
HHHHHHHHHHHHHHHH
EEE
E
|
4fxn-AA SEQUENCE||
4fxn-ITERATION-0||
4fxn-ITERATION-1||
4fxn-ITERATION-2||
4fxn-ITERATION-3||
4fxn-ITERATION-4||
4fxn-ITERATION-5||
4fxn-ITERATION-6||
4fxn-ITERATION-7||
4fxn-ITERATION-8||
4fxn-ITERATION-9||
AA
PHD
PHD
PHD
PHD
PHD
PHD
PHD
PHD
PHD
PHD
|PDEAEQDCIEFGKKIANI|
|
HHHHHHHHHHHHH |
|
HHHHHHHHHHHHH |
|
HHHHHHHHHHHHH |
|
HHHHHHHHHHHHH |
|
HHHHHHHHHHHH |
|
HHHHHHHHHHHHH |
|
HHHHHHHHHHHH |
|
HHHHHHHHHHHHH |
|
HHHHHHHHHHHHH |
|
HHHHHHHHHHHH |
Optimal segmentation of predicted secondary
structures by Dynamic Programming
H score
E score
C score
? score
Region
The recorded values are used in a
weighted function according to their
secondary structure type, that gives
each position a window-specific score.
The more probable the secondary
structure element, the higher the score.
window
size
Restrictions:
H only if ws>=4
E only if ws>=2
Segmentation score (Total score of each path)
2
sequence position
Max score
Offset
Label
5
H
6
Example of an optimally segmented secondary
structure prediction library for sequence 3chy
3chy
3chy
3chy
3chy
3chy
3chy
3chy
3chy
3chy
3chy
3chy
3chy
3chy
3chy
3chy
<<<<<<<<<<<<<<-
1fx1
FLAV_DESDE
FLAV_DESVH
FLAV_DESGI
FLAV_DESSA
4fxn
FLAV_MEGEL
2fcr
FLAV_ANASP
FLAV_ECOLI
FLAV_AZOVI
FLAV_ENTAG
FLAV_CLOAB
3chy
---------------GYVV-----KPFTAATLEEKLNKIFEKLGM-----??????????????? ee ??
hhhhhhhhhhhhhh ????????
??????????????? ee ??
hhhhhhhhhhhhhhh ????????
??????????????? ee ??
hhhhhhhhhhhhhh ????????
??????????????? eee ??
??hhhhhhhhhhhhh ????????
??????????????? eee ??
??hhhhhhhhhhhhh ????????
??????????????? eee ??
hhhhhhhhhhhhh ?????????
????????????????eee ??
hh?hhhhhhhhhhh ?????????
e
? eeeeeee
hhhhhhhhhhhhhhh
??????
? eeeeeee
hhhhhhhhhhhhhhh
??????
eeeeeee
hhhhhhhhhhhhhhh hhhhh
? eeeeeee
hhhhhhhhhhhhhhh
????
e
eeeeeeee
hhhhhhhhhhhhhhhh? ??????
eeeeeee
hhhhhhhhhh ???????????
------------------hhhhhhhhhhhhhh
------
Consensus
Consensus-DSSP
---------------EEEE----HHHHHHHHHHHHH -----...............****.....****xx***************......
PHD
PHD-DSSP
------------------HHHHHHHHHHHHHH
-----...............xxxx.....******************x**......
DSSP
LumpDSSP
...............EEEE.....SS
...............EEEE.....
HHHHHHHHHHHHHHHT ......
HHHHHHHHHHHHHHH ......
What to do with a multiple alignment?
• Use it to eyeball and detect
structural/functional features
• Use it to make a profile and search a
database for homologs
• Give it to other bioinformatics methods and
predict secondary structure, functional
residues, correlated mutations, phylogenetic
trees, etc.
Rules of thumb when looking at a
multiple alignment (MA)
•
•
•
•
Hydrophobic residues are internal
Gly (Thr, Ser) in loops
MA: hydrophobic block -> internal b-strand
MA: alternating (1-1) hydrophobic/hydrophilic =>
edge b-strand
• MA: alternating 2-2 (or 3-1) periodicity => -helix
• MA: gaps in loops
• MA: Conserved column => functional? => active
site
Rules of thumb when looking at a
multiple alignment (MA)
• Active site residues are together in 3D structure
• Helices often cover up core of strands
• Helices less extended than strands => more
residues to cross protein
• b--b motif is right-handed in >95% of cases
(with parallel strands)
• MA: ‘inconsistent’ alignment columns and
match errors!
• Secondary structures have local anomalies, e.g.
b-bulges
Rules of thumb when looking at a
multiple alignment (MA)
• Active site residues are together in 3D structure
• Helices often cover up core of strands
• Helices less extended than strands => more
residues to cross protein
• b--b motif is right-handed in >95% of cases
(with parallel strands)
• MA: ‘inconsistent’ alignment columns and
match errors!
• Secondary structures have local anomalies, e.g.
b-bulges
Periodicity patterns
Burried b-strand
Edge b-strand
-helix
Burried and Edge strands
Parallel b-sheet
Anti-parallel b-sheet
b--b motif is right-handed in
>95% of cases
RH
LH
Flavodoxin-cheY example: 5(b)
1fx1
FLAV_DESDE
FLAV_DESVH
FLAV_DESSA
FLAV_DESGI
2fcr
FLAV_AZOVI
FLAV_ENTAG
FLAV_ANASP
FLAV_ECOLI
4fxn
FLAV_MEGEL
FLAV_CLOAB
3chy
-PKALIVYGSTTGNT-EYTAETIARQLANAG-YEVDSRDAASVEAGGLFEGFDLVLLGCSTWGDDSI------ELQDDFIPLF-DSLEETGAQGRKVACF
MSKVLIVFGSSTGNT-ESIaQKLEELIAAGG-HEVTLLNAADASAENLADGYDAVLFgCSAWGMEDL------EMQDDFLSLF-EEFNRFGLAGRKVAAf
MPKALIVYGSTTGNT-EYTaETIARELADAG-YEVDSRDAASVEAGGLFEGFDLVLLgCSTWGDDSI------ELQDDFIPLF-DSLEETGAQGRKVACf
MSKSLIVYGSTTGNT-ETAaEYVAEAFENKE-IDVELKNVTDVSVADLGNGYDIVLFgCSTWGEEEI------ELQDDFIPLY-DSLENADLKGKKVSVf
MPKALIVYGSTTGNT-EGVaEAIAKTLNSEG-METTVVNVADVTAPGLAEGYDVVLLgCSTWGDDEI------ELQEDFVPLY-EDLDRAGLKDKKVGVf
--KIGIFFSTSTGNT-TEVADFIGKTLGA---KADAPIDVDDVTDPQALKDYDLLFLGAPTWNTG----ADTERSGTSWDEFLYDKLPEVDMKDLPVAIF
-AKIGLFFGSNTGKT-RKVaKSIKKRFDDET-MSDA-LNVNRVS-AEDFAQYQFLILgTPTLGEGELPGLSSDCENESWEEFL-PKIEGLDFSGKTVALf
MATIGIFFGSDTGQT-RKVaKLIHQKLDG---IADAPLDVRRAT-REQFLSYPVLLLgTPTLGDGELPGVEAGSQYDSWQEFT-NTLSEADLTGKTVALf
SKKIGLFYGTQTGKT-ESVaEIIRDEFGN---DVVTLHDVSQAE-VTDLNDYQYLIIgCPTWNIGEL--------QSDWEGLY-SELDDVDFNGKLVAYf
-AITGIFFGSDTGNT-ENIaKMIQKQLGK---DVADVHDIAKSS-KEDLEAYDILLLgIPTWYYGE--------AQCDWDDFF-PTLEEIDFNGKLVALf
-MK--IVYWSGTGNT-EKMAELIAKGIIESG-KDVNTINVSDVNIDELL-NEDILILGCSAMGDEVL-------EESEFEPFI-EEIS-TKISGKKVALF
MVE--IVYWSGTGNT-EAMaNEIEAAVKAAG-ADVESVRFEDTNVDDVA-SKDVILLgCPAMGSEEL-------EDSVVEPFF-TDLA-PKLKGKKVGLf
-MKISILYSSKTGKT-ERVaKLIEEGVKRSGNIEVKTMNLDAVD-KKFLQESEGIIFgTPTYYAN---------ISWEMKKWI-DESSEFNLEGKLGAAf
ADKELKFLVVDDFSTMRRIVRNLLKELGFN--NVEEAEDGVDALNKLQAGGYGFVI---SDWNMPNM----------DGLELL-KTIRADGAMSALPVLM
1fx1
FLAV_DESDE
FLAV_DESVH
FLAV_DESSA
FLAV_DESGI
2fcr
FLAV_AZOVI
FLAV_ENTAG
FLAV_ANASP
FLAV_ECOLI
4fxn
FLAV_MEGEL
FLAV_CLOAB
3chy
GCGDS-SY-EYFCGA-VDAIEEKLKNLGAEIVQD---------------------GLRIDGD--PRAARDDIVGWAHDVRGAI-------ASGDQ-EY-EHFCGA-VPAIEERAKELgATIIAE---------------------GLKMEGD--ASNDPEAVASfAEDVLKQL-------GCGDS-SY-EYFCGA-VDAIEEKLKNLgAEIVQD---------------------GLRIDGD--PRAARDDIVGwAHDVRGAI-------GCGDS-DY-TYFCGA-VDAIEEKLEKMgAVVIGD---------------------SLKIDGD--PE--RDEIVSwGSGIADKI-------GCGDS-SY-TYFCGA-VDVIEKKAEELgATLVAS---------------------SLKIDGE--PD--SAEVLDwAREVLARV-------GLGDAEGYPDNFCDA-IEEIHDCFAKQGAKPVGFSNPDDYDYEESKS-VRDGKFLGLPLDMVNDQIPMEKRVAGWVEAVVSETGV-----GLGDQVGYPENYLDA-LGELYSFFKDRgAKIVGSWSTDGYEFESSEA-VVDGKFVGLALDLDNQSGKTDERVAAwLAQIAPEFGLS--L-GLGDQLNYSKNFVSA-MRILYDLVIARgACVVGNWPREGYKFSFSAALLENNEFVGLPLDQENQYDLTEERIDSwLEKLKPAV-L-----GTGDQIGYADNFQDA-IGILEEKISQRgGKTVGYWSTDGYDFNDSKA-LRNGKFVGLALDEDNQSDLTDDRIKSwVAQLKSEFGL-----GCGDQEDYAEYFCDA-LGTIRDIIEPRgATIVGHWPTAGYHFEASKGLADDDHFVGLAIDEDRQPELTAERVEKwVKQISEELHLDEILNA
G-----SY-GWGDGKWMRDFEERMNGYGCVVVET---------------------PLIVQNE--PDEAEQDCIEFGKKIANI--------G-----SY-GWGSGEWMDAWKQRTEDTgATVIGT----------------------AIVNEM--PDNA-PECKElGEAAAKA--------STANSIAGGSDIA---LLTILNHLMVKgMLVYSG----GVAFGKPKTHLGYVHINEIQENEDENARIfGERiANkVKQIF----------VTAEAKK--ENIIAA---------AQAGAS-------------------------GYVV-----KPFTAATLEEKLNKIFEKLGM------
Iteration 0
T
G
SP= 136944.00
AvSP= 10.675
SId= 4009
AvSId= 0.313
Building flavodoxin
1
2
RH
3
4
5
Building flavodoxin
1
2
RH
3
4
5
Building flavodoxin
1
2
RH
3
4
5
Building flavodoxin
1
2
RH
3
4
5
Building flavodoxin
1
2
RH
3
4
5
Building flavodoxin
1
2
RH
3
4
5
Building flavodoxin
try again
2
1
RH
3
4
5
Building flavodoxin
2
1
RH
3
4
5
Building flavodoxin
2
1
RH
3
4
5
Building flavodoxin
2
1
RH
3
4
5
Building flavodoxin
2
1
RH
3
4
5
Building flavodoxin
2
1
RH
3
4
5
Flavodoxin family - TOPS diagrams
(Flores et al., 1994)
4
5
4
5
3 2
3
1
1
2
Protein structure evolution
Insertion/deletion of secondary structural
elements can ‘easily’ be done at loop sites
Protein structure evolution
Insertion/deletion of structural domains can
‘easily’ be done at loop sites
N
C
Integrating protein multiple alignment,
secondary and tertiary structure
prediction to predict
structural domains in sequence data
SnapDRAGON
Richard A. George
George R.A. and Heringa, J. (2002) J. Mol. Biol., 316, 839-851.
A domain is a:
• Compact, semi-independent unit
(Richardson, 1981).
• Stable unit of a protein structure that can
fold autonomously (Wetlaufer, 1973).
• Recurring functional and evolutionary
module (Bork, 1992).
“Nature is a ‘tinkerer’ and not an inventor” (Jacob, 1977).
The DEATH Domain
http://www.mshri.on.ca/pawson
• Present in a variety of Eukaryotic
proteins involved with cell death.
• Six helices enclose a tightly
packed hydrophobic core.
• Some DEATH domains form
homotypic and heterotypic dimers.
Delineating domains is essential for:
•
•
•
•
•
•
•
•
Obtaining high resolution structures (x-ray, NMR)
Sequence analysis
Multiple sequence alignment methods
Prediction algorithms (SS, Class, secondary/tertiary
structure)
Fold recognition and threading
Elucidating the evolution, structure and function of
a protein family (e.g. ‘Rosetta Stone’ method)
Structural/functional genomics
Cross genome comparative analysis
Structural domain organisation can be nasty…
Pyruvate kinase
Phosphotransferase
b barrel regulatory domain
/b barrel catalytic substrate binding
domain
/b nucleotide binding domain
1 continuous + 2 discontinuous domains
Protein structure hierarchical levels
PRIMARY STRUCTURE (amino acid sequence)
SECONDARY STRUCTURE (helices, strands)
VHLTPEEKSAVTALWGKVNVDE
VGGEALGRLLVVYPWTQRFFE
SFGDLSTPDAVMGNPKVKAHG
KKVLGAFSDGLAHLDNLKGTFA
TLSELHCDKLHVDPENFRLLGN
VLVCVLAHHFGKEFTPPVQAAY
QKVVAGVANALAHKYH
QUATERNARY STRUCTURE
TERTIARY STRUCTURE (fold)
Protein structure hierarchical levels
PRIMARY STRUCTURE (amino acid sequence)
SECONDARY STRUCTURE (helices, strands)
VHLTPEEKSAVTALWGKVNVDE
VGGEALGRLLVVYPWTQRFFE
SFGDLSTPDAVMGNPKVKAHG
KKVLGAFSDGLAHLDNLKGTFA
TLSELHCDKLHVDPENFRLLGN
VLVCVLAHHFGKEFTPPVQAAY
QKVVAGVANALAHKYH
QUATERNARY STRUCTURE
TERTIARY STRUCTURE (fold)
Protein structure hierarchical levels
PRIMARY STRUCTURE (amino acid sequence)
SECONDARY STRUCTURE (helices, strands)
VHLTPEEKSAVTALWGKVNVDE
VGGEALGRLLVVYPWTQRFFE
SFGDLSTPDAVMGNPKVKAHG
KKVLGAFSDGLAHLDNLKGTFA
TLSELHCDKLHVDPENFRLLGN
VLVCVLAHHFGKEFTPPVQAAY
QKVVAGVANALAHKYH
QUATERNARY STRUCTURE
TERTIARY STRUCTURE (fold)
Protein structure hierarchical levels
PRIMARY STRUCTURE (amino acid sequence)
SECONDARY STRUCTURE (helices, strands)
VHLTPEEKSAVTALWGKVNVDE
VGGEALGRLLVVYPWTQRFFE
SFGDLSTPDAVMGNPKVKAHG
KKVLGAFSDGLAHLDNLKGTFA
TLSELHCDKLHVDPENFRLLGN
VLVCVLAHHFGKEFTPPVQAAY
QKVVAGVANALAHKYH
QUATERNARY STRUCTURE
TERTIARY STRUCTURE (fold)
Domain prediction using DRAGON
Distance Regularisation Algorithm for
Geometry OptimisatioN
(Aszodi & Taylor, 1994)
•Folds proteins based on the requirement that
(conserved) hydrophobic residues cluster
together.
•First constructs a random high dimensional C
distance matrix.
•Distance geometry is used to find the 3D
conformation corresponding to a prescribed target
matrix of desired distances between residues.
The DRAGON target matrix is inferred
from:
• A multiple sequence alignment of a protein (old)
– Conserved hydrophobicity
• Secondary structure information (SnapDRAGON)
– predicted by PREDATOR (Frishman & Argos, 1996).
– strands are entered as distance constraints from the Nterminal C to the C-terminal C.
Multiple alignment
C distance
matrix
N
Target
matrix
3
N
100 randomised
initial matrices
100 predictions
N
N
Predicted secondary
structure
CCHHHCCEEE
N
Input data
•The C distance matrix is divided into smaller clusters.
•Seperately, each cluster is embedded into a local centroid.
•The final predicted structure is generated from full
embedding of the multiple centroids and their
corresponding local structures.
SnapDragon
Multiple alignment
Predicted
secondary structure
CCHHHCCEEE
Generated folds
by Dragon
Boundary
recognition
Summed and
Smoothed
Boundaries
SnapDRAGON
1
2
3
Domains in structures assigned using
method by Taylor (1997)
Domain boundary positions of each
model against sequence
Summed and Smoothed Boundaries
(Biased window protocol)
SnapDRAGON
• Is very slow (can be hours for proteins>400
aa) – cluster computing implementation
• Uses consistency in the absence of standard
of truth
• Goes from primary+secondary to tertiary
structure to ‘just’ chop protein sequences
• SnapDRAGON webserver is underway
Integrating protein sequence database
searching and on-the-fly domain recognition
DOMAINATION
Richard A. George
Protein domain identification and improved sequence
searching using PSI-BLAST
(George & Heringa, Prot. Struct. Func. Genet., in press; 2002)
Domaination
• Current iterative homology search methods
do not take into account that:
– Domains may have different ‘rates of
evolution’.
– Common conserved domains, such as the
tyrosine kinase domain, can obscure weak but
relevant matches to other domain types
– Premature convergence (false negatives)
– Matrix migration / Profile wander (false
positives).
PSI-BLAST
• Query sequence is first scanned for the presence of socalled low-complexity regions (Wooton and Federhen,
1996), i.e. regions with a biased composition (e.g. TM
regions or coiled coils) likely to lead to spurious hits,
which are excluded from alignment.
• Initially operates on a single query sequence by
performing a gapped BLAST search
• Then takes significant local alignments found,
constructs a ‘multiple alignment’ and abstracts a
position specific scoring matrix (PSSM) from this
alignment.
• Rescans the database in a subsequent round to find
more homologous sequences -- Iteration continues until
user decides to stop or search converges
PSI-BLAST iteration
Q
xxxxxxxxxxxxxxxxx
Query sequence
Gapped BLAST search
Q
xxxxxxxxxxxxxxxxx
Query sequence
Database hits
A
C
D
.
.
Y
PSSM
Pi
Px
Gapped BLAST search
A
C
D
.
.
Y
Pi
Px
PSSM
Database hits
DOMAINATION
Chop and Join
Domains
Identifying domain boundaries
Sum N- and C-termini of
gapped local alignments
True N- and C- termini are
counted twice (within 10 residues)
Boundaries are smoothed using two
windows (15 residues long)
Combine scores using biased
protocol:
if Ni x Ci = 0
then Si = Ni+Ci
else Si = Ni+Ci +(NixCi)/(Ni+Ci)
Identifying domain deletions
• Deletions in the query (or insertion in the
DB sequences) are identified by
– two adjacent segments in the query align to the
same DB sequences (>70% overlap), which
have a region of >35 residues not aligned to the
query.
(remove N- and C- termini)
DB
Query
Identifying domain permutations
• A domain shuffling event is declared
– when two local alignments (>35 residues)
within a single DB sequence match two
separate segments in the query (>70% overlap),
but have a different sequential order.
b
a
a
b
DB
Query
Identifying continuous and discontinuous domains
•Each segment is assigned an independence score (In).
If In>10% the segment is assigned as a continuous domain.
•An association score is calculated between non-adjacent
fragments by assessing the shared sequence hits to the
segments. If score > 50% then segments are considered as
discontinuous domains and joined.
Create domain profiles
• A representative set of the database sequence fragments
that overlap a putative domain are selected for alignment
using OBSTRUCT (Heringa et al. 1992).
> 20% and < 60% sequence identity (including the query seq).
• A multiple sequence alignment is generated using
PRALINE (Heringa 1999, 2002; Kleinjung et al., 2002).
• Each domain multiple alignment is used as a profile in
further database searches using PSI-BLAST (Altschul et al
1997).
• The whole process is iterated until no new domains are
identified.
Significant sequences found in database searches
At an E-value cut-off of 0.1 the performance of DOMAINATION
searches with the full-length proteins is 15% better than PSI-BLAST
Summary
Algorithmic integration issues:
• Integrating data categories
• Integrating alternative methods (consensus)
• Making an web-integrated genomics pipeline that
combines it all
Big task ahead @ VU
Needs:
• People
• Teams with an interest in Integrative
Bioinformatics
• HTC/Dedicated cluster computing
Acknowledgements
VU CvB
FEW
FALW
Victor Simossis – NIMR to VU (1 November 2002)
Jens Kleinjung – NIMR to VU (1 December 2002)
Hans Westerhoff – FALW, VU
Henri Bal – CS, FEW, VU
Hans van Beek – VUMC/FALW, VU
View at NIMR (Mill Hill)