Needleman-Wunsch Algorithm

Download Report

Transcript Needleman-Wunsch Algorithm

Allele
Mining:
with
respect
to
Comparative Protein Structure Modelling
and Docking study
Sunil Kumar
Institute of Life Sciences
Bhubaneswar
E-mail: [email protected]
Allele Mining: an Introduction
• Enormous sequence information is available in public databases
as a result of sequencing of diverse crop genomes.
• It is important to use this genomic information for the
identification and isolation of novel and superior alleles of
agronomically important genes from crop gene pools to suitably
deploy for the development of improved cultivars.
• Allele mining is a promising approach to dissect naturally
occuring allelic variation at candidate genes controlling key
agronomic traits which has potential applications in crop
improvement programs.
• It helps in tracing the evolution of allels, identification of new
haplotypes and development of allele specific markers for use in
marker-assisted selection.
Allele Mining…..cont
• Initial studies of allele mining have focused only on
the identification of SNP/InDels at coding sequences
or exons of the gene.
• Since these variations were expected to affect the
encoded protein structure and/or function
• However, recent reports indicate that the nucleotide
changes in non-coding regions (5’UTR) including
promoter, introns and 3’ UTR) also have significant
effect on transcript synthesis and accumulation
which in turn alter the trait expression
Information Transfer pathway within the cell
……ATGCATGCATGCATGCATGC..
………CGUACGUACGUACGU…………
DNA
………CGUACGUACGUACGU…………
RNA
DECODING
MECHANISM
PROTEIN Sequence
PROTEIN Structure
Biological function
Proteins
Proteins are the building blocks of life.
In a cell, 70% is water and 15%-20% are
proteins.
Examples:
hormones – regulate metabolism
structural – hair, wool, muscle,…
antibodies – immune response
enzymes – chemical reactions
Amino Acids
A protein is composed of a central backbone and a
collection of (typically) 50-2000 amino acids
There are 20 different kinds of amino acids
Name
Leucine
Alanine
Serine
Glycine
Valine
Glutamic acid
Threonine
3-letter code
Leu
Ala
Ser
Gly
Val
Glu
Thr
1-letter code
L
A
S
G
V
E
T
Amino Acids
Side chain
Each amino acid is identified by its side chain,
which determines the properties of this amino acid.
Side Chain Properties
•Hydrophobic stays
inside, while hydrophilic
stay close to water
•Oppositely charged
amino acids can form salt
bridge.
•Polar amino acids can
participate hydrogen
bonding
Protein Folding
• Proteins must fold
to function
• Some diseases are
caused by
misfolding
e.g., mad cow disease
Three Structure Levels
Primary structure:
sequence of amino acids
Helix
– e.g., DRVYIHPF
Secondary structure:
local folding patterns
– e.g., alpha-helix,
beta-sheet, loop
Beta
Sheet
Tertiary structure:
complete 3D fold
Loop
Beta Sheet Examples
Parallel beta sheet
Anti-parallel beta sheet
Helix Examples
Domain, Fold, Motif
• A protein chain could have several domains
▫ A domain is a discrete portion of a protein, can
fold independently, possess its own function
• The overall shape of a domain is called a
fold. There are only a few thousand
possible folds.
• Sequence motif: highly conserved protein
subsequence
• Structure motif: highly conserved
substructure
Protein Data Bank
About 50,000 protein structures, solved using experimental
techniques ~800 are unique structural folds
Same structural folds
Different structural
folds
The Problem
protein
structure
• Protein functions determined
by 3D structures
• ~ 50,000 protein structures
in PDB (Protein Data Bank)
medicine • Experimental determination
of protein structures timeconsuming and expensive
• Many protein sequences
available
sequence
function
Why Protein 3D Structures?
3D Structures of Proteins
Better Understanding of Protein Functions
“Three-dimensional protein structures are important in
understanding the mechanisms of human genetic diseases,
predicting the effect of non-synonymous single nucleotide
polymorphisms and developing new personalized medicines”
Xie and Bourne (2005) PLoS Compt.Biol. 1:e31
What is Homology Modeling?
An approach to predict a model of the three-dimensional
structure of a given protein sequence (TARGET) based
on an alignment to one or more known protein structures
(TEMPLATES)
The homology modeling method is based on the assumption
that the structure of an unknown protein is similar to
known structures of reference proteins
Why a Model?
A model is desirable when either X-ray crystallography or NMR
spectroscopy can not determine the structure of a protein in time
or at all.
While the 3-D structure of proteins can be determined by x-ray
crystallography and NMR spectroscopy. These experimental
techniques are time consuming and not possible if a sufficient
quantity and quality of a proteins is not available.
The built model provides a wealth of information of how the
protein functions with information at residue property level. This
information can than be used for mutational studies or for drug
design..
Protein Structure Determination
• High-resolution structure determination
▫ X-ray crystallography (~1Å)
▫ Nuclear magnetic resonance (NMR) (~1-2.5Å)
• Low-resolution structure determination
▫ Cryo-EM (electron-microscropy) ~10-15Å
X-ray crystallography
• most accurate
• An extremely pure protein sample is needed.
• The protein sample must form crystals that are relatively large
without flaws. Generally the biggest problem.
• Many proteins aren’t amenable to crystallization at all (i.e.,
proteins that do their work inside of a cell membrane).
• ~$100K per structure
Nuclear Magnetic Resonance
• Fairly accurate
• No need for crystals
• limited to small, soluble proteins only.
Steps in homology modelling
Target’s sequence
1. Identification of structures that will form the template
for modelling
2. Sequence Alignment of the target with template
3. Transfer of the coordinates from the template(s) to the
target of structurally conserved regions (SCR’s)
4. Modelling the missing regions
5. Refinement and validation of the model
Target’s structure
Template search
• Homology modeling is based on using similar structures
i.e. no Similar structures = No Model
• 40% amino acid identity or higher is best; below that is
not advisable but examples of success do exist
• Need sequence similarity across the whole sequence,
not just in one part
Searching Databases
Query
Database
BLASTING….
FASTING….
Key Step:
Sequence alignment of the target with the
basis structures
Good Alignment
Good Model
• Sequence alignment is a basic technique in
homology modeling.
• It
is
used
to
establish
a
one-to-one
correspondence between the amino acids of the
reference protein (template) and those of the
unknown protein (target) in the structurally
conserved regions.
• The correspondence is the basis for transferring
coordinates from the reference to the model
protein
Sequence A
Sequence B
GGTGGAC
AAAGGTGAC
GGTGGAC
AAAGGTG - AC
A Sample alignment of two DNA sequences
(a) Un-gapped alignment
(b) Gapped alignment. The “I” indicates matching
nucleotides
Sequence Alignment
Global
Local
Alignm Alignm
ent
ent
Applications:
Global alignment : essential for comparative modeling.
Local alignment :
sufficient for functional domains.
N.B: Global alignment is computationally more time
consuming than the local alignment.
Sequence Homology Vs Sequence Similarity
Dotplot:
A dotplot gives an overview of all possible alignments
Sequence 2
A



T 
 
T 
 
C



A



C



A



Sequence
T 
  1
A



T A C A T T A C G T







A C
Dynamic Programming
 Dynamic programming is a computational method used for aligning two
protein or nucleotide sequences. The method compares every pair of
residues/nucleotides in the two sequences and generates an alignment.
In the alignment matches, mismatches and gaps in the two sequences
are positioned in such a way that the number of matches between
identical or similar residues is maximum possible.
• Needleman and Wunsch Algorithm
- Global Alignment • Smith and Waterman Algorithm
- Local Alignment -
F(i, j) = F(i-1, j-1) + s(xi ,yj)
F(i, j) = max
F(i, j) = F(i-1, j) - d
F(i, j) = F(i, j-1) - d
F(i-1, j-1)
s(xi
,yj)
F(i-1,j)
-d j)
F(i,
F(i, j-1)
-d
Steps
1. Initialization:- 1st Row and 1st Column- Filled with Multiple of Gap
Penalty
2. Rest of the cells: Filled with Vmax Value
3. Generation of Optimal path: Through back tracking
4. Generation of optimal alignment: For the optimal path (No. of
optimal path = No. of optimal alignment
Scoring Scheme
:- Given an alignment between two sequences, we can
compute its similarity by :-
1) Rewarding for a match
2) Penalizing for a mismatch
3) Penalizing for a gap
Match => +1
Mismatch => -1
Gap or Indel => -2
Smith and Waterman
(local alignment)
Two differences:
0
1. F(i, j) = max
F(i, j) = F(i-1, j-1) + s(xi ,yj)
F(i, j) = F(i-1, j) - d
F(i, j) = F(i, j-1) - d
2. An alignment can now end anywhere in the matrix
Example:
Sequence 1
Sequence 2
HEAGAWGHEE
PAWHEAE
Scoring parameters: BLOSUM
Gap penalty:
Linear gap penalty of 8
Comparative Modelling Methods
Restrained based methods
-MODELLER
(Sali and Blundell, 1993)
MODELLER

MODELLER is a computer program that models
three-dimensional structures of proteins and their
assemblies by satisfaction of spatial restraints.

MODELLER is most frequently used for homology or
comparative protein structure modeling.

The user provides an alignment of a sequence to be
modeled with known related structures and
MODELLER will automatically calculate a model with
all non-hydrogen atoms.

A 3D model is obtained by optimization of a
molecular probability density function (pdf).
Format for Modeller:
INCLUDE
SET ATOM_FILES_DIRECTORY = './:../‘
SET PDB_EXT = '.atm‘
SET STARTING_MODEL = 1
SET ENDING_MODEL = 20
SET MD_LEVEL = 'refine1‘
SET DEVIATION = 4.0
SET KNOWNS ='1JKE‘
SET HETATM_IO = off
SET WATER_IO = off
SET ALIGNMENT_FORMAT = 'PIR‘
SET SEQUENCE = 'target1‘
SET ALNFILE = 'multiple1.ali
CALL ROUTINE = 'model'
Loop Modelling
Loop region
Calculate
distances between
the anchor
residues.
FRAGMENT
DATABASE
Loop Generation
Process:
1. Select a loop for
each region
2. Fixing of the loop
Loop Library
• Loops extracted from PDB using high resolution (<2 Å)
X-ray structures
• Typically thousands of loops in DB
• Includes loop coordinates, sequence, # residues in loop,
Ca-Ca distance, preceding 2o structure and following 2o
structure (or their Ca coordinates)
Structure Validation
(a) Stereochemical Quality Check
(b) Residue Environment Check
Stereochemical Quality Check
PROCHECK
(Thornton and Co-workers)
Following properties are calculated and analysed
in comparison with those of highly refined structures
solved at varying resolutions.
Torsional angles:
- (f,y) combination
- c1-c2 combination
- c1 torsion for those residues without c2
- combined c3 and c4 angles
- w angles
Covelent geometry:
- main-chain bond lengths
- main-chain bond angles
Profiles-3D
• Amino acid residues in proteins can be classified
according to their local environments:
▫ solvent accessibility
▫ secondary structure
▫ polarity of other protein chemical groups in
contact with them
Refining the Model
-
Energy minimize N- and C-termini.
Repair spliced peptide bonds.
Minimize loop regions
Energy minimize mutated side chains in SCRs.
Minimize segments together.
Energy Minimization
• Energy minimization adjusts the structure of the
molecule in order to lower the energy of the system.
• For small molecules, a global minimum energy
configuration can often be found.
• for large macromolecular systems, energy minimization
allows one to examine the local minimum around a
particular conformation.
Modelling on the Web
• Prior to 1998 homology modelling could only be done
with commercial software or command-line freeware
• The process was time-consuming and labor-intensive
• The past few years has seen an explosion in
automated web-based homology modelling servers
• Now anyone can homology model!
Application of Comparative Modeling
- Comparative modeling is an efficient way to obtain
useful information about the proteins of interest.
For example – comparative modeling can be helpful
in
- Designing mutants to text hypothesis about the
proteins function.
- Identifying active and binding sites.
- Searching for designing and improving.
- Modeling substrate specificity.
- predicting antigenic epitopes.
- Simulating protein – protein docking.
- Confirming a remote structural relationship.
What is docking?
Prediction of the optimal physical configuration and
energy between two molecules
The docking problem optimizes:

Binding between two molecules such that their orientation
maximizes the interaction

Evaluates the total energy of interaction such that for the best
binding configuration the binding energy is the minimum

The resultant structural changes brought about by the
interaction
Molecular Docking
• The process of “docking” a ligand to a binding site
mimics the natural course of interaction of the ligand
and its receptor via a lowest energy pathway.
• Put a compound in the approximate area where binding
occurs and evaluate the following:
– Do the molecules bind to each other?
– If yes, how strong is the binding?
– How does the molecule (or) the protein-ligand complex
look like. (understand the intermolecular interactions)
– Quantify the extent of binding.
Few terms related to docking
• Receptor: The receiving molecule, most commonly a protein or other
biopolymer.
• Ligand: The complementary partener molecule which binds to the
receptor. Ligands are most often small molecules but could also be
another biopolymer.
• Docking: Computational simulation of a candidate ligand binding to a
receptor.
• Binding mode: The orientation of the ligand relative to the receptor as
well as the conformation of the ligand and receptor when bound to each
other.
• Pose: A candidate binding mode.
• Scoring: The process of evaluating a particular pose by counting the
number of favorable intermolecular interactions such as hydrogen
bonds and hydrophobic contacts.
• Ranking: The process of classifying which ligands are most likely to
interact favorably to a particular receptor based on the predicted
Classes of Docking
Protein-Protein docking
•
•
•
•
Both molecules usually considered rigid.
6 degrees of freedom, 3 for rotation, 3 for translation
First apply only steric constraints to limit search space
Then examine energetics of possible binding confirmations
Protein-Ligand docking
• Flexible ligand, rigid receptor.
• Search space much larger
• Either reduce flexible ligand or rigid fragments to
connected by one or several hings (reduces
confirmational space)
• Or search the confirmational space using the montecarlo methods or molecular dynamics.
1. Protein-Protein Docking
1. Protein-Ligand Docking
optimized
Docking uses a “search and score” method
It involves:
 Finding useful ways of representing the molecules and molecular
properties.
 Exploration of the configuration spaces available for interaction
between ligand and receptor.
 Evaluate and rank configurations using a scoring system, in this
case the binding energy
However, since it is difficult to evaluate the binding energy because
the binding sites may not be easily accessible, the binding energy is
modeled as follows:
∆G bind= ∆Gvdw + ∆Ghbond + ∆Gelect + ∆G conform+ ∆G
tor
+ ∆G
sol
3D Structure of the Complex
Experimental Information:
The active site can be
identified based on the
position of the ligand in the
crystal structures of the
protein-ligand complexes
If Active Site is not KNOWN?????
Some Available Programs to Perform Docking
Affinity
AutoDock
BioMedCAChe
CAChe for Medicinal
Chemists
• DOCK
• DockVision
•
•
•
•
•
•
•
•
•
•
•
FlexX
Glide
GOLD
Hammerhead
PRO_LEADS
SLIDE
VRDD
Ligand in Active Site Region
Ligand
Active site residues
Histidine 6; Phenylalanine 5; Tyrosine 21; Aspartic acid 91; Aspartic acid 48; Tyrosine 51; Histidine 47;
Glycine 29; Leucine 2; Glycine 31; Glycine 22; Alanine 18; Cysteine 28; Valine 20; Lysine 62
Examples of Docked structures
HIV protease inhibitors
COX2 inhibitors
Rigid Docking
• Shape-complementarity method:
find binding mode(s) without any
steric clashes
• Only 6-degrees of freedom
(translations and rotations)
• Move ligand to binding site and
monitor the decrease in the
energy
• Only non-bonded terms remain in
the energy term
• try to find a good steric match
between ligand and receptor
The DOCK algorithm in rigid-ligand mode
.
. 1. Define the target binding
site points.
..
.
. 2. Match the distances.
..
. . 3. Calculate the
transformation matrix
..
for the orientation.
N
F
H N
N
O
S
N
F
H N
N
O
S
N
F
H N
4. Dock the molecule.
N
O
S
N
F
H N
N
O
S
5. Score the fit.
Flexible Docking
• Dock flexible ligands into binding pocket of rigid
protein
• Binding site broken down into regions of possible
interactions
binding site
from X-ray
H-bonds
parameterised
binding site
Need for Scoring
Detailed calculations on all possibilities would be very
expensive
The major challenge in structure based drug design
to identify the best position and orientation of
the ligand in the binding site of the target.
This is done by scoring or ranking of the various
possibilities, which are based on empirical
parameters, knowledge based on using rigorous
calculations
Caspase Dependent Programmed Cell Death in Developing
Embryos: A potential Target for Therapeutic Intervention
against Pathogenic Nematodes
For the first time, we developed and evaluated flow cytometry based assays to
assess several conserved features of apoptosis in developing embryos of a
pathogenic filarial nematode Setaria digitata, in vitro.
We validated programmed cell death in developing embryos by using immunofluorescence microscopy and scoring expression profile of nematode specific
proteins related to apoptosis [e.g. CED-3, CED-4 and CED-9].
Mechanistically, apoptotic death of embryonic stages was found to be a caspase
dependent phenomenon mediated primarily through induction of intracellular
ROS. The apoptogenicity of some pharmacological compounds viz. DEC,
Chloroquine, Primaquine and Curcumin were also evaluated. Curcumin was found to
be the most effective pharmacological agent followed by Primaquine while
Chloroquine displayed minimal effect and DEC had no demonstrable effect.
Further, demonstration of induction of apoptosis in embryonic stages by lipid
peroxidation products [molecules commonly associated with inflammatory
responses in filarial disease] and demonstration of in-situ apoptosis of
developing embryos in adult parasites in a natural bovine model of filariasis have
offered a framework to understand anti-fecundity host immunity
operational against parasitic helminths.
PLoS NTD, 2011
Induction of apoptosis in developing embryos of a
pathogenic nematode
CARD
Domain
α/β(P-loop)
Domain
Cytochrome-c
Helical
Domain
Winged helix
Domain
CED- 4
PLoS NTD, 2011
Binding efficiencies of carbohydrate ligands with different
genotypes of cholera toxin B: molecular modeling, dynamics and
docking simulation studies
J Mol Model, 2011
Molecular interaction plots between carbohydrate ligand and
genotype 1. a) Galactose b) Sialic acid c) N-acetyl galactosamine
J Mol Model, 2011
Molecular interaction plots between carbohydrate
ligand and genotpye 3. a) Galactose b) Sialic acid
c) N-acetyl galactosamine
J Mol Model, 2011
Molecular interaction plots between carbohydrate
ligand and genotype 5. a) Galactose b) Sialic acid c) Nacetyl galactosamine
Molecular interaction plots between carbohydrate
ligand and genotpye 6. a) Galactose b) Sialic acid c) Nacetyl galactosamine
Molecular cloning of cDNA and peptide structure
prediction of Plzf expressed in the spermatogonial
cells of Labeo rohita
• The promyelocytic leukemia zinc finger (Plzf) gene
containing evolutionary conserved BTB domain plays a
key role in self-renewal of mammalian spermatogonial
stem cells.
• Little is known about the function of plzf in
vertebrate, especially in fish species.
• Cloned plzf from the testis of Labeo rohita (rohu), a
commercially important freshwater carp. Containing a
conserved N-terminal BTB domain and C-terminal
C2H2-zinc finger motifs.
Marine Genomics, 2010
Molecular cloning of cDNA and peptide structure
prediction of Plzf expressed in the spermatogonial
cells of Labeo rohita
•A 3D model
of BTB
domain of
plzf protein
was
constructed
by homology
modeling
approach.
Marine Genomics, 2010
•Molecular
docking on this
3D structure
established a
homo-dimer
between two
BTB domains
creating a
charged pocket
containing
conserved AA
residues:
L33,C34, D35,
and R49.
Marine Genomics, 2010
Thus, Plzf of SSC is
structurally and possibly
functionally conserved.
The identified plzf could
be the first step towards
exploring its role in rohu
SSC behavior.
Marine Genomics, 2010
Thank you
•
Alok Das Mohapatra, Sunil Kumar, Ashok Kumar Satapathy and Balachandran Ravindran (2011). Apoptosis in a
pathogenic nematode involves mitochondrial pathway. PloS Neglected Tropical Disease (In Press).
•
MHU Turabe Fazil, Sunil Kumar, Rohit Farmer, HP Pandey and DV Singh(2011). Binding efficiencies of
carbohydrate ligands with different genotypes of cholera toxin B: Molecular Modeling, dynamics and Docking
Simulation studies. J Mol Model, DOI 10.1007/s00894-010-0947-6 (Springer publication).
•
Biswaranjan Paital, Sunil Kumar*, Rohit Farmer, Niraj Kanti Tripathy, Gagan Bihari Nityananda Chainy (2011)
In silico prediction and characterization of 3D structure and binding properties of catalase from the commercially
important crab, Scylla serrata. Interdiscip Sci Comput Life Sci 3: 110–120(Springer
publication).*corresponding author.
•
Chinmayee Mohapatra, Hirak Kumar Barman, Rudra Prasanna Panda, Sunil Kumar, Varsha Das, Ramya
Mohanta, Shibani Mohapatra, Pallipuram Jayasankar (2010) Cloning of cDNA and prediction of peptide structure
of plzf expressed in the spermatogonial cells of Labeo rohita, Mar. Genomics, doi:
10.1016/j.margen.2010.09.002. (Elsevier publication).
•
MHU Turabe Fazil*, Sunil Kumar*, N Subbarao, H P Pandey and Durg V. Singh (2010). Homology modeling of
a sensor histidine kinase from Aeromonas hydrophila. J Mol Model, 16: 1003-1009 * Equal contribution.
(Springer publication).
•
•
Babu A Manjasetty, Sunil Kumar, Andrew P Turnbull and Niraj Kanti Tripathy (2009). Homology Modeling
and Analysis of Human Disease Proteins: Structural Investigations of Shwachman-Bodian-Diamond Syndrome
(SBDS) model through Bioinformatics Approach InterJRI Science and Technology, Vol. 1, Issue 2,97-104