No Slide Title

Download Report

Transcript No Slide Title

Advanced Tools and Algorithms
in Bioinformatics
Chittibabu Guda
Summer, 2004
UCSD Extension, Department of Biosciences
Today’s Topics
• Hidden Markov Models (HMMs)
• Predicting sub-cellular localization of proteins
• Predicting post-translation modification sites
• Using Standalone tools
• Current Trends in Bioinformatics
Hidden Markov Models
HMMs for biological sequences
• Hidden Markov model is a statistical model and has been mostly
developed for speech recognition.
• The most popular use of HMM in molecular biology is as a
‘probabilistic profile’ of a protein family, which is called a profile
HMM.
• Apart from this, HMMs are also used for multiple sequence
alignment, gene prediction (ORF finding), and protein structure
prediction
• Advantages are, statistically sound, no sequence ordering or gap
penalties are required
• Limitations are, large number of similar sequences are required to get
good models
Stochastic modeling of biological sequences
For Example, Profile is a position-specific scoring matrix.
• Given this model the probability of
CGGSV is:
0.8 * 0.4 * 0.8* 0.6* 0.2 = 0.031
• Since multiplication of fractions is
computationally expensive and prone to
floating point errors, a transformation into
the logarithmic world is used.
• The score is calculated by taking the logs
of all amino acid probabilities and adding
them up.
ln(0.8) + ln(0.4) + ln(0.8) + ln(0.6) + ln(0.2)
= -3.48
Stochastic modeling of biological sequences
But with this expression it is not possible to distinguish between the highly
implausible sequence TGCT- - AGG and the consensus sequence ACAC - - ATC
The HMM architecture
• S-start; E-end
• m- main state (matches/mismatches)
• i - insert state
• d - delete state
A
T
A
A
A
C
C
C
G
C
A
A
A
A
C
A
C
G
C
-
T
-
A
A
A
A
A
T
T
G
T
T
G
C
C
C
C
Parameters used in HMM building
• Transition probability: Tij (average 0.333)
• Emission probability: Ei (average 0.05)
M
M
M
M
N
N
N
Q
–
–
K
–
F
F
Y
W
L
L
L
-
S
S
T
T
i
m
m
d
• Since the probabilities are very small numbers, they are converted to log odds
scores and added to get the overall probability score
Markov modeling of biological sequences
ACA
TCA
ACA
AGA
ACC
- AC
C - G-
T
-
ATG
ATC
AGC
ATC
ATC
Markov modeling of biological sequences
ACA
TCA
ACA
AGA
ACC
ACA
- AC
C - GC-
T
-
ATG
ATC
AGC
ATC
ATC
ATC
P(s)*100
3.3
0.0075
1.2
3.3
0.59
4.7
P(ACACATC)= 0.047 Obtained by taking the product
of probabilities for residues in each state and the transitions.
Sequence Alignment and Database Search using
HMMER
Multiple Alignment
Build a Profile HMM
Database
search
Query against Profile
HMM database
(PFAM database)
Multiple
alignments
HMMSEARCH Results
(on voltage-gated ion channel proteins database)
PFAM http://pfam.wustl.edu
• Protein Family Database created using HMMs
• Pfam-A contains functionally annotated families (~7500)
• Pfam-B contains unannotated families (~107000)
• All protein sequences were clustered into families based on sequence
identity
• For each family, non-redundant, full-domain seed members were
selected to represent the family
• Seed multiple alignments were built using ClustalW and manual
checking
• HMM models were built using hmmbuild (suite of programs called
HMMER)
• Using these models more family members were added in an iterative
process of adding new members to multiple alignment and updating the
HMM Model until no more new members are found
How to build and use Profile HMMs
• Get a family of seed sequences in multiple alignment
• Build a Hidden Markov Model using hmmbuild
• Use HMM as a query to find remote homologues in the
sequence database using hmmsearch
• Add new sequences to the seed alignment using hmmalign
and update the model, iteratively
• Get the consensus sequence of the model using hmmemit
• Query HMM with new query sequences to find if the
sequences are related to the Model using hmmpfam
SledgeHMMER web server
• Accessible at
http://SledgeHMMER.sdsc.edu
• Pfam database is the largest
protein functional domain database
built by Hidden Markov Models
• This server provides quick access
to pre-calculated Pfam results for
1.2 million (entire SP+TrEMBL
databases) protein sequences
• Sequences are compared with
PERL MD5 hexadecimal hashing
methods
• Web server is implemented in
PERL/CGI interface
Predicting sub-cellular localization
of proteins
Different cellular compartments
(modified from Voet & Voet, Biochemistry; Weinheim, New York, Basel, Wiley-VCH 1992)
Methods to predict sub-cellular location
• Based on amino acid composition
• Based on signal or target peptides
• PSORT
• TargetP
• Based on domain occurrence patterns
• MITOPRED
• Based on lexical analysis
Amino acid compositional differences in different sub-cellular locations
PSORT (http://psort.ims.u-tokyo.ac.jp/)
• PSORT program works based on a comprehensive knowledge of protein sorting
• Different parameters relevant to different groups of species are determined
• Bacterial sequences
• N-terminal signal sequence (Positive - H region)/cleavage site
• Transmembrane segments
• Lipoprotein Analysis
• Amino Acid composition
PSORT continued …
• Eukaryotic sequences (Yeast/Animal/Plant)
• N-terminal signal sequence (Positive-H region)/cleavage site
• Transmembrane segments and Membrane topology
• Mitochondrial targeting signals and AAC of NT-20 amino acids
• Nuclear localization signals (NLS)
• Peroxysome matrix targeting sequences (PTSs) (S/A/C)(K/R/H/)L
• Chloroplast targeting signals
• Endoplasmic Reticulum signals (KDEL or HDEL-yeast)
• Vesicular, liposomal, vacuolar proteins etc.
MITOPRED (http://mitopred.sdsc.edu)
• A new method based on Pfam domain occurrence patterns, amino acid
composition (AAC) and pI value differences between mitochondrial and nonmitochondrial proteins
• Eukaryotic cells have multiple compartments and hence a set of pathways are
localized to a specific compartment. Thus, a protein family involved in a specific
pathway is expected in a specific compartment
• A knowledge base is developed by studying the occurrence and co-occurrence
patterns of different Pfam domain in different cellular compartments
• The method compares the Pfam domains found in the query sequence against
the knowledge-base and assigns a score, depending on which compartment it
belongs to
• Independent scores are calculated based on the AAC, pI values of the query
sequence by comparing them to the average values in different locations
• Final prediction is based on the combined score from AAC, pI and Pfam scores
Comparison of AA composition across mitochondrial and cytoplasmic sequences
0.1
0.09
Relative freequencies
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
Residues
M-sol
More in Cytoplasmic
C-sol
More in Mitochondrial
T
V
W
Y
pI value differences in different sub-cellular locations
10
pI value
8
6
4
2
0
CYT
MIT
NUC
END
EXC
Cellular Location
GOL
PLA
POX
Flowchart showing MITOPRED procedure
MITOPRED Web Server
• Accessible at
http://mitopred.sdsc.edu
• Implemented using PERL/CGI
interface
• Pre-calculated predictions are
available for all eukaryotic proteins
from Swiss-prot and TrEmbl databases
(~500000)
• Genome-scale predictions can be
downloaded for yeast, C.elegans,
Drosophila, human, mouse and
Arabidopsis species
• Provides data for the Mitoproteome
database accessible at
http://www.mitoproteome.org
Prediction of sub-cellular location by lexical analysis
• Separate SP proteins into different sub-cellular classes based on annotation
• In each class, extract all unique keywords for each sequence
• The total # of keywords in all classes is equal to the feature space (N)
• Generate a binary vector for each sequence in each class where the length of the
vector is equal to N, 1 if the keyword is present and 0 if its absent.
• For the Unknown protein, generate a binary vector similar to above, based on its
key words. From this, generate sub-vectors of size 2k-1 (where k is equal to the
number of key words in the unknown) by flipping the 1s to 0s.
• Based on the sub-vectors, retrieve all proteins with matching binary vectors from
all classes.
• The unknown belongs to the class that contributes the most number of sequences in
the retrieved group.
• This program works better, if the number of keywords are more as well as the
family size is bigger.
Flow diagram of lexical analysis method
(From Nair R, Rost Burkhard, Bioinformatics 18:S78-S86, 2000)
Predicting Post-translational
Modification Sites of Proteins
General Method for PTM site Prediction
• PROSITE provides consensus patterns for a lot of PTM sites, however in most
cases these patterns are very short and the true modifications occur based on
the structural or environmental context in the protein fold
• Because of this reason, methods based on reg expressions or local alignment
methods produce large number of false positives
• In almost all methods used in PTM site prediction, artificial neural networks
(ANNs) are used.
• General procedure:
• Prepare datasets experimentally-known to possess a type of PTM site
• Separate the dataset into training and testing data
• Train a network using training data and test it with the testset. This
process is iterated until the model is well refined
• Sufficient number of training sequences and good quality data are important
for the success of any neural network method
Different Post-translational modifications (PTMs)
• Glycosylation
• ASN(N)-glycosylation (NetNGlyc)
• O-glycosylation (NetOGlyc)
• Sulfation (Sulfinator)
• Phosphorylation (NetPhos)
• Myristoylation (NMT)
Prediction of Glycosylation Sites (NetNGlyc, NetOGlyc)
• Glycoproteins are specially synthesized molecules by covalent attachment of
oligosaccharides to certain proteins at the ASN(N-glycosylation) or Ser or Thr
(O-glycosylation) residues.
• These are usually exported to extra-cellular destinations like mucin in
alimentary tract or glycoprotein harmones in the anterior pitutory gland.
• N-glycosylation
• O-glycosyltion
• No consensus pattern
• SEA domain is associated with it
Prediction of Sulfation Sites
• Protein tyrosine sulfation is an important post-translational modification for
proteins that go through the secretory pathway. It regulates several proteinprotein interactions and modulates the binding affinity of TM peptide receptors
• Based on the rules described above, HMMs could be trained to build models
for predicting proteins sequences with patterns that abide these rules
Sulfinator Algorithm (http://us.expasy.org/tools/sulfinator/)
• Sulfinator employs four different HMMs to recognize N-terminal (HMMN), Internal (HMM-I), C-terminal (HMM-C) and in Y-clusters (HMM-Y)
Prediction of Phosphorylation Sites
(NetPhos (http://www.cbs.dtu.dk/services/NetPhos/)
• Protein kinases, a very large family of enzymes catalyze phosphorylation
• NetPhos produces neural network predictions for serine (S), threonine
(T) or tyrosine (Y) phosphorylation sites in eukaryotic proteins that affect
a multitude of cellular signaling processes
• Y-kinase Phosphorylation
• S or T-Phosphorylation in Caesin Kinase II
• Since these are very short patterns, the amino acids surrounding a
phosphorylated residue are significant in determining whether a particular
site is phosphorylated or not
Standalone Tools
Local Installation of tools and databases
• NCBI-Toolkit
• Formatting and using BLAST
• CD-HIT
• CLUSTALW
• HMMER package
Current Trends in Bioinformatics
Reductionistic
Approach
Components
Biology
Cell
Structure
Function
Genomics
Transcriptomics
Proteomics
Metabolomics
Integrative
Approach
Systems
Biology
Highway network system in San Antonio