Various Career Options Available
Download
Report
Transcript Various Career Options Available
Role of Computer and Information Science in
Biology
Presented By
Dr G. P. S. Raghava
Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India
&
Visiting Professor, Pohang Univ. of Science & Technology, Republic of Korea
Email: [email protected]
Web: http://www.imtech.res.in/raghava/
Major Applications & Challenges
Introduction to Biology
Genome Annotation: Gene Prediction
Analysis and Comparison of Sequences
Protein Structure Prediction
DNA Chip (Microarray) technology
Proteomics: Analysis of 2D gel
Fingerprinting Technique
Drug development
Computer-Aided Vaccine Design
Hierarchy in Biology
Atoms
Molecules
Macromolecules
Organelles
Cells
Tissues
Organs
Organ Systems
Individual Organisms
Populations
Communities
Ecosystems
Biosphere
Animal cell
Human Chromosomes
Genes are linearly arranged along chromosomes
Chromosomes and DNA
DNA can be simplified to a
string of four letters
GATTACA
(RT)
Sequence to Structure:
It’s a matter of dimensions!
1D Nucleic acid sequence
AGT-TTC-CCA-GGG…
1D Protein sequence
Met-Ala-Gly-Lys-His…
M – A – G – K – H…
3D Spatial arrangement of atoms
Genome Annotation
The Process of Adding Biology Information and
Predictions to a Sequenced Genome Framework
Importance of Sequence Comparison
Protein Structure Prediction
– Similar sequence have similar structure & function
– Phylogenetic Tree
– Homology based protein structure prediction
Genome Annotation
– Homology based gene prediction
– Function assignment & evolutionary studies
Searching drug targets
– Searching sequence present or absent across
genomes
Protein Sequence Alignment and Database
Searching
Alignment
of Two Sequences (Pair-wise Alignment)
– The Scoring Schemes or Weight Matrices
– Techniques of Alignments
– DOTPLOT
Multiple Sequence Alignment (Alignment of > 2 Sequences)
–Extending Dynamic Programming to more sequences
–Progressive Alignment (Tree or Hierarchical Methods)
–Iterative Techniques
Stochastic Algorithms (SA, GA, HMM)
Non Stochastic Algorithms
Database Scanning
– FASTA, BLAST, PSIBLAST, ISS
Alignment of Whole Genomes
– MUMmer (Maximal Unique Match)
Alignment of Two Sequences
Dealing Gaps in Pair-wise Alignment
Sequence Comparison without Gaps
Slide Windos method to got maximum score
ALGAWDE
ALATWDE
Total score= 1+1+0+0+1+1+1=5 ; (PID) = (5*100)/7
Sequence with variable length should use dynamic programming
Sequence Comparison with Gaps
•Insertion and deletion is common
•Slide Window method fails
•Generate all possible alignment
•100 residue alignment require > 1075
Alternate Dot Matrix Plot
Diagnoal * shows align/identical regions
Dynamic Programming
Dynamic Programming allow Optimal Alignment
between two sequences
Allow Insertion and Deletion or Alignment with gaps
Needlman and Wunsh Algorithm (1970) for global
alignment
Smith & Waterman Algorithm (1981) for local
alignment
Important Steps
– Create DOTPLOT between two sequences
– Compute SUM matrix
– Trace Optimal Path
Alignment of Multiple Sequences
Extending Dynamic Programming to more sequences
–Dynamic programming can be extended for more than two
–In practice it requires CPU and Memory (Murata et al 1985)
– MSA, Limited only up to 8-10 sequences (1989)
–DCA (Divide and Conquer; Stoye et al., 1997), 20-25 sequences
–OMA (Optimal Multiple Alignment; Reinert et al., 2000)
–COSA (Althaus et al., 2002)
Progressive or Tree or Hierarchical Methods (CLUSTAL-W)
–Practical approach for multiple alignment
–Compare all sequences pair wise
–Perform cluster analysis
–Generate a hierarchy for alignment
–first aligning the most similar pair of sequences
–Align alignment with next similar alignment or sequence
Database scanning
Basic principles of Database searching
– Search query sequence against all sequence in database
– Calculate score and select top sequences
– Dynamic programming is best
Approximation Algorithms
FASTA
Fast sequence search
Based on dotplot
Identify identical words (k-tuples)
Search significant diagonals
Use PAM 250 for further refinement
Dynamic programming for narrow region
Principles of FASTA Algorithms
Database Scanning or Fold Recognition
Concept of PSIBLAST
–
–
–
–
Perform the BLAST search (gap handling)
GeneImprove the sensivity of BLAST
rate the position-specific score matrix
Use PSSM for next round of search
Intermediate Sequence Search
– Search query against protein database
– Generate multiple alignment or profile
– Use profile to search against PDB
Comparison of Whole Genomes
MUMmer (Salzberg group,
1999, 2002)
–
–
–
–
–
Pair-wise sequence alignment of
genomes
Assume that sequences are closely
related
Allow to detect repeats, inverse repeats,
SNP
Domain inserted/deleted
Identify the exact matches
How it works
–
–
–
–
–
Identify the maximal unique match
(MUM) in two genomes
As two genome are similar so larger
MUM will be there
Sort the matches found in MUM and
extract longest set of possible matches
that occurs in same order (Ordered
MUM)
Suffix tree was used to identify MUM
Close the gaps by SNPs, large inserts
Protein Structure Prediction
Experimental Techniques
– X-ray Crystallography
– NMR
Limitations of Current Experimental
Techniques
– Protein DataBank (PDB) -> 24000 protein structures
– SwissProt -> 100,000 proteins
– Non-Redudant (NR) -> 1,000,000 proteins
Importance of Structure Prediction
– Fill gap between known sequence and structures
– Protein Engg. To alter function of a protein
– Rational Drug Design
Protein Structures
Techniques of Structure Prediction
Computer simulation based on energy calculation
– Based on physio-chemical principles
– Thermodynamic equilibrium with a minimum free energy
– Global minimum free energy of protein surface
Knowledge Based approaches
– Homology Based Approach
– Threading Protein Sequence
– Hierarchical Methods
Energy Minimization Techniques
Energy Minimization based methods in their pure form, make no
priori assumptions and attempt to locate global minma.
Static Minimization Methods
– Classical many potential-potential can be construted
– Assume that atoms in protein is in static form
– Problems(large number of variables & minima and validity of
potentials)
Dynamical Minimization Methods
– Motions of atoms also considered
– Monte Carlo simulation (stochastics in nature, time is not
cosider)
– Molecular Dynamics (time, quantum mechanical, classical
equ.)
Limitations
– large number of degree of freedom,CPU power not adequate
– Interaction potential is not good enough to model
Knowledge Based Approaches
Homology Modelling
– Need homologues of known protein structure
– Backbone modelling
– Side chain modelling
– Fail in absence of homology
Threading Based Methods
– New way of fold recognition
– Sequence is tried to fit in known structures
– Motif recognition
– Loop & Side chain modelling
– Fail in absence of known example
Hierarcial Methods
Intermidiate structures are predicted, instead of
predicting tertiary structure of protein from amino
acids sequence
Prediction of backbone structure
– Secondary structure (helix, sheet,coil)
– Beta Turn Prediction
– Super-secondary structure
Tertiary structure prediction
Limitation
Accuracy is only 75-80 %
Only three state prediction
excitation
cDNA clones
(probes)
laser 2
PCR product amplification
purification
printing
scanning
laser 1
emission
mRNA target)
overlay images and normalise
0.1nl/spot
microarray
Hybridise target
to microarray
analysis
Major Applications
Identification of differentially
expressed genes in diseased tissues
(in presence of drug)
Classification of differentially
expressed (genes) or clustering/
grouping of genes having similar
behaviour in different conditions
Use expression profile of known
disease to diagnosis and classify of
unknown genes
Terms/Jargons
Stanford/cDNA chip Affymetrix/oligo
chip
one slide/experiment
one chip/experiment
one spot
1 gene => one spot one
probe/feature/cell
or few spots(replica)
control: control spots 1 gene => many
probes
(20~25
control: two
mers)
fluorescent dyes
control: match and
(Cy3/Cy5)
mismatch cells.
Images : examples
Pseudo-colour overlay
Cy3
Cy5
Spot colour
Signal strength
Gene expression
yellow
Control = perturbed
unchanged
red
Control < perturbed
induced
green
Control > perturbed
repressed
Processing of images
Addressing or gridding
– Assigning coordinates to each of the spots
Segmentation
– Classification of pixels either as foreground or as
background
Intensity determination for each spot
– Foreground fluorescence intensity pairs (R, G)
– Background intensities
– Quality measures
Management of Microarray Data
Magnitude of Data
– Experiments
50 000 genes in human
320 cell types
2000 compunds
3 times points
2 concentrations
2 replicates
– Data Volume
4*1011 data-points
1015 = 1 petaB of Data
Management of Microarray
Data
Major Issues
Large volume of microarray data in last few years
– Storage and efficient access
– Comparison and integration of data
Problem of data access and exchange
– Data scattered around Internet
– Supplementary material of publications
– Difficult for user to access relivent data
Problems with existing databases
– Diverse purpose
– Developed for specific purpose
Management of Microarray
Data
Specific Database
– Platform (eg.Stanford MA Database; SMD)
– Organism (Yeast MA global viewer)
– Project (Life cycle database of Drosophila)
Problem with Supplement and MA databases
–
–
–
–
Lack of direct access
Quality not checked
No standard format
Incomplete data
Pre-processed cDNA Gene
Expression Data
On p genes for n slides: p is O(10,000), n is O(10-100), but
growing,
Slides
Genes
1
2
3
4
5
slide 1
slide 2
slide 3
slide 4
slide 5
…
0.46
-0.10
0.15
-0.45
-0.06
0.30
0.49
0.74
-1.03
1.06
0.80
0.24
0.04
-0.79
1.35
1.51
0.06
0.10
-0.56
1.09
0.90
0.46
0.20
-0.32
-1.09
...
...
...
...
...
Gene expression level of gene 5 in slide 4
=
Log2( Red intensity / Green intensity)
These values are conventionally displayed
on a red (>0) yellow (0) green (<0) scale.
Analysis of Microarray Data
Analysis of images
Preprocessing of gene expression data
Normalization of data
–
–
–
–
Subtraction of Background Noise
Global/local Normalization
House keeping genes (or same gene)
Expression in ratio (test/references) in log
Differential Gene expression
– Repeats and calculate significance (t-test)
– Significance of fold used statistical method
Clustering
– Supervised/Unsupervised (Hierarchical, K-means,
SOM)
Prediction or Supervised Machine Learnning (SVM)
Normalization Techniques
Global normalization
– Divide channel value by means
Control spots
– Common spots in both channels
– House keeping genes
– Ratio of intensity of same gene in two channel is used for
correction
Iterative linear regression
Parametric nonlinear nomalization
– log(CY3/CY5) vs log(CY5))
– Fitted log ratio – observed log ratio
General Non Linear Normalization
– LOESS
– curve between log(R/G) vs log(sqrt(R.G))
Classification
Task: assign objects to classes (groups) on
the basis of measurements made on the
objects
Unsupervised: classes unknown, want to
discover them from the data (cluster
analysis)
Supervised: classes are predefined, want to
use a (training or learning) set of labeled
objects to form a classifier for classification
of future observations
Issues in Clustering
Pre-processing (Image analysis and
Normalization)
Which genes (variables) are used
Which samples are used
Which distance measure is used
Which algorithm is applied
How to decide the number of clusters K
Unsupervised Learnning
Hierarchical clustering: merging two branches at
the time until all vari-ables
(genes) are in one tree. [it does not answer the
question of “how
many gene clusters there are”?]
K-mean clustering: assuming there are K clusters.
[what if this assump-tion
is incorrect?]
Model-based clustering: the number of clusters is
determined dynami-cally
[could be one of the most promising methods]
Supervised Analysis
Fisher’s linear discriminant
analysis
Quadratic discriminant analysis
Logistic regression (a linear
discriminant analysis)
Neural networks
Support vector machine
Traditional Proteomics
1D gel electrophoresis (SDS-PAGE)
2D gel electrophoresis
Protein Chips
– Chips coated with proteins/Antibodies
– large scale version of ELISA
Mass Spectrometry
– MALDI: Mass fingerprinting
– Electrospray and tandem mass
spectrometry
Sequencing of Peptides (N->C)
Matching in Genome/Proteome Databases
Overview of 2D Gel
SDS-PAGE + Isoelectric focusing (IEF)
– Gene Expression Studies
– Medical Applications
– Sample Experiments
Capturing and Analyzing Data
– Image Acquistion
– Image Sizing & Orientation
– Spot Identification
– Matching and Analysis
Comparision/Matcing of Gel Images
Compare 2 gel images
– Set X and y axis
– Overlap matching spots
– Compare intensity of spots
Scan against database
– Compare query gel with all gels
– Calculate similarity score
– Sort based on score
Proteomics:
Fingerprints of
Disease
Normal Cells
Disease Cells
Phenotypic
Changes
•Differential protein expression
• Protein nitration patterns
•Altered phosporylation
•Altered glycosylation profiles
Utility
•Target discovery
•Disease pathways
•Disease biomarkers
Fingerprinting Technique
What is fingerprinting
– It is technique to create specific pattern for a given
organism/person
– To compare pattern of query and target object
– To create Phylogenetic tree/classification based on pattern
Type of Fingerprinting
–
–
–
–
DNA Fingerprinting
Mass/peptide fingerprinting
Properties based (Toxicity, classification)
Domain/conserved pattern fingerprinting
Common Applications
–
–
–
–
–
Paternity and Maternity
Criminal Identification and Forensics
Personal Identification
Classification/Identification of organisms
Classification of cells
Fingerprinting Techniques: Principles & Applications
What is fingerprinting
Type of Fingerprinting
Common Applications
Role of Computer in DNA Fingerprinting
–
–
–
–
–
–
Searching Restriction Enzymes
Searching VNTRs
Computation of size of DNA fragments
Optimization of gels
Comparison of patterns
Creation of Phylogenetic tree
Drug Design
History of Drug/Vaccine development
– Plants or Natural Product
Plant and Natural products were source for medical substance
Example: foxglove used to treat congestive heart failure
Foxglove contain digitalis and cardiotonic glycoside
Identification of active component
– Accidental Observations
Penicillin is one good example
Alexander Fleming observed the effect of mold
Mold(Penicillium) produce substance penicillin
Discovery of penicillin lead to large scale screening
Soil micoorganism were grown and tested
Streptomycin, neomycin, gentamicin, tetracyclines etc.
Drug Design
Chemical Modification of Known Drugs
– Drug improvement by chemical modification
– Pencillin G -> Methicillin; morphine->nalorphine
Receptor Based drug design
–
–
–
–
Receptor is the target (usually a protein)
Drug molecule binds to cause biological effects
It is also called lock and key system
Structure determination of receptor is important
Ligand-based drug design
– Search a lead ocompound or active ligand
– Structure of ligand guide the drug design process
Drug Design based on Bioinformatics Tools
Detect the Molecular Bases for Disease
– Detection of drug binding site
– Tailor drug to bind at that site
– Protein modeling techniques
– Traditional Method (brute force testing)
Rational drug design techniques
– Screen likely compounds built
– Modeling large number of compounds (automated)
– Application of Artificial intelligence
– Limitation of known structures
Important Points in Drug Design based on
Bioinformatics Tools
Application of Genome
–
–
–
–
–
3 billion bases pair
30,000 unique genes
Any gene may be a potential drug target
~500 unique target
Their may be 10 to 100 variants at each target
gene
– 1.4 million SNP
– 10200 potential small molecules
Concept of Drug and Vaccine
Concept of Drug
– Kill invaders of foreign pathogens
– Inhibit the growth of pathogens
Concept of Vaccine
– Generate memory cells
– Trained immune system to face various
existing disease agents
VACCINES
A. SUCCESS STORY:
•
COMPLETE ERADICATION OF SMALLPOX
•
WHO PREDICTION : ERADICATION OF PARALYTIC
POLIO THROUGHOUT THE WORLD BY YEAR 2003
•
SIGNIFICANT REDUCTION OF INCIDENCE OF DISEASES:
DIPTHERIA, MEASLES, MUMPS, PERTUSSIS, RUBELLA,
POLIOMYELITIS, TETANUS
B.NEED OF AN HOUR
1) SEARCH FOR NONAVAILABILE EFFECTIVE VACCINES FOR
DISEASES LIKE:
MALARIA, TUBERCULOSIS AND AIDS
2) IMPROVEMENT IN SAFETY AND EFFICACY OF PRESENT
VACCINES
3) LOW COST
4) EFFICIENT DELIVERY TO NEEDY
5) REDUCTION OF ADVERSE SIDE EFFECTS
Computer Aided Vaccine Design
Whole Organism of Pathogen
– Consists more than 4000 genes and proteins
– Genomes have millions base pair
Target antigen to recognise pathogen
– Search vaccine target (essential and non-self)
– Consists of amino acid sequence (e.g. A-V-LG-Y-R-G-C-T ……)
Search antigenic region (peptide of length
9 amino acids)
Major steps of endogenous antigen processing
Computer Aided Vaccine Design
Problem of Pattern Recognition
– ATGGTRDAR
– LMRGTCAAY
– RTTGTRAWR
– EMGGTCAAY
– ATGGTRKAR
– GTCVGYATT
Epitope
Non-epitope
Epitope
Non-epitope
Epitope
Epitope
Commonly used techniques
– Statistical (Motif and Matrix)
– AI Techniques
Why computational tools are required for prediction.
200 aa proteins
Chopped to overlapping
peptides of 9 amino
acids
Bioinformatics Tools
192 peptides
10-20 predicted peptides
invitro or invivo experiments for
detecting which snippets of protein will
spark an immune response.
Thanks