Transcript Document

C2D Cheminformatics :
Methods,Tools and Results
By
OSDD-Cheminformatics team
The burden of TB
• About 9 million people were infected with TB
in year 2009, and 1.7 million died
• India is the world Tb capital with estimated
1.9 million cases reported every year.
• India has 2nd largest estimated number of
MDR-TB cases(99000 in 2008).
• By July 2010, 58 countries had reported at
least 1 case of XDR-TB.
Cheminformatics : What?
• COMPUTERS have been applied to solve problems
almost everywhere. When we use them in chemistry,
we call it cheminformatics.
• Cheminformatics is applied mostly to large number of
molecules.
• Deals with
– Storage, retrieval and crosslinking of chemical structures
and associated data.
– Prediction of physical, chemical and biological properties
of compounds.
– Analysis and prediction of reactions.
– Drug Design...
Steps in drug development
Disease selection
Target hypothesis
Lead compound
identification (screening)
Lead optimization
Pre-clinical trial
Clinical trial
Pharmacogenomic
optimization.
Cheminformatics in drug design
Target
Hit
Identification
Virtual Screening Data
Data Mining
Building
computational
models for
drug discovery
process.
Lead identification
Lead optimization
Aim of Cheminformatics Project
• To screen molecules interacting with the
Potential TB targets using classifiers.
• Select the selected molecules and dock with
Targets to further screen the molecules for
leads.
• Use cheminformatics techniques such as
QSAR ,3D QSAR, ADMET to look for potential
leads and design Drugs using the leads – by
building combinatorial libraries.
Ways to perform Virtual screening
• Use a previously derived mathematical model that
predicts the biological activity of each structure
• Run substructure queries to eliminate molecules
with undesirable functionality
• Use a docking program to identify structures
predicted to bind strongly to the active site of a
protein (if target structure is known)
• Filters remove structures not wanted in a succession
of screening methods
Main Classes of Virtual Screening Methods
• Depend on the amount of structural and bioactivity
data available
– One active molecule known: perform similarity search
(ligand-based virtual screening)
– Several active molecules known: try to identify a common
3D pharmacophore, then do a 3D database search
– Reasonable number of active and inactive structures
known: train a machine learning technique (with the help
of Molecular descriptors or Molecular properties)
– 3D structure of the protein known: use protein-ligand
docking
Molecule Properties
SPC : Structure Property Correlation
CHEMICAL PROPERTIES
pKa
Log P
Solubility
Stability
INTRINSIC PROPERTIES
Molar Volume
Connectivity Indices
Charge Distribution
Molecular Weight
Polar surface Area
BIOLOGICAL PROPERTIES
Activity
Toxicity
Biotransformation
Pharmacokinetics
Molecular descriptors used for machine
Learning
Molecular descriptors are numerical values that
characterize properties of molecules.
The descriptors fall into Four classes
a) Topological
b) Geometrical
c) Electronic
d) Hybrid or 3D Descriptors
Descriptors Used For Classification
Name of Descriptors
used
Number of Descriptors
Pharmacophore
Fingerprints
147
Weighted Burden 24
Number
Properties
8
Data mining
According to David Hand et al., of MIT press (2001)
“ Data mining is the analysis of (often large) observational data sets to find
unsuspected relationships and to summarize the data in novel ways that are both
understandable and useful to the data owner”.
Data mining …. But why?
Data  Information  Knowledge
 The main aim of a user is always to extract knowledge from an information
obtained from data.
 Data mining is one of key step in Knowledge discovery process, although
sometimes it is confused with Knowledge discovery itself!
 A user always looks for more information search with least amount of time
being spent on exploring the resources.
Data mining in Cheminformatics
• Data mining approaches are an integral part of
cheminformatics and pharmaceutical
research.
• This will tend to increase due to the increase
of computational methods for biology and
chemistry.
• Data mining has found major use in the virtual
screening process of cheminformatics.
Data Mining Taxonomy
CLASSIFIER ALGORITHMS IS USED
• Bayes classifier
Naïve bayes.
• Trees
j48
Random forest
• Functions
SMO
WORKFLOW
Accessing the
HTS bioassay
data
PubChem
All
compounds
sdf file
Bioassay
result (all)
PowerMV
PowerMV
Upload the
sdf file
Generate
descriptor file
Append the bioassay
result corresponding
to the compounds
Select the actives
and inactive
compounds
Remove the
useless attributes
File splitting
Training
Testing
Apply
classifier
algorithms
WEKA
Open the CSV
file in Excel
Excel
TP %, FP<20%,
Accuracy >70%
Selection of
best classifier
model
Molecular Descriptor generation
• Chemistry Development Kit (CDK) –
http://rguha.net/code/java/cdkdesc.html
• PowerMV
http://nisla05.niss.org/PowerMV/?q=PowerMV
PowerMv
• A Software Environment for Molecular Viewing, Descriptor
Generation, Data Analysis and Hit Evaluation.
• An operating environment for biologists and statisticians for
viewing or browsing medium to large molecular SD files,
computing descriptors.
19
Features
• Importing, viewing and sorting SD files.
• Capacity is limited only by available memory.
• Compounds structure and attributes can be easily exported to
Microsoft Excel.
Pre-requisites
• Requires .NET framework.
Limitation
• Windows based
Weka - toolkit
• Collection of machine learning algorithms for data analysis
and classification experiments.
• Tools available for data pre-processing, classification,
regression, clustering, association rules, and visualization.
22
Weka – on GARUDA
23
The Script file
• RemoveUselessAttributes
java <CLASSPATH> -Xmx4000m
weka.filters.unsupervised.attribute.RemoveUseless -i <in.csv>
-o <out.csv>
• Using cost-sensitive classification
java <CLASSPATH> –Xmx4000m
weka.classifiers.meta.CostSensitiveClassifier -cost-matrix
“[0.0 10.0; 1.0 0.0]” -t AID1626train.arff -x 5 -d
smo.model -W weka.classifiers.functions.SMO -i -- -M
Case Study: AID899
To get trained in using different
classifiers in weka and analyzing the
results
Cyp450 - a novel target against
Mycobacterium tuberculosis
Why Cyp450
The P450s are mono-oxygenase enzymes,
Generally interact with flavoprotein and/or iron–sulphur centre redox partners for catalysis
The Mtb genome sequence—a plethora of P450s .
‘‘P450 dense’’ by comparison with eukaryotic genomes
•most effective azoles have extremely tight binding constants for one of the Mtb P450s (CYP121).
Thus, analysis of Mtb CYP51 revealed P420 is an irreversibly inactivated and structurally disrupted species.
Organism
P450s
Genome size
Ratio
Humans
57
3.3 billion bp
1:5.8 million bp
D. melanogaster
84
123 million bp
1: 1.5 million bp
A. thaliana has
249
115 million bp
1: 462,000 bp
M. tuberculosis
20
4.4 million bp
1: 220,000 bp
Mutations were largely located not in the active site area itself, but instead in regions that are conformationally mobile,
where entry and exit of substrate to the active site is facilitated
Thus, acquired resistance could be mediated by mutations and it enhances flexibility and conformational rearrangements
to increased activity
Objectives
To develop model from AID 899 HTS to study the compound/drug
interaction with Human CYP450.
Why
1) A lead molecule developed should not interact with CYP450 of
human
a) Drug metabolism
b) affecting CYP450
2) It should work against CYP450 of M.tuberculosis
Work plan
Select active/inactive compounds against human
CYP450 from Pubchem HTS data
Generate model for lead compound screening
Current working
Screen the compounds via model
Select the inactives
Go for testing against mycobacterium CYP450 (model)
Select active lead compound
To be worked
Go for insilico drug designing
Invitro studies and invivo studies
Confusion Matrix
TP
FN
Active classified as active
Active classified as
inactive
FP
TN
Inactive classified as
active
Inactive classified as inactive
Base Classifier and Cost Sensitive Classifier (CSC)
CSC setting cost factor False Negative
 TP, FP rate increases
So FN is important than FP
Problem Faced
Data Redundancy
Computational Power
Communication – need alternative to SKYPE
Institutional limitations – Ban of media stream,
social network, chatting, etc.
Data Redundancy
Tried two approaches for processing the AID to obtain train and test data set.
Method 1: We downloaded sdf file containing all tested compounds.
We downloaded bioassay data files for the same
.
Then we matched it in MS excel.
It contained active, inactive, inconclusive and discrepancy
We further selected only active and inactive and ran in PowerMV to get csv
Then after converting to arff we processed test and train from it.
Loaded the two files in Weka and used different algorithms to build best model.
Method 2: We download active and inactive SDF files separately from the same pubchem page.
After processing in PowerMV both files were combined to form one.
Then similar steps were followed as in Method 1.
Problem: The number of final active and inactive compounds differ between the methods.
Active
Inactive
Discrepancy
Inconclusive
Method I
1767
6255
230
1127
Method II
1901
6441
Nil
1279
AID 899 - not curated “Problem reported to pubchem“. Director will be looking at it.
Progress & Results
1) We understood the basic working with weka
2) How to derive results from confusion matrix
3) Ignored Classifier gives good results (LAZY)
4) Got good results with RANDOM FOREST, etc unlike reported in
Virtual bioassay paper
5) Maximum accuracy of 86.16
Strategy followed
From the preliminary investigation it is clear that AID 899 is not a
properly curated dataset
In method I many classifiers were applied and the results are
represented below
In method II still many classifiers can be run and results
generated.
CSC SMO
Naïve Bayes
TP
meta decorate.Decision stump
meta decorate.REP TREE
meta decorate.RANDOM tree
100
meta decorate.RANDOM FOREST
meta decorate.AD TREE
meta decorate
AD TREE
rules.Part
meta bagging. SimpleCart
meta bagging.adtree
meta Bagging.j48graft
meta Bagging.LADTree
meta Bagging.randomforest
meta Bagging.randomtree
meta Bagging.reptree
meta cost random forest
Random forest
CSC -Ridor
Ridor
CSC Lazy lbk
Lazy.IBK
metacost lazy lb1
Lazy.IB1
List of Best classifiers : Fp<20, Accuracy >75
Accuracy
90
80
70
60
50
40
30
20
10
0
sincere thanks to
OSDD