In the name of GOD

Download Report

Transcript In the name of GOD

In the name of GOD
Basic Steps of QSAR/QSPR
Investigations
M.H. FATEMI
Mazandaran University
[email protected]
QSAR
• Qualitative Structure-Activity Relationships
• Can one predict activity (or properties in
QSPR) simply on the basis of knowledge of
the structure of the molecule?
• In other, words, if one systematically
changes a component, will it have a
systematic effect on the activity?
What is QSAR?
A QSAR is a mathematical relationship
between a biological activity of a
molecular system and its geometric and
chemical characteristics.
QSAR attempts to find consistent
relationship between biological activity
and molecular properties, so that these
“rules” can be used to evaluate the
activity of new compounds.
Why QSAR?
The number of compounds required for
synthesis in order to place 10 different
groups in 4 positions of benzene ring is
104
Solution: synthesize a small number of
compounds and from their data derive
rules to predict the biological activity of
other compounds.
QSXR
X=A
X=P
X=R
Activity
Property
Retention
X= bo+ b1D1+ b2D2+…..+
bnDn
bi
regression coefficient
Di
descriptors
n
number of descriptors
History
Early Examples
• Hammett (1930s-1940s)
COOH
X
COOH
X
COOH
X
X
para = log10
Kp
K0
meta = log10 Km
K0
COO + H
K0
COO + H
Kp
COO + H
Km
Hammett (cont.)
• Now suppose have a related series
X
CH2COOH
CH 2COO
X
log10 K'x = r
K'0
 reflect sensitivity to substituent
r reflect sensitivity to different system
+H
K'x
Free-Wilson Analysis
• Log 1/C = S ai + m
where C=predicted activity,
ai= contribution per group, and m=activity
of reference
Free-Wilson example
Br
X
N
Y
HCl
activity of analogs
Log 1/C = -0.30 [m-F] + 0.21 [m-Cl] + 0.43 [m-Br]
+ 0.58 [m-I] + 0.45 [m-Me] + 0.34 [p-F] + 0.77 [p-Cl]
+ 1.02 [p-Br] + 1.43 [p-I] + 1.26 [p-Me] + 7.82
Problems include at least two substituent position necessary
and only predict new combinations of the substituents used
in the analysis.
Hansch Analysis
Log 1/C = a p + b  + c
where
p(x) = log PRX – log PRH
and log P is the water/octanol partition
This is also a linear free energy relation
Applications of QSAR
•
•
•
•
•
1-Drug design
2-Prediction of Chemical toxicity
3-Prediction of environmental activity
4-Prediction of molecular properties
5-Investigation of retention mechanism
Structure
Entry &
Molecular
Modeling
Steps in QSPR/QSAR
QSAR STEPS
Descriptor
Generation
Feature
Selection
Construct
Model
MLRA or CNN
Model
Validation
Data set selection
• 1-Structural similarity of studied molecules
• 2-Data collected in the same conditions
• 3-Data set would be as large as possible
Structure
Entry &
Molecular
Modeling
Steps in QSPR/QSAR
QSAR STEPS
Descriptor
Generation
Feature
Selection
Construct
Model
MLRA or CNN
Model
Validation
INTRODUCTION to Molecular
Descriptors
• Molecular descriptors are numerical values that
characterize properties of molecules
• Molecular descriptors encoded structural
features of molecules as numerical descriptors
• Vary in complexity of encoded information and in
compute time
• Examples:
– Physicochemical properties (empirical)
– Values from algorithms, such as 2D fingerprints
Classical Classification of Molecular Descriptors
O
Constitutional, Topological
2-D structural formula
*
O
CH2
CH2
O
O
CH2 CH2
NH
CH
O
O
CH2
OH
Geometrical
3-D shape and structure
Quantum Chemical
Physicochemical
Hybrid descriptors
CH2
O
n
*
Topological Indexes: Example:
• Wiener Index
• Counts the number of bonds between pairs of atoms and sums the
distances between all pairs
• Molecular Connectivity Indexes
– Randić branching index
• Defines a “degree” of an atom as the number of adjacent
non-hydrogen atoms
• Bond connectivity value is the reciprocal of the square root of
the product of the degree of the two atoms in the bond.
• Branching index is the sum of the bond connectivities over all
bonds in the molecule.
– Chi indexes – introduces valence values to encode sigma, pi,
and lone pair electrons
Electronic descriptors
• Electronic interactions have very important
roles in controlling of molecular properties.
• Electronic descriptors are calculated to
encode aspects of the structures that are
related to the electrons
• Electronic interaction is a function of
charge distribution on a molecule
Physicochemical Properties
Used in this QSAR
1. Liquid solubility Sw,L in mg/L and mmol/m3
2. Octanol-water partition coefficient Kow
3. Liquid Vapor Pressure Pv,L in Pa
4. Henry’s Law constant Hc in Pa∙m3/mole
5. Boiling point
Structure
Entry &
Molecular
Modeling
Steps in QSPR/QSAR
QSAR STEPS
Descriptor
Generation
Feature
Selection
Construct
Model
MLRA or CNN
Model
Validation
Feature Selection
• E.g. comparing faces
first
requires
the
identification of key
features.
• How do we identify
these?
• The same applies to
molecules.
Objective feature selection
• After descriptors have been calculated for each
compound, this set must be reduced to a set of
descriptors which is as information rich but as
small as possible
1- Deleting of constant or near constant
descriptors
2- Pair correlation cut-off selection
3- Cluster analysis
4- Principal component analysis
5- K correlation analysis
Descriptive Statistics
N
homo
lumo
dip
mw
mia
mib
mic
polar
x0
x1p
x2p
x3p
x3c
x4p
x4c
noa
pcpa
pcna
edn
edp
dspn
shape
volm
surf
s1zy
s2zx
s3xy
ss1
ss2
ss3
logp
bcf
number
Valid N (listwise)
55
55
55
55
55
55
55
54
55
55
55
55
55
55
55
55
55
55
55
55
55
55
55
55
55
55
55
55
55
55
55
54
55
53
Minimum
.01
.02
.00
123.11
.02
.00
.00
63.45
4.07
2.20
1.41
.79
.10
.43
.14
12.00
.05
-.45
4.05
.75
.98
1.42
106.12
129.62
44.02
22.66
18.74
.57
.65
.64
1.49
1.02
1.00
Maximum
9.44
708.00
7.35
307.99
.19
.23
312.00
153.63
9.13
4.68
4.56
2.71
1.14
1.90
1.79
28.00
.58
-.05
6.37
6.95
6.94
3.93
218.34
262.24
80.88
56.08
38.74
.80
.92
.90
6.63
5.62
110.00
Mean
.6524
13.2664
2.7035
192.4207
.0580
.0270
5.6900
95.7878
5.9576
3.1949
2.3626
1.4072
.2799
.8358
.4958
17.3091
.3319
-.2652
5.2470
2.5227
2.2400
2.6579
146.2387
175.1636
57.0065
31.9507
25.0053
.7089
.8291
.8080
3.6971
3.1893
37.5636
Std. Deviation
1.66861
95.41298
2.06794
42.41658
.03451
.03070
42.06771
23.58493
1.24159
.76452
.74960
.49032
.16722
.38795
.27697
4.11804
.19432
.11673
.99529
1.99339
1.62828
.43353
25.62153
28.52871
8.44310
7.16801
4.42347
.05104
.07153
.05988
1.19562
.84204
33.22246
Variable reduction
• Principal Component Analysis
Principal Component
• PC1 = a1,1x1 + a1,2x2 + … + a1,nxn
• PC2 = a2,1x1 + a2,2x2 + … + a2,nxn
• Keep only those components that possess
largest variation
• PC are orthogonal to each other
Subjective Feature Selection
•
•
•
•
•
•
•
The aim is to reach optimal model
1-Search all possible model (Best MLR)
2-Forward, Backward & Stepwise methods
3-Genetic algorithm
4-Mutation and selection uncover models
5-Cluster significance analysis
6-Leaps & bounds regression
Feature Selection:
Most existing feature selection algorithms
consist of :
 Starting point in the feature space
 Search procedure
 Evaluation function
 Criterion of stopping the search
Feature Selection:
 Starting point in the feature space
- no features
- all features
- random subset of features
Forward Selection
• 1- variables are sequentially entered into the
model.
The first variable considered for entry into the equation is the one
with the largest positive or negative correlation with the dependent
variable. This variable is entered into the equation only if it satisfies
the criterion for entry.
2-If the first variable is entered, the independent
variable not in the equation that has the largest
partial correlation is considered next.
3-The procedure stops when there are no
variables that meet the entry criterion.
Forward Selection example
Model Summary
Model
1
2
3
4
R
.704a
.762b
.810c
.834d
R Sq uare
.496
.581
.655
.695
Adjusted
R Sq uare
.486
.564
.634
.670
a. Predictors: (Constant), log p
b. Predictors: (Constant), log p, mw
c. Predictors: (Constant), log p, mw, dip
d. Predictors: (Constant), log p, mw, dip, mia
Std. Error of
the Estimate
.59485
.54785
.50184
.47674
Backward Elimination
• 1- All variables are entered into the equation and
then sequentially removed.
• 2-The variable with the smallest partial
correlation with the dependent variable is
considered first for removal. If it meets the
criterion for elimination, it is removed.
• 3- After the first variable is removed, the variable
remaining in the equation with the smallest
partial correlation is considered next.
• 4-The procedure stops when there are no
variables in the equation that satisfy the removal
criteria.
Stepwise
• Stepwise. At each step, the independent
variable not in the equation that has the
smallest probability of F is entered, if that
probability is sufficiently small. Variables
already in the regression equation are
removed if their probability of F becomes
sufficiently large. The method terminates
when no more variables are eligible for
inclusion or removal.
Stepwise Example
Model Summary
Model
1
2
3
4
5
R
.704a
.762b
.810c
.834d
.824e
R Sq uare
.496
.581
.655
.695
.679
Adjusted
R Sq uare
.486
.564
.634
.670
.660
a. Predictors: (Constant), log p
b. Predictors: (Constant), log p, mw
c. Predictors: (Constant), log p, mw, dip
d. Predictors: (Constant), log p, mw, dip, mia
e. Predictors: (Constant), log p, dip, mia
Std. Error of
the Estimate
.59485
.54785
.50184
.47674
.48403
Forward, Backward & Stepwise
variable selection methods
• Advantages
• Fast and simple
• Can do with very packages
• Limitation
• Risk of Local minima
Genetic algorithm
Genetic Algorithm
Search Space
Definition
Genetic algorithm is a general
purpose search and optimization
method based on genetic principles
and Darwin’s law that applicable to
wide variety of problems
Darvin’s rules
Survival of fittest individuals
Recombination
Mutation
Biological background
•
•
•
•
•
Chromosome
Gene
Reproduction
Mutation
Fitness
GA
basic operation
• Population generation (chromosome )
• Selection (according to fitness )
• Recombination and mutation
(offspring)
• Repetition
GA flow chart
Initialize
population generation
Evaluate
compute fitness for each chromosome
Exploit
perform natural selection
Explore
recombination & mutation operation
Binary Encoding
Every of chromosome is a string of bit 0 or 1
Chromosome A 1 0 1 1 0 0 1 1 1 0 0 0 0 1
Chromosome B 0 0 1 0 0 1 1 1 0 1 0 0 1 1
Selection
The best chromosome should
survive and create new offspring.
• Roulette wheel selection
• Rank selection
• Steady state selection
Roulette wheel selection
Fitness 1> 2 > 3 >4
Crossover ( binary encoding )
*Single point
11001011+11011111 = 11001111
* Two point crossover
11001011 + 11011111 = 11011111
Mutation
* Bit inversion (binary encoding )
11001001 => 10001001
* Ordering change ( permutation encoding )
(1 2 3 4 5 6 8 9 7) => (1 8 3 4 5 6 2 9 7)
GA flow chart
Start
Population generation
Fitness
Selection
Replace
Crossover
Mutation
Test
End
Parameters of GA
•
•
•
•
•
•
Crossover rate
Mutation rate
Population size
Selection type
Encoding
Crossover and mutation type
Advantages of GA
•
•
•
•
Parallelism
Provide a group of potential solutions
Easy to implement
Provide global optima
How many descriptors can be used in a
QSAR model?
Rule of tumb:
- Per descriptor at least 5 data point
(molecule) must be exist in the model
Otherwise
possibility
of
finding
coincidental correlation is too high
Structure
Entry &
Molecular
Modeling
Steps in QSPR/QSAR
QSAR STEPS
Descriptor
Generation
Feature
Selection
Construct
Model
MLRA or CNN
Model
Validation
Questions?