Analysis of High-Throughput Screening Data
Download
Report
Transcript Analysis of High-Throughput Screening Data
Analysis of High-Throughput
Screening Data
I647
Fall 2006
Drug Discovery Process
• The key steps of drug discovery are:
– research - average 2 to 3 years
– pre-clinical testing - average 1 year
– clinical trial testing (involving human patients)
- average 10 years
– regulatory approval - average 2 years
Drug Discovery Process:
Web Sites
• http://akosgmbh.de/Drug_discovery_process.htm
• http://www.ppdi.com/PPD_U7.htm
INTRODUCTION
• HTS allows hundreds of thousands of
compounds to be assayed very quickly
• HTS data characterized by:
– High volume
– High level of noise
– Diverse nature of the chemical classes
involved
– Possible presence of multiple binding modes
INTRODUCTION
• Select the most potent compounds to
progress to the next stage
• Problems:
– Functional groups that interfere with the assay
(e.g., fluoresce)
– Functional groups that react with biological
systems
– Catch these with substructure and “druglikeness” filters
Techniques for Analysis of HTS
Data
• Can’t use multiple linear regression or
partial least squares as statistical tests
– Data sets are too large
• Data visualization
• Data reduction
• Data mining (if activity data is known)
HTS Methodology
• Procedure:
– Measure activity at different concentrations for
a subset of compounds
– Define IC50 (Inhibitory Concentration 50): the
concentration of a material estimated to inhibit
the biological endpoint of interest (e.g., cell
growth, ATP levels) by 50%
– Solid pure sample that tests positively gets
structure determined (hits-to-leads phase)
DATA VISUALIZATION
• Need to display simultaneously large data
sets with many thousands of molecules
and their properties
• Typical software packages:
– Draw various kinds of graphs
– Color selected properties
– Calculate simple statistics
• HTS data sets may be divided into subsets
to aid navigation
SpotFire DecisionSite
• DecisionSite Examples
http://www.spotfire.com/
Features of Data Visualization
• Often combined with structure searching
to find compounds with certain features
• Unsupervised methods – don’t use activity
data
• Supervised methods – incorporate activity
data
• Use of molecular descriptors
Non-Linear Mapping
• Descriptors:
– Physicochemical properties
– Fingerprints: a Boolean array with the
meaning of each bit not predefined
• List of patterns is generated for each
– Atom, pair of adjacent atoms, bonds connecting them
– Each group of atoms joined by longer pathways
– Substructural fragments
– Known activity against related targets
Non-Linear Mapping (cont’d)
• Non-Linear Mapping takes
multidimensional data to a lower space (2or 3-dimensional)
• Multidimensional scaling
– Generate initial set of coordinates in the lowdimensional space
– Modify the coordinates using optimization
procedures
DATA MINING METHODS
• Construct models that enable the
establishment of relationships between the
structures and the observed activity
• Simple division of structures is desirable:
– Active vs. inactive
– High, medium, or low activity classes
Data Mining Methods: Techniques
• Substructural analysis: weight each aspect
of the structure according to a preassigned activity designation
acti
Wi = ---------------acti + inacti
Data Mining Techniques
• Discriminant Analysis: aims to separate
the molecules into constituent classes
– Linear discriminant analysis works with two
variables and two activity classes
• Straight line separates the data into areas where
the maximum number of correct activities is found
Data Mining Techniques
• Neural Networks – need a training set of
data
• Once trained, the program predicts values
for new molecules
• Examples: feed-forward network and
Kohonen network (self-organizing map)
• Problem: over-training—gives excellent
results on the test data, but poor results on
unseen data
Data Mining Techniques
• Decision Trees
– Rules associate specific molecular and/or
descriptor values with the activity or property
of interest
– Start with the entire data set and identify the
descriptor or variable that gives the best split
– Follow the procedure until no more splits are
possible or desirable
– Some consider multiple splits at each node
SUMMARY
• Much interest and research on HTS
analysis
• New techniques being applied (e.g.,
support vector machines)
• Analysis of large diverse data sets needs
the most work
• Results need to feed into subsequent
analysis