Probability Density Based Indexing for High

Download Report

Transcript Probability Density Based Indexing for High

Math Models for Learning
and Discovery
Kristin P. Bennett
Mathematical Sciences Department
Rensselaer Polytechnic Institute
1/14/03
1
The Learning Problem
The problem of understanding intelligence
is said to be the greatest problem in
science today and “the” problem of for
this century – as deciphering the genetic
code was for the second half of the last
one…the problem of learning represents a
gateway to understanding intelligence in
man and machines.
-- Tomasso Poggio and Steven Smale 2003
1/14/03
2
What do these problems
have in common?
Design and Discovery of Pharmaceuticals
Target Marketing in Business
Diagnosis of Breast Cancer
Discovery of Novel Superconductors
Detection of Anthrax using TZ spectroscopy
Modeling and predicting global trade
RNA Transcription
1/14/03
3
DRUG TRIVIA
•
•
•
•
•
•
•
•
•
In USA $25B/yr for R&D of pharmaceuticals (33% clinicals)
Worth their weight in gold
10-15 years from conception  market for drug
Development cost 0.5B/drug
First-year sales > $1B/drug
1 drug approved/5000 compounds tested
1 out of 100 drugs succeeds to market
19 Alzheimer’s drugs in development
20,000,000 Americans with Alzheimer by 2050
1/14/03
RENSSELAER
4
1/14/03
RENSSELAER
5
TOWARDS TREATING THE HIV EPEDIMIC
HIV Reverse-Transcriptase Inhibition modeling:
Have a few Molecules that have been tested:
R2
O
X
O
HN
HN
N
R1
O
N
R
HO
R
N
O
S
R1
N
N
O
TBDMSO H2N
O
N
S
N
OTBDMS
O
O
S
O
R1
N
O
R1
R2
O
N
R2
N
N
O
TBDMSO H2N
OTBDMS
O
O
S
O
Can we predict if new molecule will inhibit HIV?
1/14/03
6
What do we know?
 The bioactivities of a small set of molecules
 Many Possible Descriptors for each molecules:
Molecular Weight
Electrostatic Potential
Ionization Potential
 Can we predict molecules bioactivity?
1/14/03
7
Database Marketing
Bank has $1.7 billion portfolio
of home mortgages.
When customer refinances,
they may lose customer.
Questions will a customer
refinance?
If so, offer that customer a
good deal on refinancing.
1/14/03
8
What do we know?
For many customers, we know if they
refinanced or not.
We know attributes of customer:
Income
Age
Residential Area
Payment History
Can we predict behavior of future customers?
1/14/03
9
Breast Cancer Diagnosis
Fine needle aspirate of breast tumor.
Is tumor benign or malignant?
1/14/03
10
What do we know?
For patients in initial study, we know
whether tumor was benign or malignant.
Have a digital image of tumor aspirate.
Know characteristics doctors look at:
Uniformity of cell shape
Uniformity of cell size
Cell Mitosis
1/14/03
11
What do we know?
For patients in initial study, we know
whether tumor was benign or malignant.
Have a digital image of tumor aspirate.
Know characteristics doctors look at:
Uniformity of cell shape
Uniformity of cell size
Cell Mitosis
1/14/03
12
Superconductivity
Superconductivity is the ability of a
material to conduct current with no
resistance and extremely low loss.
A few high temperature
superconductors have been found.
What other compounds are
superconductors?
1/14/03
13
Applications of
Superconductivity:
Magnetic Resonance Imaging
1/14/03
14
Applications of
Superconductivity
Maglev Trains
1/14/03
15
Applications of
Superconductivity
Very small and efficient motors
Better power transmission cables
Better cellular phone service
Find a cheap high-temperature superconductor
and you will get the NOBEL PRIZE.
1/14/03
16
What do we know?
Many compounds have been tested to see
if they are superconductors.
Many descriptors exists for these
compounds based on molecular
properties.
1/14/03
17
What do all these problems
have in common?
Each problem
Can be posed as a “yes” or “no” question.
Has examples known to be of the “yes”
type or the “no” type.
Each example has an associated set of
descriptors.
Learn Classification Function !
1/14/03
18
Data Mining
Each problem has data.
Our job is to “mine” information from this
data.
Information depends on the question
asked.
In this case we must produce a predictive
yes/no model (a.k.a. a classification
model) based on the data.
1/14/03
19
Mathematical Model
Have data
( x1 , y1 ),
,( xm , ym )
Construct predictive function
f(x)y
Solve mathematical
model
to find f
2
m
min f
  f (x )  y 
i
i
i
 f
2
K
Want f to generalize well on future data
1/14/03
20
Types of Learning Problems
 Classification
yi  1 or  1
 Regression
yi  R
 Clustering
yi unknown
 Ranking
1/14/03
y1  y2 , yk  y j ,
21
Data Mining
Classification = yes/no models
Start with examples of yes and no.
Associate a set of descriptors with each
example. Descriptors must be
appropriate for the question you are
asking.
Construct a model to split the two sets
Use the model to predict new examples.
1/14/03
22
Learning Model
 What kind of learning task is it?
 What sort of f should we use?
 Kernel function




1/14/03
f ( x)  i K ( x, xi )
What loss function to use?
i
What regularization function?
How can we solve this learning model?
How well will the model predict new points?
23
Class information
See course web page
http://www.rpi.edu/~bennek/class/mmld/i
ndex.htm
1/14/03
24