Transcript Lecture 4

CSC 599: Computational
Scientific Discovery
Lecture 4: Machine
Learning and Model Search
Outline

Computational Reasoning in Science, cont'd



Brief introduction to Artificial Intelligence



Search space and search operators
Newell's model's of intelligence
Brief introduction to Machine Learning



Computer Algebra
Bayesian nets
Error, precision and accuracy
Overfitting
Computational scientific discovery vs.
Machine Learning


Importance of sticking to paradigm
CSD vs ML: The take-home message
Computer Algebra
Forget numbers!
Q: Have an ungodly amount of algebra to do?

Physics, engineering
A: Try a Computer Algebra System (CAS)!
For algebraic symbol manipulation

Examples:


Mathematica
Maple
(Compare: Numerical methods & stats packages)


Do “number crunching”
Examples:


Matlab, Mathematica
SAS, SPSS
Bayesian Networks
Idea

Complexity:



Simplicity:


Lots of variables
Non-deterministic environment
Patterns of influence between variables
Bayesian net encodes influence patterns
Example:

Variables:
a) Prof assigns homework? (true or false)
b) TA assigns homework? (true or false)
c) Will your weekend be busy? (true or false)
Bayesian Networks (2)
Example:
pr=prof, ta=TA, b=busy
p(pr) = .6 p(-pr) = 0.4
p(ta|pr) = 0.1 p(-ta|pr)=0.9
p(ta|-pr)= 0.9 p(-ta|-pr)=0.1
p(b|ta,pr) =0.99 p(-b|ta,pr)=0.01
p(b|-ta,pr)= 0.8 p(-b|-ta,pr)=0.2
p(b|ta,-pr)= 0.9 p(-b|ta,-pr)=0.1
p(b|-ta,-pr)=0.1 p(-b|-ta,-pr)=0.9
Bayesian Networks (3)
P(pr=T|b=T)
= P(b=T, pr=T) / P(b=T)
= P(b=T, ta=T/F, pr=T) / P(b=T, ta=T/F, pr=T/F)
= [(0.99*0.1*0.6= 0.0594TTT) +(0.8*0.9*0.6 = 0.432TFT)] /
[0.0594TTT+0.432TFT+0.324TTF+0.004TFF]
= 0.599707103
Bayesian Networks (4)
Q: That's a lot of work! Can't we get the network
to simplify things?
A: Yes, D-separation!
Two sets of nodes X,Y are d-separated given Z if:
1. M is in Z and is the middle node (chain):
i --> M --> j
Intuition: if I know M, knowing i doesn't tell me any more about j
2. M is in Z and is the middle node (fork):
i <-- M --> j
Intuition: if I know common cause M, knowing 1st result i doesn't
tell me any more about 2nd result j
3. M is NOT in Z (and none of its descendants):
i --> M <-- j
Intuition: if I did know i and common result M then that would
justify why I should not believe in j.
An A.I. researcher's worldview
Problems are divided into
1. Those solvable by “algorithms”


Algorithm = do these steps and you are guaranteed
to get the answer in a “reasonable” time
Classic examples: searching and sorting
2. Those that aren't



No way to guarantee you will get an answer (in
polynomial time)
Q: What do you do?
A: Search for one!
A.I. Worldview (2)
Example of an “A.I.” problem: Chess





Can you guarantee that you will always win at
chess?
Can you guarantee that you will (at least) never
lose?
No?
Well, that makes it interesting!
Compare with Tic-Tac-Toe


You can guarantee that you will never lose
(That's why only children play it)
A.I. Worldview (3)
A.I. paradigm for searching for a solution
Remember: no “algorithm” for obtaining answer

Need to search for one:
States:

Configurations of the world
Operators:

Define legal transitions from one state to another

Example:
 white knight g1->f3
 white pawn c2->c4
A.I. Worldview (4)
State space (or search space)

Space of states reachable by operators from
initial state
A.I. Worldview (5)
Goal state


One or more states that have the configuration
that you want
In Chess: Checkmate!
A.I. Worldview (6)
A.I. Pioneer Alan Newell's view of intelligence

A given level of “intelligence” achievable with
a) Lots of knowledge and little search (Chess grandmaster)
b) Little knowledge and lots of search (“stupid” program)
c) Some knowledge and some search (“smart” program)
A.I. Worldview (7)
Idea
1. Start at initial state
2. Apply operators to traverse search space
3. Hope to arrive at goal state
4. Issues:



How quickly can you find the answer? (time!)
How much memory do you need? (space!)
How good is your goal state?


Optimal = shortest path?
Optimal = shortest arc cost?
A.I. Worldview (8)
Tools

Uninformed search





Informed search




Depth 1st
Breadth 1st
Uniform cost (best 1st where best = least cost so far)
Iterative deepening depth 1st
Heuristic function tells “desirability” of each node
Greedy (best 1st where best = least estimated cost to
goal),
A* (best 1st where best = uniform + greedy)
Search from:



Initial state to goal state(s)
Goal state to initial state(s)
Both directions
Machine Learning and A.I.
ML goals
 Find some data structure that permits better
performance on some set of problems




Prediction
Conciseness
Some combination thereof
What about coefficient finding numerical
methods?

They're “algorithms” (in the A.I. Sense)!
1. Stuff in the data
2. Turn the crank
3. In O(n^3) later out comes the answer
ML example Decision Tree
learning:
Task: Build a decision tree that predicts a class
Leaves = guessed class
Non-leaf nodes = tests on attribute variables
Each edge to child represents one or more attr. values
ML example Decision Tree
learning (2)
Approach
Greedy search
1. Use information theory to find best attribute to split
data
2. Split data on that attribute
3. Recursively Continue until either:
a) No more attributes to split on (label with majority class)
b) All instances are in same class (label with that class)
ML example Decision Tree
learning (3)
A bit of information theory:




Ci = some class value to guess
S = some set of examples
freq(Ci,S) = how many Ci's are in S
size(S) = size of S
Intuition
k choices C1 . . Ck
How much information needed to specify one Ci from S?
Not many Ci's (≈ 0)? On average few bits
Each occurrence costs more than 1 bit
but not many occurrences
Lots of Ci's (≈ size(S))? Not many bits
Each occurrence less than 1 bit (good default guess)
Some Ci's (≈ size(S)/2)? About 1 bit
About 1 bit each, occur about ½ the time
ML example Decision Tree
learning (4)
Prob choose one class value from set:
freq(Ci,S)/size(S)
Information to specify one Ci in S:
-lg( freq(Ci,S)/size(S) ) bits
For expected information multiply by class
proportions
info(S) =
- sum(i=1 to k): freq(Ci,S)/size(S) * lg(freq(Ci,S)/size(S))
ML example Decision Tree
learning (5)
Let's get an intuition:
Case 1: Every member of S is a C1, none of C2
size(S) = 10, freq(C1,S) = 10, freq(C2,S) = 0:
Therefore:
info(S)
= - sum(i=1,2): freq(Ci,S)/size(S) * lg(freq(Ci,S)/size(S))
= - [ (10/10) * lg(10/10)] - [(0/10) * lg(0/10)]
= -0 - 0 = 0
Intuition:
“If we know that we're dealing with S, then we know that all of
it's members are in C1. No need to specify that which is C1
and which is C2”
ML example Decision Tree
learning (6)
Let's get an intuition (cont'd):
Case 2: Half members of S is a C1, half of C2
size(S) = 10, freq(C1,S) = 5, freq(C2,S) = 5:
Therefore:
info(S)
= - sum(i=1,2): freq(Ci,S)/size(S) * lg(freq(Ci,S)/size(S))
= - [ (5/10) * lg(5/10)] - [(5/10) * lg(5/10)]
= -2 * (0.5 * -1) = 1
Intuition:
“If we know that we're dealing with S, then its a 50-50 guess
which members belong to C1 and which to C2. Need to
specify which (no compression possible)”
ML example Decision Tree
learning (7)
Recall the plan: select “best” attr to partition on
“best” = best separator classes
Information gain for some attribute:
gain(attr) =
= (ave info needed to spec. a class) (ave info needed to spec. a class after partition by attr)
= info(T) – infoattr(T)
When infoattr(T) small, classes well separated (big gain!)
where:
n = number attribute values
Ti = set where all members have same attr value vi
infoattr(T) = sum(i=1,n): size(Ti)/size(T) * info(Ti)
ML example Decision Tree
learning (8)
Example data (should we play tennis?)
Outlook
sunny
sunny
sunny
sunny
sunny
overcast
overcast
overcast
overcast
rain
rain
rain
rain
rain
Temp
75
80
85
72
69
72
83
64
81
71
65
75
68
70
Humidity
70
90
85
95
70
90
78
65
75
80
70
80
80
96
Windy
true
true
false
false
true
true
false
true
false
true
true
false
false
false
PlayTennis?
yes
no
no
no
yes
yes
yes
yes
yes
no
no
yes
yes
yes
ML example Decision Tree
learning (9)
info(PlayTennis):
= -9/14 * lg(9/14) - 5/14 * lg(5/14) = 0.940 bits
infooutlook(PlayTennis):
= 5/14 * (-2/5 * lg(2/5) - 3/5 * lg(3/5)) +
4/14 * (-4/4 * lg(4/4) - 0/4 * lg(0/4)) +
5/14 * (-3/5 * lg(3/5) - 2/5 * lg(2/5))
= 0.694 bits
gain(outlook) = 0.940 – 0.694 = 0.246 bits
ML example Decision Tree
learning (10)
info(PlayTennis):
= -9/14 * lg(9/14) - 5/14 * lg(5/14) = 0.940 bits
infowindy(PlayTennis):
= 6/14 * (-3/6 * lg(3/6) - 3/6 * lg(3/6)) +
8/14 * (-6/8 * lg(6/8) - 2/8 * lg(2/8)) +
= 0.892 bits
gain(windy) = 0.940 – 0.892 = 0.048 bits
gain(outlook) > gain(windy)
Test on outlook!
ML example Decision Tree
learning (11)
Guarding against overfitting:
Cross-validation
Want to use all data, but using test data to train is cheating
Split data into k sets:
for (i = 0; i < k; i++)
{
model = train_with_everything_but(i);
test_with(model,i);
}
Tenets of Machine Learning
Choose appropriate:
Training experience
Ex: Good to have about equal number of cases of each
class, even if some classes are more probable in real
data
Think about how you'll test too!
Target function:
Decision tree? Neural Net?
Representation:
Ex: how much data:
Windy in {true,false} vs. wind_speed in mph
Learning algorithm:
Ex: Greedy search? Genetic algorithm? Backpropagation?
Our Tenets of Scientific
Discovery
1. Play to computers' strengths:
1. Speed
2. Accuracy (fingers crossed)
3. Don't get bored
 Do exhaustive search!
Q: Hey doesn't that ignore all that AI heuristic fnc
research?
2. Use background knowledge



Predictive accuracy is not everything!
Normal science ==> dominant paradigm
Revolutionary science ==> ?
What are the Differences?
1. Background knowledge

CSD values background knowledge

ML considers background knowledge
What are the Differences? (cont)
2. The process of knowledge discovery
The ML Process is iterative:
But the CSD is iterative, and starts all over again:
1. Exhaustive Search
Tell computers to consider everything!


Search space systematically
Simplest --> increasingly more complex
Issues:
1. How do you search systematically?
States: models
Initial state = simplest model
Goal state = solution model
Operators: Go from one model to marginally more
complex
– What is “everything”?
Q: With floating pt values every different coefficient
could be a new model (x, x+dx, x+2dx, etc.)
A: Generate next qualitative state, use numerical
methods to find best coefficients in that state
2. Background knowledge as
inductive bias (1)
Inductive bias is necessary

N training cases




But N+1 test case could be anything
Want to assume something about target function
Inductive Bias = what you've assumed
Common inductive biases in ML:






Minimal cross-validation error (e.g. decision tree
learn)
Maximal conditional independence (Bayes nets)
Maximal boundary size between classes (Support
vector machines)
Minimal description length (Occam's razor)
Minimal feature usage (Ignore extraneous data)
Same class as nearest neighbor (Locality)
2. Background knowledge as
inductive bias (2)
Biases we can add/refine in CSD
1. Expressible in same language as paradigm?

Re-use paradigm elements instead of inventing
something “brand new”





Penalty for new objects
Penalty for new attributes
Penalty for new processes
Penalty for new relations/operations (?)
Penalty for new types of assertions (?)
2. Uses same reasoning as done in paradigm

Penalty for new types of reasoning, even with old
assertions
Q: Does this mean we can never introduce a
new thing?
Penalty for new objects:
polywater
Polymer: a long molecule in a repetitive chain
Nikolai Fedyakin (1962 USSR)
H2O condensed in and forced thru narrow quartz
capillary tubes
Measured boiling pt, freezing pt and viscosity
Similar to syrup
Boris Derjaguin
Popularized results (Moscow, then UK 1966)
In West
Some could replicate findings
Some could not
Penalty for new objects:
polywater (2)
People concerned with contamination of H20
But precautions taken against this
Denis Rousseau (Bell Labs)
Did same tests with his sweat
Had same properties as “polywater”
Easier to believe in an old thing (water +
organic pollutants) rather than new thing
("polywater")
Penalty for new things:
Piltdown Man
Circa 1900: looking for early human fossils



Neanderthals in Germany (1863)
Cro-Magnon in France (1868)
What about England??
Charles Dawson (1912)
“I was given a skull by men in at Piltdown gravel pit”
Later, got skull fragments and lower jaw
Excavating Piltdown gravels:
Dawson (r)
Smith Woodward (center)
Penalty for new things:
Piltdown Man (2)
Royal College of Surgeons (soon after discovery)
“Brain looks like modern man”
French paleontologist Marcellin Boule (1915)
“Jaw from ape”
American zoologist Gerrit Smith Miller (1915)
“Jaw from fossil ape”
German anatomist Franz Weidenreich (1923)
“Modern human cranium + orangutan jaw w/filed
teeth”
Oxford anthropologist Kenneth Page Oakley
(1953)
“Skull is medieval human, lower jaw is Sarawak
orangutan, teeth are fossil chimpanzee
Penalty for new attributes:
Inertia vs. gravitational mass
Inertia mass:
Resistance to motion
m in F = ma
Active gravitational mass
Ability to attract other masses
M in F = GMm/r2
Passive gravitational mass:
Ability to be attracted by other masses
m in F = GMm/r2
Penalty for new attributes (2)
Conceptually they are three different types of
mass
No experiment has ever distinguished between
them

People since Newton on have tried experiments
Assume they are all the same!
Penalty for new processes:
cold-fusion
Cold fusion
Novel combo of old processes: catalysis + fusion
Catalysis:
Hard:
A + B -> D
Easier (C = catalyst):
A + C -> AC (activated catalyst)
B + AC -> ABC (ready to go)
ABC -> CD (easier reaction)
CD -> C + D
(catalyst ready to do another reaction)
Penalty for new processes:
cold-fusion (2)
Fusion: how it works


Get lots of energy fusing neutron-rich atoms
Need a lot of energy in to get more out
Penalty for new processes:
cold-fusion (3)
Fusion: Overcoming electrostatic force is hard:
Current technology: need a fission bomb to do it
This is the result:
Penalty for new processes:
cold-fusion (4)
Martin Fleischmann & Stanley Pons (1989)
“We can do fusion at room temperature!”
(No initiating nuclear bomb needed)
Electrolysis of heavy water (D2O)
“Excess heat” observed
Proposed mechanism
Palladium is catalyst
Pd + D -> Pd-D
Pd-D + D -> D-Pd-D
D-Pd-D -> He-PD + energy!
He-PD -> He + Pd
Penalty for new processes:
cold-fusion (5)
Reported in New York Times
Instantly a worldwide story among scientists
Replication
Some can
Others can't
Results:
Energy:
Some get excess energy
Other claim didn't calibrate/account for everything
Helium:
Not enough observed for energy said to be produced
(there is background Helium in the air)
Ramifications
1. Science is conservative
Use the current paradigm to guide thinking
2. Accuracy is not everything
Assertion has to “fit in” current model


Be explainable by model
Use same terms as model
ML and CSD?
From ML we can get:
Idea of learning as model search:
 Training experience
 Target function
 Representation
 Learning algorithm
Extra considerations for CSD:

Use computers' strengths:



Use of background knowledge


Speed + Accuracy + Don't Get Bored
Simulation + Exhaustive search
Down right conservative about introducing new terms
Not just iterative, never ends