Learning I: Introduction, Parameter Estimation

Download Report

Transcript Learning I: Introduction, Parameter Estimation

An introduction to machine
learning and probabilistic
graphical models
Kevin Murphy
MIT AI Lab
Presented at Intel’s workshop on “Machine learning
for the life sciences”, Berkeley, CA, 3 November 2003
.
Overview
 Supervised
learning
 Unsupervised learning
 Graphical models
 Learning relational models
Thanks to Nir Friedman, Stuart Russell, Leslie Kaelbling and
various web sources for letting me use many of their slides
2
Supervised learning
no
yes
Color
Shape
Size
Output
Blue
Blue
Torus
Square
Big
Small
Y
Y
Blue
Red
Star
Arrow
Small
Small
Y
N
Learn to approximate function F(x1, x2, x3) -> t
from a training set of (x,t) pairs
3
Supervised learning
Training data
X1
B
X2
T
X3
B
T
Y
B
B
R
S
S
A
S
S
S
Y
Y
N
Learner
Prediction
T
Testing data
X1
B
X2 X3 T
A
S
?
Y
C
S
Y
Hypothesi
s
N
?
4
Key issue: generalization
yes
no
?
?
Can’t just memorize the training set (overfitting)
5
Hypothesis spaces
 Decision
trees
 Neural networks
 K-nearest neighbors
 Naïve Bayes classifier
 Support vector machines (SVMs)
 Boosted decision stumps
…
6
Perceptron
(neural net with no hidden layers)
Linearly separable data
7
Which separating hyperplane?
8
The linear separator with the largest
margin is the best one to pick
margin
9
What if the data is not linearly separable?
10
Kernel trick
 x2 

 x 
    2 xy 
 y   y 2 


x1
x2
z3
kernel
z2
z1
Kernel implicitly maps from 2D to 3D,
making problem linearly separable
11
Support Vector Machines (SVMs)
 Two


key ideas:
Large margins
Kernel trick
12
Boosting
Simple classifiers (weak learners) can have their performance
boosted by taking weighted combinations
Boosting maximizes the margin
13
Supervised learning success stories





Face detection
Steering an autonomous car across the US
Detecting credit card fraud
Medical diagnosis
…
14
Unsupervised learning
 What
if there are no output labels?
15
K-means clustering
1.
3.
4.
Reiterate
2.
Guess number of clusters, K
Guess initial cluster centers, 1, 2
Assign data points xi to nearest cluster center
Re-compute cluster centers based on assignments
16
AutoClass (Cheeseman et al, 1986)
 EM
algorithm for mixtures of Gaussians
 “Soft” version of K-means
 Uses Bayesian criterion to select K
 Discovered new types of stars from spectral data
 Discovered new classes of proteins and introns
from DNA/protein sequence databases
17
Hierarchical clustering
18
Principal Component
Analysis (PCA)
PCA seeks a projection that best represents the
data in a least-squares sense.
PCA reduces the
dimensionality of
feature space by
restricting attention to
those directions along
which the scatter of the
cloud is greatest.
.
Discovering nonlinear manifolds
20
Combining supervised and unsupervised
learning
21
Discovering rules (data mining)
Occup.
Income
Educ.
Sex
Married
Age
Student
$10k
MA
M
S
22
Student
$20k
PhD
F
S
24
Doctor
$80k
MD
M
M
30
Retired
$30k
HS
F
M
60
Find the most frequent patterns (association rules)
Num in household = 1 ^ num children = 0 => language = English
Language = English ^ Income < $40k ^ Married = false ^
num children = 0 => education
{college, grad school}
22
Unsupervised learning: summary
 Clustering
 Hierarchical
clustering
 Linear dimensionality reduction (PCA)
 Non-linear dim. Reduction
 Learning rules
23
Discovering networks
?
From data visualization to causal discovery
24
Networks in biology
 Most
processes in the cell are controlled by
networks of interacting molecules:
 Metabolic Network
 Signal Transduction Networks
 Regulatory Networks
 Networks can be modeled at multiple levels of
detail/ realism
 Molecular level
 Concentration level
Decreasing detail
 Qualitative level
25
Molecular level: Lysis-Lysogeny circuit in
Lambda phage
Arkin et al. (1998),
Genetics 149(4):1633-48
5 genes, 67 parameters based on 50 years of research
Stochastic simulation required supercomputer
26
Concentration level: metabolic pathways
 Usually
modeled with differential equations
w55
g1 w12
g2
w23
g5
g4
g3
27
Qualitative level: Boolean Networks
28
Probabilistic graphical models
 Supports
graph-based modeling at various levels of
detail
 Models can be learned from noisy, partial data
 Can model “inherently” stochastic phenomena, e.g.,
molecular-level fluctuations…
 But can also model deterministic, causal processes.
"The actual science of logic is conversant at present only with
things either certain, impossible, or entirely doubtful. Therefore
the true logic for this world is the calculus of probabilities."
-- James Clerk Maxwell
"Probability theory is nothing but common sense reduced to
calculation." -- Pierre Simon Laplace
29
Graphical models: outline
 What
are graphical models?
 Inference
 Structure learning
30
Simple probabilistic model:
linear regression
Y =  +  X + noise
Deterministic (functional) relationship
Y
X
31
Simple probabilistic model:
linear regression
Y =  +  X + noise
Deterministic (functional) relationship
Y
“Learning” = estimating
parameters , ,  from
(x,y) pairs.
Is the empirical mean
Can be estimate by
least squares
X
Is the residual variance
32
Piecewise linear regression
Latent “switch” variable – hidden process at work
33
Probabilistic graphical model for piecewise
linear regression
input
X
Q
•Hidden variable Q chooses which set of
parameters to use for predicting Y.
•Value of Q depends on value of
input X.
•This is an example of “mixtures of experts”
Y
output
Learning is harder because Q is hidden, so we don’t know which
data points to assign to each line; can be solved with EM (c.f., K-means)
34
Classes of graphical models
Probabilistic models
Graphical models
Directed
Bayes nets
Undirected
MRFs
DBNs
35
Bayesian Networks
Compact representation of probability
distributions via conditional independence
Family of Alarm
Qualitative part:
Earthquake
Directed acyclic graph (DAG)
 Nodes - random variables
Radio
 Edges - direct influence
Burglary
Alarm
E B P(A | E,B)
e b 0.9 0.1
e b
0.2 0.8
e b
0.9 0.1
e b
0.01 0.99
Call
Together:
Define a unique distribution
in a factored form
Quantitative part:
Set of conditional
probability distributions
P (B , E , A,C , R )  P (B )P (E )P (A | B , E )P (R | E )P (C | A)
36
Example: “ICU Alarm” network
Domain: Monitoring Intensive-Care Patients
 37 variables
 509 parameters
…instead of 254
MINVOLSET
PULMEMBOLUS
PAP
KINKEDTUBE
INTUBATION
SHUNT
VENTMACH
VENTLUNG
DISCONNECT
VENITUBE
PRESS
MINOVL
ANAPHYLAXIS
SAO2
TPR
HYPOVOLEMIA
LVEDVOLUME
CVP
PCWP
LVFAILURE
STROEVOLUME
FIO2
VENTALV
PVSAT
ARTCO2
EXPCO2
INSUFFANESTH
CATECHOL
HISTORY
ERRBLOWOUTPUT
CO
HR
HREKG
ERRCAUTER
HRSAT
HRBP
BP
37
Success stories for graphical models
 Multiple
sequence alignment
 Forensic analysis
 Medical and fault diagnosis
 Speech recognition
 Visual tracking
 Channel coding at Shannon limit
 Genetic pedigree analysis
…
38
Graphical models: outline
 What
are graphical models? p
 Inference
 Structure learning
39
Probabilistic Inference
Posterior

probabilities
Probability of any event given any evidence
P(X|E)
Earthquake
Radio
Burglary
Alarm
Call
40
Viterbi decoding
Compute most probable explanation (MPE) of observed data
Hidden Markov Model (HMM)
X1
Y1
X2
X3
hidden
Y2
Y3
observed
“Tomato”
41
Inference: computational issues
Easy
Hard
Dense, loopy graphs
Chains
Trees
MINVOLSET
PULMEMBOLUS
INTUBATION KINKEDTUBE
VENTMACH
DISCONNECT
PAP SHUNT
VENTLUNG
VENITUBE
Grids
PRESS
MINOVL
VENTALV
PVSATARTCO2
TPR
EXPCO2
SAO2
INSUFFANESTH
HYPOVOLEMIA
LVFAILURE CATECHOL
LVEDVOLUME
STROEVOLUME
HRERRCAUTER
ERRBLOWOUTPUT
HISTORY
CVP PCWP CO
HREKGHRSAT
HRBP
BP
42
Inference: computational issues
Easy
Hard
Dense, loopy graphs
Chains
Trees
MINVOLSET
PULMEMBOLUS
INTUBATION KINKEDTUBE
VENTMACH
DISCONNECT
PAP SHUNT
VENTLUNG
VENITUBE
Grids
PRESS
MINOVL
VENTALV
PVSATARTCO2
TPR
EXPCO2
SAO2
INSUFFANESTH
HYPOVOLEMIA
LVFAILURE CATECHOL
LVEDVOLUME
STROEVOLUME
HRERRCAUTER
ERRBLOWOUTPUT
HISTORY
CVP PCWP CO
HREKGHRSAT
HRBP
BP
Many difference inference algorithms,
both exact and approximate
43
Bayesian inference
Bayesian probability treats parameters as random
variables
 Learning/ parameter estimation is replaced by probabilistic
inference P(|D)
 Example: Bayesian linear regression; parameters are
 = (, , )

Parameters are tied (shared)
across repetitions of the data

X1
Xn
Y1
Yn
44
Bayesian inference
Elegant – no distinction between parameters and
other hidden variables
 + Can use priors to learn from small data sets (c.f.,
one-shot learning by humans)
 - Math can get hairy
 - Often computationally intractable
+
45
Graphical models: outline
 What
are graphical models?
 Inference p
 Structure learning
p
46
Why Struggle for Accurate Structure?
Earthquake
Alarm Set
Burglary
Sound
Missing an arc
Earthquake
Alarm Set
Adding an arc
Burglary
Earthquake
Sound
Cannot be compensated
for by fitting parameters
 Wrong assumptions about
domain structure

Alarm Set
Burglary
Sound
Increases the number of
parameters to be estimated
 Wrong assumptions about
domain structure

47
Scorebased Learning
Define scoring function that evaluates how well a
structure matches the data
E, B, A
<Y,N,N>
<Y,Y,Y>
<N,N,Y>
<N,Y,Y>
.
.
<N,Y,Y>
B
E
E
E
A
A
B
B
A
Search for a structure that maximizes the score
48
Learning Trees
 Can
find optimal tree structure in O(n2 log n) time: just
find the max-weight spanning tree
 If some of the variables are hidden, problem becomes hard
again, but can use EM to fit mixtures of trees
49
Heuristic Search
Learning arbitrary graph structure is NP-hard.
So it is common to resort to heuristic search
 Define a search space:
 search states are possible structures
 operators make small changes to structure
 Traverse space looking for high-scoring structures
 Search techniques:
 Greedy hill-climbing
 Best first search
 Simulated Annealing
 ...

50
Local Search Operations
 Typical
S
operations:
C
E
S
C
E
D
S
C
D
score =
S({C,E} D)
- S({E} D)
S
C
E
E
D
D
51
Problems with local search
Easy to get stuck in local optima
“truth”
S(G|D)
you
52
P(G|D)
Problems with local search II
Picking a single best model can be misleading
E
R
B
A
C
53
Problems with local search II
P(G|D)
Picking a single best model can be misleading
E
R
B
A
C



R
E
B
E
A
C
R
B
A
C
E
R
B
A
C
E
R
B
A
C
Small sample size  many high scoring models
Answer based on one model often useless
Want features common to many models
54
Bayesian Approach to Structure Learning
 Posterior
distribution over structures
 Estimate probability of features
 Edge XY
 Path X…  Y
 …
Bayesian score
for G
P (f | D )  f (G )P (G | D )
Feature of G,
e.g., XY
G
Indicator function
for feature f
55
Bayesian approach: computational issues
 Posterior
distribution over structures
P (f | D )  f (G )P (G | D )
G
How compute sum over super-exponential number of graphs?
•MCMC over networks
•MCMC over node-orderings (Rao-Blackwellisation)
56
Structure learning: other issues
 Discovering
latent variables
 Learning causal models
 Learning from interventional data
 Active learning
57
Discovering latent variables
a) 17 parameters
b) 59 parameters
There are some techniques for automatically detecting the
possible presence of latent variables
58
Learning causal models
 So
far, we have only assumed that X -> Y -> Z
means that Z is independent of X given Y.
 However, we often want to interpret directed arrows
causally.
 This is uncontroversial for the arrow of time.
 But can we infer causality from static observational
data?
59
Learning causal models
 We
can infer causality from static observational
data if we have at least four measured variables
and certain “tetrad” conditions hold.
 See books by Pearl and Spirtes et al.
 However, we can only learn up to Markov
equivalence, not matter how much data we have.
X
Y
Z
X
Y
Z
X
Y
Z
X
Y
Z
60
Learning from interventional data
The only way to distinguish between Markov equivalent
networks is to perform interventions, e.g., gene knockouts.
 We need to (slightly) modify our learning algorithms.

smoking
Yellow
fingers
P(smoker|observe(yellow)) >> prior
smoking
Yellow
fingers
Cut arcs coming
into nodes which
were set by
intervention
P(smoker | do(paint yellow)) = prior
61
Active learning
 Which
experiments (interventions) should we
perform to learn structure as efficiently as possible?
 This problem can be modeled using decision
theory.
 Exact solutions are wildly computationally
intractable.
 Can we come up with good approximate decision
making techniques?
 Can we implement hardware to automatically
perform the experiments?
 “AB: Automated Biologist”
62
Learning from relational data
Can we learn concepts from a set of relations between objects,
instead of/ in addition to just their attributes?
63
Learning from relational data: approaches
 Probabilistic

relational models (PRMs)
Reify a relationship (arcs) between nodes
(objects) by making into a node (hypergraph)
 Inductive


Logic Programming (ILP)
Top-down, e.g., FOIL (generalization of C4.5)
Bottom up, e.g., PROGOL (inverse deduction)
64
ILP for learning protein folding: input
yes
no
TotalLength(D2mhr, 118) ^ NumberHelices(D2mhr, 6) ^ …
100 conjuncts describing structure of each pos/neg example
65
ILP for learning protein folding: results
 PROGOL
learned the following rule to predict if a
protein will form a “four-helical up-and-down
bundle”:
 In
English: “The protein P folds if it contains a long
helix h1 at a secondary structure position between 1
and 3 and h1 is next to a second helix”
66
ILP: Pros and Cons
+
Can discover new predicates (concepts)
automatically
 + Can learn relational models from relational (or
flat) data
 - Computationally intractable
 - Poor handling of noise
67
The future of machine learning for
bioinformatics?
Oracle
68
The future of machine learning for
bioinformatics
Prior knowledge
Hypotheses
Replicated experiments
Learner
Biological literature
Real world
•“Computer assisted pathway refinement”
Expt.
design
69
The end
70
Decision trees
blue?
yes
oval?
no
big?
no
yes
71
Decision trees
blue?
yes
oval?
+ Handles mixed variables
+ Handles missing data
+ Efficient for large data sets
+ Handles irrelevant attributes
+ Easy to understand
- Predictive power
no
no
big?
yes
72
Feedforward neural network
input
Hidden layer
Weights on each arc
f ( J i si ),
Output
Sigmoid function at each node
f ( x)  1/(1  e cx )
i
73
Feedforward neural network
input
Hidden layer
Output
- Handles mixed variables
- Handles missing data
- Efficient for large data sets
- Handles irrelevant attributes
- Easy to understand
+ Predicts poorly
74
Nearest Neighbor


Remember all your data
When someone asks a question,


find the nearest old data point
return the answer associated with it
75
Nearest Neighbor
?
- Handles mixed variables
- Handles missing data
- Efficient for large data sets
- Handles irrelevant attributes
- Easy to understand
+ Predictive power
76
Support Vector Machines (SVMs)
 Two


key ideas:
Large margins are good
Kernel trick
77
SVM: mathematical details
 Training data : l-dimensional vector with flag of true or false
{xi , yi }, xi  Rl , yi {1,1}
 Separating hyperplane :
 Margin :
wx b  0
d  2/ w
 Inequalities : yi (xi  w  b)  1  0, i
 Support vector expansion:
w   i xi
 Support vectors :
i
 Decision:
margin
78
Replace all inner products with kernels
Kernel function
79
SVMs: summary
- Handles mixed variables
- Handles missing data
- Efficient for large data sets
- Handles irrelevant attributes
- Easy to understand
+ Predictive power
General lessons from SVM success:
•Kernel trick can be used to make many linear methods non-linear e.g.,
kernel PCA, kernelized mutual information
•Large margin classifiers are good
80
Boosting: summary
 Can
boost any weak learner
 Most commonly: boosted decision “stumps”
+ Handles mixed variables
+ Handles missing data
+ Efficient for large data sets
+ Handles irrelevant attributes
- Easy to understand
+ Predictive power
81
Supervised learning: summary
 Learn
mapping F from inputs to outputs using a
training set of (x,t) pairs
 F can be drawn from different hypothesis spaces,
e.g., decision trees, linear separators, linear in high
dimensions, mixtures of linear
 Algorithms offer a variety of tradeoffs
 Many good books, e.g.,
 “The elements of statistical learning”,
Hastie, Tibshirani, Friedman, 2001
 “Pattern classification”, Duda, Hart, Stork, 2001
82
Inference
Posterior

Probability of any event given any evidence
Most

likely explanation
Scenario that explains evidence
Rational


probabilities
decision making
Maximize expected utility
Value of Information
Effect
of intervention
Earthquake
Radio
Burglary
Alarm
Call
83
Assumption needed to make
learning work
 We
need to assume “Future futures will resemble
past futures” (B. Russell)
 Unlearnable hypothesis: “All emeralds are grue”,
where “grue” means:
green if observed before time t, blue afterwards.
84
Structure learning success stories: gene
regulation network (Friedman et al.)
Yeast data
[Hughes et al 2000]
 600
genes
 300 experiments
85
Structure learning success stories II: Phylogenetic Tree
Reconstruction (Friedman et al.)
Input: Biological sequences
Human
CGTTGC…
Chimp
CCTAGG…
Orang
CGAACG…
Uses structural EM,
with max-spanning-tree
in the inner loop
….
Output: a phylogeny
leaf
86
Instances of graphical models
Probabilistic models
Graphical models
Naïve Bayes classifier
Directed
Undirected
Bayes nets
Mixtures
of experts
MRFs
DBNs
Kalman filter
model
Ising model
Hidden Markov Model (HMM)
87
ML enabling technologies
Faster computers
 More data
 The web
 Parallel corpora (machine translation)
 Multiple sequenced genomes
 Gene expression arrays
 New ideas
 Kernel trick
 Large margins
 Boosting
 Graphical models
 …

88