Presentation (PowerPoint File)

Download Report

Transcript Presentation (PowerPoint File)

Graphical Models
David Heckerman
Microsoft Research
Overview

Intro to graphical models
– Application: Data exploration
– Dependency networks  undirected graphs
– Directed acyclic graphs (“Bayes nets”)

Applications
– Clustering
– Evolutionary history/phylogeny
Using classification/regression for
data exploration
Decision tree:
male
young
p(cust)=0.2
female
old
p(cust)=0.8
p(cust)=0.7
Logistic regression:
log p(cust)/(1-p(cust)) = 0.2 – 0.3*male +0.4*old
Neural network:
Using classification/regression for
data exploration
p(target|inputs)
Decision tree:
male
young
p(cust)=0.2
female
old
p(cust)=0.8
p(cust)=0.7
Logistic regression:
log p(cust)/(1-p(cust)) = 0.2 – 0.3*male +0.4*old
Neural network:
Conditional independence
Decision tree:
male
young
p(cust)=0.2
female
old
p(cust)=0.8
p(cust)=0.7
p(cust | gender, age, month born)=p(cust | gender, age)
p(target | all inputs) = p(target | some inputs)
Learning conditional independence
from data: Model Selection




Cross validation
Bayesian methods
Penalized likelihood
Minimum description length
Using classification/regression for
data exploration

Suppose you have thousands of variables
and you’re not sure about the interactions
among those variables

Build a classification/regression model for
each variable, using the rest of the variables
as inputs
Example with three variables X, Y, and Z
Target: X
Inputs: Y,Z
Y=0
p(x|y=0)
Target: Y
Inputs: X,Z
Y=1
X=0
X=1
p(y|x=1)
p(x|y=1)
Z=0
Z=1
p(y|x=0,z=0)
p(y|x=0,z=1)
Target: Z
Inputs: X,Y
Y=0
p(z|y=0)
Y=1
p(z|y=1)
Summarize the trees with a single graph
Target: X
Inputs: Y,Z
Y=0
p(x|y=0)
Target: Y
Inputs: X,Z
Y=1
X=0
X=1
p(y|x=1)
p(x|y=1)
Z=0
Target: Z
Inputs: X,Y
Y=0
p(z|y=0)
Z=1
p(y|x=0,z=0)
p(y|x=0,z=1)
X
Y
Z
Y=1
p(z|y=1)
Dependency Network


Build a classification/regression model for
every variable given the other variables as
inputs
Construct a graph where
– Nodes correspond to variables
– There is an arc from X to Y if X helps to predict Y

The graph along with the individual
classification/regression model is a
“dependency network”
(Heckerman, Chickering, Meek, Rounthwaite, Cadie 2000)
Example: TV viewing
Nielsen data: 2/6/95-2/19/95
viewer 1
viewer 2
viewer 3
Age Show1 Show2 Show3
73
y
n
n
16
n
y
y
...
35
n
n
n
etc.
~400 shows, ~3000 viewers
Goal: exploratory data analysis (acausal)
A bit of history


Julian Besag (and others) invented
dependency networks (under another name)
in the mid 1970s
But they didn’t like them, because they could
be inconsistent
A consistent dependency network
Target: X
Inputs: Y,Z
Y=0
p(x|y=0)
Target: Y
Inputs: X,Z
Y=1
X=0
X=1
p(y|x=1)
p(x|y=1)
Z=0
Target: Z
Inputs: X,Y
Y=0
p(z|y=0)
Z=1
p(y|x=0,z=0)
p(y|x=0,z=1)
X
Y
Z
Y=1
p(z|y=1)
An inconsistent dependency network
Target: X
Inputs: Y,Z
Y=0
p(x|y=0)
Target: Y
Inputs: X,Z
Y=1
X=0
X=1
p(y|x=1)
p(x|y=1)
Z=0
Target: Z
Inputs: X,Y
Y=0
p(z|y=0)
Z=1
p(y|x=0,z=0)
p(y|x=0,z=1)
X
Y
Z
Y=1
p(z|y=1)
A bit of history



Julian Besag (and others) invented
dependency networks (under the name
“Markov graphs”) in the mid 1970s
But they didn’t like them, because they could
be inconsistent
So they used a property of consistent
dependency networks to develop a new
characterization of them
Conditional independence
Target: X
Inputs: Y,Z
Y=0
p(x|y=0)
Target: Y
Inputs: X,Z
Y=1
X=0
X=1
p(y|x=1)
p(x|y=1)
Z=0
Y=0
p(z|y=0)
Y=1
p(z|y=1)
Z=1
p(y|x=0,z=0)
X
Target: Z
Inputs: X,Y
p(y|x=0,z=1)
X^Z|Y
Y
Z
Conditional independence in a
dependency network
Each variable is independent of all other
variables given its immediate neighbors
Hammersley-Clifford Theorem
(Besag 1974)



Given a set of variables which has a positive
joint distribution
Where each variable is independent of all
other variables given its immediate neighbors
in some graph G
It follows that
“clique
potentials”
N
p(x)   f i (ci )
i 1
where c1, c2, …, cn are the maximal cliques in
the graph G.
Example
X
Y
p( x, y, z)  f1 ( x, y) f 2 ( y, z)
Z
Consistent dependency networks:
Directed arcs not needed
X
Y
Z
X
Y
Z
p( x, y, z)  f1 ( x, y) f 2 ( y, z)
A bit of history




Julian Besag (and others) invented
dependency networks (under the name
“Markov graphs”) in the mid 1970s
But they didn’t like them, because they could
be inconsistent
So they used a property of consistent
dependency networks to develop a new
characterization of them
“Markov Random Fields” aka “undirected
graphs” were born
Inconsistent dependency networks
aren’t that bad



They are *almost consistent* because each
classification/regression model is learned
from the same data set (can be formalized)
They are easy to learn from data (build
separate classification/regression model for
each variable)
Conditional distributions (e.g., trees) are
easier to understand than clique potentials
Inconsistent dependency networks
aren’t that bad




They are *almost consistent* because each
classification/regression model is learned
from the same data set (can be formalized)
They are easy to learn from data (build
separate classification/regression model for
each variable)
Conditional distributions (e.g., trees) are
easier to understand than clique potentials
Over the last decade, has proven to be a very
useful tool for data exploration
Shortcomings of undirected graphs


Lack a generative story (e.g., Lat Dir Alloc)
Lack a causal story
cold
sore throat
lung cancer
cough
weight loss
Solution: Build trees in some order
1. Target: X
Inputs: none
.
2. Target: Y
Inputs: X
X=0
X=1
p(y|x=0)
p(y|x=1)
3. Target: Z
Inputs: X,Y
Y=0
Y=1
p(x)
p(z|y=0)
p(z|y=1)
Solution: Build trees in some order
1. Target: X
Inputs: none
.
2. Target: Y
Inputs: X
X=0
X=1
p(y|x=0)
p(y|x=1)
3. Target: Z
Inputs: X,Y
Y=0
Y=1
p(x)
X
Y
p(z|y=0)
Z
p(z|y=1)
Some orders are better than others



X
Y
Z
X
Z
Y
Random orders
Greedy search
Monte-Carlo methods
Joint distribution is easy to obtain
1. Target: X
Inputs: none
.
2. Target: Y
Inputs: X
X=0
X=1
p(y|x=0)
p(y|x=1)
3. Target: Z
Inputs: X,Y
Y=0
Y=1
p(x)
X
Y
p(z|y=0)
Z
p( x) p( y | x) p( z | x, y)  p( x, y, z )
p(z|y=1)
Directed Acyclic Graphs (aka Bayes Nets)
p( x1 ,..., xn )   p( xi | x1 ,..., xi 1 )
i
  p( xi | parents ( xi ))
i
Many inventors: Wright 1921; Good 1961; Howard &
Matheson 1976, Pearl 1982
The power of graphical models



Easy to understand
Useful for adding prior knowledge to an
analysis (e.g., causal knowledge)
The conditional independencies they express
make inference more computationally
efficient
Inference
1. Target: X
Inputs: none
.
2. Target: Y
Inputs: X
X=0
X=1
p(y|x=0)
p(y|x=1)
3. Target: Z
Inputs: X,Y
Y=0
Y=1
p(x)
X
Y
What is p(z|x=1)?
p(z|y=0)
Z
p(z|y=1)
Inference: Example
X
W
p( z ) 
Y
Z
 p(w) p( x | w) p( y | x) p( z | y)
w, x , y
Inference: Example
(“Elimination Algorithm”)
X
W
p( z ) 
Y
Z
 p(w) p( x | w) p( y | x) p( z | y)
w, x , y


  p( w) p( x | w)   p( y | x) p( z | y ) 
 y

w, x
Inference: Example
(“Elimination Algorithm”)
X
W
p( z ) 
Y
Z
 p(w) p( x | w) p( y | x) p( z | y)
w, x , y


  p( w) p( x | w)   p( y | x) p( z | y ) 
 y

w, x



  p( w)   p( x | w)   p( y | x) p( z | y ) 
 y

w
 x
Inference

Inference also important because it is the E
step of EM algorithm (when learning with
missing data and/or hidden variables)

Exact methods for inference that exploit
conditional independence are well developed
(e.g., Shachter, Lauritzen & Spiegelhalter, Dechter)

Exact methods fail when there are many
cycles in the graph
– MCMC (e.g., Geman and Geman 1984)
– Loopy propagation (e.g., Murphy et al. 1999)
– Variational methods (e.g., Jordan et al. 1999)
Applications of Graphical Models
DAGs and UGs:
 Data exploration
 Density estimation
 Clustering
UGs:
 Spatial processes
DAGs:
 Expert systems
 Causal discovery
Applications


Clustering
Evolutionary history/phylogeny
Clustering
Example: msnbc.com
User
Sequence
1
2
3
4
5
6
7
8
Etc.
frontpage news
news
news
frontpage news
news
news
frontpage news
news
weather
news
health
frontpage sports
travel
travel
news
news
frontpage news
news
frontpage
news
weather
health
sports
travel
travel
weather weather
business business
weather
travel
weather
business
sports
Millions of users per day
Goal: understand what is and isn’t working on the site
Solution
Cluster
data
User clusters
• Cluster users based on their behavior on the site
• Display clusters somehow
Generative model for clustering
(e.g., AutoClass, Cheeseman & Stutz 1995)
Cluster
1st
page
2nd
page
Discrete, hidden
3rd
page
…
Sequence Clustering
(Cadez, Heckerman, Meek, & Smyth, 2000)
Cluster
1st
page
2nd
page
Discrete, hidden
3rd
page
…
Learning parameters (with missing data)
Principles:
 Find the parameters that maximize the (log)
likelihood of the data
 Find the parameters whose posterior
probability is a maximum
 Find distributions for quantities of interest by
averaging over the unknown parameters
Gradient methods or EM algorithm typically
used for first two
Expectation-Maximization (EM) algorithm
Dempster, Laird, Rubin 1977
Initialize parameters (e.g., at random)
Expectation step: compute probabilities for values of
unobserved variable using the current values of the
parameters and the incomplete data [THIS IS
INFERENCE]; reinterpret data as set of fractional cases
based on these probabilities
Maximization step: choose parameters so as to maximize
the log likelihood of the fractional data
Parameters will converge to a local maximum of log p(data)
E-step
Suppose cluster model has 2 clusters, and that
p(cluster=1|case,current params) = 0.7
p(cluster=2|case,current params) = 0.3
Then, write
q(case) = 0.7 log p(case,cluster=1|params) +
0.3 log p(case,cluster=2|params)
Do this for each case and then find the parameters that
maximize Q=Scase q(case). These parameters also
maximize the log likelihood.
Demo: SQL Server 2005
Example: msnbc.com
User
Sequence
1
2
3
4
5
6
7
8
Etc.
frontpage news
news
news
frontpage news
news
news
frontpage news
news
weather
news
health
frontpage sports
travel
travel
news
news
frontpage news
news
frontpage
news
weather
health
sports
travel
travel
weather weather
business business
weather
travel
weather
business
sports
Sequence clustering
Other applications at Microsoft:
 Analyze how people use programs (e.g. Office)
 Analyze web traffic for intruders (anomaly
detection)
Computational biology applications


Evolutionary history/phylogeny
Vaccine for AIDS
Evolutionary History/Phylogeny
Jojic, Jojic, Meek, Geiger, Siepel, Haussler, and Heckerman 2004
Donkey
Horse
Indian rhino
White rhino
Grey seal
Harbor seal
Dog
Cat
Blue whale
Fin whale
Sperm whale
Hippopotamus
Sheep
Cow
Alpaca
Pig
Little red flying fox
Ryukyu flying fox
Horseshoe bat
Japanese pipistrelle
Long-tailed bat
Jamaican fruit-eating bat
Asiatic shrew
Long-clawed shrew
Mole
Small Madagascar hedgehog
Aardvark
Elephant
Armadillo
Rabbit
Pika
Tree shrew
Bonobo
Chimpanzee
Man
Gorilla
Sumatran orangutan
Bornean orangutan
Common gibbon
Barbary ape
Baboon
White-fronted capuchin
Slow loris
Squirrel
Dormouse
Cane-rat
Guinea pig
Mouse
Rat
Vole
Hedgehog
Gymnure
Bandicoot
Wallaroo
Opossum
Platypus
Perissodactyla
Carnivora
Cetartiodactyla
Chiroptera
Moles+Shrews
Afrotheria
Xenarthra
Lagomorpha
+ Scandentia
Primates
Rodentia 1
Rodentia 2
Hedgehogs
Probabilistic Model of Evolution
hidden
…
hidden
species
1
species
2
species
3
Learning phylogeny from data


For a given tree, find max likelihood
parameters
Search over structure to find best likelihood
(penalized to avoid over fitting)
Strong simplifying assumption
Evolution at each DNA nucleotide is independent
 EM is computationally efficient
h10
x11
h20
x12
h12
x13
x14
Nucleotide
position 1
…
h22
x23
x24
Nucleotide
position 2
hJ0
x 1J
hJ2
x J3
x J4
Nucleotide
position N
Relaxing the assumption
h10
x11


x12
h12
x13

h20
x14
…
h22
x23
x24
hJ0
x 1J
hJ2
x J3
x J4
Each substitution depends on the substitution
at the previous position
This structure captures context specific effects
during evolution
EM is computationally intractable
Variational approximation for inference
ln p(o |  )  ln  p (h, o |  )
h
p(h, o,  )
 ln  q(h | o,  )
q(h | o,  )
h
 p(h, o,  ) 

  q(h | o,  ) ln 
h
 q(h | o, ) 
Lower bound good enough for EM-like algorithm
Two simple q distributions
h
Product of
trees
o
h
h
o
Product of
chains
o
o
o
h
o
o
o
o
h
o
…
h
o
o
o
h
h
o
…
h
h
o
h
o
h
o
Things I didn’t have time to talk about







Factor graphs, mixed graphs, etc.
Relational learning: PRMs, Plates, PERs
Bayesian methods for learning
Scalability
Causal modeling
Variational methods
Non-parametric distributions
To learn more
Main conferences:
 Uncertainty in Artificial Intelligence (UAI)
 Neural information Processing Systems (NIPS)