Learning Structure in Bayes Nets (Typically also learn CPTs here)

Download Report

Transcript Learning Structure in Bayes Nets (Typically also learn CPTs here)

Learning Structure in Bayes Nets
(Typically also learn CPTs here)
• Given the set of random variables
(features), the space of all possible networks
is well-defined and finite.
• Unfortunately, it is super-exponential in the
number of variables.
• We can define a transition function between
states (network structures), such as adding
an arc, deleting an arc, or changing the
Learning Structure (Continued)
• (Continued)… direction of an arc.
• For each state (structure), we take our best
guess of the CPTs given the data as before.
• We define the score of the network to be
either the probability of the data given the
network (maximum likelihood framework)
or the posterior probability of the network
(the product of the prior probability of the
Learning Structure (Continued)
• (Continued)… network and the probability
of the data given the network, normalized
over all possible networks).
• Given a state space with transition function
and scoring function, we now have a
traditional AI search space to which we can
apply greedy hill-climbing, randomized
walks with multiple restarts, or a variety of
Learning Structure (Continued)
• (Continued)… other heuristic search
techniques.
• The balance of opinion currently appears to
favor greedy hill-climbing search for this
applications, but search techniques for
learning Bayes Net structure are wide open
for further research -- nice thesis task.
Structural Equivalence
• Independence Equivalent: 2 structures are
independence equivalent if they encode the
same conditional independence relations.
• Distribution Equivalent with respect to a
family of CPT formats: 2 structures are
equivalent if they represent the same sets of
possible distributions.
• Likelihood Equivalent:the data does not
help discriminate between the 2 structures.
One Other Key Point
• The previous discussion assumes we are
going to make a prediction based on the
best (e.g., MAP or maximum likelihood)
single hypothesis.
• Alternatively, we could make avoid
committing to a single Bayes Net. Instead
we could compute all Bayes Nets, and have
a probability for each. For any new query
One Other Key Point (Continued)
• (Continued)… we could calculate the
prediction of every network. We could then
weigh each network’s prediction by the
probability that it is the correct network
(given our previous training data), and go
with the highest scoring prediction. Such a
predictor is the Bayes-optimal predictor.
Problem with Bayes Optimal
• Because there are a super-exponential
number of structures, we don’t want to
average over all of them. Two options are
used in practice:
• Selective model averaging:just choose a
subset of “best” but “distinct” models
(networks) and pretend it’s exhaustive.
• Go back to MAP/ML (model selection).
Example of Structure Learning:
Modeling Gene Expression Data
• Expression of a gene: making from the gene
the protein for which it codes (involves
transcription and translation).
• Can estimate expression by transcription
(amount of mRNA made from the gene).
• DNA hybridization arrays: “chips” that
simultaneously measure the levels at which
all genes in a sample are expressed.
Importance of Expression Data
• Often the best clue to a disease or measurement of
successful treatment is the degree to which genes
are expressed. Such data also gives insight into
regulatory networks among genes (one gene may
code for a protein that affects another’s expression
rate).
• Can get snapshots of global expression levels.
Modeling Expression Data by
Learning a Bayes Net
• We can model the effects of genes on others
by learning a Bayes Net (both structure and
CPTs. Friedman et al. do so.
• See associated figure. Expression of gene E
might promote expression of B but
expression of A might inhibit B. The facts
that E and A directly influence B are
captured by the network structure, and the
Modeling Gene Expression Data
(Continued)
• (Continued)… fact that E promotes while A
inhibits is captured in the CPT for B given
its parents.
• B directly influences C according to the
network, but E and A influence C only
indirectly via B.
Decisions Made in this
Application
• Use selective model averaging. Learn
multiple models via bootstrapping.
(Randomly sample with replacement from
the original data set -- each run is with a
different random sample of the original
data.) Average simply by extracting
common features: Is Y in the Markov
Blanket of X? Is Y an ancestor of X?
Decisions (Continued)
• Use Independence Equivalence (two models are
equivalent if and only if they encode the same
conditional independence information.
• Must choose a prior distribution over structures.
Choose one such that (1) equivalent structures
have equivalent scores, and (2) scores are
decomposable (score is the sum of the scores of
each node, which depend only on node and its
parents).
Decisions (Continued)
• Use arc insertion, arc deletion, or arc
direction switch as the transition function.
• Use greedy hill-climbing as the search
strategy, with the following (also greedy)
additional heuristic.
• Sparse candidate algorithm: consider only
networks in which the parents of X are
nodes that have a “high” correlation with X
Decisions (Continued)
• (Continued)… when taken individually.
Note the similarity of this heuristic with the
greedy heuristic in decision tree learning.
This means that probabilistic versions of
exclusive-OR (for example) will cause
problems.