Statistical Inference, Multiple Comparisons, Random Field Theory
Download
Report
Transcript Statistical Inference, Multiple Comparisons, Random Field Theory
Study of Bayesian network classifier
Huang Kaizhu
Supervisors: Prof. Irwin King
Prof. Lyu Rung Tsong Michael
Markers:
Prof. Chan Lai Wan
Prof. Wong Kin Hong
Outline
• Background
–
–
–
–
What is Bayesian network?
How Bayesian networks can be used as classifiers?
Why choose Bayesian network?
What is problem of Learning Bayesian network ?
• My main works
– Large-Node Chow-Liu-tree
– Maximum likelihood Large-Node-Bounded semi-Naïve BN
• Future work
• Conclusion
Background
• What is Bayesian Network(BN)?
– Composed of a “structure” component G and a “parameter” component
.
– G=(V,E) is a directed acyclic graph. nodes set :V and its edge set is E.
And the nodes represent the attributes, the edges between the nodes
represent the dependence relationship between the nodes.
– is a conditional probability table.
– It encodes the following joint probability among the nodes
(X1,X2,…,Xn):
n
P ( X 1,..., Xn) P ( Xi | Xi )...(1)
i 1
Xi is the parents of Xi
Background(con’t)
Example of Bayesian network:
Structure component
Parameter component
The Bayesian network above encodes the
following probability relationship.
P(F, B, L, D , H)
=P(F) P(B)P(L |F) P(D |F,B)P(H|D)
Background (con’t)
• How can BN be a classifier?
– Firstly use BN to model the dataset
– Then use the distribution encoded in BN to do classification
c( x1 , x2 ,..., xn ) arg max P(c | x1 , x2 ,..., xn )
c
arg max P( x1 , x2 ,..., xn , c)
c
Background (con’t)
•
Why choose Bayesian network?
–
Bayesian network represents some inner relationship
between the attributes
–
The joint probability based on BN can be written as a
decomposable form
n
P ( X 1,..., Xn) P ( Xi | Xi )...(1)
i 1
Xi is the parents of Xi
Background (con’t)
• What is the problem of learning Bayesian network?
– Given a training data set D={u1, u2 , u3 …uN } of instances of attributes U,
find a network B that best matches D.
• What’s the difficulty in learning BN?
– Generally speaking, BN optimization problem is intractable.
• Two Approaches
– Either we constrain the searching in a certain restricted class of
networks (Naïve BN, Semi-Naïve BN, CL-tree etc)
Q1: Are these restricted class enough to represent the data?
– Or we adopt some heuristic methods on general networks (K2 etc)
• Q2: Are the heuristic methods on general network efficient ?
Q3: Are the heuristic methods on general network redundant to represent the
data?
Background (con’t)
• Problems in two approaches
– Q1: Are these restricted class enough to represent the data?
• No, in many cases, they are really too limited in expression ability to
model the data.
– Q2: Are the heuristic methods on general network efficient ?
• No, they have a big search space, which will be greatly timeconsuming
– Q3: is it possible that general networks obtained by the
heuristic methods are redundant to represent the data?
• Yes, sometimes, these methods favors more complex structure, which
will really increase the risk of over-fitting problem.
Possible solutions
– Upgrading Solution
• How about we obtain a restricted BN firstly, then we aim at solving
the shortcomings of this network caused by the restriction and
upgrade it into a not so simple structure?
– Bound Solution
• Can we take some strategies to bound the complexity of networks,
then we find the best structure in this bound. The final network can be
controlled by a bound parameter.
Work1:Large node Chow-Liu tree
• Upgrade Chow-Liu tree(CLT) into Large node ChowLiu tree(LNCLT)
– What is the restriction of CLT?
• CLT restricts the network in a tree structure among the variables
– Shortcomings caused by the restriction
• CLT can not represent many dataset with a non-tree underlying
structure .
– Observations:
• A “large node tree” may partially
solve this shortcoming.
– Example:
• Right figure
P( A, B, C , D, E ) P( A) P( BC | A) P( DE | BC )
Work1:Large node Chow-Liu tree
• Large-node, which is a combination of several nodes,
may partially relax the tree restriction.
In forming a large node,There are two requirements.
– Requirement 1
• Large-node must be really like a single node which means the nodes
in a Large node are really more dependent on each other.
– Requirement 2
• Large-node can not be too “large” or the probability estimation of this
large node will not be not reliable
– An extreme situation is that we combine all the nodes into a large
node. This situation will lost all the advantages of Bayesian
network.
Upgrade CLT into Large-Node-CLT
• A bounded Frequent itemset can satisfy the
Requirement 1 & 2
– What is Frequent itemset?
• It is the set of attributes that come together with each other frequently.
• Example: Food store---{bread}, {button}, {bread, button}
– An frequent itemset with high frequency is more like a “large node”.
---Requirement 1
– We restrict that the the number of nodes involved in a large node is no
greater than a K threshold
---Requirement 2
– Frequent itemset can be obtained according to the algorithm Apriori in
[AS1994]
Upgrade CLT into Large-Node-CLT
•
The construction algorithm
1.
Call Apriori[AS94] to generate the frequent itemsets, which have the size less
than k. Record all the frequent itemsets together with their supports into list L.
2. Draft the CL-tree of the dataset according to the CLT algorithm
3. Until L is null
4. Iteratively combine the frequent itemsets which satisfy the combination
conditions: father-son or sibling relationship
1.{A,C}
does not satisfy the
2.f{B,C}
is the condition,
biggest andfilter
satisfies
combination
out
3..Filter
the frequent itemsets
which
P( A, B, C , D, E ) P( A) P( BC | A) P( DE | BC )
combination
condition,
{A,C}
4..{D,
Ecoverage
} is
frequent
have them
with
{B,C}itemset
,
combine
intothe
(c)
satisfies
the and
{D,E}
is left. the combination
Example:
condition, combine them into
(d) the k is 2, after step 1, we get the
We assume
frequent itemsets {A, B} {A, C},{B, C}, {B, E},
{B, D}, {D, E}. And f({B, C})>f({A, B})> f({B,
E}) >f({B, D})>f({D, E}) (s represents the
frequency of frequent itemsets). (b) is the CLT in
step2.
Experiment
• Database
–
The experiments are conducted on MNIST handwritten digit database.
– MNIST consists of :
• a 60000-digit training dataset
• a 10000-digit testing dataset.
• Both are 28*28 gray-level digit datasets
Experiment
• Preprocessing of MNIST database
– Binarization :We use a global binarization method to binarize
MNIST datasets.
– Feature Extraction[Bakis68]: 4*4*6=96 dimension feature
Experiment
• We build 10 LNCLTs for 10 digits, we give out the
classification result by selecting the LNCLT which has
a maximum probability output.
• We compare LNCLT with CLT in :
– Data fitness---log likelihood
– Recognition rate
Experimental results
Data fitness---Log likelihood testing
Experimental results
Recognition rate
We randomly selected 1000 digits as test datasets from the
10000-digit testing dataset. We do the testing 10 times
Work2: Bound approach in semi-Naïve
Bayesian network
1.A bounded Semi-Naïve
Bayesian network(SNB).
2. We reduced the SNB into
a network which has the
same number K of nodes
in every subset, K is the
bound parameter.
3. We use Linear
programming to do the
optimization.
4. Our solution is shown
sub-optimal
Comparison between our model and traditional SNB
• Time cost
– Our model can be solved in a polynomial time
– Traditional SNB has an exponential time cost
• Structure
– Each large node in our model has the same number of nodes K, K is a
bound parameter
– The number of nodes in subsets of traditional SNB are not same and some
values of this number may be very large.
• Performance
– Our model is shown to be a sub-optimal in the bound restriction
– In traditional SNB , there is no evidence that show it is optimal or suboptimal
Experimental results
• We evaluate our approach on Tic and vote dataset from UCI machine learning
repository
Future work
• Evaluate our approaches based on a large number of
datasets in Machine Learning repository from UCI
• Build a Bayesian network which combine the
upgrading strategy and bound strategy
– In fact we are considering if we can upgrade our boundedSNB into a mixture model of bounded-SNB.
Conclusion
• A dilemma between simple structure and complex
structure seems to exist in learning Bayesian network
classifiers .
• In this presentation, we test two approaches to deal
with this problem. One is the Large-node Chow-Liu
tree approach which is based on upgrading idea and the
other is bounded semi-Naïve Bayesian network.
• The experimental results show that these two
approaches are promising and encouraging.
Main Reference
•
•
•
•
•
•
•
•
•
•
•
[AS1994] R. Agrawal, R. Srikant, 1994,“Fast algorithms for mining association rules”, Proc. VLDB-94
1994.
[Chow, Liu1968] Chow, C.K. and Liu, C.N. (1968). Approximating discrete probability distributions
with dependence trees. IEEE Trans. on Information Theory, 14,(pp462-467)
[Friedman1997] Friedman, N., Geiger, D. and Goldszmidt, M. (1997). Bayesian Network Classifiers.
Machine Learning, 29,(pp.131-161)
[Kononenko1991] Kononenko, I. (1991). Semi-naive Bayesian classifier. In Y. Kodratoff (Ed.),
Proceedings of sixth European working session on learning (pp.206-219). Springer-Verlag
[Maxwell1995] Learning Bayesian Networks is NP-Complete
[Pearl1988] Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: networks of plausible
inference, Morgan Kaufmann.
[Cheng1997] Cheng, J. Bell, D.A. Liu, W. 1997, Learning Belief Networks from Data: An Information
Theory Based Approach. In Proceedings of ACM CIKM’97
[Cheng2001] Cheng, J. and Greiner, R. 2001,
Learning Bayesian Belief Network Classifiers: Algorithms and System, E.Stroulia and S.
Matwin(Eds.): AI 2001, LNAI 2056, (pp.141-151),
[Meretakis, Wuthrich1999] Meretakis, D. and Wuthrich, B. Extending Naive Bayes Classifiers using
long Itemsets. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, San Diego, (pp. 165—174)
[Srebro2000] Artificial Intelligence Laboratory, Massachusetts Institute of Technology
Cambridge, Massachusetts 02139, http://www.ai.mit.edu
Q& A
Thanks!
Work2: Bound strategy on Semi-Naïve BN
• We restrict the semi-naïve network into not too complex
structure.
Large Node Bounded
semi-Naïve
BN Model
Bounded-SNB
MODEL
DEFINITION
Reduce Bounded-SNB MODEL
According to Lemma 1, given a bound K, we should not separate the variables set into too many small subsets. Or it
is more possible that we can combine some of the subsets into a new subset whose cardinality is no greater than K,
thus the new SNB will be coarser than the old one.
From this viewpoint, we reduce the searching space of BLN-SNB into a K-regular SNB space since there are no
possibility that a SNB coarser than K-regular SNB exists in the K-bound. Even though it is reasonable to search the
maximum likelihood SNB in the K-regular-SNB space, we won't say that: a K-regular SNB is absolutely better than
a non-K-regular SNB with the biggest cardinality no more than K . It is obvious some non-K-regular SNBs can not
be combined into a K-regular SNB. Thus in such a way, we reduce the searching space into a sub-space of K-bound
SNB.
Difference between our model & traditional SNB
1. Different approach
•
•
Traditional SNB employs independence testing to find the
semi structure,which will cause an exponential
computational cost.
Our approach employs the linear programming method to
find the semi structure, which is polynomial in
computational complexity.
2. Different performance
•
•
There are no evidence that shows traditional SNB can find
an optimal or sub-optimal structure.
Our approach can maintain a sub-optimal structure.
K-Bounded-SNB Problem
K-Bounded-SNB Problem:
Finding the m= [n/K ] K-cardinality subsets from attributes set
which satisfy the SNB conditions to maximize the log likelihood (3).
[x] means rounding the x to the nearest integer
Transforming into Integer Programming Problem
Model definition
If we relax the (6) into 0x 1, IP is
transformed into a Linear Programming
problem which can be solved in a
polynomial time.
Computational complexity analysis
Traditional SNB time cost is
exponential cost
Our model is polynomial time cost