Transcript ppt
Hierarchical Sampling for Active Learning
Sanjoy Dasgupta and Daniel Hsu
University of California, San Diego
Session : Active Learning and Experimental Design, ICML 2008
2010-08-27
Presented by Yongjin Kwon
Active Learning
Nowadays huge amounts of data are cheaply available and are
used to find some patterns or to extract some information.
Many supervised learning systems require a large amount of
(labeled) training dataset to perform well.
email spam, handwritten digits, movie rating, etc.
In many applications, however, labeled training data are very
difficult, time-consuming, or expensive to obtain.
speech recognition, text classification, biological research, etc.
Problem : Given a pool of unlabeled data and an oracle, how
can the machine learn something accurately, requesting as
few labels as possible?
Copyright 2010 by CEBT
2
Active Learning (Cont’d)
If the machine actively tries to learn some “curious” or
“informative” data, it will perform better with less training!
One-way teaching
Answer
Learn something
Everything should be prepared!
Query “curious” points only.
(a) Passive Learning
(b) Active Learning
Copyright 2010 by CEBT
3
Active Learning (Cont’d)
What are “curious” or “informative” points?
If the learner is NOT unsure about the label of a point, then the
point will be less curious.
less curious
more curious
Copyright 2010 by CEBT
4
Typical Active Learning Approach
Start by querying the labels of a few randomly-chosen points.
Repeat the following process:
Determine the decision boundary on current set of labeled points.
Choose the next unlabeled point closest to the current decision
boundary. (i.e. the most “uncertain” or “informative” point)
Query that point and obtain its label.
Binary Classification:
Decision
Boundary
Copyright 2010 by CEBT
5
Typical Active Learning Approach (Cont’d)
Missed Cluster Effect
Random sampling may fail to explore some regions, called
missed cluster.
After the initial sampling and learning, only points in the center
group are queried due to the sampling bias.
The learner is NOT consistent even with infinitely many labels!
Missed Cluster
45%
5%
Only these points
are selected!
5%
Copyright 2010 by CEBT
45%
6
Cluster-Adaptive Sampling
How can the adaptive sampling really be helpful?
Exploiting (cluster) structure in data
Efficient search through hypothesis space
Probably just four labels are needed!
Copyright 2010 by CEBT
7
Cluster-Adaptive Sampling (Cont’d)
Query points in order to recognize the label of each cluster,
rather than the label of each individual point.
Every point will be assigned the label of the corresponding cluster.
How to obtain the label of a cluster?
Sample some points from the cluster and find the majority label.
Copyright 2010 by CEBT
8
Cluster-Adaptive Active Learning
Algorithm
Unsupervised hierarchical clustering of the unlabeled data
Cluster-adaptive sampling
–
Idea : Find a pruning of the tree (hierarchical clustering) that consists
of “pure” nodes (or clusters).
–
Choose a node v in the current pruning with probability prob[v] first,
choose a random point p within the node v, and then query p‘s label.
Set of nodes in the tree
that forms a partition
of the set of leaves
–
Assess “purity” of each node and change the pruning to a better one.
–
After finding the best pruning, assign each point the label of the
corresponding cluster.
Supervised learning on the resulting fully labeled data
Copyright 2010 by CEBT
9
Cluster-Adaptive Active Learning (Cont’d)
Hierarchical clustering
1
8
4
2
6
3
1
9
2
4
8
3
5
6
7
9
leaves (points)
5
7
Copyright 2010 by CEBT
10
Cluster-Adaptive Active Learning (Cont’d)
Cluster-adaptive sampling
1
8
4
2
6
3
1
9
2
4
8
3
5
6
7
9
leaves (points)
5
7
: pruning
Copyright 2010 by CEBT
11
Cluster-Adaptive Active Learning (Cont’d)
Labeling
1
8
4
2
6
3
1
9
2
4
8
3
5
6
7
9
leaves (points)
5
7
Copyright 2010 by CEBT
12
Cluster-Adaptive Active Learning (Cont’d)
Comparison between sampling methods
Random Sampling
–
If there is a pruning with m nodes and error ε, then only O(m/ ε)
labels are needed before finding a pruning with error O(ε).
Active Sampling
–
Never worse than a constant factor away from consistency of random
sampling.
–
More quickly converges than random sampling.
Copyright 2010 by CEBT
13
Experiments
Hierarchical Clustering
Ward’s average linkage clustering
Baseline Algorithms
Random sampling (passive learning)
Margin-based sampling (query points closet to current classifier)
Model
-regularization with logistic regression, choosing the trade-off
parameter with 10-fold cross validation
Classification
OCR digit images (multi-class classification)
Newsgroup text (binary classification)
Copyright 2010 by CEBT
14
Experiments (Cont’d)
OCR Digit Images
Errors of the best prunings in
the OCR digits tree
Copyright 2010 by CEBT
Test error curves on
classification task
15
Experiments (Cont’d)
Newsgroup Text
Test error curves on classification task
Copyright 2010 by CEBT
16
Conclusions
Cluster-Adaptive Active Learning
Exploits cluster structures inherited in data.
Manages sampling bias to make the learner consistent.
Empirically outperforms random sampling and is competitive with
inconsistent active learning methods.
Copyright 2010 by CEBT
17
Discussions
The cluster-adaptive active learning introduces another
approach of active learning.
It looks at the structure of the large amount of data.
Hierarchical clustering method may affect the total cost.
It is not sure that this method can cope with very large
unlabeled data.
Copyright 2010 by CEBT
18