Transcript ppt

Hierarchical Sampling for Active Learning
Sanjoy Dasgupta and Daniel Hsu
University of California, San Diego
Session : Active Learning and Experimental Design, ICML 2008
2010-08-27
Presented by Yongjin Kwon
Active Learning
 Nowadays huge amounts of data are cheaply available and are
used to find some patterns or to extract some information.
 Many supervised learning systems require a large amount of
(labeled) training dataset to perform well.

email spam, handwritten digits, movie rating, etc.
 In many applications, however, labeled training data are very
difficult, time-consuming, or expensive to obtain.

speech recognition, text classification, biological research, etc.
 Problem : Given a pool of unlabeled data and an oracle, how
can the machine learn something accurately, requesting as
few labels as possible?
Copyright  2010 by CEBT
2
Active Learning (Cont’d)
 If the machine actively tries to learn some “curious” or
“informative” data, it will perform better with less training!
One-way teaching
Answer
Learn something
Everything should be prepared!
Query “curious” points only.
(a) Passive Learning
(b) Active Learning
Copyright  2010 by CEBT
3
Active Learning (Cont’d)
 What are “curious” or “informative” points?

If the learner is NOT unsure about the label of a point, then the
point will be less curious.
less curious
more curious
Copyright  2010 by CEBT
4
Typical Active Learning Approach
 Start by querying the labels of a few randomly-chosen points.
 Repeat the following process:

Determine the decision boundary on current set of labeled points.

Choose the next unlabeled point closest to the current decision
boundary. (i.e. the most “uncertain” or “informative” point)

Query that point and obtain its label.
Binary Classification:
Decision
Boundary
Copyright  2010 by CEBT
5
Typical Active Learning Approach (Cont’d)
 Missed Cluster Effect

Random sampling may fail to explore some regions, called
missed cluster.

After the initial sampling and learning, only points in the center
group are queried due to the sampling bias.

The learner is NOT consistent even with infinitely many labels!
Missed Cluster
45%
5%
Only these points
are selected!
5%
Copyright  2010 by CEBT
45%
6
Cluster-Adaptive Sampling
 How can the adaptive sampling really be helpful?

Exploiting (cluster) structure in data

Efficient search through hypothesis space
Probably just four labels are needed!
Copyright  2010 by CEBT
7
Cluster-Adaptive Sampling (Cont’d)
 Query points in order to recognize the label of each cluster,
rather than the label of each individual point.

Every point will be assigned the label of the corresponding cluster.
 How to obtain the label of a cluster?

Sample some points from the cluster and find the majority label.
Copyright  2010 by CEBT
8
Cluster-Adaptive Active Learning
 Algorithm

Unsupervised hierarchical clustering of the unlabeled data

Cluster-adaptive sampling
–
Idea : Find a pruning of the tree (hierarchical clustering) that consists
of “pure” nodes (or clusters).
–
Choose a node v in the current pruning with probability prob[v] first,
choose a random point p within the node v, and then query p‘s label.
Set of nodes in the tree
that forms a partition
of the set of leaves

–
Assess “purity” of each node and change the pruning to a better one.
–
After finding the best pruning, assign each point the label of the
corresponding cluster.
Supervised learning on the resulting fully labeled data
Copyright  2010 by CEBT
9
Cluster-Adaptive Active Learning (Cont’d)
 Hierarchical clustering
1
8
4
2
6
3
1
9
2
4
8
3
5
6
7
9
leaves (points)
5
7
Copyright  2010 by CEBT
10
Cluster-Adaptive Active Learning (Cont’d)
 Cluster-adaptive sampling
1
8
4
2
6
3
1
9
2
4
8
3
5
6
7
9
leaves (points)
5
7
: pruning
Copyright  2010 by CEBT
11
Cluster-Adaptive Active Learning (Cont’d)
 Labeling
1
8
4
2
6
3
1
9
2
4
8
3
5
6
7
9
leaves (points)
5
7
Copyright  2010 by CEBT
12
Cluster-Adaptive Active Learning (Cont’d)
 Comparison between sampling methods

Random Sampling
–

If there is a pruning with m nodes and error ε, then only O(m/ ε)
labels are needed before finding a pruning with error O(ε).
Active Sampling
–
Never worse than a constant factor away from consistency of random
sampling.
–
More quickly converges than random sampling.
Copyright  2010 by CEBT
13
Experiments
 Hierarchical Clustering

Ward’s average linkage clustering
 Baseline Algorithms

Random sampling (passive learning)

Margin-based sampling (query points closet to current classifier)
 Model

-regularization with logistic regression, choosing the trade-off
parameter with 10-fold cross validation
 Classification

OCR digit images (multi-class classification)

Newsgroup text (binary classification)
Copyright  2010 by CEBT
14
Experiments (Cont’d)
 OCR Digit Images
Errors of the best prunings in
the OCR digits tree
Copyright  2010 by CEBT
Test error curves on
classification task
15
Experiments (Cont’d)
 Newsgroup Text
Test error curves on classification task
Copyright  2010 by CEBT
16
Conclusions
 Cluster-Adaptive Active Learning

Exploits cluster structures inherited in data.

Manages sampling bias to make the learner consistent.

Empirically outperforms random sampling and is competitive with
inconsistent active learning methods.
Copyright  2010 by CEBT
17
Discussions
 The cluster-adaptive active learning introduces another
approach of active learning.

It looks at the structure of the large amount of data.
 Hierarchical clustering method may affect the total cost.
 It is not sure that this method can cope with very large
unlabeled data.
Copyright  2010 by CEBT
18