lecture set 9

Download Report

Transcript lecture set 9

Introduction to
Predictive Learning
LECTURE SET 9
Nonstandard Learning Approaches
Electrical and Computer Engineering
1
OUTLINE
•
Motivation for non-standard approaches
- Learning with sparse high-dimensional data
- Formalizing application requirements
- Philosophical motivation
•
New Learning Settings
- Transduction
- Universum Learning
- Learning Using Privileged Information
- Multi-Task Learning
•
Summary
2
Sparse High-Dimensional Data
• Recall standard inductive learning
• High-dimensional, low sample size (HDLSS) data:
• Gene microarray analysis
• Medical imaging (i.e., sMRI, fMRI)
• Object and face recognition
• Text categorization and retrieval
• Web search
• Sample size is smaller than dimensionality of the
input space, d ~ 10K–100K, n ~ 100’s
• Standard learning methods usually fail for such
HDLSS data.
3
Sparse High-Dimensional Data
• HDLSS data looks like a porcupine: the volume of
a sphere inscribed in a d-dimensional cube gets
smaller as the volume of d-cube gets larger !
• A point is closer to an edge than to another point
• Pairwise distances between points are the same
4
How to improve generalization for HDLSS?
Conventional approaches use
Standard inductive learning + a priori knowledge:
• Preprocessing and feature selection (preceding learning)
• Model parameterization (~ selection of good kernels)
• Informative prior distributions (in statistical methods)
Non-standard learning formulations
• Seek new generic formulations (not methods!) that better
reflect application requirements
• A priori knowledge + additional data are used to derive
new problem formulations
5
Formalizing Application Requirements
• Classical statistics: parametric model is given (by experts)
• Modern applications: complex iterative process
 Non-standard (alternative) formulation may be better!
APPLICATION
Loss
Function
NEEDS
Input, output,
other variables
Training/
test data
Admissible
Models
FORMAL PROBLEM STATEMENT
LEARNING THEORY
6
Philosophical Motivation
•
Philosophical view 1 (Realism):
Learning ~ search for the truth (estimation of
true dependency from available data)

System identification
~ Inductive Learning
where a priori knowledge is about the true model
7
Philosophical Motivation (cont’d)
•
Philosophical view (Instrumentalism):

Learning ~ search for the instrumental knowledge
(estimation of useful dependency from available data)
VC-theoretical approach ~ focus on learning formulation
8
VC-theoretical approach
•
•
•
Focus on the learning setting
(formulation), not on a learning method
Learning formulation depends on:
(1) available data
(2) application requirements
(3) a priori knowledge (assumptions)
Factors (1)-(3) combined using Vapnik’s
Keep-It-Direct (KID) Principle yield a
learning formulation
9
Contrast these two approaches
•
Conventional (statistics, data mining):
•
a priori knowledge typically reflects properties of a
true (good) model, i.e.
a priori knowledge ~ parameterization f (x, w)
Why a priori knowledge is about the true model?
•
VC-theoretic approach:
a priori knowledge ~ how to use/ incorporate
available data into the problem formulation
often a priori knowledge ~ available data samples
of different type  new learning settings
10
Examples of Nonstandard Settings
• Standard Inductive setting, e.g., digits 5 vs. 8
Finite training set x i , yi 
Predictive model derived using only training data
Prediction for all possible test inputs
• Possible modifications
- Transduction: Predict only for given test points
- Universum Learning: available labeled data ~ examples
of digits 5 and 8 and unlabeled examples ~ other digits
- Learning using Privileged Information:
training data provided by t different persons. Group label is
known only for training data, but not available for test data.
- Multi-Task Learning:
training data ~ t groups (from different persons)
11
test data ~ t groups (group label is known)
SVM-style Framework for New Settings
• Conceptual Reasons:
Additional info/data  new type of SRM structure
• Technical Reasons:
New knowledge encoded as additional constraints
on complexity (in SVM-like setting)
• Practical Reasons:
- new settings directly comparable to standard SVM
- standard SVM is a special case of a new setting
- optimization s/w for new settings may require
minor modification of (existing) standard SVM s/w
12
OUTLINE
•
•
Motivation for non-standard approaches
New Learning Settings
- Transduction
- Universum Learning
- Learning Using Privileged Information
- Multi-Task Learning
Note: all settings assume binary classification
• Summary
13
Transduction (Vapnik, 1982, 1995)
• How to incorporate unlabeled test data into the
learning process? Assume binary classification
• Estimating function at given points
Given: labeled training data
x i , yi  i  1,...,n
and unlabeled test points
j  1,...,m
x*j
 
Estimate: class labels y *  ( y1* ,....ym* ) at these test points
Goal of learning: minimization of risk on the test set:
1 m
R( y )  
m j 1
*
*
*
*
*
*

y

f
(
x
,

),....
f
(
x
where
L
y
,
y
dP
(
y
/
x
)
1
m , ) 
j
j



y
14
Transduction vs Induction
a priori knowledge
assumptions
estimated
function
deduction
induction
training
data
transduction
predicted
output
15
Transduction based on size of margin
• Binary classification, linear parameterization,
joint set of (training + working) samples
Note: working sample = unlabeled test point
• Simplest case: single unlabeled (working) sample
• Goal of learning
(1) explain well available data (~ joint set)
(2) achieve max falsifiability (~ large margin)
~ Classify test (working + training) samples by the
largest possible margin hyperplane (see below)
16
Margin-based Local Learning
• Special case of transduction: single working point
• How to handle many unlabeled samples?
17
Transduction based on size of margin
• Transductive SVM learning has two objectives:
(TL1) separate labeled data using a large-margin hyperplane
~ as in standard SVM
(TL2) separate working data using a large-margin hyperplane
18
Loss function for unlabeled samples
• Non-convex loss function:
• Transductive SVM constructs a large-margin
hyperplane for labeled samples AND forces this
hyperplane to stay away from unlabeled samples
19
Optimization formulation for SVM transduction
• Given: joint set of (training + working) samples
• Denote slack variables  i for training, *j for working
1
R
(
w
,
b
)

(w  w)  C    C  
• Minimize
2
subject to  y*i [(w  x i )  b]  1   i*
n
m
*
i 1
i
j 1
*
j
y j [(w  x i )  b]  1   j

 ,  *  0, i  1,...,n, j  1,...,m
i
j

y*j  sign(w  x j  b), j  1,...,m
where
 Solution (~ decision boundary) f (x)  (w*  x)  b*
• Unbalanced situation (small training/ large test)
 all unlabeled samples assigned to one class
1 n
1 m
• Additional constraint: n  yi  m [(w  x i )  b]
i 1
j 1
20
Optimization formulation (cont’d)
• Hyperparameters C and C * control the trade-off
between explanation and falsifiability
• Soft-margin inductive SVM is a special case of
soft-margin transduction with zero slacks  *j  0
• Dual + kernel version of SVM transduction
• Transductive SVM optimization is not convex
(~ non-convexity of the loss for unlabeled data) –
**elaborate/explain**
 different opt. heuristics ~ different solutions
• Exact solution (via exhaustive search) possible for
small number of test samples (m) – but this
solution is NOT very useful (~ inductive SVM).
21
Many applications for transduction
• Text categorization: classify word documents
into a number of predetermined categories
• Email classification: Spam vs non-spam
• Web page classification
• Image database classification
• All these applications:
- high-dimensional data
- small labeled training set (human-labeled)
- large unlabeled test set
22
Example application
• Prediction of molecular bioactivity for drug
discovery
• Training data~1,909; test~634 samples
• Input space ~ 139,351-dimensional
• Prediction accuracy:
SVM induction ~74.5%; transduction ~ 82.3%
Ref: J. Weston et al, KDD cup 2001 data analysis: prediction
of molecular bioactivity for drug design – binding to
thrombin, Bioinformatics 2003
23
Semi-Supervised Learning (SSL)
• SSL assumes availability of labeled + unlabeled
data (similar to transduction)
• SSL has a goal of estimating an inductive model
for predicting new (test) samples – different from
transduction
• In machine learning, SSL and transduction are
often used interchangeably, i.e. transduction can
be used to estimate an SSL model.
• SSL methods usually combine supervised and
unsupervised learning methods into one algorithm
24
SSL and Cluster Assumption
• Cluster Assumption: real-life application data often
has clusters, due to (unknown) correlations between
input variables. Discovering these clusters using
unlabeled data helps supervised learning
• Example: document classification and info retrieval
- individual words ~ input features (for classification)
- uneven co-occurrence of words implies clustering
of the documents in the input space
- unlabeled documents can be used to identify this
cluster structure, so that just a few labeled examples
are sufficient for constructing a good decision rule
25
Toy Example for Text Classification
• Data Set: 5 documents that need to be classified
into 2 topics ~ “Economy’ and ‘Entertainment’
• Each document is defined by 6 features (words)
- two labeled documents (shown in color)
- need to classify three unlabeled documents
• Apply clustering  2 clusters (x1,x3) and (x2,x4,x5)
Budget
deficit
L1
L2
U3
U4
U5
Music
Seafood
Sex
Interest
Rates
Movies
1
1
1
1
1
1
1
1
1
26
Self-Learning Method (example of SSL)
Given initial labeled data set L and unlabeled set U
• Repeat:
(1) estimate a classifier using L
(2) classify randomly chosen unlabeled sample
using the decision rule estimated in Step (1)
(3) move this new labeled sample to L
Iterate steps (1)-(3) until all unlabeled samples are
classified
27
Illustration (using 1-nearest neighbor classifier)
Hyperbolas data:
- 10 labeled and 100 unlabeled samples (green)
Iteration 1
0.9
0.8
Unlabeled Samples
Class +1
Class -1
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
28
Illustration (after 50 iterations)
Iteration 50
0.9
0.8
Unlabeled Samples
Class +1
Class -1
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
29
Illustration (after 100 iterations)
All samples are labeled now:
Iteration 100
0.9
Class +1
Class -1
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
30
Comparison: SSL vs T-SVM
• Comparison 1 for low-dimensional data:
- Hyperbolas data set (10 labeled, 100 unlabeled)
- 10 random realizations of training data
• Comparison 2 for high-dimensional data:
- Digits 5 vs 8 (100 labeled, 1,866 unlabeled)
- 10 random realizations of training/validation data
Note: validation data set for SVM model selection
• Methods used
- Self-learning algorithm (using 1-NN classification)
- Nonlinear T-SVM (needs parameter tuning)
31
Comparison 1: SSL vs T-SVM and SVM
Methods used
- Self-learning algorithm (using 1-NN classification)
- Nonlinear T-SVM (Poly kernel d=3)
HYPERBOLAS DATA SET
SETTING
σ=0.025,
Prediction Error ( %)
Typical Parameters
No. of training and validation samples=10 (5 per class).
No. of test samples= 100 (50 per class).
Polynomial
SVM(d=3)
11.8(2.78)
T-SVM(d=3)
Self-Learning
10.9(2.60)
4.0(3.53)
C= 700-7000
C*/C= 10-6
N/A
• Self-learning method is better than SVM or T-SVM
- Why?
32

Comparison 2: SSL vs T-SVM and SVM
Methods used
- Self-learning algorithm (using 1-NN classification)
- Nonlinear T-SVM (RBF kernel)
MNIST DATA SET
SETTING:
Test Error ( %)
Typical selected
parameter values
No. of training and validation samples=100.
No. of test samples= 1866.
RBF SVM
5.79(1.80)
C≈ 0.3- 7
gamma≈5-3-5-2
RBF T-SVM
4.33(1.34)
C*/C≈0.01-1
Self-Learning
7.68(1.26)
N/A
• SVM or T-SVM is better than self-learning method
- Why?
33

Explanation of T-SVM for digits data set
Histogram of projections of labeled + unlabeled data:
- for standard SVM (RBF kernel) ~ test error 5.94%
- for T-SVM (RBF kernel) ~ test error 2.84%
Histogram for RBF SVM (with optimally tuned parameters):
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
34
Explanation of T-SVM (cont’d)
Histogram for T-SVM (with optimally tuned parameters)
Note: (1) test samples are pushed outside the margin borders
(2) most labeled samples project away from margin
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
-3
-2
-1
0
1
2
3
35
Universum Learning (Vapnik 1998, 2006)
• Motivation: what is a priori knowledge?
- info about the space of admissible models
- info about admissible data samples
• Labeled training samples (as in inductive setting) +
unlabeled samples from the Universum
• Universum samples encode info about the region of
input space (where application data lives):
- from a different distribution than training/test data
- U-samples ~ neither positive nor negative class
• Examples of the Universum data
• Large improvement for small sample size n
36
Cultural Interpretation of the Universum
•
Absurd examples, jokes, some art forms
neither Hillary nor Obama but looks like both
37
Cultural Interpretation of the Universum
Marc Chagall:
FACES
38
Cultural Interpretation of the Universum
•
Some art forms
surrealism, dadaism
Marcel Duchamp (1919)
Mona Lisa with Mustache
39
More on Marcel Duchamp
Rrose Sélavy (Marcel Duchamp), 1921, Photo by Man Ray.
40
Main Idea of Universum Learning
•
Handwritten digit recognition: digit 5 vs 8
Fig. courtesy of J. Weston (NEC Labs)
41
Learning with the Universum
• Inductive setting for binary classification
x i , yi  i  1,...,n
Given: labeled training data
and unlabeled Universum samples x*j
j  1,...,m
Goal of learning: minimization of prediction risk (as in
standard inductive setting)
 
• Two goals of the Universum Learning:
(UL1) separate/explain labeled training data using largemargin hyperplane (as in standard SVM)
(UL2) maximize the number of contradictions on the
Universum, i.e. Universum samples inside the margin
Goal (UL2) is achieved by using special loss function for
Universum samples
42
Inference through contradictions
43

Universum SVM Formulation (U-SVM)
• Given: labeled training + unlabeled Universum samples
• Denote slack variables  i for training,  *j for Universum
n
m
1
*
• Minimize R(w, b)  (w  w)  C  i  C  *j where C, C *  0
2
i 1
j 1
subject to yi [(w  xi )  b]  1  i i ,  0, i  1,...,n for labeled data
(w  xi )  b    i*  *j  0, j  1,...,m for the Universum
where the Universum samples use  -insensitive loss
• Convex optimization
• Hyper-parameters C, C *  0 control the trade-off btwn
minimization of errors and maximizing the # contradictions
*
• When C =0,  standard soft-margin SVM
44
 -insensitive loss for Universum samples
y

1

2*
x
45
Application Study (Vapnik, 2006)
• Binary classification of handwritten digits 5 and 8
• For this binary classification problem, the following
Universum sets had been used:
U1: randomly selected other digits (0,1,2,3,4,6,7,9)
U2: randomly mixing pixels from images 5 and 8
U3: average of randomly selected examples of 5 and 8
Training set size tried: 250, 500, … 3,000 samples
Universum set size: 5,000 samples
• Prediction error: improved over standard SVM, i.e.
for 500 training samples: 1.4% vs 2% (SVM)
46
Universum U3 via random averaging (RA)
Class 1
Average
Class -1
Hyper-plane
47
Random Averaging for digits 5 and 8
• Two randomly selected examples
• Universum sample:
48
Application Study: gender of human faces
• Binary classification setting
• Difficult problem:
dimensionality ~ large (10K - 20K)
labeled sample size ~ small (~ 10 - 20)
• Humans perform very well for this task
• Issues:
- possible improvement (vs standard SVM)
- how to choose Universum?
- model parameter tuning
49
Male Faces: examples
50
Female Faces: examples
51
Universum Faces:
neither male nor female
52
Empirical Study
(Bai and Cherkassky 2008)
• Gender classification of human faces (male/ female)
• Data: 260 pictures of 52 individuals (20 females and 32
males, 5 pictures for each individual) from Univ. of Essex
• Data representation and pre-processing: image size 46x59 –
converted into gray-scale image, following standard image
processing (histogram equalization)
• Training data: 5 female and 8 male photos
• Test data: remaining 39 photos (of other people)
• Experimental procedure: randomly select 13 training
samples (and 39 test samples). Estimate and compare
inductive SVM classifier with SVM classifier using N
Universum samples (where N=100, 500, 1000).
- Report results for 4 partitions (of training and test data)
53
Empirical Study (cont’d)
• Universum generation:
U1 Average: of male and female samples randomly
selected from the training set
U2 Empirical Distribution: estimate pixel-wise
distribution of the training data. Generate a new
picture from this distribution
U3 Animal faces:
54
Universum generation: examples
• U1 Averaging:
• U2 Empirical Distribution:
55
Results of gender classification
• Classification accuracy: improves vs standard SVM by ~
2% with U1, and ~ 1% with U2 Universum
• Universum by averaging gives better results for this
problem, when number of Universum samples N = 500 or
1,000 (shown above for 4 partitionings of the data)
56
Results of gender classification
• Universum ~ Animal Faces:
• Degrades classification accuracy by 2-5%
(vs standard SVM)
• Animal faces are not relevant to this problem
57
Random Averaging Universum
• How to come up with good Universum ?
- usually application – specific
But
• RA Universum is generated from training data
• Under what conditions RA U-SVM is effective? ~
better than standard SVM
• Solution approach:
Analyze histogram of projections of training
samples onto normal direction vector w of SVM
model (for high-dim data)
58
Histogram of projections and RA Universum
• Typical histogram (for high-dimensional data)
Case 1:
3
10
2
10
1
10
0
10
-1.5
-1
-0.5
0
0.5
1
1.5
59
Histogram of projections and RA Universum
• Typical histogram (for high-dimensional data)
Case 2:
16
14
12
10
8
6
4
2
0
-1.5
-1
-0.5
0
0.5
1
1.5
60
Histogram of projections and RA Universum
• Typical histogram (for high-dimensional data)
Case 3:
250
200
150
100
50
0
-3
-2
-1
0
1
2
3
61
Example: Handwritten digits 5 vs 8
•
•
•
MNIST data set: each digit ~ 28x28=784 pixels
Training / validation set ~ 1,000 samples each
Test set ~ 1,866 samples
For U-SVM ~ 1,000 Universum samples (via RA)
Comparison of test error rate for SVM and U-SVM:
MNIST (RBF Kernel)
MNIST (Linear Kernel)
SVM
1.37% (0.22%)
4.58% (0.34%)
U-SVM
1.20% (0.19%)
4.62% (0.37%)
62
Histogram of projections for MNIST data
for RBF SVM
for Linear SVM
MNIST Data: Disbution of the linear SVM output
MNIST Data: Distribution of RBF kernel SVM output
120
250
100
200
80
150
60
100
40
50
20
0
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
0
-6
-4
-2
0
2
4
6
63
Analyzing ‘other digits’ Universum
The same set-up but using digits ‘1’ or ‘3’ Universa
Digit 1 Universum
Digit 3 Universum
0.5
0.5
0.45
0.45
0.4
0.4
0.35
0.35
0.3
0.3
0.25
0.25
0.2
0.2
0.15
0.15
0.1
0.1
0.05
0.05
0
-3
-2
-1
0
1
2
3
0
-3
-2
-1
0
1
2
3
Test error: RBF SVM~ 1.47%, Digit1~1.31%, Digit3~1.01%
64
Discussion
• Technical aspects: why the Universum data
improves generalization when d>>n?
-SVM is solved in a low-dimensional subspace
implicitly defined by the Universum data
• How to specify good Universum?
- no formal procedure exists
- conditions for the effectiveness of a particular
Universum data set: see
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5936738
• Philosophical aspects:
- Relation to human learning (cultural Universum)?
- New type of inference? Impact on psychology?
65
Learning Using Privileged Info (LUPI)
Given: training data (x, x*, y)
where x* is privileged info only for training data
Estimate: predictive model y = f(x)
This additional info x* is common in many apps:
Handwritten digit recognition: ~ person’s label
Brain imaging (fMRI): human subject label (in a group of
subjects performing the same cognitive task)
Medical diagnosis: ~ medical history after initial diagnosis
Time series prediction: ~ future values of the training data
66
Learning Using Privileged Info (LUPI)
Given: training data (x, x*, y)
where x* is privileged info only for training data
Estimate: predictive model y = f(x)
Main Idea:
standard learning x
LUPI x + x*
+
67
LUPI (Vapnik, 2006)
•
Traditional setting: training set
•
LUPI setting: training set (xi , x*i , yi ), i  1, , n
*
where xi is additional privileged info
Privileged info: in the form of additional
constraints on errors (slack variables)
•
•
(xi , yi ), i  1,
,n
Goal of learning for both settings:
find a function f(x) providing small error for
test inputs
68
Many Application Settings for LUPI
•
•
•
•
Medical diagnosis: privileged info ~ expensive
test results, patient history after diagnosis, etc.
Time series prediction: privileged info ~ future
values of the training time series
Additional group information (~ structured data)
Feature selection etc.
SVM+ is a formal mechanism
for LUPI training
69
LUPI and SVM+ (Vapnik 2006)
SVM+ is a new learning technology for LUPI
• Improved prediction accuracy (proven
both theoretically and empirically)
• Similar to learning with a human teacher:
‘privileged info’ ~ feedback from an
experienced teacher(can be very beneficial)
• SVM+ is the most promising new
technology for ‘difficult’ problems
• Computationally complex: no public
domain software currently available
70
LUPI Formalization and Challenges
• LUPI training data in the form
(xi , xi , yi ) i  1,2,...n
• Privileged Info (xi )
- assumed to have informative value (for prediction)
- is not available for predictive model yˆ  f (x)
• Interpretation of privileged info
- additional feedback from an expert teacher,
different from ‘correct answers’ or y-labels
- this feedback provided in a different space x*
• Challenge: how to handle/combine learning in two
different feature spaces?
71
SVM+ Technology for LUPI
• LUPI training data in the form
(xi , xi , yi ) i  1,2,...n
• SVM+ Learning takes place in two spaces:
- decision space Z where the decision rule yˆ  f (z )
is estimated

(
x
- correction space Z* where the privileged info i )
is encoded in the form of additional constraints on
errors (slack variables) in the decision space
• SVM+ approach implements two (kernel) mappings
- into decision space (x) : x  Z
- into correction space  (x) : x  Z
72
Illustration of SVM+ (hidden info = group label)
Correcting space
1
Correcting functions
2
mapping
Correcting space
1
mapping
Decision function
2
Decision space

r
slack variable for group r
Group1
Group2
Class 1
Class -1
73
Two Objectives of SVM+ Learning
• LUPI training data in the form
(xi , xi , yi ) i  1,2,...n
yˆ  f (z )
SVM+ Learning achieves two goals:
(SVM+ 1)
- separate labeled training data in decision space Z
using large-margin hyperplane (as in standard SVM)

(SVM+ 2) incorporate privileged info (x i ) in the form of
additional constraints on errors (slack variables) in
the decision space. These constraints are modeled
as (linear) functions in the correction space Z*
74
SVM+ Formulation aka SVM r 
xi
z i   z (xi )  Z
Decision Space
z ri   zr (xi )  Zr
Correcting Space
y  sign[ f (x)]  sign[(w,  Z (x))  b]
min
w, w1 ,..,wt ,b , d1 ,..d t
t
1
 t
(w, w)   (w r , w r )  C 
2
2 r 1
r 1

iTr
r
i
subject to:
yi ((w, z i )  b)  1   ir , i  Tr , r  1,...,t
 ir  0, i  Tr , r  1,...,t
ir  (z ir , w r )  d r , i  Tr , r  1,...,t
75
SVM+ Formulation
• Map the input vectors simultaneously into:
- Decision space (standard SVM classifier)
- Correcting space (where correcting functions
model slack variables for different groups)
• Decision space/function ~ the same for all groups
• Correcting functions ~ different for each group
(but correcting space may be the same)
• SVM+ optimization formulation incorporates:
- the capacity of decision function w, w 
- capacity of correcting functions w r , w r  for group r
- relative importance (weight)  of these two capacities
76
Application Study
(Liang and Cherkassky, 2007)
• SVM+ technology is very new.
Just a few empirical comparison studies
• fMRI data analysis problem (CMU data set)
- six subjects presented with {picture or sentence}
- fMRI data is recorded over 16 time intervals
- need to learn binary classifier fMRI image  class
• Data preprocessing (Wang et al, 2004)
- extract 7 input features (Regions of Interest) from high–dim.
fMRI image  input vector has 16x7=112 components
• Comparison between
standard SVM (data from all 6 subjects is pooled together)
SVM+ method (6 groups where each group has 80 samples)
77
fMRI Application Study
(cont’d)
• Experimental protocol: (randomly) split the data
- 60% ~ training
- 20% ~ validation (for tuning parameters)
- 20% ~ test (for estimating prediction error)
• Details of methods used:
- linear SVM classifier (single parameter C)
- SVM  classifier (3 parameters: linear kernel for decision
space, RBF kernel for correcting space, and parameter  )
• Comparison results (over 10 trials):
- standard SVM ~ ave test error: 22.2 % (st. dev. 3.8%)
- SVM  ~ ave test error: 20.2% (st.dev. 3.1%)
78
Multi-Task Learning (MTL)
• Application: Handwritten digit recognition
Labeled training data provided by t persons (t >1)
Goal 1: estimate a single classifier that will generalize well
for future samples generated by these persons ~ LUPI
Goal 2: estimate t related classifiers (for each person) ~ MTL
• Application: Medical diagnosis
Labeled training data provided by t groups of patients (t >1),
say men and women (t = 2)
Goal 1: estimate a classifier to predict/diagnose a disease
using training data from t groups of patients ~ LUPI
Goal 2: estimate t classifiers specialized for each group of
patients ~ MTL
Note: for MTL the group info is known for training and test data
79
Multi-task Learning
Training
Data
Predictive
Model
(a)
Task 1
Training
Data
Task 2
Training
Data
Task t
Training
Data
(a) Single task
learning
Predictive
Model
Task
Relatedness
Modeling
Predictive
Model
(b) Multi-task
learning (MTL)
Predictive
Model
(b)
80
Contrast Inductive Learning, MTL and LWSD
f
test data
(no Group info)
da
ta
Inductive
Learning
Training data
for Group 1
…..
data
Group info
Multi-Task
Learning
Training data
for Group t
f1
…..
test data
for Group 1
…..
test data
for Group t
f
test data
(no Group info)
ta
fo
p in
ou
da
Gr
ft
Learning with
Structured Data
81
Different Ways of Using Group Information
sSVM:
SVM
SVM+
SVM+:
mSVM:
svm+MTL:
f(x)
f(x)
SVM
f1(x)
SVM
f2(x)
f1(x)
svm+MTL
f2(x)
82
Problem setting for MTL
• Assume binary classification problem
• Training data ~ a union of t related groups(tasks)
Each group r has
nr i.i.d. samples
i  1,2,...,nr
(x ir , yir )
r  1,2,...,t
generated from a joint distribution Pr (x, y)
• Test data: task label for test samples is known
• Goal of learning: estimate t models { f1 , f 2 ,..., f t }
such that the sum of expected losses for all tasks is
t
minimized:
R( w)   (  L( y, f r (x, w))dPr (x, y)
r 1
83
SVM+ for Multi-Task Learning
• New learning formulation: SVM+MTL
• Define decision function for each group as
f r (x)  ( z (x), w)  b  ( zr (x), w r )  d r , r  1,...,t
• Common decision function ( z (x), w)  b
models the relatedness among groups
• Correcting functions fine-tune the model for
each group (task)
.
84
SVM+MTL Formulation
xi
z i   z (xi )  Z
Decision Space
z ri   zr (xi )  Zr
Correcting Space
f r (x)  sign(( z (x), w)  b  ( zr (x), wr )  dr ) , r  1,...,t
min
w, w1 ,..,wt ,b ,d1 ,..d r
t
1
 t
(w, w)   (w r , w r )  C  ir
2
2 r 1
r 1 iTr
subject to:
yir ((w, z i )  b  (w r , z ir )  d r )  1   ir , i Tr , r  1,...,t
ir  0, i T , r  1,...,t
85
Empirical Validation
•
•
Different ways of using group info 
different learning settings:
- which one yields better generalization?
- how performance is affected by sample
size?
Empirical comparisons:
- synthetic data set
86
Different Ways of Using Group Information
sSVM:
SVM
SVM+
SVM+:
mSVM:
svm+MTL:
f(x)
f(x)
SVM
f1(x)
SVM
f2(x)
f1(x)
svm+MTL
f2(x)
87
Synthetic Data
• Generate x  R 20with each component xi ~ uniform(1,1)•, i  1,...,20
• The coefficient vectors of three tasks are specified as
1  [1
,1
,1,1
,1,
1,
1,1,1,1,0,0,0,0,0,0,0,0,0,0]
 2  [1
,1
,1,1
,1,
1,
1,1,0,1,0,1,0,0,0,0,0,0,0,0]
 3  [1
,1
,1,1
,1,
1,
1,1,1,0,0,0,0,0,1,0,0,0,0,0]
• For each task and each data vector, y  sign(i x  0.5)
• Details of methods used:
- linear SVM classifier (single parameter C)
- SVM+, SVM+MTL classifier (3 parameters: linear kernel for
decision space, RBF kernel for correcting space, and parameter γ)
- Independent validation set for model selection
88
Experimental Results
• Comparison results (over 10 trials):
ave test accuracy(%) ±std. dev.
#per group sSVM
SVM+
mSVM
svm+MTL
n=100
88.11±0.65
88.31±0.84
91.18±1.26
91.47±1.03
n=50
85.74±1.36
86.49±1.69
85.09±1.40
87.39±2.29
n=15
80.10±3.42
80.84±3.16
70.73±2.69
79.24±2.81
Note1: relative performance depends on sample size
 ‘correct’ problem setting depends on sample size
Note2:
SVM+ always better than SVM
SVM+MTL always better than mSVM
89
OUTLINE
•
•
•
Motivation for non-standard approaches
New (Alternative) Learning Settings
Summary
Advantages/limitations of non-standard settings
Vapnik’s Imperative
90
OUTLINE
•
•
•
Motivation for non-standard approaches
New Learning Settings
Summary
- Advantages/limitations of non-standard
settings
- Vapnik’s Imperative
91
Advantages/Limitations of nonstandard settings
• Advantages
- makes common sense
- follows methodological framework (VC-theory)
- yields better generalization (but not always)
- new directions for research
•
Limitations
- need to formalize application requirements
 need to understand application domain
- generally more complex learning formulations
- more difficult model selection (# parameters)
- few known empirical comparisons (to date)
92
Vapnik’s Imperative
• Vapnik’s Imperative:
asking the right question (with finite data)
 most direct learning problem formulation
• ~ Controlling the goals of inference, in order to
produce a better-posed problem
• Three types of such restrictions (Vapnik, 2006)
- regularization (constraining function smoothness)
- SRM puts constraints on VC-dimension of models
- constraints on the mode of inference
• Philosophical interpretation (of restrictions):
Find a model that explains well available data and
has large falsifiability
93
Course Summary: Knowledge Discovery
and Philosophical Connections
•
•
•
Aristotle: All men by nature desire
knowledge
Einstein: Man has an intense desire for
assured knowledge
Assured Knowledge
- religion
- belief in reason
- belief in science (i.e., genetic causes)
94
Scientific Knowledge
•
•
Scientific Knowledge (last 3-4 centuries):
- objective
- recurrent events (repeatable by others)
- quantifiable (described by math models)
Knowledge ~ causal, deterministic, logical
 humans cannot reason well about
- noisy/random data
- multivariate (high-dimensional) data
95
The Nature of Knowledge
•
•
•
•
Idealism: knowledge is pure thought
Realism: knowledge comes from experience
Western History of knowledge
- Ancient Greece ~ idealism
- Classical science ~ realism
But it is not possible to obtain assured
knowledge from empirical data (Hume):
Whatever in knowledge is of empirical
nature is never certain
96
Digital Age
•
Most knowledge in the form of data from
sensors (not human sense perceptions)
• How to get assured knowledge from data?
• Naïve realism: data  knowledge ?
Wired Magazine, 16/07: We can stop looking for
(scientific) models. We can analyze the data
without hypotheses about what it might show. We
can throw the numbers into the biggest computing
clusters the world has ever seen and let statistical
algorithms find patterns where science cannot
97
(Over) Promise of Science
Archimedes: Give me a place to stand, and a
lever long enough, and I will move the world
Laplace: Present events are connected with
preceding ones by a tie based upon the
evident principle that a thing cannot occur
without a cause that produces it.
Digital Age:
more data  more knowledge
more connectivity  new knowledge
98
Classical Statistics
How to obtain assured knowledge from data?
• Classical approach:
(1) specify a parametric model
(2) estimate its parameters from data
more data  better (more accurate) model
• Parametric form is based on first-principle
knowledge and therefore correct
99
Philosophical Views on Science
•
•
•
Karl Popper: Science starts from
problems, and not from observations
Werner Heisenberg :What we observe is
not nature itself, but nature exposed to
our method of questioning
Albert Einstein:
- God is subtle but he is not malicious.
- Reality is merely an illusion, albeit a very
persistent one.
100
Predictive Data Modeling Philosophy
•
•
•
Single ‘correct’ causal model cannot be
estimated from data.
Good predictive models:
- not unique
- depend on the problem setting ( ~ our
method of questioning)
Predictive Learning methodology is the
most important aspect of learning from data.
101
Predictive Modeling: Technical Aspects
• The same methodology applies to
financial, bio/medical, weather, marketing etc
• Interpretation of predictive models:
- big problem (not addressed in this course)
- reflects the need for assured knowledge
- requires understanding of the predictive
methodology and its assumptions, rather
than just interpretable parameterization
102
Non-Technical Aspects
•
•
•
Predictive methodology + philosophy can be
applied to other areas
Examples: writing assignments etc.
Lessons from philosophy and history:
- Amount of information grows exponentially, but
the number of ‘big ideas’ remains small
- Critical thinking ability becomes even more
important in the digital age
- Freedom of information does not guarantee
freedom of thought
103
Predictive Learning and Humanities
•
Specialization/ Separation btwn science and
humanities – only recently ~ 200 years ago
•
Predictive modeling and ethical principles:
- both advocate constraints/ restrictions
104
References
•
•
•
•
•
•
•
•
•
•
Vapnik, V. Estimation of Dependencies Based on Empirical Data. Empirical
Inference Science: Afterword of 2006, Springer, 2006
Cherkassky, V. and F. Mulier, Learning from Data, second edition, Wiley, 2007
Cherkassky, V., Predictive Learning, available at http://vctextbook.com/ 2013
Cherkassky, V., S. Dhar, and W. Dai, Practical Conditions for Effectiveness of the
Universum Learning, IEEE Trans. on Neural Networks, 22, 8, 1241-1255, 2011
Chapelle, O., Schölkopf, B., and A. Zien, Eds., Semi-Supervised Learning, MIT
Press, 2006
Weston, J., Collobert, R., Sinz, F., Bottou, L. and V. Vapnik, Inference with the
Universum, Proc. ICML, 2006
Wang, X., Hutchingson, R. and Mitchell, T., Training fMRI classifier to
discriminate cognitive states across multiple subjects, Proc. NIPS, 2004
Liang, L. and Cherkassky, V., Learning using structured data: application to fMRI
data analysis, Proc. IJCNN, 2007
Ando, R. and Zhang, T., A Framework for learning predictive structures from
multiple tasks and unlabeled data, JMLR, 2005
Evgeniou, T. and Pontil, M., Regularized multi-task learning, Proc of 10-th Conf.
on Knowledge Discovery and Data Mining, 2004
105