From Feature Construction, to Simple but Effective Modeling, to

Download Report

Transcript From Feature Construction, to Simple but Effective Modeling, to

From Feature Construction, to
Simple but Effective
Modeling, to Domain Transfer
Wei Fan
IBM T.J.Watson
www.cs.columbia.edu/~wfan
www.weifan.info
[email protected], [email protected]
Feature Vector

Most data mining and machine learning
model assume the following structured data:



(x1, x2, ..., xk) -> y
where xi’s are independent variable
y is dependent variable.


y drawn from discrete set: classification
y drawn from continuous variable: regression
Frequent Pattern-Based
Feature Construction

Data not in the pre-defined feature vectors

Transactions

Biological sequence

Graph database
Frequent pattern is a good candidate for discriminative features
So, how to mine them?
A discovered
pattern
FP: Sub-graph
NSC 4960
O
O
NSC 699181
NSC 40773
OH
HO
O
O
SH
NSC 191370
HN
O
NH
O
O
H2N
O
S
NSC 164863
O
HO
N
O
O
O
O
HO
O
O
O
O
O
O
O
O
O
O
(example borrowed from George Karypis presentation)
O
Computational Issues

Measured by its “frequency” or support.




E.g. frequent subgraphs with sup ≥ 10%
Cannot enumerate sup = 10% without first
enumerating all patterns > 10%.
Random sampling not work since it is not
exhaustive.
NP hard problem
Conventional Procedure
Two-Step Batch Method
Frequent Patterns
DataSet
mine
1------------------------------2----------3
----- 4 --- 5 ---------- 6 ------- 7------
Mined
Discriminative
select
Patterns
Petal.Length< 2.45
|
124
F1 F2 F4
1.
Mine frequent patterns (>sup)
2.
Select most discriminative
patterns;
3.
Represent data in the feature space
using such patterns;
4.
Build classification models.
Data1
Data2
Data3
Data4
110
101
110
001
………
DT
setosa
Petal.Width< 1.75
versicolor
virginica
SVM
LR
Any classifiers you can
Feature Construction followed by Selection
name
Two Problems

Mine step

combinatorial explosion
1. exponential explosion
Frequent Patterns
DataSet
mine
1------------------------------2----------3
----- 4 --- 5 ---------- 6 ------- 7------
2. patterns not considered
if minsupport isn’t small
enough
Two Problems

Select step

Issue of discriminative power
4. Correlation not
directly evaluated on their
joint predictability
3. InfoGain against the complete
dataset, NOT on subset of
examples
Frequent Patterns
1------------------------------2----------3
----- 4 --- 5 ---------- 6 ------- 7------
select
Mined
Discriminative
Patterns
124
Direct Mining & Selection via Modelbased Search Tree
Classifier Feature

Miner
Basic Flow
Mine & dataset
Select
P: 20%
1
Y
Mine &
Select
P: 20%
Y
Mine &
Select
P:20% 3
Y
+
N
5
2
N
Y
4
N
…
Few
Data
6
Y
…
Compact set
of highly
discriminative
patterns
Most
discriminative
F based on IG
Mine &
Select
P:20%
N
7
N Y
Mine &
Select
P:20%
N
+
Divide-and-Conquer Based Frequent
Pattern Mining
Global
Support:
10*20%/10000
=0.02%
1
2
3
4
5
6
7
.
.
.
Mined Discriminative
Patterns
Analyses (I)
1.
2.
Scalability of pattern enumeration

Upper bound (Theorem 1):

“Scale down” ratio:
Bound on number of returned
features
Analyses (II)
Subspace pattern selection
3.

Original set:

Subset:
4.
Non-overfitting
5.
Optimality under exhaustive search
Experimental Studies:
Itemset Mining (I)

Scalability Comparison
Log(DT #Pat)
Mine & dataset
4 Select
3 P: 20%
1
2
N
Y
Log(DTAbsSupport)
Log(MbT #Pat)
4
Mine & 32dataset
Select 1
P: 20% 0 1
Adult
N
Y
Most
discriminative
F based on IG
1
0
Mine &
Select
P: 20%
Y
Adult
HypoMine &Sick
Chess
5
2
N
Y
Select
P:20%
N
Datasets #Pat using
Mine &
Select
P:20% 3
Y
4
NChess
Hypo
+
Few
Sick
Data
Sonar
5
2
Global
Mine &
Support:
7 P:20% Select
4
N
P:20% 3
Y
10*20%/10000
N
Y
423439
=0.02%
+∞
4818391
+
95507
+
Sick
Sonar
Mine &
Select
P:20%
N
RatioN(MbTY #Pat / #Pat using MbT sup)
252809
Select
6
Most
discriminative
Hypo
F Chess
based on IG
Sonar
Mine &
Select
P: 20%
MbT supY
Mine &
Adult
Log(MbTAbsSupport)
Few
Data
0.41%
6
~0%
7
Mine &
Select
P:20%
N
Y
0.0035%
0.00032%
0.00775%
+
Global
Support:
10*20%/10000
=0.02%
Experimental Studies:
Itemset Mining (II)

Accuracy of Mined Itemsets
DT Accuracy
MbT Accuracy
100%
90%
4 Wins
80%
1 loss
70%
Adult
Chess
Hypo
Log(DT #Pat)
Sick
Sonar
But, much smaller
number of
patterns
Log(MbT #Pat)
4
3
2
1
0
Adult
Chess
Hypo
Sick
Sonar
Experimental Studies:
Itemset Mining (III)

Convergence
Experimental Studies:
Graph Mining (I)

9 NCI anti-cancer screen datasets



2 AIDS anti-viral screen datasets

URL: http://dtp.nci.nih.gov.

H1: CM+CA – 3.5%
H2: CA – 1%

O
The PubChem Project. URL: pubchem.ncbi.nlm.nih.gov.
Active (Positive) class : around 1% - 8.3%
HO
HO
O
O
O
O
O
Experimental Studies:
Graph Mining (II)
Scalability

DT #Pat
MbT #Pat
1800
1500
1200
900
600
300
0
Mine & dataset
Select
P: 20%
1
N
Y
NCI1
NCI33
NCI41
NCI47
NCI81
NCI83
NCI109
Mine
& NCI123 NCI145
Log(DT Abs Support)
Select
2
P: 20%
Log(MbT Abs Support)
N
Y
Most
discriminative
F based on IG
H2Mine
&
Select
P:20%
N
H1
5
Y
4
Mine &
Select
P:20% 3
Y
3
2
4
6
N
7
Y
Mine &
Select
P:20%
N
1
+
0
NCI1
NCI33
NCI41
NCI47
NCI81
NCI83
NCI109
Few
Data
NCI123
NCI145
+
H1
H2
Global
Support:
10*20%/10000
=0.02%
Experimental Studies:
Graph Mining (III)

AUC and Accuracy
AUC
DT
MbT
0.8
0.7
11 Wins
0.6
0.5
NCI1
NCI33
NCI41
NCI47
NCI81
NCI83
Accuracy
NCI109 NCI123 NCI145
DT
H1
H2
MbT
1
0.96
10 Wins
0.92
1 Loss
0.88
NCI1
NCI33
NCI41
NCI47
NCI81
NCI83 NCI109 NCI123 NCI145
H1
H2
Experimental Studies:
Graph Mining (IV)

AUC of MbT, DT MbT VS Benchmarks
7 Wins, 4 losses
Summary

Model-based Search Tree






Integrated feature mining and construction.
Dynamic support
Can mine extremely small support patterns
Both a feature construction and a classifier
Not limited to one type of frequent pattern: plug-play
Experiment Results




Itemset Mining
Graph Mining
New: Found a DNA sequence not previously reported but can be
explained in biology.
Code and dataset available for download
List of methods:
• Logistic Regression
• Probit models
• Naïve Bayes
• Kernel Methods
 Even the true distribution
• Linear Regression
Some
unknown
After
is prefixed,
learning becomes •distribution
isstructure
unknown,
still assume
RBF
optimization
to minimize
errors:
• Mixture models
that the data
is generated
Model 6
 quadratic
loss
Model
1
Model 3
by some
known
function.
Model 5
 exponential loss Model 2
Model 4
 Estimate parameters inside
How to train models?

slack
thevariables
function via
 training data
 CV on the training data
There probably will always be mistakes unless:
1. The chosen model indeed generates the distribution
2. Data is sufficient to estimate those parameters
But how about, you don’t know which to choose or use the wrong one?
How to train models II
List of methods:
 Not quite sure the exact
• Decision Trees
• RIPPER rule learner
function,
but use a
Preference
criteria
• CBA: association rule

Simplest
hypothesis
that
fits
the
data
is
the
best.
family of “free-form”
• clustering-based methods

Heuristics:
•……
functions given some

info gain, gini index, Kearns-Mansour, etc
pruning: MDL pruning, reduced error-pruning, cost-based
pruning.
“preference criteria”.


Truth: none of purity check functions guarantee accuracy on
unseen test data, it only tries to build a smaller model
There probably will always be mistakes unless:
• the training data is sufficiently large.
• free form function/criteria is appropriate.
Can Data Speak for
Themselves?


Make no assumption about the
true model, neither parametric
form nor free form.
“Encode” the data in some
rather “neutral”
representations:



Think of it like encoding
numbers in computer’s binary
representation.
Always cannot represent some
numbers, but overall accurate
enough.
Main challenge:



Avoid “rote learning”: do not
remember all the details
Generalization
“Evenly” representing
“numbers” – “Evenly” encoding
the “data”.
Potential Advantages

If the accuracy is quite good, then


Method is quite “automatic and easy” to use
No Brainer – DM can be everybody’s tool.
Encoding Data for Major Problems

Classification:




Probability Estimation:



Similar to the above setting: estimate the probability that a transaction is a
fraud.
Difference: no truth is given, i.e., no true probability
Regression:



Given a set of labeled data items, such as, (amt, merchant category,
outstanding balance, date/time, ……,) and the label is whether it is a fraud
or non-fraud.
Label: set of discrete values
classifier: predict if a transaction is a fraud or non-fraud.
Given a set of valued data items, such as (zipcode, capital gain,
education, …), interested value is annual gross income.
Target value: continuous values.
Several other on-going problems
Encoding Data in Decision Trees
Think of each tree as a way to “encode” the training
data.

2.5
cc
c ccccccc vv v
c ccccccc
cc cc c
c
cc
cc c cc
Obviously, each tree encodes the data differently.
Subjective criteria
that prefers some encodings than
s
s
versicolor
ssss s
sss s adhoc.
others are always
s ssss
sss s
sss
s ss
setosa
0.5

v vv
Why tree? a decision tree records
some
common
v v
vvvv vv v
v
vv
v
virginicabut vnot
vvvv every
v
characteristic of the data,
piece of trivial
vvvv
vv
vvv
v
vcv v vv v v v
detail
vc c cc v
Petal width

1.5



Do not prefer anything then – just do it randomly
1
2
3
4
5
Petal length
6
7
Minimizes the difference by multiple encodings, and
then “average” them.
Random Decision Tree to Encode Data
-classification, regression, probability estimation

At each node, an un-used feature is chosen
randomly


A discrete feature is un-used if it has never been chosen
previously on a given decision path starting from the root to
the current node.
A continuous feature can be chosen multiple times on the
same decision path, but each time a different threshold
value is chosen
Continued

We stop when one of the following happens:


A node becomes too small (<= 3 examples).
Or the total height of the tree exceeds some limits:
 Such as the total number of features.
Illustration of RDT
B1: {0,1}
B1 chosen randomly
B1 == 0
B2: {0,1}
B3: continuous
B2: {0,1}
Y
N
Random threshold 0.3 B2: {0,1}
B3 < 0.3?
B2 == 0?
B3: continuous
Y
B2 chosen randomly
N
B3: continuous
B3 chosen randomly
Random threshold 0.6
………
B3 < 0.6?
B3: continous
Classification
Petal.Length< 2.45
|
setosa
50/0/0
Petal.Width< 1.75
versicolor
0/49/5
virginica
0/1/45
P(setosa|x,θ) = 0
P(versicolor|x,θ) = 49/54
P(virginica|x,θ) = 5/54
Regression
Petal.Length< 2.45
|
setosa
Height=10in
Petal.Width< 1.75
versicolor
Height=15 in
virginica
Height=12in
15 in average
value of all examples
In this leaf node
Prediction

Simply Averaging over multiple trees
Potential Advantage

Training can be very efficient. Particularly true
for very large datasets.




No cross-validation based estimation of
parameters for some parametric methods.
Natural multi-class probability.
Natural multi-label classification and
probability estimation.
Imposes very little about the structures of the
model.
Reasons

The true distribution P(y|X) is never known.


Is it an elephant?
Every random tree is not a random guess of this
P(y|X).




Their structure is, but not the “node statistics”
Every random tree is consistent with the training data.
Each tree is quite strong, not weak.
In other words, if the distribution is the same, each random
tree itself is a rather decent model.
Expected Error Reduction

Proven that for quadratic loss, such as:

for probability estimation:


regression problems


( P(y|X) – P(y|X, θ) )2
( y – f(x))2
General theorem: the “expected quadratic
loss” of RDT (and any other model
averaging) is less than any combined model
chosen “at random”.
Theorem Summary
Number of trees

Sampling theory:
 The random decision tree can be thought as sampling
from a large (infinite when continuous features exist)
population of trees.
 Unless the data is highly skewed, 30 to 50 gives pretty
good estimate with reasonably small variance. In most
cases, 10 are usually enough.
Variance Reduction
Optimal Decision Boundary
from Tony Liu’s thesis (supervised by Kai Ming Ting)
RDT looks
like the optimal
boundary
Regression Decision
Boundary (GUIDE)
Properties
•
Broken and Discontinuous
•
Some points are far from truth
•
Some wrong ups and downs
RDT Computed Function
Properties
•
•
•
Smooth and Continuous
Close to true function
All ups and downs caught
Hidden Variable
Hidden Variable

Limitation of GUIDE



Need to decide grouping variables and
independent variables. A non-trivial task.
If all variables are categorical, GUIDE
becomes a single CART regression tree.
Strong assumption and greedy-based search.
Sometimes, can lead to very unexpected
results.
It grows like …
ICDM’08 Cup Crown Winner


Nuclear ban monitoring
RDT based approach is the highest award
winner.
Ozone Level Prediction (ICDM’06 Best
Application Paper)

Daily summary maps of two datasets from
Texas Commission on Environmental Quality
(TCEQ)
SVM: 1-hr criteria CV
AdaBoost: 1-hr criteria CV
SVM: 8-hr criteria CV
AdaBoost: 8-hr criteria CV
Other Applications





Credit Card Fraud Detection
Late and Default Payment Prediction
Intrusion Detection
Semi Conductor Process Control
Trading anomaly detection
Conclusion





Imposing a particular form of model may not be a good idea to
train highly-accurate models for general purpose of DM.
It may not even be efficient for some forms of models.
RDT has been show to solve all three major problems in data
mining, classification, probability estimation and regressions,
simply, efficiently and accurately.
When physical truth is unknown, RDT is highly recommended
Code and dataset is available for download.
Standard Supervised Learning
training
(labeled)
test
(unlabeled)
Classifier
New York Times
85.5%
New York Times
In Reality……
training
(labeled)
test
(unlabeled)
Classifier
Labeled data not
Reuters
available!
New York Times
64.1%
New York Times
Domain Difference  Performance Drop
train
test
ideal setting
NYT
Classifier
New York Times
NYT
85.5%
New York Times
realistic setting
Reuters
Reuters
Classifier
NYT
64.1%
New York Times
A Synthetic Example
Training
Test
(have conflicting concepts) Partially
overlapping
Goal
Source
Domain
Source
Target
Domain
Domain
Source
Domain

To unify knowledge that are consistent with the test domain
from multiple source domains (models)
Summary

Transfer from one or multiple source
domains


Target domain has no labeled examples
Do not need to re-train


Rely on base models trained from each domain
The base models are not necessarily developed
for transfer learning applications
Locally Weighted Ensemble
Training set 1
M1
f 1 ( x, y)
f i ( x, y )  P(Y  y | x, M i )
x-feature value y-class label
w1 ( x)
f 2 ( x, y)
Training set 2
Training set
……
M2
w2 ( x)
Test example x
k
f ( x, y )   wi ( x ) f i ( x , y )
……
E
i 1
k
f k ( x, y )
Training set k
Mk
wk (x)
 w ( x)  1
i
i 1
y | x  arg max y f E ( x, y)
Modified Bayesian Model Averaging
Bayesian Model Averaging
Modified for Transfer Learning
M1
M1
M2
P ( M i | D)
Test set
……
Test set
……
P( y | x, M i )
k
P ( y | x )   P ( M i | D ) P ( y | x, M i )
Mk
M2
i 1
P( y | x, M i )
P( M i | x)
k
Mk P( y | x)   P( M i | x) P( y | x, M i )
i 1
Global versus Local Weights
x
2.40
-2.69
-3.97
2.08
5.08
1.43
……

5.23
0.55
-3.62
-3.73
2.15
4.48
y
M1
wg
wl
M2
wg
wl
1
0
0
0
0
1
…
0.6
0.4
0.2
0.1
0.6
1
…
0.3
0.3
0.3
0.3
0.3
0.3
…
0.2
0.6
0.7
0.5
0.3
1
…
0.9
0.6
0.4
0.1
0.3
0.2
…
0.7
0.7
0.7
0.7
0.7
0.7
…
0.8
0.4
0.3
0.5
0.7
0
…
Locally weighting scheme


Training
Weight of each model is computed per example
Weights are determined according to models’
performance on the test set, not training set
Synthetic Example Revisited
M1
M2
Training
Test
(have conflicting concepts) Partially
overlapping
Optimal Local Weights
C1
0.9
Higher Weight
0.1
Test example x
C2
H
0.4
0.6
w
0.9
0.4
w1
f
0.8
=
0.1

0.6
w2
k
 w ( x)  1
i
0.2
Optimal weights

0.8
Solution to a regression problem
i 1
0.2
Approximate Optimal Weights

Optimal weights


How to approximate the optimal weights
M should be assigned a higher weight at x if P(y|M,x) is
closer to the true P(y|x)
Have some labeled examples in the target domain
 Use these examples to compute weights
None of the examples in the target domain are labeled
 Need to make some assumptions about the relationship
between feature values and class labels



Impossible to get since f is unknown!
Clustering-Manifold Assumption
Test examples that are closer in
feature space are more likely to
share the same class label.
Graph-based Heuristics

Graph-based weights approximation
 Map the structures of models onto test domain
weight
on x
Clustering
Structure
M1
M2
Graph-based Heuristics
Higher Weight

Local weights calculation
 Weight of a model is proportional to the similarity
between its neighborhood graph and the clustering
structure around x.
Local Structure Based Adjustment

Why adjustment is needed?
 It is possible that no models’ structures are similar to the
clustering structure at x

Simply means that the training information are conflicting
with the true target distribution at x
Error
Clustering
Structure
Error
M1
M2
Local Structure Based Adjustment

How to adjust?
 Check if

is below a threshold
Ignore the training information and propagate the
labels of neighbors in the test set to x
Clustering
Structure
M1
M2
Verify the Assumption

Need to check the validity of this assumption
 Still, P(y|x) is unknown


How to choose the appropriate clustering algorithm
Findings from real data sets
 This property is usually determined by the nature of the
task



Positive cases: Document categorization
Negative cases: Sentiment classification
Could validate this assumption on the training set
Algorithm
Check Assumption
Neighborhood Graph
Construction
Model Weight
Computation
Weight Adjustment
Data Sets

Different applications
 Synthetic data sets



Spam filtering: public email collection  personal inboxes
(u01, u02, u03) (ECML/PKDD 2006)
Text classification: same top-level classification problems
with different sub-fields in the training and test sets
(Newsgroup, Reuters)
Intrusion detection data: different types of intrusions in
training and test sets.
Baseline Methods


Baseline Methods
 One source domain: single models
 Winnow (WNN), Logistic Regression (LR), Support
Vector Machine (SVM)
 Transductive SVM (TSVM)
 Multiple source domains:
 SVM on each of the domains
 TSVM on each of the domains
 Merge all source domains into one: ALL
 SVM, TSVM
 Simple averaging ensemble: SMA
 Locally weighted ensemble without local structure based
adjustment: pLWE
 Locally weighted ensemble: LWE
Implementation Package:
 Classification: SNoW, BBR, LibSVM, SVMlight
 Clustering: CLUTO package
Performance Measure

Prediction Accuracy
 0-1 loss: accuracy
 Squared loss: mean squared error

Area Under ROC Curve
(AUC)
 Tradeoff between true positive
rate and false positive rate
 Should be 1 ideally
A Synthetic Example
Training
Test
(have conflicting concepts) Partially
overlapping
Experiments on Synthetic Data
Spam Filtering

Accuracy
Problems
WNN
 Training set:
public emails LR

SVM
Test set:
SMA
personal emails TSVM
from three
pLWE
users: U00,
LWE
U01, U02
MSE
WNN
LR
SVM
SMA
TSVM
pLWE
LWE
20 Newsgroup
C vs S
R vs T
R vs S
S vs T
C vs R
C vs T
Acc
WNN
LR
SVM
SMA
TSVM
pLWE
LWE
MSE
WNN
LR
SVM
SMA
TSVM
pLWE
LWE
Reuters

Accuracy
Problems
WNN
 Orgs vs People
LR
(O vs Pe)


Orgs vs Places
(O vs Pl)
People vs
Places (Pe vs
Pl)
SVM
SMA
TSVM
pLWE
LWE
MSE
WNN
LR
SVM
SMA
TSVM
pLWE
LWE
Intrusion Detection

Problems (Normal vs Intrusions)




Normal vs R2L (1)
Normal vs Probing (2)
Normal vs DOS (3)
Tasks



2 + 1 -> 3 (DOS)
3 + 1 -> 2 (Probing)
3 + 2 -> 1 (R2L)
Conclusions

Locally weighted ensemble framework


Graph-based heuristics to compute
weights


transfer useful knowledge from multiple
source domains
Make the framework practical and effective
Code and Dataset available for
download
More information



www.weifan.info or
www.cs.columbia.edu/~wfan
For code, dataset and papers