New Directions in Data Mining

Download Report

Transcript New Directions in Data Mining

New Directions in Data Mining
Ramakrishnan Srikant
Directions
• Electronic Commerce & Web
– Catalog Integration (WWW 2001, with R. Agrawal)
– Searching with Numbers (WWW 2002, with R. Agrawal)
• Security
• Privacy
Catalog Integration
• B2B electronics portal: 2000 categories, 200K datasheets
I
C
s
D
S
P
M
e
m
.
L
o
g
i
c
I
C
s
C
a
t
2
C
a
t
1
a
b
x
y
z
c
d
e
w
f
Master Catalog
New Catalog
Catalog Integration (cont.)
• After integration:
I
C
s
o
g
i
c
M
e
m
. L
D
S
P
a
b
x
y
c
d
e
f
z
Goal
• Use affinity information in new catalog.
– Products in same category are similar.
• Accuracy boost depends on match between two
categorizations.
Intuition behind Algorithm
Standard
Algorithm
Enhanced
Algorithm
C
o
m
p
u
t
e
r
D
i
g
i
t
a
l
P
e
r
i
p
h
e
r
a
l
C
a
m
e
r
a
P
1
2
0
%
8
0
%
P
2
4
0
%
6
0
%
P
3
6
0
%
4
0
%
C
o
m
p
u
t
e
r
D
i
g
i
t
a
l
P
e
r
i
p
h
e
r
a
l
C
a
m
e
r
a
P
1
1
5
%
8
5
%
P
2
3
0
%
7
0
%
P
3
4
5
%
5
5
%
Problem Statement
• Given
– master categorization M: categories C1, C2, …, Cn
• set of documents in each category
– new categorization N: set of categories S1, S2, …, Sn
• set of documents in each category
• Standard Alg: Compute Pr(Ci | d)
• Enhanced Alg: Compute Pr(Ci | d, S)
Enhanced Naïve Bayes classifier
Pr(Ci | S ) Pr( d | Ci )
Pr(Ci | d , S ) 
Pr( d | S )
| Ci | (# docs in S predicted to be in Ci ) w
Pr(Ci | S ) 
 j (| C j | (# docs in S predicted to be in C j )w )
• Use tuning set to determine w.
– Defaults to standard Naïve Bayes if w = 0.
• Only affects classification of borderline documents.
Yahoo & Google
• 5 slices of the hierarchy: Autos, Movies, Outdoors,
Photography, Software
– Typical match: 69%, 15%, 3%, 3%, 1%, ….
• Merging Yahoo into Google
– 30% fewer errors (14.1% absolute difference in accuracy)
• Merging Google into Yahoo
– 26% fewer errors (14.3% absolute difference)
• Open Problems: SVM, Decision Tree, ...
Directions
• Electronic Commerce & Web
– Catalog Integration (WWW 2001, with R. Agrawal)
– Searching with Numbers (WWW 2002, with R. Agrawal)
• Security
• Privacy
Data Extraction is hard
• Synonyms for attribute names and
units.
– "lb" and "pounds", but no "lbs" or
"pound".
• Attribute names are often
missing.
– No "Speed", just "MHz Pentium III"
– No "Memory", just "MB SDRAM"
• 850 MHz Intel Pentium III
• 192 MB RAM
• 15 GB Hard Disk
• DVD Recorder: Included;
• Windows Me
• 14.1 inch diplay
• 8.0 pounds
Searching with Numbers
I
B
M
T
h
i
n
k
p
a
d
7
5
0
M
h
z
P
e
n
t
i
u
m
3
,
1
9
6
M
b
R
D
R
A
M
,
.
.
.
D
e
l
l
C
o
m
p
u
t
e
r
7
0
0
M
H
z
C
e
l
e
r
o
n
,
2
5
6
M
b
S
D
R
A
M
,
.
.
.
C
a
t
a
l
o
g
D
a
t
a
b
a
s
e
8
0
0
2
0
0
8
0
0
2
0
0
3
l
b
I
B
M
T
h
i
n
k
p
a
d
(
7
5
0
M
h
z
,
1
9
6
M
b
)
.
.
.
D
e
l
l
(
7
0
0
M
h
z
,
2
5
6
M
b
)
Why does it work?
• Conjecture: If we get a close match on numbers, it is likely
that we have correctly matched attribute names.
• Non-overlapping attributes:
– Memory: 64 - 512 Mb, Disk: 10 - 40 Gb
• Correlations:
– Memory: 64 - 512 Mb, Disk: 10 - 100 Gb still fine.
• Screen Shot
Empirical Results
100
R
e
c
o
r
d
sF
i
e
l
d
s
1
0
80
L
C
D
1
,
7
0
0
1
2
60
P
r
o
c
1
,
1
0
0
1
2
T
r
a
n
s
2
2
,
2
7
3
2
4
A
u
t
o
2
0
5
1
6
C
r
e
d
i
t
6
6
6
6
G
l
a
s
s
2
1
4
1
0
H
o
u
s
i
n
g
5
0
6
1
4
W
i
n
e
1
7
9
1
4
Acuray
D
R
A
M3
,
8
0
0
40
20
0
1 2 34 5
Q uer y
S
DRAM
T ran
G ls
as
L CD
Au
Ho
to u s
P ro
Cred
cW i
in
t
e
Incorporating Hints
• Use simple data extraction techniques to get hints,
2
5
6
M
B
S
D
R
A
M
m
e
m
o
r
y
U
n
i
t
H
i
n
t
:
A
t
t
r
i
b
u
t
e
H
i
n
t
:
M
B
S
D
R
A
M
,
M
e
m
o
r
y
• Names/Units in query matched against Hints.
• Open Problem: Rethink data extraction in this context.
Other Problems
• P. Domingos & M. Richardson, Mining the Network Value
of Customers, KDD 2001
• Polling over the Web
Directions
• Electronic Commerce & Web
• Security
• Privacy
Security
"
F
r
e
q
u
e
n
t
T
r
a
v
e
l
e
r
"
R
a
t
i
n
g
M
o
d
e
l
E
m
a
i
l
C
r
e
d
i
t
g
e
n
c
i
e
s
D
e
m
o
- C
r
i
m
i
n
a
l A
P
h
o
n
eg
r
a
p
h
i
cR
e
c
o
r
d
s
S
t
a
t
e
B
i
r
t
hM
a
r
r
i
a
g
e
L
o
c
a
l
Some Hard Problems
• Past may be a poor predictor of future
– Abrupt changes
• Reliability and quality of data
– Wrong training examples
• Simultaneous mining over multiple data types
• Richer patterns
Directions
• Electronic Commerce & Web
• Security
• Privacy
–
–
–
–
Motivation
Randomization Approach
Crytographic Approach
Challenges
Growing Privacy Concerns
• Popular Press:
– Economist: The End of Privacy (May 99)
– Time: The Death of Privacy (Aug 97)
• Govt. directives/commissions:
– European directive on privacy protection (Oct 98)
– Canadian Personal Information Protection Act (Jan 2001)
• Special issue on internet privacy, CACM, Feb 99
• S. Garfinkel, "Database Nation: The Death of Privacy in 21st
Century", O' Reilly, Jan 2000
Privacy Concerns (2)
• Surveys of web users
– 17% privacy fundamentalists, 56% pragmatic majority, 27%
marginally concerned (Understanding net users' attitude about
online privacy, April 99)
– 82% said having privacy policy would matter (Freebies & Privacy:
What net users think, July 99)
Technical Question
• Fear:
– "Join" (record overlay) was the original sin.
– Data mining: new, powerful adversary?
• The primary task in data mining: development of models
about aggregated data.
• Can we develop accurate models without access to precise
information in individual data records?
Directions
• Electronic Commerce & Web
• Security
• Privacy
– Motivation
– Randomization Approach
• R. Agrawal and R. Srikant, “Privacy Preserving Data Mining”,
SIGMOD 2000.
– Crytographic Approach
– Challenges
Web Demographics
• Volvo S40 website targets people in 20s
– Are visitors in their 20s or 40s?
– Which demographic groups like/dislike the website?
Randomization Approach Overview
30 | 70K | ...
50 | 40K | ...
Randomizer
Randomizer
65 | 20K | ...
25 | 60K | ...
Reconstruct
distribution
of Age
Reconstruct
distribution
of Salary
Data Mining
Algorithms
...
...
...
Model
Reconstruction Problem
• Original values x1, x2, ..., xn
– from probability distribution X (unknown)
• To hide these values, we use y1, y2, ..., yn
– from probability distribution Y
• Given
– x1+y1, x2+y2, ..., xn+yn
– the probability distribution of Y
Estimate the probability distribution of X.
Intuition (Reconstruct single point)
• Use Bayes' rule for density functions
1
0V
A
g
e
9
0
O
r
i
g
i
n
a
l
d
i
s
t
r
i
b
u
t
i
o
n
f
o
r
A
g
e
P
r
o
b
a
b
i
l
i
s
t
i
c
e
s
t
i
m
a
t
e
o
f
o
r
i
g
i
n
a
l
v
a
l
u
e
o
f
V
Intuition (Reconstruct single point)
• Use Bayes' rule for density functions
1
0V
A
g
e
9
0
O
r
i
g
i
n
a
l
D
i
s
t
r
i
b
u
t
i
o
n
f
o
r
A
g
e
P
r
o
b
a
b
i
l
i
s
t
i
c
e
s
t
i
m
a
t
e
o
f
o
r
i
g
i
n
a
l
v
a
l
u
e
o
f
V
Reconstructing the Distribution
• Combine estimates of where point came from for all the
points:
– Gives estimate of original distribution.
1
0
A
g
e
9
0
Reconstruction: Bootstrapping
fX0 := Uniform distribution
j := 0 // Iteration number
repeat
j
n
1
f
((
x

y
)

a
)
f
Y
i
i
X (a )
j+1
fX (a) :=  
(Bayes' rule)
j
n i 1
f
((
x

y
)

a
)
f
Y
i
i
X (a )


j := j+1
until (stopping criterion met)
• Converges to maximum likelihood estimate.
– D. Agrawal & C.C. Aggarwal, PODS 2001.
Seems to work well!
Number of People
1200
1000
800
Original
Randomized
Reconstructed
600
400
200
0
20
60
Age
Recap: Why is privacy preserved?
• Cannot reconstruct individual values accurately.
• Can only reconstruct distributions.
Classification
• Naïve Bayes
– Assumes independence between attributes.
• Decision Tree
– Correlations are weakened by randomization, not destroyed.
Algorithms
• “Global” Algorithm
– Reconstruct for each attribute once at the beginning
• “By Class” Algorithm
– For each attribute, first split by class, then reconstruct separately
for each class.
• Associate each randomized value with a reconstructed
interval & run decision-tree classifier.
– See SIGMOD 2000 paper for details.
Experimental Methodology
• Compare accuracy against
– Original: unperturbed data without randomization.
– Randomized: perturbed data but without making any corrections
for randomization.
• Test data not randomized.
• Synthetic data benchmark from [AGI+92].
• Training set of 100,000 records, split equally between the
two classes.
Synthetic Data Functions
• F3
((age < 40) and
(((elevel in [0..1]) and (25K <= salary <= 75K)) or
((elevel in [2..3]) and (50K <= salary <= 100K))) or
((40 <= age < 60) and ...
• F4
(0.67 x (salary+commission) - 0.2 x loan - 10K) > 0
Quantifying Privacy
• Add a random value between -30 and +30 to age.
• If randomized value is 60
– know with 90% confidence that age is between 33 and 87.
• Interval width  amount of privacy.
– Example: (Interval Width : 54) / (Range of Age: 100)  54%
randomization level @ 90% confidence
Acceptable loss in accuracy
100% Randomization Level
100
Accuracy
90
Original
Randomized
ByClass
Global
80
70
60
50
40
Fn 1
Fn 2
Fn 3
Fn 4
Fn 5
Accuracy vs. Randomization Level
Accuracy
Fn 3
100
90
80
70
60
50
40
Original
Randomized
ByClass
10
20
40
60
80
100
Randomization Level
150
200
Directions
• Electronic Commerce & Web
• Security
• Privacy
– Motivation
– Randomization Approach
– Crytographic Approach
• Y. Lindell and B. Pinkas, “Privacy Preserving Data Mining”,
Crypto 2000.
– Challenges
Inter-Enterprise Data Mining
• Problem: Two parties owning confidential databases wish
to build a decision-tree classifier on the union of their
databases, without revealing any unnecessary information.
• Horizontally partitioned.
– Records (users) split across companies.
– Example: Credit card fraud detection model.
• Vertically partitioned.
– Attributes split across companies.
– Example: Associations across websites.
Cryptographic Adversaries
• Malicious adversary: can alter its input, e.g., define input
to be the empty database.
• Semi-honest (or passive) adversary: Correctly follows the
protocol specification, yet attempts to learn additional
information by analyzing the messages.
Yao's two-party protocol
•
•
•
•
Party 1 with input x
Party 2 with input y
Wish to compute f(x,y) without revealing x,y.
Yao, “How to generate and exchange secrets”, FOCS 1986.
Private Distributed ID3
• Key problem: find attribute with highest information gain.
• We can then split on this attribute and recurse.
– Assumption: Numeric values are discretized, with n-way split.
Information Gain
• Let
– T = set of records (dataset),
– T(ci) = set of records in class i,
– T(ci,aj) = set of records in class i with value(A) = aj.
– Entropy(T) = i
| T ( ci ) |
| T (ci ) |
log
|T |
|T |
| T (aj ) |
 Entropy(T( aj))
– Gain(T,A) = Entropy(T) -  j
|T |
• Need to compute
– Sj Si |T(aj, ci)| log |T(aj, ci)|
– Sj |T(aj)| log |T(aj)|.
Selecting the Split Attribute
• Given v1 known to party 1 and v2 known to party 2,
compute (v1 + v2) log (v1 + v2) and output random shares.
– Party 1 gets Answer - d
– Party 2 gets d, where d is a random number
• Given random shares for each attribute, use Yao's protocol
to compute information gain.
Summary (Cryptographic Approach)
• Solves different problem (vs. randomization)
– Efficient with semi-honest adversary and small number of parties.
– Gives the same solution as the non-privacy-preserving
computation (unlike randomization).
– Will not scale to individual user data.
• Can we extend the approach to other data mining
problems?
– J. Vaidya and C.W. Clifton, “Privacy Preserving Association Rule
Mining in Vertically Partitioned Data”, KDD 2002
Directions
• Electronic Commerce
• Security
• Privacy
–
–
–
–
Motivation
Randomization Approach
Crytographic Approach
Challenges
Challenges
• Application: Privacy-Sensitive Security Profiling
• Privacy Breaches
• Clustering & Associations
Privacy-sensitive Security Profiling
• Heterogeneous,
distributed data.
• New domains: text,
graph
"
F
r
e
q
u
e
n
t
T
r
a
v
e
l
e
r
"
R
a
t
i
n
g
M
o
d
e
l
E
m
a
i
l
C
r
e
d
i
t
g
e
n
c
i
e
s
D
e
m
o
- C
r
i
m
i
n
a
l A
P
h
o
n
eg
r
a
p
h
i
cR
e
c
o
r
d
s
S
t
a
t
e
B
i
r
t
hM
a
r
r
i
a
g
e
L
o
c
a
l
Potential Privacy Breaches
• Distribution is a spike.
– Example: Everyone is of age 40.
• Some randomized values are only possible from a given
range.
– Example: Add U[-50,+50] to age and get 125  True age is  75.
– Not an issue with Gaussian.
Potential Privacy Breaches (2)
• Most randomized values in a given interval come from a
given interval.
– Example: 60% of the people whose randomized value is in
[120,130] have their true age in [70,80].
– Implication: Higher levels of randomization will be required.
• Correlations can make previous effect worse.
– Example: 80% of the people whose randomized value of age is in
[120,130] and whose randomized value of income is [...] have their
true age in [70,80].
• Challenge: How do you limit privacy breaches?
Clustering
• Classification: ByClass partitioned the data by class & then
reconstructed attributes.
– Assumption: Attributes independent given class attribute.
• Clustering: Don’t know the class label.
– Assumption: Attributes independent.
• Global (latter assumption) does much worse than ByClass.
• Can we reconstruct a set of attributes together?
– Amount of data needed increases exponentially with number of
attributes.
Associations
• Very strong correlations  Privacy breaches major issue.
• Strawman Algorithm: Replace 80% of the items with other randomly
selected items.
– 10 million transactions, 3 items/transaction, 1000 items
– <a, b, c> has 1% support = 100,000 transactions
– <a, b>, <b, c>, <a, c> each have 2% support
• 3% combined support excluding <a, b, c>
– Probability of retaining pattern = 0.23 = 0.8%
• 800 occurrences of <a, b, c> retained.
– Probability of generating pattern = 0.8 * 0.001 = 0.08%
• 240 occurrences of <a, b, c> generated by replacing one item.
– Estimate with 75% confidence that pattern was originally present!
• Ack: Alexandre Evfimievski
Associations (cont.)
• "Where does a wise man hide a leaf? In the forest. But
what does he do if there is no forest?" ... "He grows a
forest to hide it in." -- G.K. Chesterton
• A. Evfimievski, R. Srikant, R. Agrawal, J. Gehrke,
“Privacy Preserving Mining of Association Rules”, KDD
2002.
• S. Rizvi, J. Haritsa, “Privacy-Preserving Association Rule
Mining”, VLDB 2002.
Summary (Privacy)
• Have your cake and mine it too!
– Preserve privacy at the individual level, but still build accurate
models.
• Challenges
– Privacy Breaches, Security Applications, Clustering &
Associations
• Opportunities
– Web Demographics, Inter-Enterprise Data Mining, Security
Applications
Slides available from ...
www.almaden.ibm.com/cs/people/srikant/talks.html
Backup
Classification Example
A
g
e
S
a
l
a
r
y
R
e
p
e
a
t
V
i
s
i
t
o
r
?
2
35
0
K
R
e
p
e
a
t
1
73
0
K
R
e
p
e
a
t
4
34
0
K
R
e
p
e
a
t
6
85
0
K
S
i
n
g
l
e
3
27
0
K
S
i
n
g
l
e
2
02
0
K
R
e
p
e
a
t
A
g
e
<
2
5
N
o
Y
e
s
R
e
p
e
a
t
S
a
l
a
r
y
<
5
0
K
Y
e
s N
o
R
e
p
e
a
t
S
i
n
g
l
e
Work in Statistical Databases
• Provide statistical information without compromising
sensitive information about individuals (surveys: AW89,
Sho82)
• Techniques
– Query Restriction
– Data Perturbation
• Negative Results: cannot give high quality statistics and
simultaneously prevent partial disclosure of individual
information [AW89]
Statistical Databases: Techniques
• Query Restriction
– restrict the size of query result (e.g. FEL72, DDS79)
– control overlap among successive queries (e.g. DJL79)
– suppress small data cells (e.g. CO82)
• Output Perturbation
– sample result of query (e.g. Den80)
– add noise to query result (e.g. Bec80)
• Data Perturbation
– replace db with sample (e.g. LST83, LCL85, Rei84)
– swap values between records (e.g. Den82)
– add noise to values (e.g. TYW84, War65)
Statistical Databases: Comparison
• We do not assume original data is aggregated into a single
database.
• Concept of reconstructing original distribution.
– Adding noise to data values problematic without such
reconstruction.