Data Mining over the Deep Web - Computer Science and Engineering

Download Report

Transcript Data Mining over the Deep Web - Computer Science and Engineering

Data Mining over the Deep Web
Tantan Liu, Gagan Agrawal
{liut,agrawal}@cse.ohio-state.edu
Ohio State University
April 12, 2011
1
Outline
• Introduction
– Deep Web
– Data Mining on the deep web
• Frequent itemset mining over the deep web
– Bayesian network
– Active learning based sampling method
• Experiment Result
• Conclusion
2
Deep Web
• Data sources hidden from the Internet
– Online query interface vs. Database
– Database accessible through online Interface
– Input attribute vs. Output attribute
• An example of Deep Web
3
Data Mining over the Deep Web
• High level summary of data
– Scenario 1: A student wants to find a job as a software
Engineer
•
•
•
•
Will a master degree help?
Which language to learn: Java, C, or C#?
Try MSN careers – to much information!
Frequent itemset mining!
4
Challenges
• Databases cannot be accessed directly
– Sampling method for Deep web mining
• Obtaining data is time consuming
– Efficient sampling method
– High accuracy with low sampling cost
5
Roadmap
• Introduction
– Deep Web
– Data Mining
• Frequent Itemset mining over the deep web
– Bayesian Network
– Active learning based sampling method
• Experiment Result
• Conclusion
6
Frequent Itemset Mining
•
•
•
Itemset: a set of attributes with instantiations, e.g I={Brand=benz, Age>5}
Support(Brand=Benz, Age>5)=2/8=0.25
Frequent Itemset: Support is larger than a threshold
7
Frequent Itemset Mining on Deep Web
• Challenges
– Support of itemsets is unavailable
– The size of itemsets could be huge
• Considering 1-itemsets
– Simple random sample – Inefficient
• Support of itemsets of input attributes is known
– # of data records satisfying the query is provided
8
Main Idea
• Task: Estimating the support of itemsets of output
attributes
• Questions
– Can we use information about input attributes?
• Bayesian Network
– Relation between input attributes and output attributes
– Compute support for itemsets of output attributes
– How to quickly build the model
• Active learning based sample method
9
Bayesian Network
• Relation between input and output attributes
• Graphical model
– Random variables
• Input and output attributes
– Conditional dependencies
• Output attributes depend on input attributes
10
Active Learning
• In machine learning
– Passive learning: data are randomly chosen
– Active Learning
• Certain data are selected, to help build a better model
• Active Learning
– Obtaining data is costly and/or time-consuming
• Frequent Itemset Mining on Deep Web
11
An Example of Bayesian Network
Age
Brand
<=5, >5
H, B
known
Mileage
Estimate!
Price
Price
Brand
• Support of Itemsets depends on parameters in the Bayesian network
H
• Parameters are estimated based on Sample
H
‒ Parameter: p(price<=5000|H,<=5)
• 2 data records satisfying brand=H, Age<=5
• 1 data records satisfying brand=H, Age<=5, price<=5000
B
B
Age
p
<=5000
>5000
<=5 0.25
>5 0.25
0.5
0.5
0.5
0.5
<=5 0.25
>5 0.25
0.0
1.0
0.0
1.0
[0.125 0.125 0.0 0.0]
‒ p(price<=5000|H,<=5)=1/2=0.5
Support(Price<=5000)= 0.25
12
Example of Active learning on Deep
Web
Deep Web Data Source
Qi, i=1,…, 4
Q1
B=H&
Age<=5
Sampled Data
Q2
Q3
B=H&
Age>5
Q4
B=B&
Age>5
B=B&
Age<=5
Price
<=5000
>5000
Price
Q1
Q2
Q3
Q4
p11
p21
p31
p41
p12
p22
p32
p42
13
An Example of Active Learning Based
Sampling
•
Hidden idea
– Sampling heavily on query spaces with high impurity
Deep Web
Data Source
Q1(B=H)
Q1
0.5
0.5
Q2(B=B)
Price
Q2
0.01
0.99
14
Detailed Formulation
• Support for output attributes
‒
–
: an instantiation of input attributes, or a query
: prior probability
• Known
–
,
• Conditional probability
• Parameters in conditional table
• Unknown, need to estimate
15
Parameters in Bayesian Network
•
are estimated based on a sample
• Difference between estimated values and true values
• Consider
as statistical variables
• Conjugate distribution
– After observing data D,
is in the same family with
• Hyper parameter
– Expectation:
, where
• Estimation for support of output attributes
– Expectation on the distribution
16
Active Learning on Deep Web
•
Risk Function
– Risk with the estimation for 1-itemsets composed of output attributes
– Based on the hyper parameter in the Bayesian Network,
•
Data Selection
– Data are obtained by queries
: query selection
– Data records are selected step by step
– Choosing the query with most reduction on risk function
•
Updating Model
– For
where
, and sample
denotes the number of data records containing
17
Support for n-itemsets(n>1)
• Estimation based on the Bayesian network
–
• Support value of
in the query space
18
Roadmap
• Introduction
– Deep Web
– Data Mining
• Frequent Itemset mining over the deep web
– Bayesian Network
– Active learning based sampling method
• Experiment Result
• Conclusion
19
Experiment Result
• Data set: US census
– 2008 US Census on the income of US households
– 40,000 data records
• Three Methods
– Dir:
• Random Sample
• Direct Computation
– Bay
• Random Sample
• Computation Based on Bayesian Network
– Act: our proposed method
• Active Learning based Sample
• Computation Based on Bayesian Network
20
US census
• Square Error Rate:
• Absolute Error Rate (AER):
21
Conclusion
•
•
•
•
•
Data mining on the deep web is challenging
Frequent itemset mining over the deep web
Bayesian network is used to model the deep web
A active learning based sampling method
The experiment results show the efficiency of our work
22
Questions?
23