SEEDEEP: A System for Exploring and Querying Deep Web Data

Download Report

Transcript SEEDEEP: A System for Exploring and Querying Deep Web Data

SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
SEEDEEP:
A System for Exploring and Querying
Deep Web Data Sources
Fan Wang
Advisor: Prof. Gagan Agrawal
Ohio State University
PhD Defense
The Ohio State University
Summer 2010
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
The Deep Web
• The definition of “the deep web” from
Wikipedia
The deep Web refers to World Wide
Web content that is not part of the
surface web, which is indexed by
standard search engines.
• Some Examples: Expedia, Priceline
PhD Defense
The Ohio State University
Summer 2010
2
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
The Deep Web is Huge and Informative
• 500 times larger than the surface
web
• 7500 terabytes of information (19
terabytes in the surface web)
• 550 billion documents (1 billion in
the surface web)
• More than 200,000 deep web
sites
• Relevant to every domain:
scientific, e-commerce, market
• 95 percent of the deep web is
publicly accessible (with access
limitations)
PhD Defense
The Ohio State University
Summer 2010
3
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
How to Access Deep Web Data
1. A user issues query through input interfaces of deep web data sources
2. Query is translated into SQL style query
3. Trigger search on
backend database
Select price
From Expedia
Where depart=CMH and arrive=SEA
and dedate=“7/13/10” and redate=“7/16/10”
4. Answers returned
through network
PhD Defense
The Ohio State University
Summer 2010
4
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Drawbacks
• Constrained flexibility
– Types of queries: aggregation query, nested query,
queries with grouping requirement
– User specified predicates
• Inter-dependent data sources
– Users may want data from multiple correlated data
sources
• Long latency
– Network transmission time
– Denial of service
PhD Defense
The Ohio State University
Summer 2010
5
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Goal
Develop a deep web search tool
which could support
online (real time)
structured,
and high level queries
(semi)automatically
PhD Defense
The Ohio State University
Summer 2010
6
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Challenges
• Challenges for Integration
– Self-maintained and created
– Heterogeneous, hidden and dynamically updated
metadata
• Challenges for Searching
– Limited data access pattern
– Data redundancy and data quality
– Data source dependency
• Challenges for Performance
– Network latency
– Fault tolerance issue
PhD Defense
The Ohio State University
Summer 2010
7
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Our Contributions (1)
• Support online aggregation over deep web
– Answer deep web aggregation search in a OLAP
fashion
– Propose novel sampling techniques to find
accurate approximate answers in a timely manner
• Support low selectivity query over deep web
– Answer low selectivity query in presence of
limited data access
– Propose a novel Bayesian method to find optimal
stratification for hidden selective attribute
PhD Defense
The Ohio State University
Summer 2010
8
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Our Contributions (2)
• Support structured SQL queries over the
deep web
– Support SPJ, aggregation, and nested queries
• Automatic hidden schema mining
– A statistic framework to discover hidden metadata
from deep web data sources
• Novel query caching mechanism for query
optimization
• Effective fault tolerance handling mechanism
PhD Defense
The Ohio State University
Summer 2010
9
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
System Overview
Sampling the deep web
Hidden schema discovery
Structured
SQL query
Online aggregation
Data source integration
Low selectivity query
Deep
W eb
Complex
Structured Query
Aggregation/Low
Selectivity Query
Schema Mining
Query
Planning
Data Source Model
Source
Input
Ouput
Data Source
Dependency Model
Constraint
S1
A1
B1,B2
C2
S2
A1
B2,B3
C1
D1
D4
D2
System Models
Exploring Part of SEEDEEP
PhD Defense
Approximate
Query
Answering
D3
Fault
Tolerance
Query
Optimization
Querying Part of SEEDEEP
The Ohio State University
Summer 2010
10
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Outline
• Introduction
• Sampling methods for online aggregation query
– Motivation
– ANS and TPS methods
– Evaluation
• Stratification methods for low selectivity query
– Motivation
– Harmony search and Bayesian based adaptation
– Evaluation
• Future work and conclusion
PhD Defense
The Ohio State University
Summer 2010
11
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Online Aggregation Motivation
• Aggregation queries requiring data enumeration
I want to know the average airfare from US to Europe
across all major US airline flights in the next week
Deep Web Data Source
Relational Database
Select AVG(airfare)
From AirTable AT
Where AT.depart=any US city and
AT.arrive=any European city
NYC, London
AA: 500
UA: 550
Boston, Paris
USAir: 450
Delta: 400
LA, Rome
Need Enumeration!!!
UA: 600
AA: 650
Where do you get these names?
How long can you wait?
What if the data is updated dynamically?
PhD Defense
The Ohio State University
Summer 2010
12
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Initial Thoughts
• Sampling: Approximate answers
• Simple random sampling (SRS)
– Every data record has the same probability to
be selected
• Drawbacks of SRS
– Bad performance on skewed data
– High sampling cost to perform SRS on deep
web (Dasgupta et al, HDSampler)
PhD Defense
The Ohio State University
Summer 2010
13
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
We Want To Achieve
• Handle data with (probably high) skew
– Top 20 IT companies account for 80% of the sales
among all top 100 IT companies in 2005
– Hidden data (hard to gather statistical information)
– Has skew or not?
– Unknown data distribution
– Pilot sample, how much can you trust your pilot
sample?
• Lower sampling cost for sampling deep web
data
PhD Defense
The Ohio State University
Summer 2010
14
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Our Contributions
• Two Sampling Algorithms
– ANS (Adaptive Neighborhood Sampling): handling skewed
data, sample skew causing data easier
– TPS (Two Phase adaptive Sampling): lower sampling cost
• Performance
– Accurate estimates without prior knowledge
– ANS and TPS outperform HDSampler by a factor of 4 on
skewed data
– TPS has one-third of the sampling cost of HDSampler
PhD Defense
The Ohio State University
Summer 2010
15
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Background Knowledge
A survey on a type of rare monkey, which only lives in a small but
dangerous area in southern China
Associated Samples
PhD Defense
The Ohio State University
Summer 2010
16
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Why this is good and Can we use it?
• Sample more rare but significant data records
– Good for handling with skewed data
• Associated samples have relatively low sampling
cost
– Cheaper than SRS with the same sample size
• Yes, we can use it! ~~ with modification
– Much real world data has skew (IT company, household income)
– Rare data often form clusters
– Deep web data sources often return multiple records w.r.t. one
input sample
PhD Defense
The Ohio State University
Summer 2010
17
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Drawbacks
• Performance depends on the initial sample
• Initial sample is simple random sample
• No cost limit explicitly considered
– What is the size of the initial sample?
– How many associated samples should be added?
PhD Defense
The Ohio State University
Summer 2010
18
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
The ANS Sampling Algorithm
• Select a random sample
– Stop random sampling if any of the two termination rules applies
• We have sampled k number of units of interest
• We have reached the cost limit
– Take the sampled data record, add it to our sample
– If this data record is a unit of interest
Aggressively sample
skew causing data
• Obtain its neighbors (neighborhood sampling)
• For each data records obtained from neighborhood sampling
– Add it to our sample
– Perform recursive neighborhood sampling if necessary
– If neighborhoods are too large
• Increase unit of interest threshold
– If neighborhoods are too small
Control sampling cost
• Decrease unit of interest threshold
PhD Defense
The Ohio State University
Summer 2010
19
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
ANS Example
• Estimate the total sale of IT companies in 2005
• Each point represents a company’s sale record
• Color shows the scale of the sale value, the
darker, the higher
• Neighborhood of data records is defined
according to some rules
PhD Defense
The Ohio State University
Summer 2010
20
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Unit of interest:
sales larger than a threshold
1. Select initial random samples
sequentially until we have k units
of interest (k=3)
2. Explore the neighborhood
recursively for all units of
interest until the total number
of samples reach a limit
3. If too many neighbors are
Included, we increase the threshold
of unit of interest
PhD Defense
The Ohio State University
Summer 2010
21
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Estimators and Analysis for ANS
• Estimator for AVG, fixed unit of interest threshold
β is the percentage of units of
Interest w.r.t. entire data set
β=(k-1)/(n-1)
Lemma 1: The above estimator is a biased estimator,
when k is small, the bias is very small
• We also proposed estimator for variable unit of interest
threshold using post-stratification
is the estimated average value from the hth stratum (all samples corresponding
to one specific unit of interest threshold value)
PhD Defense
The Ohio State University
Summer 2010
22
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Drawbacks of ANS
• Initial samples are simple random samples
• SRS: one input search only gets one
sample from the output page
• High cost
PhD Defense
The Ohio State University
Summer 2010
23
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
The TPS Sampling Algorithm
• Partition data set D into M sub-spaces
– According to combinations of input attribute values
Aggressively draw skew
Causing data
• Random select m sub-spaces
• Select a sample of size n1 from each of the m selected sub-spaces
– First phase sampling
– For one selected sub-space, if any such selected data record is a unit of
interest, proceed
• Select a sample of size n2 from the corresponding sub-spaces
– Second phase sampling
– Sub-spaces contains units of interest in the first sampling phase may
give us more units of interests in the second sampling phase
Low sampling cost
PhD Defense
The Ohio State University
Summer 2010
24
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Unit of interest:
sales larger than a threshold
1. Random select the sub
spaces (input search)
2. Random select N1 samples
In each selected subspace
N1=4
3. If any sample selected in a
sub space is a unit of interest,
select N2 samples from the
sub space again, N2=3
PhD Defense
The Ohio State University
Summer 2010
25
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Evaluation
• Data sets
– Synthetic data sets: generated using MINITAB, varying data
skew from 1 to 9
– US Census data: 2002 US economic census data on wholesale
trade product lines listed by the kind of business (skew=8)
– Yahoo! Auto: prices of used Ford cars from 2000 to 2009 located
within 50 miles of a zipcode (skew=0.7)
• Metrics
– AER: Absolute Error Rate
– Sampling cost: number of input samples needed
• Methods
– ANS
– TPS
– SRS
PhD Defense
The Ohio State University
Summer 2010
26
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
ANS Performance w.r.t. Data Skew
1. AER increases moderately with the increase of data skew
2. When k=8, AER is consistently smaller than 19%
3. Larger k doesn’t help much to improve
PhD Defense
The Ohio State University
Summer 2010
27
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
TPS Performance w.r.t. Data Skew
1. With the increase of data skew, AER increases moderately
2. For sub-space sample size of 30%, AER always smaller than 17%
PhD Defense
The Ohio State University
Summer 2010
28
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
AER Comparison on Synthetic Data
1. All methods work well on small skew data
2. HDSampler has bad performance on data with skew>2
3. Our two methods outperform HDSampler by a factor of 5
PhD Defense
The Ohio State University
Summer 2010
29
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
AER Comparison on US Census Data
PhD Defense
The Ohio State University
Summer 2010
30
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
AER Comparison on Yahoo! Data
1. For AVG, three methods are comparable in terms of accuracy
2. For MAX, our methods are better (by a factor of 4)
PhD Defense
The Ohio State University
Summer 2010
31
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Sampling Cost Comparison on Yahoo Data
More
Results
1. To achieve a low AER, TPS has one-third of the sampling cost of HDSampler
2. The total number of samples TPS obtained (with same cost in time) is twice
the number of samples HDSampler obtained
PhD Defense
The Ohio State University
Summer 2010
32
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Outline
• Introduction
• Sampling methods for online aggregation query
– Motivation
– ANS and TPS methods
– Evaluation
• Stratification methods for low selectivity query
– Motivation
– Harmony search and Bayesian based adaptation
– Evaluation
• Future work and conclusion
PhD Defense
The Ohio State University
Summer 2010
33
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Motivating Example: Low Selectivity
• Random Sampling
– None of the selected records satisfies the low selectivity predicate
• Stratified Sampling
– Partitioning attribute, selective attribute
– How to perform stratification
• Distance based methods (clustering, outlier indexing)
• Selective attribute is not queriable
– Auxiliary attribute based stratification
» Dalenius and Hodges’s method, Ekman’s method, Gunning and Horgan’s
geometric method
– Require strong correlation between auxiliary and selective attribute
PhD Defense
The Ohio State University
Summer 2010
34
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Our Contributions
• We focus on queries in the following format
• Propose a Bayesian adaptive harmony search stratification
method to stratify a hidden selective attribute based on an
auxiliary attribute
• The stratification accurately reflects the distribution of the
hidden selective attribute when correlation is weak
• Estimations from our stratification outperforms existing
methods by a factor of 5
• Estimation accuracy obtained from our method is higher than
95% for 0.01% selectivity queries
PhD Defense
The Ohio State University
Summer 2010
35
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Background: Stratification
• Purpose
– Within stratum homogeneity
• Partition data set R into k strata
– Partitioning attribute x has value range
breaking points
, find k-1
• Sampling allocation
– Neyman allocation
– Bayesian Neyman allocation
PhD Defense
The Ohio State University
Summer 2010
36
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Background: Harmony Search (1)
• A phenomenon-mimicking meta-heuristic algorithm
inspired by the improvisation process of musicians
• Optimize an objective function
• Initialize the harmony memory, M random guesses of the
decision variable vector
PhD Defense
The Ohio State University
Summer 2010
37
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Background: Harmony Search (2)
• Improvise a new harmony from the harmony memory
– A new harmony vector
two parameters HMCR and PAR.
is generated from
• Update harmony memory
• Termination
PhD Defense
The Ohio State University
Summer 2010
38
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Algorithm Overview (1)
• Find the best stratification of the selective attribute
Stratification of An Auxiliary Attribute
18 0
16
14
12
10
8
6
4
2
0
0.0
0.45
0.75
1.1
Percent
Percent
Stratification of A Selective Attribute 2.1
0.3
0.6
0.9
PhD Defense
X
1.2
1.5
1.8
2.1
18
16
14
12
10
8
6
4
2
0
0
0.45
0.85
1.4
2.1
Variable
Selective Attribute
Auxiliary Attribute
0.0
0.3
The Ohio State University
0.6
0.9
X
1.2
1.5
1.8
Summer 2010
2.1
39
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Algorithm Overview (2)
• Consider an auxiliary attribute as the partitioning
attribute
• Harmony memory
– A list of breaking points vectors of the auxiliary attribute
PhD Defense
The Ohio State University
Summer 2010
40
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Harmony Objective Function
• What is a good stratification?
– Condition 1: homogenous data within stratum
• Small sample variance summation across strata
– Condition 2: stratification with high precision, i.e.,
the low selective region of the distribution is
exclusively covered by some strata
PhD Defense
The Ohio State University
Summer 2010
41
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Sample Allocation
• Determine the number of samples assigned to
each stratum
• What stratum should be put heavier sampling
weight?
– Stratum with diversity (more heterogeneous)
• High sample variance
– Stratum which covers large percentage of the low selective
region
• High precision
PhD Defense
The Ohio State University
Summer 2010
42
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Parameter Adaptation: Overview
• Two parameters: HMCR and PAR
X: value of HMCR parameter
Y: precentage of cases in which we obtain better harmony vector
PhD Defense
The Ohio State University
Summer 2010
43
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Bayesian Method Overview
• Estimate unknown parameter (posterior distribution)
based on prior knowledge or belief (prior
distribution)
• Observed data y, unknown parameter θ
• In our scenario
– Observed data: the harmony parameter values which yield
better harmony vector
– Unknown parameter: adaptation pattern of harmony
parameters
PhD Defense
The Ohio State University
Summer 2010
44
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Bayesian Adaptation (1)
• Assume the adaptation patterns of parameters HMCR and
PAR follows probability functions
• Represent our belief in θ as a prior probability distribution
• Observe our data: the HMCR and PAR parameters which
yield the best new harmony vector based on the current
harmony memory, denoted as
• Based on
, compute θ (posterior
distribution), denoted as
(Details)
• Compute the adapted parameter value as
PhD Defense
The Ohio State University
Summer 2010
45
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Bayesian Adaptation (2)
• Probability function
• Prior distribution of θ
Details
PhD Defense
The Ohio State University
Summer 2010
46
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Evaluation
• Data sets
– Synthetic data sets: generated using MINITAB, varying
correlation between auxiliary attribute and selective attribute
from 1 to 0.3
– US Census data, correlation (number of settlements and sale)
0.56
– Yahoo! Auto, correlation (year and mileage) 0.7
• Metric
– AER: Absolute Error Rate
• Methods
–
–
–
–
Leaps and Bounds (L&B)
Dalenius and Hodges (D&H)
Random Sampling (No Stratification)
Ours (HarmonyAdp)
PhD Defense
The Ohio State University
Summer 2010
47
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
HarmonyAdp Performance (1)
1. Higher accuracy with
more iterations
2. Iteration>40 (2% of total
data size), low AER for all
selectivity values
3. Robust with respect to
data correlation
PhD Defense
The Ohio State University
Summer 2010
48
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
HarmonyAdp Performance (2)
PhD Defense
The Ohio State University
Summer 2010
49
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Four Methods Comparison (1)
1. All methods work well on
easy queries
2. When queries get harder,
D&H, L&B and Random
degrade severely. Ours has
good performance
3. For 0.1% queries, our
method outperforms others by
a factor of 5
4. For extreme low selectivity
queries, the error rate from
our method is always lower
than 18%
PhD Defense
The Ohio State University
Summer 2010
50
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Four Methods Comparison (2)
1. Similar pattern
2. Performance degrade
of other three algorithms
on Auto data is small,
because data correlation
is high
More Results
PhD Defense
The Ohio State University
Summer 2010
51
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Conclusion
• Propose a system for exploring and querying deep
web data sources
• Propose novel sampling methods to handle online
aggregation and low selectivity queries
• Propose novel query planning algorithms to answer
structured SQL queries over the deep web
• System also has the following features
– Self-healing from unavailable/inaccessible data sources
– Query optimization
– Automatic hidden schema mining and integration
PhD Defense
The Ohio State University
Summer 2010
52
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Future Work
• Better understanding the deep web data
– Data quality
• Reliability, authority, data provenance models
• Data source correlation
– Data distribution
• Learning approximate data distribution with a bounded confidence interval
• Handling consistency issue
• Better querying the deep web
– Structured query
• Existence (Is-there) queries, data mining queries
– Semantic query, incorporate semantic web with deep web
• Automatic ontology generating from deep web
• Semantic query answering
– Result ranking
• Ranking results in heterogeneous formats
• Combining results from different/correlated data sources
PhD Defense
The Ohio State University
Summer 2010
53
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Thank You!
PhD Defense
The Ohio State University
Summer 2010
54
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Backup
•
•
•
•
•
•
ANS Estimator
TPS Estimator
SPJ NP Hard
Aggregation NP Hard
Bayesian Inference
Other System Modules
PhD Defense
The Ohio State University
Summer 2010
55
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Bayesian Adaptation (2)
• Probability function
– HMCR ranges from 0 to 1
– HMCR varies from small values to large values
– Beta distribution best mimics this adaptation pattern
PhD Defense
The Ohio State University
Summer 2010
56
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Bayesian Adaptation (3)
• Prior distribution of θ
– Belief about θ before any data is observed
– Belief about θ before the execution of harmony search
algorithm
– What is our belief?
• For Beta distribution, θ<β, peak occurs before HMCR=0.5
• θ>β, peak occurs after HMCR=0.5
• At the beginning, HMCR is small
PhD Defense
The Ohio State University
Summer 2010
57
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Lemma on Biasness
PhD Defense
The Ohio State University
Summer 2010
58
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
PhD Defense
The Ohio State University
Summer 2010
59
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
PhD Defense
The Ohio State University
Summer 2010
60
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
PhD Defense
The Ohio State University
Summer 2010
61
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
PhD Defense
The Ohio State University
Summer 2010
62
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
PhD Defense
The Ohio State University
Summer 2010
63
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
PhD Defense
The Ohio State University
Summer 2010
64
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
PhD Defense
The Ohio State University
Summer 2010
65
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
TPS Estimator
PhD Defense
The Ohio State University
Summer 2010
66
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
PhD Defense
The Ohio State University
Summer 2010
67
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Details on theorem, sufficient statistics and etc
PhD Defense
The Ohio State University
Summer 2010
68
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
PhD Defense
The Ohio State University
Summer 2010
69
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Details on max
PhD Defense
The Ohio State University
Summer 2010
70
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
SPJ NP hard
Let C={C1,C2,…,Cm} be an instance of the set cover problem.
The node set V of G is given by V={C1,…Cm}
PhD Defense
The Ohio State University
Summer 2010
71
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Aggregation NP hard
Let C={C1,C2,…,Cm} be an instance of the set cover problem.
The node set V of G is given by V={C1,…Cm, W}
The grouping attribute is s*, and node W covers s*
PhD Defense
The Ohio State University
Summer 2010
72
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
ANS Estimator Bias Evaluation
1. k=8, we achieve the best estimation
2. k>10, AER increases moderately, bias occurs
3. k=10, we reach the cost limit, bias occurs
PhD Defense
The Ohio State University
Summer 2010
73
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Harmony Objective Function
• What is a good stratification?
– Condition 1: homogenous data within stratum
• Small sample variance summation across strata
A Homogenous But Not Precise Stratification
Ideal Stratification
0.45
0.75
1.1
2.1
Percent
0
Percent
18
16
14
12
10
8
6
4
2
0
0.0
0.3
0.6
0.9
1.2
1.5
1.8
2.1
18
16
14
12
10
8
6
4
2
0
0
0.0
0.45
0.3
Selective Attribute
0.75
0.6
2.1
0.9
1.2
1.5
1.8
2.1
Selective Attribute
– Condition 2: stratification with high precision, i.e., the low
selective region of the distribution is exclusively covered by
some strata
PhD Defense
The Ohio State University
Summer 2010
74
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Outline
• Introduction
• Sampling methods for online aggregation query
– Motivation
– ANS and TPS methods
– Evaluation
• Stratification methods for low selectivity query
– Motivation
– Harmony search and Bayesian based adaptation
– Evaluation
• Answering complex structured query
• Query optimization and fault tolerance
• Future work and conclusion
PhD Defense
The Ohio State University
Summer 2010
75
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Motivating Example: Structured Query
Biologists have identified the gene X and
protein Y are contributors of a disease. They
want to examine the SNPs (Single Nucleotide
Polymorphisms) located in the genes that share
the same functions as either X or Y.
Particularly, for all SNPs located in each such
gene functions similar to either X or Y, and
those have a heterozygosity value greater than
0.01, biologists want to know the maximal SNP
frequency in the Asian population.
PhD Defense
The Ohio State University
Summer 2010
76
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
The frequency
gene has the
information
same functions
of the SNPs
as Y
X
located in these genes and filtered by
heterozygosity values
Query Plan Part 2 (sub query 2)
gene
protein Y
Human name NCBI function
Protein
GENE
GO
gen
nam e
e
SNP500
Cancer
Query Plan Part 1 (sub query 1)
gene X
NCBI function
GENE
PhD Defense
GO
Query Plan Part 3 (main query)
e
n
e
g
e
m
a
n
SNP
db frequency
SNP
filtering by Heterozygosity
The Ohio State University
Summer 2010
77
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Query Planning Algorithm
• We propose query planning algorithms to support
– Select-Project-Join query
– Aggregation query
– Uncorrelated nested query
• We also extend our algorithm to the following SQL
operators
– Union and OR
– Order by
– Having
• Planning problem: NP hard sub-graph search problem
– We propose heuristic graph search algorithms
PhD Defense
The Ohio State University
Summer 2010
78
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Query Optimization Techniques
• Data redundancy across data sources
• “All-In-One” result page of deep web data sources
– Deep web data sources: return all in one page no matter you
want it or not
• Query plan driven caching mechanism to
optimization query execution
– Reuse similar previous query plans
– Increase the possibility of previous data reuse
• Results
– Our method achieves twice the speed of a baseline method for
half the cases
PhD Defense
The Ohio State University
Summer 2010
79
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Fault Tolerance Techniques
• Inaccessibility and unavailability of deep web data
sources
• Self-healing approach for deep web query execution
• Data redundancy based fault tolerance techniques
– Hide unavailable and/or inaccessible data sources
• Steps
– Find minimal impacted subplan
– Find maximal fixable query
– Generate a replacement plan
• Result
– For half the cases, our method achieves 50% speedup
compared to a baseline method
PhD Defense
The Ohio State University
Summer 2010
80
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Initial Thoughts
• Sampling: Approximate answers
• Simple random sampling (SRS)
– Every data record has the same probability to be
selected
• Drawbacks of SRS
– Bad performance on skewed data
D=(1,1,1,1,1,1,1,1,10,1000)
True average: 101.8
SRS samples of size 2:
S1(1,1)=1, S2(1,10)=5.5, S3(1,1000)=500.5, S4(10,1000)=505
– High sampling cost to perform SRS on deep web
(Dasgupta et al, HDSampler)
PhD Defense
The Ohio State University
Summer 2010
81
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
The Deep Web is Informative
• Structured Data
– Surface web: text format
– Deep web: relational data in backend
relational databases
• Topic Specific Data
– Biology, Chemistry, Medical, Travel, Business,
Academia, and many more…
• Publicly Accessible Data
– 95 percent of the deep web is publicly
accessible (with access limitations)
PhD Defense
The Ohio State University
Summer 2010
82
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Sufficient Statistics
No other statistic which can be calculated from the same
sample provides any additional information as to the value of
the parameter
The concept is most general when defined as follows: A
statistic T(X) is sufficient for underlying parameter θ precisely
if the conditional probability distribution of the data X, given
the statistic T(X), is not a function of the parameter θ
This is the reason that the final sample si is a sufficient
statistics for the estimator (all information must be contained
in the final sample)
PhD Defense
The Ohio State University
Summer 2010
83
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Rao–Blackwell theorem
• In statistics, the Rao–Blackwell theorem, is a result which
characterizes the transformation of an arbitrarily crude
estimator into an estimator that is optimal by the meansquared-error criterion or any of a variety of similar criteria.
• The Rao–Blackwell theorem states that if g(X) is any kind of
estimator of a parameter θ, then the conditional expectation of
g(X) given T(X), where T is a sufficient statistic, is typically a
better estimator of θ, and is never worse.
• The mean squared error of the Rao–Blackwell estimator does
not exceed that of the original estimator.
PhD Defense
The Ohio State University
Summer 2010
84
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Conditional Expectation
• Let X and Y be discrete random variables, then the
conditional expectation of X given the event Y=y
is a function of y over the domain of Y
PhD Defense
The Ohio State University
Summer 2010
85
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Chebyshev Inequality
PhD Defense
The Ohio State University
Summer 2010
86
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
PhD Defense
The Ohio State University
Summer 2010
87
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
PhD Defense
The Ohio State University
Summer 2010
88
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
PhD Defense
The Ohio State University
Summer 2010
89
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Bayesian Computation (1)
• The posterior estimation for θ can be computed using
• Using Monte Carlo Sampling
– Generate a sequence of random variables
,
having common density
– Using strong law of large numbers, we could approximate the
above formula as
PhD Defense
The Ohio State University
Summer 2010
90
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Bayesian Computation (2)
• We choose
distribution
PhD Defense
to be equal to the prior
The Ohio State University
Summer 2010
91
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Strong Law of Large Numbers
• In probability theory, the law of large numbers
(LLN) is a theorem that describes the result of
performing the same experiment a large number
of times. According to the law, the average of the
results obtained from a large number of trials
should be close to the expected value, and will
tend to become closer as more trials are
performed.
PhD Defense
The Ohio State University
Summer 2010
92
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
PhD Defense
The Ohio State University
Summer 2010
93
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
PhD Defense
The Ohio State University
Summer 2010
94
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Neyman Allocation
PhD Defense
The Ohio State University
Summer 2010
95
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Leaps and Bounds
• Assumptions
– Strong correlation
– Uniform distribution within stratum (auxiliary attribute)
– Equal coefficients of variance in each stratum
(auxiliary attribute)
• Breaks follows geometric progression
PhD Defense
The Ohio State University
Summer 2010
96
SEEDEEP: A System for Exploring and Querying Deep Web Data Sources
Dalenius and Hodges
• Assumptions
– Strong correlation
– Auxiliary attribute uniform distribution within stratum
– So we have
• Breaks in following positions
PhD Defense
The Ohio State University
Summer 2010
97