Transcript ppt
Index Structures for Querying the
Deep Web
Jian Qiu, Feng Shao, Jayavel Shanmugasundaram
Cornell Universersity
Misha Zatsman
Google
Deep Web
Keyword queries
Static web pages
Surface web
Deep Web
Keyword queries
Static web pages
Surface web
400-500
times the
size of
surface
web!
www.ebay.com
Deep web
Ebay
databases
CNN
databases
Cars.com
databases
…
…
Amazon
databases
Deep Glue
Query results
Structured queries
400-500
times the
size of
surface
web!
Deep web
Ebay
database
CNN
databases
Cars.com
database
…
Amazon
database
Deep Glue System
Find textbooks
with price<$50
Our
focus
Indexer
Index
structures
Query
Query Engine
Superset of relevant
data sources
Internet
Half.com
databases
Database Concepts @
half.com…
…
Index structure for deep web: Challenges
Deal with structured data
Underlying
databases are structured
Surface web typically unstructured
Deal
with large volumes
Orders
of magnitude larger than the
size of surface web
Our approach
Understand the structure/typing of the data
Support equality and range queries
Heavily compress the index
Achieve a factor of 10 compression
Tradeoff between compression factor and
the number of false positives
Compression factor 10 with only ~10 false
positives for 1000 data sources.
Outline
Query model
Index Structures
Experimental Evaluation
Related work and conclusion
Assumptions
Data sources are classified into domains
Online car dealers, online auctions, online travel agents, …
Data sources in the same domain use same logical
relational schema
Indexing attributes
Price, date, make, model, isbn,…
Indexed by Deep Glue system
Indexing data can be obtained via
Crawling the deep web [Raghavan 01 ]
Previously agreed-upon protocols [Froogle]
Query Model
Support equality and range queries
currently
on a single indexing attribute
Schema: Car(Id,Make,Model,Year,Price)
Queries:
Find
all year 2003 cars, year = 2003
Find all cars that cost less than $1000,
price < 1000
Outline
Query model
Index Structures
Experimental Evaluation
Related work and conclusion
Overview
Uncompressed Index
Compressed Index, still support equality
and range queries
Value
Clustered Index (VCI)
DataSource Clustered Index (DCI)
Value DataSource Clustered Index (VDCI)
Histogram Based Index (HBI)
Uncompressed Index (UI)
For each distinct value v for an indexing attribute, stores the
list of data sources
data source
UI:
value
d d d d d d d d
1 2 3 4 5 6 7 8
value
data sources
1
X X X
2
X X X
X
3
X
4 X
X X
5
X X
X
6 X
X X
d1: ebay.com , d2: amazon.com …
B+tree
1
2
3
4
5
6
d6,d7,d8
d2,d3,d4,d6
d4
d1,d4,d5
d2,d3,d6
d1,d7,d8
Problems
A huge number of values and data
sources in deep web !!
Indexing
space
every indexing attribute requires
Need to compress UI !
Use
gzip?
Have
to uncompress the index index lookup too
expensive!
Need
new compression techniques
Overview
Uncompressed Index
Compressed Index, still support equality
and range queries
Value
Clustered Index (VCI)
DataSource Clustered Index (DCI)
Value DataSource Clustered Index (VDCI)
Histogram Based Index (HBI)
Value Clustered Index (VCI)
Intuition:
“closely
related” values are stored in “closely
related” data sources
ISBN
numbers of antique books in the online book
retailers specializing in antique books.
Cluster “closely related” values
Stores the list of data sources only for
each cluster
VCI Example
data source
value
d d d d d d d d
1 2 3 4 5 6 7 8
1
XXX
2 XXX X
3
X
4X
XX
5 XX
X
6X
XXX
value Cluster
id
1
2
3
4
5
6
B+tree
False
c1
c2
c3
c3
c2
c1
Cluster 2: { 2, 5}
Cluster 3: { 3, 4}
Cluster
id
data sources
c1
d1,d6,d7,d8
c2
d2,d3,d4,d6
c3
d1,d4,d5
positives
value
1 data source d1
Tradeoff between
Cluster 1: { 1, 6}
Union
VCI structures:
Mapping
Mapping
space and accuracy
all values in one cluster
each distinct value into a
separate cluster
VCI Implementation
Use existing scalable algorithm
Scales to large data sets: Birch Framework [Zhang96]
Minimize the number of false positives
Specify the parameters for Birch
Centroid, the mid-point of a cluster
Radius, a measure of quality for
Centroid
cluster1
a cluster
Distance between clusters
Radius
Distance
cluster2
VCI formulae
For a cluster having the set of values V
ds(v): the set of data sources for value v
ds(v)
centroid(V) =
vV
radius(V) =
centroid (V ) ds(v)
Data sources associated with
the cluster
Sum of number of
false positives
vV
V
distance(V1, V2)
Additional number of false positives when merging two clusters
V 1 centroid (V 2) centroid (V 1) V 2 centroid (V 1) centroid (V 2)
Overview
Uncompressed Index
Compressed Index:
Value
Clustered Index (VCI)
DataSource
Clustered Index (DCI)
Value DataSource Clustered Index (VDCI)
Histogram Based Index (HBI)
DataSource Clustered Index (DCI)
Intuition: “closely related” data sources may have “closely related”
sets of values
Amazon and b&n has similar sets of ISBN numbers
In the data graph, VCI clusters rows and DCI clusters columns
data source
Cluster 1: {d2,d3,d6}
d d d d d d d d
value
Cluster 2: { d4, d5}
1 2 3 4 5 6 7 8
1
X X X
2
X X X
X
3
X X X
X
4 X
X X
5
X X
X
6 X
X X
Cluster 3: { d1, d7, d8}
Table structures are
similar to VCI.
See paper for other
details
Overview
Uncompressed Index
Compressed Index:
Value
Clustered Index (VCI)
DataSource Clustered Index (DCI)
Value
DataSource Clustered Index (VDCI)
Histogram Based Index (HBI)
Value-DataSource Clustered Index
(VDCI)
VCI, DCI: clusters in 1 dimension
VDCI: clusters in 2 dimensions, generalizes VCI/DCI
Cluster: a set of values and a set of data sources
data source
value
d
1
d
2
d
3
d
4
d
5
d
6
d
7
d
8
1
X X X
2
X X X
X X X
3
X X X
X
4 X
X X X
5
X X X X X
6 X
X X
Cluster
1:{ {2,3}, {d2,d3,d4}}
Cluster
2:{ {4,5}, {d4,d5,d6} }
Cluster
3:{ {1,2}, {d6,d7,d8} }
Data
source d4 is in two clusters
Value
2 is in two clusters
Table
structures are
similar to VCI.
See
paper for other
details
Overview
Uncompressed Index
Compressed Index:
Value
Clustered Index (VCI)
DataSource Clustered Index (DCI)
Value DataSource Clustered Index (VDCI)
Histogram
Based Index (HBI)
Histogram Based Index (HBI)
VCI/VDCI don’t consider the ordering among
values
Range queries implies this need
HBI groups adjacent values in the same cluster
Also need to ensure the accuracy
Use threshold to determine the boundary of a cluster
Threshold: average number of false positives in a cluster
HBI Example
data source
Threshold: 2
d1 d2 d3 d4 d5 d6 d7 d8
value
1
2
3
4
5
6
X X X
X
X X X
X
X
X X
X X
X
X
X
X
Cluster adjacent values
Cluster 1: {1}
Cluster 2: {2,3,4}
Cluster 3: {5,6}
Outline
Query model
Index Structures
Experimental Evaluation
Related work and conclusion
Experimental setup
Synthetic
data
1000
data sources, 100,000 values, 4,000,000
(value,data source) pairs
Other
parameters are in the paper
Metrics
Index
creation time
Compression factor
False positives
Setup
2.8GHz Pentium IV, 1GB memory, 80GB disk
C++
Index creation time
Index structure
Time(min)
UI
0.25
VCI
15
DCI
3
VDCI
180
HBI
2.5
Equality queries (1000 data sources)
VCI
DCI
VDCI
HBI
false positives
50
40
30
20
10
0
0
5
10
compression factor
15
20
Range Queries (1000 data sources)
VCI
DCI
VDCI
HBI
false positives
50
45
40
35
30
25
20
15
10
5
0
0
5
10
compression factor
15
20
Outline
Query model
Index Structures
Experimental Evaluation
Related work and conclusion
Related work
Distributed database & information integration
Niagara system [Naughton01]
GlOSS [Gravano99]
…
Database/Inverted list compression
Query Optimization in Compressed Databases [Chen 01]
Compressing the Relations and Index [Goldstein 98]
Improved Query Performance with Variant Indices [O’Neill 97]
Implementation and Performance of Compressed Databases
[Westmann 00]
Size Reduction of Inverted Files [Weiss 90]
…
Conclusion
Space-efficient index structures for querying the
deep web
Support equality and range queries
A factor of 10 compression with a little loss in
precision
Future work
Combine cluster-based and histogram-based
Multiple attributes queries
Joins
Incremental index maintenance
Questions?
Experimental setup
Other parameters:
Number of
groups
The data sources in the same group use same distribution to generate the
values
Default 20
Group
mode
How many groups a data source belongs to
Default 1
Value
correlation
How the orders in the value space maps to the value ordering over which
Gaussian distribution is used.
Default 0.2