Transcript ppt

Index Structures for Querying the
Deep Web
Jian Qiu, Feng Shao, Jayavel Shanmugasundaram
Cornell Universersity
Misha Zatsman
Google
Deep Web
Keyword queries
Static web pages
Surface web
Deep Web
Keyword queries
Static web pages
Surface web
400-500
times the
size of
surface
web!
www.ebay.com
Deep web
Ebay
databases
CNN
databases
Cars.com
databases
…
…
Amazon
databases
Deep Glue
Query results
Structured queries
400-500
times the
size of
surface
web!
Deep web
Ebay
database
CNN
databases
Cars.com
database
…
Amazon
database
Deep Glue System
Find textbooks
with price<$50
Our
focus
Indexer
Index
structures
Query
Query Engine
Superset of relevant
data sources
Internet
Half.com
databases
Database Concepts @
half.com…
…
Index structure for deep web: Challenges

Deal with structured data
Underlying
databases are structured
Surface web typically unstructured
Deal
with large volumes
Orders
of magnitude larger than the
size of surface web
Our approach

Understand the structure/typing of the data


Support equality and range queries
Heavily compress the index

Achieve a factor of 10 compression
Tradeoff between compression factor and
the number of false positives

Compression factor 10 with only ~10 false
positives for 1000 data sources.

Outline
Query model
 Index Structures
 Experimental Evaluation
 Related work and conclusion

Assumptions

Data sources are classified into domains

Online car dealers, online auctions, online travel agents, …

Data sources in the same domain use same logical
relational schema

Indexing attributes



Price, date, make, model, isbn,…
Indexed by Deep Glue system
Indexing data can be obtained via


Crawling the deep web [Raghavan 01 ]
Previously agreed-upon protocols [Froogle]
Query Model

Support equality and range queries
 currently
on a single indexing attribute
Schema: Car(Id,Make,Model,Year,Price)
 Queries:

 Find
all year 2003 cars, year = 2003
 Find all cars that cost less than $1000,
price < 1000
Outline
Query model
 Index Structures
 Experimental Evaluation
 Related work and conclusion

Overview

Uncompressed Index

Compressed Index, still support equality
and range queries
 Value
Clustered Index (VCI)
 DataSource Clustered Index (DCI)
 Value DataSource Clustered Index (VDCI)
 Histogram Based Index (HBI)
Uncompressed Index (UI)
For each distinct value v for an indexing attribute, stores the
list of data sources
data source
UI:
value
d d d d d d d d
1 2 3 4 5 6 7 8
value
data sources

1
X X X
2
X X X
X
3
X
4 X
X X
5
X X
X
6 X
X X
d1: ebay.com , d2: amazon.com …
B+tree
1
2
3
4
5
6
d6,d7,d8
d2,d3,d4,d6
d4
d1,d4,d5
d2,d3,d6
d1,d7,d8
Problems

A huge number of values and data
sources in deep web !!
 Indexing
space

every indexing attribute requires
Need to compress UI !
 Use
gzip?
 Have
to uncompress the index  index lookup too
expensive!
 Need
new compression techniques
Overview

Uncompressed Index

Compressed Index, still support equality
and range queries
 Value
Clustered Index (VCI)
 DataSource Clustered Index (DCI)
 Value DataSource Clustered Index (VDCI)
 Histogram Based Index (HBI)
Value Clustered Index (VCI)

Intuition:
 “closely
related” values are stored in “closely
related” data sources
 ISBN
numbers of antique books in the online book
retailers specializing in antique books.

Cluster “closely related” values

Stores the list of data sources only for
each cluster
VCI Example
data source
value
d d d d d d d d
1 2 3 4 5 6 7 8
1
XXX
2 XXX X
3
X
4X
XX
5 XX
X
6X
XXX
value Cluster
id
1
2
3
4
5
6
B+tree
False
c1
c2
c3
c3
c2
c1
Cluster 2: { 2, 5}
Cluster 3: { 3, 4}
Cluster
id
data sources
c1
d1,d6,d7,d8
c2
d2,d3,d4,d6
c3
d1,d4,d5
positives
value
1  data source d1
Tradeoff between
Cluster 1: { 1, 6}
Union
VCI structures:
Mapping
Mapping
space and accuracy
all values in one cluster
each distinct value into a
separate cluster
VCI Implementation

Use existing scalable algorithm


Scales to large data sets: Birch Framework [Zhang96]
Minimize the number of false positives

Specify the parameters for Birch
Centroid, the mid-point of a cluster
 Radius, a measure of quality for
Centroid


cluster1
a cluster
Distance between clusters
Radius
Distance
cluster2
VCI formulae


For a cluster having the set of values V
ds(v): the set of data sources for value v
 ds(v)
centroid(V) =
vV

radius(V) =
 centroid (V )  ds(v)
Data sources associated with
the cluster
Sum of number of
false positives
vV
V

distance(V1, V2)

Additional number of false positives when merging two clusters

V 1  centroid (V 2)  centroid (V 1)  V 2  centroid (V 1)  centroid (V 2)
Overview

Uncompressed Index

Compressed Index:
 Value
Clustered Index (VCI)
 DataSource
Clustered Index (DCI)
 Value DataSource Clustered Index (VDCI)
 Histogram Based Index (HBI)
DataSource Clustered Index (DCI)

Intuition: “closely related” data sources may have “closely related”
sets of values

Amazon and b&n has similar sets of ISBN numbers
In the data graph, VCI clusters rows and DCI clusters columns
data source
Cluster 1: {d2,d3,d6}
d d d d d d d d
value
Cluster 2: { d4, d5}
1 2 3 4 5 6 7 8

1
X X X
2
X X X
X
3
X X X
X
4 X
X X
5
X X
X
6 X
X X
Cluster 3: { d1, d7, d8}

Table structures are
similar to VCI.

See paper for other
details
Overview

Uncompressed Index

Compressed Index:
 Value
Clustered Index (VCI)
 DataSource Clustered Index (DCI)
 Value
DataSource Clustered Index (VDCI)
 Histogram Based Index (HBI)
Value-DataSource Clustered Index
(VDCI)



VCI, DCI: clusters in 1 dimension
VDCI: clusters in 2 dimensions, generalizes VCI/DCI
Cluster: a set of values and a set of data sources
data source
value
d
1
d
2
d
3
d
4
d
5
d
6
d
7
d
8
1
X X X
2
X X X
X X X
3
X X X
X
4 X
X X X
5
X X X X X
6 X
X X
Cluster
1:{ {2,3}, {d2,d3,d4}}
Cluster
2:{ {4,5}, {d4,d5,d6} }
Cluster
3:{ {1,2}, {d6,d7,d8} }
Data
source d4 is in two clusters
Value
2 is in two clusters
Table
structures are
similar to VCI.
See
paper for other
details
Overview

Uncompressed Index

Compressed Index:
 Value
Clustered Index (VCI)
 DataSource Clustered Index (DCI)
 Value DataSource Clustered Index (VDCI)
 Histogram
Based Index (HBI)
Histogram Based Index (HBI)

VCI/VDCI don’t consider the ordering among
values

Range queries implies this need

HBI groups adjacent values in the same cluster

Also need to ensure the accuracy

Use threshold to determine the boundary of a cluster

Threshold: average number of false positives in a cluster
HBI Example
data source
Threshold: 2
d1 d2 d3 d4 d5 d6 d7 d8
value
1
2
3
4
5
6
X X X
X
X X X
X
X
X X
X X
X
X
X
X
Cluster adjacent values
Cluster 1: {1}
Cluster 2: {2,3,4}
Cluster 3: {5,6}
Outline
Query model
 Index Structures
 Experimental Evaluation
 Related work and conclusion

Experimental setup
Synthetic
data
1000
data sources, 100,000 values, 4,000,000
(value,data source) pairs
Other
parameters are in the paper
Metrics
Index
creation time
Compression factor
False positives

Setup


2.8GHz Pentium IV, 1GB memory, 80GB disk
C++
Index creation time
Index structure
Time(min)
UI
0.25
VCI
15
DCI
3
VDCI
180
HBI
2.5
Equality queries (1000 data sources)
VCI
DCI
VDCI
HBI
false positives
50
40
30
20
10
0
0
5
10
compression factor
15
20
Range Queries (1000 data sources)
VCI
DCI
VDCI
HBI
false positives
50
45
40
35
30
25
20
15
10
5
0
0
5
10
compression factor
15
20
Outline
Query model
 Index Structures
 Experimental Evaluation
 Related work and conclusion

Related work

Distributed database & information integration




Niagara system [Naughton01]
GlOSS [Gravano99]
…
Database/Inverted list compression






Query Optimization in Compressed Databases [Chen 01]
Compressing the Relations and Index [Goldstein 98]
Improved Query Performance with Variant Indices [O’Neill 97]
Implementation and Performance of Compressed Databases
[Westmann 00]
Size Reduction of Inverted Files [Weiss 90]
…
Conclusion




Space-efficient index structures for querying the
deep web
Support equality and range queries
A factor of 10 compression with a little loss in
precision
Future work




Combine cluster-based and histogram-based
Multiple attributes queries
Joins
Incremental index maintenance
Questions?
Experimental setup
Other parameters:
Number of
groups
 The data sources in the same group use same distribution to generate the
values

Default 20
Group
mode
 How many groups a data source belongs to

Default 1
Value
correlation
 How the orders in the value space maps to the value ordering over which
Gaussian distribution is used.

Default 0.2