Query-Based Data Pricing - Computer Sciences User Pages

Download Report

Transcript Query-Based Data Pricing - Computer Sciences User Pages

TOWARD PRACTICAL QUERY
PRICING WITH QUERYMARKET
Paraschos Koutris
Prasang Upadhyaya
Magdalena Balazinska
Bill Howe
Dan Suciu
University of Washington
SIGMOD 2013
MOTIVATION
• Data is increasingly sold and bought on the web
• Websites that sell data:
– Xignite (financial)
– Gnip (social)
• Data marketplace services:
–
–
–
–
Windows Azure Marketplace
Infochimps
Factual
DataMarket
2
A PRICING SCENARIO (1)
English-German dictionary T
english
german
thanks
Danke
car
Auto
day
Tag
road
Strasse
Road
Weg
…
…
PRICING SCHEMES
Sell the whole table T for a fixed price
• Q: translate only the word “thanks”
• The user pays for redundant information
Price per output tuple
• Q: Does the word “thanks” translate to “Auto” ?
• An empty result still carries information
3
A PRICING SCENARIO (2)
English-German dictionary T
english
german
thanks
Danke
car
Auto
day
Tag
road
Strasse
Road
Weg
…
…
Word Frequency Stats UF
word
frequency genre
rank
rock
0.025
music
20
pop
0.030
music
10
database
0.001
science
1453
…
…
..
…
Q1: Return all translations to German of
top 10 words in the genre “music”
Q2: Return all translations to German of
top 20 words in the genre “music”
• Current systems do not sell queries that combine datasets
• Queries issued by a user may have overlapping content
4
HOW TO PRICE DATA
English-German dictionary T
english
german
p(σT.english=‘thanks’)=$0.1
thanks
Danke
p(σT.english=‘car’)=$0.1
car
Auto
p(σT.english=‘day’)=$0.1
day
Tag
road
Strasse
p(σT.english=‘road’)=$0.15
road
Weg
p(σT.english=‘cat’)=$0.05
…
…
p(σT.german=‘Auto’)=$0.5
…
Price points
• selection queries on single table
• exhaust the possible values (ColA) of some attribute A
• may select on values not in the active domain
5
QUERYMARKET: CONTRIBUTIONS
• A formal pricing framework where:
– sellers specify a set of price points as selection queries
– buyers can purchase any query on the database
– the system automatically computes the price of the query
• Support efficient computation of prices for a large class of
SQL queries
• Support the necessary functionality for a marketplace:
– Pricing queries with overlapping information content
– Database updates
– Revenue sharing among different sellers?
6
OUTLINE
1.
2.
3.
4.
The Pricing Framework
Computing the Price
Query History
Revenue Sharing
7
THE PRICING FRAMEWORK
[Koutris et al., PODS 2012]
• The seller defines price points (view-price pairs):
S = { (V1,p1), (V2,p2), … }
• A buyer can buy any query Q
• The system will compute priceDS(Q)
Buyer Q(D) ?
Seller
Price points
Pricing System
+
Database D
priceDS(Q)
8
PROPERTIES OF PRICES
We say that the views V1,…, Vk determine Q if one can compute
Q(D) from V1(D),…, Vk(D) without access to D
Arbitrage-free: Given D, priceD(Q) is arbitrage-free if for all
views V1, …, Vk that determine Q:
priceD(Q) ≤ priceD(V1) + … + priceD(Vk)
Discount-free: priceD(Q) must not offer additional discounts
except for the explicit price points defined by the seller
9
THE PRICING FORMULA
Arbitrage-Price:
• The price of the cheapest set of views from price points
S that determine the query Q
• unique + arbitrage-free + discount-free + agrees with
price points
Table R
price = $1
A
a1
Table S
price = $2
A
B
a1
b
a2
b
ColA = { a1, a2, a3 }
ColB = { b }
Q(y) = R(x),S(x,y)
price = $3
• {σ[R.A=a1], σ[S.B=b] } determines Q
• cost = 1 + 3 = 4
• {σ[R.A=a1], σ[S.A=a1] } also determines Q
• cost = 1 + 2 = 3 (cheapest possible)
10
OUTLINE
1.
2.
3.
4.
The Pricing Framework
Computing the Price
Query History
Revenue Sharing
11
COMPUTING THE PRICE
• The problem of computing the arbitrage price even for
SELECT-PROJECT-JOIN queries is coNP-complete
• For some queries, the price can be computed fast:
• Selections, joins w/o projection
• We describe pricing as an Integer Linear Program (ILP)
and then use fast ILP solvers (e.g. GLPK, CPLEX)
• Classes of queries supported:
•
•
•
•
Selections/Projections/Joins
Unions
User-Defined Functions (UDF)
Bundles of queries
12
ILP CONSTRUCTION (1)
Table R
A
Table S
a1
price = $1
price = $2
A
B
a1
b
a2
b
ColA = { a1, a2, a3 }
ColB = { b }
price = $3
• Price the query Q(x,y) = R(x), S(x,y)
• Introduce a {0/1} variable x[attribute,value] for each
price point:
x[R.A, a2], x[S.A, a1], x[S.B, b], …
13
ILP CONSTRUCTION (2)
Table R
A
Table S
a1
A
B
a1
b
a2
b
ColA = { a1, a2, a3 }
ColB = { b }
Q(x,y) = R(x), S(x,y)
• Minimize (independent of the query):
price = x[R.A,a1] + x[R.A,a2] + x[R.A,a3] +2x[S.A,a1] +
2x[S.A,a2] + 2x[S.A,a3] +3x[S.B,b]
• Constraints:
• (a1,b) in Q:
x[R.A,a1]
≥1
x[S.A,a1] + x[S.B,b] ≥ 1
• (a2,b) not in Q: x[R.A,a2]
≥1
• (a3,b) not in Q: x[R.A,a3] + x[S.A,a3] + x[S.B,b] ≥ 1
14
ILP CONSTRUCTION (3)
Table R
Table S
A
B
a1
a1
b
a2
a2
b
A
ColA = { a1, a2, a3}
ColB = { b}
New variable for each
tuple in Qfull
• Projection: Q(y) = R(x), S(x,y)
• Constraints:
• (a1,b) in Qfull: x[R.A,a1]
• (a2,b) in Qfull: x[R.A,a2]
• (b) in Q
:
≥ z1
x[S.A,a1] + x[S.B,b] ≥ z1
≥ z2
x[S.A,a2] + x[S.B,b] ≥ z2
z 1 + z2 ≥ 1
15
QUERYMARKET SYSTEM
• Runs on top of any SQL database
• Information stored in the database:
– Price points are stored in the database in price tables
– Keeping track of price tables with an index table
• The dataset:
–
–
–
–
English-german translation:
English-french translation :
UDF to find hashtags
:
Word frequency stats
:
Ten,gr(w, w’)
Ten,fr(w, w’)
IsHashtag(w)
WF(w, genre, frequency, rank)
16
PRICE COMPUTATION (1)
• Small dataset where columns have size ~ 102
10
ILP solving time
ILP construction time
Time in seconds
8
6
4
2
0
1
2
3
4
selections
5
6
7
8
Query
9
10
11
12
13
14
15
3-way join
2-way joins
2-way joins
w/o projections with projections
17
PRICE COMPUTATION (2)
• Larger dataset where columns have size ~ 103
80
ILP solving time
70
ILP construction time
Time in seconds
60
50
40
30
20
10
0
1
2
3
4
selections
5
6
7
8
Query
9
10
11
12
13
14
15
3-way join
2-way joins
2-way joins
w/o projections with projections
18
OUTLINE
1.
2.
3.
4.
The Pricing Framework
Computing the Price
Query History
Revenue Sharing
19
QUERY HISTORY
A user asks a sequence of queries over time of varying
information overlap Q = Q1, Q2, …, Qk
Experiment with 30 selection/join queries
•
•
Oblivious pricing: each query
priced independently
18
16
price in dollars
14
High Overlap
12
Bundle pricing: each query Qi
priced p(Q1,…,Qi)- p(Q1,…,Qi-1)
10
8
6
4
2
0
1 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930
query
View pricing: when a query is purchased, the
purchased views are free for later queries
20
QUERY HISTORY (2)
25
price in dollars
20
Moderate Overlap
Oblivious pricing
15
View pricing
10
Bundle pricing
5
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
16
14
query
Weak Overlap
price in dollars
12
10
Oblivious pricing
8
Bundle pricing
6
View pricing
4
2
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
query
21
VIEW PRICING
•
View Pricing is our proposed strategy:
–
–
–
•
Computationally efficient
Low storage overhead
Close to optimal (bundle) price
View Pricing can be used for dynamic databases: if view V
is purchased at some point and then updated, the user
pays only an update price
22
OUTLINE
1.
2.
3.
4.
The Pricing Framework
Computing the Price
Query History
Revenue Sharing
23
REVENUE SHARING
• How is the revenue shared between sellers if several
datasets contribute to the answer?
• What if the cheapest set of views to determine a query is
not unique ?
• Example:
– Q(‘sigmod13’) = isHashtag(‘sigmod13’), isNoun(‘sigmod13’)
– Seller 1 prices $1 per entry for isHashtag, so does seller 2
– If both isHashtag, isNoun are false and each costs $1, purchasing
either of the entries answers Q
24
REVENUE SHARING: SOLUTION
• For a seller s, share(s, Q) is the maximum revenue of s over
all minimum-cost set of price points that determine Q
• share(s, Q) can be computed in our framework
• Solution: split price(Q) among sellers proportionally to
their shares
• Example:
– Both shares are $1
– The revenue of each seller will be $0.5, since their shares are equal
25
CONCLUSIONS
• QueryMarket: the first system that supports pricing a large
class of SQL queries within a formal framework
• We presented solutions to address the requirements of a
real-world marketplace
• Future work includes:
– Scaling the price computation (bucketization)
– Full SQL Support (aggregates, negation)
– Query answering under limited budget
26
Thank you !
27