Transcript Slides
Data Generation
for
Application-Specific Benchmarking
Y.C. Tay
National University of Singapore
Background
benchmarks help research and development
--- the dominant database benchmark is TPC
SIGMOD Conference 2011
research track: 87 papers, 17 use TPC (20%)
industry track: 14 papers, 6 use TPC (43%)
Problem :
a few TPC benchmarks
but many, many applications
TPC becoming irrelevant?
Vision
a paradigm shift in database benchmark development
from
to
top-down
committee consensus
domain-specific
package (data generator + queries)
bottom-up
community collaboration
application-specific
tools (dataset scaling)
synthetically scale up/down
application data
application already
has queries
Challenge
Dataset Scaling Problem :
Given a set of relational tables D and a scale factor s,
generate a database state D’ that is similar to D but s times its size.
E.g. What would DBLP look like in 2020?
s>1
why: scalability testing
difficulty: copying doesn’t work (e.g. social network data)
s<1
why: application testing
difficulty: sampling not straightforward (similar to web crawling)
s=1
why: privacy/proprietary reasons
difficulty: encryption is risky
Challenge
Dataset Scaling Problem :
Given a set of relational tables D and a scale factor s,
generate a database state D’ that is similar to D but s times its size.
by query results
difficulty: data correlation
E.g. database = {photos, owners, comments, tags}
inter-column correlation
inter-row correlation
inter-column + inter-row
• foreign keys
• photo dimensions
(same camera)
• 2 users comment on
each other’s photos
(social network)
• age and gender
• user likely to comment
on own photos
• gardener likely to tag
photos of flowers
• tags used by gardener
(“rose”, “bee”, “beetle”)
Challenge
scaling a social network:
extract
D
G
empirical
dataset
empirical
social graph
use join query
~
G
scale by s
inject
synthetic
social graph
use graph theory
#edges?
#triangles?
path lengths?
D
synthetic
dataset
any database theory?
~
E.g. how to inject into D
~
* correlation from G indicating X and Y comment on each other’s photos
* correlation between Alice’s birthday and wall posts by her classmates
* correlation among tags used by bird watchers
~
Challenge
* online social networks are here to stay
* their datasets can be huge
* their datasets have commercial value
where is the database theory?
Attribute Value Correlation Problem for Social Networks :
Suppose a dataset D records data from a social network.
How do the social interactions affect the correlation
among attribute values in D ?
Vision (for the next 25 years):
a paradigm shift from a top-down design of domain-specific
benchmarks by committee consensus to a bottom-up collaborative
development of tools for application-specific dataset scaling
Challenges:
• Dataset Scaling Problem
• Attribute Value Correlation Problem for Social Networks
Payoff:
• commercial value in dataset scaling tools
• new database research areas (social network data, schema design,
vertical/horizontal partition, query optimization, business intelligence, …)
Start:
UpSizeR (http:www.comp.nus.edu.sg/~upsizer )
• single-server version
• Hadoop version