The Researcher’s Guide to the Data Deluge: Querying a

Download Report

Transcript The Researcher’s Guide to the Data Deluge: Querying a

The Researcher’s Guide to the Data Deluge:
Querying a Scientific Database
in just a Few Seconds
Martin L. Kersten
Stratos Idreos
Stefan Manegold
Erietta Liarou
(and members of the CWI database group)
Science Feb’11 Data
http://www.sciencemag.org/site/special/data/
Science Feb’11 Data
…. We have recently passed the point where
more data is being collected than we can
physically store. This storage gap will widen
rapidly in data-intensive fields. Thus, decisions
will be needed on which data to archive and which to
discard. A separate problem is how to access and use
these data. Many data sets are becoming too large to
download. Even fields with well-established data
archives, such as genomics, are facing new and growing
challenges in data volume and management. And even
where accessible, much data in many fields is too
poorly organized to enable it to be efficiently used….
Science Feb’11 Data
Science Feb’11 Data
Database research vision
• Throwing away data before harvesting is the worst
ROI one can imagine.
• LSST budget is 100 M$
– During its ten-year survey, LSST will acquire 5.6
million 15-second images, spread over 2.8 million
pointings.
– 20 billion rows in the Object table, 3 trillion rows
in the Source table
Database technology is not
designed for the challenges
All sizes don’t fit
The Dawn of a new Database Era
Capture the query intent !
FIVE STEPS INTO THE FUTURE
• One-minute DBMS for real-time performance.
• Multi-scale query processing for gradual exploration.
• Post processing for conveying meaningful data.
• Query morphing to adjust for proximity results.
• Query alternatives to cope with lack of providence.
One-minute database kernels
Step 1: Do the BEST you can within a given time frame !
• Research how to …
– organize query evaluation around what is available
at low cost
– redesign algorithms and operators such that they
adaptively avoid expensive steps normally needed
for correctness and completeness
– stop process after agreed upon time
– ensure continuation upon request.
Multi-scale query processing
Step 2: Use a staging scheme for query evaluation !
• Research how to …
– partition the database for producing incremental
valuable results
D => D1 union (D2.1 union (D2.2 union (D2.3 union ..
– avoid harmful SELECT * FROM table queries
– break a query into a converging query sequence
Q => Q1 union Q2 => Q1 union Q2.1 union Q2.2 =>
Q1 union Q2.1 union Q2.2.1 union Q2.2.2 …….
Result-set post processing
Step 3: Use meaningful compression to convey more !
• Research how to …
– post-process results sets statistically
– prepare for facetted query answers
– show sort for boundaries first
• Min/max domain enclosures for all attributes
Query morphing
Step 4: Bend the search towards interesting areas !
• Research how to …
– explore the query expression space?
– transform a query with small result set such that it
produces relevant, nearby answers
Result-set post processing
Step 5: Ignore stupid questions, give hints instead !
• Research how to …
– find alternative queries in terms of expressiveness
+ performance
– Better exploit the query log for hints
SELECT *
FROM PhotoObj
-- Q1: Using the time budget. (36291322 tuples)
SELECT ra, dec, band1, intensity1, type
FROM PhotoObj;
-- Q2: Using data statistics. (879300 tuples)
SELECT * FROM PhotoObj
WHERE ra BETWEEN 53 AND 54
AND dec BETWEEN 80 AND 82;
-- Q3: Using query statistics. (899 tuples)
SELECT * FROM PhotoObj
WHERE ra BETWEEN 53 AND 54
AND dec BETWEEN 80 AND 82
AND distance(ra,dec,radius) < 10;
The Dawn of a new Database Era
Brought to you by the CWI database research group