The Third Extended File System with Copy-on-Write

Download Report

Transcript The Third Extended File System with Copy-on-Write

Processing Data Intensive Queries
in Scientific Database Federations
Xiaodan Wang
Department of Computer Science
Johns Hopkins University
Problem

Data avalanche in scientific databases
–
–

Exploring massive, widely distributed data
–
–
–

Exponential growth in data size (Pan-STARRS)
Accumulation of data at multiple data sources (clustered and
federated databases)
Joins to find correlations across multiple databases
Queries are data intensive: large transfers over the network,
and scan large portions of the data
Query throughput limits scale of exploration
To improve overall query throughput but potentially
sacrifice performance of individual queries
Processing Data Intensive Queries in Scientific Database Federations
Target Application

SkyQuery Federation of Astronomy Databases
–
–
–
–

Dozens of multi-terabyte databases across three Continents
Queries that perform full db scans lasting hours or days
Intermediate join results that are hundreds of MBs
Scalability concerns both in data size and number of sites
Cross-match
–
–
Probabilistic spatial join across multiple databases
Join results are accumulated, shipped from site to site, and
delivered to scientists
Processing Data Intensive Queries in Scientific Database Federations
Cross-Match Workload


A forward looking analysis
shows that network
dominates 90% of
performance
A quarter of the crossmatch queries execute for
minutes to several hours
Processing Data Intensive Queries in Scientific Database Federations
Incorporating Network Structure
Processing Data Intensive Queries in Scientific Database Federations
Network-Aware Join Processing

Capture heterogeneity in global-scale
federations
–
–
–
–
–

Metric to exploit high throughput paths
Decentralized, local optimizations using aggregate stats
Routing at the application layer
Two-approximate, MST-based solution with extensions
that employ semi-joins and explore bushy plans
Clustering to explore trade-offs with computation cost
Over a ten-fold reduction in network utilization
for large joins
Processing Data Intensive Queries in Scientific Database Federations
A Case for Batch Processing


Top ten buckets accessed
by 61% of queries and
reuse occur close
temporally
2% of buckets capture more
than half of the workload
and should be cached
Processing Data Intensive Queries in Scientific Database Federations
LifeRaft: Data-Driven Batch Proc.


Eliminate redundant I/O
to improve query
throughput
Batch queries with that
exhibit data sharing
–
–
–
–
Pre-process queries to
identify data sharing
Co-schedule queries that
access the same data
Access contentious data
first to maximize sharing
Improves performance by
two-fold
Processing Data Intensive Queries in Scientific Database Federations
Discussion

Cache replacement for LifeRaft
–
–

Buffering and workload overflow
–
–

Benefits contentions data regions that experience reuse
(Cache hit for LifeRaft is 40% compared with 7% for arrival
order processing)
Evaluate strategies that exploit I/O behavior of batch
workloads (segmented strategy)
Large intermediate join results
Migrate pairs of workload and bucket
Better support for interactive queries
–
–
Short and selective queries that focus on small region
Indefinite queuing times in presence of batch workloads
Processing Data Intensive Queries in Scientific Database Federations
Discussion (cont.)

Batch processing in a distributed environment
–
–

Network-aware scheduling does not consider
computation cost
Batch processing for a single system environment
Federating LifeRaft
–
–
–
Coordinate exec. of query that join multiple DBs
Batch proc. requires databases to buffer results
Maximize overall batch size while alleviating
memory used for buffering and network cost
Processing Data Intensive Queries in Scientific Database Federations
Exploring Alt. Join Schedules
Processing Data Intensive Queries in Scientific Database Federations
Discussion (cont.)

Explore both join schedules and opportunities for
batching simultaneously
–
–
–


Bushy and semi-join plans increase computation while
clustering decrease computation
Skew in join workload (ie. sites close to end user)
Quantify trade-offs with computation cost (ie. number of
buckets in batch processing)
Users submit cross-match queries in batches
Applying LifeRaft to other data-intensive, temporalspatial data such as Turbulence database
Processing Data Intensive Queries in Scientific Database Federations
Supplementary Slides
Processing Data Intensive Queries in Scientific Database Federations
Cross-Match Queries

Join by increasing cardinality (count *)
–
–
Minimal I/O
Fewer bytes on the network
Mediator
Query
Probe Query
Result
Result
Count: 800
Result
Count: 100
Processing Data Intensive Queries in Scientific Database Federations
Count: 30
Spanning Tree Approximation
(STA)
G
H
A
F
B
E
C
D
Processing Data Intensive Queries in Scientific Database Federations
STA: Find MST
G
H
A
F
B
E
C
D
Processing Data Intensive Queries in Scientific Database Federations
STA: Join Using Paths on the MST
7
G
A
4
5
6
H
9
1
F
10
8
B
E
2
3
C
13
12
11
D
Processing Data Intensive Queries in Scientific Database Federations
Filter and refine

Partition data into buckets
Processing Data Intensive Queries in Scientific Database Federations
Scheduling Behavior
Qi
B1
Qj
B2
B3
B4
Sub-divide queries by bucket:
Qi – Qi1, Qi2, Qi3
Qj – Qj3, Qj4, Qj5, Qj6 , Qj7, Qj8
Qj – Qj5, Qj6 , Qj7, Qj8
Qk
B5
B6
B7
B8
Assumptions:
- Inter-query time of 1 sec
- I/O for each bucket of 1
sec
- Cache size of 2
- Join cost is negligible
Processing Data Intensive Queries in Scientific Database Federations
Arrival order with no sharing
Qi
B1
Qj
B2
B3
B4
Qk
B5
B6
Qi Arr Qj Arr Qk Arr Qi End
B7
B8
Qj End
Qk End
Qi1
Qi2 Qi3
Qj1 Qj3 Qj4 Qj6 Qj7 Qj8 Qk1 Qk4
B1
B2
B1
B3
Completion Times:
Qi – 3 sec
Qj – 8 sec
B3
B4
B6
Qk – 13 sec
B7
B8
B1
Avg – 8 sec
Processing Data Intensive Queries in Scientific Database Federations
B4
Qk8
… B8
Tp – .2 qry/sec
Age based scheduling (bias 1)
Qi
B1
Qj
B2
Qi Arr Qj Arr Qk Arr
Qi1 Qi2
B1
B2
B3
B4
Qk
B5
B6
B7
B8
Qj End
Qk End
Qi End
Qi3Qj3 Qj1Qk1 Qj4Qk4 Qj6Qk6 Qj7Qk7 Qj8Qk8 Qi5
B3
B1
Completion Times:
Qi – 3 sec
Qj – 7 sec
B4
Qk – 7 sec
B6
B7
B8
B5
Avg – 5.6 sec Tp – .33 qry/sec
Processing Data Intensive Queries in Scientific Database Federations
Contention based scheduling
(bias 0)
Qi
B1
Qi Arr Qj Arr
Qj
B2
B3
Qk
B4
B5
B6
B7
B8
Qj End
Qk Arr
Qi End
Qk End
Qi1 Qi3Qj3 Qj1Qk1Qj4Qk4 Qj6Qk6 Qj7Qk7 Qj8Qk8 Qi2
Qk5
B1
B5
B3
B1 B4
Completion Times:
Qi – 7 sec
Qj – 5 sec
B6
Qk – 6 sec
B7
B8
B2
Avg – 6 sec Tp – .38 qry/sec
(5.6)
(.33)
Processing Data Intensive Queries in Scientific Database Federations
Parameter tuning using
trade-off curves
Processing Data Intensive Queries in Scientific Database Federations
Tuning the
age bias


Throughput performance
gap grows while
response time gap is
insensitive to saturation
Increasing age bias is
more attractive at low
saturation
Processing Data Intensive Queries in Scientific Database Federations