Transcript Slide 1

DMG 2007
OLAP Query Processing in Grids
Nelson Kotowski
Federal University of Rio de Janeiro, Brazil
Alexandre A. B. Lima
University of Grande Rio, Brazil
Esther Pacitti, Patrick Valduriez
INRIA and University of Nantes, France
Marta Mattoso
Federal University of Rio de Janeiro, Brazil
Agenda
• OLAP in Grids
• Database clusters
• GParGRES
• Preliminary experimental results
• Conclusion
2
OLAP using Grids
• Problem
 How to fulfill OLAP needs within current grid software
infrastructure ?
- Grid Services ?
- Adapting database cluster techniques to grids ?
Grid
Figure thanks to Peter Kacsuk and Gergely Sipos
3
Using Database Clusters in Grids
PC Cluster
DBMS
DBMS
Clients
Middleware
DBMS
DBMS
DBMS
A sequential “black-box” DBMS runs at each node
It is based on database replication
The middleware coordinates parallel query execution
Applications and databases are easily migrated from sequential
environments
 Both inter and intra-query parallelism can be explored




4
Inter-query Parallelism
•Improves overall system throughput
•Good for OLTP applications
•Not adequate for OLAP
DBMS
Q1
Node 1
DBMS
Node 2
DBMS
Node 3
Q2
Q3
Q4
DBMS
Node 4
5
Intra-query Parallelism
•Reduces individual query execution time
•Required for high-performance OLAP
DBMS
Q11
Q1
Q12
Q13
Q14
Virtual
Partitioning
Node 1
DBMS
Node 2
DBMS
Node 3
Q2
Q3
DBMS
Node 4
Q4
6
ParGRES
• Database cluster middleware developed by our research
•
•
•
group
Optimized for OLAP support
Provides inter and intra-query parallelism
Offers high-performance for heavy-weight query
processing over large databases
- using non-expensive components
- in a non-intrusive way
- Making no changes to database applications
- Keeping the same DBMS
- Keeping the same logical database schema
• Shows super-linear speedup
7
GParGRES
GParGRES: a Database Grid Middleware
• Middleware that provides
 Transparent access to distributed databases in a grid
 Intra-query parallelism during heavy-weight query processing
• Based on ParGRES
 Assumes that grid nodes are PC clusters running ParGRES instances
• Intra-query parallelism is achieved through virtual
partitioning
• Two levels of query splitting
 Grid-level splitting: implemented by GParGRES
 Node-level splitting: implemented by ParGRES
9
GParGRES: Architecture
10
GParGRES: Architecture
Concentrates metadata
concerning GParGRES
services, such as the state of
each FS and DQS instance,
and ParGRES execution in
the nodes
11
GParGRES: Architecture
GParGRES entry point,
responsible for creating new
instances of DQS
12
GParGRES: Architecture
Manages global query
execution. Receives the
query and splits it into
subqueries by using virtual
partitioning to implement
intra-query parallelism. It
also performs final result
composition
13
GParGRES: Architecture
Grid Local Query Service
(GLQS) – local component
responsible for receiving
subqueries from DQS and
passing them to the local
ParGRES instance
14
GParGRES: Architecture
15
GParGRES: a Database Grid Middleware
16
GParGRES: a Database Grid Middleware
17
GParGRES: a Database Grid Middleware
18
GParGRES: a Database Grid Middleware
19
GParGRES: a Database Grid Middleware
select o_orderpriority,
count(*)
from orders
where o_orderdate >=
date '1993-07-01'
group by o_orderpriority;
20
GParGRES: a Database Grid Middleware
create table temp_result_1
( o_orderpriority varchar(2),
order_count integer);
21
GParGRES: a Database Grid Middleware
select o_orderpriority,
count(*)
from orders
where o_orderdate >=
date '1993-07-01'
and o_orderkey >= ?
and o_orderkey < ?
group by
o_orderpriority;
22
GParGRES: a Database Grid Middleware
23
GParGRES: a Database Grid Middleware
24
GParGRES: a Database Grid Middleware
25
GParGRES: a Database Grid Middleware
insert into temp_result_1
values (?,?);
26
GParGRES: a Database Grid Middleware
select o_orderpriority,
sum(order_count)
from temp_result_1
group by o_orderpriority;
27
GParGRES: a Database Grid Middleware
28
GParGRES: Preliminary Experimental Results
• A preliminary GParGRES prototype has been
implemented in Java
 Simple versions of DQS and GLQS (using ParGRES
components) were implemented
• Experimental Setup
 Two clusters from Grid’5000
- Parasol cluster: 64 nodes, each with 2 Opteron
2.2GHz CPUs, 2GB RAM and 73 GB HD
- Paraquad cluster: 64 nodes, each with 2 Dual
Core Xeon 2.33GHz CPUs, 4GB RAM and 160GB
HD
 Kadeploy
- Generate customized images of operating
systems and applications
 PostgreSQL 8.2.4
 ParGRES
 TPC-H database and queries
- SF = 1
29
GParGRES: Preliminary Experimental Results (cont.)
• Two kinds of experiments
 Isolated clusters
 Mixed Configuration
30
GParGRES: Preliminary Experimental Results (cont.)
• Isolated cluster - Parasol
31
GParGRES: Preliminary Experimental Results (cont.)
• Isolated cluster - Paraquad
32
GParGRES: Preliminary Experimental Results (cont.)
• Mixed Configuration
33
GParGRES – Implementation Issues
• Goals
 To implement all components as grid services
 WSRF-compliant components: RS, FS and GLQS
• When running in a grid managed by Globus Toolkit 4, RS
can be implemented by Web Service Monitoring and
Discovery Service (WS MDS)
• Techniques employed in OGSA-DAI will help implementing
some components (e.g. FS)
34
Related Work
• OGSA-DAI
 Open Grid Services Architecture - Data Access and Integration
• OGSA-DQP
 Open Grid Services Architecture - Distributed Query Processing
• New data models for grid warehouses
 Wehrle et al. propose a data model for distributing and querying a
data warehouse in computing grids
- The warehouse is formed by data “chunks”
- Special structures are needed (e.g. X-Tree)
35
Conclusion
• GParGRES is a grid service for OLAP query processing
 It provides transparent inter and intra-query processing with
- No need for application migration
- No need for database schema migration
- DBMS independence
• GParGRES explore successful techniques implemented in
ParGRES
• Two levels of query splitting
 Grid-level splitting: implemented by GParGRES
 Node-level splitting: implemented by ParGRES
• Components are WSRF-compliant, easing the compatibility
•
with existing grid solutions
Preliminary results obtained in Grid’5000 show good
performance
36
Future Work
• Integration with OGSA-DAI
• Support for partial database replication
• Support for top-k queries
 Extension of best position algorithms
37
Thanks!
DMG 2007
A different view of the Grid
Kandinsky
the Grid, 1923
Albertina Museum
Vienna