MATE-EC2: A Middleware for Processing Data with AWS

Download Report

Transcript MATE-EC2: A Middleware for Processing Data with AWS

Middleware Solutions for DataIntensive (Scientific) Computing on
Clouds
Gagan Agrawal
Ohio State University
(Joint Work with Tekin Bicer, David Chiu, Yu
Su, ..)
PDAC-10
1
Motivation
• Cloud Resources
• Pay-as-you-go
• Elasticity
• Black boxes from a performance view-point
• Scientific Data
– Specialized formats, like NetCDF, HDF5, etc.
– Very Large Scale
PDAC-10
2
Ongoing Work at Ohio State
• MATE-EC2: Middleware for Data-Intensive Computing on
EC2
– Alternative to Amazon Elastic MapReduce
• Data Management Solutions for Scientific Datasets
– Target NetCDF and HDF5
• Accelerating Data Mining Computations Using Accelerators
• Resource Allocation Problems on Clouds
PDAC-10
3
MATE-EC2: Motivation
•
•
•
•
MATE – MapReduce with an Alternate API
MATE-EC2: Implementation for AWS Environments
Cloud resources are blackboxes
Need for services and tools that can…
– get the most out of cloud resources
– help their users with easy APIs
PDAC-10
4
MATE vs. Map-Reduce
Processing Structure
• Reduction Object represents the intermediate state of the execution
• Reduce func. is commutative and associative
• Sorting, grouping.. overheads are eliminated with red. func/obj.
PDAC-10
5
MATE-EC2 Design
• Data organization
– Three levels: Buckets, Chunks and Units
– Metadata information
• Chunk Retrieval
– Threaded Data Retrieval
– Selective Job Assignment
• Load Balancing and handling heterogeneity
– Pooling mechanism
PDAC-10
6
MATE-EC2 Processing Flow
T0
T1
T2
C0
T3
C5
Cn
S3 Data Object
Computing Layer
Job Scheduler
Metadata File
EC2 Master Node
EC2 Slave Node
Retrieve
Pass
retrieved
chunk
another
chunk
pieces
job
toandNode
Request
Job
from
as
job
C
50 is assigned
aMaster
job
Retrieve
the
new
Write
Computing
them into
Layer
theand
buffer
process
PDAC-10
7
Experiments
• Goals:
– Finding the most suitable setting for AWS
– Performance of MATE-EC2 on heterogeneous and
homogeneous environments
– Performance comparison of MATE-EC2 and MapReduce
• Applications: KMeans and PCA
• Used Resources:
– 4 Large EC2 instances for processing, 1 Large instance for Master
– 16 Data objects on S3 (8.2GB total data set for both app.)
PDAC-10
8
Diff. Data Chunk Sizes
• KMeans
• 16 Retrieval threads
• Performance
increase
– 8M vs. others
• 1.13 to 1.30
– 1 Thread vs. 16
Threads versions
• 1.24 to 1.81
PDAC-10
9
Diff. Number of Threads
• 128MB chunk size
• Performance
increase in Fig.
(KMeans)
– 1.37 to 1.90
• Performance
increase for PCA
– 1.38 to 1.71
PDAC-10
10
Selective Job Assignment
• Performance
increase in Fig.
(KMeans)
– 1.01 to 1.14
• For PCA
– 1.19 to 1.68
PDAC-10
11
Heterogeneous Env.
• L: Large instances
S: Small instances
• 128MB chunk size
• Overheads in Fig.
(KMeans)
– Under 1%
• Overheads for PCA
– 1.1 to 11.7
PDAC-10
12
MATE-EC2 vs. Map-Reduce
• Scalability (MATE)
– Efficiency: 90%
• Scalability (MR)
– Efficiency: 74%
• Speedups:
– MATE vs. MR
• 3.54 to 4.58
PDAC-10
13
MATE-EC2: Continuing
Directions
• Cloud Bursting
– Cloud as an Complement or On-Demand Alternative
to Local Resources
• Autotuning for a New Cloud Environment
– Data Storage can be black-box
• Data-Intensive Applications on Cluster of GPUs
– Programming Model, System Design
PDAC-10
14
Outline
• MATE-EC2: Middleware for Data-Intensive Computing on
EC2
– Alternative to Amazon Elastic MapReduce
• Data Management Solutions for Scientific Datasets
– Target NetCDF and HDF5
• Accelerating Data Mining Computations Using Accelerators
• Resource Allocation Problems on Clouds
PDAC-10
15
Data Management: Motivation
• Datasets are becoming extremely large
• Scientific datasets are in formats like NetCDF and
HDF5
• Existing database solutions are not scalable
– Can’t help with native data formats
PDAC-10
16
Data Management: Use
Scenarios
• Data Dissemination Efforts
– Support User-Defined Subsetting and Data Aggregation
• Implementing Data Processing Applications
– Higher-level API than NetCDF/HDF5 libraries
• Visualization Tools (ParaView etc.)
– Data format Conversion on Large Datasets
PDAC-10
17
Initial Prototype: Data Subsetting
With Relational View on NetCDF
Parse the SQL expression
Metadata for netcdf dataset
Filter dimensions
Generate data access code
Partition tasks and assign
to slave processes
Execute query
Filter variable value
PDAC-10
Metadata descriptor
• Dataset Storage Description
– List the nodes and the directories where the data is
resident.
• Dataset Layout Description
– Header part of each netcdf file
• Naturally included in netcdf dataset
• Save the energy for generating the metadata
– Describe the layout of each netcdf file
PDAC-10
Pre-filter and Post-filter
• Pre-filter:
– Take SQL grammar and metadata as input
– Do filtering based on dimensions of variable
– Support both direct dimensions and coordinate
variable
• Post-filer:
– Do filtering based on variable value
PDAC-10
Query Partition
• Partition current query into several sub-queries and
assign each sub-query to a slave process.
• Two partition criteria
– Consider the continuous of the memory
– Consider data aggregation(future)
PDAC-10
Experiment Setup
• Application:
– Global Cloud Resolving Model and Data (GCRM)
• Environment:
– Glenn System in Ohio Supercomputer Center
PDAC-10
SQL queries
No.
Description
Percent
SQL 1
SELECT pressure FROM dataset;
100%
SQL 2
SELECT pressure FROM dataset WHERE
cells<=20481
50%
SQL 3
SELECT pressure FROM dataset WHERE cells>20481
AND layers>330;
25%
SQL 4
SELECT pressure FROM dataset WHERE
cells<=20481 AND layers<250;
10%
SQL 5
SELECT pressure FROM dataset WHERE cells <=
20481 AND time<=781710 AND layers<250;
1%
PDAC-10
Scalability with different data
size
70
60
Average time(sec)
50
SQL1
40
SQL2
SQL3
30
SQL4
SQL5
20
10
0
100M
•
•
500M
1G
2G
4G
8 processes
Execution time scaled almost linearly within
each query
PDAC-10
Time improvement for using prefilter
120
Average time(sec)
100
80
60
No prefilter
prefilter
40
20
0
100M
500M
1G
2G
4G
Amount of data processed
•
•
•
4 processes;
SQL5 (only query 1% of the data);
Prefilter efficiently decreases the query size,
improve the performance.
PDAC-10
Scalability with Increasing No. of
Sources
200
180
Average time(sec)
160
140
120
100
time
80
60
40
20
0
2 procs
4 procs
8 procs
16 procs
Number of processes
•
•
•
4G dataset;
SQL1 (full scan of the data table);
Execution time scaled almost linearly
PDAC-10
Data Management: Continuing
Work
• Similar Prototype with HDF5 under Implementation
• Consider processing, not just
subsetting/aggregation
– Map-Reduce like Processing for NetCDF/HDF5
datasets?
• Consider Format Conversion for Existing Tools
PDAC-10
27
Outline
• MATE-EC2: Middleware for Data-Intensive Computing on
EC2
– Alternative to Amazon Elastic MapReduce
• Data Management Solutions for Scientific Datasets
– Target NetCDF and HDF5
• Accelerating Data Mining Computations
• Resource Allocation Problems on Clouds
PDAC-10
28
System for Mapping to
Heterogeneous Configurations
Application
Developer
User Input:
Simple C
code with
annotations
Run-time System
Compilation
Phase
Code Generator
Worker Thread
Creation and
Management
Map Computation to
CPU and GPU
Multi-core
Middlewar
e API
GPU
Code for
CUDA
PDAC-10
Dynamic Work
Distribution
29
K-Means on GPU + Multi-Core
CPUs
PDAC-10
30
Summary
• Dataset Sizes are Increasing
• Clouds add many challenges
• Many challenges in data processing on clouds
PDAC-10
31