gates-hpdc-2004 - Computer Science and Engineering

Download Report

Transcript gates-hpdc-2004 - Computer Science and Engineering

GATES: A Grid-Based
Middleware for Processing
Distributed Data Streams
Liang Chen, Kolagatla Reddy, Gagan Agrawal
Department of Computer Science and Engineering
The Ohio State University
{chenlia, reddyk, agrawal}@cis.ohio-state.edu
1
Streaming Data Model
Continuous data arrival and processing
Emerging model for data processing
Sources that produce data continuously: sensors, long
running simulations
WAN bandwidths growing faster than disk bandwidths
Active topic in many computer science
communities
Databases
Data Mining
Networking ….
2
Summary/Limitations of Current
Work
Focus on
centralized processing of stream from a single
source (databases, data mining)
communication only (networking)
Many applications involve
distributed processing of streams
streams from multiple sources
3
Motivating Application
Network Fault Management System
Switch Network
X
Network Fault Management System
4
Motivating Application (2)
Computer Vision Based Surveillance
5
Motivating Application (3)
Tatabe et al. CCGRID 2002
6
Features of Distributed Streaming
Processing Applications
Data sources could be distributed
Over a WAN
Continuous data arrival
Enormous volume
Probably can’t communicate it all to one site
Results from analysis may be desired at multiple
sites
Real-time constraints
A real-time, high-throughput, distributed processing
problem
7
Motivation
Challenges & Possible Solutions
Challenge1: Data, Communication, and/or
Compute- Intensive
Switch Network
X
8
Motivation
Challenges & possible Solutions
Challenge1: Data and/or Computation intensive
Solution: Grid computing technologies
Switch Network
9
Motivation
Challenges & possible Solutions
Challenge1: Data and/or Computation intensive
Solution: Grid computing technologies
Challenge 2: real-time analysis is required
Solution: Self-Adaptation functionality is desired
10
Need for a Grid-Based Stream
Processing Middleware
Application developers interested in data stream
processing
Will like to have abstracted
• Grid standards and interfaces
• Adaptation function
Will like to focus on algorithms only
GATES is a middleware for
Grid-based
Self-adapting
Data Stream Processing
11
Roadmap
GATES Architecture and API
Adaptation algorithm
Evaluation
Related work
Conclusion
On-going & Future work
12
GATES
Grid-based AdapTive
Execution on Streams
Targets (distributed)
processing of
(distributed) data
streams
Built on OGSA model
Self adaptation to
meet real-time
constraint on
processing
13
GATES and Grid-Standards
Applications
GATES
Globus-OGSA
Web service
Internet
14
Using GATES
Break down the analysis into several sub-tasks that
make a pipeline
Implement each sub-task in Java
Write an XML configuration file for the sub-tasks to be
automatically deployed.
Launch the application by running a java program
(StreamClient.class) provided by the GATES
15
System Architecture
16
Adaptation for Real-time Processing
Analysis on streaming data is approximate
Accuracy and execution rate trade-off can
be captured by certain parameters
(Adaptation parameters)
Sampling Rate
Size of summary structure
Application developers can expose these
parameters and a range of values
17
API for Adaptation
Public class Sampling-Stage implements StreamProcessing{
…
void init(){…}
…
void work(buffer in, buffer out){
…
GATES.Information-About-Adjustment-Parameter(min, max, 1)
while(true)
{
Image img = get-from-buffer-in-GATES(in);
Image img-sample = Sampling(img, sampling-ratio);
put-to-buffer-in-GATES(img-sample, out);
sampling-ratio = GATES.getSuggestedParameter();
}
…
}
18
Self-Adaptation Approach
Stage A
Stage B
Stage C
A
B
C
:Buffers
:Queues
:Grid services of the GATES
:Stages of an application
19
Adaptation algorithm
Goal
Issues
A
B
C
No specific information about applications
Filtering out short-term bursts and sensitive to longterm behaviors
Quickly find converged values of adjustment
parameters
Basic idea
Query Theory and Heuristic algorithm
20
Adaptation algorithm
Equations
21
Evaluation
Two applications
A counting sample application
A computational steering application
Three experiments were conducted
The First one was running counting sample
applications on the GATES
the other two were running computational
steering applications
22
The Experiment One:
Non-adaptive Vs. Adaptive version
 Performance comparison
Network Bandwidth
(Kilo-Byte sec.)
40
(sec.)
80
(sec.)
120
(sec.)
160
(sec.)
Adaptive Version
(Kilo-Byte/Sec.)
1
462.3
612.9
459.9
671
463.5
10
187.7
193.3
509.1
302.1
234.9
100
246.4
466.7
296.2
371.6
387.1
1000
240.4
298.8
307.7
478
399.9
Network Bandwidth
(Kilo-Byte/Sec.)
40 (sec.)
80 (sec.)
120 (sec.) 160 (sec.)
1
0.891
0.962
0.981
0.987
0.986
10
0.896
0.963
0.983
0.992
0.986
100
0.887
0.957
0.979
0.988
0.974
1000
0.879
0.963
0.983
0.989
0.988
 Accuracy comparison
Adaptive Version
(Kilo-Byte/Sec.)
23
Self-Adaptation with Different Processing
Requirements
24
Self-Adaptation with Different Data
Generation Rates
25
Related work
dQUOB (dynamic QUery Objects)
DataCutter
A lot of work on adaptation
Adaptation for real-time processing of streams
Streaming database systems
Support DB Operations, usually centralized
26
Conclusion
High-volume, distributed, stream processing is in
our future
Grid computing could be an effective solution for
distributed data stream processing
GATES
Distributed processing
Exploit grid web services
Self-adaptation to meet the real-time constraints
27
On-going and Future Work
Continuous (dynamic) resource discovery &
monitoring
Resource Reallocation (self-mobility)
Larger application (time-varying
visualization)
Generalize Adaptation Algorithm
More evaluation studies
28