research presentation - Computer Science and Engineering

Download Report

Transcript research presentation - Computer Science and Engineering

High-level Interfaces and Abstractions for
Data-Driven Applications in a Grid
Environment
Gagan Agrawal
Department of Computer Science and Engineering
(joint work with Liang Chen, Wei Du, Leo Glimcher,
Ruoming Jin, Xiaogang Li, Swarup Sahoo, Li Weng,
and Xuan Zhang)
(Funded by ACI-9733520, EIA-9703088, ACI9982087, ACI-0130437, EIA-0203846, ACI-0234273,
DoD PET program)
Data-Driven Applications




Becoming increasingly important
Can be extremely hard to develop for a gridenvironment
We need to focus on end-users who have used
Matlab / SQL like systems for data retrieval and
analysis
Some issues to consider



Different data layouts and formats
Flexibly exploit different forms of parallelism
Adapting to available resources
Research Projects

Automatic Data Virtualization



FREERIDE (Framework for Rapid Implementation of Datamining
Engines)





OGSA based
Support for processing distributed streams in a grid environment
Self Adaptation to meet real-time constraints
Compiler-based front-end to DataCutter


High-level specification of a parallel data mining algorithm
Flexibly exploit different forms of parallelism
GATES (Grid-based AdapTive Execution on Streams)


XML-based high-level abstractions and use of XQuery
SQL-based front-end for the STORM system
Includes support for program adaptation
More details through four student posters this afternoon
Automatic Data Virtualization




Data virtualization refers to an abstract view of data for access
and processing
Data Services are methods that implement a virtual view of data
Our focus: using compiler techniques to automatically generate
data services to support data virtualization
Two separate ongoing implementations


Using XML Schema based high-level abstractions and XQuery (ICS
2003, LCPC 2003, DBPL 2003, prior compiler work in ICS 2002,
PACT 2001)
Supporting SQL front-end for data subsetting operations (jointly
with Saltz, Kurc, Catalyurek, et al.)
Project Overview
XQuer
???
y
HDF5
NetCDF
TEXT
XML
RMDB
….
System Architecture
External Schema
XML Mapping Service
logical XML schema
physical XML schema
Compiler
XQuery/XPath
C++/C
SQL-Based Front-end
System Architecture
SELECT * FROM IPARS
WHERE RID in (0,6,26,27) AND TIME>1000 AND TIME<1100
AND SOIL>0.7 AND SPEED(OILVX, OILVY, OILVZ)<30.0;
Common operations: Subsetting, filtering, user defined filtering
FREERIDE Overview






Framework for Rapid
Implementation of Datamining
engines
Demonstrated for a variety of
standard mining algos
Targets distributed memory
parallelism, shared memory
parallelism, and combination
Can be used as basis for
scalable grid-based data mining
implementations
Developed on top of Active Data
Repository (ADR) from Saltz’s
group at Maryland
Publications: SDM
01,02,03,Sigmetrics 02, Ipdps
04, TKDE 04
Key Observation from Mining Algorithms



Most popular algorithms
have a common
canonical loop
Can be used as the
basis for supporting a
common middleware
Parallelism of different
forms and execution on
disk-resident datasets
While( ) {
forall( data instances d) {
I = process(d)
R(I) = R(I) op d
}
…….
}
Applications of FREERIDE

Apriori and FP-tree based association mining


K-means and EM Clustering



distributed memory, shared memory, combination
Nearest-neighbor search
RainForest-based decision tree construction


distributed memory, shared memory, combination
shared memory
A new decision tree algorithms – Statistical Pruning
of Intervals for Enhanced Scalability (SPIES)

distributed memory, shared memory, combination
Applying FREERIDE for Scientific Data
Mining




Joint work with Machiraju
and Parthasarathy
Focusing on feature
extraction, tracking, and
mining approach developed
by Machiraju et al.
A feature is a region of
interest in a dataset
A suite of algorithms for
extracting and tracking
features

FREERIDE forms a basis for
supporting high-level
interfaces


Data Parallel Java – lcpc
2002, IPDPS 2003
Matlab / mining operators –
planned in the future
GATES




Grid-based AdapTive
Execution on Streams
Targets (distributed)
processing of (distributed)
data streams
Built on OGSA model
Self adaptation to meet realtime constraint on
processing
GATES: Motivation

Many applications involve high-volume data streams





Data from large scale experiments / simulations
Digitized images from a movie camera
Network traffic
Data may arise from distributed sources
Analysis / consumption of results may be distributed


Many users wanting different analyses/results
Insufficient compute power at one site
Self Adaptation in GATES


Goal: Achieve the best accuracy with available resources,
subject to real-time constraint
GATES approach:




Programmer exposes certain parameters in processing of each
stage
Examples include: rate of sampling, size of summary structure
Programmer also specifies direction of sensitivity e.g. larger
summary structure means more computation/communication
Parameters adjusted at runtime


Currently based upon size of buffers: signal previous stage to become
faster/slower if buffer too small / too large
Future possibilities: use profiling / performance models …
Summary




Application development in a grid environment is
hard
Need novel runtime techniques and middleware
Innovative applications of compiler technology can
help
Equipment needs:


Need a controlled distributed environment
Need high-bandwidth connectivity – need to simulate



High rate of data arrival
External clients with ability to receive data at high rates
Scaling work to systems with Tera-bytes of storage