Compiler Techniques for Data Parallel Applications With Very Large

Download Report

Transcript Compiler Techniques for Data Parallel Applications With Very Large

Research Overview
Gagan Agrawal
Associate Professor
Personnel Involved

Ph.D student






Masters (thesis) student


Ge Yang
Undergrad student


Liang Chen
Wei Du
Ruoming Jin
Feng Li (Jointly with Joel Saltz)
Xiaogang Li
Leo Glimcher
Faculty collaborations: Joel Saltz, Tahsin Kurc, Umit Catalyurek,
Srini Parthasarathy, Raghu Machiraju
An Overall Vision

Our world will be full of distributed and dynamic data
sources



High speed networking (Grid computing)
Sensor networks, mobile systems, embedded devices
Processing this information involves many challenges




A lot of data, distributed
Often, continuous data streams (can’t store all data, realtime processing constraint)
Complex interplay of communication and computational
costs
Application programmers want more transparency
Research Projects



Compilers: Compiling XQuery (Query Language for
XML data), Compiling for a distributed heterogeneous
(grid) environment, parallelizing scientific data
intensive and data mining codes
Middleware and Runtime Support: FREERIDE
(Framework for Rapid Implementation of Datamining
Engines), ongoing work on distributed processing of
data streams
Data mining and OLAP algorithms: Mining for
streaming data, Parallel and scalable mining
algorithms, OLAP algorithms
Compiling Data Intensive Applications for
a Grid Environment
Compiling XQuery



Vision: XML has become an accepted standard for
distribution of datasets
XQuery is the well-accepted high-level query
language for querying and processing XML datasets
Compiling complex data-intensive reduction
operations written in XQuery



Reductions written using recursion
Data-centric execution strategies
Using XML Schemas to describe the datasets -
System Support for Data Mining in a
Parallel Environment
Data Parallel Java
Compiler Techniques
FREERIDE(middleware)
Runtime Techniques
MPI+Posix Threads+File I/O
Clusters of SMPs
Distributed Processing of Data Streams


Processing continuous data streams arising from distributed
sources
A number of system and algorithmic challenges





Real time requirement on processing rate – tradeoffs between
accuracy of analysis and efficiency
Placement of data – obviously want to process an individual stream
close to the source of data
Feedback based control of accuracy – cannot allow any
computational or communication stage to become the bottleneck
Performance modeling: impact of output size, level of sampling etc.
on performance
Recently started work in this area ….
Algorithms for Mining and OLAP



Decision tree construction for streaming data: new
one-pass algorithm with statistical accuracy bound
Parallel and scalable decision tree construction: use
sampling, but without losing accuracy
Data cube construction:


Parallel algorithms with optimal communication volume
Tiling based algorithms for scaling output sizes