Transcript Document
UPPSALA DATABASE LABORATORY
Managing Scientific Queries over
Distributed Data in a Grid
Environment
Ruslan Fomkin
UU- IT - UDBL
Ruslan Fomkin
Uppsala DataBase Laboratory (UDBL)
Supervisor
• prof. T. Risch
Database research
• How to make extensible middleware query
processing allowing scalable and application
oriented search to different kinds of wrapped
information sources
http://www.it.uu.se/research/group/udbl/
January 20, 2006
NGN workshop
Uppsala
2
UU- IT - UDBL
Ruslan Fomkin
AMOS II
Simulation
Visualization
Analysis
Applications
Queries and views
Plug-ins
Virtual Mediator Database
Continuous
Queries
Queries
Wrappers
Relational
Databases
January 20, 2006
Patient
Monitoring
GRID hist.
Data sources
Measurments
NGN workshop
Uppsala
3
UU- IT - UDBL
Ruslan Fomkin
Ongoing Research at UDBL
Mediating Web Services
Stream Queries on BlueGene
Manivasakan Sabesan, BSc
Erik Zeitler, MSc
Semantic Web Queries
to Hidden Web
Stream Data Manager
Milena Ivanova, PhD
Johan Petrini, MSc
UDBL
FEM Databases
Expensive GRID Queries
Kjell Orsborn, PhD
Ruslan Fomkin, MSc
January 20, 2006
NGN workshop
Uppsala
4
UU- IT - UDBL
Ruslan Fomkin
Outline
Introduction
The project
Test application
Developed framework
Conclusion
Future work
January 20, 2006
NGN workshop
Uppsala
5
UU- IT - UDBL
Ruslan Fomkin
Scientific Applications, Grid and Databases
A lot of scientific data
• Complex structure
• Stored in files distributed in Grid
Scientific analyses can be represented as
declarative queries
• Complex queries with numerical computations
• Long running or batch queries
Utilization of computational resources of Grid
January 20, 2006
NGN workshop
Uppsala
6
UU- IT - UDBL
Ruslan Fomkin
Parallel Object Query System for Expensive
Computations (POQSEC)
Query processor for scientific applications
• high-level interface to specify the analyses
• automatically generates execution plans and
evaluates them
Requirements
• Scalable, efficient, flexible, transparent
Properties
• Distributed and parallel
January 20, 2006
NGN workshop
Uppsala
7
UU- IT - UDBL
Ruslan Fomkin
Layered Architecture of the System
POQSEC provides
• scientific query management
Grid provides
• computation management
• file management
NorduGrid Middleware
Application area provides
• computational libraries
• data management libraries
User
POQSEC
Application
Grid
ROOT
NorduGrid
libraries
Data
Clusters
ROOT library
January 20, 2006
NGN workshop
Uppsala
8
UU- IT - UDBL
Ruslan Fomkin
Our Test Application
From Particle Physics
Analysis of collision events for presence of
Higgs particles
Data produced by ATLAS simulation software
• stored in files
• distributed in the Grid (e.g. NorduGrid)
• managed by ROOT library
January 20, 2006
NGN workshop
Uppsala
9
UU- IT - UDBL
Ruslan Fomkin
Object-Relational Schema of
the Application Data
PxMiss
Px
PyMiss
Event
1
particles
Py
n
Pz
Kf
Particle
Ee
Lepton
inheritance
relationship
Muon
January 20, 2006
NGN workshop
Uppsala
Electron
Jet
10
UU- IT - UDBL
Ruslan Fomkin
General Query of the Analysis
Selection of those events
that satisfy predicates
containing numerical operations
SELECT ev FROM Event ev
WHERE jetvetocut(ev) AND zvetocut(ev)
AND topcut(ev) AND misseecuts(ev) AND
leptoncuts(ev)AND threeleptoncut(ev);
Each predicate called cut in application area
Predicates are defined as queries
January 20, 2006
NGN workshop
Uppsala
11
UU- IT - UDBL
Ruslan Fomkin
Example of a predicate:
Z-veto cut
Either event does not have a pair of opposite charged
leptons
or invariant mass of the pair is not close to the mass of a
Z particle
CREATE FUNCTION zvetocut(Event ev)-> Event AS
SELECT ev
WHERE NOTANY(oppositeLeptons(ev)) OR
abs(invMass(oppositeLeptons(ev)) - zMass)
>= minZMass;
CREATE FUNCTION oppositeLeptons (Event ev) -> bag of
<Lepton, Lepton> AS
SELECT l1, l2 FROM Lepton l1, Lepton l2
WHERE l1 = particles(ev) AND l2 = particles(ev) AND
Kf(l1) = -Kf(l2);
January 20, 2006
NGN workshop
Uppsala
12
UU- IT - UDBL
Ruslan Fomkin
Current Framework
Basic tool for utilizing NorduGrid through
Advanced Resource Connector (ARC)
Submission mechanism
• submit query
• parallelize query to several subqueries
• generate job scripts (one per subquery)
Babysitter functionality
Data exchange mechanism through files
January 20, 2006
NGN workshop
Uppsala
13
UU- IT - UDBL
Ruslan Fomkin
Client and Coordinator Part
POQSEC client
personal
database with
application
schema
ROOT wrapper
Coordinator server
receives queries
creates jobs
Grid Client
Node
Coordinator
server
POQSEC
Client
Grid MetaDatabase
Grid Meta-Database
computational
resources
data files
January 20, 2006
Local
Storage
Submission Database
received
submissions
created jobs
NGN workshop
Uppsala
Job queue
Query
Coordinator
Submission
Database
Babysitter
ARC
Client
Babysitter
interactions with
ARC
14
UU- IT - UDBL
Ruslan Fomkin
Query Submission
Query submission
query
file name
selection
degree of
parallelism
CPU time for
each job
Grid Client
Node
Coordinator
server
POQSEC
Client
Grid MetaDatabase
January 20, 2006
Submission
Database
Babysitter
ARC
Client
Local
Storage
Coordinator server creates jobs
same query
partitions of data with equal size
same CPU time provided by user
corresponding job script files
Job queue
Query
Coordinator
Submission and its jobs saved in
Submission Database
Created jobs added to Job queue
Script files saved to Local Storage
NGN workshop
Uppsala
15
UU- IT - UDBL
Ruslan Fomkin
Jobs Submission
Babysitter
Takes jobs from
Job queue
Submits each
job to ARC
client
Change status
of submitted
jobs in
Submission DB
Grid Client
Node
Coordinator
server
POQSEC
Client
Grid MetaDatabase
January 20, 2006
Job queue
Submission
Database
NGN workshop
Uppsala
Babysitter
ARC
Client
Local
Storage
ARC client
finds Computing Element
submits job to corresponding ARC
Grid manager
Query
Coordinator
CE
ARC Grid
Manager
CE
ARC Grid
Manager
16
UU- IT - UDBL
Ruslan Fomkin
Job Execution
ARC Grid Manager
downloads input files
submits job to Local Batch System
After some delay LBS starts Executor
on allocated a CE node
Executor during execution
execute given subquery
accesses data through
ROOT wrapper
saves result to files
on CE Storage
SE
SE
CE
ARC Grid
Manager
CE
Storage
LBS Queue
Executor
wrapper
CE node
January 20, 2006
NGN workshop
Uppsala
17
UU- IT - UDBL
Ruslan Fomkin
Downloading Result
Babysitter
polls ARC client
for jobs
statuses
requests to
download
results for
finished jobs
Results downloaded
to Local Storage
User can retrieve
result when all jobs
are ready
January 20, 2006
Grid Client
Node
Coordinator
server
POQSEC
Client
Grid MetaDatabase
Local
Storage
CE
ARC Grid
Manager
CE
Storage
NGN workshop
Uppsala
Query
Coordinator
Job queue
Submission
Database
Babysitter
ARC
Client
CE
ARC Grid
Manager
CE
Storage
18
UU- IT - UDBL
Ruslan Fomkin
Conclusion
We provide
• declarative query interface for representation
scientific queries
• parallel query execution in Grid
(generating scripts)
• babysitter to keep track of job execution
Query parallelization is important
Standalone desktop Grid, one job Grid, four jobs
Response time
190 min
225 min
24 min
Requested CPU time
-
200 min
20 min
January 20, 2006
NGN workshop
Uppsala
19
UU- IT - UDBL
Ruslan Fomkin
Future work
Estimation time of executing query
Dealing with underestimation of execution time
Automatic making decision on degree of
parallelism and resource brokering
• adaptive
• based on current load and job statistics
Dealing with failures in Grid
POOL wrapper
January 20, 2006
NGN workshop
Uppsala
20
UU- IT - UDBL
Ruslan Fomkin
Thank you for attention! Your questions?
January 20, 2006
NGN workshop
Uppsala
21