Transcript Document

UPPSALA DATABASE LABORATORY
Managing Scientific Queries over
Distributed Data in a Grid
Environment
Ruslan Fomkin
UU- IT - UDBL
Ruslan Fomkin
Uppsala DataBase Laboratory (UDBL)
 Supervisor
• prof. T. Risch
 Database research
• How to make extensible middleware query
processing allowing scalable and application
oriented search to different kinds of wrapped
information sources
 http://www.it.uu.se/research/group/udbl/
January 20, 2006
NGN workshop
Uppsala
2
UU- IT - UDBL
Ruslan Fomkin
AMOS II
Simulation
Visualization
Analysis
Applications
Queries and views
Plug-ins
Virtual Mediator Database
Continuous
Queries
Queries
Wrappers
Relational
Databases
January 20, 2006
Patient
Monitoring
GRID hist.
Data sources
Measurments
NGN workshop
Uppsala
3
UU- IT - UDBL
Ruslan Fomkin
Ongoing Research at UDBL
Mediating Web Services
Stream Queries on BlueGene
Manivasakan Sabesan, BSc
Erik Zeitler, MSc
Semantic Web Queries
to Hidden Web
Stream Data Manager
Milena Ivanova, PhD
Johan Petrini, MSc
UDBL
FEM Databases
Expensive GRID Queries
Kjell Orsborn, PhD
Ruslan Fomkin, MSc
January 20, 2006
NGN workshop
Uppsala
4
UU- IT - UDBL
Ruslan Fomkin
Outline






Introduction
The project
Test application
Developed framework
Conclusion
Future work
January 20, 2006
NGN workshop
Uppsala
5
UU- IT - UDBL
Ruslan Fomkin
Scientific Applications, Grid and Databases
 A lot of scientific data
• Complex structure
• Stored in files distributed in Grid
 Scientific analyses can be represented as
declarative queries
• Complex queries with numerical computations
• Long running or batch queries
 Utilization of computational resources of Grid
January 20, 2006
NGN workshop
Uppsala
6
UU- IT - UDBL
Ruslan Fomkin
Parallel Object Query System for Expensive
Computations (POQSEC)
 Query processor for scientific applications
• high-level interface to specify the analyses
• automatically generates execution plans and
evaluates them
 Requirements
• Scalable, efficient, flexible, transparent
 Properties
• Distributed and parallel
January 20, 2006
NGN workshop
Uppsala
7
UU- IT - UDBL
Ruslan Fomkin
Layered Architecture of the System
 POQSEC provides
• scientific query management
 Grid provides
• computation management
• file management
NorduGrid Middleware
 Application area provides
• computational libraries
• data management libraries
User
POQSEC
Application
Grid
ROOT
NorduGrid
libraries
Data
Clusters
ROOT library
January 20, 2006
NGN workshop
Uppsala
8
UU- IT - UDBL
Ruslan Fomkin
Our Test Application
 From Particle Physics
 Analysis of collision events for presence of
Higgs particles
 Data produced by ATLAS simulation software
• stored in files
• distributed in the Grid (e.g. NorduGrid)
• managed by ROOT library
January 20, 2006
NGN workshop
Uppsala
9
UU- IT - UDBL
Ruslan Fomkin
Object-Relational Schema of
the Application Data
PxMiss
Px
PyMiss
Event
1
particles
Py
n
Pz
Kf
Particle
Ee
Lepton
inheritance
relationship
Muon
January 20, 2006
NGN workshop
Uppsala
Electron
Jet
10
UU- IT - UDBL
Ruslan Fomkin
General Query of the Analysis
 Selection of those events
that satisfy predicates
containing numerical operations
SELECT ev FROM Event ev
WHERE jetvetocut(ev) AND zvetocut(ev)
AND topcut(ev) AND misseecuts(ev) AND
leptoncuts(ev)AND threeleptoncut(ev);
 Each predicate called cut in application area
 Predicates are defined as queries
January 20, 2006
NGN workshop
Uppsala
11
UU- IT - UDBL
Ruslan Fomkin
Example of a predicate:
Z-veto cut
 Either event does not have a pair of opposite charged
leptons
 or invariant mass of the pair is not close to the mass of a
Z particle
CREATE FUNCTION zvetocut(Event ev)-> Event AS
SELECT ev
WHERE NOTANY(oppositeLeptons(ev)) OR
abs(invMass(oppositeLeptons(ev)) - zMass)
>= minZMass;
CREATE FUNCTION oppositeLeptons (Event ev) -> bag of
<Lepton, Lepton> AS
SELECT l1, l2 FROM Lepton l1, Lepton l2
WHERE l1 = particles(ev) AND l2 = particles(ev) AND
Kf(l1) = -Kf(l2);
January 20, 2006
NGN workshop
Uppsala
12
UU- IT - UDBL
Ruslan Fomkin
Current Framework
 Basic tool for utilizing NorduGrid through
Advanced Resource Connector (ARC)
 Submission mechanism
• submit query
• parallelize query to several subqueries
• generate job scripts (one per subquery)
 Babysitter functionality
 Data exchange mechanism through files
January 20, 2006
NGN workshop
Uppsala
13
UU- IT - UDBL
Ruslan Fomkin
Client and Coordinator Part
POQSEC client
 personal
database with
application
schema
 ROOT wrapper
Coordinator server
 receives queries
 creates jobs
Grid Client
Node
Coordinator
server
POQSEC
Client
Grid MetaDatabase
Grid Meta-Database
 computational
resources
 data files
January 20, 2006
Local
Storage
Submission Database
 received
submissions
 created jobs
NGN workshop
Uppsala
Job queue
Query
Coordinator
Submission
Database
Babysitter
ARC
Client
Babysitter
 interactions with
ARC
14
UU- IT - UDBL
Ruslan Fomkin
Query Submission
Query submission

query

file name
selection

degree of
parallelism

CPU time for
each job
Grid Client
Node
Coordinator
server
POQSEC
Client
Grid MetaDatabase
January 20, 2006
Submission
Database
Babysitter
ARC
Client
Local
Storage
Coordinator server creates jobs
 same query
 partitions of data with equal size
 same CPU time provided by user
 corresponding job script files
Job queue
Query
Coordinator



Submission and its jobs saved in
Submission Database
Created jobs added to Job queue
Script files saved to Local Storage
NGN workshop
Uppsala
15
UU- IT - UDBL
Ruslan Fomkin
Jobs Submission
Babysitter

Takes jobs from
Job queue

Submits each
job to ARC
client

Change status
of submitted
jobs in
Submission DB
Grid Client
Node
Coordinator
server
POQSEC
Client
Grid MetaDatabase
January 20, 2006
Job queue
Submission
Database
NGN workshop
Uppsala
Babysitter
ARC
Client
Local
Storage
ARC client
 finds Computing Element
 submits job to corresponding ARC
Grid manager
Query
Coordinator
CE
ARC Grid
Manager
CE
ARC Grid
Manager
16
UU- IT - UDBL
Ruslan Fomkin
Job Execution
ARC Grid Manager
 downloads input files
 submits job to Local Batch System
After some delay LBS starts Executor
on allocated a CE node
Executor during execution
 execute given subquery
 accesses data through
ROOT wrapper
 saves result to files
on CE Storage
SE
SE
CE
ARC Grid
Manager
CE
Storage
LBS Queue
Executor
wrapper
CE node
January 20, 2006
NGN workshop
Uppsala
17
UU- IT - UDBL
Ruslan Fomkin
Downloading Result
Babysitter

polls ARC client
for jobs
statuses

requests to
download
results for
finished jobs
Results downloaded
to Local Storage
User can retrieve
result when all jobs
are ready
January 20, 2006
Grid Client
Node
Coordinator
server
POQSEC
Client
Grid MetaDatabase
Local
Storage
CE
ARC Grid
Manager
CE
Storage
NGN workshop
Uppsala
Query
Coordinator
Job queue
Submission
Database
Babysitter
ARC
Client
CE
ARC Grid
Manager
CE
Storage
18
UU- IT - UDBL
Ruslan Fomkin
Conclusion
 We provide
• declarative query interface for representation
scientific queries
• parallel query execution in Grid
(generating scripts)
• babysitter to keep track of job execution
 Query parallelization is important
Standalone desktop Grid, one job Grid, four jobs
Response time
190 min
225 min
24 min
Requested CPU time
-
200 min
20 min
January 20, 2006
NGN workshop
Uppsala
19
UU- IT - UDBL
Ruslan Fomkin
Future work
 Estimation time of executing query
 Dealing with underestimation of execution time
 Automatic making decision on degree of
parallelism and resource brokering
• adaptive
• based on current load and job statistics
 Dealing with failures in Grid
 POOL wrapper
January 20, 2006
NGN workshop
Uppsala
20
UU- IT - UDBL
Ruslan Fomkin
Thank you for attention! Your questions?
January 20, 2006
NGN workshop
Uppsala
21