Transcript General07v4

Enabling Grids for E-sciencE
The Grid Observatory: goals and
challenges
C. Germain-Renaud (CNRS/LRI & LAL)
EGEE’07 Conference
Budapest, Hungary
1-5 October 2007
www.eu-egee.org
EGEE-II INFSO-RI-031688
EGEE and gLite are registered trademarks
Overview
Enabling Grids for E-sciencE
• NA4 cluster in EGEE-III proposal
• Integrate the collection of data on the
behaviour of the EGEE grid and users with the
development of models and of an ontology for
the domain knowledge
EGEE-II INFSO-RI-031688
Application Track - Grid Observatory
2
Some immediate questions
Enabling Grids for E-sciencE
• Scheduling
– Performance of the gLite scheduling hierarchy
– Published waiting time
– Reactive grids
• Dimensioning
– Patterns and trends in requests and usage
– Anticipate peaks
• On-line fault management
– Detection
– Diagnosis
EGEE-II INFSO-RI-031688
Application Track - Grid Observatory
3
The big picture
Enabling Grids for E-sciencE
Considering current technologies, we expect that the total number of
device administrators will exceed 220 millions by 2010 – Gartner June
2001
EGEE-II INFSO-RI-031688
Application Track - Grid Observatory
4
Autonomic Computing
Enabling Grids for E-sciencE
Computing systems that manage themselves in accordance with
high-level objectives from humans. Kephart & Chess A vision of Autonomic
Computing, IEEE Computer 2003
– Self-*: configuration, optimization, healing, protection
– Of open non steady state dynamic systems
EGEE-II INFSO-RI-031688
Application Track - Grid Observatory
5
Autonomic Computing
Enabling Grids for E-sciencE
Computing systems that manage themselves in accordance with
high-level objectives from humans. Kephart & Chess A vision of Autonomic
Computing, IEEE Computer 2003
– Self-*: configuration, optimization, healing, protection
– Of open non steady state dynamic systems
– Academic and industry involved
EGEE-II INFSO-RI-031688
Application Track - Grid Observatory
6
Autonomic Computing
Enabling Grids for E-sciencE
knowledge
analyze
plan
execute
monitor
Computing systems that manage themselves in accordance with
high-level objectives from humans. Kephart & Chess A vision of Autonomic
Computing, IEEE Computer 2003
– Statistical analysis
– Data mining
– Machine learning
EGEE-II INFSO-RI-031688
DATA REQUIRED
Application Track - Grid Observatory
7
Data Collection and Publication
Enabling Grids for E-sciencE
• Acquisition, consolidation, long-term conservation of
traces of EGEE activities
– Permanent storage of reliable, exhaustive, filtered information
– From operational to structured
EGEE-II INFSO-RI-031688
Application Track - Grid Observatory
8
Data Collection and Publication
Enabling Grids for E-sciencE
• Acquisition, consolidation, long-term conservation of
traces of EGEE activities
– Permanent storage of reliable, exhaustive, filtered information:
from operational to structured
– No monitoring development: rich ecosystem of sources, with
very different scopes, deployment and institutional status
•
•
•
•
CIC tools (GOCDB, SAM, SFT,…),
core gLite (L&B, BDII,…)
sites (Maui/PBS logs)
gLite integrators (R-GMA, Job
Provenance)
• experience integrators
(DashBoard)
• external software (MonaLisa)
EGEE-II INFSO-RI-031688
Application Track - Grid Observatory
9
Data Collection and Publication
Enabling Grids for E-sciencE
• Acquisition, consolidation, long-term conservation of
traces of EGEE activities
– Permanent storage of reliable, exhaustive, filtered information:
from operational to structured
– No monitoring development: rich ecosystem of sources, with
very different scopes, deployment and institutional status
• The major challenge is about exhaustivity
– Privacy related legal constraints
– Scientific usage will help
– Interaction with EGI
EGEE-II INFSO-RI-031688
Application Track - Grid Observatory
10
Data Collection and Publication
Enabling Grids for E-sciencE
• Publication service: navigation and querying
– Integration of independent sources: need a snapshot
of the inputs and grid state e.g. workload and
available services during a relevant time range
– Indexing along the needs of the users communities
 Ongoing work with CoreGrid
Job
• Ontology
– The Glue Information Model: an ontology of the
resources.
– Concepts for the grid dynamics e.g. job lifecycle or
users relations
– Expert concepts as prior knowledge of non-trivial
correlations: workflows, failure modes,…
EGEE-II INFSO-RI-031688
Application Track - Grid Observatory
Resource
11
Models
Enabling Grids for E-sciencE
• Intrinsic characterizations of «grid traffic»: (distribution
of) e.g. job arrival rate, running time, application data
locality
– Likely to be similar to IP traffic: many short, and a significant
number of long, at all scales
– Long range dependencies
EGEE-II INFSO-RI-031688
Application Track - Grid Observatory
12
Models
Enabling Grids for E-sciencE
• Intrinsic characterizations of «grid traffic»: (distribution
of) e.g. job arrival rate, running time, application data
locality
– Likely to be similar to IP traffic: many short, and a significant
number of long, at all scales
– Long range dependencies
• Characterizations of middleware-dependant metrics
e.g. queuing delays, overhead, SE load
EGEE-II INFSO-RI-031688
Application Track - Grid Observatory
13
Models
Enabling Grids for E-sciencE
• Intrinsic characterizations of «grid traffic»: (distribution
of) e.g. job arrival rate, running time, application data
locality
– Likely to be similar to IP traffic: many short, and a significant
number of long, at all scales
– Long range dependencies
• Characterizations of middleware-dependant metrics
e.g. queuing delays, SE load
• Inference of models for middleware components and
applications, users and usage profiles, users
interactions
EGEE-II INFSO-RI-031688
Application Track - Grid Observatory
14
Autonomic dependability
Enabling Grids for E-sciencE
• On-line failure detection and anticipation
• Passive vs Active probing : a lot of information
is available from user work
• Black-box
– On-line statistics from « similar » actions (executions,
data access, middleware modules)
EGEE-II INFSO-RI-031688
Application Track - Grid Observatory
15
Abrupt changepoint detection
Enabling Grids for E-sciencE
• Page-Hinckley statistics
• Time-sequential version of
Wald’s statistics – also known
as CUSUM
• « intelligent threshold » test
which minimizes the expected
time before a change
detection for a fixed false
positive rate
VO software
bug
Blackhole
• State of the art in quality
control, clinical trials
EGEE-II INFSO-RI-031688
Application Track - Grid Observatory
16
Autonomic dependability
Enabling Grids for E-sciencE
• On-line failure detection and anticipation
• Passive vs Active probing : a lot of information
is available from user work
• Black-box
– On-line statistics from « similar » actions (executions,
data access, middleware modules)
• Supervised and unsupervised learning
EGEE-II INFSO-RI-031688
Application Track - Grid Observatory
17
Mining the L&B logs
Enabling Grids for E-sciencE
Constructive induction
Double clustering
EGEE-II INFSO-RI-031688
Application Track - Grid Observatory
18
Autonomic dependability
Enabling Grids for E-sciencE
• On-line failure detection and anticipation
• Passive vs Active probing : a lot of information
is available from user work
• Black-box
– On-line statistics from « similar » actions (executions,
data access, middleware modules)
• Supervised and unsupervised learning
• Active probing
– Adaptive on-line test selection for best coverage of
possibly faulty components
– Experience planning
EGEE-II INFSO-RI-031688
Application Track - Grid Observatory
19
Conclusion
Enabling Grids for E-sciencE
• The Grid Observatory targets
– Contributions to a quantitative approach of grid middleware and
architecture, in the RISC sense, through data collection,
publication and characterization
– Operational impacts on EGEE: evaluation, autonomic
dependability
– Basic research in autonomic computing
– Collaboration between EGEE and national research initiatives
and other UE projects: DEMAIN, PASCAL KD-Ubiq, CoreGrid,
and hopefully more
EGEE-II INFSO-RI-031688
Application Track - Grid Observatory
20