Transcript ppt

Grid in action: from EasyGrid
to LCG testbed and
gridification techniques.
James Cunha Werner
University of Manchester
Christmas Meeting - 2005
Going to grid
Conventional way:
• Usual code (your
cuts)
• Run BetaMiniApp in
several data files
one after the other.
• When all data is
done, you have
results!
Grid way:
• Same usual code (your
cuts)
• Run several copies of
BetaMiniApp, each
running in one data file
independent.
• At the end, join all
results!
EasyGrid does it for you!
General overview
Users’ software
EasyGrid
for datasets
Gridification
algorithms
for generic soft
Grid testbed
EasyTau
for selected
events
EasyGrid: an overview
• Prototype for future development.
RPA = guarantee of useful software
• Provide all support for job submission
system:
– Recovers results in users’ directory
– Generates reports for further analysis (aborts and
abends) in one history file.
• It is a Framework users can adapt to their
own needs and applications.
• Fully operational and integrated with LCG.
./easygrid dataset_name
Christmas 2004: My goals were…
• develop a submission system fail proof.
• write web pages with all elementary tasks in
HEP/Babar, to help students and newbie.
• Understand q-qbar interaction through Pi0.
What I have achieved in 2005…
Achievements with EasyGrid
• Friendly user framework, flexible and reliable. It provides
users with results, or necessary information for further
analysis.
• Tutorial web pages for PhD students and new researchers.
http://www.hep.man.ac.uk/u/jamwer
• Pi0 Project: analysis of 500 million events and 5 Million
Monte Carlo generation in 5 weeks.
http://www.hep.man.ac.uk/u/jamwer/pi0alg5.html
• Anti-deuteron project: 1,500 Million events in 1 week,
running in several sites in UK. More than 200 jobs in
parallel.
http://www.hep.man.ac.uk/u/jamwer/deutdesc.html
LCG Installation and debug
• There are several problems in LCG grid:
– high number of jobs fail when running more than
200 jobs.
– installation issues.
– performance issues.
• Installation of a complete testbed from
scratch using 10 obsolete computers:
http://www.hep.man.ac.uk/u/jamwer/#sec0
Testbed stress test
Processing time is zero: BetaMiniApp
replaced by program to print dataset name
and wait some time (e.g. 300 s).
1,000 jobs submitted every time at
6 WNs testbed.
T0
T1
T2
Sub
Fail
0
Aborts
(1)
84
122
0
Bf33
296
144
6
Bf34
306
148
161
0
Number of jobs/WN
0
•T0 and T1: Time between
submissions is zero (continuous
flow).
•T0: WN bf36, bf37, bf38 were
without pbs_mom started
•T1: 1 WN crashed during test (2).
Bf35
314
156
195
Bf36
0
165
211
•T2: time between submissions:
30 s.
CE (bf32) CPU use was >90%.
Bf37
0
172
213
Bf38
0
91 (2)
214
(1) Cannot plan: BrokerHelper: no
compatible resources
Recommendations
CE are very required in Grid (>90% CPU load!)
and affects grid performance:
• The number of WNs for each CE can be
defined by the minimum value of submission
delay and minimum queue time.
• Run one CE for large farms is a limiting
factor. More matched CEs per RB would
reduce failure and increase performance.
• File system study will provide more
information soon.
Research in Gridification technologies for
conventional software
• Users expend years developing their source code,
and they will not throw away just to use web
services.
• I developed an algorithm that will allow users use
their own software on top of a web service layer
with LCG middleware.
• Preliminary tests using “fake” web services
(simulated with PVM) show it is a viable and
flexible approach.
Gridification algorithm
• Creates parallel processes using PVM with ssh
remote shell.
• There is a central job, with distributes tasks over
parallel processes, when slaves processes return
results. No need for load balancing!
• Controls slaves failures and resubmission to
available slaves. There is not a checkpoint system
(not worth).
• Transfer time can be a bottleneck. Task streams
implemented. Results with 300 empty processes in
one laptop show a transfer time of 185 ms/process.
Conclusion
• EasyGrid is operational. Benchmarks were a
proof-of-concept under real conditions.
• LCG testbed is operational, providing results,
and supporting performance analysis and
tuning.
• Gridification algorithm is running in one
Laptop with Genetic Programming/AI.
New year resolution
• Analysis of linux kernel related file server issues.
• LCG Performance study and Linux kernel tuning.
• Implementation of EasyTau: a submission module for
TauUser package using EasyGrid (running on
ntuples).
• Gridification algorithm running with LCG and
commercial applications (WebSphere, Tivoli,
Symphony, etc)
• EasyGrid Product development and startup.
• Run pi0 project again with EasyGrid Product and
maybe … publish a paper about gridification!
Happy new year!