WMS_Area_Report_20110209_1

Download Report

Transcript WMS_Area_Report_20110209_1

OSG
Area Coordinator’s Report:
Workload Management
February 9th, 2011
Maxim Potekhin
BNL
631-344-3621
[email protected]
Workload Management: Panda
•
•
Panda Monitoring:

Closer integration of the existing Panda Monitoring System with the Global Dashboard

Upgrade lowered in priority due to existing functionality in the Dashboard (ATLAS decision)
Scalability of Panda:

Typical throughput almost doubled in the past 12 month, from about 250k daily jobs run globally,
to almost 500k per day, with peak count of 713k in the final days of data reprocessing in Nov’10

That puts more pressure on the database (Oracle), which is used for keeping complete state of
the system, monitoring and data mining for performance analysis
 Data is heavily indexed and indexes can block during copying of data across tables
 The DB engine sometimes make suboptimal choices when confronted with multiple indexes

In the fall of 2010, there were a few problem days after a series network outages:
 resulting disbalance of data distribution across tables, lots of backlog be to copied hence
decreased performance

Multiple DB optimizations have been implemented since, notably table partitions
 Demonstrated increase in performance
 Some queries are still problematic and require workarounds
2
Workload Management: Panda
•
Scalability of Panda, cont’d:

Along with DB optimization, alternatives are being considered for storage of finalized job data
(archive), where Oracle is redundant – looking at noSQL solutions in particular – such as
Cassandra, HBASE etc

noSQL advantages (such as Cassandra):
 When compared to traditional RDBMS, more cost-effective horizontal scaling with
commodity hardware and media
 Load-balanced, redundant, truly distributed system
 Extremely fast sinking of data with proper configuration (important)
 Demonstrated performance of noSQL solutions in industry (Amazon, Facebook, Twitter,
Google etc)

In December 2010, started an evaluation of Cassandra with real Panda job data feed
 Test cluster (3 nodes) located at CERN
 Data repository at Amazon S3
 First round of testing encouraging, data design ongoing
 To be evaluated at the ATLAS Software Week at CERN in April
3
Workload Management: Engagement
•
•
CHARMM:

Thanks to 17+ active sites used the recent run was expedient, according to the team

Resource requirements turned out to be pretty precise (encouraging)

The last wave of jobs is finishing right now and the data goes to the experimental group, only 408
jobs submitted in the past month
LBNE/Daya Bay

Jobs ran at PDSF and BNL (J.Caballero), a number of issues discovered and resolved, such as:






Peculiarities of WN configuration at PDSF (version of curl)
Suboptimal job configuration resulted in some jobs running out of memory, which is now fixed
Additional software optimization was done by the researchers (MC)
An announcement went out on the Daya Bay mailing list that the initial production run will start in
a few days
An additional cluster at IIT (Illinois) is under construction
Panda user documentation is being reviewed as per researchers’ request
4