Transcript Slides

Building Virtual Scientific Computing
Environment with Openstack
Yaodong Cheng, CC-IHEP, CAS
[email protected]
ISGC 2015
Contents
 Requirements of scientific computing
 IHEP cloud platform
 Virtual machine types
 Virtual computing cluster
 Dirac distributed computing
 Conclusion
International Symposium on Grids and Clouds (ISGC) 2015
2/25
Large science facilities
 IHEP: The largest fundamental research center in
China
 IHEP serves as the backbone of China’s large
science facilities
 Beijing Electron Positron Collider BEPCII/BESIII
 Yangbajing Cosmic Ray Observatory: ASg & ARGO
 Daya Bay Neutrino Experiment
 China Spallation Neutron Source (CSNS)
 Hard X-ray Modulation Telescope(HXMT)
 Accelerator-driven Sub-critical System (ADS)
 Jiangmen Neutrino Underground Observatory (JUNO)
 Under planning: BAPS, LHAASO, XTP, HERD, …
International Symposium on Grids and Clouds (ISGC) 2015
3/25
BEPCII/BESIII
 36 Institutions from China, US, Germany, Russian,
Japan,…
 > 5PB in next 5 years
 ~ 5000 CPU cores
 simulation, reconstruction, analysis, …
 long-term data preservation
 data sharing between partners
International Symposium on Grids and Clouds (ISGC) 2015
4/25
Other experiments
 Daya Bay Neutrino Experiment
 ~200TB per year
 JUNO: Jiangmen Neutrino Experiment
 ~500TB per year
 LHAASO
 2PB per year after 2017,
 accumulate 20PB+ in 10 years
 Atlas and CMS Tier2 site
 940TB disk, 1088 CPU cores
 CSNS, HXMT, …
5PB data one year!!
International Symposium on Grids and Clouds (ISGC) 2015
5/25
Computing resources status
 ~ 12000 CPU cores
 ~ 50 queues, managed by Torque/PBS
 difficult to share
 ~ 5PB disk
 Lustre, Glustre, dCache/DPM, …
 ~ 5PB LTO4 tape
 two IBM 3584 tape libraries
 modified CERN CASTOR 1.7
Tape libraries
International Symposium on Grids and Clouds (ISGC) 2015
PC farm built
with blades
6/25
In the future, …
 More HEP experiments, need to manage twice or
more servers as today
 but, no possibility of significant increase in staff
numbers
 Is cloud a good solution ?
 Is cloud suitable for Scientific Computing?
 Time to change IT strategy!!
International Symposium on Grids and Clouds (ISGC) 2015
7/25
What is Cloud?
 NIST: Best Definitions
Essential characteristics
On-demand self-service
Broad network access
Resource pooling
Rapid elasticity
Measured service
Service models
 IaaS, PaaS, SaaS
Deployment models
 Public, private, hybrid
http://csrc.nist.gov/publ
ications/nistpubs/800145/SP800-145.pdf
Is cloud beneficial to scientific computing ?
International Symposium on Grids and Clouds (ISGC) 2015
8/25
Easy to Maintain
 Hardware: services become independent of
underlying physical machine
 Cloud services: single set of services for
managing access to computing resources
 Scientific platforms: become separate layer
deployed, controlled and managed by domain
experts
International Symposium on Grids and Clouds (ISGC) 2015
9/25
Customized Environment
 Operating systems suited to your application
 Your applications preinstalled and preconfigured
 CPU, memory, and swap sized for your needs
International Symposium on Grids and Clouds (ISGC) 2015
10/25
Dynamic Provisioning
 New storage and compute resources in minutes
(or less)
 Resources freed just as quickly to facilitate
sharing
 Create temporary platforms for variable
workloads
International Symposium on Grids and Clouds (ISGC) 2015
11/25
IHEPCloud: a Private IaaS platform
 Launched in May 2014
http://cloud.ihep.ac.cn
 Three use scenario
User self-Service virtual machine platform
 User register and destroy VM on-demand
Virtual Computing Cluster
 Job will be allocated to virtual queue
automatically when physical queue is
busy
 Distributed computing system
 Work as a cloud site: Dirac call cloud interface to
start or stop virtual work nodes
International Symposium on Grids and Clouds (ISGC) 2015
12/25
IHEPCloud services
 Who can use?
 any user who has IHEP email account
How many resources for user?
 By default, each user has 3 CPU cores and 15GB memory
VM types

testing machine
 full root privilege, no public storage
 UI node
 AFS authentication, No root privilege, public storage
 No some limitations like memory, CPU time, process
 OS types
 SL 55, SL 58, SL 65, SL 7 64 bits, SL 65 32 bits, Win7 64 bits
 add new types depends on user requirement
 VM IP address
 internal IP address (192.168.*.*) is allocated automatically
 foreign IP address (202.122.35.*) need the approval of administrator
International Symposium on Grids and Clouds (ISGC) 2015
13/25
Why does end user need IHEPCloud?
 Virtual testing machine
Develop program or do some testing
generates VM in a few minutes
Login VM via ssh/VNC, remote desktop
 Virtual UI node
debug program in computing
environment
 login node: lxslcxx.ihep.ac.cn
 Limitations: Memory, CPU time,
user processes, …
cputime > 45m && %CPU > 60%
KILL it!
 Affected by other users
 VMs: owned only by
one user; no limitations
International Symposium on Grids and Clouds (ISGC) 2015
14/25
Virtual computing cluster
 If a job queue is busy, the jobs can be allocated
to a virtual queue
 plan to run the service this year
Cloud Scheduler
Submit job
Check
Queue load
junoq: 128 CPU cores
Start/stop VM
Forward job
IHEPCloud
Virtual queue
Detailed: see haibo’s talk
International Symposium on Grids and Clouds (ISGC) 2015
15/25
Distributed computing
Distributed computing has integrated cloud
resources based on pilot schema,
implementing dynamic scheduling
Cloud resources used can be shrunk and
extended dynamically according to job
requirements
No Jobs
Cloud
Distributed Computing
Distributed Computing
Distributed Computing
User Job
Submission
Job1, Job2, Job3…
Create
Job
Finished
Get Job
Cloud
VM1, VM2, …
International Symposium on Grids and Clouds (ISGC) 2015
No Jobs
Delete
Cloud
VM
16/25
Cloud sites
5 cloud sites from Torino, JINR, CERN and IHEP have
been set up and connected to distributed
computing system
About 320 CPU cores, 400GB Memory, 10TB disk
International Symposium on Grids and Clouds (ISGC) 2015
17/25
Cloud tests
More than 4500 jobs have been done with 96%
success rate
Failure reason is lack of disk space
Disk space will be extended in IHEP cloud
International Symposium on Grids and Clouds (ISGC) 2015
18/25
Performance and Physics validation
 Performance tests has
shown that running time
in the cloud sites are
comparable with other
production sites
Simulation,
Reconstruction,
Download random
trigger data
 Physics validation has
proved that physics
results are highly
consistent between
clusters and cloud sites
International Symposium on Grids and Clouds (ISGC) 2015
19/25
Get info.
API
Architecture
interactive
Storage path
Push info.
Dashboard
Dirac
Virtual Cluster
API
API
Host Monitor
OpenStack
Log Analysis
authentication
LDAP
Service monitor
UMT
(IHEP EMAIL)
Register
Nagios
Get VM info.
DNS
Interoperation
Configuration management
Register
DNS
NetworkDB
Register
Puppet
International Symposium on Grids and Clouds (ISGC) 2015
UMT
(CAS CLOUD)
Backend
storage
CEPH
20/25
IHEPCloud components
 Core middleware: openstack
 open source cloud management system
 most popular
 Configuration management tool: Puppet
 create VM image
 manage applications in VM
 keep the consistency of VM and computing environment
 Authentication
 IHEP Email account and password
 AFS authentication for UI node
 Network management
 centric NetworkDB record MAC, IP, hostname, user, …
 each VM has a hostname, *.v.ihep.ac.cn
 network traffic accounting
 External storage
 Currently, images and instances are stored in local disk
 evaluating CEPH to support GLANCE, NOVA and Cinder
International Symposium on Grids and Clouds (ISGC) 2015
21/25
Network in IHEPCloud
 multiple IP subnets on one
physical machine
 Vlan mode (Just L2, no router) in
Openstack neutron
 IP gateway and 802.1Q in hardware
switch
Problem: trunck
Big mac table
Pre-config
Risk of Broadcast storm
 Future network
 Core layer: Vxlan(hardware-based)
 Access layer: Openstack vlan mode
International Symposium on Grids and Clouds (ISGC) 2015
22/25
IHEPCloud Current status
Control Node
Released in 18 November, 2014
Built on openstack icehouse
1 control node
VM
VM
VM
7 computing nodes
 112 physical CPU cores / 224
VM
VM
VM
Computing
Node
Virtual CPU cores, 896GB
memory totally
Active 96 VMs, 172 CPU cores,
628GB memory by 11 March
International Symposium on Grids and Clouds (ISGC) 2015
23/25
Conclusion
 cloud computing is widely accepted by industrial and
scientific domain
 Scientific computing are preparing the move to cloud
 IHEPCloud aims at providing self-service virtual machine
platform for IHEP user
 IHEPCloud also supports virtual cluster and distributed
computing
 One small Cloud platform has been built and open to IHEP
user freely
 More resources (1000+) will be added to IHEPCloud this year
 Investigate shibboleth to build federated cloud
International Symposium on Grids and Clouds (ISGC) 2015
24/25
Thank you!
Any Questions?