Distributed Build/Test Infrastructure Challenges - Indico

Download Report

Transcript Distributed Build/Test Infrastructure Challenges - Indico

International Infrastructure
and Testbeds
Miron Livny, University of Wisconsin-Madison
www.eu-etics.org
INFSOM-RI-026753
Outline (per Alberto)
• Part I: Distributed Build/Test Infrastructure Challenges
– resources, administration, security, policies, etc.
• Part II: ETICS Build/Test Infrastructure Status
– what has been done to set up such an infrastructure
– about resources and their management, not the tools
• “Should be done from the WP2 point of view, therefore it
should be mainly be concerned with the physical
infrastructure (managing a grid infrastructure, the resource
pools, platforms, package installation, cross-site job
submission, network-related issues, virtualization and so
on) more than higher-level issues of the software tools
deployed on that infrastructure.”
INFSOM-RI-026753
Continuous Integration
•
Frequent building and testing of software, known as
continuous integration, improves quality and reduces
development costs
– What’s true for individual components is even more true for
entire software stacks.
– A common C.I. framework and common facilities are necessary
to build these stacks and perform real interoperability testing.
•
To do this, two things are essential:
1. a software framework to implement the continuous integration
process.
2. a well-managed infrastructure where the framework runs,
including all of the platforms necessary for portability and
testing.
INFSOM-RI-026753
Common Infrastructure
• A common C.I. infrastructure frees developers to focus
on their mission, not the build/test process
– It’s unreasonable, impractical, and wasteful to expect that every
software development team first develop a system for managing
continuous integration for their own project, and then maintain a
facility of machines to actually perform the builds and tests.
– Software developers should focus on the problems and
challenges that are unique to their scientific domain or area of
expertise, instead of spending precious (and limited) resources
implementing a continuous integration infrastructure on their
own.
• A common C.I. Infrastructure enables sharing of
software stacks and improves interoperability.
INFSOM-RI-026753
Challenges
• Distributed Build/Test Infrastructure
Challenges
– Heterogeneity
– Administration
– Scalability
– Policy
– Security
– Data
– Communication
INFSOM-RI-026753
Heterogeneity
• Even given a robust framework for managing
continuous integration for a given software project, the
facilities necessary to run such a system require
significant resources (in terms of hardware and labor).
• Extremely Heterogeneous Resources == Extremely
Heterogeneous Problems
• Opposite extreme from typical “cluster” -- instead of
1000’s of identical CPUs, we have fewer CPUs for each
of many platforms.
• Windows x N versions
• Linux x 10…20…30…(?)
• Scientific Linux (CERN, Fermi) x N, RedHat x N, SuSE x N, Fedora
x N, Ubuntu x N, CentOS x N, etc.
• Vendor Unix (Solaris, AIX, HPUX, Irix, Tru64, etc.) x N versions
INFSOM-RI-026753
Administration Challenges
• Much harder to manage! Try finding a modern systems
administration tool that works on 50+ diverse
Windows/Linux/Unix platforms for:
•
•
•
•
OS deployment
Configuration management
User authentication (e.g., passwd file management)
Software distribution
• Most of these problems have mature solutions for
homogenous platforms -- but they’re all different!
• Heterogeneity == Lowest common denominator tools
or custom solutions == lots of work.
• Admin burden per CPU much higher for build/test
clusters than typical clusters.
INFSOM-RI-026753
Demand for Scalability
• Some production software packages contain
thousands of regression tests that require hundreds of
CPU hours to execute
• Multiple customers x Multiple components x Multiple
versions x Multiple tests == Lots of cycles!
• Quantitative increases in available computing power
can bring qualitative changes to build and test
practices
• More cycles == new testing possibilities
– Interoperability testing
• Customers adapt development practices to consume
additional build/test cycles as they become available
– “If we can rebuild after every code change, why not?”
– “If we can retest interoperability after each build, why not?”
INFSOM-RI-026753
Policy & Security
• Multiple customers bring multiple task priorities
– Between and within projects
– Project A vs. Project B
– User A1 vs. User A2
• Multiple institutions bring multiple resource priorities
– Between and within sites
– CERN admins may prefer CERN users over UW-Madison users
– gLite CPUs may only allow gLite jobs
• Distributed computing opens new “doors” between
computing centers
– These doors require locks & keys.
– Need to adapt to security policies of multiple institutions in
multiple countries.
INFSOM-RI-026753
Data challenges
• Large projects can generate 100’s of GB per day of
builds and regression test output and wish to review it
for days or weeks.
– E.g., Condor
• Three types of ETICS data:
– ETICS Database: build/test specifications and results
– Temporary build/test archive: recent output workspaces, artifacts
– ETICS repository: selected successful builds, important results
• The ETICS Database is the least of our problems.
• All three scale with quantity and workload of users.
INFSOM-RI-026753
Communication Challenges
• Intercontinental Debugging!
– Difficult to coordinate work when sites have only a 1hour overlap in a typical workday
– Lots of high-latency email communication
– Worst-case: 1 day per round-trip exchange
• Solutions
– late workdays @ CERN, early workdays @ UWMadison
– Check email from home!
– Commitment to making it work.
– All-Hands meetings have been invaluable.
INFSOM-RI-026753
ETICS Architecture
• From a hardware perspective, each ETICS site
deployment consists of:
– a small number of fixed hosts needed to support central services
– a larger, dynamic pool of worker nodes (WNs) to execute
dynamic build and test operations
• Most worker nodes in the current deployment are
dedicated, but not all are or need to be.
• Using NMI/Condor distributed computing capabilities,
the three production sites are federated
– jobs can migrate between different sites automatically based on
local resource availability
INFSOM-RI-026753
Distributed Infrastructure
• Managing multiple sites adds some administrative and
technical overhead
• In practice, however, streamlines operation of ETICS:
– delegates much of the local policy and service administration to
each site
– enables deployments to better reflect the local priorities,
constraints, and administrative processes of local resource
owners, and be more accountable to their needs.
– provides a fault-tolerance
– technical failures (hardware, software, network)
– social/political issues (unresponsive systems administrators,
insufficient partner resources, etc.)
INFSOM-RI-026753
ETICS Heterogeneity
• Leverage Mature, Successful Middleware Tools
– NMI Build and Test System
– Condor, DAGMan
• Leverage Distributed Resources
– NMI and Condor enable ETICS to harness cycles from
both local and distributed computational grids and
opportunistic resources
– CERN + INFN + UW-Madison + ???
– No one site need have every platform.
– Be opportunistic; expect and tolerate failures
INFSOM-RI-026753
ETICS Administration & Policy
• Decentralized Administration
• Focus on Cross-Site Interoperability and Correctness,
not Processes/Methods
– Leverage existing staff expertise, processes, and tools, e.g.:
– Hardware and OS deployment methods
– Backups
– System/Network Monitoring
• Common NMI built/test foundation
– Highly-managed, reproducible build/test environment
– Mechanisms to share jobs & resources across administrative
boundaries
– Without requiring common priorities or access policies
INFSOM-RI-026753
ETICS Administration & Policy
• Diversity of Local Admin Processes Has Advantages
– we learn from each other -- “cherry pick” best practices
– For example, the UW-Madison NMI-Lab is now exploring the use of
a Linux system imaging solution developed to manage the CERN
NMI-Lab.
– Local processes tailored to local expertise and requirements
– INFN’s user account management system doesn’t fit UW-Madison’s
and visa-versa
– Limits ETICS scope: lets ETICS focus on common build and test
infrastructure, not systems management tools
• Each site sets its own policy:
– Resource priorities
– User access
INFSOM-RI-026753
Incremental Deployment
• Focus on quick delivery of a functioning infrastructure
– The deployed architecture is now evolving from the initial rapid
deployment of many of the core services into a more robust,
long-term production deployment
• There have been several key benefits to this aggressive
approach
– Initial, basic deployment was more manageable than a more
complex production deployment
– simplified the challenge of understanding the components and their
operation for ETICS developers and administrators
– facilitated the quick training of ETICS staff in operational issues.
– important that a basic infrastructure be available early in the
project
– Delivering a functioning NMI/Condor infrastructure in 2006-02
facilitated higher-level ETICS service development
– Early feedback led to major improvements in NMI software
INFSOM-RI-026753
Advanced Deployment
• Once the initial infrastructure was functional,
we focused on delivering advanced features
– Root-level testing
– safely run tests as privileged user without corrupting system
– new capability to “freeze” machines for post-mortem analysis
– Parallel testing capability
– Co-scheduling multiple resources
– Coordination between co-scheduled resources
– Dynamic testbed deployment tools
– Advertising runtime build and test artifacts
• Driven by customer requirements (e.g., gLite)
INFSOM-RI-026753
ETICS Sites
• CERN
• INFN
• UW-Madison
INFSOM-RI-026753
CERN Site
• Identified early in the project as the first site to make fully
operational
– most important to treat as a production facility
– geographical co-location of key ETICS customers and staff
• CERN hosts three ETICS pools
– Production Pool
– largest and most stable pool for use by end users
– 3 servers (one primary, one backup/test, one install), total of 21 worker
nodes
– Test Pool
– verification testbed for the production pool
– 1 server, total of 3 worker nodes
– Development Pool
– less-controlled environment for unscheduled, experimental testing by
ETICS developers and administrators
– 1 server, total of 7 worker nodes
• Also hosts ETICS repository and CVS service
INFSOM-RI-026753
CPUs & Platforms at CERN
CPUs
24
13
4
4
2
2
2
2
2
2
2
2
2
INFSOM-RI-026753
Operating System
Scientif ic Linux (CERN) 3
Scientif ic Linux (CERN) 4
Red Hat Enterprise Linux Adv.
Server 4
Scientif ic Linux (CERN) 4
Mic rosoft Windows XP Server
Red Hat Enterprise Linux Adv.
Server 3
Fedora Core 6
Fedora Core 5
Fedora Core 4
Debian 3.1
Mac OS X 10.4
Scientif ic Linux (CERN) 4
Scientif ic Linux (CERN) 3
Hardware
Intel (x86)
Intel (x86)
Intel (x86)
Intel Itanium (ia64)
Intel (x86)
Intel (x86)
Intel (x86)
Intel (x86)
Intel (x86)
Intel (x86)
App le PowerPC
Intel Itanium (ia64)
Intel Itanium (ia64)
INFN ETICS Pool
• Production pool
– installed in the first months of the project with few available
platforms
– increasing number of worker nodes and supported platforms
during the life of the project
• Deployed later than other sites, not yet widely used
– Developers and users are accustomed to connecting to CERN
resources where they have their data already configured
– cross-site job migration capability can now address this
– Jobs submitted at CERN may run at INFN and visa-versa
• hosts an additional ETICS web service deployment for
development purposes
INFSOM-RI-026753
CPUs/Platforms at INFN
CPUs
4
2
2
2
1
INFSOM-RI-026753
Operating System
Scientif ic Linux (CERN) 3
Scientif ic Linux (CERN) 3
Scientif ic Linux (CERN) 4
CentOS 4.3
Mac OS X 10.4
Hardware
Intel (x86)
AMD64 / Intel EMT-64
Intel (x86)
Intel (x86)
App le PowerPC
UW-Madison NMI-Lab
• Unique among the three sites
– Condor and NMI resources not dedicated to ETICS
– Operational as a production facility two years prior to ETICS start
– Much more heterogeneous collection of build and test resources than
the other sites, but few instances of any one platform.
• Directly supports large NSF projects such as TeraGrid, the Open
Science Grid, Globus and Condor, and EU projects like OMII-UK,
OMII-Europe, and ETICS
• Hosts over 25 development projects that launch an average of
1400 builds and tests per month
• Two classes of hardware
– a pool of >100 heterogeneous worker CPUs dedicated to build and test
tasks
– supported by 5 service nodes dedicated to batch system, database,
and administrative services.
– Both groups are co-located in one data center.
– Approx. 2 TB of build/test results storage
INFSOM-RI-026753
UW-Madison NMI-Lab Resources
Operating System
Versions
Architectures
CPUs
Apple Macintosh
2
2
8
Debian Linux
1
1
2
Fedora Core Li nux
4
2
20
FreeBSD
1
1
4
Hewlett Packard HPUX
1
1
3
IBM AIX
2
1
6
Linux ( Other)
3
2
9
Microsoft Windows
2
2
3
OSF1
1
1
2
Red Hat Linux
3
2
13
Red Hat Enterprise Linux
2
3
19
Scientific L inux
3
2
11
SGI Irix
1
1
4
Sun Solaris
2
1
6
SuSE Enterprise Li nux
3
3
15
INFSOM-RI-026753
CPUs
2
2
2
2
2
2
2
5
2
2
2
2
2
2
2
2
2
2
2
2
6
2
2
6
2
2
2
2
4
2
4
5
2
1
2
2
INFSOM-RI-026753
Operating System
CentOS 4.2
Debian 3.1
Fedora Core 2
Fedora Core 3
Fedora Core 4
RedHat 7.2
RedHat 8
RedHat 9
Red Hat Enterprise Linux Adv.
Server 3
Red Hat Enterprise Linux Adv.
Server 4
Scientif ic Linux (CERN) 3
Scientif ic Linux (Fermi) 3
SuSE Linux Enterprise Serve r 8
SuSE Linux Enterprise Serve r 9
SuSE Linux Enterprise Serve r 10
Tao Linux 1
Mic rosoft Windows XP Server
Red Hat Enterprise Linux Adv.
Server 3
Red Hat Enterprise Linux Adv.
Server 4
SuSE Linux Enterprise Serve r 8
Red Hat Enterprise Linux Adv.
Server 3
Red Hat Enterprise Linux Adv.
Server 4
Scientif ic Linux (CERN) 3
SuSE Linux Enterprise Serve r 8
SuSE Linux Enterprise Serve r 9
HPUX 11
OSF 5.1
RedHat 7.2
Irix 6.5
AIX 5.2
Solaris 8
Solaris 9
Solaris 10
Mac OS X 10.3
Mac OS X 10.4
Yell ow Dog Linux 3 .0
Hardware
Intel (x86)
Intel (x86)
Intel (x86)
Intel (x86)
Intel (x86)
Intel (x86)
Intel (x86)
Intel (x86)
Intel (x86)
Location
UW
UW
UW
UW
UW
UW
UW
UW
UW
Intel (x86)
UW
Intel (x86)
Intel (x86)
Intel (x86)
Intel (x86)
Intel (x86)
Intel (x86)
Intel (x86)
AMD64 / Intel EMT-64
UW
UW
UW
UW
UW
UW
UW
UW
AMD64 / Intel EMT-64
UW
AMD64 / Intel EMT-64
Intel Itanium (ia64)
UW
NCSA, UW
Intel Itanium (ia64)
UW
Intel Itanium (ia64)
Intel Itanium (ia64)
Intel Itanium (ia64)
HP PA-RISC
Compaq Alpha
Compaq Alpha
SGI MIPS
IBM PowerPC
Sun SPARC
Sun SPARC
Sun SPARC
Apple PowerPC
Apple PowerPC
Apple PowerPC
UW
TeraGrid, UW
UW
UW
UW
UW
UW
UW
UW
UW
UW
UW
UW
UW
Results & Future Work
• Results
–
–
–
–
Delivered reliable infrastructure For ETICS services
Delivered new build and test capabilities not previously available
Fed back improvements and changes to core tools
Established productive collaboration between sites
• Future Work
– Engagement with more users
– New capabilities
– Integration with virtualization technology (Xen, VMWare)
– Enhancing cross-site capabilities
INFSOM-RI-026753