The Datacenter Needs an Operating System

Download Report

Transcript The Datacenter Needs an Operating System

The Datacenter Needs an
Operating System
Matei Zaharia, Benjamin Hindman, Andy
Konwinski, Ali Ghodsi, Anthony Joseph,
Randy Katz, Scott Shenker, Ion Stoica
Background
• Clusters of commodity servers have become a
major computing platform in industry and
academia
• Driven by data volumes outpacing the
processing capabilities of single machines
• Democratized by cloud computing
Background
• Some have declared that “the datacenter is
the new computer”
• Claim: this new computer increasingly needs
an operating system
• Not necessarily a new host OS, but a common
software layer that manages resources and
provides shared services for the whole
datacenter, like an OS does for one host
Why Datacenters Need an OS
• Growing number of applications
– Parallel processing systems: MapReduce, Dryad,
Pregel, Percolator, Dremel, MR Online
– Storage systems: GFS, BigTable, Dynamo, SCADS
– Web apps and supporting services
• Growing number of users
– 200+ for Facebook’s Hadoop data warehouse,
running near-interactive ad hoc queries
What Operating Systems Provide
• Resource sharing across applications & users
• Data sharing between programs
• Programming abstractions (e.g. threads, IPC)
• Debugging facilities (e.g. ptrace, gdb)
Result: OSes enable a highly interoperable
software ecosystem that we now take for granted
An Analogy
• Today, a scientist analyzing data on a single
machine can pipe it through a variety of tools,
write new tools that interface with these through
standard APIs, and trace across the stack
• In the future, the scientist should be able to fire
up a cloud on EC2 and do the same thing:
–
–
–
–
Intermix a variety of apps & programming models
Write new parallel programs that talk to these
Get a unified interface for managing the cluster
Debug and trace across all these components
Today’s Datacenter OS
• Hadoop MapReduce as common execution
and resource sharing platform
• Hadoop InputFormat API for data sharing
• Abstractions for productivity programmers,
but not for system builders
• Very challenging to debug across all the layers
Tomorrow’s Datacenter OS
• Resource sharing:
– Lower-level interfaces for fine-grained sharing
(Mesos is a first step in this direction)
– Optimization for a variety of metrics (e.g. energy)
– Integration with network scheduling mechanisms
(e.g. Seawall [NSDI ‘11], NOX, Orchestra)
Tomorrow’s Datacenter OS
• Data sharing:
– Standard interfaces for cluster file systems, keyvalue stores, etc
– In-memory data sharing (e.g. Spark, DFS cache),
and a unified system to manage this memory
– Streaming data abstractions (analogous to pipes)
– Lineage instead of replication for reliability (RDDs)
Tomorrow’s Datacenter OS
• Programming abstractions:
– Tools that can be used to build the next
MapReduce / BigTable in a week (e.g. BOOM)
– Efficient implementations of communication
primitives (e.g. shuffle, broadcast)
– New distributed programming models
Tomorrow’s Datacenter OS
• Debugging facilities:
– Tracing and debugging tools that work across the
cluster software stack (e.g. X-Trace, Dapper)
– Replay debugging that takes advantage of limited
languages / computational models
– Unified monitoring infrastructure and APIs
Putting it All Together
• A successful datacenter OS might let users:
– Build a Hadoop-like software stack in a week
using the OS’s abstractions, while gaining other
benefits (e.g. cross-stack replay debugging)
– Share data efficiently between independently
developed programming models and applications
– Understand cluster behavior without having to
log into individual nodes
– Dynamically share the cluster with other users
Conclusion
• Datacenters need an OS-like software stack
for the same reasons single computers did:
manageability, efficiency & programmability
• An OS is already emerging in an ad-hoc way
• Researchers can help by taking a long-term
approach towards these problems
How Researchers can Help
• Focus on paradigms, not performance
– Industry is tackling performance but lacks luxury
to take long-term view towards abstractions
• Explore clean-slate approaches
– Likelier to have impact here than in a “real” OS
because datacenter software changes quickly!
• Bring cluster computing to non-experts
– Much harder and more rewarding than big users