Transcript Document

PlanetLab: Evolution vs
Intelligent Design in Global
Network Infrastructure
Larry Peterson
Princeton University
Case for PlanetLab
Maturity
Deployed
Future
Internet
Small Scale
Testbeds
Simulation and
Research Prototypes
Foundational
Research
Time
This chasm is a major
barrier to realizing the
Future Internet
PlanetLab
QuickTime™ and
and aa
QuickTime™
TIFF(Uncompressed)
(Uncompressed) decompressor
decompressor
TIFF
are needed
needed to
to see
see this
this picture.
picture.
are
• 637 machines spanning 302 sites and 35 countries
nodes within a LAN-hop of > 2M users
• Supports distributed virtualization
each of 350+ network services running in their own slice
Slices
Slices
Slices
User Opt-in
Client
NAT
Server
Per-Node View
Node
Mgr
Local
Admin
VM1
VM2
…
VMn
Virtual Machine Monitor (VMM)
Long-Running Services
• Content Distribution
– CoDeeN: Princeton
– Coral: NYU
– Cobweb: Cornell
• Internet Measurement
– ScriptRoute: Washington, Maryland
• Anomaly Detection & Fault Diagnosis
– PIER: Berkeley, Intel
– PlanetSeer: Princeton
• DHT
– Bamboo (OpenDHT): Berkeley, Intel
– Chord (DHash): MIT
Services (cont)
• Routing
– i3: Berkeley
– Virtual ISP: Princeton
• DNS
– CoDNS: Princeton
– CoDoNs: Cornell
• Storage & Large File Transfer
– LOCI: Tennessee
– CoBlitz: Princeton
– Shark: NYU
• Multicast
– End System Multicast: CMU
– Tmesh: Michigan
Usage Stats
•
•
•
•
•
•
Slices: 350 - 425
AS peers: 6000
Users: 1028
Bytes-per-day: 2 - 4 TB
IP-flows-per-day: 190M
Unique IP-addrs-per-day: 1M
Architectural Questions
• What is the PlanetLab architecture?
– more a question of synthesis than cleverness
• Why is this the right architecture?
– non-technical requirements
– technical decisions that influenced adoption
• What is a system architecture anyway?
– how does it accommodate change (evolution)
Requirements
1) Global platform that supports both short-term
experiments and long-running services.
– services must be isolated from each other
•
•
performance isolation
name space isolation
– multiple services must run concurrently
Distributed Virtualization
– each service runs in its own slice: a set of VMs
Requirements
2) It must be available now, even though no one
knows for sure what “it” is.
– deploy what we have today, and evolve over time
– make the system as familiar as possible (e.g., Linux)
Unbundled Management
– independent mgmt services run in their own slice
– evolve independently; best services survive
– no single service gets to be “root” but some services
require additional privilege
Requirements
3) Must convince sites to host nodes running code
written by unknown researchers.
– protect the Internet from PlanetLab
Chain of Responsibility
– explicit notion of responsibility
– trace network activity to responsible party
Requirements
4) Sustaining growth depends on support for
autonomy and decentralized control.
– sites have the final say about the nodes they host
– sites want to provide “private PlanetLabs”
– regional autonomy is important
Federation
– universal agreement on minimal core (narrow waist)
– allow independent pieces to evolve independently
– identify principals and trust relationships among them
Requirements
5) Must scale to support many users with minimal
resources available.
– expect under-provisioned state to be the norm
– shortage of logical resources too (e.g., IP addresses)
Decouple slice creation from resource allocation
Overbook with recovery
– support both guarantees and best effort
– recover from wedged states under heavy load
Tension Among Requirements
• Distributed Virtualization / Unbundled Management
– isolation vs one slice managing another
• Federation / Chain of Responsibility
– autonomy vs trusted authority
• Under-provisioned / Distributed Virtualization
– efficient sharing vs isolation
• Other tensions
– support users vs evolve the architecture
– evolution vs clean slate
Synergy Among Requirements
• Unbundled Management
– third party management software
• Federation
– independent evolution of components
– support for autonomous control of resources
Architecture (1)
• Node Operating System
– isolate slices
– audit behavior
• PlanetLab Central (PLC)
– remotely manage nodes
– bootstrap service to instantiate and control slices
• Third-party Infrastructure Services
–
–
–
–
monitor slice/node health
discover available resources
create and configure a slice
resource allocation
Trust Relationships
Princeton
Berkeley
Washington
MIT
Brown
CMU
NYU
ETH
Harvard
HP Labs
Intel
NEC Labs
Purdue
UCSD
SICS
Cambridge
Cornell
…
Trusted
Intermediary
NxN
(PLC)
princeton_codeen
nyu_d
cornell_beehive
att_mcash
cmu_esm
harvard_ice
hplabs_donutlab
idsl_psepr
irb_phi
paris6_landmarks
mit_dht
mcgill_card
huji_ender
arizona_stork
ucb_bamboo
ucsd_share
umd_scriptroute
…
Trust Relationships (cont)
2
4
Node
Owner
PLC
3
1
Service
Developer
(User)
1) PLC expresses trust in a user by issuing it credentials to access a slice
2) Users trust to create slices on their behalf and inspect credentials
3) Owner trusts PLC to vet users and map network activity to right user
4) PLC trusts owner to keep nodes physically secure
Trust Relationships (cont)
4
Node
Owner
6
Mgmt
Authority
3
2
Slice
Authority
5
1
Service
Developer
(User)
1) PLC expresses trust in a user by issuing credentials to access a slice
2) Users trust to create slices on their behalf and inspect credentials
3) Owner trusts PLC to vet users and map network activity to right user
4) PLC trusts owner to keep nodes physically secure
5) MA trusts SA to reliably map slices to users
6) SA trusts MA to provide working VMs
Architecture (2)
PlanetLab
Nodes
Service
Developers
Owner 1
Create slices
Slice
Authority
Owner 2
Owner 3
Software updates
Identify
slice users
(resolve abuse)
Management
Authority
...
...
Auditing data
New slice ID
USERS
Learn about
nodes
Request a slice
Owner N
Access slice
Architecture (3)
MA
Node
Owner
Owner
VM
NM +
VMM
Node
SCS
slice
database
SA
node
database
VM
Service
Developer
Per-Node Mechanisms
SliverMgr
Proper
Node
Mgr
Owner
VM
VM1
VM2
…
VMn
PlanetFlow
SliceStat
pl_scs
pl_mom
Virtual Machine Monitor (VMM)
Linux kernel (Fedora Core)
+ Vservers (namespace isolation)
+ Schedulers (performance isolation)
+ VNET (network virtualization)
VMM
• Linux
– significant mind-share
• Vserver
– scales to hundreds of VMs per node (12MB each)
• Scheduling
– CPU
• fair share per slice (guarantees possible)
– link bandwidth
• fair share per slice
• average rate limit: 1.5Mbps (24-hour bucket size)
• peak rate limit: set by each site (100Mbps default)
– disk
• 5GB quota per slice (limit run-away log files)
– memory
• no limit
• pl_mom resets biggest user at 90% utilization
VMM (cont)
• VNET
– socket programs “just work”
• including raw sockets
– slices should be able to send only…
• well-formed IP packets
• to non-blacklisted hosts
– slices should be able to receive only…
• packets related to connections that they initiated (e.g., replies)
• packets destined for bound ports (e.g., server requests)
– essentially a switching firewall for sockets
• leverages Linux's built-in connection tracking modules
– also supports virtual devices
• standard PF_PACKET behavior
• used to connect to a “virtual ISP”
Node Manager
• SliverMgr
– creates VM and sets resource allocations
– interacts with…
• bootstrap slice creation service (pl_scs)
• third-party slice creation & brokerage services (using tickets)
• Proper: PRivileged OPERations
– grants unprivileged slices access to privileged info
– effectively “pokes holes” in the namespace isolation
– examples
•
•
•
•
files: open, get/set flags
directories: mount/unmount
sockets: create/bind
processes: fork/wait/kill
Auditing & Monitoring
• PlanetFlow
– logs every outbound IP flow on every node
• accesses ulogd via Proper
• retrieves packet headers, timestamps, context ids (batched)
– used to audit traffic
– aggregated and archived at PLC
• SliceStat
– has access to kernel-level / system-wide information
• accesses /proc via Proper
– used by global monitoring services
– used to performance debug services
Infrastructure Services
• Brokerage Services
– Sirius: Georgia
– Bellagio: UCSD, Harvard, Intel
– Tycoon: HP
• Environment Services
– Stork: Arizona
– AppMgr: MIT
• Monitoring/Discovery Services
–
–
–
–
CoMon: Princeton
PsEPR: Intel
SWORD: Berkeley
IrisLog: Intel
Evolution vs Intelligent Design
• Favor evolution over clean slate
• Favor design principles over a fixed architecture
• Specifically…
– leverage existing software and interfaces
– keep VMM and control plane orthogonal
– exploit virtualization
• vertical: mgmt services run in slices
• horizontal: stacks of VMs
– give no one root (least privilege + level playing field)
– support federation (decentralized control)
Other Lessons
•
•
•
•
•
•
•
Inferior tracks lead to superior locomotives
Empower the user: yum
Build it and they (research papers) will come
Overlays are not networks
PlanetLab: We debug your network
From universal connectivity to gated communities
If you don’t talk to your university’s general
counsel, you aren’t doing network research
• Work fast, before anyone cares
Collaborators
•
•
•
•
•
•
•
•
•
•
Andy Bavier
Marc Fiuczynski
Mark Huang
Scott Karlin
Aaron Klingaman
Martin Makowiecki
Reid Moran
Steve Muir
Stephen Soltesz
Mike Wawrzoniak
•
•
•
•
•
•
•
•
•
•
David Culler, Berkeley
Tom Anderson, UW
Timothy Roscoe, Intel
Mic Bowman, Intel
John Hartman, Arizona
David Lowenthal, UGA
Vivek Pai, Princeton
David Parkes, Harvard
Amin Vahdat, UCSD
Rick McGeer, HP Labs
Available CPU Capacity
1202005 (Week before SIGCOMM deadline)
Feb 1-8,
Pct of 360 Nodes
100
80
60
40
20
0
10
20
30
40
50
Pct of CPU Available
60
70
80
Node Boot/Install
Node
Boot Manager
PLC (MA) Boot Server
1. Boots from BootCD
(Linux loaded)
2. Hardware initialized
3. Read network config
. from floppy
4. Contact PLC (MA)
6. Execute boot mgr
5. Send boot manager
7. Node key read into memory from floppy
8. Invoke Boot API
9. Verify node key, send
current node state
10. State = “install”, run installer
11. Update node state via Boot API
13. Chain-boot node (no restart)
14. Node booted
12. Verify node key,
change state to “boot”
Chain of Responsibility
Join Request
PI submits Consortium paperwork and requests to join
PI Activated
PLC verifies PI, activates account, enables site (logged)
User Activated
Users create accounts with keys, PI activates accounts (logged)
Slice Created
PI creates slice and assigns users to it (logged)
Nodes Added to
Slices
Slice Traffic
Logged
Traffic Logs
Centrally Stored
Users add nodes to their slice (logged)
Experiments generate traffic (logged by PlanetFlow)
PLC periodically pulls traffic logs from nodes
Network Activity
Slice
Responsible Users & PI
Slice Creation
.
.
.
PI
SliverCreate(rspec)
SliceCreate( )
SliceUsersAdd( )
User
PLC
(SA)
NM VM VM … VM
SliceAttributeSet( )
SliceGetTicket( )
VMM
.
.
.
(distribute ticket to slice creation service: pl_scs)
Brokerage Service
.
.
.
rcap = PoolCreate(rspec)
SliceAttributeSet( )
SliceGetTicket( )
PLC
(SA)
NM VM VM … VM
Broker
VMM
.
.
.
(distribute ticket to brokerage service)
Brokerage Service (cont)
.
.
.
PoolSplit(rcap, slice, rspec)
PLC
(SA)
User
BuyResources( )
NM VM VM
VM … VM
VMM
Broker
.
.
.
(broker contacts relevant nodes)