Transcript Document

Global Overlay Network : PlanetLab
Claudio E.Righetti
October, 2006
(some slides taken from Larry Peterson)
•
•
•
“PlanetLab: An Overlay Testbed for Broad-Coverage Services “ Bavier,
Bowman, Chun, Culler, Peterson, Roscoe, Wawrzoniak . ACM
SIGCOMM Computer Communications Review . Volume 33 Number 3 :
July 2003
“ Overcoming the Internet Impasse through Virtualization “ Anderson ,
Peterson , Shenker , Turner . IEEE Computer. April 2005
“Towards a Comprehensive PlanetLab Architecture”, Larry Peterson,
Andy Bavier, Marc Fiuczynski, Steve Muir, and Timothy Roscoe, June
2005 http://www.planet-lab.org/PDN/PDN-05-030
Overview
1.
2.
What is PlanetLab?
Architecture
1.
2.
3.
Local: Nodes
Global: Network
Details
1.
2.
Virtual Machines
Maintenance
What Is PlanetLab?
• Geographically distributed
overlay network
• Testbed for broad-coverage
network services
PlanetLab Goal
“…to support seamless migration of an application from
an early prototype,
through multiple design iterations,
to a popular service that continues to evolve.”
PlanetLab Goal
“…support distributed virtualization – allocating a
widely distributed set of virtual machines to a user or
application, with the goal of supporting broadcoverage services that benefit from having multiple
points-of-presence on the network. This is exactly the
purpose the PlanetLab slice abstraction .”
Slices
Slices
Slices
User Opt-in
Client
NAT
Server
Challenge of PlanetLab
“ The central of challenge PlanetLab is ti provide
decentralized control of distributed virtualization.”
Long-Running Services
• Content Distribution
– CoDeeN: Princeton
– Coral: NYU
– Cobweb: Cornell
• Storage & Large File Transfer
– LOCI: Tennessee
– CoBlitz: Princeton
• Anomaly Detection & Fault Diagnosis
– PIER: Berkeley, Intel
– PlanetSeer: Princeton
• DHT
– Bamboo (OpenDHT): Berkeley, Intel
– Chord (DHash): MIT
Services (cont)
• Routing / Mobile Access
– i3: Berkeley
– DHARMA: UIUC
– VINI: Princeton
• DNS
– CoDNS: Princeton
– CoDoNs: Cornell
• Multicast
– End System Multicast: CMU
– Tmesh: Michigan
• Anycast / Location Service
– Meridian: Cornell
– Oasis: NYU
Services (cont)
• Internet Measurement
– ScriptRoute: Washington, Maryland
• Pub-Sub
– Corona: Cornell
• Email
– ePost: Rice
• Management Services
–
–
–
–
–
–
Stork (environment service): Arizona
Emulab (provisioning service): Utah
Sirius (brokerage service): Georgia
CoMon (monitoring service): Princeton
PlanetFlow (auditing service): Princeton
SWORD (discovery service): Berkeley, UCSD
PlanetLab Today
www.planet-lab.org
PlanetLab Today
• Global distributed systems infrastructure
– platform for long-running services
– testbed for network experiments
• 583 nodes around the world
– 30 countries
– 250+ institutions (universities, research labs, gov’t)
• Standard PC servers
– 150–200 users per server
– 30–40 active per hour, 5–10 at any given time
– memory, CPU both heavily over-utilised
Usage Stats
•
•
•
•
•
Slices: 600+
Users: 2500+
Bytes-per-day: 3 - 4 TB
IP-flows-per-day: 190M
Unique IP-addrs-per-day: 1M
Priorities
• Diversity of Network
– Geographic
– Links
• Edge-sites, co-location and routing centers, homes (DSL,
cable-modem)
• Flexibility
– Allow experimenters maximal control over PlanetLab nodes
– Securely and fairly
Key Architectural Ideas
• Distributed virtualization
– slice = set of virtual machines
• Unbundled management
– infrastructure services run in their own slice
• Chain of responsibility
– account for behavior of third-party software
– manage trust relationships
Architecture Overview
• Slice : horizontal cut of global PlanetLab resources
• Service : set of distributed and cooperating programs
delivering some higher-level functionality
• Each service runs in a slice of PlanetLab’s global
resources
• Multiple slices run concurrently
“ … slices act network-wide containers that isolate
services from each other. “
Architecture Overview (main principals)
• Owner: is an organization that hosts ( owns ) one or
more PlanetLab nodes
• User : is a researcher that deploys a service on a set
of PL nodes
• PlanetLab Consortium (PLC) : trusted intermediary
that manages nodes on behalf a set owners, and
creates sliceson those nodes on behalf of a set of
users
Trust Relationships
Princeton
Berkeley
Washington
MIT
Brown
CMU
NYU
EPFL
Harvard
HP Labs
Intel
NEC Labs
Purdue
UCSD
SICS
Cambridge
Cornell
…
Trusted
Intermediary
NxN
(PLC)
princeton_codeen
nyu_d
cornell_beehive
att_mcash
cmu_esm
harvard_ice
hplabs_donutlab
idsl_psepr
irb_phi
paris6_landmarks
mit_dht
mcgill_card
huji_ender
arizona_stork
ucb_bamboo
ucsd_share
umd_scriptroute
…
Trust Relationships (cont)
2
4
Node
Owner
PLC
3
1
Service
Developer
(User)
1) PLC expresses trust in a user by issuing it credentials to access a slice
2) Users trust PLC to create slices on their behalf and inspect credentials
3) Owner trusts PLC to vet users and map network activity to right user
4) PLC trusts owner to keep nodes physically secure
Principals ( PLC = MA + SA )
• Node Owners
– host one or more nodes (retain ultimate control)
– selects an MA and approves of one or more SAs
• Service Providers (Developers)
– implements and deploys network services
– responsible for the service’s behavior
• Management Authority (MA)
– installs an maintains software on nodes
– creates VMs and monitors their behavior
• Slice Authority (SA)
– registers service providers
– creates slices and binds them to responsible provider
Trust Relationships ( PLC decoupling)
(1) Owner trusts MA to map network
activity to responsible slice
MA
(2) Owner trusts SA to map slice to
responsible providers
6
1
(3) Provider trusts SA to create VMs on
its behalf
4
Owner
Provider
2
(4) Provider trusts MA to provide working
VMs & not falsely accuse it
(5) SA trusts provider to deploy
responsible services
3
5
(6) MA trusts owner to keep nodes
physically secure
SA
SA is analogous to a virtual organization
Architectural Elements
MA
Node
Owner
slice
database
Owner
VM
NM +
VMM
SCS
VM
SA
node
database
Service
Provider
Services Run in Slices
PlanetLab
Nodes
Services Run in Slices
PlanetLab
Nodes
Virtual
Machines
Service / Slice A
Services Run in Slices
PlanetLab
Nodes
Virtual
Machines
Service / Slice A
Service / Slice B
Services Run in Slices
PlanetLab
Nodes
Virtual
Machines
Service / Slice A
Service / Slice B
Service / Slice C
“… to view slice as a network of
Virtual Machines, with a set of
local resources bound to each
VM .”
Architectural Components
•
•
•
•
•
•
•
•
•
•
Node
Virtual Machine (VM)
Node Manager (NM)
Slice
Slice Creation Service (SCS)
Auditing Service (AS)
Slice Authority (SA)
Management Authority (MA)
Owner Script
Resource Specification (Rspec)
Per-Node View
Node
Mgr
Local
Admin
VM1
VM2
…
VMn
Virtual Machine Monitor (VMM)
Node Architecture Goals
• Provide a virtual machine for each service running on
a node
• Isolate virtual machines
• Allow maximal control over virtual machines
• Fair allocation of resources
– Network, CPU, memory, disk
Global View
…
PLC
…
…
Node
Machine capable of hosting one or more VM
• Unique node_id ( is bound set of attributes )
• Must have at least one non shared IP address
Virtual Machine ( VM)
Execution environment in which slice runs on a
particular node. VMs are tipically implemented by
Virtual Machine Monitor (VMM).
• Multiple VMs run on each PlanetLab node
• VMM arbitrates the nodes’s resources among them
• VM is speficied by a set of attributes ( resource
specification , RSpec)
• RSpec defines how much of the node’s resources are
allocated to the VM ; it also specifices the VM’s type
Virtual Machine ( cont)
• PlanetLab currently supports a single Linux-based
VMM
• Defines a single the VM’s type (linux-vserver-x86)
• Most important properties today is that VMs are
homogeneous
Node Manager (NM)
Program running on each node that creates VMs on
that node, and controls the resources allocated to
those VMs
• All operations that manipulate VMs on a node are
made through the NM
• Provides an interface by which infraestructure
services running on the node create VMs and bind
the resources them
Slice
Set of VMs , with each element of the set running on a
unique node
• The individual VMs that make up a slice contain no
information about the other VMs in the set, except as
managed by the service running in the slice
• Are uniquely identified by name
• Interpretation depends on the context ( is no single
name resolution service)
• Slices names are hierarchical
• Each level denoting the slice authority
Slice Creation Service (SCS)
SCS is a an infrastructure service running on each node
• Typically responsible , on behalf of PLC , for creation
of the local instantiation of a slice , which it
accomplishes by calling the local NM to create a VM
on the node
• Users may also contact the SCS directly if they wish
to synchronously create a slice on a particular node
• To do so the user presents a cryptographically-signed
ticket ( essentially RPsec’s )
Auditing Service (AS)
PLC audits behaovir of slices , and to aida in this
process, each node runs an AS. The AS records
information about packet transmitted from the node ,
and is responsible for mapping network activity to the
slice generates it.
• Trustworthy audit chain : packet signature--> slice
name --> users
• packet signature ( # source, # destination , time)
• AS offers a public, web-based interface on each node
Slice Authority (SA)
PLC, acting as a SA, maintains state for the set of
system-wide slices for which it is responsible
• There may be multiple SA but this section focuses on
the one managed by PLC.
• SA to refer to both the principal and the server that
implements it
Management Authority (MA)
Owner Script
Resource Specification (Rspec)
PlanetLab’s design philosophy
• Application Programming Interface used by tipical
services
• Protection Interface implemented by the VMM
PlanetLab node virtualization mechanisms are
characterized by the these two interfaces are drawn
PlanetLab Architecture
• Node-level
– Several virtual machines on each node, each running a
different service
• Resources distributed fairly
• Services are isolated from each other
• Network-level
– Node managers, agents, brokers, and service managers
provide interface and maintain PlanetLab
One Extreme: Software Runtimes (e.g.,
Java Virtual Machine, MS CLR)
• Very High level API
• Depend on OS to provide protection and resource
allocation
• Not flexible
Other Extreme: Complete Virtual Machine
(e.g., VMware)
• Very Low level API (hardware)
– Maximum flexibility
• Excellent protection
• High CPU/Memory overhead
– Cannot share common resources among virtual machines
• OS, common filesystem
• High-end commercial server , 10s VM
Mainstream Operating System
• API and protection at same level (system calls)
• Simple implementation (e.g., Slice = process group)
• Efficient use of resources (shared memory, common
OS)
• Bad protection and isolation
• Maximum Control and Security?
PlanetLab Virtualization: VServers
• Kernel patch to mainstream OS (Linux)
• Gives appearance of separate kernel for each virtual
machine
– Root privileges restricted to activities that do not affect other
vservers
• Some modification: resource control (e.g., File
handles, port numbers) and protection facilities
added
Node Software
• Linux Fedora Core 2
– kernel being upgraded to FC4
– always up-to-date with security-related patches
• VServer patches provide security
– each user gets own VM (‘slice’)
– limited root capabilities
• CKRM/VServer patches provide resource mgmt
–
–
–
–
proportional share CPU scheduling
hierarchical token bucket controls network Tx bandwidth
physical memory limits
disk quotas
Issues
• Multiple VM Types
– Linux vservers, Xen domains
• Federation
– EU, Japan, China
• Resource Allocation
– Policy, markets
• Infrastructure Services
– Delegation
Need to define the PlanetLab Architecture
Narrow Waist
• Name space for slices
< slice_authority, slice_name >
• Node Manager Interface
rspec = < vm_type = linux_vserver,
cpu_share = 32,
mem_limit - 128MB,
disk_quota = 5GB,
base_rate = 1Kbps,
burst_rate = 100Mbps,
sustained_rate = 1.5Mbps >
Node Boot/Install Process
Node
Boot Manager
PLC Boot Server
1. Boots from BootCD
(Linux loaded)
2. Hardware initialized
3. Read network config
. from floppy
4. Contact PLC (MA)
6. Execute boot mgr
5. Send boot manager
7. Node key read into memory from floppy
8. Invoke Boot API
9. Verify node key, send
current node state
10. State = “install”, run installer
11. Update node state via Boot API
13. Chain-boot node (no restart)
14. Node booted
12. Verify node key,
change state to “boot”
PlanetFlow
• Logs every outbound IP flow on every node
– accesses ulogd via Proper
– retrieves packet headers, timestamps, context ids (batched)
– used to audit traffic
• Aggregated and archived at PLC
Chain of Responsibility
Join Request
PI submits Consortium paperwork and requests to join
PI Activated
PLC verifies PI, activates account, enables site (logged)
User Activated
Users create accounts with keys, PI activates accounts (logged)
Slice Created
PI creates slice and assigns users to it (logged)
Nodes Added to
Slices
Slice Traffic
Logged
Traffic Logs
Centrally Stored
Users add nodes to their slice (logged)
Experiments run on nodes and generate traffic (logged by Netflow)
PLC periodically pulls traffic logs from nodes
Network Activity
Slice
Responsible Users & PI
Slice Creation
.
.
.
PI
SliceCreate( )
SliceUsersAdd( )
User
SliceNodesAdd( )
SliceAttributeSet( )
SliceInstantiate( )
PLC
(SA)
SliceGetAll( )
slices.xml
NM VM VM … VM
VMM
.
.
.
Slice Creation
PI
SliverCreate(ticket)
SliceCreate( )
SliceUsersAdd( )
User
.
.
.
PLC
(SA)
NM VM VM … VM
SliceAttributeSet( )
SliceGetTicket( )
VMM
.
.
.
(distribute ticket to slice creation service)
Brokerage Service
.
.
.
PI
rcap = PoolCreate(ticket)
SliceCreate( )
SliceUsersAdd( )
Broker
PLC
(SA)
NM VM VM … VM
SliceAttributeSet( )
SliceGetTicket( )
VMM
.
.
.
(distribute ticket to brokerage service)
Brokerage Service (cont)
.
.
.
PoolSplit(rcap, slice, rspec)
PLC
(SA)
User
BuyResources( )
NM VM VM
VM … VM
VMM
Broker
.
.
.
(broker contacts relevant nodes)
VIRTUAL MACHINES
PlanetLab Virtual Machines:
VServers
• Extend the idea of chroot(2)
–
–
–
–
New vserver created by system call
Descendent processes inherit vserver
Unique filesystem, SYSV IPC, UID/GID space
Limited root privilege
• Can’t control host node
– Irreversible
Scalability
• Reduce disk footprint using copy-on-write
– Immutable flag provides file-level CoW
– Vservers share 508MB basic filesystem
• Each additional vserver takes 29MB
• Increase limits on kernel resources (e.g., file
descriptors)
– Is the kernel designed to handle this? (inefficient data
structures?)
Protected Raw Sockets
• Services may need low-level network access
– Cannot allow them access to other services’ packets
• Provide “protected” raw sockets
– TCP/UDP bound to local port
– Incoming packets delivered only to service with corresponding port
registered
– Outgoing packets scanned to prevent spoofing
• ICMP also supported
– 16-bit identifier placed in ICMP header
Resource Limits
• Node-wide cap on outgoing network bandwidth
– Protect the world from PlanetLab services
• Isolation between vservers: two approaches
– Fairness: each of N vservers gets 1/N of the resources during
contention
– Guarantees: each slice reserves certain amount of resources (e.g.,
1Mbps bandwidth, 10Mcps CPU)
• Left-over resources distributed fairly
Linux and CPU Resource
Management
• The scheduler in Linux provides fairness by process,
not by vserver
– Vserver with many processes hogs CPU
• No current way for scheduler to provide guaranteed
slices of CPU time
MANAGEMENT SERVICES
PlanetLab Network Management
1.
PlanetLab Nodes boot a small Linux OS from CD, run on
RAM disk
Contacts a bootserver
Bootserver sends a (signed) startup script
2.
3.
•
•
•
•
Boot normally or
Write new filesystem or
Start sshd for remote PlanetLab Admin login
Nodes can be remotely power-cycled
Dynamic Slice Creation
1.
2.
3.
Node Manager verifies tickets from service
manager
Creates a new vserver
Creates an account on the node and on the vserver
User Logs in to PlanetLab Node
•
/bin/vsh immediately:
1.
2.
3.
4.
–
Switches to the account’s associated vserver
Chroot()s to the associated root directory
Relinquishes true root privileges
Switch UID/GID to account on vserver
Transition to vserver is transparent: it appears the user just
logged into the PlanetLab node directly
PlanetLab - Globus