Transcript Csci8211

Future Nets: Beyond IP
Networking
Building Large Networks (at the edge)…
• Large Scale Ethernets and enterprise networks Scaling Ethernets to millions of nodes
• Building networks for the backend of the Internet –
networks for cloud computing and data centers
Slides by Prof. Zhi-Li Zhang, UMN Advanced
Networking Course CSci5221
1
Even within a Single Administrative Domain
• Large ISPs and enterprise
networks
• Large data centers with
thousands or tens of
thousands machines
• Metro Ethernet
• More and more devices are
“Internet-capable” and
plugged in
• Likely rich and more diverse
network topology and
connectivity
2
Data Center Networks
• Data centers
– Backend of the Internet
– Mid- (most enterprises) to mega-scale (Google, Yahoo, MS, etc.)
• E.g., A regional DC of a major on-line service provider consists of
25K servers + 1K switches/routers
• To ensure business continuity, and to lower operational
cost, DCs must
– Adapt to varying workload  Breathing
– Avoid/Minimize service disruption (when maintenance, or
failure)  Agility
– Maximize aggregate throughput  Load balancing
3
Challenges posed by These Trends
• Scalability: capability to connect tens of thousands,
millions or more users and devices
– routing table size, constrained by router memory, lookup speed
• Mobility: hosts are more mobile
– need to separate location (“addressing”) and identity (“naming”)
• Availability & Reliability: must be resilient to failures
– need to be “proactive” instead of reactive
– need to localize effect of failures
• Manageability: ease of deployment, “plug-&-play”
need to minimize manual configuration
– self-configure, self-organize, while ensuring security and trust
–
• …….
4
Quick Overview of Ethernet
• Dominant wired LAN technology
– Covers the first IP-hop in most enterprises/campuses
• First widely used LAN technology
• Simpler, cheaper than token LANs, ATM, and IP
• Kept up with speed race: 10 Mbps and now to 40 Gbps
–
Soon 100 Gbps would be widely available
Metcalfe’s
Ethernet
sketch
5
Ethernet Frame Structure
• Addresses: source and destination MAC addresses
– Flat, globally unique, and permanent 48-bit value
– Adaptor passes frame to network-level protocol
• If destination address matches the adaptor
• Or the destination address is the broadcast address
– Otherwise, adapter discards frame
• Type: indicates the higher layer protocol
– Usually IP
6
Interaction w/ the Upper Layer (IP)
•
Bootstrapping end hosts by automating host configuration (e.g., IP
address assignment)
– DHCP (Dynamic Host Configuration Protocol)
– Broadcast DHCP discovery and request messages
•
Bootstrapping each conversation by enabling resolution from IP to MAC
addr
– ARP (Address Resolution Protocol)
– Broadcast ARP requests
•
Both protocols work via Ethernet-layer broadcasting (i.e., shouting!)
– Ethernet broadcast domain - A group of hosts and switches to which the same
broadcast or flooded frame is delivered
•
Too large a broadcast domain leads to
– Excessive flooding and broadcasting overhead
– Insufficient security/performance isolation
7
State of the Practice:
A Hybrid Architecture
Enterprise networks comprised of Ethernet-based
IP subnets interconnected by routers
Ethernet Bridging
- Flat addressing
- Self-learning
- Flooding
- Forwarding along a tree
Broadcast Domain
(LAN or VLAN)
R
R
IP Routing (e.g., OSPF)
- Hierarchical addressing
- Subnet configuration
- Host configuration
- Forwarding along shortest paths
R
R
R
8
Ethernet Bridging: “Routing” at L2
• Routing determines paths to destinations through
which traffic is forwarded
• Routing takes place at any layer (including L2) where
devices are reachable across multiple hops
App Layer
P2P, or CDN routing
Overlay routing
IP Layer
IP routing
Link Layer
Ethernet bridging
9
Ethernet (Layer-2) “Routing”
• Self-learning algorithm for dynamically building switch
(forwarding) tables
– “Eavesdrop” on source MACs of data packets
– Associate source MACs with port # (cached, “soft-state”)
• Forwarding algorithm
•
Forwarding algorithm
– If dst MAC found in switch table, send to the corresp. port
– Otherwise, flood to all ports (except the one it comes from)
• Dealing with “loopy” topologies
– Running (periodically) spanning tree algorithm to convert it
into a tree (rooted at an “arbitrary” node)
• 802.11 Wireless LANs use somewhat similar methods
– Use the same 48-bit MAC addresses more complex frame
structures;
– End hosts need to explicitly associate with APs
10
Layer 2 vs. Layer 3 Again
Neither bridging nor routing is satisfactory.
Can’t we take only the best of each?
Architectures
Features
Ease of configuration
Optimality in addressing
Host mobility
Path efficiency
Load distribution
Convergence speed
Tolerance to loop
Ethernet
Bridging







IP
SEATTLE
Routing














11
SEATTLE
(Scalable Ethernet ArchiTecTure for Larger Enterprises)
•
•
•
•
Plug-and-playable enterprise architecture ensuring both scalability
and efficiency
Objectives
– Avoiding flooding
– Restraining broadcasting
– Keeping forwarding tables small
– Ensuring path efficiency
SEATTLE architecture – design principles
– Hash-based location management
– Shortest-path forwarding
– Responding to network dynamics (reactive location resolution
and caching)
Lessons
– Trading a little data-plane efficiency for huge control-plane scalability
makes a qualitatively different system
12
Seattle
x
Deliver to x
Host discovery
or registration
C
Optimized forwarding
directly from D to A
y
Traffic to x
A
Hash
(F(x) = B)
Tunnel to
egress node, A
Entire enterprise
(A large single IP subnet)
Switches
Tunnel to
relay switch, B
D
LS core
Notifying
<x, A> to D
Store B
<x, A> at B
Hash
(F(x) = B)
E
End-hosts
Control flow
Data flow
14
Cloud Computing and Data Centers:
• What’s Cloud Computing?
• Data Centers and “Computing at Scale”
• Case Studies:
– Google File System
– Map-Reduce Programming Model
Optional Material
• Google Bigtable
Cloud Computing
and Data Centers
Why Study this:
• they represent part of current and “future” trends
– how applications will be serviced, delivered, …
– what are important “new” networking problems?
• more importantly, what lessons can we learn in terms of
(future) networking design?
– closely related, and there are many similar issues/challenges
(availability, reliability, scalability, manageability, ….)
– (but of course, there are also unique challenges in networking)
16
Internet and Web
• Simple client-server model
– a number of clients served by a single server
– performance determined by “peak load”
– doesn’t scale well (e.g., server crashes), when # of clients
suddenly increases -- “flash crowd”
• From single server to blade server to server farm (or
data center)
17
Internet and Web …
• From “traditional” web to “web service” (or SOA)
– no longer simply “file” (or web page) downloads
• pages often dynamically generated, more complicated “objects”
(e.g., Flash videos used in YouTube)
– HTTP is used simply as a “transfer” protocol
• many other “application protocols” layered on top of HTTP
– web services & SOA (service-oriented architecture)
• A schematic representation of “modern” web services
web rendering, request routing,database, storage, computing, …
aggregators, …
front-end
back-end
18
Data Center and Cloud Computing
• Data center: large server farms + data warehouses
– not simply for web/web services
– managed infrastructure: expensive!
• From web hosting to cloud computing
– individual web/content providers: must provision for peak load
• Expensive, and typically resources are under-utilized
– web hosting: third party provides and owns the (server farm)
infrastructure, hosting web services for content providers
Under client web
– “server consolidation” via virtualization
service control
App
Guest OS
VMM
19
Cloud Computing
• Cloud computing and cloud-based services:
– beyond web-based “information access” or “information
delivery”
– computing, storage, …
• Cloud Computing: NIST Definition
"Cloud computing is a model for enabling convenient, on-demand network
access to a shared pool of configurable computing resources (e.g.,
networks, servers, storage, applications, and services) that can be rapidly
provisioned and released with minimal management effort or service
provider interaction."
• Models of Cloud Computing
– “Infrastructure as a Service” (IaaS), e.g., Amazon EC2, Rackspace
– “Platform as a Service” (PaaS), e.g., Micorsoft Azure
– “Software as a Service” (SaaS), e.g., Google
20
Data Centers: Key Challenges
With thousands of servers within a data center,
• How to write applications (services) for them?
• How to allocate resources, and manage them?
– in particular, how to ensure performance, reliability, availability, …
• Scale and complexity bring other key challenges
– with thousands of machines, failures are the default case!
– load-balancing, handling “heterogeneity,” …
• data center (server cluster) as a “computer”
• “super-computer” vs. “cluster computer”
– A single “super-high-performance” and highly reliable computer
– vs. a “computer” built out of thousands of “cheap & unreliable” PCs
– Pros and cons?
21
Case Studies
• Google File System (GFS)
– a “file system” (or “OS”) for “cluster computer”
• An “overlay” on top of “native” OS on individual machines
– designed with certain (common) types of applications in
mind, and designed with failures as default cases
• Google MapReduce (cf. Microsoft Dryad)
– MapReduce: a new “programming paradigm” for certain
(common) types of applications, built on top of GFS
• Other examples (optional):
– BigTable: a (semi-) structured database for efficient keyvalue queries, etc. , built on top of GFS
– Amazon Dynamo:A distributed <key, value> storage system
high availability is a key design goal
– Google’s Chubby, Sawzall, etc.
– Open source systems: Hadoop, …
22
Google Scale and Philosophy
• Lots of data
– copies of the web, satellite data, user data, email and
USENET, Subversion backing store
• Workloads are large and easily parallelizable
• No commercial system big enough
– couldn’t afford it if there was one
– might not have made appropriate design choices
– But truckloads of low-cost machines
• 450,000 machines (NYTimes estimate, June 14th 2006)
• Failures are the norm
– Even reliable systems fail at Google scale
• Software must tolerate failures
– Which machine an application is running on should not matter
– Firm believers in the “end-to-end” argument
• Care about perf/$, not absolute machine perf
Typical Cluster at Google
Cluster Scheduling Master Lock Service
Machine 1
User
Task 1
BigTable
Server
User Task 2
Scheduler
Slave
GFS
Chunkserver
Linux
Machine 2
BigTable
Server
GFS Master
Machine 3
BigTable Master
User
Task
Scheduler
Slave
GFS
Chunkserver
Linux
Scheduler
Slave
GFS
Chunkserver
Linux
Google: System Building Blocks
• Google File System (GFS):
– raw storage
• (Cluster) Scheduler:
– schedules jobs onto machines
• Lock service: Chubby
– distributed lock manager
– also can reliably hold tiny files (100s of bytes)
w/ high availability
– 5 replicas (need majority vote)
• Bigtable:
– a multi-dimensional database
• MapReduce:
– simplified large-scale data processing
Google File System
Key Design Considerations
• Component failures are the norm
hardware component failures, software bugs, human errors,
power supply issues, …
– Solutions: built-in mechanisms for monitoring, error detection,
fault tolerance, automatic recovery
–
• Files are huge by traditional standards
– multi-GB files are common, billions of objects
– most writes (modifications or “mutations”) are “append”
– two types of reads: large # of “stream” (i.e., sequential)
reads, with small # of “random” reads
• High concurrency (multiple “producers/consumers” on a file)
– atomicity with minimal synchronization
• Sustained bandwidth more important than latency
GFS Architectural Design
• A GFS cluster:
– a single master + multiple chunkservers per master
– running on commodity Linux machines
• A file: a sequence of fixed-sized chunks (64 MBs)
– labeled with 64-bit unique global IDs,
– stored at chunkservers (as “native” Linux files, on local disk)
– each chunk mirrored across (default 3) chunkservers
• master server: maintains all metadata
– name space, access control, file-to-chunk mappings, garbage
collection, chunk migration
– why only a single master? (with read-only shadow masters)
•
• simple, and only answer chunk location queries to clients!
chunk servers (“slaves” or “workers”):
– interact directly with clients, perform reads/writes, …
GFS Architecture: Illustration
Separation of control and data
flows
GFS: Summary
• GFS is a distributed file system that support large-scale data
processing workloads on commodity hardware
– GFS has different points in the design space
• Component failures as the norm
• Optimize for huge files
– Success: used actively by Google to support search service and
other applications
– But performance may not be good for all apps
• assumes read-once, write-once workload (no client caching!)
• GFS provides fault tolerance
– Replicating data (via chunk replication), fast and automatic recovery
• GFS has the simple, centralized master that does not become a
bottleneck
• Semantics not transparent to apps (“end-to-end” principle?)
– Must verify file contents to avoid inconsistent regions, repeated
appends (at-least-once semantics)
29
Google MapReduce
• The problem
– Many simple operations in Google
• Grep for data, compute index, compute summaries, etc
– But the input data is large, really large
• The whole Web, billions of Pages
– Google has lots of machines (clusters of 10K etc)
– Many computations over VERY large datasets
– Question is: how do you use large # of machines efficiently?
• Can reduce computational model down to two steps
– Map: take one operation, apply to many many data tuples
– Reduce: take result, aggregate them
• MapReduce
– A generalized interface for massively parallel cluster processing
Data Center Networking
Major Theme:
What are new networking issues posed by
large-scale data centers?
•
•
•
•
•
Network Architecture?
Topology design?
Addressing?
Routing?
Forwarding?
CSci5221: Data Center Networking, and Large-Scale
Enterprise Networks: Part I
31
Data Center Interconnection
Structure
• Nodes in the system: racks of servers
• How are the nodes (racks) inter-connected?
– Typically a hierarchical inter-connection structure
• Today’s typical data center structure
Cisco recommended data center structure:
starting from the bottom level
– rack switches
– 1-2 layers of (layer-2) aggregation switches
– access routers
– core routers
• Is such an architecture good enough?
32
Cisco Recommended DC Structure:
Illustration
Internet
Internet
CR
Data Center
Layer 3
Layer 2
LB
S
AR
AR
S
S
S
…
…
CR
AR
AR
LB
S
S
…
…
Key:
• CR = L3 Core Router
• AR = L3 Access Router
• S = L2 Switch
• LB = Load Balancer
• A = Rack of 20 servers
with Top of Rack switch
33
Data Center Design Requirements
• Data centers typically run two types of applications
– outward facing (e.g., serving web pages to users)
– internal computations (e.g., MapReduce for web indexing)
• Workloads often unpredictable:
– Multiple services run concurrently within a DC
– Demand for new services may spike unexpected
• Spike of demands for new services mean success!
• But this is when success spells trouble (if not prepared)!
• Failures of servers are the norm
– Recall that GFS, MapReduce, etc., resort to dynamic reassignment of chunkservers, jobs/tasks (worker servers) to
deal with failures; data is often replicated across racks, …
– “Traffic matrix” between servers are constantly changing
34
Data Center Costs
• Data centers typically run two types of applications
– outward facing (e.g., serving web pages to users)
– internal computations (e.g., MapReduce for web indexing)
• Workloads often unpredictable:
– Multiple services run concurrently within a DC
– Demand for new services may spike unexpected
• Spike of demands for new services mean success!
• But this is when success spells trouble (if not prepared)!
• Failures of servers are the norm
– Recall that GFS, MapReduce, etc., resort to dynamic reassignment of chunkservers, jobs/tasks (worker servers) to
deal with failures; data is often replicated across racks, …
– “Traffic matrix” between servers are constantly changing
35
Data Center Costs
Amortized Cost*
Component
Sub-Components
~45%
Servers
CPU, memory, disk
~25%
Power infrastructure UPS, cooling, power distribution
~15%
Power draw
Electrical utility costs
~15%
Network
Switches, links, transit
*3 yr amortization for servers, 15 yr for infrastructure; 5% cost of money
• Total cost varies
– upwards of $1/4 B for mega data center
– server costs dominate
– network costs significant
• Long provisioning timescales:
– new servers purchased quarterly at best
Source: the Cost of a Cloud: Research Problems in Data Center Networks. Sigcomm
CCR 2009. Greenberg, Hamilton, Maltz, Patel.
36
Overall Data Center Design Goal
Agility – Any service, Any Server
• Turn the servers into a single large fungible pool
– Let services “breathe” : dynamically expand and contract
their footprint as needed
• We already see how this is done in terms of Google’s GFS,
BigTable, MapReduce
• Benefits
– Increase service developer productivity
– Lower cost
– Achieve high performance and reliability
These are the three motivators for most data center
infrastructure projects!
37
Achieving Agility …
• Workload Management
– means for rapidly installing a service’s code on a server
– dynamical cluster scheduling and server assignment 
• E.g., MapReduce, Bigtable, …
– virtual machines, disk images 
• Storage Management
– means for a server to access persistent data
– distributed file systems (e.g., GFS) 
• Network Management
– Means for communicating with other servers, regardless of
where they are in the data center
– Achieve high performance and reliability
38
Networking Objectives
1. Uniform high capacity
–
–
Capacity between servers limited only by their NICs
No need to consider topology when adding servers
=> In other words, high capacity between two any servers no
matter which racks they are located !
2. Performance isolation
–
Traffic of one service should be unaffected by others
3. Ease of management: “Plug-&-Play” (layer-2 semantics)
–
–
–
Flat addressing, so any server can have any IP address
Server configuration is the same as in a LAN
Legacy applications depending on broadcast must work
39
Is Today’s DC Architecture Adequate?
• Hierarchical network; 1+1 redundancy
• Equipment higher in the hierarchy handles more traffic
• more expensive, more efforts made at availability  scale-up design
• Servers connect via 1 Gbps UTP to Top-of-Rack switches
• Other links are mix of 1G, 10G; fiber, copper
•
•
Uniform high capacity?
Performance isolation?
typically via VLANs
•
•
•
Internet
Internet
Data Center
Layer 3
CR
AR
Agility in terms of
dynamically adding or
Layer 2 LB S
shrinking servers?
Agility in terms of
S
S
adapting to failures, and
to traffic dynamics?
…
Ease of management?
CR
…
AR
S
AR
LB
S
S
…
…
AR
Key:
• CR = L3 Core
Router
• AR = L3 Access
Router
• S = L2 Switch
• LB = Load
Balancer
• A = Top of Rack
switch
40
Case Studies
• A Scalable, Commodity Data Center Network Architecture
–
a new Fat-tree “inter-connection” structure (topology) to
increases “bi-section” bandwidth
• needs “new” addressing, forwarding/routing
• VL2: A Scalable and Flexible Data Center Network
– consolidate layer-2/layer-3 into a “virtual layer 2”
– separating “naming” and “addressing”, also deal with
dynamic load-balancing issues
Other Approaches:
• PortLand: A Scalable Fault-Tolerant Layer 2 Data Center
Network Fabric
• BCube: A High-Performance, Server-centric Network
Architecture for Modular Data Centers
41
A Scalable, Commodity Data Center
Network Architecture
• Main Goal: addressing the limitations of today’s
data center network architecture
– single point of failure
– oversubscription of links higher up in the topology
• trade-offs between cost and providing
• Key Design Considerations/Goals
– Allows host communication at line speed
• no matter where they are located!
– Backwards compatible with existing infrastructure
• no changes in application & support of layer 2 (Ethernet)
– Cost effective
• cheap infrastructure
• and low power consumption & heat emission
42
Fat-Tree Based DC Architecture
• Inter-connect racks (of servers) using a fat-tree topology
• Fat-Tree: a special type of Clos Networks (after C. Clos)
K-ary fat tree: three-layer topology (edge, aggregation and core)
– each pod consists of (k/2)2 servers & 2 layers of k/2 k-port switches
– each edge switch connects to k/2 servers & k/2 aggr. switches
– each aggr. switch connects to k/2 edge & k/2 core switches
– (k/2)2 core switches: each connects to k pods
Fat-tree
with K=2
43
Fat-Tree Based Topology …
• Why Fat-Tree?
– Fat tree has identical bandwidth at any bisections
– Each layer has the same aggregated bandwidth
• Can be built using cheap devices with uniform capacity
– Each port supports same speed as end host
– All devices can transmit at line speed if packets are distributed
uniform along available paths
• Great scalability
Fat tree network with K = 3 supporting 54 hosts
44
Cost of Maintaining Switches
45
Fat-tree Topology is Great, But …
Does using fat-tree topology to inter-connect racks
of servers in itself sufficient?
• What routing protocols should we run on these
switches?
• Layer 2 switch algorithm: data plane flooding!
• Layer 3 IP routing:
– shortest path IP routing will typically use only one path
despite the path diversity in the topology
– if using equal-cost multi-path routing at each switch
independently and blindly, packet re-ordering may occur;
further load may not necessarily be well-balanced
– Aside: control plane flooding!
46
FAT-Tree Modified
• Enforce a special (IP) addressing scheme in DC
– unused.PodNumber.switchnumber.Endhost
– Allows host attached to same switch to route only
through switch
– Allows inter-pod traffic to stay within pod
• Use two level look-ups to distribute traffic and
maintain packet ordering
•
•
First level is prefix lookup
– used to route down the topology
to servers
Second level is a suffix lookup
– used to route up towards core
– maintain packet ordering by using
same ports for same server
47
More on Fat-Tree DC Architecture
Diffusion Optimizations
• Flow classification
– Eliminates local congestion
– Assign to traffic to ports on a per-flow basis instead of
a per-host basis
• Flow scheduling
– Eliminates global congestion
– Prevent long lived flows from sharing the same links
– Assign long lived flows to different links
What are potential drawbacks of this architecture?
48