Lecture Note

Download Report

Transcript Lecture Note

Cloud Service Infrastructure
ECE7650
Key Components
 Datacenter Architecture
 Cluster: compute, storage
 Network for DC
 Storage
 Virtualization
 Cloud Management SW
Resource Usage Metering
 Automated Systems Management
 Privacy and Security Measures

 Cloud Programming Env
2
What’s a Cluster?
 Broadly, a group of networked
autonomous computers that work
together to form a single machine in
many respects:
To improve performance (speed)
 To improve throughout
 To improve service availability (highavailability clusters)

 Based on commercial off-the-shelf,
the system is often more costeffective than single machine with
comparable speed or availability
Highly Scalable Clusters
 High Performance Cluster (aka Compute
Cluster)
A form of parallel computers, which aims to
solve problems faster by using multiple compute
nodes.
 For parallel efficiency, the nodes are often
closely coupled in a high throughput, low-latency
network

 Server Cluster and Datacenter

Aims to improve the system’s throughput ,
service availability, power consumption, etc by
using multiple nodes
Top500 Installation of
Supercomputers
Top500.com
Clusters in Top500
An Example of Top500 Submission (F’08)
Location
Tukwila, WA
Hardware – Machines
256 Dual-CPU, quad-core Intel 5320
Clovertown 1.86GHz CPU and 8GB
RAM
Hardware – Networking
Private & Public: Broadcom GigE
MPI: Cisco Infiniband SDR, 34 IB
switches in leaf/node configuration
Number of Compute Nodes
256
Total Number of Cores
2048
Total Memory
2 TB of RAM
Particulars of for current Linpack Runs
Best Linpack Result
11.75 TFLOPS
Best Cluster Efficiency
77.1%
For Comparison…
Linpack rating from June2007 Top500
run (#106) on the same hardware
8.99 TFLOPS
Cluster efficiency from June2007 Top500 59%
run (#106) on the same hardware
Typical Top500 efficiency for Clovertown 65-77%
motherboards w/ IB regardless of
(2 instances of 79%)
Operating System
30% impro in
efficiency on the
same hardware;
about one hour to
deplay
Beowulf Cluster
A cluster of inexpensive PCs for low-cost
personal supercomputing
Based on commodity off-the-shelf
components:
PC computers running a Unix-like Os (BSD,
Linux, or OpenSolaris)
 Interconnected by an Ethernet LAN

 Head node, plus a group of compute node
 Head node controls the cluster, and serves
files to the compute nodes
Standard, free and open source software
 Programming in MPI
 MapReduce
Why Clustering Today
 Powerful node (cpu, mem, storage)
Today’s PC is yesterday’s supercomputers
 Multi-core processors

 High speed network
Gigabit (56% in top500 as of Nov 2008)
 Infiniband System Area Network (SAN) (24.6%)

 Standard tools for parallel/
distributed computing & their growing
popularity.
MPI, PBS, etc
 MapReduce for data-intensive computing

Major issues in Cluster Design
 Programmability
Sequential vs Parallel Programming
 MPI, DSM, DSA: hybrid of multithreading and
MPI
 MapReduce

 Cluster-aware Resource management
 Job scheduling (e.g. PBS)
 Load balancing, data locality, communication opt,
etc
 System management
Remote installation, monitoring, diagnosis,
 Failure management, power management, etc

Cluster Architecture
 Multi-core node architecture
 Cluster Interconnect
Single-core computer
Single-core CPU chip
the single core
Multicore Architecture
 Combine 2 or more
independent cores (normally
CPU) into a single package
 Support multitasking and
multithreading in a single
physical package
Multicore is Everywhere
 Dual-core commonplace in laptops
 Quad-core in desktops
 Dual quad-core in servers
 All major chip manufacturers produce
multicore CPUs
SUN Niagara (8 cores, 64 concurrent threads)
 Intel Xeon (6 cores)
 AMD Opteron (4 cores)

Multithreading on multi-core
David Geer, IEEE Computer, 2007
Interaction with the OS
 OS perceives each core as a separate
processor
 OS scheduler maps threads/processes
to different cores
 Most major OS support multi-core today:
Windows, Linux, Mac OS X, …
Cluster Interconnect
 Network fabric connecting the compute
nodes
 Objective is to strike a balance between
Processing power of compute nodes
 Communication ability of the interconnect

 A more specialized LAN, providing many
opportunities for perf. optimization
Switch in the core
 Latency vs bw

Input
Ports
Receiver
Input
Buffer
Output
Buffer Transmiter
Cross-bar
Control
Routing, Scheduling
Output
Ports
Goal: Bandwidth and Latency
0.8
Delivered Bandwidth
80
70
Latency
60
50
40
Saturation
30
20
10
0.7
0.6
0.5
0.4
Saturation
0.3
0.2
0.1
0
0
0
0.2
0.4
0.6
0.8
Delivered Bandwidth
1
0
0.2
0.4
0.6
0.8
Offered Bandwidth
1
1.2
Ethernet Switch: allows multiple
simultaneous transmissions A
 hosts have dedicated,
direct connection to switch
 switches buffer packets
 Ethernet protocol used on
each incoming link, but no
collisions; full duplex


each link is its own collision
domain
switching: A-to-A’ and Bto-B’ simultaneously,
without collisions

not possible with dumb hub
C’
B
1 2
6
5
3
4
C
B’
A’
switch with six interfaces
(1,2,3,4,5,6)
Switch Table

Q: how does switch know that
A’ reachable via interface 4,
B’ reachable via interface 5?
 A: each switch has a switch
table, each entry:

Q: how are entries created,
maintained in switch table?

C’
B
something like a routing
protocol?
1 2
6
5
(MAC address of host, interface
to reach host, time stamp)
 looks like a routing table!

A
3
4
C
B’
A’
switch with six interfaces
(1,2,3,4,5,6)
Switch: self-learning
 switch
Source: A
Dest: A’
learns which
hosts can be reached
through which
interfaces
A A A’
C’
B
1 2
3
6
5 4
when frame received,
switch “learns”
location of sender:
incoming LAN segment
B’
 records
MAC
addrininterface TTL
sender/location
pair
60
1
A
switch table

C
A’
Switch table
(initially empty)
Source: A
Self-learning,
forwarding: example
Dest: A’
A A A’
C’
B
 frame destination
unknown: flood
A6A’
5
r destination A location known:
1 2
4
C
A’ A
selective send
B’
3
A’
MAC addr interface TTL
A
A’
1
4
60
60
Switch table
(initially empty)
Interconnecting switches
 Switches can be connected together
S4
S1
S3
S2
A
B
C
F
D
E
I
G
H
r Q: sending from A to G - how does S1 know to forward
frame destined to F via S4 and S3?
r A: self learning! (works exactly the same as in singleswitch case!)
r Q: Latency and Bandwidth for a large-scale network?
What characterizes a network?
 Topology
(what)
 physical interconnection structure of the network graph
 Regular vs irregular
 Routing Algorithm
(which)
 restricts the set of paths that msgs may follow
 Table-driven, or routing algorithm based
 Switching Strategy
 how data in a msg traverses a route
 Store and forward vs cut-through
(how)
 Flow Control Mechanism
 when a msg or portions of it traverse a route
 what happens when traffic is encountered?
(when)

Interplay of all of these determines performance
Tree: An Example
 Diameter and ave distance logarithmic
 k-ary tree, height d = logk N
 address specified d-vector of radix k coordinates
describing path down from root
 Fixed degree
 Route up to common ancestor and down
 R = B xor A
 let i be position of most significant 1 in R, route up i+1
levels
 down in direction given by low i+1 bits of B
 Bandwidth and Bisection BW?
Bandwidth
 Bandwidth
 Point-to-Point bandwidth
 Bisectional bandwidth of interconnect frabric: rate of
data that can be sent across an imaginary line dividing the
cluster into two halves each with equal number of ndoes.
 For a switch with N ports,
 If it is non-blocking, the bisectional bandwidth = N * the
p-t-p bandwidth
 Oversubscribed switch delivers less bisectional bandwidth
than non-blocking, but cost-effective. It scales the bw per
node up to a point after which the increase in number of
nodes decreases the available bw per node

oversubscription is the ratio of the worst-case achievable
aggregate bw among the end hosts to the total bisection bw
How to Maintain Constant BW per
Node?
 Limited ports in a single switch
 Multiple switches
 Link between a pair of switches be
bottleneck

Fast uplink
 How to organize multiple switches
Irregular topology
 Regular topologies: ease of management

Scalable Interconnect: Examples
Fat Tree
4
0
0
1
0
1
0
1
1
3
2
0
1
1
0
building block
16 node butterfly
Multidimensional Meshes and Tori
2D Mesh


n = kd-1 X ...X kO nodes
described by d-vector of coordinates (id-1, ..., iO)
d-dimensional k-ary mesh: N = kd



3D Cube
d-dimensional array


2D torus
k = dN
described by d-vector of radix k coordinate
d-dimensional k-ary torus (or k-ary d-cube)?
Packet Switching Strategies
 Store and Forward (SF)
 move entire packet one hop toward destination
 buffer till next hop permitted
 Virtual Cut-Through and Wormhole
 pipeline the hops: switch examines the header,
decides where to send the message, and then
starts forwarding it immediately
 Virtual Cut-Through: buffer on blockage
 Wormhole: leave message spread through
network on blockage
SF vs WH (VCT) Switching
Cut-Through Routing
Store & Forward Routing
Source
Dest
32 1 0
3 2 1 0
3 2 1
3 2
3
Dest
0
1 0
2 1 0
3 2 1 0
3 2 1
3 2
3
0
3 2 1
0
3 2
1
0
3
2
1
0
3
2
1 0
3
2 1 0
3 2 1 0
1 0
2 1 0
3 2 1 0
Time
3 2 1
0
 Unloaded latency: h( n/b+ D) vs n/b+hD
 h: distance
 n: size of message
 b: bandwidth
 D: additional routing delay per hop
Conventional Datacenter Network
Problems with the Cluster Arch
 Resource fragmentation:

If an application grows and requires more
servers, it cannot use available servers in other
layer 2 domains, resulting in fragmentation and
underutilization of resources
 Poor server-to-server connectivity
 Servers in different layer-2 domains to
communication through the layer-3 portion of
the network
 See papers in the reading list of
Datacenter Network Design for proposed
approaches
Datacenter as a Computer
1.
2.
3.
4.
5.
6.
Overview
Workloads and SW infrastructure
HW building blocks
Datacenter basics
Energy and power efficiency
Dealing with Failures and Repairs
35
Datercenter
 Datacenters are buildings where servers and
comm. gear are co-located because of their
common environmental requirements and physical
security needs, and for ease of maintenance.
 Traditional DCs typically host a large number of
relatively small- or medium-sized applications,
each running on a dedicated hw infra that is decoupled and protected from other systems in the
same facility
36
Recent Data Center Projects
37
Advances in Deployment of DC
 Conquering complexity.
Building racks of servers &
complex cooling systems all
separately is not efficient.
 Package and deploy into
bigger units:

Microsoft Generation 4 Data Center
http://www.youtube.com/watch?v=PPnoKb9fTkA
38
Microsoft Generation 4 Datacenter
39
Typical Elements of a DC
 Often low-end servers are used
 Connected in multi-tier networks
40
Key Aspects
 Storage
 Network Fabric
 Storage Hierarchy
 Quantify Latency, BW, and Capacity
 Power Usage
 Handling Failures
41
Storage
 Global distributed file systems (e.g. Google’s GFS)
 Hard to implement at the cluster-level, but lower hw
costs and networking fabric utilization
 GFS implement replication across different machines
(for fault tolerance); Google deploys desktop-clas dis
drives, instead of enterprise-grade disks
 Network Attached Storage (NAS), directly
connected to the cluster-level switching fabric

Simple to deploy, because it pushes the responsibility for
data management and integrity to NAS appliance
42
Amazon’s Simple Storage Service (S3)
 Write, read, and delete objects containing from 1 byte to 5





terabytes of data each. The number of objects you can store is
unlimited.
Each object is stored in a bucket and retrieved via a unique,
developer-assigned key.
A bucket can be stored in one of several Regions: in the US
Standard, EU (Ireland), US West (Northern California) and Asia
Pacific (Singapore) Regions.
Built to be flexible so that protocol or functional layers can easily
be added. The default download protocol is HTTP. A BitTorrent™
protocol interface is provided to lower costs for high-scale
distribution.
Designed to provide 99.999999999% durability and 99.99%
availability of objects over a given year. Designed to sustain the
concurrent loss of data in two facilities.
Designed to provide 99.99% durability and 99.99% availability of
objects over a given year. This durability level corresponds to an
average annual expected loss of 0.01% of objects. Designed to
sustain the loss of data in a single facility.
43
Amazon’s Elastic Block Storage (EBS)
 EBS provides block level storage volumes for use
with EC2 instances. It allows you to create storage
volumes from 1 GB to 1 TB that can be mounted as
devices by Amazon EC2 instances.
 Each storage volume is automatically replicated
within the same Availability Zone. This prevents
data loss due to failure of any single hardware
component.
 The latency and throughput of Amazon EBS
volumes is designed to be significantly better than
the Amazon EC2 instance stores in nearly all
cases.
44
Network Fabric
 Tradeoff between speed, scale, and cost
 Two-level hierarchy (in rack and cluster-
levels)
A switch having 10 times the bi-section bw
costs about 100 times as much
 A rack with 40 servers, each with a 1-Gbps port
may have between 4 to 8 1-Gbps uplinks to the
cluster switch: an oversubscription factor
between 5 and 10 for comm across racks.

 “fat-tree” networks built of lower-cost
commodity Ethernet switches
45
Next Generation of Network Fabric
 Monsoon
Work by Albert Greenberg, Parantap Lahiri,
David A. Maltz,Parveen Patel, Sudipta Sengupta.
 Designed to scale to 100K+ data centers.
 Flat server address space instead of dozens of
VLANS.
 Valiant Load Balancing.
 Allows a mix of apps and dynamic scaling.
 Strong fault tolerance characteristics.

46
Storage Hierarchy
47
Latency, BW, and Capacity
 Assume: 2000 servers (8GB mem and 1TB disk), 40 per rack
connected by a 48-port 1Gpbs switch (8 uplinks)
 Arch: bridge the gap in a cost-efficient manner
 SW : hide the complexity, exploit data locality
48
DC vs Supercomputers
 Scale



Blue Waters = 40K 8-core “servers”
Road Runner = 13K cell + 6K AMD
servers
MS Chicago Data Center = 50
containers = 100K 8-core servers.
Fat tree network
 Network Architecture



Supercomputers: CLOS “Fat Tree”
infiniband
Low latency – high bandwidth
protocols
Data Center: IP based Network
Standard Data Center
• Optimized for Internet Access
 Data Storage

Supers: separate data farm
• GPFS or other parallel file system

DCs: use disk on node + memcache
49
Power Usage
 Distribution of peak power usage of a
Google DC (circa 2007)
50
Workload and SW Infra
 Platform-level SW: presented in all individual
servers, providing basic server-level services
 Cluster-level infra: collection of distributed
systems sw that manages resources and provides
services at the cluster level

MapReduce, Hadoop, etc
 Application-level sw: implements a specific service
 Oneline service like web search, gmail,
 Offline computations, e.g. data analysis or generate data
used for online services such as building index
51
Examples of Appl-SW
 Web 2.0 applications


Provide rich user experience including real-time global
collaboration
Enable rapid software development
 Software to scan voluminous Wikipedia edits to identify
spam
 Organize global news articles by geographic location
 Data-intensive workloads based on scalable architectures,
such as Google’s MapReduce framework


Financial modeling, real-time speech translation, Web search
Next generation rich media, such as virtual worlds, streaming
videos, Web conferencing, etc.
52
Characteristics
 Ample parallelism at both data and request levels;
 Key is not to find parallelism, but to manage and
efficiently harness the explicit parallelism
 Data parallelism arises from the large data sets of
relatively independent records to be processed
 Request-level parallelism comes from hundreds or
thousands of requests per second to be responded; the
requests rarely invovle read-after-write sharing of data
or synchronization
53
Characteristics (cont’)
 Workload churn: isolation from users of Internet
services makes it easy to deploy new sw quickly.




Google’s front-end web server binaries are released on a
weekly cycle
The core of its search services be reimplemented from
scratch every 2~3 years!!
New products and services frequently emerge, and their
success with users directly affects the resulting
workload mix in the DC
Hard to develop a meaningful benchmark
• Not too much for HW architects to do
• Count on sw rewrites to take advantage of new hw capabilities ?!
 Fault-free is challenging, but possible
54
Basic Programming Concepts
 Data Replication for both perf and availability
 Data Partitioning (sharded) for both p. & a.
 Load balancing:
 Sharded vs replicated services
 Health checking & watchdog timers for availability
 No op could rely on a given server to respond to make
progress forward
 Integrity checks for availability
 Application-specific compression
 Eventual consistency w.r.t. replicated data
 When no updates occur for a long period time, eventually
all updates will propagete through the system and all
replicas will be consistent
55
Cluster-level SW
 Resource Management
 Map user tasks to hw resources, enforce priorities and
quotas, provide basic task managment services
 Simple allocation; or automate allocation of resources;
fair-sharing of resources at a finer level of granulaity;
power/energy consideration
 Related Work at Wayne State
• Wei, et al, Resource management for end-to-end quality of
service assurance [Wei’s PhD disseration’06]
• Zhou, et al, Proportional resource allocation in web servers,
streaming servers, and e-commerce servers proportional ;
see cic.eng.wayne.edu for related publications (02-05)
 HW abstraction and Basic Services
 E.g. reliable distributed storage, message-passing,
cluster-level sync. (GFS, Dynamo )
56
Cluster-level SW (cont’)
 Deployment and maintenance
 Sw image distributionj, confguration management,
monitoring service perf and quality, alarm trigger for
operators in emergency situations, etc
 E.g. Autopilot of Microsoft, Google’s Health
Infrastructure
 Related Work at Wayne State
• Jia, et al, Measuring machine capacity, ICDCS’08
• Jia, et al, Autonomic VM configuration, ICAC’09
• Bu, et al, Autonomic Web apps configuration, ICDCS’09
 Programming Frameworks
 Tools like MapReduce would improve programmer
productibity by automatically handling data partitioning,
distribution, and fault tolerance
57
Monitoring Infra: An Example
 Service-level dashboards
Keep track of service quality (w.r.t target level)
 Info must be fresh for corrective actions and
avoid significant disruption within sec not min
 E.g. how to measure user-perceived pageview
response time? (multiple objects, end-to-end)

18 objects
58
Client-Experienced QoS
request-based QoS
server
client
connection close
last object
object 2
object 1
base page
Setup connection
Internet
waiting for
new requests
client-perceived pageview QoS
HTTPS Traffic
Mirrored
HTTPS Traffic
TCP Packets
HTTPS Trans
Packet
Packet
Perf
Capture
Analyzer
Analyzer
Wei/Xu, sMonitor for Measurement of User-Perceived Laency, USENIX’2006
59
Perf Debugging Tools
 Help operators and service designers to develop
understanding of the complex interactions between
programs, often running on hundreds of servers, so as to
determine the root cause of perf anomalies and identify
bottlenecks
 Black-box monitoring: observing network traffic among
system components and inferring causal relationships
through statistical inference methods, assuming no
knowledge of or assistance from appl or sw;

But Info may not accurate
 Appl/middleware instrumentation systems, like Google’s
Dapper, require to modify applications or middleware
libraries for passing tracing across machines and across
module boundaries. The annotated modules log tracing info
to local disks for subsequent analysis
60
HW Building Blocks
 Cost Effectiveness of Low-end Servers
61
Performance of Parallel Apps
 Under a model of fixed local computation time,
plus the latency penalty of access to global data
structure
62
Parallel Apps Perf
 Perf advantage of a cluster of high-end nodes (128-cores)
over a cluster with the same number of cores built with lowend servers (4 cores) [4 to 20 times difference in price]
63
How Small a Cluster Node should Be?
 Other factors need to be considered
Amount of parallelism,
 Network requirements
 Smaller server may lead to low utilization
 etc

64
Datacenter as a Computer
1.
2.
3.
4.
5.
6.
Overview
Workloads and SW infrastructure
HW building blocks
Datacenter basics
Energy and power efficiency
Dealing with Failures and Repairs
65
Power/Energy Issues in DC
 Main components
66
UPS Systems
 A transfer switch that chooses the active power
input, with utility power or generator power.

Typically, a generator takes 10-15 s to start and assume
the full rated load
 Batteries to bridge of time of utility power
blackout



AC-DC-AC double conversion
When utility power fails, UPS loses input AC power but
retains internal DC power, and thus the AC output power
Remove voltage spikes or harmonic distortions in the aC
feed
 100s kW up to 2MW
67
Power Distribution Units
 Takes the UPS output (typically 200~480v)
and break it up into many 110 or 220v
circuits that feed the actual servers on
the floor.

Each circuit is protected by its own breaker
 A typical PDU handles 75~225kW of load,
whereas a typical circuit handles 20 or 30A
at 110-220V (a max of 6kW)
 PDU provides additional redundancy (in
circuit)
68
Datacenter Cooling Systems
 CRAC Units: computer room air
conditioning
 Water-based free cooling
69
Energy Efficiency
 DCPE (DC perf efficiency) : Ratio of amount of
computation al work to total energy consumed
 Total energy consumed:

Power usage effectiveness (PUE): ratio of building power
to IT power (currently 1.5 to 2.0)
70
Breakdown of DC Energy Overhead
71
Energy Efficiency (cont’)
 Power usage effectiveness (PUE)
 Server PUE: ratio of total server input
power to its useful power (consumed by the
components like motherboard, disks, CPUs,
DRAM, I/O bards, etc)

User power excludes losses in power supplies,
fans,
 Currently, SPUE between 1.6 to 1.8.

With VRM (voltage regulatory module), SPUE
can be reduced to 1.2
72
Measuring Power Efficiency
 Benchmarks
Green500 in high-performance computing
 Joulesort:

• measuing the total system energy to perform an outof-core sort

SPECpower_ssj2008
• Compute the perf-to-power ratio of a system running
a typical business applicatoin an enterprise Java
platform
73
Power Efficiency: An Example
 SPECpolwer_ssj2008 on a server with single-chip
2.83GH quad-core Intel Xeon, $GB mem, one 7.2k
RPM, s.5’’ SATA disk
74
Activity Profile of Google Servers
 A sample of 5000 servers over a period of
6 months

Most of the time, 10-50% utilization
75
Energy-Proportional Computing
 Humans at rest consume at little as 70w, while
being able to sustain peaks of 1kW+ for tens of
mins
 For adult male:
76
Causes of Poor Energy Proportionality
 CPU used to be dominant factor (60%); currently
slighly lower than 50% at peak, drops to 30% at
low activity levels
77
Energy Proportional Computing: How
 Hardware components: For example
 CPU: dynamic voltage scaling
 High speed disk drives spend 70% of their total power
simply keeping the platters spinning
• Need smaller rotational speeds, smaller platters,
 Power distribution and cooling
78
SW role: management
 Smart use of power management features in existing
hw, low-overhead inactive or active low-power modes,
as well as implementing power-friendly scheduling of
tasks to enhance energy proportionality of hw
systems
 Two challenges:
Encapsulation in lower-level modules to hide additional infra
complexity
 Performance robustness: minizing perf variability caused by
power management tools
 Related Work at Wayne State




Zhong, et al “System-wide energy minimization for hard real-time tasks,” TECS’08.
Zhong, et al “Energy-aware modeling and scheduling for dynamic voltage scaling with
statistical real-time guarantee,” TC’’07; Zong’s PhD dissertation ‘2007
Gong, Poewr/Performance Optimization (ongoing)
79
Datacenter as a Computer
1.
2.
3.
4.
5.
6.
Overview
Workloads and SW infrastructure
HW building blocks
Datacenter basics
Energy and power efficiency
Dealing with Failures and Repairs
80
Basic Concepts
 Failure: A system failure occurs when the delivered service
deviates from the specified service, where the service
specification is an agreed description of the expected
service. [Avizienis & Laprie 1986]
 Fault: the root cause of failures, defined as a defective
state in materials, design, or implementation. Faults may
remain undetected for some time. Once a fault becomes
visible, it is called an error.


Faults are unobserved defective states
Error is “manifestation” of faults
(Source: Salfner’08)
81
Challenges of High Service
Availability
 High service availability expectation translates
into a high-reliability requirement for DC
 Faults in hw, sw, and operation are inevitable
 In Google, about 45% servers need to reboot at
least once over a 6-month window; 95%+
requires less often than once a month, but the
tail is relatively long.
 The average downtime ~3 hours, implying
99.85% availability
 Determining the appropriate level of reliability is
fundamentally a trade-off between the cost of
failures and the cost of preventing them
82
Availability and Reliability
 Availability: A measure of the time that a
system was actually usable, as a fraction of
the time that it was intended to be usable.
(x nines)

Yield: the ratio of requests that is satisfied by
the service to the total
 Reliability Metrics:





Time to failure (TTF)
Time to repair (TTR)
Mean time to failure (MTTF)
Mean time to repair (MTTR)
Mean time between failures (MTBF)
83
Interplay of HW, SW, Operation
 Faults in hw and sw are inevitable. But endeavor will never
halt: e.g. RAID disk drive, ECC memory
 A fault-tolerant sw infrastructure would help hide much of
the failure complexity from application-level sw.



E.g. SW-based RAID system across disk drives residing in
multiple machines (as in GFS); MapReduce
Flexibility in choosing the level of hw reliability that maximizes
overall system cost efficiency (e.g. inexpensive PC-class hw);
Simplification in common operational procedures (e.g. hw/sw
upgrade)
 With FT sw, not necessary to keep a server running at all
costs. This leads to change in every aspect of the systems,
from design to operation and opens opportunity to
optimization

As long as the hw faults can be detected and reported to sw in
a timely manner
84
Fault Characterization
 FT-sw be based on fault sources, their statistical
characteristics, correponding recovery behavior
 Fault Severity




Corrupted: committed data are lost or corrupted
Unreachable: service is down
• Google service no better than 99.9% when its
servers are one of the end points
• Amazon’s Service Level Agreement is 99.95%
Degraded: service is available but in degraded mode
Masked: faults occur but masked from users by FT
hw/sw mechanisms
85
Causes of Service-Level Failures
 Field data study I on Internet services: Operator-caused or
misconfig errors are larget contriubutors; hw-related faults
(server or networking) accounts about 10-25%
[Oppenheimer’2003]
 Field data study II on early Tandem systems: Hw faults
(<10%), sw faults (~60%), op/maintenance (~20%) [Gray’90]
 Google’s observation over a period of 6 weeks
86
Observations
 Seems sw/hw-based ft techniques do well
for independent faults.
 SW-, Op-, and maintenance-induced faults
have a higher impact on outages, possibly
because they are most likely affect
multiple systems at once, creating a
corrected failure scenario which is hard to
overcome.
87
Proactive Failure Management?
 Failure Prediction?
 Predict future machine failures with low false-positive rates
in a short time horizon
 Develop models for a good trade-off between
accuracy (both in false-positive rates and time
horizon) and the penalties involved in failure
occurrence and recovery
In DC, the penalty is low, the prediction model must be
highly accurate to be economically competitive.
 In systems where a crash is disruptive to op, less accurate
prediction models would be beneficial
 Related Work at Wayne State



Fu/Xu, Exploring spatial/temporal event correlation for failure
prediction, SC’07
Fu’s PhD dissertation 2008
88
In Summary
 Hardware
 Building blocks are commodity server-class machines,
consumer- or enterprise-grade disk drives, Ethernetbased networking fabrics
 Perf of the network fabric and storage subsystems be
more relevant to CPU and mem
 Software:
 FT sw for high service availability (99.99%)
 Programmability, Parallel efficciency, Manageability
 Economics: Cost effectiveness
 power and energy factors
 Utilization characteristics require systems and
components to be energy efficient across a wide load
spectrum, particularly at low utilization level
89
In Summary: Key Challenges
 Rapidly changing workload
 New applications with a large variety of computational
characteristics emerge at a fast pace
 Need creative solutions from both hw and sw; but little
benchmark available
 Building balanced systems from imbalanced
components

Processors outpaced memory and magnetic storage in
perf and power efficiency; more research should be
shifted onto the non-cpu subsystems
 Curbing energy usage
 Power becomes the first order resources, as speed
 Performance under power/energy budget
90
Key Challenges (cont’)
 Amdahls’ cruel law
Speedup = 1/ (f_seq + f_par/n) on n-node
parallel systems. Sequential part f_seq limits
parallel efficiency, no matter how large n
 Future perf gains will continue to be delivered
mostly due to cores or threads, not so much by
faster cpus.
 Data-level or request-level parallelism is
enough? Parallel computing beyond MapReduce!!

91
Reading List
 Barroso and Holzle, “The datacenter as a
computer,” Morgan & Claypool, 2009.
92