Transcript sun-s00

Research in Internet Scale Systems
Katherine Yelick
U.C. Berkeley, EECS
http://iram.cs.berkeley.edu/istore
With Jim Beck, Aaron Brown, Daniel Hettena,
David Oppenheimer, Randi Thomas, Noah Treuhaft,
David Patterson, John Kubiatowicz
http://www.cs.berkeley.edu/project/titanium
With Greg Balls, Dan Bonachea, David Gay, Ben Liblit,
Chang-Sun Lin, Peter McQuorquodale,
Carleton Miyamoto, Geoff Pike, Alex Aiken,
Phil Colella, Susan Graham, Paul Hilfinger
Slide 1
Ubiquitous Computing
• Computing everywhere:
– Desktop, Laptop, Palmtop, Cars, Cellphones
• Input devices everywhere:
– Sensors, cameras, microphones
• Connectivity everywhere:
– Rapid growth of bandwidth in the interior of the net
– Broadband to the home and office
– Wireless technologies such as CMDA, Satelite, laser
• Rise of the thin-client metaphor:
– Services provided by interior of network
– Incredibly thin clients on the leaves
» MEMs devices -- sensors+CPU+wireless net in 1mm3
Slide 2
The problem space: big data
• Big demand for enormous amounts of data
– today: enterprise and internet applications
» online applications: e-commerce, mail, web, archives
» enterprise decision-support, data mining databases
– future: richer data and more of it
» computational & storage back-ends for mobile devices
» more multimedia content
» more use of historical data to provide better services
• Two key application domains:
– storage: public, private, and institutional data
– search: building static indexes, dynamic discovery
• Today’s SMP server designs can’t easily scale
– Bigger scaling problems than performance!
Slide 3
The Real Scalability Problems: AME
• Availability
– systems should continue to meet quality of service
goals despite hardware and software failures and
extreme load
• Maintainability
– systems should require only minimal ongoing human
administration, regardless of scale or complexity
• Evolutionary Growth
– systems should evolve gracefully in terms of
performance, maintainability, and availability as
they are grown/upgraded/expanded
• These are problems at today’s scales, and will only
get worse as systems grow
Slide 4
Research Principles
• Redundancy everywhere, no single point of failure
• Performance secondary to AME
– Performance robustness over peak performance
– Dedicate resources to AME
» biological systems use > 50% of resources on maintenance
– Optimizations viewed as AME-enablers
» e.g., use of (slower) safe languages like Java with static
and dynamic optimizations
• Introspection
– reactive techniques to detect and adapt to failures,
workload variations, and system evolution
– proactive techniques to anticipate and avert
problems before they happen
Slide 5
Outline
• Motivation
• Hardware Techniques
– general techniques
– ISTORE projects
• Software Techniques
• Availability Benchmarks
• Conclusions
Slide 6
Hardware techniques
• Fully shared-nothing cluster organization
– truly scalable architecture, automatic redundancy
– tolerates partial hardware failure
• No Central Processor Unit:
distribute processing with storage
– Most storage servers limited by speed of CPUs;
why does this make sense?
– Amortize sheet metal, power, cooling infrastructure
for disk to add processor, memory, and network
• On-demand network partitioning/isolation
– Applications must tolerate these anyway
– Allows testing, repair of online system
Slide 7
Hardware techniques
• Heavily instrumented hardware
– sensors for temp, vibration, humidity, power
• Independent diagnostic processor on each node
– remote control of power, console, boot code
– collects, stores, processes environmental data
– connected via independent network
• Built-in fault injection capabilities
– Used for proactive hardware introspection
» automated detection of flaky components
» controlled testing of error-recovery mechanisms
– Important for AME benchmarking
Slide 8
ISTORE-1 Hardware Platform
• 80-node x86-based cluster, 1.4TB storage
– cluster nodes are plug-and-play, intelligent, networkattached storage “bricks”
» a single field-replaceable unit to simplify maintenance
– each node is a full x86 PC w/256MB DRAM, 18GB disk
– 2-node system running now; full system in next quarter
ISTORE Chassis
80 nodes, 8 per tray
2 levels of switches
•20 100 Mbit/s
•2 1 Gbit/s
Environment Monitoring:
UPS, redundant PS,
fans, heat and vibration
sensors...
Intelligent Disk “Brick”
Portable PC CPU: Pentium II/266 + DRAM
Redundant NICs (4 100 Mb/s links)
Diagnostic Processor
Disk
Half-height canister
Slide 9
A glimpse into the future?
• System-on-a-chip enables computer, memory,
redundant network interfaces without significantly
increasing size of disk
• ISTORE HW in 5-7 years:
– building block: 2006
MicroDrive integrated with
IRAM
» 9GB disk, 50 MB/sec from disk
» connected via crossbar switch
– 10,000 nodes fit into one
rack!
• O(10,000) scale is our ultimate
design point
Slide 10
ISTORE-2 Hardware Proposal
• Smaller disks
– replace 3.5” disks with 2.5” or 1” drives
» 340MB available now in 1”, 1 GB next year (?)
• Smaller, more highly integrated processors
– E.g., Transmeta Crusoe includes processor and
Northbridge (interface) functionality in 1 Watt
– Xilinx FPGA for Southbridge, diagnostic proc, etc.
• Larger scale
– Roughly 1000 nodes, depending on support
» ISTORE-1 built with donated disks, memory, processors
» Paid for network, board design, enclosures (discounted)
Slide 11
Outline
• Motivation
• Hardware Techniques
• Software Techniques
– general techniques
– Titanium: a high performance Java dialect
– Sparsity: using dynamic information
– Virtual streams: performance robustness
• Availability Benchmarks
• Conclusions
Slide 12
Software techniques
• Fault tolerant data structures
– Application controls replication, checkpointing, and
consistency policy
– Self-scrubbing used to identify software errors
that have corrupted application state
• Encourage use of safe languages
– Type safety and automatic memory management
avoid a host of application errors
– Use of static and dynamic information to meet
performance needs
• Runtime adaptation to performance heterogeneity
– e.g., outer vs. inner track (1.5X), fragmentation
– Evolution of systems adds to this problem
Slide 13
Software Techniques
• Reactive introspection
– Use statistical techniques to identify normal
behavior and detect deviations from it
» e.g., network activity, response time, program counter (?)
– Semi-automatic response to abnormal behavior
» initially, rely on human administrator
» eventually, system learns to set response parameters
• Proactive introspection
– Continuous online self-testing
» in deployed systems!
» goal is to shake out bugs in failure response code on
isolated subset
» use of fault-injection and stress testing
Slide 14
Techniques for Safe Languages
Titanium: A high performance dialect of Java
• Scalable parallelism
– A global address space, but not shared memory
– For tightly-coupled applications, e.g., mining
– Safe, region-based memory management
• Scalar performance enhancements, some specific
to application domain
– immutable classes (avoids indirection)
– multidimensional arrays with subarrays
• Application domains
– scientific computing on grids
» typically +/-20% of C++/F in this domain
– data mining in progress
Slide 15
Use of Static Information
• Titanium compiler performs parallel optimizations
– communication overlap (40%) and aggregation
• Uses two new analyses
– synchronization analysis: the parallel analog to
control flow analysis
» identifies code segments that may execute in
parallel
– shared variable analysis: the parallel analog to
dependence analysis
» recognize when reordering can be observed by
another processor
» necessary for any code motion or use of relaxed
memory models in hardware => missed or illegal
optimizations
Slide 16
Use of Dynamic Information
• Several data mining or web search algorithms use
sparse matrix-vector multiplication
– use for documents, images, video, etc.
– irregular, indirect memory patterns perform poorly
on memory hierarchies
• Performance improvements possible, but depend on:
– sparsity structure, e.g., keywords within documents
– machine parameters without analytical models
• Good news:
– operation repeated many times on similar matrix
– Sparsity: automatic code generator based on
runtime information
Slide 17
Using Dynamic Information:
Sparsity Performance
Slide 18
Use of Dynamic Information:
Virtual Stream
Minimum Per-Process Bandwidth
(MB/sec)
• System performance limited by the weakest link
• NOW Sort experience: performance heterogeneity is the norm
– disks: inner vs. outer track (50%), fragmentation
– processors: load (1.5-5x)
• Virtual Streams: dynamically off-load I/O work from slower
disks to faster ones
Ideal
Virtual Streams
Static
6
5
4
3
2
1
0
100%
67%
39%
Efficiency Of Single Slow Disk
29%
Slide 19
Applications
• ISTORE is not one super-system that demonstrates
all these techniques!
– Initially provide library to support AME goals
• Initial application targets
– cluster web/email servers
» self-scrubbing data structures, online self-testing
» statistical identification of normal behavior
– decision-support database query execution system
» River-based storage, replica management
– information retrieval for multimedia data
» self-scrubbing data structures, structuring performancerobust distributed computation
Slide 20
Outline
•
•
•
•
Motivation
Hardware Techniques
Software Techniques
Availability Benchmarks
– methodology
– example
• Conclusions
Slide 21
Availability Benchmark Methodology
• Goal: quantify variation in QoS metrics as events
occur that affect system availability
• Leverage existing performance benchmarks
– to generate fair workloads
– to measure & trace quality of service metrics
• Use fault injection to compromise system
– hardware faults (disk, memory, network, power)
– software faults (corrupt input, driver error
returns)
– maintenance events (repairs, SW/HW upgrades)
• Examine single-fault and multi-fault workloads
– the availability analogues of performance microand macro-benchmarks
Slide 22
Methodology: reporting results
• Results are most accessible graphically
– plot change in QoS metrics over time
– compare to “normal” behavior?
» 99% confidence intervals calculated from no-fault runs
Performance
210
}
200
normal behavior
(99% conf)
190
180
injected
disk failure
170
reconstruction
160
0
5
10
15
20
25
30
35
40
45
50
55
60
Time (2-minute intervals)
• Can graphs be distilled into useful numbers?
Slide 23
Example results: software RAID-5
• Test systems: Linux/Apache and Win2000/IIS
– SpecWeb ’99 to measure hits/second as QoS metric
– fault injection at disks based on empirical fault data
» transient, correctable, uncorrectable, & timeout faults
• 15 single-fault workloads injected per system
– only 4 distinct behaviors observed
(A) no effect
(B) system hangs
(C) RAID enters degraded mode
(D) RAID enters degraded mode &
starts reconstruction
– both systems hung (B) on simulated disk hangs
– Linux exhibited (D) on all other errors
– Windows exhibited (A) on transient errors and (C) on
uncorrectable, sticky errors
Slide 24
Example results: multiple-faults
Hits per second
210
disks replaced
}
200
190
180
170
160
Windows
2000/IIS
spare
faulted
data disk
faulted
reconstruction
(manual)
150
140
0
10
20
30
40
normal behavior
(99% conf)
50
60
70
80
90
100
110
Time (2-minute intervals)
behavior
}normal
(99% conf)
Hits per second
220
210
200
190
data disk
faulted
180
reconstruction
(automatic)
reconstruction
(automatic)
spare
faulted
170
Linux/
Apache
160
150
disks replaced
140
0
10
20
30
40
50
60
70
80
Time (2-minute intervals)
90
100
110
• Windows reconstructs ~3x faster than Linux
• Windows reconstruction noticeably affects application performance,
while Linux reconstruction does not
Slide 25
Availability Benchmark Summary
• Linux and Windows take opposite approaches
to managing benign and transient faults
– Linux is paranoid and stops using a disk on any error
– Windows ignores most benign/transient faults
– Windows is more robust except when disk is truly
failing
• Linux and Windows have different reconstruction
philosophies
– Linux uses idle bandwidth for reconstruction
– Windows steals app. bandwidth for reconstruction
– Windows rebuilds fault-tolerance more quickly
• Win2k favors fault-tolerance over performance; Linux
favors performance over fault-tolerance
Slide 26
Conclusions
• Two key applications domains
– Storage: loosely coupled
– Search: tightly coupled, computation important
• Key challenges to future servers are:
– Availability, Maintainability, and Evolutionary growth
• Use of self-monitoring to satisfy AME goals
– Proactive and reactive techniques
• Use of static techniques for high performance and
reliable software
– Titanium extension of Java
• Use of dynamic information for performance robustness
– Sparsity and Virtual Streams
• Availability benchmarks a powerful tool?
Slide 27