powerpoint 97

Download Report

Transcript powerpoint 97

ISTORE-1 Update
David Patterson
University of California at Berkeley
[email protected]
UC Berkeley IRAM Group
UC Berkeley ISTORE Group
[email protected]
July 2000
Slide 1
Perspective on Post-PC Era
• PostPC Era will be driven by 2 technologies:
1) “Gadgets”:Tiny Embedded
or Mobile Devices
– ubiquitous: in everything
– e.g., successor to PDA,
cell phone,
wearable computers
2) Infrastructure to Support such Devices
– e.g., successor to Big Fat Web Servers,
Database Servers
Slide 2
Outline
• Motivation for the ISTORE project
– AME: Availability, Maintainability, Evolutionary growth
• ISTORE’s research principles & techniques
– Introspection
– SON: Storage-Oriented Node In Cluster
– RAIN: Redundant Array of Inexpensive Network
switches
– Benchmarks for AME
• A Case for SON vs. CPUs
• Applications, near term and future
• Conclusions and future work
Slide 3
Lampson: Systems Challenges
• Systems that work
–
–
–
–
–
–
•
•
•
•
Meeting their specs
Always available
Adapting to changing environment
Evolving while they run
Made from unreliable components
Growing without practical limit
Credible simulations or analysis
Writing good specs
“Computer Systems Research
Testing
-Past and Future”
Keynote address,
Performance
17th SOSP,
– Understanding when it doesn’t matter
Dec. 1999
Butler Lampson
Microsoft
Slide 4
Hennessy: What Should the “New World”
Focus
Be?
• Availability
– Both appliance & service
• Maintainability
– Two functions:
» Enhancing availability by preventing failure
» Ease of SW and HW upgrades
• Scalability
– Especially of service
“Back to the Future:
Time to Return to Longstanding
• Cost
Problems in Computer Systems?”
– per device and per service transaction Keynote address,
FCRC,
• Performance
May 1999
John Hennessy
– Remains important, but its not SPECint
Stanford
Slide 5
The real scalability problems: AME
• Availability
– systems should continue to meet quality of service
goals despite hardware and software failures
• Maintainability
– systems should require only minimal ongoing human
administration, regardless of scale or complexity:
Today, cost of maintenance = 10X cost of purchase
• Evolutionary Growth
– systems should evolve gracefully in terms of
performance, maintainability, and availability as they
are grown/upgraded/expanded
• These are problems at today’s scales, and will
only get worse as systems grow
Slide 6
Principles for achieving AME (1)
• No single points of failure
• Redundancy everywhere
• Performance robustness is more important
than peak performance
– “performance robustness” implies that real-world
performance is comparable to best-case performance
• Performance can be sacrificed for
improvements in AME
– resources should be dedicated to AME
» compare: biological systems spend > 50% of resources
on maintenance
– can make up performance by scaling system
Slide 7
Principles for achieving AME (2)
• Introspection
– reactive techniques to detect and adapt to failures,
workload variations, and system evolution
– proactive techniques to anticipate and avert problems
before they happen
Slide 8
Hardware Techniques (1): SON
• SON: Storage Oriented Nodes (in clusters)
• Distribute processing with storage
– If AME really important, provide resources!
– Most storage servers limited by speed of CPUs!!
– Amortize sheet metal, power, cooling, network for
disk to add processor, memory, and a real network?
– Embedded processors 2/3 perf, 1/10 cost, power?
– Serial lines, switches also growing with Moore’s Law;
less need today to centralize vs. bus oriented systems
• Advantages of cluster organization
– Truly scalable architecture
– Architecture that tolerates partial failure
– Automatic hardware redundancy
Slide 9
Hardware techniques (2)
• Heavily instrumented hardware
– sensors for temp, vibration, humidity, power, intrusion
– helps detect environmental problems before they can
affect system integrity
• Independent diagnostic processor on each node
– provides remote control of power, remote console
access to the node, selection of node boot code
– collects, stores, processes environmental data for
abnormalities
– non-volatile “flight recorder” functionality
– all diagnostic processors connected via independent
diagnostic network
Slide 10
Hardware techniques (3)
• On-demand network partitioning/isolation
– Internet applications must remain available despite
failures of components, therefore can isolate a subset
for preventative maintenance
– Allows testing, repair of online system
– Managed by diagnostic processor and network
switches via diagnostic network
Slide 11
Hardware techniques (4)
• Built-in fault injection capabilities
–
–
–
–
Power control to individual node components
Injectable glitches into I/O and memory busses
Managed by diagnostic processor
Used for proactive hardware introspection
» automated detection of flaky components
» controlled testing of error-recovery mechanisms
– Important for AME benchmarking (see next slide)
Slide 12
ISTORE-1 hardware platform
• 80-node x86-based cluster, 1.4TB storage
– cluster nodes are plug-and-play, intelligent, networkattached storage “bricks”
» a single field-replaceable unit to simplify maintenance
– each node is a full x86 PC w/256MB DRAM, 18GB disk
– more CPU than NAS; fewer disks/node than cluster
ISTORE Chassis
80 nodes, 8 per tray
2 levels of switches
•20 100 Mbit/s
•2 1 Gbit/s
Environment Monitoring:
UPS, redundant PS,
fans, heat and vibration
sensors...
Intelligent Disk “Brick”
Portable PC CPU: Pentium II/266 + DRAM
Redundant NICs (4 100 Mb/s links)
Diagnostic Processor
Disk
Half-height canister
Slide 13
ISTORE-1 Status
• 10 Nodes manufactured; 60 board fabbed, 25
to go
• Boots OS
• Diagnostic Processor Interface SW complete
• PCB backplane: not yet designed
• Finish 80 node system: Summer 2000
Slide 14
A glimpse into the future?
• System-on-a-chip enables computer, memory,
redundant network interfaces without
significantly increasing size of disk
• ISTORE HW in 5-7 years:
– building block: 2006 MicroDrive
integrated with IRAM
» 9GB disk, 50 MB/sec from disk
» connected via crossbar switch
– If low power, 10,000 nodes fit
into one rack!
• O(10,000) scale is our
ultimate design point
Slide 15
Hardware Technique (6): RAIN
• Switches for ISTORE-1 substantial fraction of
space, power, cost, and just 80 nodes!
• Redundant Array of Inexpensive Disks (RAID):
replace large, expensive disks by many small,
inexpensive disks, saving volume, power, cost
• Redundant Array of Inexpensive Network
switches: replace large, expensive switches by
many small, inexpensive switches, saving volume,
power, cost?
– ISTORE-1: Replace 2 16-port 1-Gbit switches by fat
tree of 8 8-port switches, or 24 4-port switches?
Slide 16
“Hardware” techniques (6)
• Benchmarking
– One reason for 1000X processor performance was
ability to measure (vs. debate) which is better
» e.g., Which most important to improve: clock rate,
clocks per instruction, or instructions executed?
– Need AME benchmarks
“what gets measured gets done”
“benchmarks shape a field”
“quantification brings rigor”
Slide 17
Availability benchmark methodology
• Goal: quantify variation in QoS metrics as
events occur that affect system availability
• Leverage existing performance benchmarks
– to generate fair workloads
– to measure & trace quality of service metrics
• Use fault injection to compromise system
– hardware faults (disk, memory, network, power)
– software faults (corrupt input, driver error returns)
– maintenance events (repairs, SW/HW upgrades)
• Examine single-fault and multi-fault workloads
– the availability analogues of performance micro- and
macro-benchmarks
Slide 18
Benchmark Availability?
Methodology for reporting results
• Results are most accessible graphically
– plot change in QoS metrics over time
– compare to “normal” behavior?
Performance
» 99% confidence intervals calculated from no-fault runs
}
normal behavior
(99% conf)
injected
disk failure
0
reconstruction
Time
Slide 19
Example single-fault result
220
Solaris
215
210
1
205
Reconstruction
200
0
195
190
0
10
20
30
40
50
60
70
80
90
100
110
160
2
140
Reconstruction
120
#failures tolerated
Hits per second
Linux
2
1
Hits/sec
# failures tolerated
100
0
80
0
10
20
30
40
50
60
70
80
90
100
110
Time (minutes)
• Compares Linux and Solaris reconstruction
– Linux: minimal performance impact but longer window of
vulnerability to second fault
– Solaris: large perf. impact but restores redundancy fast
Slide 20
Software techniques
• Fully-distributed, shared-nothing code
– centralization breaks as systems scale up O(10000)
– avoids single-point-of-failure front ends
• Redundant data storage
– required for high availability, simplifies self-testing
– replication at the level of application objects
» application can control consistency policy
» more opportunity for data placement optimization
Slide 21
Software techniques (2)
• “River” storage interfaces
– NOW Sort experience:
performance heterogeneity is the norm
» e.g., disks: outer vs. inner track (1.5X), fragmentation
» e.g., processors: load (1.5-5x)
– So demand-driven delivery of data to apps
» via distributed queues and graduated declustering
» for apps that can handle unordered data delivery
– Automatically adapts to variations in performance of
producers and consumers
– Also helps with evolutionary growth of cluster
Slide 22
Software techniques (3)
• Reactive introspection
– Use statistical techniques to identify normal behavior
and detect deviations from it
– Policy-driven automatic adaptation to abnormal
behavior once detected
» initially, rely on human administrator to specify policy
» eventually, system learns to solve problems on its own
by experimenting on isolated subsets of the nodes
• one candidate: reinforcement learning
Slide 23
Software techniques (4)
• Proactive introspection
– Continuous online self-testing of HW and SW
» in deployed systems!
» goal is to shake out “Heisenbugs” before they’re
encountered in normal operation
» needs data redundancy, node isolation, fault injection
– Techniques:
» fault injection: triggering hardware and software
error handling paths to verify their integrity/existence
» stress testing: push HW/SW to their limits
» scrubbing: periodic restoration of potentially
“decaying” hardware or software state
• self-scrubbing data structures (like MVS)
• ECC scrubbing for disks and memory
Slide 24
A Case for
Storage Oriented Nodes
Advantages of SON:
• 1 v. 2 Networks
• Physical
Repair/Maintenance
• Die size vs. Clock rate,
Complexity
• Silicon Die Cost ~ Area4
• Cooling ~ (Watts/chip)N
• Size, Power Cost of
System v. Cost of Disks
• Cluster advantages:
dependability, scalability
Advantages of CPU:
• Apps don’t parallelize, so
1 very fast CPU much
better in practice than
N fast CPUs
• Leverage Desktop MPU
investment
• Software Maintenance:
1 Large system with
several CPUs easier to
install SW than several
small computers
Slide 25
SON: 1 vs. 2 networks
• Current computers all have LAN + Disk
interconnect (SCSI, FCAL)
– LAN is improving fastest, most investment, most
features
– SCSI, FCAL poor network features, improving slowly,
relatively expensive for switches, bandwidth
– Two sets of cables, wiring?
• Why not single network based on best
HW/SW technology?
Slide 26
SON: Physical Repair
• Heterogeneous system with server components
(CPU, backplane, memory cards, interface
cards, power supplies, ...) and disk array
components (disks, cables, controllers, array
controllers, power supplies, ... )
– Keep all components available somewhere as FRUs
• Homogeneous modules that is based on hotpluggable interconnect (LAN) with Field
Replacable Units: Node, Power Supplies,
network cables
– Replace node (disk, CPU, memory, NI) if any fail
– Preventative maintenance via isolation, fault insertion
Slide 27
SON: Complexity v. Perf
• Complexity increase:
– HP PA-8500: issue 4 instructions per clock cycle, 56
instructions out-of-order execution, 4Kbit branch
predictor, 9 stage pipeline, 512 KB I cache, 1024 KB D
cache (> 80M transistors just in caches)
– Intel SA-110: 16 KB I$, 16 KB D$, 1 instruction, in
order execution, no branch prediction, 5 stage pipeline
• Complexity costs in development time,
development power, die size, cost
– 440 MHz HP PA-8500 477 mm2, 0.25 micron/4M
$330, > 40 Watts
– 233 MHz Intel SA-110 50 mm2, 0.35 micron/3M
$18, 0.4 Watts
Slide 28
Cost of System v. Disks
• Examples show cost of way we build current
systems (CPU, 2 networks, many disks/CPU …)
Date
– NCR WorldMark:
10/97
– Sun Enterprise 10k: 3/98
– Sun Enterprise 10k: 9/99
– IBM Netinf. Cluster: 7/00
Cost
$8.3M
$5.2M
$6.2M
$7.8M
Disks Disks/CPU
1312
10.2
668
10.4
1732
27.0
7040
55.0
• And these Data Base apps are CPU bound!!!
• Also potential savings in space, power
– ISTORE-1: with big switches, its 2-3 racks for 80
CPUs/disks (3/8 rack unit per CPU/disk themselves)
Slide 29
– ISTORE-2: 4X density improvement?
SON: Cluster Advantages
• Truly scalable architecture
• Architecture that tolerates partial failure
• Automatic hardware redundancy
Slide 30
SON: Cooling cost
v. Peak Power
• What is relationship?
– Feet per second of air flow?
– Packaging costs?
– Fan failure?
Slide 31
The Case for CPU
Advantages of CPU:
But:
• Apps don’t parallelize, so • Assume Apps that
N very fast CPU much
parallelize: WWW
better in practice than
services, Vision,
2N fast CPUs
Graphics
• Leverage Desktop MPU
• Leverage investment in
investment
Embedded MPU,
System on a Chip
• Software Installation:
• Improved maintenance is
1 Large system with
research target:
several CPUs easier to
e.g., many disks lower
keep SW up-to-date
reliability, but RAID is
than several small
better
computers
Slide 32
Initial Applications
• ISTORE is not one super-system that
demonstrates all these techniques!
– Initially provide middleware, library to support AME
goals
• Initial application targets
– cluster web/email servers
» self-scrubbing data structures, online self-testing
» statistical identification of normal behavior
– information retrieval for multimedia data
» self-scrubbing data structures, structuring
performance-robust distributed computation
Slide 33
ISTORE Successor
does Human Quality Vision?
• Malik at UCB thinks vision research at critical
juncture; have about right algorithms,
awaiting faster computers to test them
• 10,000 nodes with System-On-A-Chip +
Microdrive + network
– 1 to 10 GFLOPS/node => 10,000 to 100,000 GFLOPS
– High Bandwidth Network
– 1 to 10 GB of Disk Storage per Node
=> can replicate images per node
– Need AME advances to keep 10,000 nodes useful
Slide 34
Conclusions: ISTORE
• Availability, Maintainability, and Evolutionary
growth are key challenges for server systems
– more important even than performance
• ISTORE is investigating ways to bring AME to
large-scale, storage-intensive servers
– via clusters of network-attached, computationallyenhanced storage nodes running distributed code
– via hardware and software introspection
– we are currently performing application studies to
investigate and compare techniques
• Availability benchmarks a powerful tool?
– revealed undocumented design decisions affecting
SW RAID availability on Linux and Windows 2000
• Exciting applications for large systems that
can be maintained
Slide 35
Backup Slides
Slide 36
State of the art Cluster: NCR
TD V2,
WorldMark • TPC-D,
10/97
BYNET switched network
Proc
Proc
Bus
Bus
Proc
Proc
Proc
Proc
Proc
Proc
Mem
Mem
pci
pci
bridge
…
1
bridge
Mem
Mem
bus bridge
bus bridge
s
c
s
i
32
s
c
s
i
… … …
s
c
s
i
…
1
…
s
c
s
i
s
c
s
i
… … …
s
c
s
i
…
64
– 32 nodes x
4 200 MHz CPUs,
1 GB DRAM, 41 disks
(128 cpus, 32 GB,
1312 disks, 5.4 TB)
– CPUs, DRAM, encl.,
boards, power
$5.3M
– Disks+cntlr
$2.2M
– Disk shelves $0.7M
– Cables
$0.1M
– HW total
$8.3M
source:
www.tpc.org
Slide 37
State of the Art SMP: Sun E10000
4 address buses
data crossbar switch
Proc
Proc
Xbar
Xbar
Proc
Proc
Proc
Proc
Proc
Proc
Mem
s Mem
• TPC-D,Oracle 8,
3/98
s
bridge
s
c
s
i
…
1
bridge
s
c
s
i
s s
c c
s s
i i
bus bridge
bus bridge
s
c
s
i
s
c
s
i
… … …
16
s
c
s
i
…
1
…
s
c
s
i
s
c
s
i
… … …
s
c
s
i
…
23
– SMP 64 336 MHz
CPUs, 64GB dram,
668 disks (5.5TB)
– Disks,shelf
$2.1M
– Boards,encl.
$1.2M
– CPUs
$0.9M
– DRAM
$0.8M
– Power
$0.1M
– Cables,I/O
$0.1M
– HW total
$5.2M
source:
www.tpc.org
Slide 38
State of the Art SMP: Sun E10000
4 address buses
data crossbar switch
Proc
Proc
Xbar
Xbar
Proc
Proc
Proc
Proc
Proc
Proc
Mem
s Mem
• TPC-C,Oracle 8i,
9/99
– SMP 64 400 MHz
CPUs, 64GB dram,
1732 disks (15.5TB)
– Disks,shelf $3.6M
– Boards,encl. $0.9M
– CPUs
$0.9M
– DRAM
$0.6M
– Power
$0.1M
– Cables,I/O
$0.1M
s
bridge
f
c
a
l
…
1
bridge
f
c
a
l
f f
c c
a a
l l
bus bridge
bus bridge
f
c
a
l
f
c
a
l
… … …
16
f
c
a
l
…
1
…
f
c
a
l
f
c
a
l
… … …
f
c
a
l
…
27
– HW total
$6.2M
source:
www.tpc.org
Slide 39
State of the art Cluster: IBM
Netinfinity • TPC-C, DB2, 7/00
– 32 nodes x
4 700 MHz CPUs,
Giganet 1Gbit switched Ethernet
0.5 GB DRAM, 220
disks (128 cpus, 16 GB,
Proc
Proc
Bus
Bus
Proc
Proc
Proc
Proc
7040 disks, 116 TB)
Proc
Proc
Mem
Mem
– CPUs
$0.6M
pci
pci
bridge
bridge
$0.5M
32 – Caches
1
– DRAM
$0.6M
– Disks
$3.8M
– Disk shelves $1.6M
Mem
Mem
– Disk cntrl. $0.4M
bus bridge
bus bridge
– Racks
$0.1M
– Cables
$0.1M
s
s
s…
s
s
s …
– Switches
$0.1M
c
c
c
c
c
c
…s … s … s – HW total $7.8M
…s …s … s
…
i
i
i
1
…
i
i
i
704
source:
www.tpc.org
Slide 40
Attacking Computer Vision
• Analogy: Computer Vision Recognition in 2000
like Computer Speech Recognition in 1985
– Pre 1985 community searching for good algorithms:
classic AI vs. statistics?
– By 1985 reached consensus on statistics
– Field focuses and makes progress, uses special
hardware
– Systems become fast enough that can train systems
rather than preload information, which accelerates
progress
– By 1995 speech regonition systems starting to deploy
– By 2000 widely used, available on PCs
Slide 41
Computer Vision at Berkeley
• Jitendra Malik believes has an approach that
is very promising
• 2 step process:
1) Segmentation: Divide image into regions of
coherent color, texture and motion
2) Recognition: combine regions and search
image database to find a match
• Algorithms for 1) work well, just slowly
(300 seconds per image using PC)
• Algorithms for 2) being tested this summer
using hundreds of PCs; will determine accuracy
Slide 42
Human Quality Computer Vision
• Suppose Algorithms Work: What would it take
to match Human Vision?
• At 30 images per second: segmentation
– Convolution and Vector-Matrix Multiply of Sparse
Matrices (10,000 x 10,000, 10% nonzero/row)
– 32-bit Floating Point
– 300 seconds on PC (assuming 333 MFLOPS)
=> 100G FL Ops/image
– 30 Hz => 3000 GFLOPs machine to do segmentation
Slide 43
Human Quality Computer Vision
• At 1 / second: object recognition
– Human can remember 10,000 to 100,000 objects per
category (e.g., 10k faces, 10k Chinese characters, high
school vocabulary of 50k words, ..)
– To recognize a 3D object, need ~10 2D views
– 100 x 100 x 8 bit (or fewer bits) per view
=> 10,000 x 10 x 100 x 100 bytes or 109 bytes
– Pruning using color and texture and by organizing
shapes into an index reduce shape matches to 1000
– Compare 1000 candidate merged regions with 1000
candidate object images
– If 10 hours on PC (333 MFLOPS) => 12000 GFLOPS
– Use storage to reduce computation?
Slide 44