Transcript IRAM

Computers for the Post-PC Era
Aaron Brown, Jim Beck, Rich Martin,
David Oppenheimer, Kathy Yelick, and
David Patterson
http://iram.cs.berkeley.edu/istore
2000 Grad Visit Day
Slide 1
Berkeley Approach to Systems
• Find an important problem crossing HW/SW
Interface, with HW/SW prototype at end,
typically as part of graduate courses
• Assemble a band of 3-6 faculty, 12-20 grad
students, 1-3 staff to tackle it over 4 years
• Meet twice a year for 3-day retreats with
invited outsiders
–
–
–
–
Builds team spirit
Get advice on direction, and change course
Offers milestones for project stages
Grad students give 6 to 8 talks  Great Speakers
• Write papers, go to conferences, get PhDs, jobs
• End of project party, reshuffle faculty, go to 1
Slide 2
For Example, Projects I Have Worked On
• RISC I,II
– Sequin, Ousterhout (CAD)
• SOAR (Smalltalk On A RISC) Ousterhout (CAD)
• SPUR (Symbolic Processing Using RISCs)
– Fateman, Hilfinger, Hodges, Katz, Ousterhout
• RAID I,II (Redundant Array of Inexp. Disks)
– Katz, Ousterhout, Stonebraker
• NOW I,II (Network of Workstations), (TD)
– Culler, Anderson
• IRAM I (Intelligent RAM)
– Yelick, Kubiatowicz, Wawrzynek
• ISTORE I,II (Intelligent Storage)
– Yelick, Kubiatowicz
Slide 3
Symbolic Processing Using RISCs: ‘85-’89
• Before Commercial RISC chips
• Built Workstation Multiprocessor and
Operating System from scratch(!)
• Sprite Operating System
• 3 chips: Processor, Cache Controller, FPU
– Coined term “snopping cache protocol”
– 3C’s cache miss: compulsory, capacity, conflict
Slide 4
Group Photo (in souvenir jackets)
David Wood,Wisconsin Jim Larus, Wisconsin, M/S George Taylor, Founder, ?
Dave Lee
John
Founder
OusterSi. Image
hout
Founder,
Scriptics
Ben Zorn
Colorado,
M/S
Mark
Hill
Wisc.
Mendel
Rosenblum,
Stanford,
Founder
VMware
Susan
Eggers
Washington
Garth Gibson
CMU, Founder ?
Shing Kong
Transmeta
Brent Welch
Founder, Scriptics
• See www.cs.berkeley.edu/Projects/ARC to
learn more about Berkeley Systems
Slide 5
SPUR 10 Year Reunion, January ‘99
• Everyone from North America came!
• 19 PhDs: 9 to Academia
–
–
–
–
–
8/9 got tenure, 2 full professors (already)
2 Romme fellows (3rd, 4th at Wisconsin)
3 NSF Presidential Young Investigator Winners
2 ACM Dissertation Awards
They in turn produced 30 PhDs (1/99)
• 10 to Industry
– Founders of 5 startups, (1 failed)
– 2 Department heads (AT&T Bell Labs, Microsoft)
• Very successful group; SPUR Project “gave
them a taste of success, lifelong friends”,
Slide 6
Network of Workstations (NOW) ‘94 -’98
 Leveraging commodity workstations and OSes to
harness the power of clustered machines connected
via high-speed switched networks
 Construction of HW/SW prototypes: NOW-1 with 32
SuperSPARCs, and NOW-2 with 100 UltraSPARC 1s
 NOW-2 cluster held the world record for the
fastest Disk-to-Disk Sort for 2 years, 1997-1999
 NOW-2 cluster 1st to crack the 40-bit key as part of
a key-cracking challenge offered by RSA, 1997
 NOW-2 made list of Top 200 supercomputers 1997
 NOW a foundation of Virtual Interface (VI)
Architecture, standard allows protected, direct userlevel access to network, by Compaq, Intel, & M/S
 NOW technology led directly to one Internet startup
company (Inktomi), + many other Internet companies
Slide 7
use cluster technology
Network of Workstations (NOW) ‘94 -’98
 12 PhDs. Note that 3/4 of them went into academia, and that
1/3 are female:
 Andrea Arpaci-Desseau, Asst. Professor, Wisconsin,
Madison
 Remzi Arpaci-Desseau, Asst. Professor, Wisconsin, Madison
 Mike Dahlin, Asst. Professor, University of Texas, Austin
 Jeanna Neefe Matthews, Asst. Professor, Clarkson Univ.
 Douglas Ghormley, Researcher, Los Alamos National Labs
 Kim Keeton, Researcher, Hewlett Packard Labs
 Steve Lumetta, Assistant Professor, Illinois
 Alan Mainwaring, Researcher, Sun Microsystems Labs
 Rich Martin, Assistant Professor, Rutgers University
 Nisha Talagala, Researcher, Network Storage, Sun Micro.
 Amin Vahdat, Assistant Professor, Duke University
 Randy Wang, Assistant Professor, Princeton University
Slide 8
Research in Berkeley Courses
• RISC, SPUR, RAID, NOW, IRAM, ISTORE all
started in advanced graduate courses
• Make transition from undergraduate student to
researcher in first-year graduate courses
– First year architecture, operating systems courses:
select topic, do research, write paper, give talk
– Prof meets each team 1-on-1 ~3 times, + TA help
– Some papers get submitted and published
• Requires class size < 40 (e.g., Berkeley)
– If 1st year course size ~100 students
=> cannot do research in grad courses 1st year or so
– If school offers combined BS/MS (e.g., MIT) or
professional MS via TV broadcast (e.g., Stanford),
then effective class size ~150-250
Slide 9
Outline
• Background: Berkeley Approach to
Systems
• PostPC Motivation
• PostPC Microprocessor: IRAM
• PostPC Infrastructure Motivation
• PostPC Infrastructure: ISTORE
• Hardware Architecture
• Software Architecture
• Conclusions and Feedback
Slide 10
Perspective on Post-PC Era
• PostPC Era will be driven by 2 technologies:
1) “Gadgets”:Tiny Embedded
or Mobile Devices
– ubiquitous: in everything
– e.g., successor to PDA,
cell phone,
wearable computers
2) Infrastructure to Support such Devices
– e.g., successor to Big Fat Web Servers,
Database Servers
Slide 11
Intelligent RAM: IRAM
Microprocessor & DRAM on a
single chip:
– 10X capacity vs. SRAM
– on-chip memory latency
5-10X,
bandwidth 50-100X
– improve energy efficiency
2X-4X (no off-chip bus)
– serial I/O 5-10X v. buses
– smaller board area/volume
Proc
$
L2$
Bus
I/O I/O
Bus
D R A M
I/O
IRAM advantages extend to:
– a single chip system
– a building block for larger systems
$
L
o f
g a
i b
c
I/O
Proc
Bus
D R A M
D
Rf
Aa
Mb
Slide 12
Revive Vector Architecture
• Cost: $1M each?
• Single-chip CMOS MPU/IRAM
• Low latency, high
BW memory system? • IRAM
• Code density?
• Much smaller than VLIW
• Compilers?
• For sale, mature (>20 years)
(We retarget Cray compilers)
• Performance?
• Easy scale speed with
technology
• Power/Energy?
• Parallel to save energy, keep
performance
• Limited to scientific
• Multimedia apps vectorizable
applications?
too: N*64b, 2N*32b, 4N*16b
Slide 13
VIRAM-1: System on a Chip
Prototype scheduled for end of Summer 2000
•0.18 um EDL process
•16 MB DRAM, 8 banks
Memory (64 Mbits / 8 MBytes)
•MIPS Scalar core and
caches @
200 MHz
•4 64-bit vector unit
pipelines @
4
Vector
Pipes/Lanes
Xbar
200 MHz
•4 100 MB parallel I/O lines
•17x17 mm, 2 Watts
Memory (64
Mbits / 8and
MBytes)
•25.6 GB/s memory (6.4 GB/s per
direction
per Xbar)
•1.6 Gflops (64-bit), 6.4 GOPs (16-bit)
•140 M transistors (> Intel?)
C
P
U
+$
I/O
Slide 14
Outline
• PostPC Infrastructure Motivation and
Background: Berkeley’s Past
• PostPC Motivation
• PostPC Device Microprocessor: IRAM
• PostPC Infrastructure Motivation
• ISTORE Goals
• Hardware Architecture
• Software Architecture
• Conclusions and Feedback
Slide 15
Background: Tertiary Disk (part of NOW)
• Tertiary Disk
(1997)
– cluster of 20 PCs
hosting 364 3.5”
IBM disks (8.4 GB)
in 7 19”x 33” x 84”
racks, or 3 TB.
The 200MHz, 96
MB P6 PCs run
FreeBSD and a
switched 100Mb/s
Ethernet connects – Hosts world’s largest art
database:80,000 images in
the hosts. Also 4
cooperation with San Francisco
UPS units.
Fine Arts Museum:
Try www.thinker.org
Slide 16
Tertiary Disk HW Failure Experience
Reliability of hardware components (20 months)
7
6
1
1
1
1
3
1
IBM SCSI disk failures
IDE (internal) disk failures
SCSI controller failure
SCSI Cable
Ethernet card failure
Ethernet switch
enclosure power supplies
short power outage
(out of 364, or 2%)
(out of 20, or 30%)
(out of 44, or 2%)
(out of 39, or 3%)
(out of 20, or 5%)
(out of 2, or 50%)
(out of 92, or 3%)
(covered by UPS)
Did not match expectations:
SCSI disks more reliable than SCSI cables!
Difference between simulation and prototypes
Slide 17
SCSI Time Outs
+ Hardware Failures (m11)
SCSI Bus 0
Disk
Hardware
SCSI
TimeFailures
Outs
SCSI Time Outs
SCSI Bus 0 Disks
SCSI Bus 0 Disks
10 10
9 8
8
7 6
6
5 4
4
3 2
2 0
1
0 8/17/98 8/19/98 8/21/98 8/23/98 8/25/98 8/27/98
8/15/98
8/17/98 8/19/98
8/23/98 8/25/98
8/29/98 8/31/98
0:00
0:00 8/21/98
0:00
0:00 8/27/98
0:00
0:00
0:00
0:00
0:00
0:00
0:00
0:00
0:00
0:00
0:00
Slide 18
Can we predict a disk failure?
• Yes, look for Hardware Error messages
– These messages lasted for 8 days between:
»8-17-98 and 8-25-98
– On disk 9 there were:
»1763 Hardware Error Messages, and
»297 SCSI Timed Out Messages
• On 8-28-98: Disk 9 on SCSI Bus 0 of
m11 was “fired”, i.e. appeared it was
about to fail, so it was swapped
Slide 19
Lessons from Tertiary Disk Project
• Maintenance is hard on current systems
– Hard to know what is going on, who is to blame
• Everything can break
– Its not what you expect in advance
– Follow rule of no single point of failure
• Nothing fails fast
– Eventually behaves bad enough that operator
“fires” poor performer, but it doesn’t “quit”
• Most failures may be predicted
Slide 20
Outline
• Background: Berkeley Approach to
Systems
• PostPC Motivation
• PostPC Microprocessor: IRAM
• PostPC Infrastructure Motivation
• PostPC Infrastructure: ISTORE
• Hardware Architecture
• Software Architecture
• Conclusions and Feedback
Slide 21
The problem space: big data
• Big demand for enormous amounts of data
– today: high-end enterprise and Internet
applications
» enterprise decision-support, data mining databases
» online applications: e-commerce, mail, web, archives
– future: infrastructure services, richer data
» computational & storage back-ends for mobile devices
» more multimedia content
» more use of historical data to provide better services
• Today’s SMP server designs can’t easily scale
• Bigger scaling problems than performance!
Slide 22
The real scalability problems:
AME
• Availability
– systems should continue to meet quality of service
goals despite hardware and software failures
• Maintainability
– systems should require only minimal ongoing human
administration, regardless of scale or complexity
• Evolutionary Growth
– systems should evolve gracefully in terms of
performance, maintainability, and availability as
they are grown/upgraded/expanded
• These are problems at today’s scales, and will
only get worse as systems grow
Slide 23
Principles for achieving AME (1)
• No single points of failure
• Redundancy everywhere
• Performance robustness is more important
than peak performance
– “performance robustness” implies that real-world
performance is comparable to best-case
performance
• Performance can be sacrificed for
improvements in AME
– resources should be dedicated to AME
» compare: biological systems spend > 50% of resources on
maintenance
– can make up performance by scaling system
Slide 24
Principles for achieving AME (2)
• Introspection
– reactive techniques to detect and adapt to
failures, workload variations, and system evolution
– proactive (preventative) techniques to anticipate
and avert problems before they happen
Slide 25
Hardware techniques (2)
• No Central Processor Unit:
distribute processing with storage
– Serial lines, switches also growing with Moore’s
Law; less need today to centralize vs. bus oriented
systems
– Most storage servers limited by speed of CPUs;
why does this make sense?
– Why not amortize sheet metal, power, cooling
infrastructure for disk to add processor, memory,
and network?
– If AME is important, must provide resources to be
used to help AME: local processors responsible for
health and maintenance of their storage
Slide 26
ISTORE-1 hardware platform
• 80-node x86-based cluster, 1.4TB storage
– cluster nodes are plug-and-play, intelligent, networkattached storage “bricks”
» a single field-replaceable unit to simplify maintenance
– each node is a full x86 PC w/256MB DRAM, 18GB disk
– more CPU than NAS; fewer disks/node than cluster
ISTORE Chassis
80 nodes, 8 per tray
2 levels of switches
•20 100 Mbit/s
•2 1 Gbit/s
Environment Monitoring:
UPS, redundant PS,
fans, heat and vibration
sensors...
Intelligent Disk “Brick”
Portable PC CPU: Pentium II/266 + DRAM
Redundant NICs (4 100 Mb/s links)
Diagnostic Processor
Disk
Half-height canister
Slide 27
A glimpse into the future?
• System-on-a-chip enables computer, memory,
redundant network interfaces without
significantly increasing size of disk
• ISTORE HW in 5-7 years:
– building block: 2006
MicroDrive integrated with
IRAM
» 9GB disk, 50 MB/sec from disk
» connected via crossbar switch
– 10,000 nodes fit into one
rack!
• O(10,000) scale is our
ultimate design point
Slide 28
Development techniques
• Benchmarking
– One reason for 1000X processor performance was
ability to measure (vs. debate) which is better
» e.g., Which most important to improve: clock rate, clocks
per instruction, or instructions executed?
– Need AME benchmarks
“what gets measured gets done”
“benchmarks shape a field”
“quantification brings rigor”
Slide 29
Example results: multiple-faults
Windows
2000/IIS
Hits per second
210
disks replaced
}
200
190
normal behavior
(99% conf)
180
170
spare
faulted
data disk
faulted
160
reconstruction
(manual)
150
140
0
10
20
30
40
50
60
70
80
90
100
110
Time (2-minute intervals)
behavior
}normal
(99% conf)
Linux/
Apache
Hits per second
220
210
200
190
data disk
faulted
180
reconstruction
(automatic)
reconstruction
(automatic)
spare
faulted
170
160
150
disks replaced
140
0
10
20
30
40
50
60
70
80
Time (2-minute intervals)
90
100
110
• Windows reconstructs ~3x faster than Linux
• Windows reconstruction noticeably affects application
performance, while Linux reconstruction does not
Slide 30
Software techniques (1)
• Proactive introspection
– Continuous online self-testing of HW and SW
» in deployed systems!
» goal is to shake out “Heisenbugs” before they’re
encountered in normal operation
» needs data redundancy, node isolation, fault injection
– Techniques:
» fault injection: triggering hardware and software error
handling paths to verify their integrity/existence
» stress testing: push HW/SW to their limits
» scrubbing: periodic restoration of potentially “decaying”
hardware or software state
• self-scrubbing data structures (like MVS)
• ECC scrubbing for disks and memory
Slide 31
Conclusions (1): ISTORE
• Availability, Maintainability, and Evolutionary
growth are key challenges for server systems
– more important even than performance
• ISTORE is investigating ways to bring AME to
large-scale, storage-intensive servers
– via clusters of network-attached, computationallyenhanced storage nodes running distributed code
– via hardware and software introspection
– we are currently performing application studies to
investigate and compare techniques
• Availability benchmarks a powerful tool?
– revealed undocumented design decisions affecting
SW RAID availability on Linux and Windows 2000
Slide 32
Conclusions (2)
• IRAM attractive for two Post-PC applications
because of low power, small size, high memory
bandwidth
– Gadgets: Embedded/Mobile devices
– Infrastructure: Intelligent Storage and Networks
• PostPC infrastructure requires
– New Goals: Availability, Maintainability, Evolution
– New Principles: Introspection, Performance
Robustness
– New Techniques: Isolation/fault insertion, Software
scrubbing
– New Benchmarks: measure, compare AME metrics
Slide 33
Berkeley Future work
• IRAM: fab and test chip
• ISTORE
– implement AME-enhancing techniques in a variety
of Internet, enterprise, and info retrieval
applications
– select the best techniques and integrate into a
generic runtime system with “AME API”
– add maintainability benchmarks
» can we quantify administrative work needed to maintain a
certain level of availability?
– Perhaps look at data security via encryption?
– Even consider denial of service?
Slide 34