Transcript IRAM
Computers for the Post-PC Era
Aaron Brown, Jim Beck, Rich Martin,
David Oppenheimer, Kathy Yelick, and
David Patterson
http://iram.cs.berkeley.edu/istore
2000 Grad Visit Day
Slide 1
Berkeley Approach to Systems
• Find an important problem crossing HW/SW
Interface, with HW/SW prototype at end,
typically as part of graduate courses
• Assemble a band of 3-6 faculty, 12-20 grad
students, 1-3 staff to tackle it over 4 years
• Meet twice a year for 3-day retreats with
invited outsiders
–
–
–
–
Builds team spirit
Get advice on direction, and change course
Offers milestones for project stages
Grad students give 6 to 8 talks Great Speakers
• Write papers, go to conferences, get PhDs, jobs
• End of project party, reshuffle faculty, go to 1
Slide 2
For Example, Projects I Have Worked On
• RISC I,II
– Sequin, Ousterhout (CAD)
• SOAR (Smalltalk On A RISC) Ousterhout (CAD)
• SPUR (Symbolic Processing Using RISCs)
– Fateman, Hilfinger, Hodges, Katz, Ousterhout
• RAID I,II (Redundant Array of Inexp. Disks)
– Katz, Ousterhout, Stonebraker
• NOW I,II (Network of Workstations), (TD)
– Culler, Anderson
• IRAM I (Intelligent RAM)
– Yelick, Kubiatowicz, Wawrzynek
• ISTORE I,II (Intelligent Storage)
– Yelick, Kubiatowicz
Slide 3
Symbolic Processing Using RISCs: ‘85-’89
• Before Commercial RISC chips
• Built Workstation Multiprocessor and
Operating System from scratch(!)
• Sprite Operating System
• 3 chips: Processor, Cache Controller, FPU
– Coined term “snopping cache protocol”
– 3C’s cache miss: compulsory, capacity, conflict
Slide 4
Group Photo (in souvenir jackets)
David Wood,Wisconsin Jim Larus, Wisconsin, M/S George Taylor, Founder, ?
Dave Lee
John
Founder
OusterSi. Image
hout
Founder,
Scriptics
Ben Zorn
Colorado,
M/S
Mark
Hill
Wisc.
Mendel
Rosenblum,
Stanford,
Founder
VMware
Susan
Eggers
Washington
Garth Gibson
CMU, Founder ?
Shing Kong
Transmeta
Brent Welch
Founder, Scriptics
• See www.cs.berkeley.edu/Projects/ARC to
learn more about Berkeley Systems
Slide 5
SPUR 10 Year Reunion, January ‘99
• Everyone from North America came!
• 19 PhDs: 9 to Academia
–
–
–
–
–
8/9 got tenure, 2 full professors (already)
2 Romme fellows (3rd, 4th at Wisconsin)
3 NSF Presidential Young Investigator Winners
2 ACM Dissertation Awards
They in turn produced 30 PhDs (1/99)
• 10 to Industry
– Founders of 5 startups, (1 failed)
– 2 Department heads (AT&T Bell Labs, Microsoft)
• Very successful group; SPUR Project “gave
them a taste of success, lifelong friends”,
Slide 6
Network of Workstations (NOW) ‘94 -’98
Leveraging commodity workstations and OSes to
harness the power of clustered machines connected
via high-speed switched networks
Construction of HW/SW prototypes: NOW-1 with 32
SuperSPARCs, and NOW-2 with 100 UltraSPARC 1s
NOW-2 cluster held the world record for the
fastest Disk-to-Disk Sort for 2 years, 1997-1999
NOW-2 cluster 1st to crack the 40-bit key as part of
a key-cracking challenge offered by RSA, 1997
NOW-2 made list of Top 200 supercomputers 1997
NOW a foundation of Virtual Interface (VI)
Architecture, standard allows protected, direct userlevel access to network, by Compaq, Intel, & M/S
NOW technology led directly to one Internet startup
company (Inktomi), + many other Internet companies
Slide 7
use cluster technology
Network of Workstations (NOW) ‘94 -’98
12 PhDs. Note that 3/4 of them went into academia, and that
1/3 are female:
Andrea Arpaci-Desseau, Asst. Professor, Wisconsin,
Madison
Remzi Arpaci-Desseau, Asst. Professor, Wisconsin, Madison
Mike Dahlin, Asst. Professor, University of Texas, Austin
Jeanna Neefe Matthews, Asst. Professor, Clarkson Univ.
Douglas Ghormley, Researcher, Los Alamos National Labs
Kim Keeton, Researcher, Hewlett Packard Labs
Steve Lumetta, Assistant Professor, Illinois
Alan Mainwaring, Researcher, Sun Microsystems Labs
Rich Martin, Assistant Professor, Rutgers University
Nisha Talagala, Researcher, Network Storage, Sun Micro.
Amin Vahdat, Assistant Professor, Duke University
Randy Wang, Assistant Professor, Princeton University
Slide 8
Research in Berkeley Courses
• RISC, SPUR, RAID, NOW, IRAM, ISTORE all
started in advanced graduate courses
• Make transition from undergraduate student to
researcher in first-year graduate courses
– First year architecture, operating systems courses:
select topic, do research, write paper, give talk
– Prof meets each team 1-on-1 ~3 times, + TA help
– Some papers get submitted and published
• Requires class size < 40 (e.g., Berkeley)
– If 1st year course size ~100 students
=> cannot do research in grad courses 1st year or so
– If school offers combined BS/MS (e.g., MIT) or
professional MS via TV broadcast (e.g., Stanford),
then effective class size ~150-250
Slide 9
Outline
• Background: Berkeley Approach to
Systems
• PostPC Motivation
• PostPC Microprocessor: IRAM
• PostPC Infrastructure Motivation
• PostPC Infrastructure: ISTORE
• Hardware Architecture
• Software Architecture
• Conclusions and Feedback
Slide 10
Perspective on Post-PC Era
• PostPC Era will be driven by 2 technologies:
1) “Gadgets”:Tiny Embedded
or Mobile Devices
– ubiquitous: in everything
– e.g., successor to PDA,
cell phone,
wearable computers
2) Infrastructure to Support such Devices
– e.g., successor to Big Fat Web Servers,
Database Servers
Slide 11
Intelligent RAM: IRAM
Microprocessor & DRAM on a
single chip:
– 10X capacity vs. SRAM
– on-chip memory latency
5-10X,
bandwidth 50-100X
– improve energy efficiency
2X-4X (no off-chip bus)
– serial I/O 5-10X v. buses
– smaller board area/volume
Proc
$
L2$
Bus
I/O I/O
Bus
D R A M
I/O
IRAM advantages extend to:
– a single chip system
– a building block for larger systems
$
L
o f
g a
i b
c
I/O
Proc
Bus
D R A M
D
Rf
Aa
Mb
Slide 12
Revive Vector Architecture
• Cost: $1M each?
• Single-chip CMOS MPU/IRAM
• Low latency, high
BW memory system? • IRAM
• Code density?
• Much smaller than VLIW
• Compilers?
• For sale, mature (>20 years)
(We retarget Cray compilers)
• Performance?
• Easy scale speed with
technology
• Power/Energy?
• Parallel to save energy, keep
performance
• Limited to scientific
• Multimedia apps vectorizable
applications?
too: N*64b, 2N*32b, 4N*16b
Slide 13
VIRAM-1: System on a Chip
Prototype scheduled for end of Summer 2000
•0.18 um EDL process
•16 MB DRAM, 8 banks
Memory (64 Mbits / 8 MBytes)
•MIPS Scalar core and
caches @
200 MHz
•4 64-bit vector unit
pipelines @
4
Vector
Pipes/Lanes
Xbar
200 MHz
•4 100 MB parallel I/O lines
•17x17 mm, 2 Watts
Memory (64
Mbits / 8and
MBytes)
•25.6 GB/s memory (6.4 GB/s per
direction
per Xbar)
•1.6 Gflops (64-bit), 6.4 GOPs (16-bit)
•140 M transistors (> Intel?)
C
P
U
+$
I/O
Slide 14
Outline
• PostPC Infrastructure Motivation and
Background: Berkeley’s Past
• PostPC Motivation
• PostPC Device Microprocessor: IRAM
• PostPC Infrastructure Motivation
• ISTORE Goals
• Hardware Architecture
• Software Architecture
• Conclusions and Feedback
Slide 15
Background: Tertiary Disk (part of NOW)
• Tertiary Disk
(1997)
– cluster of 20 PCs
hosting 364 3.5”
IBM disks (8.4 GB)
in 7 19”x 33” x 84”
racks, or 3 TB.
The 200MHz, 96
MB P6 PCs run
FreeBSD and a
switched 100Mb/s
Ethernet connects – Hosts world’s largest art
database:80,000 images in
the hosts. Also 4
cooperation with San Francisco
UPS units.
Fine Arts Museum:
Try www.thinker.org
Slide 16
Tertiary Disk HW Failure Experience
Reliability of hardware components (20 months)
7
6
1
1
1
1
3
1
IBM SCSI disk failures
IDE (internal) disk failures
SCSI controller failure
SCSI Cable
Ethernet card failure
Ethernet switch
enclosure power supplies
short power outage
(out of 364, or 2%)
(out of 20, or 30%)
(out of 44, or 2%)
(out of 39, or 3%)
(out of 20, or 5%)
(out of 2, or 50%)
(out of 92, or 3%)
(covered by UPS)
Did not match expectations:
SCSI disks more reliable than SCSI cables!
Difference between simulation and prototypes
Slide 17
SCSI Time Outs
+ Hardware Failures (m11)
SCSI Bus 0
Disk
Hardware
SCSI
TimeFailures
Outs
SCSI Time Outs
SCSI Bus 0 Disks
SCSI Bus 0 Disks
10 10
9 8
8
7 6
6
5 4
4
3 2
2 0
1
0 8/17/98 8/19/98 8/21/98 8/23/98 8/25/98 8/27/98
8/15/98
8/17/98 8/19/98
8/23/98 8/25/98
8/29/98 8/31/98
0:00
0:00 8/21/98
0:00
0:00 8/27/98
0:00
0:00
0:00
0:00
0:00
0:00
0:00
0:00
0:00
0:00
0:00
Slide 18
Can we predict a disk failure?
• Yes, look for Hardware Error messages
– These messages lasted for 8 days between:
»8-17-98 and 8-25-98
– On disk 9 there were:
»1763 Hardware Error Messages, and
»297 SCSI Timed Out Messages
• On 8-28-98: Disk 9 on SCSI Bus 0 of
m11 was “fired”, i.e. appeared it was
about to fail, so it was swapped
Slide 19
Lessons from Tertiary Disk Project
• Maintenance is hard on current systems
– Hard to know what is going on, who is to blame
• Everything can break
– Its not what you expect in advance
– Follow rule of no single point of failure
• Nothing fails fast
– Eventually behaves bad enough that operator
“fires” poor performer, but it doesn’t “quit”
• Most failures may be predicted
Slide 20
Outline
• Background: Berkeley Approach to
Systems
• PostPC Motivation
• PostPC Microprocessor: IRAM
• PostPC Infrastructure Motivation
• PostPC Infrastructure: ISTORE
• Hardware Architecture
• Software Architecture
• Conclusions and Feedback
Slide 21
The problem space: big data
• Big demand for enormous amounts of data
– today: high-end enterprise and Internet
applications
» enterprise decision-support, data mining databases
» online applications: e-commerce, mail, web, archives
– future: infrastructure services, richer data
» computational & storage back-ends for mobile devices
» more multimedia content
» more use of historical data to provide better services
• Today’s SMP server designs can’t easily scale
• Bigger scaling problems than performance!
Slide 22
The real scalability problems:
AME
• Availability
– systems should continue to meet quality of service
goals despite hardware and software failures
• Maintainability
– systems should require only minimal ongoing human
administration, regardless of scale or complexity
• Evolutionary Growth
– systems should evolve gracefully in terms of
performance, maintainability, and availability as
they are grown/upgraded/expanded
• These are problems at today’s scales, and will
only get worse as systems grow
Slide 23
Principles for achieving AME (1)
• No single points of failure
• Redundancy everywhere
• Performance robustness is more important
than peak performance
– “performance robustness” implies that real-world
performance is comparable to best-case
performance
• Performance can be sacrificed for
improvements in AME
– resources should be dedicated to AME
» compare: biological systems spend > 50% of resources on
maintenance
– can make up performance by scaling system
Slide 24
Principles for achieving AME (2)
• Introspection
– reactive techniques to detect and adapt to
failures, workload variations, and system evolution
– proactive (preventative) techniques to anticipate
and avert problems before they happen
Slide 25
Hardware techniques (2)
• No Central Processor Unit:
distribute processing with storage
– Serial lines, switches also growing with Moore’s
Law; less need today to centralize vs. bus oriented
systems
– Most storage servers limited by speed of CPUs;
why does this make sense?
– Why not amortize sheet metal, power, cooling
infrastructure for disk to add processor, memory,
and network?
– If AME is important, must provide resources to be
used to help AME: local processors responsible for
health and maintenance of their storage
Slide 26
ISTORE-1 hardware platform
• 80-node x86-based cluster, 1.4TB storage
– cluster nodes are plug-and-play, intelligent, networkattached storage “bricks”
» a single field-replaceable unit to simplify maintenance
– each node is a full x86 PC w/256MB DRAM, 18GB disk
– more CPU than NAS; fewer disks/node than cluster
ISTORE Chassis
80 nodes, 8 per tray
2 levels of switches
•20 100 Mbit/s
•2 1 Gbit/s
Environment Monitoring:
UPS, redundant PS,
fans, heat and vibration
sensors...
Intelligent Disk “Brick”
Portable PC CPU: Pentium II/266 + DRAM
Redundant NICs (4 100 Mb/s links)
Diagnostic Processor
Disk
Half-height canister
Slide 27
A glimpse into the future?
• System-on-a-chip enables computer, memory,
redundant network interfaces without
significantly increasing size of disk
• ISTORE HW in 5-7 years:
– building block: 2006
MicroDrive integrated with
IRAM
» 9GB disk, 50 MB/sec from disk
» connected via crossbar switch
– 10,000 nodes fit into one
rack!
• O(10,000) scale is our
ultimate design point
Slide 28
Development techniques
• Benchmarking
– One reason for 1000X processor performance was
ability to measure (vs. debate) which is better
» e.g., Which most important to improve: clock rate, clocks
per instruction, or instructions executed?
– Need AME benchmarks
“what gets measured gets done”
“benchmarks shape a field”
“quantification brings rigor”
Slide 29
Example results: multiple-faults
Windows
2000/IIS
Hits per second
210
disks replaced
}
200
190
normal behavior
(99% conf)
180
170
spare
faulted
data disk
faulted
160
reconstruction
(manual)
150
140
0
10
20
30
40
50
60
70
80
90
100
110
Time (2-minute intervals)
behavior
}normal
(99% conf)
Linux/
Apache
Hits per second
220
210
200
190
data disk
faulted
180
reconstruction
(automatic)
reconstruction
(automatic)
spare
faulted
170
160
150
disks replaced
140
0
10
20
30
40
50
60
70
80
Time (2-minute intervals)
90
100
110
• Windows reconstructs ~3x faster than Linux
• Windows reconstruction noticeably affects application
performance, while Linux reconstruction does not
Slide 30
Software techniques (1)
• Proactive introspection
– Continuous online self-testing of HW and SW
» in deployed systems!
» goal is to shake out “Heisenbugs” before they’re
encountered in normal operation
» needs data redundancy, node isolation, fault injection
– Techniques:
» fault injection: triggering hardware and software error
handling paths to verify their integrity/existence
» stress testing: push HW/SW to their limits
» scrubbing: periodic restoration of potentially “decaying”
hardware or software state
• self-scrubbing data structures (like MVS)
• ECC scrubbing for disks and memory
Slide 31
Conclusions (1): ISTORE
• Availability, Maintainability, and Evolutionary
growth are key challenges for server systems
– more important even than performance
• ISTORE is investigating ways to bring AME to
large-scale, storage-intensive servers
– via clusters of network-attached, computationallyenhanced storage nodes running distributed code
– via hardware and software introspection
– we are currently performing application studies to
investigate and compare techniques
• Availability benchmarks a powerful tool?
– revealed undocumented design decisions affecting
SW RAID availability on Linux and Windows 2000
Slide 32
Conclusions (2)
• IRAM attractive for two Post-PC applications
because of low power, small size, high memory
bandwidth
– Gadgets: Embedded/Mobile devices
– Infrastructure: Intelligent Storage and Networks
• PostPC infrastructure requires
– New Goals: Availability, Maintainability, Evolution
– New Principles: Introspection, Performance
Robustness
– New Techniques: Isolation/fault insertion, Software
scrubbing
– New Benchmarks: measure, compare AME metrics
Slide 33
Berkeley Future work
• IRAM: fab and test chip
• ISTORE
– implement AME-enhancing techniques in a variety
of Internet, enterprise, and info retrieval
applications
– select the best techniques and integrate into a
generic runtime system with “AME API”
– add maintainability benchmarks
» can we quantify administrative work needed to maintain a
certain level of availability?
– Perhaps look at data security via encryption?
– Even consider denial of service?
Slide 34