Trends and IDISK - University of California, Berkeley

Download Report

Transcript Trends and IDISK - University of California, Berkeley

Introduction to
Hardware/Architecture
David A. Patterson
http://cs.berkeley.edu/~patterson/talks
{patterson,kkeeton}@cs.berkeley.edu
EECS, University of California
Berkeley, CA 94720-1776
1
What is a Computer System?
Application (Netscape)
Compiler
Software
Hardware
Assembler
Processor
Operating
System
(Windows 98)
Memory
I/O system
Instruction Set
Architecture
Datapath & Control
Digital Design
Circuit Design
transistors

Coordination of many levels of abstraction
2
Levels of Representation
temp = v[k];
High Level Language
Program (e.g., C)
v[k] = v[k+1];
v[k+1] = temp;
lw$to, 0($2)
lw$t1, 4($2)
sw
$t1, 0($2)
sw
$t0, 4($2)
Compiler
Assembly Language
Program (e.g.,MIPS)
Assembler
Machine Language
Program (MIPS)
Machine Interpretation
0000
1010
1100
0101
1001
1111
0110
1000
1100
0101
1010
0000
0110
1000
1111
1001
1010
0000
0101
1100
1111
1001
1000
0110
0101
1100
0000
1010
1000
0110
1001
1111
Control Signal
Specification
°
°
3
The Instruction Set: a Critical Interface
software
instruction set
hardware
4
struction Set Architecture (subset of Computer Arch
“... the attributes of a [computing] system as seen
by the programmer, i.e. the conceptual structure
and functional behavior, as distinct from the
organization of the data flows and controls the
logic design, and the physical implementation.”
– Amdahl, Blaaw, and Brooks, 1964
-- Organization of Programmable
Storage
SOFTWARE
-- Data Types & Data Structures:
Encodings & Representations
-- Instruction Set
-- Instruction Formats
-- Modes of Addressing and Accessing Data Items and Instructions
-- Exceptional Conditions
5
Anatomy: 5 components of any Compute
Personal Computer
Computer
Processor
(active)
Control
(“brain”)
Datapath
(“brawn”)
Memory
(passive)
(where
programs,
data
live when
running)
Devices
Input
Output
Keyboard,
Mouse
Disk
(where
programs,
data
live when
not running)
Display,
Printer
Processor often called (IBMese)
“CPU” for “Central Processor Unit”
6
Technology Trends:
Microprocessor Capacity
100000000
Alpha 21264: 15 million
Pentium Pro: 5.5 million
PowerPC 620: 6.9 million
Alpha 21164: 9.3 million
Sparc Ultra: 5.2 million
10000000
Moore’s Law
Pentium
i80486
Transistors
1000000
i80386
i80286
100000
2X transistors/Chip
Every 1.5 years
i8086
10000
i8080
Called “Moore’s Law”:
i4004
1000
1970
1975
1980
1985
1990
1995
2000
Year
7
900
800
700
600
500
400
300
200
100
0
Technology Trends: Processor
Performance DEC Alpha 21264/600
1.54X/yr
DEC Alpha 5/500
DEC
HP
Sun MIPSMIPSIBM
AXP/
9000/
-4/ M M/ RS/
500
750
6000
260 2000 120
DEC Alpha 5/300
DEC Alpha 4/266
IBM POWER 100
87 88 89 90 91 92 93 94 95 96 97
Processor performance increase/yr
mistakenly referred to as Moore’s Law (transistors/chip)
8
Computer Technology=>Dramatic Chan




Processor
 2X in speed every 1.5 years; 1000X performance in last 15 years
Memory
 DRAM capacity: 2x / 1.5 years; 1000X size in last 15 years
 Cost per bit: improves about 25% per year
Disk
 capacity: > 2X in size every 1.5 years
 Cost per bit: improves about 60% per year
 120X size in last decade
State-of-the-art PC “when you graduate” (1997-2001)
 Processor clock speed:
1500 MegaHertz
(1.5 GigaHertz)
 Memory capacity:
500 MegaByte
(0.5 GigaBytes)
 Disk capacity: 100 GigaBytes
(0.1 TeraBytes)
 New units! Mega => Giga, Giga => Tera
9
Integrated Circuit Costs
Die cost =
Wafer cost
Dies per Wafer * Die yield
Dies
Flaws
Die Cost is goes roughly with the cube of the area:
fewer dies per wafer * yield worse with die area
10
Die Yield (1993 data)
Raw Dices Per Wafer
wafer diameter
6”/15cm
8”/20cm
10”/25cm
die area (mm2)
100
144
196
139
90
62
265
177
124
431
290
206
256
44
90
153
324
32
68
116
400
23
52
90
die yield
23%
19%
16% 12% 11%
10%
typical CMOS process:  =2, wafer yield=90%, defect density=2/cm2, 4 test sites/wafer
6”/15cm
8”/20cm
10”/25cm
Good Dices Per Wafer (Before Testing!)
31
16
9
5
3
2
59
32
19
11
7
5
96
53
32
20
13
9
typical cost of an 8”, 4 metal layers, 0.5um CMOS wafer: ~$2000
11
1993 Real World Examples
Chip
Metal Line
layers width
0.90
0.80
0.80
0.80
0.70
0.70
WaferDefect Area Dies/
cost /cm2 mm2 wafer
386DX
486DX2
PowerPC 601
HP PA 7100
DEC Alpha
SuperSPARC
2
3
4
3
3
3
$900
$1200
$1700
$1300
$1500
$1700
Pentium
3 0.80 $1500
1.0
1.0
1.3
1.0
1.2
1.6
43 360
81 181
121 115
196 66
234 53
256 48
1.5 296
40
Yield Die Cost
71%
54%
28%
27%
19%
13%
$4
$12
$53
$73
$149
$272
9%
$417
From "Estimating IC Manufacturing Costs,” by Linley Gwennap, Microprocessor Report,
August 2, 1993, p. 15
12
Other Costs
IC cost = Die cost + Testing cost + Packaging cost
Final test yield
Packaging Cost: depends on pins, heat dissipation
Chip
386DX
486DX2
PowerPC 601
HP PA 7100
DEC Alpha
SuperSPARC
Pentium
Die
cost
$4
$12
$53
$73
$149
$272
$417
Package
pins
type
132
QFP
168 PGA
304
QFP
504 PGA
431 PGA
293 PGA
273 PGA
cost
$1
$11
$3
$35
$30
$20
$19
Test &
Assembly
$4
$12
$21
$16
$23
$34
$37
Total
$9
$35
$77
$124
$202
$326
$473
13
System Cost: 1995-96 Workstation
System
Cabinet
Subsystem
Sheet metal, plastic
Power supply, fans
Cables, nuts, bolts
(Subtotal)
Motherboard
Processor
DRAM (64MB)
Video system
I/O system
Printed Circuit board
(Subtotal)
I/O Devices Keyboard, mouse
Monitor
Hard disk (1 GB)
Tape drive (DAT)
(Subtotal)
% of total cost
1%
2%
1%
(4%)
6%
36%
14%
3%
1%
(60%)
1%
22%
7%
6%
(36%)
14
COST v. PRICE
Q: What % of company income
on Research and Development (R&D)?
list price
+50–80%
Average
Discount
(33–45%)
gross margin
(33–14%)
direct costs
direct costs
(8–10%)
component
cost
component
cost
(25–31%)
avg. selling price
+25–100% Gross Margin
+33% Direct Costs
Component
Cost
Input:
chips,
displays, ...
component
cost
Making it:
labor, scrap,
returns, ...
(WS–PC)
Overhead:
R&D, rent,
marketing,
profits, ...
Commision:
channel profit,
volume
discounts,
15
Outline

Review of Five Technologies:
Processor, Memory, Disk, Network Systems



Description / History / Performance Model
State of the Art / Trends / Limits / Innovation
Common Themes across Technologies



Perform.: per access (latency) + per byte (bandwidth)
Fast: Capacity, BW, Cost; Slow: Latency, Interfaces
Moore’s Law affecting all chips in system
16
Processor Trends/ History

Microprocessor: main CPU of “all” computers



< 1986, +35%/ yr. performance increase (2X/2.3yr)
>1987 (RISC), +60%/ yr. performance increase (2X/1.5yr)
Cost fixed at $500/chip, power whatever can cool
CPU time= Seconds
Program

= Instructions x
Program
Clocks
Instruction
x Seconds
Clock
History of innovations to 2X / 1.5 yr




Pipelining (helps seconds / clock, or clock rate)
Out-of-Order Execution (helps clocks / instruction)
Superscalar (helps clocks / instruction)
Multilevel Caches (helps clocks / instruction)
17
Pipelining is Natural!
° Laundry Example
° Ann, Brian, Cathy, Dave
each have one load of clothes
to wash, dry, fold, and put away
A
B
C
D
° Washer takes 30 minutes
° Dryer takes 30 minutes
° “Folder” takes 30 minutes
° “Stasher” takes 30 minutes
to put clothes into drawers
18
Sequential Laundry
6 PM
T
a
s
k
O
r
d
e
r
A
7
8
9
10
11
12
1
2 AM
30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30
Time
B
C
D
Sequential laundry takes
8 hours for 4 loads
19
Pipelined Laundry: Start work ASAP
6 PM
T
a
s
k
7
8
9
10
30 30 30 30 30 30 30
11
12
1
2 AM
Time
A
B
C
O
r
d
e
r
D
Pipelined laundry takes
3.5 hours for 4 loads!
20
Pipeline Hazard: Stall
6 PM
T
a
s
k
7
8
9
30 30 30 30 30 30 30
A
10
11
12
1
2 AM
Time
bubble
B
C
O
r
d
e
r
D
E
F
A depends on D; stall since folder tied up
21
Out-of-Order Laundry: Don’t Wait
6 PM
T
a
s
k
7
8
9
30 30 30 30 30 30 30
A
10
11
12
1
2 AM
Time
bubble
B
C
O
r
d
e
r
D
E
F
A depends on D; rest continue; need more
resources to allow out-of-order
22
Superscalar Laundry: Parallel per stage
6 PM
T
a
s
k
8
9
30 30 30 30 30
A
B
C
O
r
d
e
r
7
D
E
F
10
11
12
1
2 AM
Time
(light clothing)
(dark clothing)
(very dirty clothing)
(light clothing)
(dark clothing)
(very dirty clothing)
More resources, HW match mix of parallel tasks?
23
Superscalar Laundry: Mismatch Mix
6 PM
T
a
s
k
O
r
d
e
r
7
8
9
30 30 30 30 30 30 30
A
B
C
D
10
11
12
1
2 AM
Time
(light clothing)
(light clothing)
(dark clothing)
(light clothing)
Task mix underutilizes extra resources
24
State of the Art: Alpha 21264






15M transistors
2 64KB caches on chip; 16MB L2 cache off chip
Clock <1.7 nsec, or >600 MHz
(Fastest Cray Supercomputer: T90 2.2 nsec)
90 watts
Superscalar: fetch up to 6 instructions/clock cycle,
retires up to 4 instruction/clock cycle
Execution out-of-order
25
Today’s Situation: Microprocessor






MIPS MPUs
Clock Rate
On-Chip Caches
Instructions/Cycle
Pipe stages
Model
Die Size (mm2)
32
205
6.3x
Development (man yr..) 60
SPECint_base95
5.7
300
8.8
5.0x
1.6x



without cache, TLB
R5000
R10000 10k/5k
200 MHz
195 MHz 1.0x
32K/32K
32K/32K 1.0x
1(+ FP)
4
4.0x
5
5-7
1.2x
In-order Out-of-order --84
298
3.5x
26
Memory History/Trends/State of Art

DRAM: main memory of all computers




State of the Art: $152, 128 MB DIMM
(16 64-Mbit DRAMs),10 ns x 64b (800MB/sec)
Capacity: 4X/3 yrs (60%/yr..)



Commodity chip industry: no company >20% share
Packaged in SIMM or DIMM (e.g.,16 DRAMs/SIMM)
Moore’s Law
MB/$: + 25%/yr.
Latency: – 7%/year, Bandwidth: + 20%/yr. (so far)
source: www.pricewatch.com, 5/21/98
27
Memory Summary


DRAM rapid improvements in capacity, MB/$,
bandwidth; slow improvement in latency
Processor-memory interface
(cache+memory bus) is bottleneck to
delivered bandwidth

Like network, memory “protocol” is major
overhead
28
Processor Innovations/Limits

Low cost , low power embedded processors




Very Long Instruction Word (Intel,HP IA-64/Merced)


Lots of competition, innovation
Integer perf. embedded proc. ~ 1/2 desktop processor
Strong ARM 110: 233 MHz, 268 MIPS, 0.36W typ., $49
multiple ops/ instruction, compiler controls parallelism
Consolidation of desktop industry? Innovation?
SPARC
x86
IA-64
29
Processor Summary

SPEC performance doubling / 18 months



Processor tricks not as useful for transactions?




Growing CPU-DRAM performance gap & tax
Running out of ideas, competition? Back to 2X / 2.3 yrs?
Clock rate increase compensated by CPI increase?
When > 100 MIPS on TPC-C?
Cost fixed at ~$500/chip, power whatever can cool
Embedded processors promising

1/10 cost, 1/100 power, 1/2 integer performance?
30
Processor Limit: DRAM Gap
CPU
“Moore’s Law”
100
10
1
µProc
60%/yr..
Processor-Memory
Performance Gap:
(grows 50% / year)
DRAM
DRAM
7%/yr..
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
Performance
1000
• Alpha 21264 full cache miss in instructions executed:
180 ns/1.7 ns =108 clks x 4 or 432 instructions
• Caches in Pentium Pro: 64% area, 88% transistors
31
he Goal: Illusion of large, fast, cheap memo



Fact: Large memories are slow,
fast memories are small
How do we create a memory that is large, cheap
and fast (most of the time)?
Hierarchy of Levels

Similar to Principle of Abstraction:
hide details of multiple levels
32
Library


Working on paper in library at a desk
Option 1: Every time need a book








Leave desk to go to shelves (or stacks)
Find the book
Bring one book back to desk
Read section interested in
When done with section, leave desk and go to
shelves carrying book
Put the book back on shelf
Return to desk to work
Next time need a book, go to first step
33
Memory Hierarchy Analogy: Library

Option 2: Every time need a book







Leave some books on desk after fetching them
Only go to shelves when need a new book
When go to shelves, bring back related books in case
you need them; sometimes you’ll need to return books
not used recently to make space for new books on
desk
Return to desk to work
When done, replace books on shelves, carrying as
many as you can per trip
Illusion: whole library on your desktop
Buzzword “cache” from French for hidden treasure
34
Why Hierarchy works: Natural Locality

The Principle of Locality:
Program access a relatively small portion of the
address space at any instant of time.
Probability
of reference

0
2^n - 1
Address Space

What programming constructs lead to Principle
of Locality?
35
Memory Hierarchy: How Does it Work?

Temporal Locality (Locality in Time):
 Keep most recently accessed data items closer to
the processor
 Library Analogy: Recently read books are kept on
desk
 Block is unit of transfer (like book)

Spatial Locality (Locality in Space):
 Move blocks consists of contiguous words to the
upper levels
 Library Analogy: Bring back nearby books on
shelves when fetch a book; hope that you might
need it later for your paper
36
Memory Hierarchy Pyramid
Central Processor Unit (CPU)
Increasing
“Upper”
Distance from
Level 1
Levels in
CPU,
Level 2
memory
Decreasing
hierarchy
Level 3
cost / MB
“Lower”
...
Level n
Size of memory at each level
(data cannot be in level i unless also in i+1)
37
Hierarchy



Temporal locality: keep recently accessed data
items closer to processor
Spatial locality: moving contiguous words in
memory to upper levels of hierarchy
Uses smaller and faster memory technologies
close to the processor



Fast hit time in highest level of hierarchy
Cheap, slow memory furthest from processor
If hit rate is high enough, hierarchy has access
time close to the highest (and fastest) level and
size equal to the lowest (and largest) level
38
Recall : 5 components of any Computer
Focus on I/O
Computer
Processor Memory
(active)
(passive)
Control
(“brain”)
(where
programs,
Datapath data live
when
(“brawn”)
running)
Devices
Input
Output
Keyboard,
Mouse
Disk,
Network
Display,
Printer
39
Disk Description / History
Embed. Proc. Track
(ECC, SCSI) Sector
Track Arm
Buffer Head
Platter
Cylinder
1973:
1. 7 Mbit/sq. in
140 MBytes
1979:
7. 7 Mbit/sq. in
2,300 MBytes
source: New York Times, 2/23/98, page C3,
“Makers of disk drives crowd even more data into even smaller spaces”
40
Disk History
Areal Density
10000
1000
100
10
1
1970
1980
1990
Year
1989:
63 Mbit/sq. in
60,000 MBytes
1997:
1450 Mbit/sq. in
2300 Mbytes
source: N.Y. Times, 2/23/98, page C3 (2.5” diameter)
1997:
3090 Mbit/s. i.
8100 Mbytes
(3.5” diameter)
2000
2000:
10,100
Mb/s. i.
25,000
MBytes
2000:
11,000
Mb/s. i.
73,400
MBytes
41
State of the Art: Ultrastar 72ZX



Embed. Proc. Track

Sector
Cylinder
Platter
Track Arm
Head
Buffer
Latency =
Queuing Time +
Controller time +
per access Seek Time +
+
Rotation Time +
per byte
Size / Bandwidth
{
source: www.ibm.com;
www.pricewatch.com; 2/14/00







73.4 GB, 3.5 inch disk
2¢/MB
16 MB track buffer
11 platters, 22 surfaces
15,110 cylinders
7 Gbit/sq. in. areal density
17 watts (idle)
0.1 ms controller time
5.3 ms avg. seek
(seek 1 track => 0.6 ms)
3 ms = 1/2 rotation
37 to 22 MB/s to media 42
Disk Limit




Continued advance in capacity (60%/yr) and
bandwidth (40%/yr.)
Slow improvement in seek, rotation (8%/yr)
Time to read whole disk
Year
Sequentially Randomly
1990
4 minutes
6 hours
2000
12 minutes
1 week
Dynamically change data layout to reduce seek,
rotation delay? Leverage space vs. spindles?
43
A glimpse into the future?

IBM microdrive for digital cameras


340 Mbytes
Disk target in 5-7 years?

building block: 2006 MicroDrive
» 9GB disk, 50 MB/sec from disk

10,000 nodes fit into one rack!
44
Disk Summary



Continued advance in capacity, cost/bit, BW; slow
improvement in seek, rotation
External I/O bus bottleneck to transfer rate, cost?
=> move to fast serial lines (FC-AL)?
What to do with increasing speed of embedded
processor inside disk?
45
Connecting to Networks (and
Other I/O)


Bus - shared medium of communication
that can connect to many devices
Hierarchy of Buses in a PC
46
Buses in a PC
CPU
Memory
bus
Memory

Data rates



PCI:
Internal
(Backplane)
I/O bus
Ethernet
Interface
Memory: 100 MHz, 8 bytes
 800 MB/s (peak)
PCI: 33 MHz, 4 bytes wide
 132 MB/s (peak)
SCSI: “Ultra2” (40 MHz),
“Wide” (2 bytes)
SCSI
Interface
SCSI:
External
I/O bus
(1 to 15 disks)
Ethernet
Local
Area
Network
47
Why Networks?




Originally sharing I/O devices between
computers
(e.g., printers)
Then Communicating between computers
(e.g, file transfer protocol)
Then Communicating between people (e.g.,
email)
Then Communicating between networks of
computers
 Internet, WWW
48
Types of Networks

Local Area Network (Ethernet)




Inside a building: Up to 1 km
(peak) Data Rate:
10 Mbits/sec, 100 Mbits/sec,1000 Mbits/sec
Run, installed by network administrators
Wide Area Network



Across a continent (10km to 10000 km)
(peak) Data Rate:
1.5 Mbits/sec to 2500 Mbits/sec
Run, installed by telephone companies
49
ABCs of Networks: 2
Computers




Starting Point: Send bits between 2 computers
Queue (First In First Out) on each end
Can send both ways (“Full Duplex”)
Information sent called a “message”

Note: Messages also called packets
50
A Simple Example: 2 Computers

What is Message Format?


(Similar in idea to Instruction Format)
Fixed size? Number bits?
Request/
Response
Address/Data
32 bits
1 bit
0: Please send data from address in your memory
1: Packet contains data corresponding to request
• Header(Trailer): information to deliver message
• Payload: data in message (1 word above)
51
Questions About Simple Example

What if more than 2 computers want to
communicate?
Need computer “address field” in packet to know
which computer should receive it (destination), and to
which computer it came from for reply (source)
Req./
Resp. Dest. Source
Net ID Net ID
Address/Data

1 bit
5 bits 5 bits
Header
32 bits
Payload
52
Questions About Simple Example

What if message is garbled in transit?
Add redundant information that is checked when
message arrives to be sure it is OK
 8-bit sum of other bytes: called “Check sum”; upon
arrival compare check sum to sum of rest of
Req./ information in message
Checksum
Resp. Dest. Source
Net ID Net ID
Address/Data

1 bit
5 bits
5 bits
Header
32 bits
Payload
8 bits
Trailer
53
Questions About Simple Example

What if message never arrives?
If tell sender it has arrived (and tell receiver reply has
arrived), can resend upon failure
 Don’t discard message until get “ACK” (acknowledgment);
(Also, if check sum fails, don’t send ACK)
Req./
Resp. Dest. Source
Net ID Net ID
Address/Data
Check

2 bits 5 bits
5 bits
32 bits
8 bits
00: Request—Please send data from Address
01: Reply—Message contains data corresponding to request
10: Acknowledge (ACK) request
11: Acknowledge (ACK) reply
54
Observations About Simple
Example


Simple questions such as those above lead to
more complex procedures to send/receive
message and more complex message formats
Protocol: algorithm for properly sending and
receiving messages (packets)
55
Ethernet (popular LAN) Packet Format
Preamble
Dest Addr
8 Bytes
6 Bytes
Data Pad Check
Src Addr
6 Bytes
0-1500B 0-46B 4B
Length of Data
2 Bytes



Preamble to recognize beginning of packet
Unique Address per Ethernet Network Interface
Card so can just plug in & use (privacy issue?)
Pad ensures minimum packet is 64 bytes


Easier to find packet on the wire
Header+ Trailer: 24B + Pad
56

Software Protocol to Send and
Receive
SW Send steps
1: Application copies data to OS buffer
2: OS calculates checksum, starts timer
3: OS sends data to network interface HW and says start

SW Receive steps
3: OS copies data from network interface HW to OS buffer
2: OS calculates checksum, if OK, send ACK; if not, delete
message (sender resends when timer expires)
1: If OK, OS copies data to user address space, & signals
application to continue
57
Protocol for Networks of Networks
(WAN)?

Internetworking: allows computers on
independent and incompatible networks to
communicate reliably and efficiently;



Enabling technologies: SW standards that allow
reliable communications without reliable networks
Hierarchy of SW layers, giving each layer
responsibility for portion of overall communications
task, called protocol families or protocol suites
Abstraction to cope with complexity of
communication vs. Abstraction for complexity of
computation
58
Protocol for Network of
Networks

Transmission Control Protocol/Internet Protocol
(TCP/IP)




This protocol family is the basis of the Internet, a
WAN protocol
IP makes best effort to deliver
TCP guarantees delivery
TCP/IP so popular it is used even when
communicating locally: even across homogeneous
LAN
59
FTP From Stanford to Berkeley
Hennessy
FDDI
Ethernet
FDDI
T3
FDDI
Ethernet

Ethernet
BARRNet is WAN for Bay Area


Patterson
T3 is 45 Mbit/s leased line (WAN);
FDDI is 100 Mbit/s LAN
IP sets up connection, TCP sends file
60
Protocol Family Concept
Message
Logical
Actual
H Message
T
Logical
Actual
H H Message
Message
Actual
H Message
T
Actual
T T
H H Message
T T
Actual
61
Protocol Family Concept

Key to protocol families is that communication
occurs logically at the same level of the protocol,
called peer-to-peer, but is implemented via services
at the lower level

Danger is each level lower performance if family is
implemented as hierarchy
(e.g., multiple check sums)
62


TCP/IP packet, Ethernet packet,
protocols
Application sends message
TCP breaks into 64KB
segments, adds 20B header
IP adds 20B header, sends to
network
 If Ethernet, broken into 1500B
packets with headers, trailers
(24B)
 All Headers, trailers have
length field, destination, ...

Ethernet Hdr
IP Header
TCP Header
EH IP Data
TCP data
Message
Ethernet Hdr
63
Networks


Shared Media vs.
Switched: pairs
communicate at
same time: “point-topoint” connections
Aggregate BW in
switched network is
Node
many times shared

point-to-point faster
since no arbitration,
simpler interface
Shared
Node
Node
Node
Node
Crossbar
Switch
Node
Node
64
Heart of Today’s Data Switch
Covert serial bit
stream to, say,
128 bit words
Covert 128 bit words
into serial bit stream
Memory
Unpack header to find destination and place
message into memory of proper outgoing port;
OK as long as memory much faster than switch rate
65
Network Media (if time)
Twisted Pair:
Copper, 1mm think, twisted to
avoid attenna effect (telephone)
Coaxial Cable: Plastic Covering
Braided outer conductor
Insulator
Copper core
Fiber Optics
Transmitter
– L.E.D
– Laser Diode
light
source
Air
Used by cable
companies: high BW,
good noise immunity
Total internal
reflection
Receiver
– Photodiode
Silica
Light:
3 parts are
cable, light
source, light
detector
66
Rates


Using the peak transfer rate of a portion of the
I/O system to make performance projections
or performance comparisons
Peak bandwidth measurements often based
on unrealistic assumptions about system or
unattainable because of other system
limitations


In example, Peak Bandwidth FDDI vs.
10 Mbit Ethernet = 10:1, but delivered BW ratio
(due to software overhead) is 1.01:1
Peak PCI BW is 132 MByte/sec, but combined
with memory often < 80 MB/s
67
Network Description/Innovations


Shared Media vs. Switched:
pairs communicate at same time
Aggregate BW in switched
network is many times shared



point-to-point faster only
single destination, simpler interface
Serial line: 1 – 5 Gbit/sec
Moore’s Law for switches, too

1 chip: 32 x 32 switch, 1.5 Gbit/sec links, $396
48 Gbit/sec aggregate bandwidth (AMCC S2025)
68
Network History/Limits
TCP/UDP/IP protocols for WAN/LAN in 1980s
 Lightweight protocols for LAN in 1990s
 Limit is standards and efficient SW protocols
10 Mbit Ethernet in 1978 (shared)
100 Mbit Ethernet in 1995 (shared, switched)
1000 Mbit Ethernet in 1998 (switched)



FDDI; ATM Forum for scalable LAN (still meeting)
Internal I/O bus limits delivered BW


32-bit, 33 MHz PCI bus = 1 Gbit/sec
future: 64-bit, 66 MHz PCI bus = 4 Gbit/sec
69
Network Summary


Fast serial lines, switches offer high
bandwidth, low latency over reasonable
distances
Protocol software development and standards
committee bandwidth limit innovation rate


Ethernet forever?
Internal I/O bus interface to network
is bottleneck to delivered bandwidth, latency
70
Network Summary

Protocol suites allow heterogeneous networking
Another use of principle of abstraction
 Protocols  operation in presence of failures
 Standardization key for LAN, WAN


Integrated circuit revolutionizing network switches
as well as processors


Switch just a specialized computer
High bandwidth networks with slow SW overheads
don’t deliver their promise
71
Systems:
History, Trends, Innovations


Cost/Performance leaders from PC industry
Transaction processing, file service based on
Symmetric Multiprocessor (SMP)servers




4 - 64 processors
Shared memory addressing
Decision support based on SMP and Cluster
(Shared Nothing)
Clusters of low cost, small SMPs getting popular
72
1997 State of the Art System:
PC






$1140 OEM
1 266 MHz Pentium II
64 MB DRAM
2 UltraDMA EIDE disks, 3.1 GB each
100 Mbit Ethernet Interface
(PennySort winner)
source: www.research.microsoft.com/research/barc/SortBenchmark/PennySort.ps
73
1997 State of the Art SMP: Sun E10000
4 address buses
data crossbar switch
Proc
Proc
Xbar
Xbar
Proc
Proc
Proc
Proc
Proc
Proc
Mem
s Mem

TPC-D,Oracle 8, 3/98

s
bridge
s
c
s
i
…
1
bridge
s
c
s
i
s
c
s
i
… … …

s s
c c
s s
i i
bus bridge
bus bridge
s
c
s
i
16
s
c
s
i
…
1
…
s
c
s
i
s
c
s
i
… … …




s
c
s
i
…


23
SMP 64 336 MHz
CPUs, 64GB dram,
668 disks (5.5TB)
Disks,shelf $2,128k
Boards,encl. $1,187k
CPUs
$912k
DRAM
$768k
Power
$96k
Cables,I/O
$69k
HW total
$5,161k
source: www.tpc.org
74
State of the Art Cluster:
Tandem/Compaq SMP




ServerNet switched network
Rack mounted equipment
SMP: 4-PPro, 3GB dram,
3 disks (6/rack)
 TPC-C, Oracle 8, 4/98
CPUs
$191k
10 Disk shelves/rack
DRAM,
$122k
@ 7 disks/shelf
Disks+cntlr
$425k
Total: 6 SMPs
Disk shelves
$94k
(24 CPUs, 18 GB DRAM),
Networking
$76k
402 disks (2.7 TB)
Racks
$15k








HW total
$926k
75
1997 Berkeley Cluster: Zoom
Project

3 TB storage system



370 8 GB disks,
20 200 MHz PPro PCs,
100Mbit Switched Ethernet
System cost small delta
(~30%) over raw disk cost
Application: San Francisco
Fine Arts Museum Server



70,000 art images online
Zoom in 32X; try it yourself!
www.Thinker.org (statue)
76
User Decision Support Demand
vs. Processor speed
Database demand:
2X / 9-12 months
100
“Greg’s Law”
10
1
1996
“Moore’s Law”
1997
1998
1999
Database-Proc.
Performance Gap:
CPU speed
2X / 18 months
2000
77
Berkeley Perspective on Post-PC
Era
PostPC Era will be driven by 2
technologies:
1) “Gadgets”:Tiny Embedded
or Mobile Devices
 ubiquitous: in everything
 e.g., successor to PDA,
cell phone,
wearable computers

2) Infrastructure to Support such Devices
 e.g., successor to Big Fat Web
Servers, Database Servers
78
Intelligent
RAM:
Microprocessor
& DRAM
on a
single chip:
10X capacity vs. SRAM
 on-chip memory latency
5-10X,
bandwidth 50-100X
 improve energy efficiency
2X-4X (no off-chip bus)
 serial I/O 5-10X v. buses
 smaller board area/volume

I/O
$
$
L2$
Bus
I/O
Bus
D R
I/O
IRAM advantages extend to:
a single chip system
 a building block for larger systems

IRAM
Proc
L
o f
g a
i b
c
A M
I/O
Proc
Bus
D R
A M
D
R f
A a
Mb
79
Other examples: IBM “Blue Gene”



1 PetaFLOPS in 2005 for $100M?
Application: Protein Folding
Blue Gene Chip






32 Multithreaded RISC processors + ??MB Embedded DRAM +
high speed Network Interface on single 20 x 20 mm chip
1 GFLOPS / processor
2’ x 2’ Board = 64 chips (2K CPUs)
Rack = 8 Boards
(512 chips,16K CPUs)
System = 64 Racks (512 boards,32K chips,1M CPUs)
Total 1 million processors in just 2000 sq. ft.
80
Other examples: Sony Playstation 2

Emotion Engine: 6.2 GFLOPS, 75 million polygons per
second (Microprocessor Report, 13:5)


Superscalar MIPS core + vector coprocessor + graphics/DRAM
81
Claim: “Toy Story” realism brought to games

The
problem
space:
big
data
Big demand for enormous amounts of data

today: high-end enterprise and Internet applications
» enterprise decision-support, data mining databases
» online applications: e-commerce, mail, web, archives

future: infrastructure services, richer data
» computational & storage back-ends for mobile devices
» more multimedia content
» more use of historical data to provide better services


Today’s SMP server designs can’t easily scale
Bigger scaling problems than performance!
82

The real scalability problems:
AME
Availability


Maintainability


systems should require only minimal ongoing human
administration, regardless of scale or complexity
Evolutionary Growth


systems should continue to meet quality of service
goals despite hardware and software failures
systems should evolve gracefully in terms of
performance, maintainability, and availability as they
are grown/upgraded/expanded
These are problems at today’s scales, and will only
83
ISTORE-1 hardware platform

80-node x86-based cluster, 1.4TB storage

cluster nodes are plug-and-play, intelligent, networkattached storage “bricks”
» a single field-replaceable unit to simplify maintenance


each node is a full x86 PC w/256MB DRAM, 18GB disk
more CPU than NAS; fewer disks/node than cluster
ISTORE Chassis
80 nodes, 8 per tray
2 levels of switches
•20 100 Mbit/s
•2 1 Gbit/s
Environment Monitoring:
UPS, redundant PS,
fans, heat and vibration
sensors...
Intelligent Disk “Brick”
Portable PC CPU: Pentium II/266 + DRAM
Redundant NICs (4 100 Mb/s links)
Diagnostic Processor
Disk
Half-height canister
84
Conclusion


IRAM attractive for two Post-PC applications
because of low power, small size, high memory
bandwidth

Gadgets: Embedded/Mobile devices

Infrastructure: Intelligent Storage and Networks
PostPC infrastructure requires

New Goals: Availability, Maintainability, Evolution

New Principles: Introspection, Performance Robustness


New Techniques: Isolation/fault insertion, Software
scrubbing
New Benchmarks: measure, compare AME metrics
85
Questions?
Contact us if you’re interested:
email: [email protected]
http://iram.cs.berkeley.edu/
86