Lecture 9: Branch Prediction, Dependence Speculation, and Data

Transcript Lecture 9: Branch Prediction, Dependence Speculation, and Data

IRAM and ISTORE Projects
Aaron Brown, James Beck, Rich Fromm, Joe Gebis,
Paul Harvey, Adam Janin, Dave Judd, Kimberly Keeton,
Christoforos Kozyrakis, David Martin, Rich Martin,
Thinh Nguyen, David Oppenheimer, Steve Pope,
Randi Thomas, Noah Treuhaft, Sam Williams,
John Kubiatowicz, Kathy Yelick, and David Patterson
http://iram.cs.berkeley.edu/[istore]
Winter 2000 IRAM/ISTORE Retreat
Slide 1
IRAM Vision: Intelligent PDA
Pilot PDA
+ gameboy, cell phone, radio,
timer, camera, TV remote,
am/fm radio, garage door
opener, ...
+ Wireless data (WWW)
+ Speech, vision, video
+
Voice output for
conversations
Speech control
+Vision to see,
scan documents,
read bar code, ...
Slide 3
ISTORE Hardware Vision
• System-on-a-chip enables computer, memory, without
significantly increasing size of disk
• 5-7 year target:
• MicroDrive:1.7” x 1.4” x 0.2”
2006: ?
- 1999: 340 MB, 5400 RPM,
5 MB/s, 15 ms seek
– 2006: 9 GB, 50 MB/s ? (1.6X/yr
capacity,1.4X/yr BW)
• Integrated IRAM processor
– 2x height
• Connected via crossbar switch
– growing like Moore’s law
• 10,000+ nodes in one rack!
Slide 4
VIRAM: System on a Chip
Prototype scheduled for tape-out 1H 2000
•0.18 um EDL process
•16 MB DRAM, 8 banks
Memory (64 Mbits / 8 MBytes)
•MIPS Scalar core and
caches @ 200 MHz
•4 64-bit vector unit
4 Vector Pipes/Lanes
Xbar
pipelines @ 200 MHz
•4 100 MB parallel I/O lines
•17x17 mm, 2 Watts
Memory (64 Mbits / 8 MBytes)
•25.6 GB/s memory (6.4 GB/s per
direction and per Xbar)
•1.6 Gflops (64-bit), 6.4 GOPs (16-bit)
C
P
U
+$
Slide 5
I/O
IRAM Architecture Update
• ISA mostly frozen since 6/99
– better fixed-point model and instructions
» gained some experience using them over past year
– better exception model
– better support for short vectors
» auto-increment memory addressing
» instructions for in-register reductions & butterflypermutations
– memory consistency model spec refined (poster)
• Suite of simulators actively used and maintained
– vsim-isa (functional), vsim-p (performance), vsim-db
(debugger), vsim-sync (memory synchronization)
Slide 6
IRAM Software Update
• Vectorizing Compiler for VIRAM
– retargeting CRAY vectorizing compiler (talk)
» Initial backend complete: scalar and vector instructions
» Extensive testing for correct functionality
» Instruction scheduling and performance tuning begun
• Applications using compiler underway
– Speech processing (talk)
– Small benchmarks; suggestions welcome
• Hand-coded fixed point applications
– Video encoder application complete (poster)
– FFT, floating point done, fixed point started (talk)
Slide 7
IRAM Chip Update
•IBM to supply embedded DRAM/Logic (98%)
–DRAM macro added to 0.18 micron logic process
–DRAM specs under NDA; final agreement in UCB bureaucracy
•MIPS to supply scalar core (99%)
–MIPS processor, caches, TLB
•MIT to supply FPU (100%)
–single precision (32 bit) only
•VIRAM-1 Tape-out scheduled for mid-2000
–Some updates of micro-architecture based on benchmarks (talk)
–Layout of multiplier (poster), register file nearly complete
–Test strategy developed (talk)
–Demo system high level hardware design complete (talk)
–Network interface design complete (talk)
Slide 8
VIRAM-1 block diagram
Slide 9
Microarchitecture configuration
• 2 arithmetic units
• Memory system
– both execute integer
operations
– 8 DRAM banks
– one executes FP operations
– 256-bit synchronous
interface
– 4 64-bit datapaths (lanes)
per unit
– 1 sub-bank per bank
• 2 flag processing units
– 16 Mbytes total capacity
– for conditional execution and
speculation support
• Peak performance
• 1 load-store unit
– 3.2 GOPS64, 12.8 GOPS16
– optimized for strides 1,2,3,
(w. madd)
and 4
– 1.6 GOPS64, 6.4 GOPS16
– 4 addresses/cycle for
(wo. madd)
indexed and strided
– 0.8 GFLOPS64, 3.2
operations
GFLOPS32 (w. madd)
– decoupled indexed and
– 6.4 Gbyte/s memory
strided stores
bandwidth
Slide 10
Media Kernel Performance
Peak
Perf.
Sustained
Perf.
%
of Peak
Image Composition
6.4 GOPS
6.40 GOPS
100.0%
iDCT
6.4 GOPS
1.97 GOPS
30.7%
Color Conversion
3.2 GOPS
3.07 GOPS
96.0%
Image Convolution
Integer MV Multiply
Integer VM Multiply
FP MV Multiply
3.2 GOPS
3.16 GOPS
3.2 GOPS
2.77 GOPS
3.2 GOPS
3.00 GOPS
3.2 GFLOPS 2.80 GFLOPS
98.7%
86.5%
93.7%
87.5%
FP VM Multiply
3.2 GFLOPS
99.6%
AVERAGE
3.19 GFLOPS
86.6%
Slide 11
Base-line system comparison
VIRAM
MMX
VIS
TMS320C82
Image
Composition
iDCT
0.13
-
2.22 (17.0x)
-
1.18
3.75 (3.2x)
-
-
Color
Conversion
Image
Convolution
0.78
8.00 (10.2x)
-
5.70 (7.6x)
5.49
5.49 (4.5x)
6.19 (5.1x)
6.50 (5.3x)
• All numbers in cycles/pixel
•MMX and VIS results assume all
data in L1 cache
Slide 12
Scaling to 10K Processors
• IRAM + micro-disk offer huge scaling opportunities
• Still many hard system problems, SAM AME (talk)
– Availability
» 24 x7 databases without human intervention
» Discrete vs. continuous model of machine being up
– Maintainability
» 42% of system failures are due to administrative errors
» self-monitoring, tuning, and repair
– Evolution
» Dynamic scaling with plug-and-play components
» Scalable performance, gracefully down as well as up
» Machines become heterogeneous in performance at scale
Slide 13
ISTORE-1: Hardware for AME
Hardware: plug-and-play intelligent devices with selfmonitoring, diagnostics, and fault injection hardware
–intelligence used to collect and filter monitoring data
–diagnostics and fault injection enhance robustness
–networked to create a scalable shared-nothing cluster
Intelligent Chassis
80 nodes, 8 per tray
2 levels of switches
•20 100 Mb/s
•2 1 Gb/s
Environment Monitoring:
UPS, redundant PS,
fans, heat and vibrartion
sensors...
Intelligent Disk “Brick”
Portable PC Processor: Pentium II+ DRAM
Redundant NICs (4 100 Mb/s links)
Diagnostic Processor
Disk
Half-height canister
Slide 14
ISTORE Brick Block Diagram
Mobile Pentium II Module
SCSI
CPU
North
Bridge
South
Bridge
DRAM
256 MB
Ethernets
4x100 Mb/s
Super
I/O
BIOS
Disk (18 GB)
Diagnostic
Net
DUAL
UART
Diagnostic
Processor
Monitor
&
Control
PCI
• Sensors for heat and vibration
Flash
RTC
RAM
• Control over power to individual nodes
Slide 15
ISTORE Software Approach
• Two-pronged approach to providing reliability:
1) reactive self-maintenance: dynamic reaction to
exceptional system events
» self-diagnosing, self-monitoring hardware
» software monitoring and problem detection
» automatic reaction to detected problems
2) proactive self-maintenance: continuous online selftesting and self-analysis
» automatic characterization of system components
» in situ fault injection, self-testing, and scrubbing to detect
flaky hardware components and to exercise rarely-taken
application code paths before they’re used
Slide 16
ISTORE Applications
• Storage-intensive, reliable services for ISTORE-1
– infrastructure for “thin clients,” e.g., PDAs
– web services, such as mail and storage
– large-scale databases (talk)
– information retrieval (search and on-the-fly indexing)
• Scalable memory-intensive computations for ISTORE in 2006
– Performance estimates through IRAM simulation + model
» not major emphasis
– Large-scale defense and scientific applications enabled by
high memory bw and arithmetic performance
Slide 17
Performance Availability
Minimum Per-Process Bandwidth
(MB/sec)
• System performance limited by the weakest link
• NOW Sort experience: performance heterogeneity is the norm
– disks: inner vs. outer track (50%), fragmentation
– processors: load (1.5-5x) and heat
• Virtual Streams: dynamically off-load I/O work from slower
disks to faster ones
Ideal
Virtual Streams
Static
6
5
4
3
2
1
0
100%
67%
39%
Efficiency Of Single Slow Disk
29%
Slide 18
ISTORE Update
• High level hardware design by UCB complete (talk)
– Design of ISTORE boards handed off to Anigma
»
»
»
»
First run complete; SCSI problem to be fixed
Testing of UCB design (DP), to start asap
10 nodes by end of 1Q 2000, 80 by 2Q 2000
Design of BIOS handed off to AMI
– Most parts donated or discounted
» Adaptec, Andataco, IBM, Intel, Micron, Motorola, Packet Engines
• Proposal for Quantifying AME (talk)
• Beginning work on short-term applications
»
»
»
»
Mail server
Web server
Large database
Decision support primitives
will be used to
drive principled
system design
Slide 19
Conclusions
• IRAM attractive for two Post-PC applications because
of low power, small size, high memory bandwidth
– Mobile consumer electronic devices
– Scaleable infrastructure
• IRAM benchmarking result: faster than DSPs
• ISTORE: hardware/software architecture for large
scale network services
• Scaling systems requires
– new continuous models of availability
– performance not limited by the weakest link
– self* systems to reduce human interaction
Slide 20
Backup Slides
Slide 21
Introduction and Ground Rules
• Who is here?
– Mixed IRAM/ISTORE “experience”
• Questions are welcome during talks
• Schedule: lecture from Brewster Kahle during
Thursday’s Open Mic Session.
• Feedback is required (Fri am)
– Be careful, we have been known to listen to you
• Mixed experience: please ask
• Time for skiing and talking tomorrow afternoon
Slide 22
2006 ISTORE
• ISTORE node
– Add 20% pad to MicroDrive size for packaging,
connectors
– Then double thickness to add IRAM
– 2.0” x 1.7” x 0.5” (51 mm x 43 mm x 13 mm)
• Crossbar switches growing by Moore’s Law
– 2x/1.5 yrs  4X transistors/3yrs
– Crossbars grow by N2  2X switch/3yrs
– 16 x 16 in 1999  64 x 64 in 2005
• ISTORE rack (19” x 33” x 84”)
1 tray (3” high)  16 x 32
 512 ISTORE nodes / try
• 20 trays+switches+UPS
 10,240 ISTORE nodes / rack (!)
Slide 23
IRAM/VSUIF Decryption (IDEA)
GOP/s
8
6
2
4
4
2
8
# lanes
0
16
32
64
Virtual processor width
•
•
•
•
IDEA Decryption operates
vpw on 16-bit ints
Compiled with IRAM/VSUIF
Note scalability of both #lanes and data width
Some hand-optimizations (unrolling) will be
automated by Cray compiler
Slide 24
1D FFT on IRAM
FFT study on IRAM
– bit-reversal time included; cost hidden using indexed store
– Faster than DSPs on floating point (32-bit) FFTs
– CRI Pathfinder does 24-bit fixed point, 1K points in 28 usec
(2 Watts without SRAM)
Slide 25
3D FFT on ISTORE 2006
•
Performance of large 3D FFT’s depend on 2 factors
– speed of 1D FFT on a single node (next slide)
– network bandwidth for “transposing” data
FFT Performance (nxnxn) on 64 nodes
90
80
70
GFlops
60
UCB NOW
ISTORE-1
w/+IRAM
4*1Gb/s links
Scaled net
8*1Gb/s links
50
40
30
20
10
0
1
2
3
4
5
6
7
8
9
10
11
log n
– 1.3 Tflop FFT possible w/ 1K IRAM nodes, if network
bisection bandwidth scales (!)
Slide 26
ISTORE-1 System Layout
Brick shelf
Brick shelf
Brick shelf
Brick shelf
Brick shelf
Brick shelf
Brick shelf
Brick shelf
Slide 27
V-IRAM1: 0.18 µm, Fast Logic, 200 MHz
1.6 GFLOPS(64b)/6.4 GOPS(16b)/32MB
+
x
2-way
Superscalar
Processor
I/O
Vector
Instruction
Queue
I/O
÷
Load/Store
16K I cache
16K D cache
Vector Registers
4 x 64
4 x 64
100MB
each
Memory Crossbar Switch
M
I/O
M
4…x 64
I/O
4 x 64
or
8 x 32
or
16 x 16
M
M
M
M
M
M
M
…
M
4…x 64
M
x 64
… 4…
…
M
M
M
M
M
M
M
M
M
4…x 64
M
M
M
M
M
…
M
4…
x 64
…
M
M
M
M
…
Slide 28
Fixed-point multiply-add model
Multiply half word & Shift & Round
zn
x n/2
y n/2
Add & Saturate
*
+
n
Round
a
sat
n
w
n
• Same basic model, different set of instructions
– fixed-point: multiply & shift & round, shift
right & round, shift left & saturate
– integer saturated arithmetic: add or sub &
saturate
– added multiply-add instruction for improved
performance and energy consumption
Slide 29
Other ISA modifications
• Auto-increment loads/stores
– a vector load/store can post-increment its base
address
– added base (16), stride (8), and increment (8)
registers
– necessary for applications with short vectors or
scaled-up implementations
• Butterfly permutation instructions
– perform step of a butterfly permutation within a
vector register
– used for FFT and reduction operations
• Miscellaneous instructions added
– min and max instructions (integer and FP)
– FP reciprocal and reciprocal square root
Slide 30
Major architecture updates
• Integer arithmetic units support multiply-add
instructions
• 1 load store unit
– complexity Vs. benefit
• Optimize for strides 2, 3, and 4
– useful for complex arithmetic and image processing
functions
• Decoupled strided and indexed stores
– memory stalls due to bank conflicts do not stall the
arithmetic pipelines
– allows scheduling of independent arithmetic
operations in parallel with stores that experience
many stalls
– implemented with address, not data, buffering
Slide 31
– currently examining a similar optimization for loads
Micro-kernel results: simulated
systems
# of 64-bit lanes
Addresses per cycle
for strided-indexed
accesses
Crossbar width
Width of DRAM bank
interface
DRAM banks
1 Lane
System
2 Lane
System
4 Lane
System
8 Lane
System
1
1
2
2
4
4
8
8
64b
128b
256b
512b
64b
128b
256b
512b
8
8
8
8
•Note : simulations performed with 2 load-store units and without
decoupled stores or optimizations for strides 2, 3, and 4
Slide 32
Micro-kernels
Benchmark
Operations Data
Type
Width
Memory
Accesses
Image
Composition
(Blending)
2D iDCT (8x8
image blocks)
Color Conversion
(RGB to YUV)
Integer
16b
Unit-stride
Integer
16b
Integer
32b
Unit-stride
Strided
Unit-stride
Image
Convolution
Matrix-vector
Multiply (MV)
Integer
32b
Unit-stride
Integer
FP
32b
Unit-stride
Vector-matrix
Multiply (VM)
Integer
FP
32b
Unit-stride
Other
Comments
Uses reductions
•Vectorization and scheduling performed manually
Slide 33
Scaled system results
1 Lane
2 Lanes
4 Lanes
8 Lanes
8
7
Speedup
6
5
4
3
2
1
0
Compositing
iDCT
Color Conversion
Convolution
MxV INT (32)
VxM INT (32)
MxV FP (32)
VxM FP(32)
•Near linear speedup for all application apart from iDCT
•iDCT bottlenecks
•large number of bank conflicts
•4 addresses/cycle for strided accesses
Slide 34
iDCT scaling with sub-banks
1 Lane
2 Lanes
4 Lanes
8 Lanes
8
7
Speedup
6
5
4
3
2
1
0
1 Sub-Bank
2 Sub-Banks
4 Sub-Banks
8 Sub-Banks
• Sub-banks reduce bank conflicts and increase
performance
• Alternative (but not as effective) ways to reduce
conflicts:
– different memory layout
– different address interleaving schemes
Slide 35
Compiling for VIRAM
• Long-term success of DIS technology depends on
simple programming model, i.e., a compiler
• Needs to handle significant class of applications
– IRAM: multimedia, graphics, speech and image
processing
– ISTORE: databases, signal processing, other DIS
benchmarks
• Needs to utilize hardware features for performance
– IRAM: vectorization
– ISTORE: scalability of shared-nothing programming
model
Slide 36
IRAM Compilers
• IRAM/Cray vectorizing compiler [Judd]
– Production compiler
» Used on the T90, C90, as well as the T3D and T3E
» Being ported (by SGI/Cray) to the SV2 architecture
– Has C, C++, and Fortran front-ends (focus on C)
– Extensive vectorization capability
» outer loop vectorization, scatter/gather, short loops, …
– VIRAM port is under way
• IRAM/VSUIF vectorizing compiler [Krashinsky]
– Based on VSUIF from Corinna Lee’s group at Toronto which
is based on MachineSUIF from Mike Smith’s group at
Harvard which is based on SUIF compiler from Monica
Lam’s group at Stanford
– This is a “research” compiler, not intended for compiling
large complex applications
Slide 37
– It has been working since 5/99.
IRAM/Cray Compiler Status
Frontends
C
C++
Fortran
Vectorizer
PDGCS
Code Generators
C90
IRAM
• MIPS backend developed in this year
– Validated using a commercial test suite for code
generation
• Vector backend recently started
– Testing with simulator under way
• Leveraging from Cray
– Automatic vectorization
Slide 38
VIRAM/VSUIF Matrix/Vector
Multiply
• VIRAM/VSUIF does reasonably well on long loops
•
•
1200
1000
•
•
800
600
400
•
mvm
handopt
saxpy
padded
0
dot
200
256x256 single matrix
Compare to 1600 Mflop/s (peak witho
multadd)
Note BLAS-2 (little reuse)
~350 on Power3 and EV6
Mflop/s
Problems specific to VSUIF
– hand strip-mining results in
short loops
– reductions
– no multadd support
vmm
Slide 39
Reactive Self-Maintenance
• ISTORE defines a layered system model for
monitoring and reaction:
Provided by
Application
Problem detection
SW monitoring
Policies
Provided by ISTORE
Runtime System
ISTORE API
Reaction
mechanisms
Coordination
of reaction
Self-monitoring
hardware
• ISTORE API defines interface between runtime
system and app. reaction mechanisms
• Policies define system’s monitoring, detection, and
reaction behavior
Slide 40
Proactive Self-Maintenance
• Continuous online self-testing of HW and SW
– detects flaky, failing, or buggy components via:
» fault injection: triggering hardware and software error
handling paths to verify their integrity/existence
» stress testing: pushing HW/SW components past normal
operating parameters
» scrubbing: periodic restoration of potentially “decaying”
hardware or software state
– automates preventive maintenance
• Dynamic HW/SW component characterization
– used to adapt to heterogeneous hardware and
behavior of application software components
Slide 41
ISTORE-0 Prototype and Plans
• ISTORE-0: testbed for early experimentation with
ISTORE research ideas
• Hardware: cluster of 6 PCs
– intended to model ISTORE-1 using COTS
components
– nodes interconnected using ISTORE-1 network
fabric
– custom fault-injection hardware on subset of nodes
• Initial research plans
– runtime system software
– fault injection
– scalability, availability, maintainability
benchmarking
– applications: block storage server, database, FFT
Slide 42
Runtime System Software
• Demonstrate simple policy-driven adaptation
– within context of a single OS and application
– software monitoring information collected and
processed in realtime
» e.g., health & performance parameters of OS, application
– problem detection and coordination of reaction
» controlled by a stock set of configurable policies
– application-level adaptation mechanisms
» invoked to implement reaction
• Use experience to inform ISTORE API design
• Investigate reinforcement learning as technique to infer
appropriate reactions from goals
Slide 43
Record-breaking performance is
not the common case
Slowdown
• NOW-Sort records demonstrate peak performance
• But perturb just 1 of 8 nodes and...
5
4
3
2
1
0
Best
case
Bad
disk
layout
Busy
disk
Light
CPU
Heavy
CPU
Paging
Slide 44
Virtual Streams:
•
•
•
•
Dynamic load balancing for I/O
Replicas of data serve as second sources
Maintain a notion of each process’s progress
Arbitrate use of disks to ensure equal progress
The right behavior, but what mechanism?
Process
Virtual Streams Software
Arbiter
Disk
Slide 45
Graduated Declustering:
A Virtual Streams implementation
• Clients send progress, servers schedule in response
Before Slowdown
Client0
B
Client1
B
B/2
B/2
From
Server3
Client2
B
B/2
B/2
B/2
After Slowdown
Client3
B
B/2
Client0
7B/8
To
Client0
B/2
Client1
7B/8
B/4
B/2
3B/8
B/2
B/2
Client2
7B/8
5B/8
Client3
7B/8
To
Client0
3B/8
B/4
B/2
B/2
5B/8
0 1
1 2
2 3
Server0
B
Server1
B
Server2
B
3 0
Server3
B
From
Server3
0 1
1 2
2 3
Server0
B
Server1
B/2
Server2
B
3 0
Server3
B 46
Slide
Read Performance:
Multiple Slow Disks
Minimum Per-Process
Bandwidth (MB/sec)
6
Ideal
Virtual Streams
Static
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
# Of Slow Disks (out of 8)
Slide 47
Storage Priorities: Research v. Users
Traditional Research
Priorities
1) Performance
1’) Cost
easy
3) Scalability
to
4) Availability measure
5) Maintainability
}
}
ISTORE
Priorities
1) Maintainability
2) Availability
3) Scalability
4) Performance
hard 5) Cost
to
measure
Slide 48
Intelligent Storage Project Goals
• ISTORE: a hardware/software architecture
for building scaleable, self-maintaining
storage
– An introspective system: it monitors itself
and acts on its observations
• Self-maintenance: does not rely on
administrators to configure, monitor, or tune
system
Slide 49
Self-maintenance
• Failure management
– devices must fail fast without interrupting service
– predict failures and initiate replacement
– failures  immediate human intervention
• System upgrades and scaling
– new hardware automatically incorporated without
interruption
– new devices immediately improve performance or
repair failures
• Performance management
– system must adapt to changes in workload or
access patterns
Slide 50
ISTORE-I: 2H99
• Intelligent disk
– Portable PC Hardware: Pentium II, DRAM
– Low Profile SCSI Disk (9 to 18 GB)
– 4 100-Mbit/s Ethernet links per node
– Placed inside Half-height canister
– Monitor Processor/path to power off components?
• Intelligent Chassis
– 64 nodes: 8 enclosures, 8 nodes/enclosure
» 64 x 4 or 256 Ethernet ports
– 2 levels of Ethernet switches: 14 small, 2 large
» Small: 20 100-Mbit/s + 2 1-Gbit; Large: 25 1-Gbit
» Just for prototype; crossbar chips for real system
– Enclosure sensing, UPS, redundant PS, fans, ...
Slide 51
Disk Limit
• Continued advance in capacity (60%/yr) and
bandwidth (40%/yr)
• Slow improvement in seek, rotation (8%/yr)
• Time to read whole disk
Year
Sequentially
Randomly
(1 sector/seek)
1990
4 minutes
6 hours
1999
35 minutes
1 week(!)
• 3.5” form factor make sense in 5-7 years?
Slide 52
Related Work
•
•
•
•
•
ISTORE adds to several recent research efforts
Active Disks, NASD (UCSB, CMU)
Network service appliances (NetApp, Snap!, Qube, ...)
High availability systems (Compaq/Tandem, ...)
Adaptive systems (HP AutoRAID, M/S AutoAdmin, M/S
Millennium)
• Plug-and-play system construction (Jini, PC Plug&Play,
...)
Slide 53
Other (Potential) Benefits of ISTORE
• Scalability: add processing power, memory, network
bandwidth as add disks
• Smaller footprint vs. traditional server/disk
• Less power
– embedded processors vs. servers
– spin down idle disks?
• For decision-support or web-service applications,
potentially better performance than traditional servers
Slide 54
Disk Limit: I/O Buses

Multiple copies of data,
SW layers
CPU Memory
bus
Memory C
C
Cannot use 100% of bus
 Queuing Theory (<
70%)
 Command overhead
(Effective
size
=
size
x
Internal
I/O bus 1.2)
External
(PCI)
I/O bus

• Bus rate vs. Disk rate
C
– SCSI: Ultra2 (40 MHz),
Wide (16 bit): 80 MByte/s
– FC-AL: 1 Gbit/s = 125
MByte/s (single disk in 2002)
(SCSI)
C
Controllers(15 disks)
Slide 55
State of the Art: Seagate Cheetah 36
– 36.4 GB, 3.5 inch disk
– 12 platters, 24 surfaces
– 10,000 RPM
– 18.3 to 28 MB/s internal
media transfer rate
(14 to 21 MB/s user data)
– 9772 cylinders (tracks),
(71,132,960 sectors total)
– Avg. seek: read 5.2 ms, write
6.0 ms (Max. seek: 12/13,1
track: 0.6/0.9 ms)
– $2100 or 17MB/$ (6¢/MB)
(list price)
– 0.15 ms controller time
source: www.seagate.com
Slide 56
User Decision Support Demand
vs. Processor speed
Database
demand:
2X / 9-12 months
100
“Greg’s Law”
Database-Proc.
Performance Gap:
“Moore’s Law”
CPU speed
2X / 18
months
10
1
1996
1997
1998
1999
2000
Slide 57

Lecture 9: Branch Prediction, Dependence Speculation, and Data

Transcript Lecture 9: Branch Prediction, Dependence Speculation, and Data

Directory