Introduction

Download Report

Transcript Introduction

Graduate Computer Architecture I
Lecture 1: Introduction
Young Cho
Computer Architecture
Forces
that shape
Applications
Computer Architecture
Operating
System
Compiler
Firmware
Instruction Set Architecture
Instr. Set Processing I/O system
Programming
Languages
Datapath & Control
Digital Design
Circuit Design
Technology
Layout & Fab
Semiconductor Materials
2 - CSE/ESE 560M – Graduate Computer Architecture I
History
Instruction Set Architecture
A. Set of Elementary
Instructions
Q. What is ISA ??
 Lasts through
many
ROBOT!
generations
Draw- PORTABILITY
me
a many different
 Used in
ways
- GENERALITY
Basketbal
l Player!
Permits an EFFICIENT
functionality to higher levels
A “Good” ISA …
 Provides CONVENIENT
functionality to higher levels
3 - CSE/ESE 560M – Graduate Computer Architecture I
Instruction Set Architecture
• Programmer’s Point of View
– Organization of Programmable Storage
– Data Types & Structures: Encodings & Representations
– Instruction Formats
– Instruction (or Operation Code) Set
– Modes of Addressing and Accessing Data Items and Instructions
– Exceptional Conditions
• Logic Designer’s Point of View
– Capabilities & Performance Characteristics of Principal Functional
Units (e.g., Registers, ALU, Shifters, Logic Units, ...)
– Ways in which these components are interconnected
– Information flows between components
– Logic and means by which such information flow is controlled.
– Choreography of FUs to realize the ISA
– Register Transfer Level (RTL) Description
4 - CSE/ESE 560M – Graduate Computer Architecture I
Execution Cycle
Instruction
Fetch
Instruction
Decode
Operand
Fetch
Obtain instruction
from storage
Determine required
functionality
Locate and obtain
operand data
Execute
Find result or status
Result
Deposit results in
storage for later
Store
Next
Instruction
Determine next
instruction
5 - CSE/ESE 560M – Graduate Computer Architecture I
Storage
Processor
program
regs
F.U.s
Data
Elements of ISA
• Set of machine-recognized data types
– bytes, words, integers, floating point, strings, . . .
• Operations performed on those data types
– Add, sub, mul, div, xor, move, ….
• Programmable storage
– regs, PC, memory
• Methods of identifying and obtaining data
referenced by instructions (addressing modes)
– Literal, reg., absolute, relative, reg + offset, …
• Format (encoding) of the instructions
– Op code, operand fields, …
Current Logical State
Next Logical State
of the Machine
of the Machine
6 - CSE/ESE 560M – Graduate Computer Architecture I
Example: MIPS R3000
r0
r1
°
°
°
r31
PC
lo
hi
0
Programmable storage
Data types ?
2^32 x bytes
Format ?
31 x 32-bit GPRs (R0=0)
Addressing Modes?
32 x 32-bit FP regs (paired DP)
HI, LO, PC
Arithmetic logical
Add, AddU, Sub, SubU, And, Or, Xor, Nor, SLT, SLTU,
AddI, AddIU, SLTI, SLTIU, AndI, OrI, XorI, LUI
SLL, SRL, SRA, SLLV, SRLV, SRAV
Memory Access
LB, LBU, LH, LHU, LW, LWL,LWR
SB, SH, SW, SWL, SWR
Control
32-bit instructions on word boundary
J, JAL, JR, JALR
BEq, BNE, BLEZ,BGTZ,BLTZ,BGEZ,BLTZAL,BGEZAL
7 - CSE/ESE 560M – Graduate Computer Architecture I
History of ISA
Single Accumulator (EDSAC 1950)
Accumulator + Index Registers
(Manchester Mark I, IBM 700 series 1953)
Separation of Programming Model
from Implementation
High-level Language Based (Stack)
(B5000 1963)
Concept of a Family
(IBM 360 1964)
General Purpose Register Machines
Complex Instruction Sets
Load/Store Architecture
(CDC 6600, Cray 1 1963-76)
(Vax, Intel 432 1977-80)
Intel X86?
RISC
(MIPS,Sparc,HP-PA,IBM RS6000, 1987)
8 - CSE/ESE 560M – Graduate Computer Architecture I
Technology Trends: Microprocessor Capacity
100000000
Itanium II: 241 million
Pentium 4: 55 million
Alpha 21264: 15 million
Pentium Pro: 5.5 million
PowerPC 620: 6.9 million
Alpha 21164: 9.3 million
Sparc Ultra: 5.2 million
10000000
Moore’s Law
Pentium
i80486
Transistors
1000000
i80386
i80286
100000
CMOS improvements:
• Die size: 2X every 3 yrs
• Line width: halve / 7 yrs
i8086
10000
i8080
i4004
1000
1970
1975
1980
1985
Year
9 - CSE/ESE 560M – Graduate Computer Architecture I
1990
1995
2000
Memory Capacity (Single Chip DRAM)
size
1000000000
year
1980
1983
1986
1989
1992
1996
2000
2003
100000000
Bits
10000000
1000000
100000
10000
1000
1970
1975
1980
1985
1990
Year
10 - CSE/ESE 560M – Graduate Computer Architecture I
1995
2000
size(Mb) cyc time
0.0625 250 ns
0.25
220 ns
1
190 ns
4
165 ns
16
145 ns
64
120 ns
256
100 ns
1024
60 ns
Technology Trends
•
•
•
•
•
•
Clock Rate:
Transistor Density:
Chip Area:
Transistors per chip:
Total Perf Capability:
Storage
~30% per year
~35% per year
~15% per year
~55% per year
~100% per year
– DRAMs ~4x every 3-4 years
– Disk Density (60% per year)
• Network bandwidth
– ~2x per year for the last decade
11 - CSE/ESE 560M – Graduate Computer Architecture I
Performance Trends based on Types
Performance
100
Supercomputers
10
Mainframes
Microprocessors
Minicomputers
1
0.1
1965
1970
1975
12 - CSE/ESE 560M – Graduate Computer Architecture I
1980
1985
1990
1995
400
200
87
88
89
90
13 - CSE/ESE 560M – Graduate Computer Architecture I
91
800
92
93
1.54X/yr
94
1200
95
96
DEC Alpha 21164/600
DEC Alpha 5/500
DEC Alpha 5/300
DEC Alpha 4/266
IBM POWER 100
DEC AXP/500
1000
HP 9000/750
IBM RS/6000
0
MIPS M/120
600
MIPS M/2000
Sun-4/260
Processor Performance
97
Definition: Performance
• Performance is in units of things per second
– bigger is better
• If we are primarily concerned with response time
1
performance( x) 
exec _ time( x)
• " X is n times faster than Y" means
performance( x) exec _ time( y )
n

performance( y ) exec _ time( x)
14 - CSE/ESE 560M – Graduate Computer Architecture I
Metrics of Performance
Application
Answers per day/month
Programming
Language
Compiler
ISA
(millions) of Instructions per second: MIPS
(millions) of (FP) operations per second: MFLOP/s
Datapath
Control
Megabytes per second
Function Units
Transistors Wires Pins
15 - CSE/ESE 560M – Graduate Computer Architecture I
Cycles per second (clock rate)
Components of Performance
CPI
CPU
CPUtime
time
== Seconds
Seconds == Instructions
Instructions xx Cycles
Cycles xx Seconds
Seconds
Program
Program
Instruction
Program
Program
Instruction Cycle
Cycle
Inst
Count
Inst Count
CPI
Program
X
Compiler
X
(X)
Inst Set
X
X
Organization
Technology
16 - CSE/ESE 560M – Graduate Computer Architecture I
X
Cycle
Time
Clock Rate
X
X
What’s a Clock Cycle?
Latch
or
register
combinational
logic
• Old days: 10 levels of gates
• Today: determined by numerous time-of-flight
issues + gate delays
– clock propagation, wire lengths, drivers
17 - CSE/ESE 560M – Graduate Computer Architecture I
Architecture as a Whole
• Complete System Integration
– i.e. hardware, runtime system, compiler, and operating
system
– Networking, this is called the “End to End argument”
• Computer Architecture is Not Just About
– Transistors
– Individual Instructions
– Particular Implementations
• Original RISC projects
– Exotic and complex instruction
– A compiler + simple instructions
18 - CSE/ESE 560M – Graduate Computer Architecture I
True Speed Measurement
Sail Boat Speed: 10 miles/hour
100 miles
19 - CSE/ESE 560M – Graduate Computer Architecture I
Race Car: 100 miles/hour
Running: 5 miles/hour
0.1 mile
Amdahl’s Law

Fractionenhanced 
ExTimenew  ExTimeold  1  Fractionenhanced  
Speedupenhanced 

Speedupoverall 
ExTimeold

ExTimenew
1
1  Fractionenhanced  
Fractionenhanced
Speedupenhanced
Best you could ever hope to do:
Speedupmaximum
20 - CSE/ESE 560M – Graduate Computer Architecture I
1

1 - Fractionenhanced 
Graduate Computer Architecture I
Administrative
Course Information
• Course Web Site
– http://www.arl.wustl.edu/~young/cse560m
• Prerequisites
– CSE 361S/CS 306S and CSE 260M
• Times and Location
– Lecture: Tuesday and Thursday 2:30 PM - 4:00 PM
– In Cupples II - Room 200
• Text book
– J. Hennessy and D. Patterson, Computer Architecture: A
Quantitative Approach, Third edition, Morgan-Kaufmann, 2003.
(ISBN: 1-55860-724-2).
– (Optional) D. Patterson and J. Hennessy, Computer Organization
and Design: The Hardware/Software Interface, Third Edition.
– (Optional) P. Ashenden, The Student's Guide to VHDL, MorganKaufmann, 2003. (ISBN: 1-55860-520-7).
22 - CSE/ESE 560M – Graduate Computer Architecture I
Course Goal
•
General Overview of the Course : Three Parallel Tracks
•
Grading
– Lectures – Fast paced
– Literature Survey – Write exam quality question and answer related to the
literature
– Project – Leading up to Final Project with Conference format document
and Powerpoint slide presentation
–
–
–
–
Class Participation
Survey Homework
Exams/Quizzes
Final Project
•
Late Policy
•
Academic Integrity
10%
20%
20%
50%
– No late work unless special situations
–
–
–
–
–
Highest standards of academic integrity
Refrain from the forms of misconduct
Violations will lead to disciplinary action
Uphold the highest standards of scholarship
Plagiarism or other forms of cheating are not tolerated
23 - CSE/ESE 560M – Graduate Computer Architecture I
Course Assignments
• Literatures
– First two papers on-line “What’s Next?” and “System Research”
– One Q&A for assigned paper is due at the start of the class
– Will use the best Questions for in-class Quizzes/Exams
• Project
–
–
–
–
–
–
Will be using Xilinx ISE
Available in Computer Lab
May want to register and download webpack
http://www.xilinx.com/ise/logic_design_prod/webpack.htm
Limited but maybe suffice for smaller modules
Complete Tutorials and Start building 16-bit Structural VHDL ALU
• Diagnostic Quizzes
– Quiz A and B are take home quizzes designed to help you
– May want to refer to “Computer Organization and Design: The
Hardware/Software Interface”
– Attempt all the problems to receive full score
– You should know most of their contents by the end of 3rd week
24 - CSE/ESE 560M – Graduate Computer Architecture I
General Tips
•
•
•
•
•
•
Try not to miss a Lecture
Get to know the Instructor
Don’t worry about the Grades
Help from Other Students and Professor
Produce Professional Results
Quality Documentation and Presentation
25 - CSE/ESE 560M – Graduate Computer Architecture I
Graduate Computer Architecture I
Course Preview
Pipelined Instruction Execution
Time (clock cycles)
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
ALU
DMem
Ifetch
Reg
ALU
O
r
d
e
r
Ifetch
ALU
I
n
s
t
r.
Cycle 6 Cycle 7
ALU
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
27 - CSE/ESE 560M – Graduate Computer Architecture I
Reg
Reg
Reg
DMem
Reg
Limits to Pipelining
• Maintain the von Neumann “illusion” of one
instruction at a time execution
• Hazards prevent next instruction from executing
during its designated clock cycle
– Structural hazards: attempt to use the same hardware
to do two different things at once
– Data hazards: Instruction depends on result of prior
instruction still in the pipeline
– Control hazards: Caused by delay between the fetching
of instructions and decisions about changes in control
flow (branches and jumps).
28 - CSE/ESE 560M – Graduate Computer Architecture I
Instruction Level Parallelism
• 1st generation RISC - pipelined
– Full 32-bit processor fit on a chip  issue almost 1 IPC
• Need to access memory 1+x times per cycle
– Floating-Point unit on another chip
– Cache controller a third, off-chip cache
– 1 board per processor  multiprocessor systems
• 2nd generation: superscalar
– Processor and floating point unit on chip (and some cache)
– Issuing only one instruction per cycle uses at most half
– Fetch multiple instructions, issue couple
• Grows from 2 to 4 to 8 …
– How to manage dependencies among all these instructions?
– Where does the parallelism come from?
• VLIW
– Expose some of the ILP to compiler, allow it to schedule
instructions to reduce dependences
29 - CSE/ESE 560M – Graduate Computer Architecture I
Modern ILP
•
•
•
•
•
•
•
•
Deep Pipelines
Dynamically Scheduled
Out-of-Order Execution
Fetch Many (10’s) Instructions per cycle
Automatically Resolve Dependencies
Retains Sequential Instruction Execution
Precise Interrupt or Exception Handling
Huge Complexity – Is it still RISC?
30 - CSE/ESE 560M – Graduate Computer Architecture I
How is ILP going to be Exploited?
• Multi-Threading
– Thread: loci of control, execution context
– Fetch instructions from multiple threads at
once, throw them all into the execution unit
– Concept has existed in high performance
computing for 20 years (or is it 40?
CDC6600)
Multi-processor Chip
• Vector processing (SIMD)
– Each instruction processes many data
– Ex: MMX, 3D Now, AltiVec, and etc.
• Multiple Processors per Chip
– Higher Level of Architecture
– Interconnects
– Memory Architecture
Tensilica Configurable Proc
31 - CSE/ESE 560M – Graduate Computer Architecture I
Performance Boost: Speculation
• Programs make decisions as they go
– Conditionals, loops, calls
– Translate into branches and jumps (1 of 5 instructions)
• How do you determine what instructions for fetch
when the ones before it haven’t executed?
– Branch prediction
– Lot’s of clever machine structures to predict future
based on history
– Machinery to back out of mis-predictions
• Execute all the possible branches
– Likely to hit additional branches, perform stores
– Speculative threads
– What can hardware do to make programming (with
performance) easier?
32 - CSE/ESE 560M – Graduate Computer Architecture I
The Memory Abstraction
• Association of <name, value> pairs
– typically named as byte addresses
– often values aligned on multiples of size
• Sequence of Reads and Writes
• Write binds a value to an address
• Read of Addr returns most recently written value
bound to that address
command (R/W)
address (name)
data (W)
data (R)
done
33 - CSE/ESE 560M – Graduate Computer Architecture I
Processor-DRAM Memory Gap (latency)
1000
CPU
Performance
“Joy’s Law”
µProc
60%/yr.
(2X/1.5yr)
Processor-Memory
Performance Gap:
(grows 50% / year)
100
10
DRAM
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1
Time
34 - CSE/ESE 560M – Graduate Computer Architecture I
DRAM
9%/yr.
(2X/10 yrs)
Levels of the Memory Hierarchy
Upper Level
Capacity
Access Time
Cost
CPU Registers
100s Bytes
<< 1s ns
Cache
10s-100s K Bytes
~1 ns
$1s/ MByte
Main Memory
M Bytes
100ns- 300ns
$< 1/ MByte
Disk
10s G Bytes, 10 ms
(10,000,000 ns)
$0.001/ MByte
Tape
infinite
sec-min
$0.0014/ MByte
Staging
Xfer Unit
faster
Registers
Instr. Operands
prog./compiler
1-8 bytes
Cache
Blocks
cache cntl
8-128 bytes
Memory
Pages
OS
512-4K bytes
Files
user/operator
Mbytes
Disk
Tape
35 - CSE/ESE 560M – Graduate Computer Architecture I
Larger
Lower Level
Locality
• The Principle of Locality:
– Program access a relatively small portion of the address space at
any instant of time.
• Two Different Types of Locality:
– Temporal Locality (Locality in Time): If an item is referenced, it will
tend to be referenced again soon (e.g., loops, reuse)
– Spatial Locality (Locality in Space): If an item is referenced, items
whose addresses are close by tend to be referenced soon
(e.g., straightline code, array access)
• Last 30 years, HW relied on locality for speed
P
$
36 - CSE/ESE 560M – Graduate Computer Architecture I
MEM
Cache Design
• Several interacting dimensions
–
–
–
–
–
cache size
block size
associativity
replacement policy
write-through vs write-back
• The optimal choice is a
compromise
– depends on access characteristics
• workload
• use (I-cache, D-cache, TLB)
– depends on technology / cost
• Simplicity often wins
37 - CSE/ESE 560M – Graduate Computer Architecture I
Cache Size
Associativity
Block Size
Role of Memory
• Modern microprocessors are almost all cache
38 - CSE/ESE 560M – Graduate Computer Architecture I
Memory in Parallel Systems
• Maintaining the Illusion of Seq Mem Access
• Multiple Processors Accessing Same Memory
– Consistency in Distributed and Shared Memory
Pn
P1
Pn
P1
$
$
Interconnection network
Mem
Mem
39 - CSE/ESE 560M – Graduate Computer Architecture I
Mem
$
Mem
Interconnection network
$
Inter-Communication
Proc
Caches
Busses
adapters
Memory
Pentium III Chipset
Controllers
I/O Devices:
Disks
Displays
Keyboards
40 - CSE/ESE 560M – Graduate Computer Architecture I
Networks
Merging of HW/SW Design
• Moore’s law (more and more trans) is all about volume and
regularity
• What if you could pour nano-acres of unspecific digital
logic “stuff” onto silicon
– Do anything with it. Very regular, large volume
• Field Programmable Gate Arrays
– Chip is covered with logic blocks w/ FFs, RAM blocks, and
interconnect
– All three are “programmable” by setting configuration bits
– These are huge?
• Can each program have its own instruction set?
• Do we compile the program entirely into hardware?
41 - CSE/ESE 560M – Graduate Computer Architecture I
Tiny and Low Power System
•
•
•
•
Tiny and Cheap System
System on a chip
Resource efficiency
Real-estate, power, pins, …
42 - CSE/ESE 560M – Graduate Computer Architecture I
Conclusion
•
•
•
•
A lot of work, but Good Results
Should be a Dynamic Course
Get Introductory and Book Done ASAP
Hope to have Progressive and Fun Course
43 - CSE/ESE 560M – Graduate Computer Architecture I