PPT - Florida State University
Download
Report
Transcript PPT - Florida State University
CDA 5155 and 4150
Computer Architecture
Week 2:
2 September 2014
Goals of the course
• Advanced coverage of computer
architecture – general purpose processors,
embedded processors,historically
significant processors, design tools.
– Instruction set architecture
– Processor microarchitecture
– Systems architecture
• Memory systems
• I/O systems
2/49
Teaching Staff
• Professor Gary Tyson
– PhD: University of California – Davis
– Faculty jobs:
•
•
•
•
•
California State University Sacramento: 1987 - 1990
University of California – Davis: 1993 - 1994
University of California – Riverside: 1995 - 1996
University of Michigan: 1997 - 2003
Florida State University: 2003 – present
3/49
Grading in 5155 (Fall’14)
Programming assignments
Exams (2 @ 25% each)
In class, 75 minutes
Team Project (20%)
In-order pipeline simulation (10%)
Out-of-order pipeline simulation (10%)
3 or 4 students per team
Class Participation (10%)
4/49
Time Management
• 3 hours/week lecture
– This is probably the most important time
• 2 hours/week reading
– Hennessy/Patterson:
– Computer Architecture: A Quantitative Approach
• 3-5 hours/week exam prep
• 5+ hours/week Project (1/3 semester)
Total: ~10-15 hours per week.
5/49
Tentative Course Timeline
Week
Date
Topic
1
Aug 28
Performance, ISA, Pipelining
2
Sept 2
Pipelining, Branch Prediction
3
Sept 9
Superscalar, Exceptions
4
Sept 16
Compilers, VLIW
5
Sept 23
Dynamic Scheduling
6
Sept 30
Dynamic Scheduling
7
Oct 7
Advanced pipelines
8
Oct 14
Advanced pipelines
9
Oct 21
Cache design
10
Oct 28
Cache design, VM
11
Nov 4
Multiprocessor, Multithreading
12
Nov 11
Embedded processors
13
Nov 18
Embedded processors
14
Nov 25
Research Topics
15
Dec 2
Research Topics
Holidays
Due Dates
Notes
Exa
m
Exam
project
6/49
Web Resources
Course Web Page:
http://www.cs.fsu.edu/~tyson/courses/CDA5155
Wikipedia:
http://en.wikipedia.org/wiki/Microprocessor
Wisconsin Architecture Page:
http://arch-www.cs.wisc.edu/home
7/49
Levels of Abstraction
•
•
•
•
•
•
•
•
•
Problem/Idea (English?)
Algorithm (pseudo-code)
High-Level languages (C, Verilog)
Assembly instructions (OS calls)
Machine instructions (I/O interfaces)
Microarchitecture/organization (block diagrams)
Logic level: gates, flip-flops (schematic, HDL)
Circuit level: transistors, sizing (schematic, HDL)
Physical: VLSI layout, feature size, cabling, PC boards.
What are the abstractions at each level?
8/49
Levels of Abstraction
•
•
•
•
•
•
•
•
•
Problem/Idea (English?)
Algorithm (pseudo-code)
High-Level languages (C, Verilog)
Assembly instructions (OS calls)
Machine instructions (I/O interfaces)
Microarchitecture/organization (block diagrams)
Logic level: gates, flip-flops (schematic, HDL)
Circuit level: transistors, sizing (schematic, HDL)
Physical: VLSI layout, feature size, cabling, PC boards.
At what level do I perform a square root? Recursion?
9/49
Levels of Abstraction
•
•
•
•
•
•
•
•
•
Problem/Idea (English?)
Algorithm (pseudo-code)
High-Level languages (C, Verilog)
Assembly instructions (OS calls)
Machine instructions (I/O interfaces)
Microarchitecture/organization (block diagrams)
Logic level: gates, flip-flops (schematic, HDL)
Circuit level: transistors, sizing (schematic, HDL)
Physical: VLSI layout, feature size, cabling, PC boards.
Who/what translates from one level to the next?
10/49
Role of Architecture
• Responsible for hardware specification:
– Instruction set design
• Also responsible for structuring the overall
implementation
– Microarchitectural design.
• Interacts with everyone
– mainly compiler and logic level designers.
• Cannot do a good job without knowledge of both
sides
11/49
Design Issues: Performance
• Get acceptable performance out of system.
– Scientific: floating point throughput, memory&disk
intensive, predictable
– Commercial: string handling, disk (databases),
predictable
– Multimedia: specific data types (pixels), network?
Predictable?
– Embedded: what do you mean by performance?
– Workstation: Maybe all of the above, maybe not
12/49
Calculating Performance
•
•
•
Execution time is often the best metric
Throughput (tasks/sec) vs. latency (sec/task)
Benchmarks: what are the tasks?
–
–
–
–
–
What I care about!
Representative programs (SPEC, Linpack)
Kernels: representative code fragments
Toy programs: useful for testing end-conditions
Synthetic programs: does nothing but with a
representative instruction mix.
13/49
Design Issues: Cost
• Processor
– Die size, packaging, heat sink? Gold connectors?
– Support: fan, connectors, motherboard specifications, etc.
• Calculating processor cost:
– Cost of device = (die + package + testing) / yield
– Die cost = wafer cost / good die yield
• Good die yield related to die size and defect density
– Support costs: direct costs (components, labor),
indirect costs ( sales, service, R&D)
– Total costs amortized over number of systems sold(PC vs NASA)
14/49
Other design issues
• Some applications care about other design issues.
• NASA deep space mission
– Reliability: software and hardware (radiation
hardening)
• AMD
– Code compatibility
• ARM
– Power
15/49
A Quantitative Approach
• Hardware systems performance is
generally easy to quantify
– Machine A is 10% faster than Machine B
– Of course Machine B’s advertising will show
the opposite conclusion
• Many software systems tend to have much
more subjective performance evaluations.
16/49
Measuring Performance
• Total Execution Time:
– A is 3 times faster than B for programs P1,P2
1
n
n
Σ Time
i
i=1
– Issue: Emphasizes long running programs
17/49
Measuring Performance
• Weighted Execution Time:
n
∑
Weighti X Timei
i=1
– What if P1 is executed far more frequently?
18/49
Measuring Performance
• Normalized Execution Time:
– Compare machine performance to a reference
machine and report a ratio.
• SPEC ratings measure relative performance to a
reference machine.
19/49
Amdahl’s Law
• Rule of Thumb: Make the common case faster
http://en.wikipedia.org/wiki/Amdahl's_law
(Attack longest running part until it is no longer) repeat
20/49
Instruction Set Design
• Software Systems: named variables;
complex semantics.
• Hardware systems: tight timing
requirements; small storage structures;
simple semantics
• Instruction set: the interface between very
different software and hardware systems
21/49
Design decisions
• How much “state” is in the
microarchitecture?
– Registers; Flags; IP/PC
• How is that state accessed/manipulated?
– Operand encoding
• What commands are supported?
– Opcode; opcode encoding
22/49
Design Challenges: or why is
architecture still relevant?
• Clock frequency is increasing
– This changes the number of levels of gates that
can be completed each cycle so old designs
don’t work.
– It also tend to increase the ratio of time spent
on wires (fixed speed of light)
• Power
– Faster chips are hotter; bigger chips are hotter
23/49
Design Challenges (cont)
• Design Complexity
– More complex designs to fix frequency/power issues
leads to increased development/testing costs
– Failures (design or transient) can be difficult to
understand (and fix)
• We seem far less willing to live with hardware
errors (e.g. FDIV) than software errors
– which are often dealt with through upgrades – that we
pay for!)
24/49
Techniques for Encoding
Operands
• Explicit operands:
– Includes a field to specify which state data is
referenced
– Example: register specifier
• Implicit operands:
– All state data can be inferred from the opcode
– Example: function return (CISC-style)
25/49
Accumulator
• Architectures with one implicit register
– Acts as source and/or destination
– One other source explicit
• Example: C = A + B
– Load A
– Add B
– Store C
// (Acc)umulator = A
// Acc = Acc + B
// C = Acc
Ref: “Instruction Level Distributed Processing:
Adapting to Shifting Technology” 26/49
Stack
• Architectures with implicit “stack”
– Acts as source(s) and/or destination
– Push and Pop operations have 1 explicit operand
• Example: C = A + B
–
–
–
–
Push A
Push B
Add
Pop C
// Stack = {A}
// Stack = {A, B}
// Stack = {A+B}
// C = A+B ; Stack = {}
Compact encoding; may require more instructions though
27/49
Registers
• Most general (and common) approach
– Small array of storage
– Explicit operands (register file index)
• Example: C = A + B
Register-memory
load/store
Load R1, A
Load R1, A
Load R2, B
Add R3, R1, B Add R3, R1, R2
Store R3, C
Store R3, C
28/49
Memory
• Big array of storage
– More complex ways of indexing than registers
• Build addressing modes to support efficient
translation of software abstractions
• Uses less space in instruction than 32-bit
immediate field
A[i]; use base (i) + displacement (A) (scaled?)
a.ptr; use base (a) + displacement (ptr)
29/49
Addressing modes
Register
Immediate
Base/Displacement
Register Indirect
Indexed
Direct
Memory Indirect
Autoincrement
Add R4, R3
Add R4, #3
Add R4, 100(R1)
Add R4, (R1)
Add R4, (R1+R2)
Add R4, (1001)
Add R4, @(R3)
Add R4, (R2)+
30/49
Other Memory Issues
What is the size of each element in memory?
0x000
0-255
Byte
0x000
0 - 65535
Half word
0x000
0 - ~4B
Word
31/49
Other Memory Issues
Big-endian or Little-endian? Store 0x114488FF
Points to most significant byte
0x000
11
Points to least significant byte
0x000
FF
44
88
88
44
FF
11
32/49
Other Memory Issues
Non-word loads? ldb R3, (000)
0x000
11
00 00 00 11
44
88
FF
33/49
Other Memory Issues
Non-word loads? ldb R3, (003)
11
FF FF FF FF
44
Sign extended
88
0x003
FF
34/49
Other Memory Issues
Non-word loads? ldbu R3, (003)
11
00 00 00 FF
44
Zero filled
88
0x003
FF
35/49
Other Memory Issues
Alignment? Word accesses only address ending in 00
Half-word accesses only ending in 0
Byte accesses any address
11
44
0x002
ldw R3, (002) is illegal!
88
FF
Why is it important to be aligned?
How can it be enforced?
36/49
Techniques for Encoding
Operators
• Opcode is translated to control signals that
– direct data (MUX control)
– select operation for ALU
– Set read/write selects for register/memory/PC
• Tradeoff between how flexible the control is and
how compact the opcode encoding.
– Microcode – direct control of signals (Improv)
– Opcode – compact representation of a set of control
signals.
• You can make decode easier with careful opcode selection
37/49
Handling Control Flow
•
•
•
•
•
•
Conditional branches (short range)
Unconditional branches (jumps)
Function calls
Returns
Traps (OS calls and exceptions)
Predicates (conditional retirement)
38/49
Encoding branch targets
• PC-relative addressing
– Makes linking code easier
• Indirect addressing
– Jumps into shared libraries, virtual functions,
case/switch statements
• Some unusual modes to simplify target
address calculation
– (segment offset) or (trap number)
39/49
Condition codes
• Flags
– Implicit: flag(s) specified in opcode (bgt)
– Flag(s) set by earlier instructions (compare, add, etc.)
• Register
– Uses a register; requires explicit specifier
• Comparison operation
– Two registers with compare operation specified in
opcode.
40/49
Higher Level Semantics:
Functions
•
Function call semantics
–
–
–
–
•
Simple approach:
–
•
Save PC + 1 instruction for return
Manage parameters
Allocate space on stack
Jump to function
Use a jump instruction + other instructions
Complex approach:
–
Build implicit operations into new “call” instruction
41/49
Role of the Compiler
• Compilers make the complexity of the ISA
(from the programmers point of view) less
relevant.
– Non-orthogonal ISAs are more challenging.
– State allocation (register allocation) is better
left to compiler heuristics
– Complex Semantics lead to more global
optimization – easier for a machine to do.
People are good at optimizing 10 lines of code.
Compilers are good at optimizing 10M lines.
42/49
LC processor
• Little Computer Fall 2011
– For programming projects
• Instruction Set Design
opcode regA
regB
destReg
43/49
LC processor
R-type instructions
opcode regA
regB
24- 22 21- 19 18 –16
add:
destReg
15 – 3
2-0
destReg = regA + regB
nand: destReg = regA & regB
44/49
LC processor
I-type instructions
opcode regA
24- 22
regB
21- 19 18 –16
offsetField
15 – 0
lw:
regB = Memory[regA + offsetField]
sw:
Memory[regA +offsetField] = regB
beq:
if (regA= = regB) PC = PC + 1 + offsetField
45/49
LC processor
O-type instructions
opcode
unused
24- 22
21 – 0
noop:
do nothing
halt:
halt the simulation
46/49
LC assembly example
lw 0
1
lw 1
2
start
add 1
beq 0
1
beq 0
0
noop
done
halt
five
.fill
neg1
.fill
stAddr
.fill
five
3
2
1
2
start
load reg1 with 5 (uses symbolic address)
load reg2 with -1 (uses numeric address)
decrement reg1
goto end of program when reg1==0
go back to the beginning of the loop
end of program
5
-1
start
will contain the address of start (2)
47/49
LC machine code example
(address
(address
(address
(address
(address
(address
(address
(address
(address
(address
0):
1):
2):
3):
4):
5):
6):
7):
8):
9):
8454151 (hex 0x810007)
9043971 (hex 0x8a0003)
655361 (hex 0xa0001)
16842754 (hex 0x1010002)
16842749 (hex 0x100fffd)
29360128 (hex 0x1c00000)
25165824 (hex 0x1800000)
5 (hex 0x5)
-1 (hex 0xffffffff)
2 (hex 0x2)
Input for simulator:
8454151
9043971
655361
16842754
16842749
29360128
25165824
5
-1
2
48/49