Lecture #2 - The University of Texas at Dallas

Download Report

Transcript Lecture #2 - The University of Texas at Dallas

EE/CE/CS 6304 Computer Architecture
Lecture #2
(8/27/15)
Yiorgos Makris
Professor
Department of Electrical Engineering
University of Texas at Dallas
Course Web-site:
http://www.utdallas.edu/~gxm112130/EE6304FA15
The Instruction Set: a Critical Interface
software
instruction set
hardware
• Properties of a good abstraction
–
–
–
–
Lasts through many generations (portability)
Used in many different ways (generality)
Provides convenient functionality to higher levels
Permits an efficient implementation at lower levels
Instruction Set Architecture
... the attributes of a [computing] system as seen by
the programmer, i.e. the conceptual structure and
functional behavior, as distinct from the organization
of the data flows and controls the logic design, and
the physical implementation.
– Amdahl, Blaaw, and Brooks, 1964
SOFTWARE
-- Organization of Programmable
Storage
-- Data Types & Data Structures:
Encodings & Representations
-- Instruction Formats
-- Instruction (or Operation Code) Set
-- Modes of Addressing and Accessing Data Items and Instructions
-- Exceptional Conditions
Example: MIPS R3000
r0
r1
°
°
°
r31
PC
lo
hi
0
Programmable storage
Data types ?
2^32 x bytes
Format ?
31 x 32-bit GPRs (R0=0)
Addressing Modes?
32 x 32-bit FP regs (paired DP)
HI, LO, PC
Arithmetic logical
Add, AddU, Sub, SubU, And, Or, Xor, Nor, SLT, SLTU,
AddI, AddIU, SLTI, SLTIU, AndI, OrI, XorI, LUI
SLL, SRL, SRA, SLLV, SRLV, SRAV
Memory Access
LB, LBU, LH, LHU, LW, LWL,LWR
SB, SH, SW, SWL, SWR
Control
32-bit instructions on word boundary
J, JAL, JR, JALR
BEq, BNE, BLEZ,BGTZ,BLTZ,BGEZ,BLTZAL,BGEZAL
ISA vs. Computer Architecture
• Old definition of computer architecture
= instruction set design
– Other aspects of computer design called implementation
– Insinuates implementation is uninteresting or less challenging
• Our view is computer architecture >> ISA
• Architect’s job much more than instruction set
design; technical hurdles today more challenging
than those in instruction set design
• Since instruction set design not where action is,
some conclude computer architecture (using old
definition) is not where action is
– We disagree on conclusion
– Agree that ISA not where action is
Execution is not just about hardware
Source-to-Source
Program
Transformations
Compiler
Libraries
Linker
Application Binary
Library Services
OS Services
Hypervisor
Hardware
• The VAX fallacy
– Produce one instruction for
every high-level concept
– Absurdity: Polynomial Multiply
» Single hardware instruction
» But Why? Is this really
faster???
• RISC Philosophy
– Full System Design
– Hardware mechanisms viewed
in context of complete system
– Cross-boundary optimization
• Modern programmer does
not see assembly
language
– Many do not even see “lowlevel” languages like “C”.
Computer Architecture is an
Integrated Approach
• What really matters is the functioning of the complete
system
– hardware, runtime system, compiler, operating system, and
application
– In networking, this is called the “End to End argument”
• Computer architecture is not just about transistors,
individual instructions, or particular implementations
– E.g., Original RISC projects replaced complex instructions with a
compiler + simple instructions
• It is very important to think across all
hardware/software boundaries
– New technology  New Capabilities 
New Architectures  New Tradeoffs
– Delicate balance between backward compatibility and efficiency
Computer Architecture is
Design and Analysis
Design
Architecture is an iterative process:
• Searching the space of possible designs
• At all levels of computer systems
Analysis
Creativity
Cost /
Performance
Analysis
Good Ideas
Bad Ideas
Mediocre Ideas
Computer Architecture Topics
Input/Output and Storage
Disks, WORM, Tape
VLSI
Coherence,
Bandwidth,
Latency
L2 Cache
L1 Cache
Instruction Set Architecture
Network
Communication
Addressing,
Protection,
Exception Handling
Pipelining, Hazard Resolution,
Superscalar, Reordering,
Prediction, Speculation,
Vector, Dynamic Compilation
Pipelining and Instruction
Level Parallelism
Other Processors
Emerging Technologies
Interleaving
Bus protocols
DRAM
Memory
Hierarchy
RAID
Computer Architecture Topics
P
M
P
S
M
° ° °
P
M
P
M
Interconnection Network
Processor-Memory-Switch
Multiprocessors
Networks and Interconnections
Shared Memory,
Message Passing,
Data Parallelism
Network Interfaces
Topologies,
Routing,
Bandwidth,
Latency,
Reliability
Building Hardware
that Computes
“Moore Machine”
“Mealey Machine”
Latch
Combinational
Logic
Finite State Machines:
Implementation as Comb logic + Latch
1/0
Alpha/
0/0
0
1/1
Beta/
0/1
Delta/
1/1
00
01
10
00
01
10
0/0
2
Input State old State new
0
0
0
1
1
1
1
00
10
01
01
00
10
Div
0
0
1
0
1
1
Microprogrammed Controllers
• State machine in which part of state is a “micro-pc”.
– Explicit circuitry for incrementing or changing PC
• Includes a ROM with “microinstructions”.
+ 1
ROM
(Instructions)
State w/ Address
Addr
Control
Branch
PC
MUX
Combinational Logic/
Controlled Machine
– Controlled logic implements at least branches and jumps
Instruction
Branch
0: forw 35
xxx
1: b_no_obstacles 000
2: back 10
xxx
3: rotate 90
xxx
4: goto
001
Fundamental Execution Cycle
Instruction
Fetch
Instruction
Decode
Operand
Fetch
Execute
Result
Store
Next
Instruction
Memory
Obtain instruction
from program
storage
Determine required
actions and
instruction size
Locate and obtain
operand data
Processor
program
regs
F.U.s
Data
Compute result value
or status
von Neuman
Deposit results in
storage for later
use
Determine successor
instruction
bottleneck
What’s a Clock Cycle?
Latch
or
register
combinational
logic
• Old days: 10 levels of gates
• Today: determined by numerous time-of-flight
issues + gate delays
– clock propagation, wire lengths, drivers
Pipelined Instruction Execution
Time (clock cycles)
Reg
DMem
Ifetch
Reg
DMem
Reg
ALU
DMem
Reg
ALU
O
r
d
e
r
Ifetch
ALU
I
n
s
t
r.
ALU
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Ifetch
Ifetch
Reg
Reg
Reg
DMem
Reg
Limits to pipelining
• Maintain the von Neumann “illusion” of one
instruction at a time execution
• Hazards prevent next instruction from executing
during its designated clock cycle
– Structural hazards: attempt to use the same hardware to do
two different things at once
– Data hazards: Instruction depends on result of prior
instruction still in the pipeline
– Control hazards: Caused by delay between the fetching of
instructions and decisions about changes in control flow
(branches and jumps).
• Power: Too many thing happening at once  Melt
your chip!
– Must disable parts of the system that are not being used
– Clock Gating, Asynchronous Design, Low Voltage Swings, …
Progression of ILP
• 1st generation RISC - pipelined
– Full 32-bit processor fit on a chip => issue almost 1 IPC
» Need to access memory 1+x times per cycle
– Floating-Point unit on another chip
– Cache controller a third, off-chip cache
– 1 board per processor  multiprocessor systems
• 2nd generation: superscalar
– Processor and floating point unit on chip (and some cache)
– Issuing only one instruction per cycle uses at most half
– Fetch multiple instructions, issue couple
» Grows from 2 to 4 to 8 …
– How to manage dependencies among all these instructions?
– Where does the parallelism come from?
• VLIW
– Expose some of the ILP to compiler, allow it to schedule
instructions to reduce dependences
Modern ILP
• Dynamically scheduled, out-of-order execution
– Current microprocessor 6-8 of instructions per cycle
– Pipelines are 10s of cycles deep
 many simultaneous instructions in execution at once
– Unfortunately, hazards cause discarding of much work
• What happens:
– Grab a bunch of instructions, determine all their dependences,
eliminate dep’s wherever possible, throw them all into the execution
unit, let each one move forward as its dependences are resolved
– Appears as if executed sequentially
– On a trap or interrupt, capture the state of the machine between
instructions perfectly
• Huge complexity
– Complexity of many components scales as n2 (issue width)
– Power consumption big problem
IBM Power 4
• Combines: Superscalar and OOO
• Properties:
– 8 execution units in out-of-order engine,
each may issue an instruction each cycle.
– In-order Instruction Fetch, Decode (compute
dependencies)
– Reordering for in-order commit
When all else fails - guess
• Programs make decisions as they go
– Conditionals, loops, calls
– Translate into branches and jumps (1 of 5 instructions)
• How do you determine what instructions for fetch
when the ones before it haven’t executed?
– Branch prediction
– Lot’s of clever machine structures to predict future based on
history
– Machinery to back out of mis-predictions
• Execute all the possible branches
– Likely to hit additional branches, perform stores
speculative threads
What can hardware do to make programming
(with performance) easier?
Have we reached the end of ILP?
• Multiple processor easily fit on a chip
• Every major microprocessor vendor
has gone to multithreaded cores
– Thread: loci of control, execution context
– Fetch instructions from multiple threads at once,
throw them all into the execution unit
– Intel: hyperthreading
– Concept has existed in high performance computing
for 20 years (or is it 40? CDC6600)
• Vector processing
– Each instruction processes many distinct data
– Ex: MMX
• Raise the level of architecture – many
processors per chip
Tensilica Configurable Proc
Limiting Forces: Clock Speed and ILP
• Chip density is
continuing increase
~2x every 2 years
– Clock speed is not
– # processors/chip (cores)
may double instead
• There is little or no
more Instruction
Level Parallelism (ILP)
to be found
– Can no longer allow
programmer to think in
terms of a serial
programming model
• Conclusion:
Parallelism must be
exposed to software!
Source: Intel, Microsoft (Sutter) and
Stanford (Olukotun, Hammond)
Examples of MIMD Machines
• Symmetric Multiprocessor
– Multiple processors in box with shared
memory communication
– Current MultiCore chips like this
– Every processor runs copy of OS
• Non-uniform shared-memory with
separate I/O through host
– Multiple processors
» Each with local memory
» general scalable network
– Extremely light “OS” on node provides
simple services
» Scheduling/synchronization
– Network-accessible host for I/O
• Cluster
– Many independent machine connected with
general network
– Communication through messages
P
P
P
P
Bus
Memory
P/M P/M P/M P/M
P/M P/M P/M P/M
P/M P/M P/M P/M
P/M P/M P/M P/M
Host
Time (processor cycle)
Categories of Thread Execution
Superscalar
Fine-Grained Coarse-Grained
Thread 1
Thread 2
Multiprocessing
Thread 3
Thread 4
Simultaneous
Multithreading
Thread 5
Idle slot