Lecture 20 RISC Architecture and Super Computer

Download Report

Transcript Lecture 20 RISC Architecture and Super Computer

RISC Architecture and Super
Computer
Prof. Sin-Min Lee
Department of Computer Science
San Jose State University
The Basis for RISC
• Use of simple instructions
• One of their key realizations was that a
sequence of simple instructions produces
the same results as a sequence of complex
instructions, but can be implemented with a
simpler (and faster) hardware design.
Reduced Instruction Set Computers---RISC
machines---were the result.
Addressing modes
• Limited number of addressing modes
• The effective address is computed in a
single clock cycle.
Instruction Pipeline
•
Similar to a manufacturing assembly line
1.
2.
3.
4.
•
•
Fetch an instruction
Decode the instruction
Execute the instruction
Store results
Each stage processes simultaneously (after
initial latency)
Execute one instruction per clock cycle
Pipeline Stages
• Some processors use 3, 4, or 5 stages
RISC characteristics
• Simple instruction set.
• In a RISC machine, the instruction set
contains simple, basic instructions, from
which more complex instructions can be
composed.
• Same length instructions.
RISC characteristics
• Each instruction is the same length, so that
it may be fetched in a single operation.
• 1 machine-cycle instructions.
• Most instructions complete in one machine
cycle, which allows the processor to handle
several instructions at the same time. This
pipelining is a key technique used to speed
up RISC machines.
Instructions Pipelines
•
•
It is to prepare the next instruction while
the current instruction is still executing.
A Three states RISC pipelines is :
1. Fetch instruction
2. Decode and select registers
3. Execute the instruction
Clock 1 2 3
Stage
1
i1 i2 i3
2
- i1 i2
3
- - i1
4
5
6
7
i4 i5 i6 i7
i3 i4 i5 i6
i2 i3 i4 i5
RISC vs. CISC
• RISC have fewer and simpler instructions,
therefore, they are less complex and easier to
design. Also, it allow higher clock speed than
CISC. However, When we compiled high-level
language. RISC CPU need more instructions than
CISC CPU.
• CISC are complex but it doesn’t necessarily
increase the cost. CISC processors are backward
compactable.
Why RISC is better
The 80/20 rule: Analysis of the instruction mix generated by
CISC compilers, shows that more than 80% of the instructions
generated and executed used only 20% of an instruction set. It was
an obvious conclusion that if this 20% of instruction was speeded
up, the performance benefits would be far greater. Further analysis
shows that these instructions tend to perform the simpler operations
and use only the simpler addressing modes. For the CISC machine,
all the effort invested in processor design to provide complex
instructions and thereby reduce the compiler workload was being
wasted.
.
• Less cost: Since only the simpler instructions are
needed, the processor hardware required to
implement them could be reduced in complexity.
Therefor it should be possible to design a more
performance processor with less cost.
• Good performance: With a simpler instruction set,
it should possible for a processor to execute its
instruction in a single clock cycle. Higher
performance can be achieved.
Pipelining: A key RISC technique
RISC designers are concerned primarily with creating the
fastest chip possible, and so they use a number of techniques,
including pipelining.
Pipelining is a design technique where the computer's
hardware processes more than one instruction at a time, and
doesn't wait for one instruction to complete before starting
the next.
The advantages of RISC
Implementing a processor with a simplified instruction set design
provides several advantages over implementing a comparable CISC
design:
(1) Speed. Since a simplified instruction set allows for a pipelined, superscalar
design RISC processors often achieve 2 to 4 times the performance of CISC
processors using comparable semiconductor technology and the same clock rates.
(2) Simpler hardware. Because the instruction set of a RISC processor is so simple,
it uses up much less chip space; extra functions, such as memory management
units or floating point arithmetic units, can also be placed on the same chip.
Smaller chips allow a semconductor manufacturer to place more parts on a single
silicon wafer, which can lower the per-chip cost dramatically.
(3) Shorter design cycle. Since RISC processors are simpler than corresponding
CISC processors, they can be designed more quickly, and can take advantage of
other technological developments sooner than corresponding CISC designs,
leading to greater leaps in performance between generations.
Early RISC Machines
IBM 801 1980
120 instructions
No microcode
32 bit instructions
MSI technology
Berkeley RISC
Coined RISC and CISC
Promoted architecture and implementation innovations as RISC
Single VLSI chip implementation
Stanford MIPS
Concentrated on compiler technology to improve system
performance
IBM 801
Put in hardware what
Could not be moved to compile time
Could not be efficiently implemented in executable code
by a compiler
Could be implemented as random logic
Architecture
32 32 bit registers
Separate data and instruction caches
Two stage pipeline, decode-operand fetch-execute, shift-set
conditions-write
Delayed branches, Branch with execute
Compilers
No intent on letting end users program in assembly
Berkeley RISC
Unlike IBM 801
No heavy reliance on compiler technology
Single chip implementation
Argues that RISC is the best way to use scarce silicon area
Influential because
Introduced RISC and CISC terms
First single chip RISC processor
Introduced several innovations at once
Great marketing job
Current RISC
RISC -> SPARC
MIPS -> MIPS R[2-4]000
IBM 801 -> IBM RT -> IBM RS/6000
HP-PA RISC
ARM
M88000
PowerPC
i860
I960
Instruction Pipeline
An instruction pipeline is very similar to a manufacturing
assembly line. Imagine an assembly line partitioned into four
stages:
• 1st stage receives some parts, performs its assembly
task, and passes the results to the second stage;
• 2nd stage takes the partially assembled product
from the first stage, performs its task, and passes its
work to the third stage;
• 3rd stage does its work, passing the results to the
last stage, which completes the task and outputs its
results.
As the first piece moves from the first stage to the
second stage, a new set of parts for a new piece
enters the first stage. Ultimately, every stage
processes a piece simultaneously. This is how time
is saved. Each product requires the same amount
of time to be processed (actually slightly more, to
account for the transfers between stages), but
products are manufactured more quickly because
several are being created at the same time.
An instruction pipeline processes an instruction the
way the assembly line processes a product.
• 1st stage:
fetches the instruction
from memory.
• 2nd stage: decodes the instruction and
fetches any required operands.
• 3rd stage:
executes the instruction,
• 4th stage: stores the result.
Consider a nonpipelined machine with 6
execution stages of lengths 50 ns, 50 ns, 60 ns,
60 ns, 50 ns, and 50 ns.
- Find the instruction latency on this
machine.
- How much time does it take to execute
100 instructions?
Instruction latency = 50+50+60+60+50+50= 320 ns
Time to execute 100 instructions = 100*320 = 32000 ns
Suppose we introduce pipelining on this machine.
Assume that when introducing pipelining, the clock
skew adds 5ns of overhead to each execution stage.
- What is the instruction latency on the pipelined
machine?
- How much time does it take to execute 100
instructions?
Solution:
Remember that in the pipelined implementation, the
length of the pipe stages must all be the same, i.e., the
speed of the slowest stage plus overhead. With 5ns
overhead it comes to:
The length of pipelined stage = MAX(lengths of unpipelined
stages) + overhead = 60 + 5 = 65 ns
Instruction latency = 6x65 ns =390ns
Time to execute 100 instructions = 65*6*1 + 65*1*99 = 390 +
6435 = 6825 ns
Instructions Pipelines
•
•
It is to prepare the next instruction while
the current instruction is still executing.
A Three states RISC pipelines is :
1. Fetch instruction
2. Decode and select registers
3. Execute the instruction
Clock 1 2 3
Stage
1
i1 i2 i3
2
- i1 i2
3
- - i1
4
5
6
7
i4 i5 i6 i7
i3 i4 i5 i6
i2 i3 i4 i5
What is the speedup obtained from pipelining?
Solution:
Speedup is the ratio of the average instruction
time without pipelining to the average
instruction time with pipelining.
Average instruction time not pipelined = 320 ns
Average instruction time pipelined = 65 ns
Speedup = 320 / 65 = 4.92
• Each instruction is the same length, so that
it may be fetched in a single operation.
• 1 machine-cycle instructions.
• Most instructions complete in one machine
cycle, which allows the processor to handle
several instructions at the same time. This
pipelining is a key technique used to speed
up RISC machines.
This is one possible configuration of an
RISC pipeline, the pipeline implemented in
the SPARC MB86900 CPU. The IBM 801,
the first RISC computer, also uses a fourstage instruction pipeline. Other processors,
such as the RISC II, use only three stages;
they combine the execute and store result
operations in to a single stage.
The MIPS processor uses a five-stage pipeline; it decodes the
instruction and selects the operand registers in separate stages.
These three configurations are shown in the following figure.
• Note that each stage has a register that
latches its data at the end of the stage
to synchronize data flow between
stages. The flow of instructions
through each pipeline is shown in the
following Figure.
A Single Pipelined Control Unit
Offers Several Advantage:
• The primary advantage is the
reduced hardware requirements of
the pipeline.
• A second advantage of instruction
pipelines is the reduced complexity
of the memory interface.
•Many video game systems like Sony Play Station and
Nintendo use small (66MHZ in PS1) RISC processors. These
machines are Single Purpose machines and always run the
same types of programs, so small RISC processors give
excellent performance results on machines like these.
• Pocket PC’s like the Palm Pilot and Compaq’s Ipaq series
also use small RISC processors. Again, a machine like this is
basically single purpose. Yes, you can do lot of things with
them, but often you use a calendar, MP3 player, and maybe a
word processor.
So, why don’t I have a RISC processor at
home? (Continued)
• RISC based PC processors are still quite a bit more
expensive than their CISC counterparts.
• When you write code for a RISC based machine, you
are writing code native to that particular processor.
Compatibility become an extreme issue – Another
RISC processor using the same OS won’t be able to run
software that you coded on the previous machine.
• The rather bright fellows at INTEL have come up
with a solution for you. The current processor you own
(provided that it is a x486 or higher) is a CRISC
processor.
CRISC – I shouldn’t have to tell you what this
stands for
• Intel realized that while the x86 CISC set is very large
there are a few instructions that are quite common and
only do one thing (ex. JMP, MOV, INC. etc.)
• Intel decided to take those common instructions,
adjust them to be the same size and then hardwired
them into the CPU’s core so they could be executed in a
RISC like fashion.
• Yes, your Pentium III processor at home will behave
like a RISC processor, sometimes. This helps gain
more efficiency from the CPU while remaining
backwards compatible
Why Use Pipelining?
• Pipelining allows you to start the process of executing
one instruction before the previous one has completed
•Even if there are delays in any one stage of the process
for one instruction, it is still more efficient than nonpipelined processors
•Pipelining is introduced with the 486 processor
Review of 6- Stage execution process
• FETCH – Instructions are fetched from a
MICROCODE ROM (CISC)
• DECODE – Instructions are decoded into simple code
that the CPU understands (often called Micro-ops)
• ISSUE/SCHEDULE – Once instructions have been
decoded, they are placed into a pool and then issued to
a unit (Integer, FPU, MMX) for execution
• EXECUTE – The instruction is executed here
• RETIRE – Results are analyzed and put back into
their proper order
• WRITE BACK – The results of the instructions are
written to memory (committed to code)
Super Scalar
• Put simply, a super scalar processor has two or more
integer execution units that run in parallel (they can
execute instructions simultaneously)
• The Pentium Processor is the first INTEL super scalar
processor
• The scheduling unit can issue instructions
simultaneously to different units to be executed at the
same time
Data Flow
Performance Improvement
• The speedup is the ratio of the time needed
to process n instruction using a nonpipelined control unit to the time needed
using a pipelined control unit
Sn = n T1 / (n + k -1) Tk
Pipeline Problems
• Memory access
– Fetch an instruction in one clock cycle
– Include cache memory
• Branch statements
– The instruction that are in pipeline should not
be there
Register Windowing
• More than 100 registers, not always
accessible
• Global registers are always accessible
• The remaining registers are windowed,
accessible at specific times
SPARC Processor Register
Windowing
Keeping Track
• A window point register contains the value
of the window that is currently active
• A window mask register contains 1 bit per
window and denotes which windows
contain valid data.
Subroutine Calls
• Register windows provide greatest benefit
during subroutine calls
• During the calling process, the register
window is moved down one position.
• CPU can pass parameters to the subroutine
via the registers that overlap
• Same register can be used to return results
to the calling routine.
Example
Example (cont)
RISC Advantages
• RISC have fewer and simpler instructions.
– Their control units are less complex and easier
to design
– Run at higher clock frequencies
– Reduced amount of space needed on the
processor chip -> more space for additional
registers
– Easier to incorporate parallelism
– Compilers are less complex
CISC Advantages
• New complex processors incorporate the
design of the previous designs.
• Backward compatibility with other
processors in their series.