CA_7_Performance - KTU
Download
Report
Transcript CA_7_Performance - KTU
COMPUTER
ARCHITECTURE
Assoc.Prof. Stasys Maciulevičius
Computer Dept.
[email protected]
Computer performance
We have already noticed: since performance
is inversely proportional to the execution
time, performance can be defined as
follows:
Perf ≡ F / CPI
It shows that performance can be increased in
two ways:
1) by increasing clock frequency,
2) by enhancing processor architecture –
decreasing CPI
2009-2014
©S.Maciulevičius
2
Computer performance
What influences increasing of clock frequency?
1) degree of technology development
2) features of processor microarchitecture.
What influences decreasing of CPI? – Such
features of processor microarchitecture as:
1) smart algorithms realizing instructions
2) more actions inside one clock period.
Another factor contributing to gain performance -
the use of SIMD principles
Let’s see more about this and other things
2009-2014
©S.Maciulevičius
3
Processor performance
Since the computer performance primarily
depends on the processor performance, we will
talk about it
Over the past decades, several microprocessor
families have been developed, competing servers,
workstations, desktop and notebook markets
Each such family has its own microarchitecture,
created according to the authors’ concept of user
needs, performance evaluation criteria,
development prospects and market trends
This concept influences the decisions affecting the
CPU architectural features or trade-offs
2009-2014
©S.Maciulevičius
4
General tools increasing
processor performance
improving silicon technology
superscalarity
out-of-order execution
pipelining and superpipelining
special functional units (MMX, SSE, …),
register renaming
2009-2014
©S.Maciulevičius
5
General tools increasing
processor performance
increasing the inner memory capacity
integrated caches
branch prediction
predication
speculation
multithreading
using several cores on one chip
using SOCs - Systems On the Chip.
2009-2014
©S.Maciulevičius
6
Improving silicon technology
Fabrication process determines the die size
and actually limits the complexity of the device
disposed on it
If the first Intel 16-bit microprocessor for
personal computers was produced using the 3
micron process technology, now is 0.032 micron
process technology in use, 0.022 micron
process technology is comming
Decreasing process technology rates of almost
than 100 times allows to place on the same chip
area almost 10000 times more complex circuitry
and to increase their speed
2009-2014
©S.Maciulevičius
7
Superscalarity
Superscalar referred
to as the CPU, which
has several functional
units.
The first step toward
superscalarity was the
Intel 486DX processor,
which in addition to the
arithmetic-logic unit (IU
- Integer Unit) has
floating-point
processing unit (FPU):
2009-2014
Prefetch
buffer
BIU
(Bus Interf.
Unit)
©S.Maciulevičius
Cache
IU
FPU
(8 KB)
Registers
8
Superscalarity
Instruction cache
Pentium
has two IU
and one
FPU:
(8 KB)
BTB
Prefetch buffer
FPU
Register
stack
BIU
(Bus
Interf.
Unit)
+
ALU
ALU
U pipe
V pipe
Registers
Data cache
(8 KB)
2009-2014
©S.Maciulevičius
9
Superscalarity– Pentium 4
2009-2014
©S.Maciulevičius
10
Superscalar processor
Main memory
Instruct. mem.
(cache)
Instruction
fetch
Operand
Register
file
fetch
Load
Store
Addr
ALU
Write result
Shift
Branch
Functional units
Data
Data
memory
2009-2014
©S.Maciulevičius
11
Out-of-order execution
In superscalar processor at the same time
execution of several instructions can be
started
Since a sequence of instructions generated
by compiler will not always be friendly to
load several functional units modern
processors have a special circuitry for
analyzing a large queue instructions waiting
for execution and dynamically selects for
execution such instructions that at a given
time could be carried out, although another
instructions are ahead of them
2009-2014
©S.Maciulevičius
12
Out-of-order execution
In order to make the instructions would be
initiated out-of-order processor have
instructions buffer, called the instructions
window. It is in pipeline between stages F
and X; here are instructions already
waiting for execution
The instructions window can be:
centralized – common to all functional units
distributed – every functional unit has own
window
2009-2014
©S.Maciulevičius
13
Pipelining and superpipelining
Pipeline is a set of data processing elements
connected in series, so that the output of
one element is the input of the next one
The elements of a pipeline are often executed
in parallel or in time-sliced fashion
First processors have relativ short (two - five
stages) pipelines
In order to reduce the clock period the number
of stages has been increased
2009-2014
©S.Maciulevičius
14
Superpipelining
Now pipelines are quite longer – have more than
10 stages
When pipeline has more stages, each stage is
simpler, and operations can therefore be
completed within a shorter period of time
Such processors are called superpipelined.
Pentium pipeline has 5 stages only, AMD Athlon
integer pipeline has 10 stages, FPU pipeline 15 stages
2009-2014
©S.Maciulevičius
15
Superpipelining
Pentium 4 has very deep instruction pipeline to
achieve very high clock speeds (up to 3.8 GHz)
– even 20 stages:
The new word was introduced – hyperpipeline.
To achieve higher frequencies the processor
Pentium 4 Prescott pipeline was increased to 31
stages
2009-2014
©S.Maciulevičius
16
Special functional units
Image processing, complex computer graphics - the
necessary items in modern computer-aided design
systems and computer games
This calls for special operations, which are realized in
dedicated functional units, sometimes called functional
multimedia units
The first such devices were developed for the Intel
microprocessor 80860 (indeed, the 80860 microprocessor
wasn’t used in PCs )
In 1996, the Pentium MMX was introduced with the same
basic Pentium microarchitecture complemented with MMX
instructions
2009-2014
©S.Maciulevičius
17
Special functional units
The MMX technology is designed to accelerate
multimedia and communications applications by
including new instructions and data types that
allow applications to achieve a new level of
performance
Originally MMX instructions were realized in FPU,
later on specialized functional units (XMM etc.)
The main feature of the multimedia functional units parallel processing, when several (2-16) short
words are packed into one long (128 or 256 bit)
word
2009-2014
©S.Maciulevičius
18
Register renaming
Let such two instructions are in a program:
k1: add ..., r2, ...;
[ ... (r2) + (…) ]
k2: mult r2, ..., …;
[ r2 (...) + (…) ]
Second instruction replaces content of
register r2 with new value. It is clear that k2
can not be executed until the k1 did not use
the former content of register r2.
This situation is called the conflict WAR –
write-after-read.
2009-2014
©S.Maciulevičius
19
Register renaming
If the processor has a 'spare' registers, one of
them (for example, r33) can be used in k2:
k1: add ..., r2, ...;
[ ... (r2) + (…) ]
k2: mult r33, ..., …;
[ r33 (...) + (…) ]
So WAR conflict will be resolved. It is clear that
the following instruction must apply to the r33
instead of r2 (as long as the such registry change
is valid)
Term “Register renaming” was introduced in
1975
2009-2014
©S.Maciulevičius
20
Increasing the inner memory
capacity
Processor speed is increasing faster than the
speed of the memory, increasing the gap
between processor capacity to carry out
calculations and providing them with the
instructions and data from memory
To reduce the consequences of that gap
processor developers are increasing the
storage capacity on the chip, increasing the
number of internal registers for storing data
and intermediate results
2009-2014
©S.Maciulevičius
21
Integrated caches
Further efforts of processor developers led to
using cache and integrating it in processor chip
Intel 486DX processor has just one 8 KB cache
for storing the most commonly used
instructions and data
Modern processors have on the chip caches with
total capacity of several hundreds of kilobytes
or even several megabytes (for example, Intel
Sandy Bridge processor has separate 32 KB
L1 (Level 1) caches for data and instructions
per core, and 256 KB L2 cache per core, and
shared 3-8 MB L3 cache
2009-2014
©S.Maciulevičius
22
Branch prediction
With the introduction of pipelined superscalar
processors like the Intel Pentium, DEC Alpha
21064, and the IBM POWER series load of
pipeline became more important
When processor executes branch instruction, it is
important to determine the instruction, which must
be fetched after branch
Branch prediction is the tool that determines
whether a conditional branch in the instruction
flow of a program is likely to be taken or not
2009-2014
©S.Maciulevičius
23
Branch prediction
If prediction is true pipeline continues its work
If prediction is false pipeline must be reloaded
with new instructions from target address
Branch prediction can be implemented in
different ways:
Static
prediction is the simplest branch
prediction technique - it predicts the outcome of
a branch based solely on the branch instruction
Dynamic prediction rely on information about
the dynamic state of code executing on a CPU
2009-2014
©S.Maciulevičius
24
Static prediction
A more complex form of static prediction assumes
that backwards branches will be taken, and forwardpointing branches will not be taken
The article “PC Processor Microarchitecture“ states:
“Forward branches dominate backward branches by
about 4 to 1 (whether conditional or not). About 60%
of the forward conditional branches are taken, while
approximately 85% of the backward conditional
branches are taken (because of the prevalence of
program loops) “
2009-2014
©S.Maciulevičius
25
Dynamic prediction
The hardware influences the prediction while
execution proceeds
Prediction is decided on the computation history of
the program
During the start-up phase of the program execution,
where a static branch prediction might be effective,
the history information is gathered and dynamic
branch prediction gets effective
In general, dynamic branch prediction gives better
results than static branch prediction, but at the cost
of increased hardware complexity
2009-2014
©S.Maciulevičius
26
Dynamic prediction
In case of dynamic prediction various
tools can be used:
Branch Prediction Buffer or Branch
History Table
Branch Target Buffer
Branch Target Cache
Return Address Stack or Return Stack
Buffer
2009-2014
©S.Maciulevičius
27
Branch History Table
To refine branch prediction, we could create a
table that is indexed by the low-order address
bits of recent branch instructions
In this buffer (sometimes called a "Branch
History Table" (BHT)), for each branch
instruction, we'd store a bit that indicates
whether the branch was recently taken
If the BHT's prediction bit indicates the branch
should be taken, then the pipeline can go ahead
and start fetching instructions from the new
address
2009-2014
©S.Maciulevičius
28
Branch Target Buffer
In addition to a large BHT, most predictors also
include a buffer that stores the actual target
address of taken branches (along with optional
prediction bits)
This table allows the CPU to look to see if an
instruction is a branch and start fetching at the
target address early on in the pipeline
processing
By storing the instruction address and the target
address, even before the processor decodes the
instruction, it can know that it is a branch
2009-2014
©S.Maciulevičius
29
Pipeline and BTB
Instruction
address
Branch
address
V bit
(or – prediction bit)
Store actual branch
address
Look-up
in table
PC
Instruction pipeline
Address
Instruction
2009-2014
Fetch
unit
©S.Maciulevičius
30
Speculative execution
Speculative execution is the execution of
code, the result of which may not be needed
The significant time losses are associated with
the instructions that require the data loaded
from main memory
In order to reduce these losses, some of the
processors use speculative loads, when
special scheme analyzes the instructions
waiting execution, and, seeing load instruction,
starts to execute it out of order to have a
minimum delay for data from the main memory
2009-2014
©S.Maciulevičius
31
Speculative load
add r1, r3, r4
cmp r1, r5
bcc L1
sub r6, r1, r7
add r6, r2, r1
st var2, r6
…
L1: sub r6, r1, r8
ld r2, var1
add r6, r2, r6
…
2009-2014
©S.Maciulevičius
add r1, r3, r4
ld r2, var1
cmp r1, r5
bcc L1
sub r6, r1, r7
add r6, r2, r1
st var2, r6
…
L1: sub r6, r1, r8
Is r2 ready?
add r6, r2, r6
…
32
Parallel computing options
In order to improve computer performance, it can
be various ways of parallel computing used,
including thouse in multiprocessor systems :
Multiprocessing – the use of two or more central
processing units (CPUs) within a single computer
system. This significantly increases the complexity of
the motherboard, as well as system price
Chip multiprocessing – CMP – processing
system composed of two or more independent cores.
The cores are typically integrated onto a single
integrated circuit die, or they may be integrated onto
multiple dies in a single chip package
2009-2014
©S.Maciulevičius
33
Multithreading
Multithreading – the ability of a processor to
manage its use by more than one threads
(program parts) at a time. These threads share
the processor resources but are able to be
executed independently
Since modern technology allows in one chip to
accommodate hundreds of millions of transistors,
it is possible to build in chip even such
complicated circuits required for dynamic
analysis of program code, thread identifying,
their initialization and maintenance
2009-2014
©S.Maciulevičius
34
Multithreading
Time-slice multithreading – processor
regularly (in fixed intervals) is switching from one
thread to another
Switch-on-event multithreading – processor is
switching from one thread to another when some
pause occurs in current thread (e.g. in case of
cache misses)
Simultaneous multithreading – processor
carries out different threads at once, without
switching from one thread to another. Resources
are dynamically allocated (“unnecessary for you give it to another")
2009-2014
©S.Maciulevičius
35
Hyper-Threading
Mutiprocessor
architecture
Hyperthreading
Time (CPU cycles)
Superscalar
architecture
Thread 0
2009-2014
Thread 1
©S.Maciulevičius
36
Benefit of Hyper-Threading
Source:
2009-2014
©S.Maciulevičius
37
Multi-core processors
Increase the frequency towards increasing
productivity, becoming more and more
difficult
Instead, the company has focused its efforts
to increase the parallelism – to develop dualcore processors, later moving to a multi-core
processors
This is a logical outcome of development of
Multithreading
2009-2014
©S.Maciulevičius
38
System on a Chip
Chip technology developments make it possible
to integrate on one chip other computer
components (such as chipsets, memory
controller, controllers of peripheral devices etc.)
“System on a Chip”, or SOC, refers to the
integration of all the necessary electronic
circuits of diverse functions onto a single chip,
to come up with a complete electronic system
that performs the more complex but more
useful final product function
2009-2014
©S.Maciulevičius
39
System on a Chip
This not only makes the core of computer
more compact, but also significantly
increases the speed, whereas the data
exchange carried out in a chip much faster
than between chips
On the other hand, two or more processors
can be realized in a single chip, creating
preconditions for development powerful
multiprocessor systems
2009-2014
©S.Maciulevičius
40
Transmeta Crusoe 6000
Look at
Crusoe
6000
(Transmeta)
– example
of SOC :
2009-2014
©S.Maciulevičius
41
Intel Atom SoC
In 2012, Intel expanded the Atom processor family
with a new system on chip (SoC) platform designed
for smartphones and tablets
Atom competes with existing SoCs developed for
the smartphone and tablet market from companies
like Texas instruments, Nvidia, Qualcomm and
Samsung
Intel has announced an accelerated roadmap for
Atom SoC, with 22 nm Silvermont core scheduled
in 2013, and 14 nm Airmont core - in 2014
2009-2014
©S.Maciulevičius
42
Intel Atom SoC
Platform Codename
OS/Platform Target
Manufacturing Process
CPU Cores/Threads
CPU Clock
GPU
GPU Clock
Memory Interface
2009-2014
Intel Atom Z2460
Intel Atom Z2760
Medfield
Clovertrail
Android Smartphones
Windows 8 Tablets
32nm SoC (P1269)
32nm SoC (P1269)
1/2
up to 2.0 GHz
2/4
up to 1.8 GHz
PowerVR SGX 540
PowerVR SGX 545
400 MHz
533 MHz
2 x 32-bit LPDDR2
2 x 32-bit LPDDR2
©S.Maciulevičius
43
Atom Z2760 SoC
2009-2014
©S.Maciulevičius
44
Atom Z2760 SoC Features
A full x86 desktop, purpose-built processor
for Windows 8
Allows for thin, light tablets on Intel®
architecture and convertible form factors as small as 8.5 mm thin and 1.5 pounds
Power management provides long battery
life and over three weeks of standby time;
standby resume time is less than a second
2009-2014
©S.Maciulevičius
45
Atom Z2760 SoC Features
The 32-bit dual-channel memory controller
offers fast memory read/write performance
through efficient pre-fetching algorithms,
low latency and high memory bandwidth.
Includes support for LPDDR2 800 MT/s
data rates, up to 2 GB
Includes Intel® Graphics Media
Accelerator with up to 533 MHz graphics
core frequency
2009-2014
©S.Maciulevičius
46
Atom Z2760 SoC Features
Easily create and share superb HD-quality
video with super-smooth playback quality,
and hardware acceleration support for
video encode and decode
Enables the processor to dynamically
burst to higher performance
Full disk encryption can use hardwareenhanced AES algorithms to protect data
2009-2014
©S.Maciulevičius
47
Atom Z2760 SoC Details
2009-2014
©S.Maciulevičius
48