Çoxnüvəli prosessorların arxitekturası

Download Report

Transcript Çoxnüvəli prosessorların arxitekturası

İnformasiya texnologiyaları kafedrası
KOMPÜTERİN TƏŞKİLİ VƏ TEXNOLOGİYALARI
(Computer Organization & Technologies)
Mövzu № 2. CISC və RISC tipli
mikroprosessorlar.
(Prosessorların arxitekturası. CISC arxitekturalı prosessorlar. RISC
arxitekturalı prosessorlar. Çoxnüvəli prosessorların arxitekturası.)
Azər Fərhad oğlu Həsənov
www.berkut.ws/teaching.html
[email protected]
iş nömrəsi: 497-26-00, 24-20
www.berkut.ws/comporgtech.html
Mövzu № 2. CISC və RISC tipli mikroprosessorlar.
• Prosessorların arxitekturası.
• CISC arxitekturalı prosessorlar (Complex Instruction Set
Computers).
• RISC arxitekturalı prosessorlar (Reduced Instruction Set
Computers).
• RISC arxitekturasının əsas xüsusiyyətləri.
• RISC prosessorlarda registrlər.
• RISC arxitekturasının üstünlükləri və çatışmazlıqları.
• Çoxnüvəli prosessorların arxitekturası.
Prosessorların arxitekturası
• prosessorun registrlərinin (registr faylının) funksiyaları
və ölçüləri;
• yazma və oxuma üçün yaddaşa müraciət zamanı üsul
və məhdudiyyətlər;
• bir əmrdə yerinə yetirilən əməliyyatların sayı;
• əmrlərin uzunluğu (dəyişən və ya fiksasiya olunmuş);
• verilənlər tiplərinin sayı.
Complex Instruction Set Computers
• mərkəzi prosessorun bir neçə taktına yerinə yetirilən
çoxlu sayda (yüzlərlə) müxtəlif maşın əmrləri;
• proqramlaşdırılabilən məntiqli idarəetmə qurğusu;
• ümumi təyinatlı registrlərin (ÜTR) daha da çoxluğu;
• müxtəlif uzunluqlu müxtəlif formatlı əmrlər;
• iki ünvanlı ünvanlaşdırmanın olması;
• müxtəlif dolayı ünvanlaşdırılma üsullarını özündə
saxlayan operandların ünvanlaşdırılması
mexanizmlərinin inkişafı.
•
•
•
•
•
•
•
•
•
Reduced Instruction Set Computers
bütün əmrlərin (və ya heç olmasa əmrlərin 75%-nin) bir dövrə
yerinə yetirilməsi;
bütün əmrlərin unifikasiya olunmuş konveyerli emalına icazə
verən adi söz uzunluqlu və verilənlər şinin eninə bərabər olan
standart söz uzunluqlu əmrlər;
əmrlərin daha az sayı (128-dən çox olmayan);
əmrlərin formatının daha az sayı (4-dən çox olmayan);
ünvanlaşdırma üsullarının daha az sayı (4-dən çox olmayan);
yalnız “oxuma” və “yazma” əmrləri ilə yaddaşa müraciət;
“oxuma” və “yazma” əmrlərindən başqa yerdə qalanların
prosessor daxili registrlər arası ötürülmələrdən istifadəsi;
aparat məntiqli İQ;
prosessorun daha böyük (32-dən az olmayan) ümumi təyinatlı
registrlər faylı (müasir RISC mikroprosessorlarında bu 500 aşa
bilir).
RISC prosessorlarda registrlər
RISC prosessorlarda registrlər
RISC prosessorlarda registrlər
Çoxnüvəli prosessorların arxitekturası
Çoxnüvəli prosessorların arxitekturası
Şəkil 2.5. Müxtəlif istehsalçıların çoxnüvəli prosessorlarının strukturları:
a – IBM Power 6; b – Intel Pentium D;
Çoxnüvəli prosessorların arxitekturası
Şəkil 2.5. Müxtəlif istehsalçıların çoxnüvəli prosessorlarının strukturları:
c – Intel Core 2 Quad; d – Intel Nehalem;
Çoxnüvəli prosessorların arxitekturası
Şəkil 2.5. Müxtəlif istehsalçıların çoxnüvəli prosessorlarının strukturları:
e – Intel Itanium 3 Tukwila; f – AMD Phenom X4; g – Sun UltraSPARC T2;
Çoxnüvəli prosessorların arxitekturası
Şəkil 2.5. Müxtəlif istehsalçıların çoxnüvəli prosessorlarının strukturları:
h – Intel Core i7-990X.
+
William Stallings
Computer Organization
and Architecture
9th Edition
+
Chapter 15
Reduced Instruction Set Computers (RISC)
Table 15.1
Characteristics of Some CISCs, RISCs, and
Superscalar Processors
Table 15.1 Characteristics of Some CISCs, RISCs, and Superscalar Processors
Instruction
Execution
Characteristics
Execution sequencing
•Determines the control and
pipeline organization
Operands used
•The types of operands and the
frequency of their use determine
the memory organization for
storing them and the addressing
modes for accessing them
High-level languages (HLLs)
•Allow the programmer to express algorithms more
concisely
•Allow the compiler to take care of details that are not
important in the programmer’s expression of
algorithms
•Often support naturally the use of structured
programming and/or object-oriented design
Semantic gap
•The difference between the
operations provided in HLLs
and those provided in computer
architecture
Operations performed
•Determine the functions to be
performed by the processor and
its interaction with memory
Table 15.2
Weighted Relative Dynamic Frequency
of HLL Operations
Table 15.2 Weighted Relative Dynamic Frequency of HLL Operations [PATT82a]
Table 15.3
Operands
Table 15.3 Dynamic Percentage of Operands
Table 15.4
Procedure Arguments and
Local Scalar Variables
Table 15.4 Procedure Arguments and Local Scalar Variables
+
Implications

HLLs can best be supported by optimizing performance of
the most time-consuming features of typical HLL programs

Three elements characterize RISC architectures:

Use a large number of registers or use a compiler to optimize
register usage

Careful attention needs to be paid to the design of instruction
pipelines

Instructions should have predictable costs and be consistent with
a high-performance implementation
+
The Use of a Large Register File
Software Solution

Requires compiler to allocate
registers

Allocates based on most used
variables in a given time

Requires sophisticated
program analysis
Hardware Solution

More registers

Thus more variables will be in
registers
+
Overlapping Register Windows
Circular Buffer
Organization of
Overlapped
Windows
+
Global Variables

Variables declared as global in an HLL can be assigned memory
locations by the compiler and all machine instructions that
reference these variables will use memory reference operands


However, for frequently accessed global variables this scheme is
inefficient
Alternative is to incorporate a set of global registers in the
processor


These registers would be fixed in number and available to all
procedures
A unified numbering scheme can be used to simplify the instruction
format

There is an increased hardware burden to accommodate the
split in register addressing

In addition, the linker must decide which global variables
should be assigned to registers
Characteristics of Large-Register-File and
Cache Organizations
Table 15.5 Characteristics of Large-Register-File and Cache Organizations
+
Referencing a
Scalar
Graph Coloring Approach
+
Why CISC ?
(Complex Instruction Set Computer)

There is a trend to richer instruction sets which include a
larger and more complex number of instructions

Two principal reasons for this trend:



A desire to simplify compilers
A desire to improve performance
There are two advantages to smaller programs:


The program takes up less memory
Should improve performance
 Fewer instructions means fewer instruction bytes to be fetched
 In a paging environment smaller programs occupy fewer
pages, reducing page faults
 More instructions fit in cache(s)
Table 15.6
Code Size Relative to RISC 1
Table 15.6 Code Size Relative to RISC I
Characteristics of Reduced
Instruction Set Architectures
One machine
instruction per
machine cycle
Register-to-register
operations
• Machine cycle --- the time it takes to fetch two operands from
registers, perform an ALU operation, and store the result in a
register
• Only simple LOAD and STORE operations accessing memory
• This simplifies the instruction set and therefore the control unit
Simple addressing
modes
• Simplifies the instruction set and the control unit
Simple instruction
formats
• Generally only one or a few formats are used
• Instruction length is fixed and aligned on word boundaries
• Opcode decoding and register operand accessing can occur
simultaneously
Comparison of Register-to-Register and
Memory-to-Memory Approaches
Table 15.7
Characteristics of Some Processors
The Effects of Pipelining
+
Optimization of Pipelining

Delayed branch



Delayed Load





Does not take effect until after execution of following instruction
This following instruction is the delay slot
Register to be target is locked by processor
Continue execution of instruction stream until register required
Idle until load is complete
Re-arranging instructions can allow useful work while loading
Loop Unrolling





Replicate body of loop a number of times
Iterate loop fewer times
Reduces loop overhead
Increases instruction parallelism
Improved register, data cache, or TLB locality
Table 15.8
Normal and Delayed Branch
+
Use of the
Delayed Branch
do i=2, n-1
a[i] = a[i] + a[i-1] * a[i+l]
end do
Becomes
do i=2, n-2, 2
a[i] = a[i] + a[i-1] * a[i+i]
a[i+l] = a[i+l] + a[i] * a[i+2]
end do
if (mod(n-2,2) = i) then
a[n-1] = a[n-1] + a[n-2] * a[n]
end if
Loop Unrolling
Twice Example
+
RISC versus CISC Controversy

Quantitative


Qualitative


Compare program sizes and execution speeds of programs on
RISC and CISC machines that use comparable technology
Examine issues of high level language support and use of VLSI
real estate
Problems with comparisons:





No pair of RISC and CISC machines that are comparable in lifecycle cost, level of technology, gate complexity, sophistication of
compiler, operating system support, etc.
No definitive set of test programs exists
Difficult to separate hardware effects from complier effects
Most comparisons done on “toy” rather than commercial products
Most commercial devices advertised as RISC possess a mixture of
RISC and CISC characteristics
Summary
+
Chapter 15
Reduced Instruction
Set Computers
(RISC)




Instruction execution characteristics
 Operations
 Operands
 Procedure calls
 Implications
The use of a large register file
 Register windows
 Global variables
 Large register file versus cache
Reduced instruction set architecture
 Characteristics of RISC
 CISC versus RISC characteristics


RISC pipelining

Pipelining with regular instructions

Optimization of pipelining
MIPS R4000

Instruction set

Instruction pipeline
SPARC

SPARC register set

Instruction set

Instruction format

Compiler-based register optimization

RISC versus CISC controversy
+
Chapter 18
Multicore Computers
+
Alternative Chip
Organization
+
Intel Hardware
Trends
Processor Trends
Power
Memory
+
+
Power Consumption

By 2015 we can expect to see microprocessor chips with
about 100 billion transistors on a 300 mm2 die

Assuming that about 50-60% of the chip area is devoted to
memory, the chip will support cache memory of about 100 MB
and leave over 1 billion transistors available for logic

How to use all those logic transistors is a key design issue

Pollack’s Rule

States that performance increase is roughly proportional to square
root of increase in complexity
+
Performance
Effect of
Multiple Cores
Scaling of Database Workloads on
Multiple-Processor Hardware
+
Effective Applications for Multicore
Processors



Multi-threaded native applications

Characterized by having a small number of highly threaded
processes

Lotus Domino, Siebel CRM (Customer Relationship Manager)
Multi-process applications

Characterized by the presence of many single-threaded processes

Oracle, SAP, PeopleSoft
Java applications



Java Virtual Machine is a multi-threaded process that provides scheduling
and memory management for Java applications
Sun’s Java Application Server, BEA’s Weblogic, IBM Websphere, Tomcat
Multi-instance applications


One application running multiple times
If multiple application instances require some degree of isolation,
virtualization technology can be used to provide each of them with its own
separate and secure environment
Hybrid +
Threading
for
Rendering
Module
Multicore
Organization
Alternatives
+
Intel Core Duo
Block Diagram
+ Intel x86 Multicore Organization Core Duo

Advanced Programmable Interrupt Controller (APIC)




Provides inter-processor interrupts which allow any process to
interrupt any other processor or set of processors
Accepts I/O interrupts and routes these to the appropriate core
Includes a timer which can be set by the OS to generate an
interrupt to the local core
Power management logic



Responsible for reducing power consumption when possible,
thus increasing battery life for mobile platforms
Monitors thermal conditions and CPU activity and adjusts
voltage levels and power consumption appropriately
Includes an advanced power-gating capability that allows for an
ultra fine grained logic control that turns on individual processor
logic subsystems only if and when they are needed
Continued . . .
+ Intel x86 Multicore Organization Core Duo


2MB shared L2 cache

Cache logic allows for a dynamic allocation of cache space based
on current core needs

MESI support for L1 caches

Extended to support multiple Core Duo in SMP

L2 cache controller allows the system to distinguish between a
situation in which data are shared by the two local cores, and a
situation in which the data are shared by one or more caches on
the die as well as by an agent on the external bus
Bus interface

Connects to the external bus, known as the Front Side Bus, which
connects to main memory, I/O controllers, and other processor
chips
Intel Core i7-990X Block Diagram
+
Table 18.1
Cache Latency
Summary
+
Multicore
Computers
Chapter 18

Hardware performance issues



Multicore organization

Intel x86 multicore organization
Increase in parallelism and
complexity
Power consumption


Software performance issues

Software on multicore

Valve game software
example


Intel Core Duo

Intel Core i7-990X
ARM11 MPCore

Interrupt handling

Cache coherency
IBM zEnterprise mainframe
Növbəti mühazirənin mövzusu
Mövzu № 3. Mikroarxitektura və
mikroproqramlaşdırma.
(Mikroarxitektura nümunələri. Verilənlər traktı. Mikroəmrlər. Mic-1
mikroarxitekturası – mikroəmrlərin idarəedilməsi. IJVM - əmrlər yığımının
arxitektura nümunəsi. 8051 ailəsindən olan mikrokontrollerlərin
mikroarxitekturası. Mikroproqramlaşdırma.)
www.berkut.ws/comporgtech.html