Çoxnüvəli prosessorların arxitekturası
Download
Report
Transcript Çoxnüvəli prosessorların arxitekturası
İnformasiya texnologiyaları kafedrası
KOMPÜTERİN TƏŞKİLİ VƏ TEXNOLOGİYALARI
(Computer Organization & Technologies)
Mövzu № 2. CISC və RISC tipli
mikroprosessorlar.
(Prosessorların arxitekturası. CISC arxitekturalı prosessorlar. RISC
arxitekturalı prosessorlar. Çoxnüvəli prosessorların arxitekturası.)
Azər Fərhad oğlu Həsənov
www.berkut.ws/teaching.html
[email protected]
iş nömrəsi: 497-26-00, 24-20
www.berkut.ws/comporgtech.html
Mövzu № 2. CISC və RISC tipli mikroprosessorlar.
• Prosessorların arxitekturası.
• CISC arxitekturalı prosessorlar (Complex Instruction Set
Computers).
• RISC arxitekturalı prosessorlar (Reduced Instruction Set
Computers).
• RISC arxitekturasının əsas xüsusiyyətləri.
• RISC prosessorlarda registrlər.
• RISC arxitekturasının üstünlükləri və çatışmazlıqları.
• Çoxnüvəli prosessorların arxitekturası.
Prosessorların arxitekturası
• prosessorun registrlərinin (registr faylının) funksiyaları
və ölçüləri;
• yazma və oxuma üçün yaddaşa müraciət zamanı üsul
və məhdudiyyətlər;
• bir əmrdə yerinə yetirilən əməliyyatların sayı;
• əmrlərin uzunluğu (dəyişən və ya fiksasiya olunmuş);
• verilənlər tiplərinin sayı.
Complex Instruction Set Computers
• mərkəzi prosessorun bir neçə taktına yerinə yetirilən
çoxlu sayda (yüzlərlə) müxtəlif maşın əmrləri;
• proqramlaşdırılabilən məntiqli idarəetmə qurğusu;
• ümumi təyinatlı registrlərin (ÜTR) daha da çoxluğu;
• müxtəlif uzunluqlu müxtəlif formatlı əmrlər;
• iki ünvanlı ünvanlaşdırmanın olması;
• müxtəlif dolayı ünvanlaşdırılma üsullarını özündə
saxlayan operandların ünvanlaşdırılması
mexanizmlərinin inkişafı.
•
•
•
•
•
•
•
•
•
Reduced Instruction Set Computers
bütün əmrlərin (və ya heç olmasa əmrlərin 75%-nin) bir dövrə
yerinə yetirilməsi;
bütün əmrlərin unifikasiya olunmuş konveyerli emalına icazə
verən adi söz uzunluqlu və verilənlər şinin eninə bərabər olan
standart söz uzunluqlu əmrlər;
əmrlərin daha az sayı (128-dən çox olmayan);
əmrlərin formatının daha az sayı (4-dən çox olmayan);
ünvanlaşdırma üsullarının daha az sayı (4-dən çox olmayan);
yalnız “oxuma” və “yazma” əmrləri ilə yaddaşa müraciət;
“oxuma” və “yazma” əmrlərindən başqa yerdə qalanların
prosessor daxili registrlər arası ötürülmələrdən istifadəsi;
aparat məntiqli İQ;
prosessorun daha böyük (32-dən az olmayan) ümumi təyinatlı
registrlər faylı (müasir RISC mikroprosessorlarında bu 500 aşa
bilir).
RISC prosessorlarda registrlər
RISC prosessorlarda registrlər
RISC prosessorlarda registrlər
Çoxnüvəli prosessorların arxitekturası
Çoxnüvəli prosessorların arxitekturası
Şəkil 2.5. Müxtəlif istehsalçıların çoxnüvəli prosessorlarının strukturları:
a – IBM Power 6; b – Intel Pentium D;
Çoxnüvəli prosessorların arxitekturası
Şəkil 2.5. Müxtəlif istehsalçıların çoxnüvəli prosessorlarının strukturları:
c – Intel Core 2 Quad; d – Intel Nehalem;
Çoxnüvəli prosessorların arxitekturası
Şəkil 2.5. Müxtəlif istehsalçıların çoxnüvəli prosessorlarının strukturları:
e – Intel Itanium 3 Tukwila; f – AMD Phenom X4; g – Sun UltraSPARC T2;
Çoxnüvəli prosessorların arxitekturası
Şəkil 2.5. Müxtəlif istehsalçıların çoxnüvəli prosessorlarının strukturları:
h – Intel Core i7-990X.
+
William Stallings
Computer Organization
and Architecture
9th Edition
+
Chapter 15
Reduced Instruction Set Computers (RISC)
Table 15.1
Characteristics of Some CISCs, RISCs, and
Superscalar Processors
Table 15.1 Characteristics of Some CISCs, RISCs, and Superscalar Processors
Instruction
Execution
Characteristics
Execution sequencing
•Determines the control and
pipeline organization
Operands used
•The types of operands and the
frequency of their use determine
the memory organization for
storing them and the addressing
modes for accessing them
High-level languages (HLLs)
•Allow the programmer to express algorithms more
concisely
•Allow the compiler to take care of details that are not
important in the programmer’s expression of
algorithms
•Often support naturally the use of structured
programming and/or object-oriented design
Semantic gap
•The difference between the
operations provided in HLLs
and those provided in computer
architecture
Operations performed
•Determine the functions to be
performed by the processor and
its interaction with memory
Table 15.2
Weighted Relative Dynamic Frequency
of HLL Operations
Table 15.2 Weighted Relative Dynamic Frequency of HLL Operations [PATT82a]
Table 15.3
Operands
Table 15.3 Dynamic Percentage of Operands
Table 15.4
Procedure Arguments and
Local Scalar Variables
Table 15.4 Procedure Arguments and Local Scalar Variables
+
Implications
HLLs can best be supported by optimizing performance of
the most time-consuming features of typical HLL programs
Three elements characterize RISC architectures:
Use a large number of registers or use a compiler to optimize
register usage
Careful attention needs to be paid to the design of instruction
pipelines
Instructions should have predictable costs and be consistent with
a high-performance implementation
+
The Use of a Large Register File
Software Solution
Requires compiler to allocate
registers
Allocates based on most used
variables in a given time
Requires sophisticated
program analysis
Hardware Solution
More registers
Thus more variables will be in
registers
+
Overlapping Register Windows
Circular Buffer
Organization of
Overlapped
Windows
+
Global Variables
Variables declared as global in an HLL can be assigned memory
locations by the compiler and all machine instructions that
reference these variables will use memory reference operands
However, for frequently accessed global variables this scheme is
inefficient
Alternative is to incorporate a set of global registers in the
processor
These registers would be fixed in number and available to all
procedures
A unified numbering scheme can be used to simplify the instruction
format
There is an increased hardware burden to accommodate the
split in register addressing
In addition, the linker must decide which global variables
should be assigned to registers
Characteristics of Large-Register-File and
Cache Organizations
Table 15.5 Characteristics of Large-Register-File and Cache Organizations
+
Referencing a
Scalar
Graph Coloring Approach
+
Why CISC ?
(Complex Instruction Set Computer)
There is a trend to richer instruction sets which include a
larger and more complex number of instructions
Two principal reasons for this trend:
A desire to simplify compilers
A desire to improve performance
There are two advantages to smaller programs:
The program takes up less memory
Should improve performance
Fewer instructions means fewer instruction bytes to be fetched
In a paging environment smaller programs occupy fewer
pages, reducing page faults
More instructions fit in cache(s)
Table 15.6
Code Size Relative to RISC 1
Table 15.6 Code Size Relative to RISC I
Characteristics of Reduced
Instruction Set Architectures
One machine
instruction per
machine cycle
Register-to-register
operations
• Machine cycle --- the time it takes to fetch two operands from
registers, perform an ALU operation, and store the result in a
register
• Only simple LOAD and STORE operations accessing memory
• This simplifies the instruction set and therefore the control unit
Simple addressing
modes
• Simplifies the instruction set and the control unit
Simple instruction
formats
• Generally only one or a few formats are used
• Instruction length is fixed and aligned on word boundaries
• Opcode decoding and register operand accessing can occur
simultaneously
Comparison of Register-to-Register and
Memory-to-Memory Approaches
Table 15.7
Characteristics of Some Processors
The Effects of Pipelining
+
Optimization of Pipelining
Delayed branch
Delayed Load
Does not take effect until after execution of following instruction
This following instruction is the delay slot
Register to be target is locked by processor
Continue execution of instruction stream until register required
Idle until load is complete
Re-arranging instructions can allow useful work while loading
Loop Unrolling
Replicate body of loop a number of times
Iterate loop fewer times
Reduces loop overhead
Increases instruction parallelism
Improved register, data cache, or TLB locality
Table 15.8
Normal and Delayed Branch
+
Use of the
Delayed Branch
do i=2, n-1
a[i] = a[i] + a[i-1] * a[i+l]
end do
Becomes
do i=2, n-2, 2
a[i] = a[i] + a[i-1] * a[i+i]
a[i+l] = a[i+l] + a[i] * a[i+2]
end do
if (mod(n-2,2) = i) then
a[n-1] = a[n-1] + a[n-2] * a[n]
end if
Loop Unrolling
Twice Example
+
RISC versus CISC Controversy
Quantitative
Qualitative
Compare program sizes and execution speeds of programs on
RISC and CISC machines that use comparable technology
Examine issues of high level language support and use of VLSI
real estate
Problems with comparisons:
No pair of RISC and CISC machines that are comparable in lifecycle cost, level of technology, gate complexity, sophistication of
compiler, operating system support, etc.
No definitive set of test programs exists
Difficult to separate hardware effects from complier effects
Most comparisons done on “toy” rather than commercial products
Most commercial devices advertised as RISC possess a mixture of
RISC and CISC characteristics
Summary
+
Chapter 15
Reduced Instruction
Set Computers
(RISC)
Instruction execution characteristics
Operations
Operands
Procedure calls
Implications
The use of a large register file
Register windows
Global variables
Large register file versus cache
Reduced instruction set architecture
Characteristics of RISC
CISC versus RISC characteristics
RISC pipelining
Pipelining with regular instructions
Optimization of pipelining
MIPS R4000
Instruction set
Instruction pipeline
SPARC
SPARC register set
Instruction set
Instruction format
Compiler-based register optimization
RISC versus CISC controversy
+
Chapter 18
Multicore Computers
+
Alternative Chip
Organization
+
Intel Hardware
Trends
Processor Trends
Power
Memory
+
+
Power Consumption
By 2015 we can expect to see microprocessor chips with
about 100 billion transistors on a 300 mm2 die
Assuming that about 50-60% of the chip area is devoted to
memory, the chip will support cache memory of about 100 MB
and leave over 1 billion transistors available for logic
How to use all those logic transistors is a key design issue
Pollack’s Rule
States that performance increase is roughly proportional to square
root of increase in complexity
+
Performance
Effect of
Multiple Cores
Scaling of Database Workloads on
Multiple-Processor Hardware
+
Effective Applications for Multicore
Processors
Multi-threaded native applications
Characterized by having a small number of highly threaded
processes
Lotus Domino, Siebel CRM (Customer Relationship Manager)
Multi-process applications
Characterized by the presence of many single-threaded processes
Oracle, SAP, PeopleSoft
Java applications
Java Virtual Machine is a multi-threaded process that provides scheduling
and memory management for Java applications
Sun’s Java Application Server, BEA’s Weblogic, IBM Websphere, Tomcat
Multi-instance applications
One application running multiple times
If multiple application instances require some degree of isolation,
virtualization technology can be used to provide each of them with its own
separate and secure environment
Hybrid +
Threading
for
Rendering
Module
Multicore
Organization
Alternatives
+
Intel Core Duo
Block Diagram
+ Intel x86 Multicore Organization Core Duo
Advanced Programmable Interrupt Controller (APIC)
Provides inter-processor interrupts which allow any process to
interrupt any other processor or set of processors
Accepts I/O interrupts and routes these to the appropriate core
Includes a timer which can be set by the OS to generate an
interrupt to the local core
Power management logic
Responsible for reducing power consumption when possible,
thus increasing battery life for mobile platforms
Monitors thermal conditions and CPU activity and adjusts
voltage levels and power consumption appropriately
Includes an advanced power-gating capability that allows for an
ultra fine grained logic control that turns on individual processor
logic subsystems only if and when they are needed
Continued . . .
+ Intel x86 Multicore Organization Core Duo
2MB shared L2 cache
Cache logic allows for a dynamic allocation of cache space based
on current core needs
MESI support for L1 caches
Extended to support multiple Core Duo in SMP
L2 cache controller allows the system to distinguish between a
situation in which data are shared by the two local cores, and a
situation in which the data are shared by one or more caches on
the die as well as by an agent on the external bus
Bus interface
Connects to the external bus, known as the Front Side Bus, which
connects to main memory, I/O controllers, and other processor
chips
Intel Core i7-990X Block Diagram
+
Table 18.1
Cache Latency
Summary
+
Multicore
Computers
Chapter 18
Hardware performance issues
Multicore organization
Intel x86 multicore organization
Increase in parallelism and
complexity
Power consumption
Software performance issues
Software on multicore
Valve game software
example
Intel Core Duo
Intel Core i7-990X
ARM11 MPCore
Interrupt handling
Cache coherency
IBM zEnterprise mainframe
Növbəti mühazirənin mövzusu
Mövzu № 3. Mikroarxitektura və
mikroproqramlaşdırma.
(Mikroarxitektura nümunələri. Verilənlər traktı. Mikroəmrlər. Mic-1
mikroarxitekturası – mikroəmrlərin idarəedilməsi. IJVM - əmrlər yığımının
arxitektura nümunəsi. 8051 ailəsindən olan mikrokontrollerlərin
mikroarxitekturası. Mikroproqramlaşdırma.)
www.berkut.ws/comporgtech.html