CREST Overview - CoC wiki server

Download Report

Transcript CREST Overview - CoC wiki server

Multicore: Commercial
Processors
ECE 4100/6100 (1)
Some Examples
• Desktop and Server/Enterprise Space
– Intel
– AMD
– SUN Microsystems
• The Embedded Space: Freescale
Semiconductor
ECE 4100/6100 (2)
Focus
• The Chip Level Architecture
– What do we have on chip?
• The Core Architecture
– Note the presence/absence/configuration
of concepts studied earlier in class
– Rationalize the design decisions that led to
the preceding
– What can/should we expect next?
ECE 4100/6100 (3)
The Intel Core Duo Processor
Series
ECE 4100/6100 (4)
Intel Core Duo
 Homogeneous cores
 Bus based on chip interconnect
 Shared Memory
 Traditional I/O
Classic OOO: Reservation Stations,
Issue ports, Schedulers…etc
Source: Intel Corp.
Large, shared set associative, prefetch,
etc.
ECE 4100/6100 (5)
Intel Core Duo: Vital Stats
•
•
•
•
•
151 million transistors; Shared 2 MB L2 cache
Each core has a 12 stage pipeline (Yonah)
Low-power (less than 25 watts) Dual Core microprocessor
Supports Intel’s Vanderpool virtualization technology
EM64T (Intel x86-64 extensions) is not supported
– Desktop market – not severe due to lack of OS and software
– Sossaman processor for servers, which is based on Yonah, also lacks
EM64T-support  severe disadvantage
• Communication between the L2 cache and both execution cores is
handled by an arbitration bus unit
– Eliminates cache coherency traffic over the FSB
– Raises the core-to-L2 latency
– The increase in clock frequency offsets the impact
• Core processors communicate with the system chipset over a 667
MT/s front side bus (FSB), up from 533 MT/s used by the fastest
Pentium M.
• Intel Core Solo uses the same two-core die as the Core Duo, but
features only one active core
– Chips failing quality control can be sold
– Core 2 Duo processors will also include the ability to disable one core to
conserve power
ECE 4100/6100 (6)
The Core™ micro-architecture
Source: Ars Technica
ECE 4100/6100 (7)
The Core Execution core
Source: Ars Technica
ECE 4100/6100 (8)
Intel Core Duo
• High memory latency due to the lack of on-die
memory controller (further aggravated by systemchipset's use of DDR-II RAM)
• Main-memory transactions have to pass through
the Northbridge of the chipset
– Higher latency compared to the AMD's Turion platform.
– Weakness shared by the entire line of Pentium processors
– L2-cache is quite effective at hiding main-memory latency
• Execution units
– Three 64-bit integer exec units
• one CIU (complex) + two SIU (simple)
– Two FPUs
– Poor Floating Point Unit (FPU) throughput
• Limited to little "performance per watt" in single
threaded applications compared to its predecessor.
ECE 4100/6100 (9)
Core 2 Duo and Core Duo
•
•
•
•
Source: Intel Corp.
Very similar architectures
Bump in the processor speed
Increase in Level 2 cache. (2MB to 4MB)
Both chips have a 65-nm process technology architecture and
support a 667 MHz front-side-bus (FSB).
• 14 stage pipeline
ECE 4100/6100 (10)
Intel® CoreTM2 Duo Processor
Process Technology
65 nm
Number of Processor Cores
2
L2 Cache Size (shared between 2 processor cores)
Up to 4MB
Transistor Gate Height / Gate Oxide Thickness (65 nm)
1.2 nm
Transistor Gate Length (for 65nm Process Technology)
35 nm
Line Width
65 nm
Number of Transistors
291 million
Processor Die Size
143 mm2
Average Power
<1.1 Watt
ECE 4100/6100 (11)
Intel Core 2 Duo
Source: Hard Core Hardware
ECE 4100/6100 (12)
Wide Dynamic Execution
Source: Bit Tech
ECE 4100/6100 (13)
Wide Dynamic Execution
Source: Bit Tech
ECE 4100/6100 (14)
Wide Dynamic Execution
• Pipe width of 4 execution units per chip (Pentium M/Pentium 4
Netburst have 3)
• Delivery of more instructions per clock cycle
• Pipeline depth of 14 vs. 31 in Pentium Prescott 4
– Compromise between efficient execution of short instructions and
long instructions
• Ops fusion
– Less work for the processor pipeline to run
– Micro-ops fusion
• fuse together repetitive instructions in x86 code
– Macro-ops fusion
• works on the x86 instructions themselves, not just their micro
derivatives.
• Instruction loads and micro-ops can be reduced by approximately 15%
and 10%, respectively
ECE 4100/6100 (15)
Intelligent Power Capability
Source: Bit Tech
ECE 4100/6100 (16)
Intelligent Power Capability
• SpeedStep technology
– Dyamic clock speed reduction
– Intel mobile processors include this already
– Enhanced SpeedStep used in Core 2 Duo
• Controller that turns on sections of the
processor as needed. One core can be
shut down for single-threaded
applications
ECE 4100/6100 (17)
Advanced Smart Cache
Source: Bit Tech
ECE 4100/6100 (18)
Advanced Smart Cache
Source: Bit Tech
• Both cores share data stored in the L2 cache via an arbitration bus unit
embedded in the cache.
– Dynamically allocates cache space between the two cores, minimising bus
traffic by allowing both cores to access one copy of data
• Does larger L2 cache matter?
– Studies point out that improvements in execution time are low from a 2MB
to 4MB for most applications (2-4%)
ECE 4100/6100 (19)
Smart Memory Access
Source: Bit Tech
ECE 4100/6100 (20)
Smart Memory Access
Execution with and without memory disambiguation
Memory Aliasing
Execution without memory disambiguation
Source: Ars Technica
• Improved prefetch
units
• Memory
disambiguation
– Allows re-ordering
instructions more
efficiently
Example from
http://arstechnica.com/articles/paedia/cpu/core.ars/8
ECE 4100/6100 (21)
Advanced Digital Media Boost
Source: Bit Tech
ECE 4100/6100 (22)
Advanced Digital Media Boost
• Streaming SIMD Extension (SSE) instructions
– SSE instructions are an extension of the standard
x86 instruction set.
– Utilized in multimedia encoding, decoding, image
manipulation and encryption
• SSE instructions are 128-bit.
– Up from 64-bits
– Double the SSE performance over previous
generation
ECE 4100/6100 (23)
Comparison of SSE to prior
processors
Source: Ars Technica
ECE 4100/6100 (24)
Intel Conroe Vs Presler
Conroe
• What is the major difference?
Presler
Source: Bit Tech
– Shared L2 versus separate caches
ECE 4100/6100 (25)
DC 4MB
DC 2/4MB
shared
DC 2/4MB
DC 3 MB/6
MB shared
(45nm)
DC 2/4MB
shared
DC 2MB
SC 1MB
Enterprise processors
DC 3MB /6MB
shared (45nm)
8C 12MB
shared
(45nm)
Mobile processors
Desktop processors
Intel’s Roadmap for Multicore
8C 12MB
shared
(45nm)
QC 8/16MB
shared
QC 4MB
DC 16MB
DC 4MB
DC 2MB
SC 512KB/
1/ 2MB
2006
2007
2008
2006
2007
2008
2006
2007
2008
Source: Adapted from Tom’s Hardware
• Drivers are
– Market segments
– More cache
– More cores
• 80 core processor prototype has been designed!
ECE 4100/6100 (26)
Intel Chipset Example
Source: Extreme Tech
ECE 4100/6100 (27)
References and Links
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
http://www.intel.com/products/processor/coreduo/
http://en.wikipedia.org/wiki/Intel_Core
http://www.hothardware.com/viewarticle.aspx?articleid=845&cid=1
http://www.bit-tech.net/hardware/2006/03/10/intel_core_microarchitecture/
http://www.bit-tech.net/hardware/2006/05/19/intel_core_duo_t2600_on_the_desktop
http://www.bit-tech.net/hardware/2006/07/14/intel_core_2_duo_processors/
http://www.hardcoreware.net/reviews/review-347-1.htm
http://www.trustedreviews.com/cpu-memory/review/2006/08/28/Intel-Core-2-DuoMerom-Notebooks/p1
http://www.trustedreviews.com/cpu-memory/review/2006/07/14/Intel-Core-2-DuoConroe-E6400-E6600-E6700-X6800/p1
http://techreport.com/reviews/2006q2/core-duo/index.x?pg=1
http://arstechnica.com/articles/paedia/cpu/core.ars/1
http://www.anandtech.com/mobile/showdoc.aspx?i=2663&p=4
http://www.extremetech.com/article2/0,1697,1988794,00.asp
http://www.coreduoinfo.com/blog/about-intel-core-duo/
http://67.91.114.164/intel_c2d_info.htm
http://www.pcper.com/article.php?aid=272&type=expert
ECE 4100/6100 (28)
AMD MultiCore Processors
ECE 4100/6100 (29)
Dual Core AMD Opteron
Source: AMD
ECE 4100/6100 (30)
AMD Multicore (Dualcore) Opteron
• Two AMD Opteron CPU
cores on a single die
– Each has 1MB L2 cache
Core 0
1-MB L2
– Approximately same die size
as 130nm single-core AMD
Opteron processor
• 95 watt power envelope
Northbridge
Core 1
• 90nm, ~205 million
transistors
– fits into 90nm power
infrastructure
1-MB L2
• Introduced with “K8”
Revision E core in April
2005
Source: AMD
ECE 4100/6100 (31)
Opteron Core Pipeline
Source: Chip
Architect
ECE 4100/6100 (32)
AMD Opteron Processor Core
Architecture
Branch
Prediction
Fetch
L1
Icache
64KB
Scan/Align/Decode
Microcode Engine
Fastpath
µops
L1
Dcache
64KB
Instruction Control Unit (72 entries)
Int Decode & Rename
44-entry
Load/Store
Queue
Res
Res
Res
AGU
AGU
AGU
ALU
ALU
ALU
FP Decode & Rename
36-entry FP scheduler
FADD
FMUL
FMISC
MULT
Source: The 3D shop
ECE 4100/6100 (33)
Dual Core AMD Opteron
• AMD64 technology
– Runs 32-bit applications and is 64-bit capable
– Compatible with the x86 software infrastructure
– Enables a single architecture across 32- and 64-bit
environments
• Direct Connect Architecture
– NUMA system
• Each processor shares its memory with other processors in
the system
– Integrated Memory Controller on-die
• DDR2 DRAM memory controller offers memory BW up to
10.7 GB/s per processor
– HyperTransport
• Point-to-point interconnect can be used to build a mesh of
multiple-processor Opteron systems
• Scalable bandwidth interconnect between processors, I/O
subsystems, and other chipsets
• 24.0 GB/s peak bandwidth per processor
ECE 4100/6100 (34)
Dual Core AMD Opteron
• Not a simple aggregation of K8 cores
– Integrated the cores for efficiency
• Dual-core Opteron acts very much like a SMP
system
• Compatible with existing single-threaded, multithreaded (hyperthreaded) software
• MOESI coherency protocol (O – “Owns”)
– Updates through system request interface
•
•
•
•
SSE3 support with 10 new instructions.
Quad-core upgradeability
Hardware assisted AMD Virtualization
Optimized Power Management
ECE 4100/6100 (35)
Dual Core AMD Opteron
Source: Elec Design
ECE 4100/6100 (36)
AMD Opteron (SOI)
Source: Chip Architect
ECE 4100/6100 (37)
AMD 64 bit Core
• 1MB L2 Cache
• Detailed discussion of the 64-bit core
architecture at:
– http://chiparchitect.com/news/2003_09_21_Detailed_
Architecture_of_AMDs_64bit_Core.html
ECE 4100/6100 (38)
Multiprocessor Systems using AMD
Opteron
CPU CPU CPU
CPUCPU
CPU
8 GB/S
SRQ
Crossbar
Mem.Ctrlr
SRQ
Crossbar
HT
Mem.Ctrlr
HT
8 GB/S
I/O
I/O Hub
Hub
Memory
PCI-E
Controller
Bridge
Hub
PCI-E
Bridge
PCI-E
Bridge
PCI-E
Bridge
8 GB/S
PCI-E
Bridge
PCI-E
Bridge
8 GB/S
USB
I/O Hub
PCI
Legacy x86 Architecture
•
•
•
•
CPUs, Memory, I/O all share a bus
Major bottleneck to performance
Faster CPUs or more cores for performance
Symmetric Multiprocessing
Source: AMD
AMD64 Direct Connect Architecture
• Eliminates FSB bottleneck
• HyperTransport™ Technology interconnect for high
bandwidth and low latency
• Each CPU has its own memory
• Each CPU can access the main memory of another
processor, transparent to the programmer 
Different from SMP
ECE 4100/6100 (39)
Multiprocessor Systems using AMD
Opteron
Source: XBitlabs
ECE 4100/6100 (40)
Cache coherency
Source: Chip Architect
ECE 4100/6100 (41)
AMD Athlon 64 X2
Source: AMD
ECE 4100/6100 (42)
References and Links
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
http://techreport.com/reviews/2005q2/opteron-x75/index.x?pg=1
http://www.tomshardware.com/2005/06/03/dual_core_stress_test/index.html
http://www.a1-electronics.net/AMD_Section/CPUs/2005/AMD_Athlon64x2_Apr.shtml
http://en.wikipedia.org/wiki/Opteron
http://en.wikipedia.org/wiki/Athlon_64_X2
http://www.amd.com/usen/Processors/ProductInformation/0,,30_118_8796_14309,00.html
http://chiparchitect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html
http://firingsquad.com/hardware/amd_dual-core_opteron_875/page2.asp
http://www.xbitlabs.com/articles/cpu/display/opteron-ws_4.html
http://www.extremetech.com/article2/0,1697,1675784,00.asp
http://www.elecdesign.com/Articles/Index.cfm?AD=1&ArticleID=11991
http://www.the3dshop.com/userimages/amd_systems/opteron_dualcore.htm
http://www.nextcomputing.com/advantages/thruadv.shtml
http://arstechnica.com/news.ars/post/20060817-7535.html
http://www.bit-tech.net/hardware/2005/05/09/amd_a64x2_4800/1.html
ECE 4100/6100 (43)
SUN – UltraSPARC Multicore
ECE 4100/6100 (44)
SUN – UltraSPARC T1
• Eight cores, each 4-way
threaded
• 1.2 GHz
• Cache
– 16K 4-way 32B L1-I
– 8K 4-way 16B L1-D
– 3MB internal L2 cache
partitioned into four banks
and four memory
controllers.
– Data moved between the L2
and the cores using an
integrated crossbar switch
to provide high throughput
Source: Sun
ECE 4100/6100 (45)
SUN – UltraSPARC T1
Source: Sun
ECE 4100/6100 (46)
SUN – UltraSPARC T1 Pipeline
• T1's integer pipeline
– Fetch, Thread Selection, Decode, Execute, Memory
Access, Writeback
Source: Sun
ECE 4100/6100 (47)
SUN UltraSPARC T2 – Niagara 2
Source: Sun
ECE 4100/6100 (48)
SUN UltraSPARC T2
• Ultra SPARC T2 has 8 threads/core (8 Sparc Cores)
• 8 stage integer pipeline ( as opposed to 6 for T1)
• Twice the performance of T1 with a transactional workload
(under the same power envelope)
• Each thread, increased to 1.4 GHz from 1.2 GHz
• One PCI Express port (x8 1.0)
• Two 10 Gigabit Ethernet ports with packet classification and
filtering
• L2 cache size increased to 4 MB shared (8-banks, 16-way
associative)
• 1 floating point unit per core
• Eight encryption engines
• Four dual-channel FBDIMM memory controllers
• 711 signal I/O,1831 total
ECE 4100/6100 (49)
UltraSparc T2 Core
Microarchitecture
Source: Realworld Tech
ECE 4100/6100 (50)
UltraSparc T2 Memory System
Source: Sun
ECE 4100/6100 (51)
UltraSparc T2 Core Block Diagram
•
IFU – Instruction Fetch Unit
•
EXU0/1 – Integer Execution Units
•
LSU – Load/Store Unit
–
–
16 KB I$, 32B lines, 8-way SA
64-entry fully-associative ITLB
–
–
4 threads share each unit
Executes one integer instrn/cycle
–
–
8KB D$, 16B lines, 4-way SA 128-entry
fully-associative
DTLB
–
Cryptographic acceleration
•
•
FGU – Floating/Graphics Unit
SPU – Stream Processing Unit
•
TLU – Trap Logic Unit
•
MMU – Memory Management Unit
Source: Sun
–
–
–
Updates machine state, handles
exceptions and interrupts
Hardware tablewalk (HWTW)
8KB, 64KB, 4MB, 256MB pages
ECE 4100/6100 (52)
UltraSparc T2 Core Pipeline
• 8 stages for integer operations:
– Fetch, Cache, Pick, Decode, Execute, Memory,
Bypass, Writeback
– > 3-cycle load-use
– Memory (translation, tag/data access)
– Bypass (late select, formatting)
• 12 stages for floating-point:
– Fetch, Cache, Pick, Decode, Execute, FX1, FX2,
FX3, FX4, FX5, FB, FW
– 6-cycle latency for dependent FP ops
– Longer pipeline for divide/sqrt
ECE 4100/6100 (53)
References and Links
• http://realworldtech.com/page.cfm?A
rticleID=RWT090406012516&p=4
• http://www.opensparc.net/cgibin/goto.php?w=/pubs/preszo/06/H
otChips06_09_ppt_master.pdf
• http://www.freescale.com/files/netco
mm/doc/fact_sheet/MPC8572FS.pdf
ECE 4100/6100 (54)
The Embedded Multicores
ECE 4100/6100 (55)
Freescale MPC8572 PowerQUICC
III Processor
Source: Freescale
ECE 4100/6100 (56)
Freescale MPC8572 PowerQUICC
III Processor
• Dual Embedded e500 core 36-bit physical addressing
• Double-precision floating-point
• Integrated L1/L2 cache
– L1 cache—32 KB data and 32 KB
– Shared L2 cache—1 MB with ECC
– L2 configurable as SRAM, cache and I/O transactions can
be stashed into L2 cache regions
• Integrated DDR memory controller with
• full ECC support
• Integrated security engine, Pattern Matching Engine,
Packet Deflate Engine
• Four on-chip triple-speed Ethernet controllers
ECE 4100/6100 (57)
References and Links
• http://www.freescale.com/files/netco
mm/doc/fact_sheet/MPC8572FS.pdf
ECE 4100/6100 (58)
Summary
• Multicore technology spans the product
spectrum
– The downward migration of leading edge
technology continues
• Architectural principles are key to
– Developers: extracting performance
– Designers: improving performance
– Marketing: understanding new markets for
performance
ECE 4100/6100 (59)