Manycores * From hardware prospective to software
Download
Report
Transcript Manycores * From hardware prospective to software
MANYCORES – FROM HARDWARE
PROSPECTIVE TO SOFTWARE
Presenter: D96943001 電子所 陳泓輝
Why Moore’s Law is die
He is not CEO anymore!!
Walls => ILP, Frequency, Power, Memory walls
ILP – more cost less return
ILP: instruction level parallelism
OOO: out of order execution of microcodes
Frequency wall
FO4 delay metric: delay of a inverter with 4 fan-in
with ¼ size and it drives another inverter 4x size
Freq ↑ => Some OP cycle counts ↑
Saturated!
Memory wall
External access penalty is increasing(the gap)
Solution => enlarge cache
Cache decide the performance and the price
It’s cache that matters!
The power wall
High power might imply
Thermal
run away of device behavior
Larger current => electronic migration => issue of the
reliability of the metal connection
Hit packaging heat limitation
Change
to high cost packaging
noise!!
Form factor
Cooling
The great wall……
Moore’s Law
CMOS
Multicore
Manycore
Historical - Intel 2007 Xeon
Dual
on chip memory controller => fcpu > 2*fmem
Point-to-point interconnection => fabrics
Multiple
activity)
communication activities (c.f. “bus” => one
Fabric working notation
AMD – Opteron(Shanghai)
Much the same as Intel Xeon
Shared L3 cache among 2 cores
Game consoles
XBox360 => Triple core
PS3 => Cell, 8+1 cores
Homogeneous
Heterogeneous
Power PC wins!
State-of-art multicore DSP chips
TI TNETV3020
Homogeneous
Freescale 8156
Heterogeneous
State-of-art multicore DSP chips
picoChip PC205
Heterogeneous
Tilera TILE64
Homogeneous, Mesh
State-of-art multicore x86 chips
24 “tiles” with two IA cores per tile
A 24-router mesh network with 256 GB/s bisection
bandwidth
4 integrated DDR3 memory controllers
Hardware support for message-passing !!
Intel Single-chip Cloud Computer
1GHz Pentium
GPGPU - OpenCL
Official
LOGO
Special case: multicore video processor
Characteristics of video applications in consumer
electronics
A General Solution
High computational capability
Low hardware cost
Low power consumption
Fixed-function logic designed
Challenges
Multiple video decoding standards
Updating video decoding standards
Ill-posed video processing algorithms
Product requirements are diverse and mutually exclusive
mediaDSP technology
Nickname: accelerator
Heterogeneous (programmable and fixed functions units)
A task-based programming model
A uniform approach for managing tasks executing on
different types of programmable and fixed-function
processing elements
A platform, easily extendable to support a range of
applications
Broadcom: mediaDSP technology
Easily to be customized for special purpose
Successful stories
SD MPEG Video encoder including scaling and noise
reduction
Frame-Rate-Conversation Video Processing for
FHD@60Hz /120Hz videos
Classes of video processing
Highly parallelizable operations for fixed-point data
and no floating point
Ad-hoc computation and decision making, which are
operating on smaller sets of data produced by the
parallelizable processes
A processor with SIMD data path engine
A general processor such as RISC
Data movement and formatting on multidimensional
pixels
Bit serial processing for entropy decoding and
encoding => dedicate hardware do this job very
efficiently
Task-based programming model
Programmers’ duties as follows:
Partition a sequential algorithm into a set of
parallelizable tasks and then efficiently map it to the
massively parallel architecture
A task has a definite initiation time
A task runs until completion with no interruption and no
further synchronization with other task
Understand hardware architecture and limitation
Shared memory (instead of FIFO mode)
Buffer size must be enough for a data unit
Interconnect bandwidth must be enough
Computational power must be enough for real time
(IP) Platform-based architecture
Task-oriented engine (TOE)
A programmable DSP or a fixed function unit
Task control unit (TCU)
A RISC maintains a queue of tasks and synchronous with other TCU/TOEs
To maximize the utilization of TOEs
Control engine
Shared memory
Communication fabric
Memory architecture
Memory hierarchy
All TOEs use software-managed DMA rather
than caches for their local storage
L1 - Processor Instruction
and Data Memory
6D addressing (x,y,t,Y,U,V) and the chunking of
blocks into smaller subblocks.
L2 - On-chip Shared
Memory
No {pre-fetching, early load scheduling, cache,
speculative execution, multithreading …}
L3 - Off-chip
Broadcom BCM35421 chip [1/2]
Do motion-compensated frame-rate conversion
Double frame rate from FHD@60fps to FHD@120fps
(to conquer motion blur)
24fps 60fps (de-judder)
Broadcom BCM35421 chip [2/2]
65nm CMOS process
mediaDSP runs at 400 MHz
106 Million transistors
Two Teraops of peak integer performance
Performance of DSPs for applications
DSP becomes useful when it can perform a minimum
of 100 instructions per sample period
68% DSP were shipped for mobile handsets and base
stations in 2008
Several K cycles for
processing a input
sample
Multiple elements
Increase in performance:
multiple
elements > higher performance single
elements
Go deeper –TI’s multicore
Multicore Programming Guide
Mapping application to mutilcore
Know the processing model option
Identify all the tasks
Task
partition into many small ones
Familiar with Inter-task communication/data flow
Combination/aggregation
Mapping
Memory hierarchy
=> L1/L2/L3, private/shared memory,
external memory channel numbers/capability
DMA
Special purpose hardware!!
FFT, Viterb, reed solomon, AES codec, Entropy codec
Parallel processing model
Master/Slave model
Data flow
Very successful in
communication
system
Router
Base
station
Data movement
Shared memory
Dedicated memory
Transitional memory => ownership change,
content not copy
Notification [1/4]
Direct signaling
Create
event to other core’s local interrupt controller
Other
core polling local status
Or the local interrupt controller convert this event to real
interrupt
Notification [2/4]
Indirect signaling
Not
directly controlled by software
Notification [3/4]
Atomic arbitration
Hardware semaphore/mutex
Semaphore => allow limited multiple access => example: multiport SRAM/external DDR memory
Mutex => allow one access only
Use software semaphore instead if resource only shared
between processes only executed in one core
Overhead of hardware semaphore is not small
Its only a facility for software usage, hardware only
guarantee atomic operation, locked content is not
protected
Cost, performance consideration
Notification [4/4]
Left diagram is
mutex
Just like the
software
counterpart
Data transfer engines
DMA => System DMA, local DMA belongs to a core
Ethernet
Up to 32 MAC address
RapidIO
Implemented with ultra fast serial IO physical layer
Maybe multiple serial IO links uni/bi-directional
Example
USB 2.0 => 480Mbit/sec USB 3.0 => 5Gbit/sec
Serial ATA
1.0, Gen 1 => 1.5Gbit/sec; 2.0, Gen 2 => 3 Gbit/sec
3.0, Gen 3 => 6 Gbit/sec
High speed serial link
USB
SATA
Memory management
Devices do not support automated cache coherency
among cores because of the power consumption
involved and the latency overhead introduced
Switched central resource fabric
Highlights [1/3]
Portion of the cache could be configured to as
memory mapped SRAM
Transparent
Implicit
Address aliasing => masking MSByte
Write common
For core 0: 0x10800000 == 0x00800000 rom code
For core 1: 0x11800000 == 0x00800000 still assess core’s
private area
For
cache => visible
core 2: 0x12800000 == 0x00800000
Special register DNUM for dynamic pointer address
update =>
Explicit
Each core has it
DNUM
Highlight [2/3]
The only guaranteed
coherency by hardware
L2 (core-locally)
L1D L2 SL2 (if as
memory mapped SRAM)
(core-locally)
L1D
L1P
L1D
Equal access to the
external DDR2 SDRAM
through SCR
This may be the
bottleneck for
certain application
L1P
L1D
Highlight [3/3]
If any portion of the L1s is configured as memorymapped SRAM, there is a small paging engine built
into the core (IDMA) that can be used to transfer
linear blocks of memory between L1 and L2 in the
background of CPU operation
Paging
engine => MMU
IDMA may also be used to perform bulk peripheral
configuration register access
DSP code and data image
Image types
Single
image
Multiple image
Multiple image with shared code and data
Complex
linking scheme should be used
Device boot
Tool
Tool Tool
Debugging
Cuda => XXOO↑↑↓↓←→←→BA
TI’s offer
Hardware emulation => ICE, JTAG
Basically,
not intrusive
Software instrumentation
Patching
original codes to enable same ability => this
time, “Trace Logs”
Basically, intrusive
Type of Trace Logs
API
call log, Statistics log, DMA transaction log, Event
log, Customer data log
More on logs
Information stores in memory pull back to host by
path through hardware emulation
Provide tool to correlate all the logs
Display
them with an organized manner
Log example:
Go deeper –Freescale’s manycore
Embedded Multicore: An Introduction
Why manycore?
Freescale MPC8641
Single
core => freq x 1.5 => power x 2
Dual core => freq x 1.5 => power x 1.3
Bug in this
Fig.
Memory system types
SMP + AMP + Sharing
Manycore enables multiple OS concurrently running
Memory sharing => MMU
Interface/peripheral sharing => hypervisor
Virtualization is good for legacy support
Review of single core
Manycore example [1/2]
2
2
2
2
4
2
3
4
3
1
4
Manycore example [2/2]
2
1
2
4
3
4
1
Highlights [1/2]
CoreNet fabric supports cache coherency across all
cache layers
CoreNet fabric also supports software semaphores by
extending the bit-test to guarantee atomic access
between cores
CLASS is better suited for DSPs as they tend to use less
complex operating systems and the application
software is more in control
Silicon area of the fabric reduced
If core is configured as software accelerator
Some space of L1, L2 could be converted to memory
mapped SRAM in a per way basis
Highlights [2/2]
While MMU protects memory access, DMA could
ruin all the things => solution “PAMU”
PAMU
is located at the connection of non-core
masters and the CoreNet fabric configured to map
memory and to limit access windows thereby
increasing system stability
Cache stashing
DMA
between cache and
external memory
Deal with 10Gb Ethernet
Parsing network traffic up to
layer 4
Assign packet to designated
cores
TCP
port 80/22 to core 0
ARP to core 1
UDP to core 4~7
Queue Mgr and Buffer Mgr
simplify driver codes
Debugging
JTAG
High-speed logging link
Platinum version “printk”
Why JTAG is slow?
Involving serial bit shifting
Conclusion
Hardware guys are crazy and unfriendly
They
are all “Postscript” geeks
However, they provides the world a chance to deal
with real-time sample data with thousands of
cycles => more things could be done by software
Besides data structure and algorithm, now good
engineer need to know more
The more you know hardware the more it dance
with you, or at least……
Talk as common sense
Reference[1/2]
Sudhakar Yalamanchili, Georgia Institute of Technology, “Multicore
Computing Multicore Computing - - Evolution”
(Broadcom) “Broadcom mediaDSP: A platform for building
programmable multicore video processors,” IEEE Micro., Mar/April
2009.
(Texas Instruments, TI) Lina J. Karam, Ismail Alkamal, Alan Gatherer,
Gene A. Frantz, David V. Anderson, and Brian L. Evans, “Trends in
Multicore DSP Platforms,” IEEE Signal Processing Magazine, Nov.
2009.
University Texas and Texas Instruments
Reference[2/2]
http://en.wikipedia.org
(Tilera) Tile64 processor products. http://www.tilera.com
Originate from MIT
(Intel) Single-chip Cloud Computer
TI, “Multicore Programming Guide”
Freescale, “Embedded Multicore: An Introduction”
Presentation slides from lab member