Manycores * From hardware prospective to software

Download Report

Transcript Manycores * From hardware prospective to software

MANYCORES – FROM HARDWARE
PROSPECTIVE TO SOFTWARE
Presenter: D96943001 電子所 陳泓輝
Why Moore’s Law is die


He is not CEO anymore!!
Walls => ILP, Frequency, Power, Memory walls
ILP – more cost less return


ILP: instruction level parallelism
OOO: out of order execution of microcodes
Frequency wall

FO4 delay metric: delay of a inverter with 4 fan-in
with ¼ size and it drives another inverter 4x size
Freq ↑ => Some OP cycle counts ↑
Saturated!
Memory wall


External access penalty is increasing(the gap)
Solution => enlarge cache

Cache decide the performance and the price
It’s cache that matters!
The power wall

High power might imply
 Thermal
run away of device behavior
 Larger current => electronic migration => issue of the
reliability of the metal connection
 Hit packaging heat limitation
 Change
to high cost packaging
 noise!!
 Form factor
 Cooling
The great wall……
Moore’s Law
CMOS
Multicore
Manycore
Historical - Intel 2007 Xeon
 Dual
on chip memory controller => fcpu > 2*fmem
 Point-to-point interconnection => fabrics
 Multiple
activity)
communication activities (c.f. “bus” => one
Fabric working notation
AMD – Opteron(Shanghai)


Much the same as Intel Xeon
Shared L3 cache among 2 cores
Game consoles


XBox360 => Triple core
PS3 => Cell, 8+1 cores
Homogeneous
Heterogeneous
Power PC wins!
State-of-art multicore DSP chips
TI TNETV3020
Homogeneous
Freescale 8156
Heterogeneous
State-of-art multicore DSP chips
picoChip PC205
Heterogeneous
Tilera TILE64
Homogeneous, Mesh
State-of-art multicore x86 chips




24 “tiles” with two IA cores per tile
A 24-router mesh network with 256 GB/s bisection
bandwidth
4 integrated DDR3 memory controllers
Hardware support for message-passing !!
Intel Single-chip Cloud Computer
1GHz Pentium
GPGPU - OpenCL
Official
LOGO
Special case: multicore video processor

Characteristics of video applications in consumer
electronics




A General Solution


High computational capability
Low hardware cost
Low power consumption
Fixed-function logic designed
Challenges




Multiple video decoding standards
Updating video decoding standards
Ill-posed video processing algorithms
Product requirements are diverse and mutually exclusive
mediaDSP technology

Nickname: accelerator
 Heterogeneous (programmable and fixed functions units)
 A task-based programming model
 A uniform approach for managing tasks executing on
different types of programmable and fixed-function
processing elements
 A platform, easily extendable to support a range of
applications
Broadcom: mediaDSP technology


Easily to be customized for special purpose
Successful stories
SD MPEG Video encoder including scaling and noise
reduction
 Frame-Rate-Conversation Video Processing for
FHD@60Hz /120Hz videos

Classes of video processing

Highly parallelizable operations for fixed-point data
and no floating point


Ad-hoc computation and decision making, which are
operating on smaller sets of data produced by the
parallelizable processes



A processor with SIMD data path engine
A general processor such as RISC
Data movement and formatting on multidimensional
pixels
Bit serial processing for entropy decoding and
encoding => dedicate hardware do this job very
efficiently
Task-based programming model


Programmers’ duties as follows:
Partition a sequential algorithm into a set of
parallelizable tasks and then efficiently map it to the
massively parallel architecture
A task has a definite initiation time
 A task runs until completion with no interruption and no
further synchronization with other task


Understand hardware architecture and limitation
Shared memory (instead of FIFO mode)
 Buffer size must be enough for a data unit
 Interconnect bandwidth must be enough
 Computational power must be enough for real time

(IP) Platform-based architecture

Task-oriented engine (TOE)


A programmable DSP or a fixed function unit
Task control unit (TCU)


A RISC maintains a queue of tasks and synchronous with other TCU/TOEs
To maximize the utilization of TOEs

Control engine

Shared memory

Communication fabric
Memory architecture

Memory hierarchy
 All TOEs use software-managed DMA rather
than caches for their local storage
 L1 - Processor Instruction
and Data Memory
 6D addressing (x,y,t,Y,U,V) and the chunking of
blocks into smaller subblocks.
 L2 - On-chip Shared
Memory
 No {pre-fetching, early load scheduling, cache,
speculative execution, multithreading …}
 L3 - Off-chip
Broadcom BCM35421 chip [1/2]

Do motion-compensated frame-rate conversion
Double frame rate from FHD@60fps to FHD@120fps
(to conquer motion blur)
 24fps  60fps (de-judder)

Broadcom BCM35421 chip [2/2]




65nm CMOS process
mediaDSP runs at 400 MHz
106 Million transistors
Two Teraops of peak integer performance
Performance of DSPs for applications


DSP becomes useful when it can perform a minimum
of 100 instructions per sample period
68% DSP were shipped for mobile handsets and base
stations in 2008
Several K cycles for
processing a input
sample
Multiple elements

Increase in performance:
 multiple
elements > higher performance single
elements
Go deeper –TI’s multicore
Multicore Programming Guide
Mapping application to mutilcore


Know the processing model option
Identify all the tasks
 Task
partition into many small ones
 Familiar with Inter-task communication/data flow
 Combination/aggregation
 Mapping
 Memory hierarchy
=> L1/L2/L3, private/shared memory,
external memory channel numbers/capability
 DMA
 Special purpose hardware!!

FFT, Viterb, reed solomon, AES codec, Entropy codec
Parallel processing model
Master/Slave model
Data flow

Very successful in
communication
system
 Router
 Base
station
Data movement



Shared memory
Dedicated memory
Transitional memory => ownership change,
content not copy
Notification [1/4]

Direct signaling
 Create
event to other core’s local interrupt controller
 Other
core polling local status
 Or the local interrupt controller convert this event to real
interrupt
Notification [2/4]

Indirect signaling
 Not
directly controlled by software
Notification [3/4]

Atomic arbitration

Hardware semaphore/mutex
Semaphore => allow limited multiple access => example: multiport SRAM/external DDR memory
 Mutex => allow one access only
 Use software semaphore instead if resource only shared
between processes only executed in one core



Overhead of hardware semaphore is not small
Its only a facility for software usage, hardware only
guarantee atomic operation, locked content is not
protected

Cost, performance consideration
Notification [4/4]


Left diagram is
mutex
Just like the
software
counterpart
Data transfer engines


DMA => System DMA, local DMA belongs to a core
Ethernet


Up to 32 MAC address
RapidIO
Implemented with ultra fast serial IO physical layer
 Maybe multiple serial IO links  uni/bi-directional
 Example

USB 2.0 => 480Mbit/sec  USB 3.0 => 5Gbit/sec
 Serial ATA



1.0, Gen 1 => 1.5Gbit/sec; 2.0, Gen 2 => 3 Gbit/sec
3.0, Gen 3 => 6 Gbit/sec
High speed serial link
USB
SATA
Memory management

Devices do not support automated cache coherency
among cores because of the power consumption
involved and the latency overhead introduced
Switched central resource  fabric
Highlights [1/3]

Portion of the cache could be configured to as
memory mapped SRAM
 Transparent

Implicit
Address aliasing => masking MSByte
Write common
 For core 0: 0x10800000 == 0x00800000 rom code
 For core 1: 0x11800000 == 0x00800000 still assess core’s
private area
 For

cache => visible
core 2: 0x12800000 == 0x00800000
Special register DNUM for dynamic pointer address
update =>
Explicit
Each core has it
DNUM
Highlight [2/3]

The only guaranteed
coherency by hardware
 L2 (core-locally)
 L1D  L2  SL2 (if as
memory mapped SRAM)
(core-locally)
 L1D

L1P
L1D
Equal access to the
external DDR2 SDRAM
through SCR
This may be the
bottleneck for
certain application
L1P
L1D
Highlight [3/3]

If any portion of the L1s is configured as memorymapped SRAM, there is a small paging engine built
into the core (IDMA) that can be used to transfer
linear blocks of memory between L1 and L2 in the
background of CPU operation
 Paging

engine => MMU
IDMA may also be used to perform bulk peripheral
configuration register access
DSP code and data image

Image types
 Single
image
 Multiple image
 Multiple image with shared code and data
 Complex

linking scheme should be used
Device boot
 Tool
Tool Tool
Debugging

Cuda => XXOO↑↑↓↓←→←→BA
TI’s offer

Hardware emulation => ICE, JTAG
 Basically,

not intrusive
Software instrumentation
 Patching
original codes to enable same ability => this
time, “Trace Logs”
 Basically, intrusive

Type of Trace Logs
 API
call log, Statistics log, DMA transaction log, Event
log, Customer data log
More on logs


Information stores in memory pull back to host by
path through hardware emulation
Provide tool to correlate all the logs
 Display
them with an organized manner
 Log example:
Go deeper –Freescale’s manycore
Embedded Multicore: An Introduction
Why manycore?

Freescale MPC8641
 Single
core => freq x 1.5 => power x 2
 Dual core => freq x 1.5 => power x 1.3
Bug in this
Fig.
Memory system types
SMP + AMP + Sharing

Manycore enables multiple OS concurrently running
Memory sharing => MMU
 Interface/peripheral sharing => hypervisor
 Virtualization is good for legacy support

Review of single core
Manycore example [1/2]
2
2
2
2
4
2
3
4
3
1
4
Manycore example [2/2]
2
1
2
4
3
4
1
Highlights [1/2]



CoreNet fabric supports cache coherency across all
cache layers
CoreNet fabric also supports software semaphores by
extending the bit-test to guarantee atomic access
between cores
CLASS is better suited for DSPs as they tend to use less
complex operating systems and the application
software is more in control


Silicon area of the fabric reduced
If core is configured as software accelerator

Some space of L1, L2 could be converted to memory
mapped SRAM in a per way basis
Highlights [2/2]

While MMU protects memory access, DMA could
ruin all the things => solution “PAMU”
 PAMU
is located at the connection of non-core
masters and the CoreNet fabric configured to map
memory and to limit access windows thereby
increasing system stability

Cache stashing
 DMA
between cache and
external memory
Deal with 10Gb Ethernet


Parsing network traffic up to
layer 4
Assign packet to designated
cores
 TCP
port 80/22 to core 0
 ARP to core 1
 UDP to core 4~7

Queue Mgr and Buffer Mgr
simplify driver codes
Debugging
JTAG
High-speed logging link
Platinum version “printk”
Why JTAG is slow?

Involving serial bit shifting
Conclusion

Hardware guys are crazy and unfriendly
 They



are all “Postscript” geeks
However, they provides the world a chance to deal
with real-time sample data with thousands of
cycles => more things could be done by software
Besides data structure and algorithm, now good
engineer need to know more
The more you know hardware the more it dance
with you, or at least……
Talk as common sense
Reference[1/2]

Sudhakar Yalamanchili, Georgia Institute of Technology, “Multicore
Computing Multicore Computing - - Evolution”

(Broadcom) “Broadcom mediaDSP: A platform for building
programmable multicore video processors,” IEEE Micro., Mar/April
2009.

(Texas Instruments, TI) Lina J. Karam, Ismail Alkamal, Alan Gatherer,
Gene A. Frantz, David V. Anderson, and Brian L. Evans, “Trends in
Multicore DSP Platforms,” IEEE Signal Processing Magazine, Nov.
2009.
 University Texas and Texas Instruments
Reference[2/2]






http://en.wikipedia.org
(Tilera) Tile64 processor products. http://www.tilera.com
 Originate from MIT
(Intel) Single-chip Cloud Computer
TI, “Multicore Programming Guide”
Freescale, “Embedded Multicore: An Introduction”
Presentation slides from lab member