Insert Presentation Name - University of California, San Diego

Download Report

Transcript Insert Presentation Name - University of California, San Diego

Digital Space
Anant Agarwal
MIT and Tilera Corporation
1
Arecibo
2
Stages of Reality
100B
transistors
1B
transistors
2018
p
m
p
m
p
m
p
m
p
m
p
m
2007
mem
p
m
p
m
p
m
mem
s
s
s
s
s
s
s
s
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
s
s
s
s
s
s
s
s
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
s
s
s
s
s
s
s
s
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
s
s
s
s
s
s
s
s
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
s
s
s
s
s
s
s
s
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
s
s
s
s
s
s
s
s
mem
1996
CPU
2007
Mem
1997
3
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
s
s
s
s
s
s
s
s
2014
mem
mem
p
m
2002
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
s
s
s
s
s
s
s
s
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
s
s
s
s
s
s
s
s
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
Virtual reality
Simulator reality
Prototype reality
Product reality
Virtual Reality
4
The Opportunity
1996…
20MIPS cpu
in 1987
Few thousand gates
5
The Opportunity
The billion transistor chip of 2007
6
How to Fritter Away Opportunity
Caches
Control
100 ported RegFil and RR
More resolution buffers, control
the x1786?
does not scale
7
“1/10 ns”
Take Inspiration from ASICs
mem
mem
mem
mem
mem
• Lots of ALUs, lots of registers, lots of local memories – huge on-chip parallelism –
but with a slower clock
• Custom-routed, short wires optimized for specific applications
Fast, low power, area efficient
But not programmable
8
Our Early Raw Proposal
Got parallelism?
CPU
Mem
E.g.,
100-way unrolled loop,
running on 100 ALUs,
1000 regs,
100 memory banks
But how to build programmable, yet custom, wires?
9
A digital wire
Pipeline it! Multiplex it!
Uh! What were we
Software
smoking! orchestrate
it!
Ctrl
• Fast clock (10GHz in 2010)
Ctrl
Ctrl
Ctrl
Ctrl
• Improve utilization
• Customize to application and
maximize utilization
A dynamic router!
Replace custom wires with
routed on-chip networks
10
Static Router
Switch
Code
Switch
Code
Switch
Code
Switch
Code
Switch
Code
A static router!
Compiler
11
Application
Replace Wires with Routed Networks
Ctrl
12
50-Ported Register File  Distributed Registers
Gigantic
50
ported
register
file
13
50-Ported Register File  Distributed Registers
Gigantic
50
ported
register
file
14
Distributed Registers + Routed Network
R
Distributed
register file
Called NURA [ASPLOS 1998]
15
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
16-Way ALU Clump  Distributed ALUs
16
ALU
ALU
ALU
ALU
ALU
ALU
ALU
Bypass Net
ALU
RF
ALU
ALU
ALU
ALU
ALU
R
ALU
Distributed ALUs, Routed Bypass Network
Scalar Operand Network (SON) [TPDS 2005]
17
Mongo Cache  Distributed Cache
Gigantic
10
ported
cache
18
Distributing the Cache
19
Distributed Shared Cache
ALU
ALU
ALU
ALU
ALU
R
ALU
$
Like DSM (distributed shared memory), cache is distributed;
But, unlike NUCA, caches are local to processors, not far away
[ISCA 1999]
20
Tiled Multicore Architecture
21
ALU
ALU
ALU
ALU
ALU
R
ALU
$
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
E.g., Operand Routing in 16-way Superscalar
ALU
ALU
ALU
ALU
ALU
ALU
ALU
Bypass Net
ALU
RF
+
>>
Source: [Taylor ISCA 2004]
22
Operand Routing in a Tiled Architecture
ALU
ALU
>>
+
ALU
R
ALU
$
23
ALU
ALU
>>
Tiled Multicore
• Scales to large numbers of cores
• Modular – design, layout and verify 1 tile
• Power efficient [MIT-CSAIL-TR-2008-066]
– Short wires
CV2f
– Chandrakasan effect
CV2f
– Dynamic and compiler scheduled routing
Processor
Core
S
Core + Switch = Tile
24
A Prototype Tiled Architecture:
The Raw Microprocessor
SMEM
[Billion transistor IEEE Computer Issue ’97]
www.cag.csail.mit.edu/raw
A Raw Tile
IMEM
Raw Switch
PC
DMEM
PC REG
FPU
ALU
SMEM
PC
SWITCH
The Raw Chip
Tile
Packet stream
Disk stream
Video1
DRAM
25
Scalar operand network (SON):
Capable of low latency transport of
small (or large) packets
[IEEE TPDS 2005]
Virtual reality
Simulator reality
Prototype reality
Product reality
26
Scalar Operand Transport in Raw
Goal: flow controlled, in order delivery of operands
fmul r24, r3, r4
27
fadd r5, r3, r24
route P->E, N->S
route W->P, S->N
software
controlled
crossbar
software
controlled
crossbar
RawCC: Distributed ILP Compilation (DILP)
C
tmp0 = (seed*3+2)/2
tmp1 = seed*v1+2
tmp2 = seed*v2 + 2
tmp3 = (seed*6+2)/3
v2 = (tmp1 - tmp3)*5
v1 = (tmp1 + tmp2)*3
v0 = tmp0 - v1
v3 = tmp3 - v2
Place, Route, Schedule
seed.0=seed
pval1=seed.0*3.0
pval0=pval1+2.0
v3.10=tmp3.6-v2.7
tmp0=tmp0.1
v3=v3.10
v1.2=v1
v2.4=v2
pval2=seed.0*v1.2
pval3=seed.o*v2.4
tmp1.3=pval2+2.0
tmp2.5=pval3+2.0
v1.2=v1
v2.4=v2
tmp1=tmp1.3
pval5=seed.0*6.0
pval7=tmp1.3+tmp2.5
pval2=seed.0*v1.2
pval3=seed.o*v2.4
pval4=pval5+2.0
tmp1.3=pval2+2.0
tmp0.1=pval0/2.0
tmp2.5=pval3+2.0
tmp3.6=pval4/3.0
tmp1=tmp1.3
tmp0=tmp0.1
tmp3=tmp3.6
pval6=tmp1.3-tmp2.5
v1.8=pval7*3.0
v2.7=pval6*5.0
v0.9=tmp0.1-v1.8
v0=v0.9
28
v3.10=tmp3.6-v2.7
v2=v2.7
v3=v3.10
v1.8=pval7*3.0
v0.9=tmp0.1-v1.8
tmp2=tmp2.5
pval6=tmp1.3-tmp2.5
v2.7=pval6*5.0
v2=v2.7
v1=v1.8
v0=v0.9
tmp2=tmp2.5
pval7=tmp1.3+tmp2.5
v1=v1.8
tmp3.6=pval4/3.0
tmp3=tmp3.6
seed.0=seed
pval0=pval1+2.0
pval4=pval5+2.0
tmp0.1=pval0/2.0
Partitioning
pval1=seed.0*3.0
pval5=seed.0*6.0
Black arrows =
Operand Communication
over SON
[ASPLOS 1998]
Virtual reality
Simulator reality
Prototype reality
Product reality
29
A Tiled Processor Architecture Prototype:
the Raw Microprocessor
Michael Taylor
Walter Lee
Jason Miller
David Wentzlaff
Ian Bratt
Ben Greenwald
Henry Hoffmann
Paul Johnson
Jason Kim
James Psota
Arvind Saraf
Nathan Shnidman
Volker Strumpen
Matt Frank
Rajeev Barua
Elliot Waingold
Jonathan Babb
Sri Devabhaktuni
Saman Amarasinghe
Anant Agarwal
30
October 02
Raw Die Photo
IBM .18 micron process,
16 tiles,
425MHz,
18 Watts (vpenta)
31
[ISCA 2004]
Raw Motherboard
32
Raw Ideas and Decisions:
What Worked, What Did Not
•
•
•
•
•
•
•
33
Build a complete prototype system
Simple processor with single issue cores
FPGA logic block in each tile
Distributed ILP and static network
Static network for streaming
Multiple types of computation – ILP, streams, TLP, server
PC in every tile
Why Build?
•
•
•
Compiler (Amarasinghe), OS and runtimes (ISI), apps (ISI, Lincoln Labs, Durand) folks will not
work with you unless you are serious about building hardware
Need motivaion to build software tools -- compilers, runtimes, debugging, visualization – many
challenges here
Run large data sets (simulation takes forever even with 100 servers!)
Many hard problems show up or are better understood after you begin building (how to
maintain ordering for distributed ILP, slack for streaming codes)
Have to solve hard problems – no magic!
The more radical the idea, the more important it is to build
•
•
•
Cycle simulator became cycle accurate simulator only after HW got precisely defined
Don’t bother to commercialize unless you have a working prototype
Total network power few percent for real apps [Aug 2003 ISLPED, Kim et al. Energy
•
•
•
34
– World will only trust end-to-end results since it is too hard to dive into details and understand all
assumptions
– Would you believe this: “Prof. John Bull has demonstrated a simulation prototype of a 64-way issue
out-of-order superscalar”
characterization of a tiled architecture processor with on-chip networks] [MIT-CSAIL-TR-2008-066 Energy
scalability of on-chip interconnection networks in multicore architecures ]
– Network power is few percent in Raw for real apps; however, it is 36% only for a highly contrived
synthetic sequence meant to toggle every network wire
Raw Ideas and Decisions:
What Worked, What Did Not
•
•
•
•
•
•
•
35
Yes
Build a complete prototype system
Simple processor, single issue 1GHz, 2-way, inorder in 2016
FPGA logic block in each tile
No
Distributed ILP
Yes ‘02, No ‘06, Yes ‘14
Static network for streaming
Multiple types of computation – ILP, streams, TLP, server Yes
PC in every tile
Yes
Raw Ideas and Decisions:
Streaming – Interconnect Support
Forced
synchronization in
static network
36
route P->E, N->S
route W->P, S->N
software
controlled
crossbar
software
controlled
crossbar
Streaming in Tilera’s Tile Processor
• Streaming done over dynamic interconnect with stream
demuxing (AsTrO SDS)
• Automatic demultiplexing of streams into registers
• Number of streams is virtualized
TAG
add r55, r3, r4
37
sub r5, r3, r55
Dynamic
Switch
Dynamic
Switch
Virtual reality
Simulator reality
Prototype reality
Product reality
38
Why Do We Care?
Markets Demanding More Performance
Wireless Networks
-
Demand for high thruput – more channels
Fast moving standards LTE, services
Networking market
-
GGSN
Demand for high performance – 10Gbps
Demand for more services, intelligence
Digital Multimedia market
-
Base Station
Switches
Security Appliances
Demand for high performance – H.264 HD
Demand for more services – VoD, transcode
Routers
Video Conferencing
… and with power efficiency and programming ease
39 39
Cable & Broadcast
Tilera’s TILEPro64™ Processor
Multicore Performance (90nm)
Number of tiles
Cache-coherent distributed cache
Operations @ 750MHz (32, 16, 8 bit)
Bisection bandwidth
64
5 MB
144-192-384 BOPS
2 Terabits per second
Power Efficiency
Power per tile (depending on app)
Core power for h.264 encode (64
tiles)
Clock speed
170 – 300 mW
12W
Up to 866
MHz
I/O and Memory Bandwidth
I/O bandwidth
Main Memory bandwidth
40 Gbps
200 Gbps
Product reality
Programming
ANSI standard C
SMP Linux programming
Stream programming
40
[Tile64, Hotchips 2007]
[Tile64, Microprocessor Report Nov 2007]
Tile Processor Block Diagram
A Complete System on a Chip
DDR2 Memory Controller 0
DDR2 Memory Controller 1
XAUI
MAC
PHY 0
PCIe 0
MAC
PHY
Serdes
PROCESSOR
CACHE
L2 CACHE
Reg File
P2
P1
P0
MDN
GbE 0
GbE 1
Flexible IO
PCIe 1
MAC
PHY
XAUI
MAC
PHY 1
Serdes
Serdes
DDR2 Memory Controller 3
41
DDR2 Memory Controller 2
TDN
UDN
SWITCH
Flexible IO
L1D
ITLB
DTLB
2D DMA
Serdes
UART, HPI
JTAG, I2C,
SPI
L1I
IDN
STN
Tile Processor NoC
•
5 independent non-blocking networks
–
–
64 switches per network
1 Terabit/sec per Tile
•
Each network switch directly and
independently connected to tiles
•
One hop per clock on all networks
•
I/O write example
•
Memory write example
•
Tile to Tile access example
•
All accesses can be performed
simultaneously on non-blocking networks
[IEEE Micro Sep 2007]
42
Tiles
UDN
STN
IDN
MDN
TDN
VDN
Multicore Hardwall Implementation
Or Protection and Interconnects
OS1/APP1
Switch
OS2/APP2
data
valid
HARDWALL_ENABLE
OS1/APP3
Switch
43
data
valid
Product Reality Differences
• Market forces
–
–
–
–
–
Need crisper answer to “who cares”
SMP Linux programming with pthreads – fully cache coherent
C + API approach to streaming vs new language Streamit in Raw
Special instructions for video, networking
Floating point needed in research project, but not in product for embedded market
• Lessons from Raw
– E.g., Dynamic network for streams
– HW instruction cache
– Protected interconnects
• More substantial engineering
–
–
–
–
–
44
3-way VLIW CPU, subword arithmetic
Engineering for clock speed and power efficiency
Completeness – I/O interfaces on chip – complete system chip. Just add DRAM for system
Support for virtual memory, 2D DMA
Runs SMP Linux (can run multiple OSes simultaneously)
Virtual reality
Simulator reality
Prototype reality
Product reality
45
What Does the Future Look Like?
Corollary of Moore’s law: Number of cores will
double every 18 months
‘02
‘05
‘08
‘11
‘14
Research
16
64
256
1024
4096
Industry
4
16
64
256
1024
1K cores by 2014! Are we ready?
(Cores minimally big enough to run a self respecting OS!)
46
Vision for the Future
• The ‘core’ is the logic gate of the 21st century
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
47
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
p
m
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
s
Research Challenges for 1K Cores
• 4-16 cores not interesting. Industry is there. University must focus on “1K cores”;
Everything will change!
• Can we use 4 cores to get 2X through DILP? Remember cores will be 1GHz and
simple! What is the interconnect?
• How should we program 1K cores? Can interconnect help with programming?
• Locality and reliability WILL matter for 1K cores. Spatial view of multicore?
• Can we add architectural support for programming ease? E.g., suppose I told you
cores are free. Can you discover mechanisms to make programming easier?
• What is the right grain size for a core?
• How must our computational models change in the face of small memories per core?
• How to “feed the beast”? I/O and external memory bandwidth
• Can we assume perfect reliability any longer?
48
ATAC Architecture
Electrical Mesh Interconnect (EMesh)
m
m
p
m
p
switch
m
p
switch
m
p
m
switch
p
m
m
switch
p
switch
m
p
switch
switch
p
m
p
switch
switch
m
p
[Proc. BARC Jan 2007, MIT-CSAIL-TR-2009-018 ]
49
switch
p
m
p
m
switch
switch
m
switch
p
p
switch
p
m
m
p
switch
switch
Optical Broadcast WDM Interconnect
Research Challenges for 1K Cores
• 4-16 cores not interesting. Industry is there. University must focus on “1K cores”;
Everything will change!
• Can we use 4 cores to get 2X through DILP? What is the interconnect?
• How should we program 1K cores? Can interconnect help with programming?
• Locality and reliability WILL matter for 1K cores. Spatial view of multicore?
• Can we add architectural support for programming ease? E.g., suppose I told you
cores are free. Can you discover mechanisms to make programming easier?
• What is the right grain size for a core?
• How must our computational models change in the face of small memories per core?
• How to “feed the beast”? I/O and external memory bandwidth
• Can we assume perfect reliability any longer?
50
FOS – Factored Operating System
OS cores collaborate, inspired by distributed internet services model
Need
new
page
OS
FS OS
FS OS
User
App
OS
I/O
OS
The key idea:
space sharing replaces
time sharing
FS OS
File System
•Today: User app and OS kernel thrash each other in a core’s cache
• User/OS time sharing is inefficient
•Angstrom: OS assumes abstracted space model. OS services bound to
distinct cores, separate from user cores. OS service cores collaborate
to achieve best resource management
• User/OS space sharing is efficient
[OS Review 2008]
51
Research Challenges for 1K Cores
• 4-16 cores not interesting. Industry is there. University must focus on “1K cores”;
Everything will change!
• Can we use 4 cores to get 2X through DILP? What is the interconnect?
• How should we program 1K cores? Can interconnect help with programming?
• Locality and reliability WILL matter for 1K cores. Spatial view of multicore?
• Can we add architectural support for programming ease? E.g., suppose I told you
cores are free. Can you discover mechanisms to make programming easier?
• What is the right grain size for a core?
• How must our computational models change in the face of small memories per core?
• How to “feed the beast”? I/O and external memory bandwidth
• Can we assume perfect reliability any longer?
52
The following are trademarks of Tilera Corporation: Tilera, the Tilera Logo, Tile
Processor, TILE64, Embedding Multicore, Multicore Development Environment,
Gentle Slope Programming, iLib, iMesh and Multicore Hardwall. All other
trademarks and/or registered trademarks are the property of their respective
owners.
53