360N: Computer Architecture Spring 2005

Download Report

Transcript 360N: Computer Architecture Spring 2005

RAMP-White
Derek Chiou and Hari Angepat
The University of Texas at Austin
© Derek Chiou
Supported in part by DOE, NSF, IBM, Intel, and Xilinx1
Test of size
RAMP-White Requirements

Coherent shared memory experimental platform


Scalable to the same level as other RAMP machines




1K eventual target
Down to 2
Full system (OS, I/O, etc.)
Intentions


7/8/2015
Configurable coherence protocol, engine
ISA/Architecture independent (like all RAMP efforts)
 Use different cores
Integrate components from other RAMP participants
 A test-bed for sharing IP
Derek Chiou, RAMP-White Tutorial, FCRC 2007
2
Test of size
Texas Modifications to RAMP-White

New code in Bluespec rather than Verilog/VHDL




Start with XUP board


Many advantages including interfaces, configurability
My group’s hardware development is exclusively Bluespec
Free/low cost for academics (www.bluespec.com)
We had XUP before BEE2
Embedded PowerPC is starting core

It’s a free, fast core with real (incoherent) 16KB caches






RAMP is core independent
My research needs fast cores
Can then use synthesizable 405s
Multi-OS shared space


7/8/2015
No space issues on XUP
2 Leons + MMU + memory controller barely fits (no space for our stuff)
Processors map to shared global space
May try SMP OS, but unlikely to scale well to 1K processors
Derek Chiou, RAMP-White Tutorial, FCRC 2007
3
Test of size
High-Level Architecture Philosophy

Flexibility



Avoid wasted work
Easy changes
Module-agnostic


Interfaces


Complete set of necessary interfaces
All communication via messages



Fixed fields, but fields are configurable
“shims” connect components to White infrastructure

7/8/2015
Processors, network, I/O, etc.
Use existing IP
Building one instance to confirm interface completeness
Derek Chiou, RAMP-White Tutorial, FCRC 2007
4
Test of size
32b Address in Shared Memory Machine??

4GB possible per BEE2 FPGA

Need more than 32b

Eventually, hope for 64b soft-core processors

For now two options: live with 4GB space
Or, provide one more layer of translation




7/8/2015
Physical address in certain region is global virtual address
Translated by hardware to node + physical address
Also useful for multiple OSs in single memory
 OSs tend to assume they own physical address 0
Derek Chiou, RAMP-White Tutorial, FCRC 2007
5
RAMP-White Block Diagram
Processor
Test of size
Proc dependent
Coherent $
IO
& Platform
Devices
Intersection
Unit (IU)
Network
Interface
(NIU)
Network
Router
Memory
Controller
(MC)
7/8/2015
Derek Chiou, RAMP-White Tutorial, FCRC 2007
6
Test of size
Three Phase Approach to Hardware

Phase 1: Incoherent shared memory

No hardware global cache, just global shared
memory support


However, software can maintain coherence if
necessary





Ring network
Requires a coherent cache, IU awareness
Running what is essentially a snoopy protocol


True coherence engine not required
But, very restricted communication

Sufficient for testing, modeling many targets
General network-based coherence

7/8/2015
Network virtual memory
Run a simulator on top of the processor
$ P $
$ CP$ $
IU
MC I/O
Ring-based coherence (scalable bus)


Optional cache for local memory
$ P $
$ CP$ $
IU
MC
I/O
Requires general coherence engine, general
network
Derek Chiou, RAMP-White Tutorial, FCRC 2007
7
Test of size
Intersection Unit

Processor interface



Processor
Network interface


Coherent $

Intersection
Unit (IU)
Network
Interface
(NIU)



Bluespec nice to specify coherence
engine
Incoherent version is a special case
Programmable memory regions



7/8/2015
Master (issue memory requests)
Hooks for coherency engine

Memory
Controller
(MC)
Master (send)
Slave (receive)
Memory interface

IO
& Platform
Devices
Slave
Snoop
Global (local and remote)
Local
translation
Derek Chiou, RAMP-White Tutorial, FCRC 2007
8
Test of size
Intersection Unit Internals
Proc
Memory
Controller &
DRAM
Net
Global Address Translation
Intersection Unit Controller
hardware
Proc
7/8/2015
IO
IO
Controller
BRAMs
Net
Derek Chiou, RAMP-White Tutorial, FCRC 2007
9
Test of size
Network Interface Unit

Currently two virtual channels

Split into two components
Processor

Coherent $

IO
& Platform
Devices
Intersection
Unit (IU)
Memory
Controller
(MC)
Network
Interface
(NIU)


One input/one output


7/8/2015
Msg composition/Queuing
Net transmit/receive
 Insert/extract for ring
Intended to permit other netspecific transmit/receive
Creates a simple
unidirectional ring
Can interface to more
advanced fabrics
Derek Chiou, RAMP-White Tutorial, FCRC 2007
10
Test of size
IU Internal Message
PRI CMD PERM SIZE TAG
GADDR
DATA

Defaults







7/8/2015
PRI: High priority, Low priority
CMD: Read, Write, Coherence, …
PERM: Modified, Exclusive, Shared, Invalid
SIZE: Byte, word, double word, cache-line
GADDR: global address (translated by IU)
DATA: dependent on size
Bluespec permits easy modification for your protocol
Derek Chiou, RAMP-White Tutorial, FCRC 2007
11
Test of size
Network Message
PRI DEST SRC SIZE NETTAG CMD
MESSAGE






7/8/2015
PRI: High and Low
DEST,SRC: destination, source of message
SIZE: Total message size
NETTAG: network tag (optional)
CMD: network command (optional)
MESSAGE: data
Derek Chiou, RAMP-White Tutorial, FCRC 2007
12
Test of size
Programmer View

Sequential consistency

PowerPC
 Global addresses labeled as uncached

Ordered accesses from PowerPC 405
Coherent global cache still uncached
Soft cores can be weaker



User interface


7/8/2015
Terminal per core/OS if desired
Mmap to map shared memory
Derek Chiou, RAMP-White Tutorial, FCRC 2007
13
Test of size
Operating System

Issues with SMP OS on embedded PowerPC





Incoherent cache
Load-reservation/store-conditional instructions not MP capable
Also missing TLB Invalidation & OpenPIC (interprocessor interrupts,
bring-up)
How scalable anyways? (1K processors)
Therefore, separate OS per core

Region of memory is global


7/8/2015
Mmap
Locks implemented using regular loads/stores + sequential
consistency
Derek Chiou, RAMP-White Tutorial, FCRC 2007
14
Test of size
Status: Phase 1 RAMP-White

Hari Angepat did the work

Components



Written in Bluespec
NIU code complete and tested
 2 processor ring
IU code complete and tested




Hardware intended to target different ISAs


7/8/2015
Processor Slave (no coherence right now)
PLB Master/slave interface (I/O)
NIU interface
PLB master and slave shims written
Some preliminary OS work
 Multi-image mmap interface running
Derek Chiou, RAMP-White Tutorial, FCRC 2007
15
Test of size
Current RAMP-White Phase 1
IO
& Platform
Devices
Intersection
Unit (IU)
Linux
Linux
PPC 405
PPC 405
Network
Interface
(NIU)
Network
Interface
(NIU)
Intersection
Unit (IU)
PLB shim
Memory
Controller
(MC)
7/8/2015
Derek Chiou, RAMP-White Tutorial, FCRC 2007
16
Test of size
Phase 1 Demo on XUP Configuration


7/8/2015
See both processors boot and run (top, cpu_info)
Run a simple “take-lock, increment counter, release
lock”
Derek Chiou, RAMP-White Tutorial, FCRC 2007
17
Test of size
Our Long Term Plans

Phase 1, XUP just started to work



Phase 2




With multi-OS, limited device support
Limited alpha release end of the 3Q07
Coherent cache, IU forwarding modifications
Better OS support (ProtoFlex?)
Limited alpha release 1Q08
Phase 3

Arbitrary network, cache coherency engine



7/8/2015
Getting network from Washington, Berkeley
RDL? Leon?
Release depends on ease of integration
Derek Chiou, RAMP-White Tutorial, FCRC 2007
18
Test of size
Conclusions

RAMP-White architecture



ISA/implementation agnostic


Running on XUP
We will be our own customer

7/8/2015
Care taken to not be specific
RAMP White Phase 1 works


Phased approach minimizes wasted work
Designed to be easy to modify for your purpose
 Many architectures only require modified coherence engine,
maybe cache
Building cycle-accurate x86 CMP simulator on top
Derek Chiou, RAMP-White Tutorial, FCRC 2007
19
Extra slides
© Derek Chiou
20
Test of size
P
Node Architecture
$
P $
IU
MC I/O
$
P $
IU
MC I/O
$
P $
C$
IU
MC I/O
$
P $
C$
IU
MC
7/8/2015
Derek Chiou, RAMP-White Tutorial, FCRC 2007
I/O
21
Test of size
Generalized Architecture


Proc dependent
Intersection Unit
Network Interface Unit
Proc
$
Mem MC
PLB
IU
NIU Proc independent
OPB
bridge
7/8/2015
Derek Chiou, RAMP-White Tutorial, FCRC 2007
22
Test of size
Sharing IP: Some Preliminary Experience

We looked at RAMP-Red XUP


Used some code (PLB master)
Red-BEE is not ready to distribute


Looking for switch code
Berkeley’s code on CVS repository

But, we can’t use memory controller because we don’t have BEE2 board yet

Bluespec
We are spinning almost all of our own code right now

Would like to steal software




Naming



7/8/2015
OS (kernel proxy)
SMP OS port
MPI reference design in BEE2 repository
Is that RAMP-Blue?
A central CVS repository for RAMP code?
Derek Chiou, RAMP-White Tutorial, FCRC 2007
23
Test of size
Sharing Over the Long Term

Processor is shared



Proc


MC is shared

$


IU
NIU



Borrow half from Berkeley?
Network can be shared

7/8/2015
Trying to make ours general
NIU can be shared


CMU/Stanford
IU functionality can be shared

Peripherals
Transactional/traditional
Borrow Stanford’s?
Coherency engine can be shared

CCE
Xilinx or Berkeley
Coherent cache can be shared

Mem MC
Leon
PowerPC
MicroBlaze
Everything else
Borrow Berkeley’s?
Derek Chiou, RAMP-White Tutorial, FCRC 2007
24