Piranha:A scalable Architecture Based on Single

Download Report

Transcript Piranha:A scalable Architecture Based on Single

Piranha: A Scalable Architecture Based on Single-Chip
Multiprocessing
Barroso, Gharachorloo, McNamara, et.
Al
Proceedings of the 27th Annual ISCA, June 2000
Presented by Wael Kdouh
Spring 2006
Professor: Dr. Hisham El Rewini
Computer Science and Engineering
Motivation







Economic: High demand for OLTP(on-line transaction processing) machines
Disconnect between ILP-focus and this demand
OLTP
-High memory latency
-Little ILP (Get, process, store)
-Large TLP
OLTP unserved by aggressive ILP machines
Use “old” cores, ASIC design methodology for “glueless,” scalable OLTP machines
and low development costs and time to market
Short wires as opposed to costly and slow long wires that can affect cycle time
Amdahl’s Law
Computer Science and Engineering
Other Innovations



The design of the shared second-level cache uses sophisticated protocol
that does not enforce inclusion in first-level instruction and data caches in
order to maximize utilization of on-chip caches.
The cache coherence protocol among nodes incorporates a number of
unique features that result in fewer protocol messages and lower protocol
engine occupancies compared to other designs.
It has a unique I/O architecture, with an I/O node that is full-fledged
member of the interconnect and the global shared memory-memory
coherence protocol.
Computer Science and Engineering
The Piranha Processing Node
CPU:
Alpha
ECE152 work
Single in-order
8-stage pipeline
500 Mhz
180 nm process (2000)
Almost entirely ASIC design
50% clock speed, 200% area
versus full-custom methodology
Intra-Chip Switch, is a
Unidirectional crossbar
Separate I/D L1(64KB, 2-way set-associative) for
each CPU.
Logically shared interleaved L2 cache(1MB).
Eight memory controllers interface to a bank of
up to 32 Rambus DRAM chips. Aggregate max
bandwidth of 12.8 GB/sec.
Computer Science and Engineering
Communication Assist
+ Home Engine (Exporting) and
Remote Engine (Importing) support
shared memory across multiple
nodes
+ System Control tackles system
miscellany: interrupts, exceptions,
init, monitoring, etc.
+ OQ, Router, IQ, Switch standard
which links multiple Piranha chips
+ Each link and block here
corresponds to actual wiring and
module.
+Total inter-node I/O Bandwidth : 32
GB/sec
THERE IS NO INHERENT
I/O CAPABILITY.
Computer Science and Engineering
I/O Organization
 Smaller than processing node
 Router  2 links, alleviates need
for routing table
 Memory is globally visible and part
of coherency scheme
 CPU  optimized placement for drivers, translations
etc. with low-latency access needs to I/O.
 Re-used dL1 design provides interface to PCI/X interface
 Supports arbitrary I/O:P ratio, network topology
 Glueless scaling up to 1024 nodes of any type supports application specific
customization
Computer Science and Engineering
Piranha System
Computer Science and Engineering
Coherence: Local
 L2 bank and associated controller contains directory data for intra-chip
requests – Centralized directory
 Chip ICS responsible for all on-chip communication
 L2 is “non-inclusive”.
 “Large victim buffer” for L1s. Keeps tags and state copies of L1 data
 The L2 controller can determine whether data is cached remotely, and if
exclusively. Majority of L1 requests then require no CA assist.
 L2 on request can service directly, forward to owner L1, forward to
protocol engine, or get from memory.
 L2 on forwards blocks conflicting requests
Computer Science and Engineering
Coherence: Global

Trades ECC granularity for “free” directory data storage (4x granularity 
leaves 44 bits per 64 bit line)

Invalidation-based distributed directory protocol

Some optimizations

No NACKing: Deadlock avoidance through I/O, L, H priority virtual lanes: L:
Home node, low priority. H: Forwarded requests, replies

Also guarantee forwards always serviced by targets: e.g. owner writes back
to home, holds data until home acknowledges.

Removes NACK/Retry traffic, as well as “ownership change” (DASH), retrycounts (Origin), “No, seriously” (Token).
Computer Science and Engineering
Evaluation Methodology

Admittedly favorable OLTP benchmarks chosen (TPC-B and TPC-D modifications)

Simulated and compared to performance of aggressive OOO core (Alpha 21364) with
integrated coherence and cache hardware


“Fudged” for full-custom effect
Four evaluations: P1 (One-core Piranha @ 500MHz), INO (1GHz single-issue
in-order aggressive core), OOO (4-issue 1GHz) and P8 (Spec. system)
Computer Science and Engineering
Parameters for different processor
designs.
Computer Science and Engineering
Results
Performance Evaluation





OLTP and DSS workloads: TPC-B/D, Oracle database
SimOS-Alpha environment
Compared:

Piranha (P8) @ 500 MHz and Full-Custom (P8F) @ 1.25 GHz

Next-generation Microprocessor (OOO) 1 GHz
Single Chip Evaluation

OOO outperforms P1 (individual proc) by 2.3x

P8 outperforms OOO by 3x

Speedup of P8 over P1 = 7x
Multi-chip Configurations

Four chips (only 4 CPUs per chip ?!)

Results show that Piranha scales better than OOO
Computer Science and Engineering
Questions/Discussion




Evaluation methodology?
Would the Piranha design be worthwhile if there were a
well-designed SMT processor (with 4 or 8 threads)?
Reliability better or worse with multiple chips per
processor?
Power consumption?
Computer Science and Engineering
Conclusion

The authors maintain that :
1) The use of chip multiprocessing is inevitable in future
microprocessor designs.
2) As more transistors become available, further
increasing on-chip cache sizes or building more
complex cores will only lead to diminishing performance
gains and possibly longer design cycles.
Given the enormous emphasis that Intel engineers are plcaing on
massive L2 caches, Intel engineers appear to disagree.
Given the huge investment that both Intel and Compaq/HP have
put into the Itanium family, and the fact that the Alpha is a
moribund architecture , it is unlikely that the innovative Piranha
microprocessor will ever see the light of day.
The Future


Harvey G. Cragon, discusses in his paper “Forty Five Years of Computer
Architecture—All That's Old is New Again,” that he finds most of the
performance-improvement advances in computer micro-architecture have been
based on the exploitation of only two ideas: locality and pipelining.
In my personal opinion the upcoming years are going to exploit two ides: SMT
and CMP.
No more penguins to eat………………
Computer Science and Engineering
Questions
Computer Science and Engineering