Piranha:A scalable Architecture Based on Single
Download
Report
Transcript Piranha:A scalable Architecture Based on Single
Piranha: A Scalable Architecture Based on Single-Chip
Multiprocessing
Barroso, Gharachorloo, McNamara, et.
Al
Proceedings of the 27th Annual ISCA, June 2000
Presented by Wael Kdouh
Spring 2006
Professor: Dr. Hisham El Rewini
Computer Science and Engineering
Motivation
Economic: High demand for OLTP(on-line transaction processing) machines
Disconnect between ILP-focus and this demand
OLTP
-High memory latency
-Little ILP (Get, process, store)
-Large TLP
OLTP unserved by aggressive ILP machines
Use “old” cores, ASIC design methodology for “glueless,” scalable OLTP machines
and low development costs and time to market
Short wires as opposed to costly and slow long wires that can affect cycle time
Amdahl’s Law
Computer Science and Engineering
Other Innovations
The design of the shared second-level cache uses sophisticated protocol
that does not enforce inclusion in first-level instruction and data caches in
order to maximize utilization of on-chip caches.
The cache coherence protocol among nodes incorporates a number of
unique features that result in fewer protocol messages and lower protocol
engine occupancies compared to other designs.
It has a unique I/O architecture, with an I/O node that is full-fledged
member of the interconnect and the global shared memory-memory
coherence protocol.
Computer Science and Engineering
The Piranha Processing Node
CPU:
Alpha
ECE152 work
Single in-order
8-stage pipeline
500 Mhz
180 nm process (2000)
Almost entirely ASIC design
50% clock speed, 200% area
versus full-custom methodology
Intra-Chip Switch, is a
Unidirectional crossbar
Separate I/D L1(64KB, 2-way set-associative) for
each CPU.
Logically shared interleaved L2 cache(1MB).
Eight memory controllers interface to a bank of
up to 32 Rambus DRAM chips. Aggregate max
bandwidth of 12.8 GB/sec.
Computer Science and Engineering
Communication Assist
+ Home Engine (Exporting) and
Remote Engine (Importing) support
shared memory across multiple
nodes
+ System Control tackles system
miscellany: interrupts, exceptions,
init, monitoring, etc.
+ OQ, Router, IQ, Switch standard
which links multiple Piranha chips
+ Each link and block here
corresponds to actual wiring and
module.
+Total inter-node I/O Bandwidth : 32
GB/sec
THERE IS NO INHERENT
I/O CAPABILITY.
Computer Science and Engineering
I/O Organization
Smaller than processing node
Router 2 links, alleviates need
for routing table
Memory is globally visible and part
of coherency scheme
CPU optimized placement for drivers, translations
etc. with low-latency access needs to I/O.
Re-used dL1 design provides interface to PCI/X interface
Supports arbitrary I/O:P ratio, network topology
Glueless scaling up to 1024 nodes of any type supports application specific
customization
Computer Science and Engineering
Piranha System
Computer Science and Engineering
Coherence: Local
L2 bank and associated controller contains directory data for intra-chip
requests – Centralized directory
Chip ICS responsible for all on-chip communication
L2 is “non-inclusive”.
“Large victim buffer” for L1s. Keeps tags and state copies of L1 data
The L2 controller can determine whether data is cached remotely, and if
exclusively. Majority of L1 requests then require no CA assist.
L2 on request can service directly, forward to owner L1, forward to
protocol engine, or get from memory.
L2 on forwards blocks conflicting requests
Computer Science and Engineering
Coherence: Global
Trades ECC granularity for “free” directory data storage (4x granularity
leaves 44 bits per 64 bit line)
Invalidation-based distributed directory protocol
Some optimizations
No NACKing: Deadlock avoidance through I/O, L, H priority virtual lanes: L:
Home node, low priority. H: Forwarded requests, replies
Also guarantee forwards always serviced by targets: e.g. owner writes back
to home, holds data until home acknowledges.
Removes NACK/Retry traffic, as well as “ownership change” (DASH), retrycounts (Origin), “No, seriously” (Token).
Computer Science and Engineering
Evaluation Methodology
Admittedly favorable OLTP benchmarks chosen (TPC-B and TPC-D modifications)
Simulated and compared to performance of aggressive OOO core (Alpha 21364) with
integrated coherence and cache hardware
“Fudged” for full-custom effect
Four evaluations: P1 (One-core Piranha @ 500MHz), INO (1GHz single-issue
in-order aggressive core), OOO (4-issue 1GHz) and P8 (Spec. system)
Computer Science and Engineering
Parameters for different processor
designs.
Computer Science and Engineering
Results
Performance Evaluation
OLTP and DSS workloads: TPC-B/D, Oracle database
SimOS-Alpha environment
Compared:
Piranha (P8) @ 500 MHz and Full-Custom (P8F) @ 1.25 GHz
Next-generation Microprocessor (OOO) 1 GHz
Single Chip Evaluation
OOO outperforms P1 (individual proc) by 2.3x
P8 outperforms OOO by 3x
Speedup of P8 over P1 = 7x
Multi-chip Configurations
Four chips (only 4 CPUs per chip ?!)
Results show that Piranha scales better than OOO
Computer Science and Engineering
Questions/Discussion
Evaluation methodology?
Would the Piranha design be worthwhile if there were a
well-designed SMT processor (with 4 or 8 threads)?
Reliability better or worse with multiple chips per
processor?
Power consumption?
Computer Science and Engineering
Conclusion
The authors maintain that :
1) The use of chip multiprocessing is inevitable in future
microprocessor designs.
2) As more transistors become available, further
increasing on-chip cache sizes or building more
complex cores will only lead to diminishing performance
gains and possibly longer design cycles.
Given the enormous emphasis that Intel engineers are plcaing on
massive L2 caches, Intel engineers appear to disagree.
Given the huge investment that both Intel and Compaq/HP have
put into the Itanium family, and the fact that the Alpha is a
moribund architecture , it is unlikely that the innovative Piranha
microprocessor will ever see the light of day.
The Future
Harvey G. Cragon, discusses in his paper “Forty Five Years of Computer
Architecture—All That's Old is New Again,” that he finds most of the
performance-improvement advances in computer micro-architecture have been
based on the exploitation of only two ideas: locality and pipelining.
In my personal opinion the upcoming years are going to exploit two ides: SMT
and CMP.
No more penguins to eat………………
Computer Science and Engineering
Questions
Computer Science and Engineering