Piranha: A Scalable Architecture Based on Single

Download Report

Transcript Piranha: A Scalable Architecture Based on Single

Piranha: A Scalable Architecture Based
on Single-Chip Multiprocessing
Luiz A. Barroso et al. (Compaq Computer Corporation)
Presented by:
Nick Kirchem
Feb 13, 2004
Target and Motivation

Commercial applications (databases, OLTP)
–
–
–

Most important market for high performance servers
Data dependent computation (low ILP)
Little gained by complex multiple issue out-of-order
processors
Complexity of current processors
–
–
–
Long design times
High development costs
Better use of transistors?
Project Goals

Design a Chip Multiprocessing (CMP) System
–
–

Integrate 8 simple processor cores on a single chip
Exploit thread-level parallelism instead of ILP
High performance, Low Cost
–
–
Achieve superior performance on commercial
workloads
Small team, modest investment, short design time
Architecture Overview
Architecture Elements






Simple Processors (500 MHz, In-Order)
No I/O capability on chip (separate I/O nodes)
Up to 1024 nodes in a system
Individual L1 Caches (64KB, 2-way set-assoc)
One Logical L2 Cache, interleaved, 1MB
Intra-Chip Switch
–
–
–
Unidirectional crossbar
Transaction based, atomic transfers
Bandwidth ~3x memory bandwidth
Intra-Chip Cache Coherence


MESI protocol
No Inclusion (1 MB aggregate L1, 1MB L2)
–
–

But, L2 holds copy of L1 tags and state
(no snooping required at L1)
L1 filled directly from memory (L2 = victim cache)
Coherence handled by L2 controllers
–
Can service request directly, forward to owner L1,
forward to protocol engine, obtain from Memory
Inter-Node Coherence

Protocol Engines (microprogrammable controllers)
–
–

Directory Storage
–
–


Home: exports local memory
Remote: imports remote memory
Compute ECC at coarse granularity, use extra bits for directory
info  no memory space overhead
Directory granularity = 1 node (not individual processor)
Interconnect: I/O queues, router (point-to-point, 4 links)
No NAKs – avoid deadlock by sufficient buffering, and
guarantee forwarded requests can be serviced
Performance Evaluation



OLTP and DSS workloads: TPC-B/D, Oracle database
SimOS-Alpha environment
Compared:
–
–

Single Chip Evaluation
–
–
–

Piranha (P8) @ 500 MHz and Full-Custom (P8F) @ 1.25 GHz
Next-generation Microprocessor (OOO) 1 GHz
OOO outperforms P1 (individual proc) by 2.3x
P8 outperforms OOO by 3x
Speedup of P8 over P1 = 7x
Multi-chip Configurations
–
–
Four chips (only 4 CPUs per chip ?!)
Results show that Piranha scales better than OOO
Questions/Concerns



Would the Piranha design be worthwhile if
there were a well-designed SMT processor
(with 4 or 8 threads)?
Reliability better or worse with multiple chips
per processor?
Power consumption?