Conclusions - Zhejiang University

Transcript Conclusions - Zhejiang University

Transactional Server Processor –
A Bright Future Ahead
Nhon Quach (Guo Ruiren), Ph.D.
Processor Architect
Zhejiang University
Hangzhou
September 13th, 2005
Outline







What are transactional server processors (TSP)?
Background
Past and present TSP design trends
Server performance growth
A bright future for TSPs
Challenges and opportunities
Summary
You are expected to have fun and ask as many questions as you can
What Are TSPs?

Processors designed for transaction processing
–
Formally, transactions have ACID (atomicity, consistency,
isolation, and durability) properties in database terminology.
• E.g.: placing a order, doing a query, updating the database, etc.
–
Often used in corporate server farms and datacenters in tiers.
• Tier-2: web servers running Apache.
• Tier-3: application servers running business logic.
• Tier-4: backend servers running commercial databases such as
Oracle, SQL, or DB2.

Do NOT include processors used in compute servers.
–

High demand on floating-point computing capability
Examples: IBM POWER4, AMD Opteron and
Jackhammer, Intel Xeon and Itanium.
Today, same TSPs are used in all servers (including compute servers)
Tiers of Servers
Clients
Web servers
Application servers
Backend servers
Switch
Intranet/
Internet
Large SMPs with
8-128 processors
Small SMPs
with 2-8P
Stateless connections
Small-medium size
SMPs with 4-16P
Stateful transactions
Web & app servers usually scale out and backend servers usually scale up
Next Item of Discussion

What are transactional server processors (TSP)?
 Background
 Past and present TSP design trends
 Server performance growth
 A bright future for TSPs
 Challenges and opportunities
 Summary
Server Hardware Characteristics

Symmetric multiprocessing (SMP) systems
–
2 to 128 processors with large physical memory.
• Large (>16P) SMPs are often logically partitioned.
• Stringent requirements on RAS (reliability, availability,
serviceability), performance stability, and scalability.



Negligible silent data corruption
Vendors often need to write service level agreements
SMP vs. Clusters
–
–
–
–
Hardware vs. software enforced coherence.
Scale up vs. scale out
Manageability vs. scalability and availability
90% of the servers sold today are SMP’s
• But clusters are gaining in popularity recently (e.g., grid or
utility computing, etc.).
Some large SMP’s have large than 1 terabyte of main memory!
SMP vs. Cluster
SMP
P
P
P
P
P
P
M
P
P
P

One OS image
HW enforced coherency
– All memory accesses must check
other processor caches or memory
for consistency.

M
HW Link
P
P
Coherency logic
P
P
P
M
P
P
–
M
Cluster
P P
M
P P
P P
M
P P
Need multiple layers of coherency
logic for large systems.

P P
M
SW links via TCP
P P
P P
M
P P

Multiple OS images
SW enforced coherency
– Only remote data
accesses will experience
messaging delay.
Manageability is a major reason why SMPs are preferred over clusters
Server Workload Characteristics

Large code and data footprints
–
Large cache and TLB miss rates
• Performance highly sensitive to memory latency
• Low hardware instruction level parallelism (ILP)


Large number of I/O operations and context switch
–



Often misconstrued to have low application ILP
Result in a large number of interrupts.
Large number of synchronization operations
High (>100) level of thread level parallelism (TLP)
Low demand on floating-point computing capability
–
May change with the proposed IEEE754R decimal floatingpoint standard.
TSPs face one of the most demanding workloads
TSP Performance – A Quick Introduction

Measured by TPC-C transactions per minute (TPMC)
–
TPMC = N * (1/PL) * (1/CPI) * Freq * 60
• N: number of processors or threads in the systems

Function of processor implementation style
• PL: pathlength or number of instructions per transaction

Function of the ISA and the compiler
• CPI: cycles per instruction


Function of microarchitecture and implementation style
Dominated by L3 miss rates (30-40%!)
• Freq: frequency of the processor


Function of process technology and length of the pipeline
Easier to increase N and/or reduce CPI
–
Use more processors (or multithreading) and large L3 cache.
TPMC ratings often decide the market success (or failure) of a TSP
Components of Memory Latency
CPU
CPU
P2
P1
1
L1
L1
L2
CPU
L1
L2
P3
L2
CPU
L1
P4
L2
2
4
3
Bus
5
Bus
Local memory
4
4
Remote memory
Memory
Memory latency = L1 + L2 + L3 + Snoop + local/remote memory access time
Memory latencies are an important determining factor of server performance
Next Items of Discussion

What are transactional server processors (TSP)?
 Background
 Past and present TSP design trends
 Server performance growth
 A bright future for TSPs
 Challenges and opportunities
 Summary
The 80’s

No concept of tiered servers yet.
–
–

RISC processors dominated the server and workstation
markets.
CISC processors dominated the desktop market.
All processors were designed to exploit ILP.
–
–
–
–
One-size-fit-all design style
Processed multiple (2-3) instructions per cycle.
Used dynamic branch prediction and on-chip I and D caches.
Performance limited by the number of available transistors
• Simplified instruction sets to ease decoding => The RISC
“revolution”

Power consumption was not an issue.
Design focus was how to find more transistors?
The 90’s

The search for ILP continued.
–
–
–
–

Frequency sold!
–

But RISC processors still dominated the server market.
Efforts continued to invent new ISA (e.g., Intel EPIC).
–

Ultra long and high frequency pipeline became popular.
CISC processors took over the workstation market.
–

Pipelined floating-point execution units
Processed 4-8 instructions per cycle
Out-of-order (OOO) execution
Larger and more (2-3) levels of on-chip caches
Increased awareness of the abundance of TLP in server apps.
Power consumption was somewhat an issue.
–
But most vendors adopted a wait-and-see attitude.
Design focus was how to find even more transistors?
TSP Integration History
1980
CPU
1985
1990
CPU + FPU
CPU + FPU
Cache
Br
tables
Cache
L2
FPU
L1
Memory Controller
Memory
1995
CPU + FPU
Memory Controller
Memory
Memory Controller
Memory
2000
2005
CPU + FPU
Br
tables
L1
L2
Memory Controller
Memory
L3
CPU0
Branch
tables
L1
L2
CPU1
Br
tables
Br
tables
L1 L2
L1 L2
L3
L3
Memory Controller
Memory Controller
Memory
Memory
Integration reduces both local and remote memory latencies
10000000
1000000
100000
Moore’s Law
10000
1000
100
TPMC
$/TPMC
Moore’s Law
04
20
02
20
00
20
98
19
96
19
94
19
19
19
92
10
1
90
tpmC or $/tpmC
Performance Improvement
Year Reported
Source: http://www.tpc.org, 2005
Achieved by having faster and more processors in the system
The Present

Today, all TSPs used
–
–
–
–

Impressive TSP performance gain
–
–

Large dynamic branch prediction tables
Three levels of on-chip caches with huge L3 caches
Two levels of on-chip TLBs
Pure hardware or software assisted OOO (e.g., Intel IPF)
50% due to process technology
50% due to implementation (uarch+circuit) techniques
ISA largely irrelevant
–
CISC processors dominate the low-end server market
• Dominance of the high-end market likely
Memory and system architecture is more important than ISA style
The Present (Cont.)

The search for ILP slowing
–
–
–

Memory latencies are stabilizing!
–
–
–

Cache hierarchy stops growing – stays at three levels
Processor clock frequency increase is slowing
Use of integrated memory controllers
Transistor integration density still follows Moore’s law.
–

Though certain researchers are still trying
Re-discovery of the KISS (Keep It Simple, Stupid) principle
Search for way to exploit TLP begins
But transistors are slower, leakier, and less reliable.
Power consumption a major problem
–
Market segmentation – the birth of TSPs
Design focus is how to use the transistors more efficiently
Next Item of Discussion

What are transactional server processors (TSP)?
 Background
 Past and present TSP design trends
 Server performance growth
 A bright future for TSPs
 Challenges and opportunities
 Summary
New Developments Favoring TSP
 Multiple processor cores on a single chip.
–
The era of chip multiprocessing (CMP) has finally arrived!
 Implicit and explicit Multithreading
–
Efficient way to increase the number of logical processors.
 Active power management
–
Consume power only when absolutely needed.
 Virtualization technology
–
–
Allow server consolidation
Reduce cost of ownership
These technologies will enable a new breed of low-cost and powerful servers
The Promise of CMP
M
M
P
P
L3
L3
P
P
L3
L3
M
M
4P SMP

M
M
CMP
CMP
P P
P
P
P
P P
P P
P
P
P
P P
CMP with
428 cores
L3
L3
CMP
CMP
M
M
16P
4P CMP
8P
32P
Scale out by increasing the number of cores in a CMP
–
–
Protect IT infrastructure investment
Memory latencies remain unchanged
CMP is an especially cost effective way to build a single-chip server
Two-Core CMP – An Example
16-64K. Read-only. Designed for
latency. Need to resolve branch quickly,
hence small size to achieve one cycle
access time. TLB address translation
often in critical path.
L2 queues to buffer read/write and
instruction traffic. Reads & writes may
be reordered depending on the memory
ordering model.
0.5-1MB. Unified and WB. Designed
for latency and bandwidth.
High-performance ILP cores. Use dynamic
branch prediction, multiple issues, and
OOO execution.
CPU
L1I
CPU
L1D
L1I
L2 Queue
L1D
L2 Queue
L2
L2
L3 Queue
L3 queues. Reads & writes may be
reordered depending on the memory
ordering model. May or may not be
shared
16-64K. WT or WB.
Designed for latency.
Determine memory-op
penalty. Large penalty
for every extra cycle of
latency. TLB address
translation often in
critical path.
L3
4-8MB, WB. Designed
for bandwidth with
large line size. May or
may not be shared.
P2P fabric + MC Logic
P2P network running at a high frequency. Enforce cache coherency and ordering
For simplicity, some CMP designs don’t share the L3 cache
The Promise of Multithreading
M
M
P
P
CMP
P
CMP
L3
1428 thread
threads
CMP
CMP
M
M
32P
16P
64P
8P CMP

CMP+multithreading is an effective way to exploit TLP
–
Increase number of (logical) processors without increasing
memory latency.
I/O and memory BWs place an upper limit on the number of threads
Multithreading

Amortize memory and stall penalties.
–

Benefit depends on the stall and memory access patterns.
Threads share the same pipelines and uarch resources.
–
–
Need to duplicate arch and enlarge uarch resources.
Thread switch and run time slice depend on threading style.
• May require complex pipeline interlock logic for certain threading style

Explicit threads
–
Threads appear as logical processors to the OS.
• Exploit TLP to improve transaction throughput.
• Single thread performance may suffer.

Implicit threads
–
Used strictly by the hardware to enhance single-thread performance.
• Often with compiler assist.
There is still no consensus on how to best use the threads
Multithreading – An Example
Unthreaded Processor
L1I$
Br Tables
RF
Execution units
D$
(L1/L2/L3)
2-Way Threaded Processor
Typically
enlarged to
handle 2
threads
Br Tables
Contains
instructions Thread
from both switch
logic
threads
L1I$
TS
Logic
Contain
architectural
states and
must be
duplicated
RF
Many ways
to do this as
shown in the
next slide
Execution units
Shared between
the threads
D$
(L1/L2/L3)
Complexity of implementing multithreading is often underestimated
Many Multithreading Styles

Each thread takes turn
–
Simple issue logic, complicated interlock logic
• Fit well with non-stallable pipeline and in-order design styles
–
Exploit both short and long stall latencies
Interleaved

L3 miss
Thread switch on long latency events
–
Simple issue and interlock logic
• Pipeline flush on each thread switch
• No need to redesign the pipeline interlock logic
–
Blocked

Reasonable performance
Both threads may issue in the same clock
–
Complicated issue and interlock logic
• Fit well with non-stallable pipeline design style
–
SMT
–
Exploit both short and long stall latencies
Take advantage of often-empty slots in superscalar processors
Existing implementation constraints often determine the threading style
Server Performance Trend Prediction
100000
ILP + Multithreading + CMP
10000
1000
TPMC
ILP + CMP
100
We are here
10
ILP only
11
9
7
5
3
1
1
TPMC Rating on Log Scale
TPMC Performance Improvement Trend
Chip Integration Generation
Achieved by increasing no. of processors and reducing memory latencies
Future TSP – An Example

Processor core design
–
–
–
–





0.15-0.45nm technology
4-8Ghz (~10-20 stage pipeline)
Dynamic branch prediction with three levels of caches
Adequate level of reliability
4-16 processor cores
2-8 explicit threads
16-128MB L3 cache size
Point to point bus topology with integrated memory
controller
Aggressive active power management
128P server systems on a single chip!
Outline

What are transactional server processors (TSP)?
 Background
 Past and present TSP design trends
 Server performance growth
 A bright future for TSPs
 Challenges and opportunities
 Summary
Possible Roadblocks
 Power consumption
–
–
Limit the number of cores or threads on a chip.
Passive power management (PPM) still has a lot of room for
improvement – and research!
•
•
•
•
•
Process technology level: reduce leakage currents, etc.
Circuit level: size transistor properly, use sleep transistors, etc.
Logic level: use low voltage swing logic, use clock gating, etc.
Micro-architecture level: use stallable pipelines, etc.
Systems level: use active power management (APM)
 Reliability (against soft errors)
–
May require expensive protection of random logic elements
• Extent of problem not yet known.
Power consumption will play the same prominent role as ILP once did
Different Types of Power Consumption
Determined by:
Power
Thermal Power Type of packaging
Worst-case Power
Representative set of target applications
or power virus
Max. App. Power
Currently running application
Ave. Power
App B
Ave. App. Power
Representative set of target applications
Currently running application
App A
Time
Power consumption is one of the most difficult quantities to estimate
Power Consumption Trend with PPM
Relatvie Performance
18
16
14
12
Spec Int
Die Size
Power
10
8
6
4
2
0
486
Pentium
PIII
P4
Source: Deborah Marr et. al., “Hyper-Threading Technology Architecture And
Microarchitecture,” Intel Technology Journal, Q1, 2002
With PPM, it is hard to integrate multiple cores on a single chip
Active Power Management
Processor
core0
power, temp, perf sensors
Processor
core2
Processor
core1
APM
Module
Frequency controller
Processor
core3
Voltage controller
Off-chip voltage regulator

Max. power dynamically allocated among cores depending on power
requirement of the running apps
Adjusting voltage is the most effective way to control power consumption
PPM vs. APM

Design for worst-case power.
–
–

–


–
Local clock gating when
blocks idle.
Global clock gating when
processor idle
• P states rarely used
–
No saving on static power
consumption
Processor runs at variable
voltages and frequencies.
– Dynamic power balancing
among cores.
– Power budget dynamically
adjusted.
No power balancing between
cores.
Fixed power budget
Save energy by
–
Design for ave. app. power.
– Max. power may be adjusted
– Shutdown on overheat
Max. power = thermal power
Shutdown on overheat
Processor runs at fixed voltage
and frequency
–


Save energy by
– Local clock gating.
– Dynamically adjust voltage and
clock frequency.
• Hardware P state transition
– Save static power consumption
by shutting off power to a block.
Voltage regulator changing speed determines how aggressively APM can be
APM Challenges

Difficult to monitor performance level
–
–
Processor may be stalled for memory accesses
Applications often show phase behaviors
• Quick response time needed


Customers want a minimum guarantee on performance
May affect system software behaviors
–

Timestamp, timing loop, etc.
Intel IPF2 is a good start
–
–
Showed 3x improvement in power efficiency
Still much room for improvement
Software and systems vendors are now open to the concept of APM
Summary

Server workloads are highly demanding because of their large
code and data footprints.
–

Past TSPs used a one-size-fit-all design style.
–
–

Good performance gain in the 80’s and 90’s
Poor scalability in terms of power consumption and memory
latency.
CMP+multithreading is a more scalable way to build SMPs
–
–

Large miss rates and synchronization and I/O rates
But need active power management techniques.
Future TSP design more tailored and balanced: focusing on
memory latency, TLP, soft errors, and power consumption.
The best is yet to come for TSPs
–
An extended period of growth and simultaneous drop in cost
TSPs – A Bright Future Awaits!
Q&A

Thank you!

Hope you had fun and have learned something new.

Questions or comments?

Contact info: [email protected]

Conclusions - Zhejiang University

Transcript Conclusions - Zhejiang University

Directory