Transcript 幻灯片 1

Toward a Sustainable
Architecture at
Extreme Scale
Zhimin Tang, CTO
[email protected]
Outline
Sustainable (Cost Effective) HPC
Counter-examples in the history
Current and Future Challenges
New computing forms from sensor to cloud
Silicon based IC process approaching its
physical limit
Strategy
Abandon HPC only acceleration features
Design sustainable architecture for HPC and
other applications
Considerations of Cost
Effectiveness or Sustainability
Application (Algorithm) Requirements
High performance
Technology Constraints
CMOS vs. bipolar, Moore’s Law
Commercial MPU vs. customed ASIP
Economical Feasibility
Good eco-system
Mass production
Low energy consumption
HPCs in the History
Vector Supercomputers
CMOS Dominated, SIMD Weakness
Connection Machine
SIMD PE Array
Optimal only for some
Algorithms
Custom chips, tiny processor
MIMD with Custom CPUs
Chip Level Integration (SoC)
nCube/2, KSR-1 (COMA), …
High NRE cost due to custom design without
mass production
Low node processor performance
Why No Cost Effectiveness
HPC Is a Small Market
Architectures Designed Only for HPC
Lower volume, higher cost (NRE)
No enough resource to implement a top level
(wrt performance) solution
Longer time-to-market, behind Moore’s Law
Result: COTS Solutions in Last 20 Years
Commercial off-the-shelf
Co-design with the IT Ecosystem
From Cloud computers to sensors
Ecosystem Requirements
High Performance and Low Cost
Low cost is continuing a must
New factors of cost: energy/power, big NRE
Performance no longer the bottleneck
for most applications
like car, train, airplane in transportation
New appearances of performance
Computing: MIPS/MFLOPS
Transaction processing: TPM
Cloud applications: requests serviced in unit time
Energy Efficiency
Two Ends of Computing System
Cloud: large scale power dissipation
Terminal: limited battery life
Energy: compute < memory < communication
For each FLOP in Linpack
FPU spends 10pJ, Memory access 475pJ
Wireless Sensor Network
RF radio consumes most of the power
What We Need Besides Locality?
Needs New Architecture
Architecture Consuming Less Energy
Many core, custom designed for applications
Flattened software stack
Architecture for New Performance Metrics
High volume throughput computers
New Algorithms and Methodology
Complexity of computation
Complexity of memory access and
communication
Constraints to Innovation
Existing Software Ecosystem
standard or de facto interfaces
e.g., ISA: Instruction Set Architecture
Pro: Compatibility of Software
Con: Obstacles of Innovation, legacy
Huge Expenses of Development
new architecture needs new processors
NRE of chip development increasing rapidly,
as CMOS process approaching its limit
NRE: Non-Recurring Engineering
CMOS Technology
Approaching Limit, And No Replacement!
Moore’s law:7nm@2024, ~30 atoms
Different with the Transfer in 1990’s
Bipolar (ECL/TTL) is faster, but consumes
much power
CMOS developed for 20 years, no too slow,
low cost, and low power
But Now, Liquid Cooling for CMOS
In the foreseeable future, still CMOS
More and More than Moore
2011 ITRS Exec. Summary Fig. 4
Dark Silicon
At 8nm, above half of transistors must be
turned off
Speedup of 4-8 for 5 process generations
ISCA’11, IEEE Micro’12, CACM’13
Economical Feasibility
Moore’s Law Provides More Transistors
But switching speed no longer faster
Process development in nanometer scale
increases NRE tremendously
Mass Production Is Essential
Otherwise, chip business is not sustainable
Advantages of general-purposed processors
How about Many-core Processors?
GPU, Tilera, MIC, …
Pros and Cons of MPU
Most Advanced Process, Mass Product
Stable, reliable, low cost
Mature ecosystem and solutions
Not Optimal for Many Applications
Aim: not too bad for most applications
Over allocation of resources
Waste of resources, Consumption of more
energy
MPU not good for Cloud
High L1-I Cache Miss Rate
Processor idle (instruction starvation)
Small ILP and MLP
Wide issue not effective
Low Efficiency of Memory Access
Large L3 takes ½ chip area, no help to
improve performance
Useless High Bandwidth On-chip
Few Data sharing among cores
Low Utilization of Resources
 Only 1/3 are frequently used
L2
Cache
OOO
FPU
L2
Cache
OOO
FPU
L2
Cache
OOO
FPU
GPU
L3 Cache
L2
Cache
OOO
FPU
Pros and Cons of ASIP
Optimal Designed for Some Applications
high efficiency, low resource, low power
But No Lunches Are Free
Much design/verification work
Stability/Reliability?
May affect the time to market
How to amortize the huge NRE
Small market means high cost
MPU + Accelerator
GPU
Pro: mass production
Con: PCIE overhead, small memory size
MIC PHI
Mass production possible?
FPGA
Resource utilization
Ease of programming
MPU interface, e.g., QPI or PCIE
Design of New Processors
Crossing the Gap between General and
Special
Many Simple Cores
Reduce power consumption
Multiple Hardware Thread in Each Core
Massive threads on chip
Exploit concurrency, tolerate latency
Dynamic Scheduling of On-chip Threads
Improve performance for general apps
Combining Multithreading
and Vector Pipelining
流水向量处理引擎
指令
I$
缓存
指
令
PC
PC
PC
寄
IR
存
器
Switch to single thread
指
令
I
译
D
码
Vector Registers
PC
PC
PC
PC
寄
PC
PC
R存
PC
F器
堆
ALU
FPU
LSU
数据缓存/SPM
D$/SPM
Deep scalar pipeline
Switch to vector pipeline
Thread Parallelism and Data
Parallelism in Two dimensions
Deep thread parallelism and data parallelism
PC
PC
PC
PC
指令
I$
缓存
指
令
PC
IPC
PC
寄
R
存
器
指
令
I
译
D
码
指
令
I
译
D
码
寄
PC
R
存
PC
PC
器
F
堆
寄
PC
R
存
PC
PC
器
F
堆
FPU
LSU
数据缓存/SPM
D$/SPM
ALU
FPU
LSU
数据缓存/SPM
D$/SPM
Wide data parallelism
Wide thread parallelism
PC
PC
PC
PC
指
令
PC
IPC
PC
寄
R
存
器
Vector Register File
指令
I$
缓存
ALU
In Conclusion
A Universal Architecture
Scalable and reconfigurable processor array
Supports thread and data level parallelism
Fulfill All Requirements from Terminal to
Cloud Data Center
High performance computers
Cloud computing servers
Equipment in Core network
Terminals for Cloud and mobile Internet
Thanks!