Transcript 幻灯片 1
Toward a Sustainable
Architecture at
Extreme Scale
Zhimin Tang, CTO
[email protected]
Outline
Sustainable (Cost Effective) HPC
Counter-examples in the history
Current and Future Challenges
New computing forms from sensor to cloud
Silicon based IC process approaching its
physical limit
Strategy
Abandon HPC only acceleration features
Design sustainable architecture for HPC and
other applications
Considerations of Cost
Effectiveness or Sustainability
Application (Algorithm) Requirements
High performance
Technology Constraints
CMOS vs. bipolar, Moore’s Law
Commercial MPU vs. customed ASIP
Economical Feasibility
Good eco-system
Mass production
Low energy consumption
HPCs in the History
Vector Supercomputers
CMOS Dominated, SIMD Weakness
Connection Machine
SIMD PE Array
Optimal only for some
Algorithms
Custom chips, tiny processor
MIMD with Custom CPUs
Chip Level Integration (SoC)
nCube/2, KSR-1 (COMA), …
High NRE cost due to custom design without
mass production
Low node processor performance
Why No Cost Effectiveness
HPC Is a Small Market
Architectures Designed Only for HPC
Lower volume, higher cost (NRE)
No enough resource to implement a top level
(wrt performance) solution
Longer time-to-market, behind Moore’s Law
Result: COTS Solutions in Last 20 Years
Commercial off-the-shelf
Co-design with the IT Ecosystem
From Cloud computers to sensors
Ecosystem Requirements
High Performance and Low Cost
Low cost is continuing a must
New factors of cost: energy/power, big NRE
Performance no longer the bottleneck
for most applications
like car, train, airplane in transportation
New appearances of performance
Computing: MIPS/MFLOPS
Transaction processing: TPM
Cloud applications: requests serviced in unit time
Energy Efficiency
Two Ends of Computing System
Cloud: large scale power dissipation
Terminal: limited battery life
Energy: compute < memory < communication
For each FLOP in Linpack
FPU spends 10pJ, Memory access 475pJ
Wireless Sensor Network
RF radio consumes most of the power
What We Need Besides Locality?
Needs New Architecture
Architecture Consuming Less Energy
Many core, custom designed for applications
Flattened software stack
Architecture for New Performance Metrics
High volume throughput computers
New Algorithms and Methodology
Complexity of computation
Complexity of memory access and
communication
Constraints to Innovation
Existing Software Ecosystem
standard or de facto interfaces
e.g., ISA: Instruction Set Architecture
Pro: Compatibility of Software
Con: Obstacles of Innovation, legacy
Huge Expenses of Development
new architecture needs new processors
NRE of chip development increasing rapidly,
as CMOS process approaching its limit
NRE: Non-Recurring Engineering
CMOS Technology
Approaching Limit, And No Replacement!
Moore’s law:7nm@2024, ~30 atoms
Different with the Transfer in 1990’s
Bipolar (ECL/TTL) is faster, but consumes
much power
CMOS developed for 20 years, no too slow,
low cost, and low power
But Now, Liquid Cooling for CMOS
In the foreseeable future, still CMOS
More and More than Moore
2011 ITRS Exec. Summary Fig. 4
Dark Silicon
At 8nm, above half of transistors must be
turned off
Speedup of 4-8 for 5 process generations
ISCA’11, IEEE Micro’12, CACM’13
Economical Feasibility
Moore’s Law Provides More Transistors
But switching speed no longer faster
Process development in nanometer scale
increases NRE tremendously
Mass Production Is Essential
Otherwise, chip business is not sustainable
Advantages of general-purposed processors
How about Many-core Processors?
GPU, Tilera, MIC, …
Pros and Cons of MPU
Most Advanced Process, Mass Product
Stable, reliable, low cost
Mature ecosystem and solutions
Not Optimal for Many Applications
Aim: not too bad for most applications
Over allocation of resources
Waste of resources, Consumption of more
energy
MPU not good for Cloud
High L1-I Cache Miss Rate
Processor idle (instruction starvation)
Small ILP and MLP
Wide issue not effective
Low Efficiency of Memory Access
Large L3 takes ½ chip area, no help to
improve performance
Useless High Bandwidth On-chip
Few Data sharing among cores
Low Utilization of Resources
Only 1/3 are frequently used
L2
Cache
OOO
FPU
L2
Cache
OOO
FPU
L2
Cache
OOO
FPU
GPU
L3 Cache
L2
Cache
OOO
FPU
Pros and Cons of ASIP
Optimal Designed for Some Applications
high efficiency, low resource, low power
But No Lunches Are Free
Much design/verification work
Stability/Reliability?
May affect the time to market
How to amortize the huge NRE
Small market means high cost
MPU + Accelerator
GPU
Pro: mass production
Con: PCIE overhead, small memory size
MIC PHI
Mass production possible?
FPGA
Resource utilization
Ease of programming
MPU interface, e.g., QPI or PCIE
Design of New Processors
Crossing the Gap between General and
Special
Many Simple Cores
Reduce power consumption
Multiple Hardware Thread in Each Core
Massive threads on chip
Exploit concurrency, tolerate latency
Dynamic Scheduling of On-chip Threads
Improve performance for general apps
Combining Multithreading
and Vector Pipelining
流水向量处理引擎
指令
I$
缓存
指
令
PC
PC
PC
寄
IR
存
器
Switch to single thread
指
令
I
译
D
码
Vector Registers
PC
PC
PC
PC
寄
PC
PC
R存
PC
F器
堆
ALU
FPU
LSU
数据缓存/SPM
D$/SPM
Deep scalar pipeline
Switch to vector pipeline
Thread Parallelism and Data
Parallelism in Two dimensions
Deep thread parallelism and data parallelism
PC
PC
PC
PC
指令
I$
缓存
指
令
PC
IPC
PC
寄
R
存
器
指
令
I
译
D
码
指
令
I
译
D
码
寄
PC
R
存
PC
PC
器
F
堆
寄
PC
R
存
PC
PC
器
F
堆
FPU
LSU
数据缓存/SPM
D$/SPM
ALU
FPU
LSU
数据缓存/SPM
D$/SPM
Wide data parallelism
Wide thread parallelism
PC
PC
PC
PC
指
令
PC
IPC
PC
寄
R
存
器
Vector Register File
指令
I$
缓存
ALU
In Conclusion
A Universal Architecture
Scalable and reconfigurable processor array
Supports thread and data level parallelism
Fulfill All Requirements from Terminal to
Cloud Data Center
High performance computers
Cloud computing servers
Equipment in Core network
Terminals for Cloud and mobile Internet
Thanks!