High_performance_MultiCPU_Patrick_Sproule_NVIDIAx

Download Report

Transcript High_performance_MultiCPU_Patrick_Sproule_NVIDIAx

High Performance, Multi-CPU
Power Signoff for Mega Designs
Patrick Sproule
Director of Engineering, VLSI Methodology
Nvidia Power Analysis Requirements
Static and Dynamic Full Chip Power Analysis
Tool implementation must handle both sub-chip analysis or full die
analysis in a single sessions.
Ideally provide full domain analysis for full accuracy in a single run.
Design Size Scalability
Full flat design analysis to handle both small and largest production
designs on existing/available compute resource.
Runtime Predictability
Designs get larger but schedule time for power analysis is required to
stay constant or shrink. Required close ended runtime estimates.
Clear Reporting
Large amount of analysis data must be condensed to clear reports.
Power Analysis Challenges
Designs have seen device count grow by 4 orders of magnitude
in less than 10 years.
Increased number of metal layers and modelled device count
cause calculation to expand faster than tools and compute
resources.
Large runtimes and/or inefficient subdivision of designs
required.
Designs have also become highly replicated at a multitude of
hierarchy levels.
Complexity of data handling and integration within the tools.
Many engineer run analysis at different hierarchy levels.
Recreation of db and duplication of analysis costs schedule.
Current Rail Analysis Methodology
• Partition-based hierarchical methodology is planned and executed
within a large design team at many levels
Full Chip Integration
Full Chip
chiplet
partition
• Unique design technologies, especially in low power
Multi-power domains, power gating switches, …
Chiplet Owners
Partition Owners
Typical Extraction and Rail Analysis
• Rail Analysis
–
–
–
–
Power-Grid-View (PGV): physical modeling of IP
Current Signatures
Extraction
Rail: RC, current, geometry
Physical
Database
Current
Signatures
RC
Extraction
PGDB
Rail Analysis
IR Drop
Results/Plots
Primitive
PGV
Hierarchical Rail Analysis Method (H-PGV)
Partition 1
…
H-PGV 1
RC
Extraction
Partition N
Top-Level
Database
Current
Signatures
…
H-PGV N
RC
Extraction
PGDB
Rail Analysis
IR Drop
Results/Plots
Primitive
PGV
H-PGV Advantages
H-PGV generation runtime is minimal compared to
full chip database setup for IR-drop analysis
H-PGVs can be generated in parallel
Hierarchical methodology supports bottom-up and
top-down rail analysis.
Capturing H-PGV boundary condition for ECO at
partition level (top down push)
Full and Sub-chip level analysis time greatly
improved with same accuracy
Flat vs. Hierarchical Correlation
Example Analysis: Sub-chip level
14.4M total primitive instance count (modelled cells)
8.9M regular logic and memory cells
5.5M filler, tap, decap cells
18 total partitions in chiplet
7 unique partitions
3 partitions replicated 4 time each.
Runtime : 18~32 minutes
Memory : 40~45G
% of Instances
H-PGV run metrics :
50
40
30
20
10
0
0
0.5
1
1.5
2
2.5
% Difference Between Flat and Hierarchical IR Drop
Analysis
Rail Analysis at Full Chip Level
Design
Metal
Layers
# of Transistors
(Billions)
RAM (GB)
CPU
Rail Analysis
Runtime (Days)
GF100
(flat)
9
3.0
200
1
2.25
GK104
(flat)
11
3.5
600
8
10
GK110
(flat)
11
7
1000+ (est.)
8
26 (est.)
GK110
(hierarchical)
11
7
650
8
8
Nvidia Scale and Runtime Issues
Design Size Growth outpacing tool and resource capability.
Runtime (d) Per Design
Design Instances
1.00E+09
30
25
1.00E+08
20
15
1.00E+07
10
5
1.00E+06
0
Tesla
Fermi
Design Progression
Design Sizes
VS
EPS
EPS-H
Kepler
Voltus on Kepler
~380M instances flat analysis – tsmc28nm
Main resource:
~725Gb memory on 1Tb 32 cpu machine.
Static and Dynamic Signoff Power analysis at
VDD & VSS (done as parallel runs)
21 hour runtime per analysis domain.
~8x runtime improvement over previous method with
equivalent accuracy.
Rail Analysis at Full Chip Level
Design
Metal
Layers
# of Transistors
(Billions)
RAM (GB)
CPU
Rail Analysis
Runtime (Days)
GF100
(flat)
9
3.0
200
1
2.25
GK104
(flat)
11
3.5
600
8
10
GK110
(flat)
11
7
1000+ (est.)
8
26 (est.)
GK110
(hierarchical)
11
7
650
8
8
GK110
(VOLTUS)
11
7
700
32
21 hours
Nvidia Scale and Runtime Issues
Runtime (d) Per Design
Design Instances
1.00E+09
30
25
1.00E+08
20
15
1.00E+07
10
5
1.00E+06
0
Tesla
Fermi Kepler Kepler-V
Design Progression
Column1
VS
EPS
EPS-H
VOLTUS
Next
Summary
Voltus meets our needs for Rail analysis with
accuracy and runtime with far less than expected
runtimes.
Further testing proved possible to run VDD-GND
combined domain in a single pass in 50 hrs runtime
using multi-threaded and distributed capabilities.
Capability to run both multi-threaded and distributed
allows us the flexibility to manage schedule and
resource requirements.
Congratulations to the Voltus team on delivering a
distruptive runtime improvement.
Q&A