www.monolithic3d.com

Download Report

Transcript www.monolithic3d.com

VLSI TECHNOLOGY SYMPOSIUM 2011
TECHNOLOGY IMPACTS FROM THE
NEW WAVE OF ARCHITECTURES
FOR MEDIA-RICH WORKLOADS
Samuel Naffziger
AMD Corporate Fellow
June 14th, 2011
1 VLSI Technology Symposium | June 2011 | Public
Outline
Introduction
The new workloads and demands on computation
Characteristics of serial and parallel computation
The Accelerated Processing Unit (APU) architecture
APU architecture implications for technology
Summary
2 VLSI Technology Symposium | June 2011 | Public
The Big Experience/Small Form Factor Paradox
Mid 1990s
Display
4:3 @ 0.5
megapixel
Content
Email, film &
scanners
Online
Text and low
res photos
Multimedia
CD-ROM
Interface
4:3 @ 1.2
megapixels
Digital cameras,
SD webcams (1-5
MB files)
WWW and
streaming SD
video
DVDs
Mouse & keyboard
Now: Parallel/Data-Dense
16:9 @ 7 megapixels
HD video flipcams, phones,
webcams (1GB)
3D Internet apps and HD video
online, social networking w/HD files
3D Blu-ray HD
Multi-touch, facial/gesture/voice
recognition + mouse & keyboard
All day computing (8+ Hours)
1-2 Hours
3-4 Hours
Standard-definition
Internet
Form
Factors
Battery
Life*
Mouse &
keyboard
Mid 2000s
Early Internet and Multimedia
Experiences
*Resting battery life as measured with industry standard tests.
3 VLSI Technology Symposium| June 2011 | Public
Immersive and
interactive performance
Workloads
Technology
Focusing on the experiences that matter
Consumer PC Usage
New Experiences
Email
Web browsing
Accelerated Internet
and HD Video
Office productivity
Listen to music
Online chat
Watching online video
Photo editing
Simplified Content
Management
Personal finances
Taking notes
Online web-based
games
Social networking
Calendar management
Immersive
Gaming
Locally installed games
Educational apps
Video editing
Internet phone
0%
20%
40%
Source: IDC's 2009 Consumer PC Buyer Survey
4 VLSI Technology Symposium| June 2011 | Public
60%
80%
100%
People Prefer Visual Communications
Verbal Perception
Words are processed
at only 150 words
per minute
Augmenting Today’s Content:
5 VLSI Technology Symposium | June 2011 | Public
Visual Perception
Pictures and video
are processed 400 to
2000 times faster
 Rich visual experiences
 Multiple content sources
 Multi-Display
 Stereo 3D
The Emerging World of New Data Rich Applications
The Ultimate Visual
Experience™
Fast Rich Web content, favorite HD
Movies, games with realistic
graphics
Communicating
• IM, Email, Facebook
• Video Chat, NetMeeting
Using photos
• Viewing& Sharing
• Search, Recognition, Labeling?
• Advanced Editing
Using video
• DVD, BLU-RAY™, HD
• Search, Recognition, Labeling
• Advanced Editing & Mixing
Gaming
• Mainstream Games
• 3D games
Music
• Listening and Sharing
• Editing and Mixing
• Composing and compositing
ViVu
CyberLink
CyberLink
ArcSoft
ArcSoft
Desktop
Nuvixa
Power
Media
TotalMedia®
Media
Telepresence
Be Present Director 9
Espresso 6
Theatre 5 Converter® 7
6 VLSI Technology Symposium | June 2011 | Public
Microsoft®
Corel
Internet
Digital Studio Explorer 9 PowerPoint® 2010
2010
Windows
Live
Essentials
Codemasters
F1 2010
Viewdle
Uploader
Corel
VideoStudio
Pro
New Workload Examples: Changing Consumer Behavior
24 hours
of video
uploaded to YouTube
every minute
50 million +
digital media files
added to personal content libraries
every day
7 | 2011 VLSI Symposium| June 2011 | Public
Approximately
9 billion
video files owned are
high-definition
1000
images
are uploaded to Facebook
every second
What Are the Implications for Computation?
Insatiable demand for high
bandwidth processing
–Visual image processing
–Natural user interfaces
–Massive data mining for
associative searches,
recognition
Some of these compute needs
can be offloaded to servers,
some must be done on the
mobile device
– Similar compute needs and
massive growth in both
spaces
8 VLSI Technology Symposium | June 2011 | Public
How must CPU
architecture change to
deal with these trends?
Conditional
branches
Data
Parallel Code
…
35 Years of Microprocessor Trend Data
8000
6000
5000
4000
Loops, branches and
conditional evaluation
i=0
i++
load x(i)
fmul
store
cmp i (1000000)
Transistors
bc
(thousands)
i,j=0
i++
j++
load x(i,j)
fmul
store
cmp j (100000)
bc
cmp i (100000)
bc
GPU
CPU
Single-thread
Performance
(SpecINT)
Frequency
2D array(MHz)
representing
Typical Power
very large
dataset (Watts)
…
7000
Loop 1M
times for
1M pieces
of data
…
GFLOPs Trend
Peak GFlops (SPFP)
i=0
i++
load x(i)
fmul
store
cmp i (16)
bc
…
Serial Code
Parallel and Serial
Computation
3000
AMD
Number of
projections
Cores
2000
1000
…
0
2005 and 2006
2008
2009
2010K. Olukotun,
2011 L. Hammond
2012
2013
2014
Original data collected
plotted by 2007
M. Horowitz,
F. Labonte,
O. Shacham,
and
C. Batten
Years
9 VLSI Technology Symposium | June 2011 | Public
GPU/CPU Design Differences
CPU (Serial compute)
Lots of instructions little data
• Out of order exec, Branch
prediction
• Few hardware threads
GPU (parallel compute)
Few instructions lots of data
• Single Instruction Multiple Data
• Extensive fine-threading capability
Weak performance gains
through density
Nearly linear performance
gains with density
Maximize speed with fast
devices
Maximize density with cool
devices
10 VLSI Technology Symposium | June 2011 | Public
Three Eras of Processor Performance
Single-Core
Era
Multi-Core
Era
Heterogeneous
Systems Era
Constrained by:
Power
Complexity
Constrained by:
Power
Parallel SW availability
Scalability
Temporarily constrained by:
Programming models
Communication overheads
Workloads
o
?
we are
here
Time
11 VLSI Technology Symposium | June 2011 | Public
o
we are
here
Time
(# of Processors)
Targeted Application
Performance
Enabled by:
 Moore’s Law
 Abundant data parallelism
 Power efficient GPUs
Throughput Performance
Enabled by:
 Moore’s Law
 Desire for Throughput
 20 years of SMP arch
Single-thread Performance
Enabled by:
 Moore’s Law
 Voltage & Process Scaling
 Micro Architecture
o
we are
here
Time
(Data-parallel exploitation)
Heterogeneous Computing with an APU Architecture
2010 IGP-based (“Danube”) Platform
~17 GB/sec
~17 GB/sec
CPU
Cores
MC
CPU Chip
2011 APU-based (“Llano”) Platform
DDR3 DIMM
Memory
CPU
Cores
DDR3 DIMM
Memory
UNB
UVD
~7 GB/sec
GPU
UVD
FCH Chip
SB Functions
Graphics requires memory
BW to bring full capabilities
to life
UNB / MC
APU Chip
GPU
~27 GB/sec
~27 GB/sec
PCIe
Optional
PCIe®
Bandwidth pinch points and latency
hold back the GPU capabilities
GPU
Integration Provides Improvement
 Eliminate power and latency of extra chip
crossing
 3X bandwidth between GPU and Memory!
 Same sized GPU is substantially more effective
 Power efficient, advanced technology for both
CPU and GPU
12 VLSI Technology Symposium | June 2011 | Public
Performance
The Challenges of Integration
Density
Thick, fast
metal
Big devices
Flop count for
4 Llano CPU
cores=0.66M
Dense, thin
metal, small
devices
CPU flop
area = 2.14
13 VLSI Technology Symposium | June 2011 | Public
GPU flop
area = 1.0
Flop count for
Llano GPU
=3.5M
How to Balance the Metal Stack?
Performance
Cu Resistivity without barrier
With barrier
Density
With the 20nm node, even local
metal will be seeing large RC
increase  compromises more
difficult
14 VLSI Technology Symposium | June 2011 | Public
Resistivity (uohm-cm)
2.5
2.4
2.3
2.2
2.1
Add metal layers?
2

1.9 Thin, dense layers for the
1.8
GPU
1.7
1.6
 Thick, low resistance
1.5
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
layers
forLinethe
CPU
Width
(um)
 Cost issues?
 Via resistance?
Technology improvements in
BEOL are required
Device Optimization
Performance
APU Vt Mix
GPU
CPU
1.2
1
LC-HVT
0.8
HVT
0.6
RVT
0.4
LC-RVT
0.2
Device Ioff
LVT
0
Speed
Leakage
FPG vs.vs.Ioff
Llano CPU
250
desired device
range
200
Broader
span of
150
devices required
FPG
RO speed
To achieve breakthrough APU
performance, the Llano GPU
has ~5X the flops and ~5X the
device count of the CPUs
Llano GPU
100
50
LVT
RVT
175
50
LC-RVT
HVT
4.3
2.7
A broader
device
suite is
required
LC-HVT
0
20
Ioff (nA/um room temp)
15 VLSI Technology Symposium | June 2011 | Public
0.5
0.4
Power Transfers
110.0
105.0
100.0
95.0
90.0
85.0
Voltage range is critical to enabling
the efficient power transfers that
make for compelling APU
performance
16 VLSI Technology Symposium | June 2011 | Public
GPU-centric data
CPU-centric
serial
parallel
workload
Balanced
workload
workload
Operating Voltage Range
Operating voltage
requirements:
 Low voltage necessary for
power efficiency
 High voltage necessary for
a snappy user experience
enabled by turbo mode
17 VLSI Technology Symposium | June 2011 | Public
E/op vs. V
2.5
2
1.5
1
0.5
0
0.7V
0.8V
0.9V
1.0V
1.1V
1.2V
1.3V
Operating Voltage Challenges
5
4.5
0.952V
4
0.950V
0.915V
3.5
0.900V
Frequency
3
0.886V
Power Density
2.5
0.850V
Voltage
2
0.805V
1.5
1
0.800V
0.750V
DataData
40nmJuniper
GPU Frequency
Frequency
0.5
1000MHz
0
0.700V
1
900MHz
40nm
28nm
20nm
14nm
2
3
800MHz
4
5
700MHz
6
7
600MHz
8
500MHz
9
10
400MHz
11
Frequency
spread increases
at low voltage
300MHz
200MHz
100MHz
0MHz
0.85V
18 VLSI Technology Symposium | June 2011 | Public
1.000V
0.90V
0.95V
1.00V
1.10V
1.15V
12
13
14
15
16
17
18
Nominal Voltage
 To maintain cost effective
performance growth with
technology node, the GPU
must:
– Hold power density
constant
– Exploit density gains to add
compute units
This necessarily drives
operating voltage down
 This would be good for energy
efficiency except …
– Variation impacts are much
greater at low voltage
Power Density Limited GPU
40nm to 14nm
The Operating Voltage Challenge
Many barriers to maintaining both high
and low voltage as technology scales
 TDDB vs. SCE control
 ULK breakdown vs. denser pitches
 Variation control
FD devices should enable
maintaining the functional
range for a generation or two
Will turbo modes be too
compromised?
What’s next?
110.0
105.0
Poly
100.0
95.0
90.0
85.0
Fin
BOX
19 VLSI Technology Symposium | June 2011 | Public
3D Integration to the Rescue?
 Stacking offers many attractive benefits
DRAM
Higher bandwidth to local memory
Enables parallel and serial compute die to Heat
be Sink
in their own
separate optimized technology – interconnect speed vs.
density, device optimization etc. TIM (Thermal Interface Material)
Microbumps
DRAM
Metal in
Layers
Allows IO and southbridge content to remain
older, more
Analog Die (SB, Power)
analog-friendly technology
Metal Layers
Through
Silicon
Vias
(TSVs)
GPU Die
Metal Layers
CPU Die
Metal Layers
Package Substrate
South
Bridge
20 VLSI Technology Symposium | June 2011 | Public
3D Integration Challenges
 Economical 3D stacking in high volume manufacturing presents
many challenges
Benefits must exceed the additional costs of TSVs, and yield fallout
Logistics of testing and assembling die from multiple sources can be
immense
Countless mechanical and thermal issues to solve in high volume mfg
DRAM
Heat Sink
Die to
Die
Vias
Through
Silicon
Vias
(TSVs)
TIM (Thermal Interface Material)
DRAM
Metal Layers
Analog Die (SB, Power)
Metal Layers
GPU Die
Metal Layers
CPU Die
Metal Layers
Package Substrate
South
Bridge
21 VLSI Technology Symposium | June 2011 | Public
Clearly 3D provides
compelling solutions to
many problems, but the
barriers to entry mean
heavy R&D $$ and
partnerships required
Summary
Insatiable demand for high bandwidth computation
– Visual image processing
– Natural user interfaces
– Massive data mining for associate searches, recognition
Some of these compute needs can be offloaded to servers,
some must be done on the mobile device
– Similar compute needs and massive growth in both spaces
– Combined serial and parallel computation architectures are
key in both spaces
Huge technology challenges to meeting this opportunity
– Interconnect scaling is hitting a wall that must be overcome
– A broad device suite is necessary that operates efficiently at
low voltage while enabling high speed for response time
– 3D integration offers a promising long term solution
22 VLSI Technology Symposium | June 2011 | Public