Part 1: Introduction

Transcript Part 1: Introduction

UNIVERSITY OF MASSACHUSETTS
Dept. of Electrical & Computer Engineering
Computer Architecture
ECE 668
2016
Introduction
Professor Csaba Andras Moritz
www.ecs.umass.edu/ece/andras
Copyright 2016 Csaba Andras Moritz - ECE668 Introduction .1
Coping with ECE 668
 Students with varied backgrounds
 Prerequisites – Basic Computer Architecture/568, VLSI
 See details on the website on what you need to know and next slide
 3 projects to choose from, some flexibility beyond that
 You need software and/or Verilog/HSPICE skills to complete it
 2 exams – midterm and final
 Class participation, attend office hours
 Many lectures will be using the whiteboard vs. slides
 Many lectures are outside the textbook or based on papers
 Web: www.ecs.umass.edu/ece/andras/courses/ECE668/
 About the instructor
 Research and entrepreneurship
 Founder of BlueRISC (built CPUs
(privacy, stealth)
Copyright 2016 Csaba Andras Moritz - ECE668 Introduction .2
and tools), WindowsSCOPE (cyber) and eprivo
What you should know
 Basic architecture (like in ECE568)
 processor (data path, control/branching schemes,
arithmetic), memory, I/O, Tomasulo, basic speculation,
hazards (RAW, WAR, WAW)
 Read and write in an assembly language, C, C++,..
 MIPS/ARM ISA preferred
 Basic VLSI – HSPICE and/or Verilog
Copyright 2016 Csaba Andras Moritz - ECE668 Introduction .3
Textbook and references
 Main Textbooks:
 D.A. Patterson and J.L. Hennessy, Computer Architecture: A
Quantitative Approach, 4th edition (or later), Morgan-Kaufmann.
 M Dubois et al, Parallel Computer Organization and Design, 1st
edition
 Recommended reading:
 J.P. Shen and M.H. Lipasti, Modern Processor Design:
Fundamentals of Superscalar Processors, McGraw-Hill,
2005.
 Chandrakasan et al, Design of High-Performance
Microprocessor Circuits
 NASIC research papers and Nanoelectronics textbook
chapter; SKYBRIDGE 3D IC, N3ASIC, CMOL, FPNI,
SPWF, Bayesian computing papers
 Other research papers we bring up in class or post online.
Copyright 2016 Csaba Andras Moritz - ECE668 Introduction .4
Course Outline – 3 Main Focus Areas
Power-Aware
Embedded CPU Design
Multiprocessor,
many-core design
Emerging Architectures with
Nanoscale. 3D and Unconventional
Copyright 2016 Csaba Andras Moritz - ECE668 Introduction .5
Course Outline
 (1) Introduction; (2) Pipelined Processors - Power Issues
Pipelines/Caches, including Compiler-Exposed; (3) PowerAware Branch Prediction and Control-Flow; (4) Process
Variation Mitigation in Pipelines; (5) Process Variation
Mitigation in Caches; (6) Hardware and CompilerManaged Prefetching; (7) Power-Aware Prefetching with
Compiler Assist; (8) ARM CPUs (ARM Reference
Manual); (9) Out-of-order Processors; (10) Superscalar,
Limits of ILP, and VLIW (Compiler Managed); (11) Intel
Multi-Cores and Many-Cores; (12) Shared Memory
Processors/Cache Coherency (Snooping & Directorybased); (13) Transactional Memory; (14)
Synchronization Coherence; (15) Interconnection
Networks; (16) 3D IC Technology and Processors; (17)
Unconventional Architectures (Probabilistic and
Neuromorphic).
 Offered the first time this format
Copyright 2016 Csaba Andras Moritz - ECE668 Introduction .6
Administrative Details
Instructor: Prof. Csaba Andras Moritz
KEB 2nd floor
Email: [email protected]
Office Hours: 11:30-12:30 pm, Tues., &
Thur.
 TA – Jiajun Shi (only for projects)
 Course web page: details available at:
http://www.ecs.umass.edu/ece/andras/course
s/ECE668




Copyright 2016 Csaba Andras Moritz - ECE668 Introduction .7
Grading
 Midterm I - 35%
 Project – 30%: several projects to choose
from
 Class Participation – 5%
 Final Exam. - 30%
 Homework – exam questions
Copyright 2016 Csaba Andras Moritz - ECE668 Introduction .8
What is “Computer Architecture”
Computer Architecture
=
Von Neumann CPUs
Instruction Set Architecture +
Machine Organization
(e.g., Pipelining, Memory Hierarchy,
Storage systems, etc)
and also
Unconventional Organization
IBM 360 (minicomputer, mainframe, supercomputer)
Intel X86 vs. ARM vs. Neuromorphic vs. Nanoprocessors
Copyright 2016 Csaba Andras Moritz - ECE668 Introduction .9
Computer Architecture Topics - Processors
Input/Output and Storage
RAID
performance,
reliability
Disks, Tape
Interleaving
Bus protocols
DRAM
Memory
Hierarchy
VLSI
L2 Cache
L1 Cache
Bandwidth,
Latency
Addressing
Instruction Set Architecture
Pipelining, Hazard Resolution,
Superscalar, Reordering,
Branch Prediction, VLIW, Vector
Copyright 2016 Csaba Andras Moritz - ECE668 Introduction .10
Instruction
Level Parallelism
Computer Revolution
Aishare.com; IBM
Scaling Trend & Challenges
Transistor Cost Projection
Lisley forecasting, 2014
ITRS Projections
Performance Trend
J. Warnock, DAC 2011
CMOS Scaling Challenges
Lithographic Challenges with Scaling
L. Liebmann, SPIE. 2008
 Lithography Challenge
 Design rule explosion
 Increasing cost
 Reliability Challenge
 Reducing lifespan of lower level
metals
 Device Scaling Challenge
Degrading Reliability with Scaling
J. Warnock, DAC 2011
Copyright 2016 Csaba Andras Moritz - ECE668 Introduction .13
Overall Scaling
Figure courtesy of Intel; Copyright – Baskaran Ganesan, Intel Higher Education Program
Copyright 2016 Csaba Andras Moritz
- ECE668 Introduction .14
Shrinking geometry
Figure courtesy of Intel; Copyright – Baskaran Ganesan, Intel Higher Education Program
Copyright 2016 Csaba Andras Moritz
- ECE668 Introduction .15
Pipelined CPUs – like ARM,
Multi-Core SoCs
ARM7-9 Pipelines
ARM multi-core with ARMv7
Copyright 2016 Csaba Andras Moritz - ECE668 Introduction .16
Application Processors
ARM CPUs Currently
Application
RT
Low power
Security
Copyright 2016 Csaba Andras Moritz - ECE668 Introduction .17
Multi-Cores – Intel’s Core I5
Superscalar CISC core based vs. RISC
Die: 177mm2, Transistors: 1.7B
Copyright 2016 Csaba Andras Moritz - ECE668 Introduction .18
Intel CPUs 6th Generation (latest)
Copyright 2016 Csaba Andras Moritz - ECE668 Introduction .19
Intel CPUs 6th Generation Mobile
(latest)
Copyright 2016 Csaba Andras Moritz - ECE668 Introduction .20
Multi/many-core = Network on a
chip
 Everything you learn as CSE students
applied/integrated in a chip
Copyright 2016 Csaba Andras Moritz - ECE668 Introduction .21
Intel Polaris with 80 cores
Figure courtesy of Intel; Copyright – Baskaran Ganesan, Intel Higher Education Program
Copyright 2016 Csaba Andras Moritz
- ECE668 Introduction .22
Tilera processor with 64 cores
 MIT startup from Raw project (used to be
involved in this)
Copyright 2016 Csaba Andras Moritz - ECE668 Introduction .23
What is next: Nanoprocessors?
 Molecular memory, NASIC processors, 3D?
Cross
NW devices
Courtesy of
Prof Chui’s
Group at
UCLA
2-4
decoder
opcode
dest
adder/
multiplier
2-4
decoder
rf3~0
opcode
operanda
Copyright 2016 Csaba Andras Moritz - ECE668 Introduction .24
rf3~0
result
operandb
operanda
dest
operandb
opcode
NASIC ALU, Copyright: NASIC
group, UMASS
adder/ result
multiplier
From Nanodevices to Nanocomputing
Crossed Nanowire Array
n+ gate
Array-based Circuits with
Built-in Fault-tolerance
(NASICs)
pchannel
n+ source
& drain
a0 b0
a0 b0
clk
s0 s0
a1 b1
a1 b1
Down
clk
s1 s1
a2 b2
a2 b2
Down
clk
s2 s2
a3 b3
a3 b3
Down
Up
s3 s3
Down
Up
Up
c0
c0
clk
s0
s0
Up
s0
s0
c2
c2
c1
c1
s0 s0
s1 s1
s0
s0
s0
s0
c4
c4
c3
c3
s2 s2
Evaluation/Cascading:
Streaming Control with
Surrounding Microwires
s3 s3
Nanoprocessor
Copyright 2016 Csaba Andras Moritz - ECE668 Introduction .25
NASICs Fabric Based Architectures
Cellular Architecture
WIre Streaming Processor
 General purpose stream processor





5-stage pipeline with minimal feedback
Built-in fault tolerance: up to 10%
device level defect rates
33X density adv vs. 16nm scaled
CMOS
Simpler manufacturing
~9X improved power-per-performance
efficiency (rough estimate)
Copyright 2016 Csaba Andras Moritz - ECE668 Introduction .26
•
Special purpose for image and
signal processing
Massively parallel array of identical
interacting simple functional cells
 Fully programmable from external
template signals
 22X denser than in 16nm scaled
CMOS

N3ASIC- 3D Nanowire Technology
Copyright 2016 Csaba Andras Moritz - ECE668 Introduction .27
Skybridge 3D Circuits – Vertically
Integrated
• 3D Circuit concept and 1 bit full adder based on initial
Skybridge
• Technology designed in my group
• FETs are gate-all-around on vertical nanowires
• Also did 3D CMOS
Copyright 2016 Csaba Andras Moritz - ECE668 Introduction .28
Example ISAs in Processors
(Instruction Set Architectures)
ARM
Digital Alpha
HP PA-RISC
Sun Sparc
MIPS
Intel
(32, 64-bit, v8)
1985
(v1, v3)
1992
(v1.1, v2.0)
1986
(v8, v9)
1987
(MIPS I, II, III, IV, V) 1986
(8086,80286,80386,
1978
80486,Pentium, MMX, ...)
RISC vs. CISC
Copyright 2016 Csaba Andras Moritz - ECE668 Introduction .29
RISC ISA Encoding Example
Copyright 2016 Csaba Andras Moritz - ECE668 Introduction .30
Also, Virtualized ISAs
 ISA is random internally/or chosen at
(software) compile-time
 E.g., BlueRISC TrustGUARD
 Fluid - more than one ISA possible
 Instruction set can be morphed
 Can be made device unique – digital DNA
 Some implementations possible in ASICs but
most suitable for FPGAs
Copyright 2016 Csaba Andras Moritz - ECE668 Introduction .31
Power consumption
 Two parts
 Dynamic
 α * Vdd2 * f* Cl
 Cl – load capacitance ~ proportional with size of circuit
 α – activity factor – how FETs are switching
 Leakage
 Mainly from subthreshold (the FETs leak current

through the channel) but several other currents
» Not switched off strongly enough
Significant for small feature sizes (lower Ion/Ioff)
Copyright 2016 Csaba Andras Moritz - ECE668 Introduction .32
Define and quantify leakage power
Poweridle = Currentidle ´ Voltage
 Leakage current increases in processors with
smaller transistor sizes
 Increasing the number of transistors increases
power even if they are turned off
 Leakage started to become dominant sub 90nms
 Very low power systems even gate voltage to
inactive modules to control loss due to leakage
 Vdd and ground gating, stacked designs
Copyright 2016 Csaba Andras Moritz - ECE668 Introduction .33
Leakage Power
Where it occures
Copyright 2016 Csaba Andras Moritz - ECE668 Introduction .34
Leakage and Temperature
Copyright 2016 Csaba Andras Moritz - ECE668 Introduction .35
Power Density/Thermal Trends
Copyright 2016 Csaba Andras Moritz - ECE668 Introduction .36
Power-Aware Architectures
 Objective is to minimize activity (dynamic)
and avoid idle power by playing tricks
(leakage)
 Role of compilers - control
 Combined with circuit level optimizations – make it more


efficient
Interactions between circuits, architecture, compilers
Compiler-exposed architectures when ISA support
 In addition to CAD tools – that can do clock
gating and some circuit level optimizations
 Can be extended to thermal awareness with
on-chip sensors and management
Copyright 2016 Csaba Andras Moritz - ECE668 Introduction .37
Process/parameter Variation
 Random and systemic variation on the die
increases with smaller technology nodes
 Random like doping, system variation e.g., due to
chemical mechanical polishing affecting regions the
same way
 Delay and power implications
 Architectures would need to be made aware
 Can help mitigate variability
 Rather than considering worst case scenarios they
would adapt on-the-fly to changing performance due to
variation
» Note manufacturing variation is exacerbated by
environmental aspects like temperature fluctuation
due to workload fluctuation
Copyright 2016 Csaba Andras Moritz - ECE668 Introduction .38
Performance limits Amdahl’s Law
Law of Diminishing Returns
1-f enh
Copyright 2016 Csaba Andras Moritz - ECE668 Introduction .39
Latency Lags Bandwidth (last ~20 years)
 Performance Milestones
 Processor: ‘286, ‘386, ‘486,
Pentium, Pentium Pro, Pentium
4 (21x,2250x)
 Ethernet: 10Mb, 100Mb,
1000Mb, 10000 Mb/s (16x,1000x)
 Memory Module: 16bit plain
DRAM, Page Mode DRAM,
32b, 64b, SDRAM,
DDR SDRAM (4x,120x)
 Disk : 3600, 5400, 7200,
10000, 15000 RPM (8x, 143x)
10000
CPU high,
Memory low
(“Memory
Wall”) 1000
Processor
Network
Relative
Memory
BW
100
Improve
ment
Disk
10
(Latency improvement
= Bandwidth improvement)
1
1
10
100
Relative Latency Improvement
Copyright 2016 Csaba Andras Moritz - ECE668 Introduction .40
Rule of Thumb for Latency Lagging BW
In the time that bandwidth doubles, latency
improves by no more than a factor of 1.2 to
1.4
(and capacity improves faster than bandwidth)
 Stated alternatively:
Bandwidth improves by more than the square
of the improvement in Latency
Copyright 2016 Csaba Andras Moritz - ECE668 Introduction .41
6 Reasons Latency Lags Bandwidth
1.
•
•
Moore’s Law helps BW more than latency
Faster transistors, more transistors,
more pins help Bandwidth
Smaller, faster transistors but communicate
over (relatively) longer lines: limits latency
Copyright 2016 Csaba Andras Moritz - ECE668 Introduction .42
6 Reasons Latency Lags Bandwidth (cont’d)
2. Distance limits latency
•
•
Size of DRAM block  long bit and word lines
 most of DRAM access time
Speed of light and computers on network
3. Bandwidth easier to sell (“bigger=better”)
•
•
•
4400 MB/s DIMM (“PC4400”) vs. 50 ns latency
Even if just marketing, customers now trained
Since bandwidth sells, more resources thrown at bandwidth,
which further tips the balance
Copyright 2016 Csaba Andras Moritz - ECE668 Introduction .43
6 Reasons Latency Lags Bandwidth (cont’d)
4. Latency helps BW, but not vice versa
•
•
Spinning disk faster improves both bandwidth and
rotational latency
Lower DRAM latency 
More access/second (higher bandwidth)
Copyright 2016 Csaba Andras Moritz - ECE668 Introduction .44
6 Reasons Latency Lags Bandwidth (cont’d)
5. Bandwidth hurts latency
•
•
Queues help Bandwidth, hurt Latency (Queuing Theory)
Adding chips to widen a memory module increases
Bandwidth but higher fan-out on address lines may
increase Latency
6. Operating System overhead hurts
Latency more than Bandwidth
•
Long messages amortize overhead;
overhead bigger part of short messages
Copyright 2016 Csaba Andras Moritz - ECE668 Introduction .45
Summary of Architecture Trends
 CMOS Microprocessors focus on computing bandwidth with
multiple cores
 Accelerators for specialized support
 Software to take advantage; Von Neumann
designs at heart
 Power, thermal management and process variation management
key considerations
 Many-cores go beyond the 4-8 cores
 Key challenge is compilers – automate workload distribution/parallelism
 Cache coherency schemes, memory models, synchronization, IO
virtualization
 At nanoscale new architectural areas could be enabled
 Unconventional architectures


» Not programmed – more like the brain through learning, inference
As well as new opportunities for microprocessor design
3D designs open new opportunities – we will study some emerging
approaches
Copyright 2016 Csaba Andras Moritz - ECE668 Introduction .46