Networking for Embedded Systems

Download Report

Transcript Networking for Embedded Systems

CS1103 電機資訊工程實習
Parallel Processing
Prof. Chung-Ta King
Department of Computer Science
National Tsing Hua University
(Contents from Saman Amarasinghe, Katherine Yelick, N. Guydosh)
The “Software Crisis”
“To put it quite bluntly: as long as there were
no machines, programming was no problem at
all; when we had a few weak computers,
programming became a mild problem, and now
we have gigantic computers, programming has
become an equally gigantic problem.“
-- E. Dijkstra, 1972 Turing Award Lecture
1
The First Software Crisis

Time frame: ’60s and ’70s

Problem: assembly language programming
 Computers could handle larger, more complex
programs

Needed to get abstraction and portability
without losing performance
2
How Was the Crisis Solved?

High-level languages for von-Neumann
machines
 FORTRAN and C

Provided “common
machine language”
for uniprocessors
Common Properties
Single flow of control
Single memory image
Differences:
Register File
ISA
Functional Units
3
The Second Software Crisis
Time frame: ’80s and ’90s
 Problem: inability to build and maintain complex
robust applications requiring multi-million lines
of code developed by hundreds of programmers

 Computers could handle larger, more complex
programs

Needed to get composability, flexibility and
maintainability
 High-performance was not an issue
 left for Moore’s Law
4
How Was the Crisis Solved?

Object oriented programming
 C++, C# and Java

Also…
 Better tools: component libraries
 Better software engineering methodology: design
patterns, specification, testing, code reviews
5
Today: Programmers are
Oblivious to Processors


Solid boundary between hardware and software
Programmers don’t have to know the processor
 High level languages abstract away the processors, ex. Java
bytecode is machine independent
 Moore’s law allows programmers to get good speedup without
knowing anything about the processors

Programs are oblivious of the processor  they work on
any processor
 A program written in ’70 still works and is much faster today

This abstraction provides a lot of freedom for the
programmers
But, not anymore!
6
Moore’s Law Dominates
Moore’s Law
2X transistors/chip every 1.5 years
Called “Moore’s Law”
Microprocessors have
become smaller, denser,
and more powerful.
Gordon Moore (co-founder of
Intel) predicted in 1965 that
the transistor density of IC
chips would double roughly
every 18 months.
Slide source: Jack Dongarra
Effects of shrinking sizes ...
7
Transistors and Clock Rate
Growth in transistors per chip
Increase in clock rate
100,000,000
1000
10,000,000
1,000,000
i80386
i80286
100,000
R3000
R2000
100
Clock Rate (MHz)
Transistors
R10000
Pentium
10
1
i8086
10,000
i8080
i4004
1,000
1970 1975 1980 1985 1990 1995 2000 2005
Year
0.1
1970
1980
1990
2000
Year
Performance not as expected? Just wait a year or two…
8
But, Not Anymore
Performance
tapered out
9
Why?
Hidden Parallelism Tapped Out

The ’80s: superscalar expansion
 50% per year improvement in performance
 Transistors applied to implicit parallelism
 multiple instruction issue
 dynamic scheduling: hardware discovers parallelism
between instructions
 speculative execution: look past predicted branches
 non-blocking caches: multiple outstanding memory ops
 Pipeline superscalar processor (10 CPI  1 CPI)

The ’90s: era of diminishing returns
 2-way to 6-way issue, out-of-order issue, branch
prediction (1 CPI  0.5 CPI)
10
The Origins of a Third Crisis



Time frame: 2005 to 20??
Problem: we are running out of innovative ideas for
hidden parallelism and sequential performance is left
behind by Moore’s law
Need continuous and reasonable performance
improvements
 to support new features
 to support larger datasets

While sustaining portability, flexibility and maintainability
without increasing complexity faced by the programmer
 critical to keep up with the current rate of evolution in software
11
Other Forces for a Change

General-purpose unicores have stopped historic
performance scaling
 Diminishing returns of more instruction-level
parallelism
 Power consumption
 Chip yield
 Wire delays
 DRAM access latency
Go for multicore and parallel programming
12
Limit #1: Power Density
Scaling clock speed (business as usual) will not work
Sun’s
Surface
Power Density (W/cm2)
10000
Rocket
Nozzle
1000
Nuclear
Reactor
100
8086
Hot Plate
10 4004
8008 8085
386
286
8080
1
1970
1980
P6
Pentium®
486
1990
Year
2000
Power Wall:
Can put
more
transistors
on a chip
than can
afford to
turn on
2010
Source: Patrick
Gelsinger, Intel
13
Power Efficiency (watts/spec)
14
Multicores Save Power

Multicores with simple cores decreases
frequency and power
 Dynamic power is proportional to V2fC

Example: uniprocessor with power budget N
 Increase frequency by 20%
 Substantially increases power, by more than 50%
 But, only increase performance by 13%
 Decrease frequency by 20% (e.g., simplifying core)
 Decreases power by 50%
 Can now add another simple core
 Power budget stays at N with increased performance
Source: John Cavazos
15
Why Parallelism Lowers Power
Highly concurrent systems are more power
efficient
 Hidden concurrency burns power

 Speculation, dynamic dependence checking, etc.
 Push parallelism discovery to software (compilers and
application programmers) to save power

Challenge: Can you double the concurrency in
your algorithms every 2 years?
16
Limit #2: Chip Yield

Manufacturing costs and yield limit use of density
Moore’s
(Rock’s) 2nd law:
fabrication costs go up
Yield (% usable chips) drops
Parallelism can help
 More smaller, simpler processors
are easier to design and validate
 Can use partially working chips
e.g., Cell processor (PS3) is sold
with 7 out of 8 “on” to improve
yield
17
Limit #3: Speed of Light

Consider the 1 Tflop/s sequential machine:
 Data must travel some distance, r, to get from
memory to CPU
 To get 1 data element per cycle, this means 1012
times per second at the speed of light, c=3x108 m/s.
Thus r < c/1012 = 0.3 mm.

Now put 1 Tbyte of storage in 0.3mmx0.3mm:
 Each bit occupies about 1 square Angstrom, or the
size of a small atom.

No choice but
parallelism
1 Tflop/s, 1 Tbyte
sequential
machine
r = 0.3
mm
18
Limit #4: Memory Access

Growing processor-memory performance gap
Total chip performance still growing with Moore’s Law
Bandwidth rather than latency will be growing concern
19
Revolution is Happening Now

Chip density is continuing
increase 2x every 2 years
 Clock speed is not
 Number of processor cores
may double instead


There is little or no
hidden parallelism (ILP)
to be found
Parallelism must be
exposed to and managed
by software
Source: Intel, Microsoft (Sutter) and
Stanford (Olukotun, Hammond)
20
Multicore in Products

All processor companies switch to MP (2X CPUs/2 yrs)
AMD/’05
Intel/’06
IBM/’04
Sun/’07
Processors/chip
2
2
2
8
Threads/processor
1
2
2
16
Threads/chip
2
4
4
128
Manufacturer/Year
And at the same time,
 The STI Cell processor (PS3) has 8 cores
 The latest NVidia Graphics Processing Unit (GPU) has
128 cores
 Intel has demonstrated an 80-core research chip
21
The Rise of Multicores
22
IBM Cell Processor

Consists of a
64-bit Power
core,
augmented
with 8
specialized
SIMD coprocessors
(SPU,
Synergistic
Processor
Unit), and a
coherent bus
23
Intel Xeon Processor 5300 Series

Dual-die quad core processor
24
Open Sparc T1
25
Parallel Computing Not New

Researchers have been using parallel computing
for decades:
 Mostly in computational science and engineering
 Problems too large to solve on one computer; use
100s or 1000s

Many companies in the 80s/90s “bet” on parallel
computing and failed
 Computers got faster too quickly for there to be a
large market
26
Old: Parallelism only for
High End Computing
New: Parallelism by
Necessity
Why Parallelism?

All major processor vendors are producing multicore
 Every machine will soon be a parallel machine
 All programmers will be parallel programmers???

New software model
 Want a new feature? Hide the “cost” by speeding up code first
 All programmers will be performance programmers???

Some may eventually be hidden in libraries, compilers,
and high level languages
 But a lot of work is needed to get there

Big open questions:
 What will be the killer apps for multicore machines
 How should the chips be designed, and programmed?
28
A View from Intel
Recognition
Mining
Synthesis
What is a tumor?
Is there a tumor here?
What if the tumor progresses?
It is all about dealing efficiently with complex multimodal datasets
Source: Pradeep K Dubey, Intel
Emerging Workload Focus: iRMS
Recognition
Mining
Synthesis
What is …?
Is it …?
What if …?
Model
Find an existing
model instance
Create a new
model instance
Graphics Rendering + Physical Simulation
Learning &
Modeling
Computer
Vision
Visual Input
Streams
Reality
Augmentation
Synthesized
Visuals
Most RMS apps are about enabling interactive (real-time) RMS Loop or iRMS
Source: Pradeep K Dubey, Intel
Technology Enables Two Paths
1. increasing performance, same price (& form factor)
31
Technology Enables Two Paths
2. constant performance, decreasing cost (& form factor)
32
Re-inventing Client/Server

“The Datacenter is the Computer”
 Building sized computers: Google, MS …

“The Laptop/Handheld is the Computer”
 ‘08: Dell # laptops > # desktops?
 1B Cell phones/yr, increasing in function
 Will desktops disappear? Laptops?

Laptop/Handheld as future client,
Datacenter or “cloud” as future server
 But, energy budgets could soon dominate facility
costs
33
Summary of Trends
Power is dominant concern for future systems
 Primary “knob” to reduce power is lower clock rates and
increase parallelism
 The memory wall (latency and bandwidth) will continue
to get worse
 Memory capacity will be increasingly limited/costly
 Entire spectrum of computing will need to address
parallelism  performance is a software problem

 Handheld devices: to keep battery power
 Laptops/desktops: each new “feature” requires saving time
elsewhere
 High end computing facilities and data centers: to reduce
energy costs
34
Quiz

You might have used computers with dual-core
CPU, but you do not have to write parallel
programs. Your sequential programs can still run
on such computers. So, what is the point of
writing parallel programs? Please give your
thoughts on this.
[Hint:] What happen if you have a computer
with 64 cores?
35