ppt - Computer Science Division

Download Report

Transcript ppt - Computer Science Division

CS 258
Parallel Computer Architecture
Lecture 1
Introduction to Parallel Architecture
January 23, 2008
Prof John D. Kubiatowicz
Computer Architecture Is …
the attributes of a [computing] system as seen
by the programmer, i.e., the conceptual
structure and functional behavior, as distinct
from the organization of the data flows and
controls the logic design, and the physical
implementation.
Amdahl, Blaaw, and Brooks, 1964
SOFTWARE
1/23/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.2
The Instruction Set: a Critical Interface
software
instruction set
hardware
• Properties of a good abstraction
–
–
–
–
–
Lasts through many generations (portability)
Used in many different ways (generality)
Provides convenient functionality to higher levels
Permits an efficient implementation at lower levels
Changes very slowly! (Although this is increasing)
• Is there a solid interface for multiprocessors?
1/23/08
– No standard hardware interface
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.3
What is Parallel Architecture?
• A parallel computer is a collection of processing
elements that cooperate to solve large problems
• Some broad issues:
– Models of computation: PRAM? BSP? Sequential Consistency?
– Resource Allocation:
» how large a collection?
» how powerful are the elements?
» how much memory?
– Data access, Communication and Synchronization
» how do the elements cooperate and communicate?
» how are data transmitted between processors?
» what are the abstractions and primitives for cooperation?
– Performance and Scalability
» how does it all translate into performance?
» how does it scale?
1/23/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.4
Topologies of Parallel Machines
• Symmetric Multiprocessor
P
– Multiple processors in box with shared
memory communication
– Current MultiCore chips like this
– Every processor runs copy of OS
• Non-uniform shared-memory with
separate I/O through host
– Multiple processors
» Each with local memory
» general scalable network
– Extremely light “OS” on node provides
simple services
» Scheduling/synchronization
– Network-accessible host for I/O
P
P
P
Bus
Memory
P/M P/M P/M P/M
P/M P/M P/M P/M
Host
P/M P/M P/M P/M
P/M P/M P/M P/M
• Cluster
– Many independent machine connected with
general network
– Communication through messages
1/23/08
Kubiatowicz CS258 ©UCB Spring 2008
Network
Lec 1.5
Conventional Wisdom (CW) in
Computer Architecture
1.
•
Old CW: Power is free, but transistors expensive
New CW is the “Power wall”:
Power is expensive, but transistors are “free”
– Can put more transistors on a chip than have the
power to turn on
2. Old CW: Only concern is dynamic power
•
New CW: For desktops and servers, static power due
to leakage is 40% of total power
3. Old CW: Monolithic uniprocessors are reliable
internally, with errors occurring only at pins
•
New CW: As chips drop below 65 nm feature sizes,
they will have high soft and hard error rates
1/23/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.6
Conventional Wisdom (CW)
in Computer Architecture
4.
•
5.
•
6.
•
1/23/08
Old CW: By building upon prior successes, continue
raising level of abstraction and size of HW designs
New CW: Wire delay, noise, cross coupling, reliability,
clock jitter, design validation, …
stretch development time and cost of large designs at
≤65 nm
Old CW: Researchers demonstrate new
architectures by building chips
New CW: Cost of 65 nm masks, cost of ECAD,
and design time for GHz clocks
 Researchers no longer build believable chips
Old CW: Performance improves latency & bandwidth
New CW: BW improves > (latency improvement)2
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.7
Conventional Wisdom (CW)
in Computer Architecture
7.
•
Old CW: Multiplies slow, but loads and stores fast
New CW is the “Memory wall”:
Loads and stores are slow, but multiplies fast
–
8.
200 clocks to DRAM, but even FP multiplies only 4 clocks
Old CW: We can reveal more ILP via compilers
and architecture innovation
–
•
9.
•
1/23/08
Branch prediction, OOO execution, speculation, VLIW, …
New CW is the “ILP wall”:
Diminishing returns on finding more ILP
Old CW: 2X CPU Performance every 18 months
New CW is Power Wall + Memory Wall + ILP Wall =
Brick Wall
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.8
Uniprocessor Performance (SPECint)
Performance (vs. VAX-11/780)
10000
From Hennessy and Patterson, Computer Architecture: A
Quantitative Approach, 4th edition, Sept. 15, 2006
3X
??%/year
1000
52%/year
100
10
25%/year
 Sea change in chip
design: multiple “cores” or
processors per chip
1
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
• VAX
: 25%/year 1978 to 1986
• RISC + x86: 52%/year 1986 to 2002
• RISC + x86: ??%/year 2002 to present
1/23/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.9
Sea Change in Chip Design
• Intel 4004 (1971): 4-bit processor,
2312 transistors, 0.4 MHz,
10 micron PMOS, 11 mm2 chip
• RISC II (1983): 32-bit, 5 stage
pipeline, 40,760 transistors, 3 MHz,
3 micron NMOS, 60 mm2 chip
• 125 mm2 chip, 0.065 micron CMOS
= 2312 RISC II+FPU+Icache+Dcache
– RISC II shrinks to  0.02 mm2 at 65 nm
– Caches via DRAM or 1 transistor SRAM
(www.t-ram.com) or 3D chip stacking
– Proximity Communication via capacitive coupling at > 1
TB/s ? Ivan Sutherland @ Sun / Berkeley)
• Processor is the new transistor?
1/23/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.10
ManyCore Chips: The future is here
• Intel 80-core multicore chip (Feb 2007)
–
–
–
–
–
80 simple cores
Two floating point engines /core
Mesh-like "network-on-a-chip“
100 million transistors
65nm feature size
Frequency Voltage Power
Bandwidth
Performance
3.16 GHz
0.95 V 62W 1.62 Terabits/s 1.01 Teraflops
5.1 GHz
1.2 V
175W 2.61 Terabits/s 1.63 Teraflops
5.7 GHz
1.35 V 265W 2.92 Terabits/s 1.81 Teraflops
• “ManyCore” refers to many processors/chip
– 64? 128? Hard to say exact boundary
• How to program these?
– Use 2 CPUs for video/audio
– Use 1 for word processor, 1 for browser
– 76 for virus checking???
• Something new is clearly needed here…
1/23/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.11
Conventional Wisdom (CW)
in Computer Architecture
10. Old CW: Increasing clock frequency is primary method of
performance improvement
•
New CW: Processors Parallelism is primary method of
performance improvement
11. Old CW: Don’t bother parallelizing app, just wait and run
on much faster sequential computer
•
New CW: Very long wait for faster sequential CPU
–
–
2X uniprocessor performance takes 5 years?
End of La-Z-Boy Programming Era
–
“This shift toward increasing parallelism is not a triumphant stride
forward based on breakthroughs …; instead, this … is actually a
retreat from even greater challenges that thwart efficient silicon
implementation of traditional solutions.”
The Parallel Computing Landscape: A Berkeley View, Dec 2006
12. Old CW: Less than linear scaling  failure
•
New CW: Given the switch to parallel hardware, even
sublinear speedups are beneficial
13 New Moore’s Law is 2X processors (“cores”) per chip every
technology generation, but same clock rate
1/23/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.12
Déjà vu all over again?
• Multiprocessors imminent in 1970s, ‘80s, ‘90s, …
• “… today’s processors … are nearing an impasse as technologies
approach the speed of light..”
David Mitchell, The Transputer: The Time Is Now (1989)
• Transputer was premature
 Custom multiprocessors strove to lead uniprocessors
 Procrastination rewarded: 2X seq. perf. / 1.5 years
• “We are dedicating all of our future product development to
multicore designs. … This is a sea change in computing”
Paul Otellini, President, Intel (2004)
• Difference is all microprocessor companies switch to
multiprocessors (AMD, Intel, IBM, Sun; all new Apples 2 CPUs)
 Procrastination penalized: 2X sequential perf. / 5 yrs
 Biggest programming challenge: 1 to 2 CPUs
1/23/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.13
CS258: Information
Instructor: Prof John D. Kubiatowicz
Office: 673 Soda Hall
Phone: 643-6817
Email: [email protected]
Office Hours: Wed 1:00 - 2:00 or by appt.
Class: Mon, Wed 2:30-4:00pm
310 Soda Hall
Web page: http://www.cs/~kubitron/courses/cs258-S08/
Lectures available online <Noon day of lecture
Email: [email protected]
Clip signup link on web page (as soon as it is up)
1/23/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.14
Computer Architecture Topics (252+)
Input/Output and Storage
Disks, WORM, Tape
VLSI
Coherence,
Bandwidth,
Latency
L2 Cache
L1 Cache
Instruction Set Architecture
Addressing,
Protection,
Exception Handling
Pipelining, Hazard Resolution,
Superscalar, Reordering,
Prediction, Speculation,
Vector, Dynamic Compilation
1/23/08
Network
Communication
Other Processors
Emerging Technologies
Interleaving
Bus protocols
DRAM
Memory
Hierarchy
RAID
Pipelining and Instruction
Level Parallelism
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.15
Computer Architecture Topics (258)
P
M
P
S
M
° ° °
P
M
P
Interconnection Network
Processor-Memory-Switch
M
Shared Memory,
Message Passing,
Data Parallelism
Transactional Memory
Checkpoint/Restart
Network Interfaces
Topologies,
Routing,
Bandwidth,
Latency,
Reliability
Multiprocessors
Networks and Interconnections
Programming Models/Communications Styles
Reliability/Fault Tolerance
Everything in previous slide but more so!
1/23/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.16
What will you get out of CS258?
• In-depth understanding of the design and engineering
of modern parallel computers
– technology forces
– Programming models
– fundamental architectural issues
» naming, replication, communication, synchronization
– basic design techniques
» cache coherence, protocols, networks, pipelining, …
– methods of evaluation
• from moderate to very large scale
• across the hardware/software boundary
• Study of REAL parallel processors
– Research papers, white papers
• Natural consequences??
– Massive Parallelism  Reconfigurable computing?
– Message Passing Machines  NOW  Peer-to-peer systems?
1/23/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.17
Will it be worthwhile?
• Absolutely!
– Now, more than ever, industry trying to figure out how to build
these new multicore chips….
• The fundamental issues and solutions translate
across a wide spectrum of systems.
– Crisp solutions in the context of parallel machines.
• Pioneered at the thin-end of the platform pyramid
on the most-demanding applications
– migrate downward with time
• Understand implications
for software
• Network attached
storage, MEMs, etc?
SuperServers
Departmenatal Servers
Workstations
Personal Computers
1/23/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.18
Why Study Parallel Architecture?
Role of a computer architect:
To design and engineer the various levels of a computer
system to maximize performance and programmability within
limits of technology and cost.
Parallelism:
• Provides alternative to faster clock for performance
• Applies at all levels of system design
• Is a fascinating perspective from which to view
architecture
• Is increasingly central in information processing
• How is instruction-level parallelism related to course-grained
parallelism??
1/23/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.19
Is Parallel Computing Inevitable?
This was certainly not clear just a few years ago
Today, however:
YES!
• Industry is desperate for solutions!
• Application demands: Our insatiable need for
computing cycles
• Technology Trends: Easier to build
• Architecture Trends: Better abstractions
• Current trends:
– Today’s microprocessors are multiprocessors and/or have
multiprocessor support
– Servers and workstations becoming MP: Sun, SGI, DEC,
COMPAQ!...
1/23/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.20
Why Me? The Alewife Multiprocessor
• Cache-coherence Shared Memory
•
•
•
•
– Partially in Software!
User-level Message-Passing
Rapid Context-Switching
Asynchronous network
One node/board
1/23/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.21
Lecture style
•
•
•
•
•
•
•
1-Minute Review
20-Minute Lecture/Discussion
5- Minute Administrative Matters
25-Minute Lecture/Discussion
5-Minute Break (water, stretch)
25-Minute Lecture/Discussion
Instructor will come to class early & stay after to
answer questions
Attention
20 min.
1/23/08
Break “In Conclusion, ...”
Time
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.22
Course Methodology
• Study existing designs through research papers
–
–
–
–
We will read about a number of real multiprocessors
We will discuss network router designs
We will discuss cache coherence protocols
Etc…
• High-level goal:
– Understand past solutions so that…
– We can make proposals for ManyCore designs
• This is a critical point in parallel architecture design
– Industry is really looking for suggestions
– Perhaps you can make them listen to your ideas???
1/23/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.23
Research Paper Reading
• As graduate students, you are now researchers.
• Most information of importance to you will be in
research papers.
• Ability to rapidly scan and understand research
papers is key to your success.
• So: you will read lots of papers in this course!
– Quick 1 paragraph summaries will be due in class
– Students will take turns discussing papers
• Papers will be scanned and on web page.
1/23/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.24
TextBook: Two leaders in field
Text: Parallel Computer Architecture:
A Hardware/Software Approach,
By: David Culler & Jaswinder Singh
Covers a range of topics
We will not necessarily cover
them in order.
1/23/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.25
How will grading work?
No TA This Term!
Rough Breakdown:
• 20% Paper Summaries/Presentations
• 30% One Midterm
• 40% Research Project (work in pairs)
–
–
–
–
–
–
meet 3 times with me to see progress
give oral presentation
give poster session
written report like conference paper
6 weeks work full time for 2 people
Opportunity to do “research in the small” to help make transition
from good student to research colleague
• 10% Class Participation
1/23/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.26
Application Trends
• Application demand for performance fuels advances in
hardware, which enables new appl’ns, which...
– Cycle drives exponential increase in microprocessor performance
– Drives parallel architecture harder
» most demanding applications
New Applications
More Performance
• Programmers willing to work really hard to improve
high-end applications
• Need incremental scalability:
– Need range of system performance with progressively increasing cost
1/23/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.27
Metrics of Performance
Application
Programming
Language
Answers per month
Operations per second
Compiler
(millions) of Instructions per second: MIPS
ISA
(millions) of (FP) operations per second:
MFLOP/s
Datapath
Megabytes per second
Control
Function Units
Cycles per second (clock rate)
Transistors Wires Pins
And What about: Programmability, Reliability, Energy?
1/23/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.28
Speedup
• Speedup (p processors) =
Time (1 processor)
Time (p processors)
• Common mistake:
– Compare parallel program on 1 processor to parallel program
on p processors
• Wrong!:
– Should compare uniprocessor program on 1 processor to
parallel program on p processors
• Why? Keeps you honest
– It is easy to parallelize overhead.
1/23/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.29
Amdahl's Law
Speedup due to enhancement E:
Speedup(E)
ExTime w/o E
= ------------ExTime w/ E
=
Performance w/ E
------------------Performance w/o E
Suppose that enhancement E accelerates a
fraction F of the task by a factor S, and
the remainder of the task is unaffected
1/23/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.30
Amdahl’s Law for parallel programs?
Fractionpar 

ExTime par  ExTime ser   1  Fractionpar 
  ExTime overhead p, stuff
p



Speedup 

1
1  Fraction
parallel

Fractionpar

p
ExTime overhead p, stuff 
ExTime par
Best you could ever hope to do:
Speedupmaximum 
1
1 - Fractionpar


Worse: Overhead may kill your performance!
1/23/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.31
Where is Parallel Arch Going?
Old view: Divergent architectures, no predictable pattern of growth.
Application Software
Systolic
Arrays
System
Software
Architecture
SIMD
Message Passing
Dataflow
Shared Memory
• Uncertainty of direction paralyzed parallel software development!
1/23/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.32
Today
• Extension of “computer architecture” to support
communication and cooperation
– Instruction Set Architecture plus Communication
Architecture
• Defines
– Critical abstractions, boundaries, and primitives
(interfaces)
– Organizational structures that implement interfaces
(hw or sw)
• Compilers, libraries and OS are important bridges today
• Still – not enough standardization!
– Nothing close to the stable x86 instruction set!
1/23/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.33
P.S. Parallel Revolution May Fail
• John Hennessy, President, Stanford University, 1/07:
“…when we start talking about parallelism and ease of use of truly
parallel computers, we're talking about a problem that's as hard
as any that computer science has faced. … I would be panicked if
I were in industry.”
“A Conversation with Hennessy & Patterson,” ACM Queue Magazine,
4:10, 1/07.
• 100% failure rate of Parallel Computer Companies
– Convex, Encore, Inmos (Transputer), MasPar, NCUBE, Kendall
Square Research,
Sequent, (Silicon Graphics),
300
Millions of
Thinking Machines, …
• What if IT goes from a
growth industry to a
replacement industry?
– If SW can’t effectively use
32, 64, ... cores per chip
 SW no faster on new computer
 Only buy if computer wears out
1/23/08
250
PCs / year
200
150
100
50
0
1985
Kubiatowicz CS258 ©UCB Spring 2008
1995
2005
2015
Lec 1.34
Can programmers handle parallelism?
• Historically: Humans not as good at parallel
programming as they would like to think!
– Need good model to think of machine
– Architects pushed on instruction-level parallelism
really hard, because it is “transparent”
• Can compiler extract parallelism?
– Sometimes
• How do programmers manage parallelism??
– Language to express parallelism?
– How to schedule varying number of processors?
• Is communication Explicit (message-passing) or
Implicit (shared memory)?
– Are there any ordering constraints on
communication?
1/23/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.35
Is it obvious that more cores
More performance?
• AMBER molecular dynamics simulation program
• Starting point was vector code for Cray-1
• 145 MFLOP on Cray90, 406 for final version on 128processor Paragon, 891 on 128-processor Cray T3D
1/23/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.36
Granularity:
• Is communication fine or coarse grained?
– Small messages vs big messages
• Is parallelism fine or coarse grained
– Small tasks (frequent synchronization) vs big tasks
• If hardware handles fine-grained parallelism, then
easier to get incremental scalability
• Fine-grained communication and parallelism harder
than coarse-grained:
– Harder to build with low overhead
– Custom communication architectures often needed
• Ultimate course grained communication:
– GIMPS (Great Internet Mercenne Prime Search)
– Communication once a month
1/23/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.37
Current Commercial Computing targets
• Relies on parallelism for high end
– Computational power determines scale of business that can be
handled
• Databases, online-transaction processing, decision
support, data mining, data warehousing ...
• Google, Yahoo, ….
• TPC benchmarks (TPC-C order entry, TPC-D decision
support)
–
–
–
–
1/23/08
Explicit scaling criteria provided
Size of enterprise scales with size of system
Problem size not fixed as p increases.
Throughput is performance measure (transactions per minute or tpm)
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.38
Scientific Computing Demand
1/23/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.39
Engineering Computing Demand
• Large parallel machines a mainstay in many
industries
– Petroleum (reservoir analysis)
– Automotive (crash simulation, drag analysis, combustion
efficiency),
– Aeronautics (airflow analysis, engine efficiency, structural
mechanics, electromagnetism),
– Computer-aided design
– Pharmaceuticals (molecular modeling)
– Visualization
» in all of the above
» entertainment (films like Toy Story)
» architecture (walk-throughs and rendering)
– Financial modeling (yield and derivative analysis)
– etc.
1/23/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.40
Can anyone afford high-end MPPs???
• ASCI (Accellerated Strategic Computing Initiative)
ASCI White: Built by IBM
–
–
–
–
1/23/08
12.3 TeraOps, 8192 processors (RS/6000)
6TB of RAM, 160TB Disk
2 basketball courts in size
Program it??? Message passing
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.41
Need New class of applications
• Handheld devices with ManyCore
processors!
– Great Potential, right?
• Human Interface applications very
important:
“The Laptop/handheld is the Computer”
– ’07: HP number laptops > desktops
– 1B+ Cell phones/yr, increasing in function
– Obtellini demoed “Universal Communicator”
(Combination cell phone, PC, and Video Device)
– Apple iPhone
• User wants Increasing Performance, Weeks
or Months of Battery Power
1/23/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.42
Applications: Speech and Image Processing
10 GIPS
1 GIPS
Telephone
Number
Recognition
100 M IPS
10 M IP S
1 M IPS
1980
200 Words
Isolated Sp eech
Recognition
Sub-Band
Speech Coding
1985
1,000 Words
Continuous
Speech
Recognition
ISDN-CD Stereo
Receiver
5,000 Words
Continuous
Speech
Recognition
HDTVReceiver
CIF Video
CELP
Speech Coding
Speaker
Veri¼cation
1990
1995
• Also CAD, Databases, . . .
1/23/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.43
Compelling Laptop/Handheld Apps
• Meeting Diarist
– Laptops/ Handhelds
at meeting coordinate
to create speaker
identified, partially
transcribed text
diary of meeting

Teleconference speaker identifier,
speech helper

1/23/08
L/Hs used for teleconference, identifies who is
speaking, “closed caption” hint of what being said
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.44
Compelling Laptop/Handheld Apps
• Health Coach
– Since laptop/handheld always with you,
Record images of all meals, weigh plate
before and after, analyze calories consumed
so far
» “What if I order a pizza for my next
meal? A salad?”
– Since laptop/handheld always with you,
record amount of exercise so far, show how
body would look if maintain this exercise
and diet pattern next 3 months
» “What would I look like if I regularly
ran less? Further?”
• Face Recognizer/Name Whisperer
– Laptop/handheld scans faces, matches
image database, whispers name in ear
(relies on Content Based Image Retrieval)
1/23/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.45
Content-Based Image Retrieval
(Kurt Keutzer)
See “Porting MapReduce to
a GPU” Thursday 11:30
Relevance
Feedback
Query by example
Similarity
Metric
Image
Database
Candidate
Results
Final Result
• Built around Key Characteristics of personal
databases
1000’s of
images
1/23/08
–
–
–
–
Very large number of pictures (>5K)
Non-labeled images
Many pictures of few people
Complex pictures including people, events, places, and objects
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.46
What About….?
• Many other applications might make sense
• If Only…..
– Parallel programming is already really hard
– Who is going to write these apps???
• Domain Expert is not necessarily qualified to
write complex parallel programs!
1/23/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.47
Need a Fresh Approach to Parallelism
•
Berkeley researchers from many backgrounds meeting since
Feb. 2005 to discuss parallelism
–
–
•
•
•
Krste Asanovic, Ras Bodik, Jim Demmel, Kurt Keutzer, John
Kubiatowicz, Edward Lee, George Necula, Dave Patterson,
Koushik Sen, John Shalf, John Wawrzynek, Kathy Yelick, …
Circuit design, computer architecture, massively parallel
computing, computer-aided design, embedded hardware
and software, programming languages, compilers,
scientific programming, and numerical analysis
Tried to learn from successes in high performance
computing (LBNL) and parallel embedded (BWRC)
Led to “Berkeley View” Tech. Report and new Parallel
Computing Laboratory (“Par Lab”)
Goal: Productive, Efficient, Correct Programming of 100+
cores & scale as double cores every 2 years (!)
1/23/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.48
Why Target 100+ Cores?
• 5-year research program aim 8+ years out
• Multicore: 2X / 2 yrs  ≈ 64 cores in 8 years
• Manycore: 8X to 16X multicore
1000
512
256
128
100
10
1
2003
Automatic
Parallelization,
Thread Level
Speculation
4
2
64
64
32
16
8
1
1/23/08
2005
2007
2009
2011
2013
Kubiatowicz CS258 ©UCB Spring 2008
2015
Lec 1.49
4 Themes of View 2.0/ Par Lab
1. Applications
•
Compelling apps drive top-down research agenda
2. Identify Common Computational Patterns
•
Breaking through disciplinary boundaries
3. Developing Parallel Software with Productivity,
Efficiency, and Correctness
•
2 Layers + Coordination & Composition Language
+ Autotuning
4. OS and Architecture
•
•
1/23/08
Composable primitives, not packaged solutions
Deconstruction, Fast barrier synchronization, Partitions
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.50
Par Lab Research Overview
Easy to write correct programs that run efficiently on manycore
Personal Image Hearing,
Parallel
Speech
Health Retrieval Music
Browser
Motifs/Dwarfs
C&CL Compiler/Interpreter
Parallel
Libraries
Efficiency
Languages
Parallel
Frameworks
Sketching
Static
Verification
Type
Systems
Directed
Testing
Autotuners
Dynamic
Legacy
Communication &
Schedulers
Checking
Code
Synch. Primitives
Efficiency Language Compilers
Debugging
OS Libraries & Services
with Replay
Legacy OS
Hypervisor
1/23/08
Multicore/GPGPU
RAMP Manycore
Kubiatowicz CS258 ©UCB Spring 2008
Correctness
Composition & Coordination Language (C&CL)
Lec 1.51
“Motifs" Popularity
(Red Hot  Blue Cool)
1
2
3
4
5
6
7
8
9
10
11
12
13
Finite State Mach.
Combinational
Graph Traversal
Structured Grid
Dense Matrix
Sparse Matrix
Spectral (FFT)
Dynamic Prog
N-Body
MapReduce
Backtrack/ B&B
Graphical Models
Unstructured Grid
1/23/08
HPC
ML
Games
DB
SPEC
How do compelling apps relate to 13 motif/dwarfs?
Embed
•
Health Image Speech Music Browser
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.52
Developing Parallel Software
• 2 types of programmers  2 layers
• Efficiency Layer (10% of today’s programmers)
– Expert programmers build Frameworks & Libraries,
Hypervisors, …
– “Bare metal” efficiency possible at Efficiency Layer
• Productivity Layer (90% of today’s programmers)
– Domain experts / Naïve programmers productively build
parallel apps using frameworks & libraries
– Frameworks & libraries composed to form app frameworks
• Effective composition techniques allows the efficiency
programmers to be highly leveraged 
Create language for Composition and Coordination (C&C)
1/23/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.53
Architectural Trends
• Greatest trend in VLSI generation is increase in
parallelism
– Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit
» slows after 32 bit
» adoption of 64-bit now under way, 128-bit far (not
performance issue)
» great inflection point when 32-bit micro and cache fit on a
chip
– Mid 80s to mid 90s: instruction level parallelism
» pipelining and simple instruction sets, + compiler advances
(RISC)
» on-chip caches and functional units => superscalar
execution
» greater sophistication: out of order execution, speculation,
prediction
• to deal with control transfer and latency problems
– Next step: ManyCore.
– Also: Thread level parallelism? Bit-level parallelism?
1/23/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.54
Can ILP go any farther?
3
25
2.5
20
2
Speedup
Fraction of total cycles (%)
30
15

1.5
10
1
5
0.5
0
0
0
1
2
3
4
5
6+




0
Number of instructions issued
5
10
15
Instructions issued per cycle
• Infinite resources and fetch bandwidth, perfect
branch prediction and renaming
– real caches and non-zero miss latencies
1/23/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.55
Thread-Level Parallelism “on board”
Proc
Proc
Proc
Proc
MEM
• Micro on a chip makes it natural to connect many to
shared memory
– dominates server and enterprise market, moving down to desktop
• Alternative: many PCs sharing one complicated pipe
• Faster processors began to saturate bus, then bus
technology advanced
– today, range of sizes for bus-based systems, desktop to large servers
1/23/08
No. of processors in fully
configuredCS258
commercial
Kubiatowicz
©UCBshared-memory
Spring 2008 systems
Lec 1.56
What about Storage Trends?
• Divergence between memory capacity and speed even more
pronounced
– Capacity increased by 1000x from 1980-95, speed only 2x
– Gigabit DRAM by c. 2000, but gap with processor speed much greater
• Larger memories are slower, while processors get faster
– Need to transfer more data in parallel
– Need deeper cache hierarchies
– How to organize caches?
• Parallelism increases effective size of each level of hierarchy,
without increasing access time
• Parallelism and locality within memory systems too
– New designs fetch many bits within memory chip; follow with fast pipelined
transfer across narrower interface
– Buffer caches most recently accessed data
– Processor in memory?
• Disks too: Parallel disks plus caching
1/23/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.57
Summary: Why Parallel Architecture?
• Desperate situation for Hardware Manufacturers!
– Economics, technology, architecture, application demand
– Perhaps you can help this situation!
• Increasingly central and mainstream
• Parallelism exploited at many levels
– Instruction-level parallelism
– Multiprocessor servers
– Large-scale multiprocessors (“MPPs”)
• Focus of this class: multiprocessor level of
parallelism
• Same story from memory system perspective
– Increase bandwidth, reduce average latency with many local
memories
• Spectrum of parallel architectures make sense
– Different cost, performance and scalability
1/23/08
Kubiatowicz CS258 ©UCB Spring 2008
Lec 1.58