PowerPoint Template
Download
Report
Transcript PowerPoint Template
并行程序设计
PARALLEL PROGRAMMING
Pingpeng Yuan
PARALLEL PROGRAMMING
What
Why
How
Goal
exam
2
What is Parallel Programming?
Coordinating
multiple processing
elements to solve a
problem
3
PARALLELISM - A SIMPLISTIC
UNDERSTANDING
Multiple tasks at once.
Distribute work into multiple
execution units.
Two approaches Data Parallelism
Functional or Control
Parallelism
数据并行 – 将数据分成块,然后
每一计算单元分别处理数据块.
功能并行 – 将问题划分成不同的
任务,然后处理单元分别处理任
务
4
WHY
Why
Technology Trend
Application Needs
5
HUMAN ARCHITECTURE! GROWTH
PERFORMANCE
Vertical
Growth
Horizontal
5
10
15 20 25
30
35
40
45 . . . .
Age
6
COMPUTATIONAL POWER IMPROVEMENT
C.P.I.
Multiprocessor
Uniprocessor
1
2. . . .
No. of Processors
7
GENERAL TECHNOLOGY TRENDS
•Microprocessor performance increases 50% 100% per year
•Clock frequency doubles every 3 years
•Transistor count quadruples every 3 years
8
CLOCK FREQUENCY GROWTH RATE
(INTEL FAMILY)
•
30% per year
9
INTEL MANY INTEGRATED CORE (MIC)
32 core version of MIC:
TILERA’S 100 CORES (JUNE 2011)
Tilera has introduced a range of processors (64-bit Gx family: 36 cores, 64 cores
and 100 cores), aiming to take on Intel in servers that handle high-throughput
web applications
64-bit cores running up to 1.5GHz
Manufactured in 40nm technology
11
….
Ju
n
Ju
n
Ju
n
Ju
n
Ju
n
Ju
n
Ju
n
Ju
n
-1
1
-1
0
-0
9
-0
8
-0
7
-0
6
-0
5
-0
4
-0
3
-0
2
-0
1
-0
0
-9
9
-9
8
-9
7
-9
6
-9
5
-9
4
-9
3
400000
Ju
n
Ju
n
Ju
n
Ju
n
Ju
n
Ju
n
Ju
n
Ju
n
Ju
n
Ju
n
Ju
n
TOP500
Number of cores
Number of cores of no 1 system from Top500
600000
500000
Paradigm Change in HPC
300000
200000
100000
0
GPU ARCHITECTURE
NVIDIA Fermi, 512 Processing Elements (PEs)
THE GAP BETWEEN CPU AND GPU
ref: Tesla GPU Computing Brochure
GPU WILL TOP THE LIST IN NOV 2010
TRANSISTOR COUNT GROWTH RATE (INTEL
FAMILY)
• Transistor count grows much faster than clock rate
- 40% per year, order of magnitude more contribution in 2 decades
16
HOW TO USE MORE TRANSISTORS
Improve single threaded performance via
architecture:
Not keeping up with potential given by technology
Use transistors for memory structures to
improve data locality
Use parallelism
Instruction-level
Thread level
17
SIMILAR STORY FOR STORAGE
(TRANSISTOR COUNT)
18
TRENDS IN DRAM CAPABILITIES
• DRAM densities to double 1000
every 3 years
• Projections for DRAM densities
revised downwards over time 100
• Current densities at 4Gb/die
10
1
8.
Gb/s
6.
5.
4.
3.
2.
1.
.
1999
2001
2003
2004
2005
2006
2007
2009
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
7.
1999
2001
2003
2004
2005
2006
2007
2009
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
9.
DRAM I/O Rate
(Source: ITRS ITWG)
DRAM Density (Gbits/die)
(Source: ITRS ITWG)
• DRAM data rates to
double every 4-5 years
• Projections for DRAM
data
rates
revised
upwards over time
• Current data-rates at
2.2 Gb/s
SIMILAR STORY FOR STORAGE
内存容量和内存访问速度差距更明显
从1980-95起内存容量扩大了1000x,每年增长50%
延迟每年只降低了3% (only 2x from 1980-95)
内存带宽增加了2x
处理器速度变快,内存变大,内存相对变慢
需要并行传输更多地数据
需要更多的cache层次
20
存储层次MEMORY HIERARCHY
100 bytes
CPU registers
32KB
L1 cache
256KB
1GB
1TB
1PB
< 1 ns
L2 cache
Primary Memory
Secondary Storage
Tertiary Storage
1 ns
4 ns
60 ns
10 ms
1s-1hr
每一层次可视作为下一层的cache
21
SIMILAR STORY FOR STORAGE
并行增加了每层的效率,但没有增加访问时
间
并行和局部性在存储系统内部同样如此
内存芯片上同时取多个bit;然后在狭窄的通道上
流水传输
缓冲区存储最近访问的数据
22
DISK TRENDS
Disks too: Parallel disks plus caching
Disk capacity, 1975-1989
doubled every 3+ years
25% improvement each year
factor of 10 every decade
Still exponential, but far less rapid than processor
performance
Disk capacity, 1990-recently
doubling every 12 months
100% improvement each year
factor of 1000 every decade
Capacity growth 10x as fast as processor performance!
23
DISK TRENDS
Only a few years ago, we purchased disks by the
megabyte
Today, 1 GB (a billion bytes) costs $1 $0.50 $0.05
from Dell
=> 1 TB costs $1K $500 $50, 1 PB costs $1M $500K
$50K
Technology is amazing
Flying a 747 6” above the ground
Reading/writing a strip of postage stamps
24
总之,飞速增长
处理器速度
存储能力
带宽相对于延迟和时钟频率之间的差距
并行是计算机体系结构发展的必然趋势
25
COMMODITY COMPUTER SYSTEMS
19462003 General-purpose computing: Serial. 5KHz4GHz.
2004 General-purpose computing goes parallel.
Clock frequency growth flat. #Transistors/chip 19802011: 29K30B!
#”cores”: ~dy-2003
If you want your program to
run significantly faster …
you’re going to have to
parallelize it
27
DRIVERS OF PARALLEL
APPLICATION NEEDS
ref: http://www.nvidia.com/object/tesla_computing_solutions.html
COMPUTING
–
APPLICATIONS OF PARALLEL
PROCESSING
29
30
WHY DO WE NEED PARALLEL PROCESSING?
Reasonable running time = Fraction of hour to several hours (103-104 s)
In this time, a TIPS/TFLOPS machine can perform 1015-1016 operations
Example 1: Southern oceans
heat Modeling
(10-minute iterations)
300 GFLOP per iteration
300 000 iterations per 6 yrs =
1016 FLOP
Example 2: Fluid dynamics calculations (1000 1000 1000 lattice)
109 lattice points 1000 FLOP/point 10 000 time steps = 1016 FLOP
Example 3: Monte Carlo simulation of nuclear reactor
1011 particles to track (for 1000 escapes) 104 FLOP/particle = 1015 FLOP
Decentralized supercomputing ( from Mathworld News, 2006/4/7 ):
Grid of tens of thousands networked computers discovers 230 402 457 – 1,
the 43rd Mersenne prime, as the largest known prime (9 152 052 digits )
31
32
33
34
大数据时代
根据IDC的报告,2012年全球的数据总量为
2.7ZB,预计到2020年,全球的数据总量将
达到35ZB。
大数据分类:
互联网数据
科学数据
多媒体数据
行业应用数据,如金融数据
WHAT MAKES IT BIG DATA?
SOCIAL
BLOG
SMART
METER
VOLUME
VELOCITY
VARIETY
101100101001
001001101010
101011100101
010100100101
VALUE
36
NUMBERS
How many data in the world?
800 Terabytes, 2000
160 Exabytes, 2006
500 Exabytes(Internet), 2009
2.7 Zettabytes, 2012
35 Zettabytes by 2020
How many data generated ONE
day?
7 TB, Twitter
10 TB, Facebook
Big data: The next frontier for innovation, competition, and productivity
McKinsey Global Institute 2011
37
BIG DATA USE CASES
Today’s Challenge
New Data
What’s Possible
Healthcare
Expensive office visits
Remote patient
monitoring
Preventive care,
reduced hospitalization
Manufacturing
In-person support
Product sensors
Automated diagnosis,
support
Location-Based
Services
Based on home zip
code
Real time location data
Geo-advertising, traffic,
local search
Public Sector
Standardized services
Citizen surveys
Tailored services,
cost reductions
Retail
One size fits all
marketing
Social media
Sentiment analysis
segmentation
38
HOW
How
实践是检验真理的唯一标准
39
PARALLEL PROGRAMMING
课程内容结构
Parallel Architectures
Parallel Algorithms
Parallel Programming
40
GOAL
• Most people in the research community agree that
there are at least two kinds of parallel programmers
that will be important to the future of computing
• Programmers that understand how to write software,
but are naïve about parallelization and mapping to
architecture
• Programmers that are knowledgeable about
parallelization, and mapping to architecture, so can
achieve high performance
授课计划
总共32学时
4学时: 课程介绍+并行计算系统体系结构
4学时:并行算法基础
24学时:并行程序设计
42
考核要求
成绩评定方式:平时成绩(出勤率 + 1 doc) +考试
成绩(分数比例:20:80)
1 doc
针对某一并行计算技术问题,对相关解决技术进行评论
并给出改进
评论主要着眼于创新点和存在的问题,以及可能下一步
的研究工作。
43