PowerPoint Template

Download Report

Transcript PowerPoint Template

并行程序设计
PARALLEL PROGRAMMING
Pingpeng Yuan
PARALLEL PROGRAMMING
 What
 Why
 How
 Goal
 exam
2
What is Parallel Programming?

Coordinating
multiple processing
elements to solve a
problem
3
PARALLELISM - A SIMPLISTIC
UNDERSTANDING
 Multiple tasks at once.
 Distribute work into multiple
execution units.
 Two approaches  Data Parallelism
 Functional or Control
Parallelism
 数据并行 – 将数据分成块,然后
每一计算单元分别处理数据块.
 功能并行 – 将问题划分成不同的
任务,然后处理单元分别处理任
务
4
WHY
Why
Technology Trend
Application Needs
5
HUMAN ARCHITECTURE! GROWTH
PERFORMANCE
Vertical
Growth
Horizontal
5
10
15 20 25
30
35
40
45 . . . .
Age
6
COMPUTATIONAL POWER IMPROVEMENT
C.P.I.
Multiprocessor
Uniprocessor
1
2. . . .
No. of Processors
7
GENERAL TECHNOLOGY TRENDS
•Microprocessor performance increases 50% 100% per year
•Clock frequency doubles every 3 years
•Transistor count quadruples every 3 years
8
CLOCK FREQUENCY GROWTH RATE
(INTEL FAMILY)
•
30% per year
9
INTEL MANY INTEGRATED CORE (MIC)
32 core version of MIC:
TILERA’S 100 CORES (JUNE 2011)
 Tilera has introduced a range of processors (64-bit Gx family: 36 cores, 64 cores
and 100 cores), aiming to take on Intel in servers that handle high-throughput
web applications
 64-bit cores running up to 1.5GHz
 Manufactured in 40nm technology
11
….
Ju
n
Ju
n
Ju
n
Ju
n
Ju
n
Ju
n
Ju
n
Ju
n
-1
1
-1
0
-0
9
-0
8
-0
7
-0
6
-0
5
-0
4
-0
3
-0
2
-0
1
-0
0
-9
9
-9
8
-9
7
-9
6
-9
5
-9
4
-9
3
400000
Ju
n
Ju
n
Ju
n
Ju
n
Ju
n
Ju
n
Ju
n
Ju
n
Ju
n
Ju
n
Ju
n
TOP500
Number of cores
Number of cores of no 1 system from Top500
600000
500000
Paradigm Change in HPC
300000
200000
100000
0
GPU ARCHITECTURE
NVIDIA Fermi, 512 Processing Elements (PEs)
THE GAP BETWEEN CPU AND GPU
ref: Tesla GPU Computing Brochure
GPU WILL TOP THE LIST IN NOV 2010
TRANSISTOR COUNT GROWTH RATE (INTEL
FAMILY)
• Transistor count grows much faster than clock rate
- 40% per year, order of magnitude more contribution in 2 decades
16
HOW TO USE MORE TRANSISTORS
Improve single threaded performance via
architecture:
Not keeping up with potential given by technology
Use transistors for memory structures to
improve data locality
Use parallelism
Instruction-level
Thread level
17
SIMILAR STORY FOR STORAGE
(TRANSISTOR COUNT)
18
TRENDS IN DRAM CAPABILITIES
• DRAM densities to double 1000
every 3 years
• Projections for DRAM densities
revised downwards over time 100
• Current densities at 4Gb/die
10
1
8.
Gb/s
6.
5.
4.
3.
2.
1.
.
1999
2001
2003
2004
2005
2006
2007
2009
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
7.
1999
2001
2003
2004
2005
2006
2007
2009
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
9.
DRAM I/O Rate
(Source: ITRS ITWG)
DRAM Density (Gbits/die)
(Source: ITRS ITWG)
• DRAM data rates to
double every 4-5 years
• Projections for DRAM
data
rates
revised
upwards over time
• Current data-rates at
2.2 Gb/s
SIMILAR STORY FOR STORAGE
 内存容量和内存访问速度差距更明显
 从1980-95起内存容量扩大了1000x,每年增长50%
 延迟每年只降低了3% (only 2x from 1980-95)
 内存带宽增加了2x
 处理器速度变快,内存变大,内存相对变慢
 需要并行传输更多地数据
 需要更多的cache层次
20
存储层次MEMORY HIERARCHY
100 bytes
CPU registers
32KB
L1 cache
256KB
1GB
1TB
1PB
< 1 ns
L2 cache
Primary Memory
Secondary Storage
Tertiary Storage
1 ns
4 ns
60 ns
10 ms
1s-1hr
 每一层次可视作为下一层的cache
21
SIMILAR STORY FOR STORAGE
并行增加了每层的效率,但没有增加访问时
间
并行和局部性在存储系统内部同样如此
内存芯片上同时取多个bit;然后在狭窄的通道上
流水传输
缓冲区存储最近访问的数据
22
DISK TRENDS
Disks too: Parallel disks plus caching
Disk capacity, 1975-1989
 doubled every 3+ years
 25% improvement each year
 factor of 10 every decade
 Still exponential, but far less rapid than processor
performance
Disk capacity, 1990-recently
 doubling every 12 months
 100% improvement each year
 factor of 1000 every decade
 Capacity growth 10x as fast as processor performance!
23
DISK TRENDS
Only a few years ago, we purchased disks by the
megabyte
Today, 1 GB (a billion bytes) costs $1 $0.50 $0.05
from Dell
 => 1 TB costs $1K $500 $50, 1 PB costs $1M $500K
$50K
Technology is amazing
 Flying a 747 6” above the ground
 Reading/writing a strip of postage stamps
24
总之,飞速增长
 处理器速度
 存储能力
 带宽相对于延迟和时钟频率之间的差距
并行是计算机体系结构发展的必然趋势
25
COMMODITY COMPUTER SYSTEMS
19462003 General-purpose computing: Serial. 5KHz4GHz.
2004 General-purpose computing goes parallel.
Clock frequency growth flat. #Transistors/chip 19802011: 29K30B!
#”cores”: ~dy-2003
If you want your program to
run significantly faster …
you’re going to have to
parallelize it
27
DRIVERS OF PARALLEL
APPLICATION NEEDS
ref: http://www.nvidia.com/object/tesla_computing_solutions.html
COMPUTING
–
APPLICATIONS OF PARALLEL
PROCESSING
29
30
WHY DO WE NEED PARALLEL PROCESSING?
Reasonable running time = Fraction of hour to several hours (103-104 s)
In this time, a TIPS/TFLOPS machine can perform 1015-1016 operations
Example 1: Southern oceans
heat Modeling
(10-minute iterations)
300 GFLOP per iteration 
300 000 iterations per 6 yrs =
1016 FLOP
Example 2: Fluid dynamics calculations (1000  1000  1000 lattice)
109 lattice points  1000 FLOP/point  10 000 time steps = 1016 FLOP
Example 3: Monte Carlo simulation of nuclear reactor
1011 particles to track (for 1000 escapes)  104 FLOP/particle = 1015 FLOP
Decentralized supercomputing ( from Mathworld News, 2006/4/7 ):
Grid of tens of thousands networked computers discovers 230 402 457 – 1,
the 43rd Mersenne prime, as the largest known prime (9 152 052 digits )
31
32
33
34
大数据时代
根据IDC的报告,2012年全球的数据总量为
2.7ZB,预计到2020年,全球的数据总量将
达到35ZB。
大数据分类:
 互联网数据
 科学数据
 多媒体数据
 行业应用数据,如金融数据
WHAT MAKES IT BIG DATA?
SOCIAL
BLOG
SMART
METER
VOLUME
VELOCITY
VARIETY
101100101001
001001101010
101011100101
010100100101
VALUE
36
NUMBERS
 How many data in the world?
 800 Terabytes, 2000
 160 Exabytes, 2006
 500 Exabytes(Internet), 2009
 2.7 Zettabytes, 2012
 35 Zettabytes by 2020
 How many data generated ONE
day?
 7 TB, Twitter
 10 TB, Facebook
Big data: The next frontier for innovation, competition, and productivity
McKinsey Global Institute 2011
37
BIG DATA USE CASES
Today’s Challenge
New Data
What’s Possible
Healthcare
Expensive office visits
Remote patient
monitoring
Preventive care,
reduced hospitalization
Manufacturing
In-person support
Product sensors
Automated diagnosis,
support
Location-Based
Services
Based on home zip
code
Real time location data
Geo-advertising, traffic,
local search
Public Sector
Standardized services
Citizen surveys
Tailored services,
cost reductions
Retail
One size fits all
marketing
Social media
Sentiment analysis
segmentation
38
HOW
How
 实践是检验真理的唯一标准
39
PARALLEL PROGRAMMING
课程内容结构
Parallel Architectures
Parallel Algorithms
Parallel Programming
40
GOAL
• Most people in the research community agree that
there are at least two kinds of parallel programmers
that will be important to the future of computing
• Programmers that understand how to write software,
but are naïve about parallelization and mapping to
architecture
• Programmers that are knowledgeable about
parallelization, and mapping to architecture, so can
achieve high performance
授课计划
 总共32学时
 4学时: 课程介绍+并行计算系统体系结构
 4学时:并行算法基础
 24学时:并行程序设计
42
考核要求
成绩评定方式:平时成绩(出勤率 + 1 doc) +考试
成绩(分数比例:20:80)
 1 doc
 针对某一并行计算技术问题,对相关解决技术进行评论
并给出改进
 评论主要着眼于创新点和存在的问题,以及可能下一步
的研究工作。
43