Multi Core Development
Download
Report
Transcript Multi Core Development
Alex Becker
Multi-core is short for “multiple cores”
Advances in technology allow for several
discrete cores on one chip
This however is not multi-CPU
A core contains the core processor, but does not
include the other components of the CPU
A CPU contains things such as a front-side bus,
caches, and video processing
Number of transistors on a chip doubles every
two years
Able to split the transistors into two separate
cores
Multi-core chips can handle multiple tasks
better than single core
Case in point: SETI@Home
Two cores meant one could be dedicate to S@H
The other core would take care of normal tasks
Parallel computing
Overall speed increase at the same or lower
power requirements
Early multi-core chips had a major downside
compared to single-core chips
Individual core speed was less than a singlecore’s speed
Tasks would run slower
First commercial multi-core chips were
Advance Micro Devices’ Opteron
Designed for servers
Provided a significant advantage over prior,
single-core devices
More cores meant servers could process more data
Speed isn’t as critical in server applications
compared to standard applications
Intel’s first multi-core offering was a Xeon
The second major offering was the Core Duo
The Core Duo was designed for mobile
computing
Also offered significant advantages
Allowed a lower thermal design power (TDP) than
single core chips
Saved power
Used in video cards
CUDA technology
Hundreds to over a thousand CUDA cores
Parallelism is running multiple threads, cores,
or CPUs to run parts of a program side by side
Results in a speed increase
How much of an increase though?
One would expect a linear increase in speed as
amount of the program running in parallel
increases
Speed Increase Factor
25
20
15
10
5
0
1
2
3
4
5
6
7
8 9 10 11 12 13 14 15 16 17 18 19 20
Number of Cores
Back in the 1960s, Gene Amdahl came up with
a formula to determine the speed increase that
would come with parallel computing.
This became known as Amdahl’s Law
𝑋=
1
𝑟𝑠 +
𝑟𝑝
𝑁
rp is the portion of the program that can be run
in parallel
rs is the portion that cannot be run in parallel
N is the number of threads
The sum of rp and rs must equal one
X is the factor by which the program’s speed
can be increased.
10 minute program
9 minutes can be in parallel
1 minute must be sequential
Based on this:
rp = 0.9
rs = 0.1
N=4
X will be around 3/8
Speed Increase Factor (X)
8
7
6
5
4
3
2
1
0
1
2
3
4
5
6
7
8 9 10 11 12 13 14 15 16 17 18 19 20
Number of Cores (N)
Core Count:
GTX 670: 1344
GTX 680: 1536
Or, the 670 has 7/8 the cores of the 680
Should one expect a performance drop of 1/8?
Obviously no.
Amdahl’s law shows that using more cores
suffers from diminishing returns
Using the example from before, let’s see the
time differences
GTX 670: ~10.07% of the time, or 60.42s
GTX 680: ~10.06% of the time, or 60.36s
The difference is much less than 12%
Noticed in real world frame rate testing
Writing in parallel has its own set of challenges
Thread Safety
What can and can’t be parallel
A simple equation?
void seq_function(int n, float a,
float *x, float *y) {
int i;
for(i = 0; i < n; i++)
y[i] = a * x[i] + y[i];
}
//Call:
seq_function(n, 3.14, x, y);
Nvidia’s implementation of parallel processing
Designed to be parallel from the get go
Uses blocks (blockidx.x)
Each block has threads, represented as
blockDim.x
Individual thread ID is threadIdx.x
When calling, specify number of blocks and the
number of threads per block
__global__
void CUDA_function(int n, float a,
float *x, float *y) {
int i = blockIdx.x * blockDim.x
+ threadIdx.x;
if(i < n)
y[i] = a * x[i] + y[i];
}
int n_blocks = (n + 255) / 256
CUDA_function<<n_blocks, 256>>(n,
3.14, x, y);
n_blocks is the total number of blocks, and
there are 256 threads per block
Parallel code can cause race conditions and
other nasty side effects
Different levels of safety
Thread Unsafe
Thread Safe
Thread Safe-MT
“Thread Locks”
For example, when one thread is accessing data
that possibly could be accessed by other
threads, the thread locks the data
When another thread tries to access the data,
the thread is put in a queue
When the first thread is done, the waiting
thread can access the data
An atomic function is a function that must be
the only function being executed at that time
The only threads able to access the data
contained within one are the threads in the
function itself
Uninterruptable
Way of the future
Subset of parallel development
Program speeds will increase as more
programs become parallel
Video game engines
Eventually almost every program will have
some parallelism in it
Intel Canada - English. Moore’s Law Inspires Intel Innovation. Retrieved
from http://www.intel.com/content/www/us/en/siliconinnovations/moores-law-technology.html
Nickolls, J., Buck, I., Garland, M., & Skadron, K. ACMQ Site - ACM
Queue. Scalable Parallel Programming with CUDA - ACM Queue.
Retrieved from http://queue.acm.org/detail.cfm?id=1365500
Oracle Documentation. Multithreaded Programming Guide. Retrieved
from http://docs.oracle.com/cd/E19963-01/html/8211601/docinfo.html
EECS Instructional Support Group Home Page. Retrieved from
http://www-inst.eecs.berkeley.edu/~n252/paper/Amdahl.pdf
ARK | Your source for information on Intel® products. ARK | Mobile
Intel® Pentium® 4 Processor - M 2.30 GHz, 512K Cache, 400 MHz FSB.
Retrieved from http://ark.intel.com/products/27360/Mobile-IntelPentium-4-Processor---M-2_30-GHz-512K-Cache-400-MHz-FSB
ARK | Your source for information on Intel® products. ARK | Intel® Coreâ„¢ Duo
Processor T2700 (2M Cache, 2.33 GHz, 667 MHz FSB). Retrieved from
http://ark.intel.com/products/27238/Intel-Core-Duo-Processor-T2700-2M-Cache-2_33GHz-667-MHz-FSB
ARK | Your source for information on Intel® products. Intel® Pentium® Mobile
Processor (Mobile). Retrieved from http://ark.intel.com/products/family/41878/IntelPentium-Mobile-Processor/mobile
ARK | Your source for information on Intel® products. Intel® Coreâ„¢ Duo Processor
(Mobile). Retrieved from http://ark.intel.com/products/family/22731/Intel-Core-DuoProcessor/mobile
Angelini, C. Tom's Hardware: Hardware News, Tests and Reviews. GeForce GTX 670 2
GB Review: Is It Already Time To Forget GTX 680? : Giving GK104 A Haircut. Retrieved
from http://www.tomshardware.com/reviews/geforce-gtx-670-review,3200.html
Robbins, D. IBM - United States. Common threads: POSIX threads explained, Part 2.
Retrieved from http://www.ibm.com/developerworks/library/l-posix2/
Welcome [Savannah]. <util/atomic.h> Atomically and Non-Atomically Executed Code
Blocks. Retrieved from http://www.nongnu.org/avr-libc/usermanual/group__util__atomic.html