A Tutorial on High Performance Computing Taxonomy

Download Report

Transcript A Tutorial on High Performance Computing Taxonomy

A Tutorial
on
High Performance Computing
Taxonomy
By
Prof. V. Kamakoti
Department of Computer Science and Engineering
Indian Institute of Technology, Madras
Chennai – 600 036, India
Organization of the Tutorial
• Session – 1
– Instruction Level Parallelism (ILP)
• Pipelining concepts
• RTL and Speed-up
• Superscalar/VLIW concepts
– Static Instruction Scheduling
– Dynamic Instruction Scheduling
• Branch Prediction
Organization of the Tutorial
• Session – 2
– Amdahl’s law and its applications
– Symmetric Multiprocessors (SMP)
• The Cache Coherency problem
– ESI Protocol
–
–
–
–
Distributed Memory Systems
Basics of Message Passing Systems
Parallel Models of Computing
Design of Algorithms for Parallel processors
• Brent’s Lemma
Why this Title?
• Performance related issues at
–
–
–
–
circuit level (RTL)
instruction level (Processor level)
Shared Memory Multiprocessor level (SMP)
Distributed Memory Multiprocessor level
(Cluster/Grid – Collection of SMPs)
ILP - Pipelining
Fetch + Inc. PC
I2
I1
I3
I4
I5
I1
I2
I3
I4
Decode Instrn
Fetch Data
I1
I2
I3
I1
I2
Execute Instrn
Store Data
I1
First Instruction
completes at end of
5th unit
With Pipelining
Second instruction
10000 Instructions
th
at end of 6 unit
No pipelining takes
th
10000 instruction
50000 Units
at end of 10004
units
Performance
• With pipelining we get a speed up of close to 5
• This will not work always
– Hazards
• Data
• Control
• Structural
• Non Ideal: Not every step take the same amount of
time
– Float Mult – 10 cycles
– Float Div – 40 cycles
• Performance come out of Parallelism. Let us try to
understand parallelism
Types of Parallelism
• Recognizing parallelism
– Example: To add 100 numbers
– for j = 1 to 100 do
• Sum = Sum + A[j]; //Inherently Sequential
– A better solution in terms of parallelism, assuming 4
processors are available
• Split 100 numbers into 4 parts each of 25 numbers and allot
one part each to all the four processors
• Each processor adds 25 numbers allotted to it and sends the
answer to a head processor, which further adds and gives the
sum.
Types of Parallelism
• Data Parallelism or SIMD
– The above example
– Same instruction “Add 25 numbers” but on
multiple data
– The parallelism is because of data.
Types of Parallelism
• Functional Parallelism or MIMD
– Multiple functions to be performed on a data set
or data sets.
– Multiple Instructions and Multiple Data.
– Example is the case of the pipeline discussed
earlier
Example of Pipelining
• Imagine that 100 sets of data are to be
processed in a sequence by the following
system.
Part
1
Part
2
Part 1 takes 10 ms and Part 2 takes 15 ms.
To process 100 sets of data it takes 2500 ms.
Example of Pipelining
• Consider the following changes, with a
storage element. When first data set is in
part 2, the second data set can be in part 1.
Part
1
S
T
O
R
A
G
E
Part
2
First data set finishes at 30ms and after every 15
ms one data set will come out – total processing
time is 1515 ms. – A tremendous speedup.
Functional Parallelism
• Different data sets and different instructions on
them.
• Hence, Multiple Instruction and Multiple data.
• An interesting problem is to convert circuits with
large delays into pipelined circuits to get very
good throughput as seen earlier.
• Useful in the context of using the same circuit for
different sets of data.
Pipelining Concepts
• A combinational circuit can be easily modeled as a
Directed Acyclic Graph (DAG).
• Every node of the DAG is a subcircuit of the given
circuit – forms a stage of a pipeline.
• An edge of the DAG connects two nodes of the
DAG.
• Perform a topological sorting of the DAG.
N2
N1
N4
Level = 3
N3
N5
Level = 1
Level = 2
N6
N7
N8
Level = 4
A Pipelined Circuit
• If an edge connects two nodes of levels j
and k, j < k, then introduce k-j storage
levels in between, in the edge.
• Each edge can carry one or more bits.
1
1
N2
N1
2
N4
1
N3
2
N5
2
3
N6
N7
N8
4
Optimization
• Delay at every stage should be al most
equal.
• The stage with maximum delay dictates the
throughput.
• Number of bits transferred across nodes to
be optimized, that shall reduce the storage
requirements.
Stage Time Balancing
If Part 1 takes 10 ms and Part 2 takes 15ms then,
First data set finishes at 30ms and after every 15
ms one data set will come out – total processing
time for 100 data set is 1515 ms. – A tremendous
speedup.
S
Part
1
T
O
R
A
G
E
Part
2
If Part 1 takes 12 ms and Part 2 takes 13 ms then, First
data set finishes at 26 ms and after every 13 ms one
data set will come out – total processing time for 100
data set is 1313 ms. – A significant improvement.
RTL View of Circuits and
Performance
• Register Transfer Level
Reg
L1
Combo
3 ns
Reg
L2
Combo
5 ns
Clock Frequency is (1/5)*109
Improve frequency by reducing maximum
stage delay.
Reg
L3
High Speed Circuits
•
•
•
•
•
Carry Ripple to Carry Look ahead
Wallace Tree Multipliers
Increased Area and Power Consumption
Lower Time delays
Why Laptops have lesser frequency ratings than
Desktops?
• Reference: Cormen, Lieserson and Rivest, (First
Edition) Introduction to Algorithms (or) Computer
Architecture by Hamacher et al.
ILP Continues….
• Data Hazards
– LOAD [R2 + 10], R1 // Loads into R1
– ADD R3, R1, R2 //R3 = R1 + R2
• This is the “Read After Write (RAW)” Data
Hazard for R1
–
–
–
–
LD [R2+10], R1
ADD R3, R1, R12
LD [R2 + 14], R1
ADD R12, R1, R2
• This shows the WAW for R1 and WAR for R12
ILP – Pipelining Advanced
Fetch + Inc. PC
Decode Instrn
Superscalar: CPI < 1
Success: Different
Instrns take different
cycle time
Fetch Data
Execute Unit 1
Store Data
Execute Unit 2
Execute Unit K
Four FMULs while one FDIV
Implies – Out-of-Order Execution
Difficulties in Superscalar
Construction
• Ensuring no Data Hazards among several
instructions executing in the different
execution units at a same point of time.
• If this is done by compiler – then Static
Instruction Scheduling – VLIW - Itanium
• Done by the hardware – then Dynamic
Instruction Scehduling – Tomasulo – MIPS
Embedded Processor
Static Instruction Scheduling
• Compiler make bundles of “K” instructions that can be put at the same
time to the execution units such that there are no data dependencies
between them.
– Very Long Instruction Word (VLIW) to accommodate “K’ instructions at
a time
• Lot of “NOPS” if the bundle cannot be filled with relevant instructions
– Size of the executable
• Does not complicate the Hardware
• Source code portability – if I make the next gen processor with K+5
units (say) – then?
– Solved by having a software/firmware emulator which has a
negative say in the performance.
Thorn in the Flesh for Static
Instruction Scheduling
• The famous “Memory Aliasing Problem”
– LD [R1+20], R2 //Load R2 into
– ST R3, [R4+40] //Store R3 with
• This implies a RAW if (R1 + 20 = R4 + 40) and
this cannot be detected at compile time
• Such combinations of memory operations are not
put in same bundle and memory operations are
strictly scheduled in program order.
Dynamic Instruction Scheduling
• The data hazards are handled by the
hardware
– RAW using Operand Forwarding Technique
– WAR and WAW using Register Renaming
Technique
Processor Overview
Why should
result of LD
Processor
go to R2 in
Reg file and ALU/Control
then reload Multiple function
Units
to ALU?
Forward the
same on its
way to reg
file
Register File
Bus
Memory
RAW
LD [R1+20],R2
ADD R3,R2,R4
Register Renaming
1.
2.
3.
4.
5.
6.
ADD R1, R2, R3
ST R1, [R4+50]
ADD R1, R5, R6
SUB R7,R1,R8
ST R1, [R4 + 54]
ADD R1, R9, R10
Dependencies due to
Reg R1
RAW: (1,2), (1,4), (1,5)
(3,4) (3,5)
WAR: (2,3), (2,6), (4,6),
(5,6)
WAW: (1,3), (1,6), (3,6)
Register Renaming: Static
Scheduling
1.
2.
3.
4.
5.
6.
ADD R1, R2, R3
ST R1, [R4+50]
ADD R12, R5, R6
SUB R7,R12,R8
ST R12, [R4 + 54]
ADD R1, R9,R10
Rename R1 to R12 after Instruction 3
till Instruction 6
Dependency only within a window
and not the whole program.
Only WAR and WAW are between
(1,6) and (2,6) which are far away in
the program order
Increases Register pressure for the
compiler
Dynamic Scheduling - Tomasulo
Instruction Fetch Unit
Register Status Indicator
To Reg
file/Mem
Reservation
Station
Exec 1
Exec 2
Exec 3
Exec 4
Common Data Bus (CDB)
Every
Execution
unit indicates
writes
result
with
the
unit
Register
Statusare
Indicator
whether
thealong
latest
value to
of
Instructions
If
all operands
available
fetched
one
thenthe
by
operation
one
and
proceeds
decoded
in the
find
the
register
in to
the
reg CDB
file
orand
currently
computed
by
number
the
which
isbeing
forwarded
allsome
the typeison
allotted
execution
of
operation
unit,
else,
theitsource
waits
of
in operands
the to
reservation
execution
and
if theReg-file
latter
it states
the
execution
reservation
stations,
andunit
Memory
station ofunit
the
allotted
execution
pingingunit
thenumber
CDB
1.
2.
3.
4.
5.
6.
An Example:
Instruction Fetch
ADD R1, R2, R3
ST R1, [R4+50]
ADD R1, R5, R6
SUB R7,R1,R8
ST R1, [R4 + 54]
ADD R1, R9, R10
Register Status Indicator
Reg
R1 R2 R3 R4 R5 R6 R7 R8 R9 R10
Number
Status
Empty
0
0
Empty
0
0
Empty
0
0
0
Empty
0
0
Empty
0
Empty
1.
2.
3.
4.
5.
6.
An Example:
Instruction Fetch
ADD R1, R2, R3
-ST R1, [R4+50]
ADD R1, R5, R6
SUB R7,R1,R8
ST R1, [R4 + 54]
ADD R1, R9, R10
Register Status Indicator
Reg
R1 R2 R3 R4 R5 R6 R7 R8 R9 R10
Number
Status
Ins 1
1
0
Empty
0
0
Empty
0
0
0
Empty
0
0
Empty
0
Empty
1.
2.
3.
4.
5.
6.
An Example:
Instruction Fetch
ST R1, [R4+50]
----ADD R1, R5, R6
SUB R7,R1,R8
ST R1, [R4 + 54]
ADD R1, R9, R10
Register Status Indicator
Reg
R1 R2 R3 R4 R5 R6 R7 R8 R9 R10
Number
Status
I 1, E
1
0
I 2, W 1
0
0
Empty
0
0
0
Empty
0
0
Empty
0
Empty
1.
2.
3.
4.
5.
6.
An Example:
Instruction Fetch
ADD R1, R5, R6
------SUB R7,R1,R8
ST R1, [R4 + 54]
ADD R1, R9, R10
Register Status Indicator
Reg
R1 R2 R3 R4 R5 R6 R7 R8 R9 R10
Number
Status
I 1, E
3
0
I 2, W 1
0
0
I 3, E
0
0
0
Empty
0
0
Empty
0
Empty
Note: Reservation Station stores the number of the execution
unit that shall yield the latest value of a register.
1.
2.
3.
4.
5.
6.
An Example:
Instruction Fetch
SUB R7,R1,R8
--------ST R1, [R4 + 54]
ADD R1, R9, R10
Register Status Indicator
Reg
R1 R2 R3 R4 R5 R6 R7 R8 R9 R10
Number
Status
I 1, E
3
0
I 2, W 1
0
0
I 3, E
0
0
4
I 4, W 3
0
0
Empty
0
Empty
1.
2.
3.
4.
5.
6.
An Example:
Instruction Fetch
ST R1, [R4 + 54]
----------ADD R1, R9, R10
Register Status Indicator
Reg
R1 R2 R3 R4 R5 R6 R7 R8 R9 R10
Number
Status
I 1, E
3
0
I 2, W 1
0
0
I 3, E
0
0
4
0
0
I 4, W 3 I 5, W 3
0
Empty
1.
2.
3.
4.
5.
6.
An Example:
Instruction Fetch
ADD R1, R9, R10
-------------
Register Status Indicator
Reg
R1 R2 R3 R4 R5 R6 R7 R8 R9 R10
Number
Status
I 1, E
6
0
I 2, W 1
0
0
I 3, E
0
0
4
0
0
I 4, W 3 I 5, W 3
0
I 6, E
1.
2.
3.
4.
5.
6.
An Example:
Instruction Fetch
ADD R1, R9, R10
ADD R1, R2, R3
ST U1, [R4+50]
ADD R1, R5, R6
SUB R7, U3, R8
ST U3, [R4 + 54]
ADD R1, R9, R10
Register Status Indicator
Reg
R1 R2 R3 R4 R5 R6 R7 R8 R9 R10
Number
Status
I 1, E
6
0
I 2, W 1
0
0
I 3, E
0
0
4
0
0
I 4, W 3 I 5, W 3
0
I 6, E
Execution
Effectively
unit
three
6, Instructions
on
completion
are
will
executing
make R1
and
entry
others
in waiting
Register
See that Operand
Forwarding
and
Register
Renaming
is
done for the
appropriate
Status
Indicator
results.
0. Similarly
The whole
unit
program
4 will make
is converted
R7 entry
as 0.
shown above.
automatically
Memory Aliasing
• Every Memory location is a register
• Conceptually the same method can be used
• The size of memory status indicator will be
prohibitively large.
• An Associative memory used to record the
memory address to be written to and the
unit number doing it.
Other Hazards
• Control Hazards
– Conditional Jumps – which instruction to fetch
next in the pipeline
– Branch predictors are used – which shall
predict whether a branch is taken or not.
– Misprediction leads to undo-ing some actions
increasing the penalty but nothing much can be
done
Branch Prediction
• Different types of predictors
– Tournament
– Correlation
– K-bit
• Reference: Henessey and Patterson –
Computer Architecture.
Other Hazards
• Structural Hazards
– Non availability of a functional unit
– Say would like to schedule the seventh
instruction in our example
– The new instruction has to wait.
– Separate Integer, FPU and Load Store units are
made available
• Load-Store Architecture – What is it?
Architectural Enhancements
Amdahl’s Law
Speedup(Overall) =
Exec time without Enhancement
Exec time with Enhancement
A = Fraction of computation time in the original
architecture that can be converted to take advantage of
enhancement.
Exec_time(New) = (1 – A) Exec_time(old) + Exec_time of Enhanced portion -- (1)
Exec_time of enhanced portion(old)
Speedup(enhanced) =
Exec_time of enhanced portion(new)
= A * Exec time (old)
Exec_time of enhanced portion(new)
A * Exec time (old)
Exec_time of enhanced portion(new) =
Speedup(enhanced)
Substituting in (1) above we get
Final form of Amdahl’s Law
Exec_time(new) =
Exec_time(old) *
Speedup Overall =
(1 – A +
A
Speedup(Enhanced)
)
1
(
1–A +
A
Speedup(Enhanced)
)
Application of Amdahl’s Law:
Always 50% FP operations
- 20% FP Square root, 30% others
Choice 1: Use hardware and improve FP Square root to get
speedup of 10
Choice 2: Use software and improve all FP operations by a
speedup of 1.6
Speedup in Choice 1 is 1/(1 – 0.2 + 0.2/10) = 1.22
Speedup in Choice 2 is 1/(1 – 0.5 + 0.5/1.6) = 1.23
Choice 2 is better than Choice 1
Shared Memory Architectures
•Sharing one memory space among several
processors.
•Maintaining coherence among several copies
of a data item.
Shared Memory Multiprocessor
Processor
Registers
Processor
Processor
Processor
Registers
Registers
Registers
Caches
Caches
Caches
Caches
Snoopy
Memory
Disk & Other IO
Chipset
Snoopy-cache state Machine-1
Shared Memory
Architectures
CPU Read hit
* State machine for CPU
requests for each cache block
CPU Read
Invalid
Place read miss on
bus
Shared (read/only)
CPU Write
Applies to write
Back Data
Place write
CPU read miss
Write back block
CPU Read miss
Place read miss
On bus
Miss on Bus
CPU Write(Hit/Miss)
Place Write Miss on Bus
Cache Block
state
Exclusive
(read/write)
CPU read Hit
CPU write hit
CPU write miss
Write back cache block
Place write miss on Bus
Snoopy-cache state Machine- II
Shared Memory
Architectures
 State machine for bus
requests
Write miss
For this block
Invalid
Shared (read/only)
for each cache block
Write Back
Block; (abort
memory access)
Write miss
For this block
Exclusive
(read/write)
Read miss
For this block
Write Back
Block; (abort
memory access)
Shared Memory Architectures
P1
Step
State
Example
P2
Addr
Value
State
Bus
Addr
Value
Memory
Action
Proc.
Addr
Value
Addr
Value
P1:Write
10 to A1
P1:Read
A1
P2:Read
A1
P2:Write
20 to A1
P2:Write
40 to A2
Assumes Initial Cache State
is invalid and A1 and A2 map
to same cache block,
but A1 == A2
This is the Cache for P1
Invalid
Remote write or
Miss Write
Back
CPU read hit
CPU write hit
Remote Write
or Miss
CPU read hit
Shared
Read miss on Bus
Write
miss on Bus
Remote Read
Write Back
Exclusive
CPU write
Place write
Miss on Bus
Shared Memory Architectures
P1
P2
Step
State
Addr
Value
P1:Write
10 to A1
Excl.
A1
10
State
Example : step 1
Bus
Addr
Value
Memory
Action
Proc.
Addr
Wr.Ms
P1
A1
Value
Addr
P1:Read
A1
P2:Read
A1
P2:Write
20 to A1
P2:Write
40 to A2
Invalid
Shared
Write
Miss on Bus
Exclusive
Value
Shared Memory Architectures
P1
P2
Step
State
Addr
Value
P1:Write
10 to A1
Excl.
A1
10
P1:Read
A1
Excl.
A1
10
State
Example : step 2
Bus
Addr
Value
Memory
Action
Proc.
Addr
Wr.Ms
P1
A1
Value
Addr
P2:Read
A1
P2:Write
20 to A1
P2:Write
40 to A2
Invalid
Assumes Initial Cache State
is invalid and A1 and A2 map
to same cache block.
but A1 == A2
CPU read hit
Shared
Exclusive
Value
Shared Memory Architectures
P1
P2
Step
State
Addr
Value
P1:Write
10 to A1
Excl.
A1
10
P1:Read
A1
Excl.
A1
10
P2:Read
A1
State
Shar
Shar.
A1
Bus
Addr
Value
A1
10
Shar
Example : step 3
A1
10
Memory
Action
Proc.
Addr
Value
Addr
Value
Wr.Ms
P1
A1
RdMs
P2
A1
WrBk
P1
A1
10
A1
10
EdDa
P2
A1
10
10
10
P2:Write
40 to A2
10
10
Assumes Initial Cache State
is invalid and A1 and A2 map
to same cache block.
but A1 ==A2
Invalid
Shared
Remote Read
Write Back
Exclusive
Shared Memory Architectures
P2
P1
Step
State
Addr
Value
P1:Write
10 to A1
Excl.
A1
10
P1:Read
A1
Excl.
A1
10
P2:Read
A1
Shar
Shar
P2:Write
20 to A1
State
Inv.
A1
Example : step 4
Bus
Addr
Value
A1
10
Memory
Action
Proc.
Addr
Value
Addr
Value
Wr.Ms
P1
A1
RdMs
P2
A1
WrBk
P1
A1
10
A1
10
10
Shar
A1
10
RdDa
P2
A1
Excl
A1
20
WrMs
P2
A1
10
10
P2:Write
40 to A2
10
10
Remote Write
Invalid
Shared
Assumes Initial Cache State
is invalid and A1 and A2 map
to same cache block.
but A1== A2
Exclusive
Shared Memory Architectures
P1
P2
Step
State
Addr
Value
P1:Write
10 to A1
Excl.
A1
10
P1:Read
A1
Excl.
A1
10
P2:Read
A1
Shar
Shar
P2:Write
20 to A1
State
Inv
A1
Example : step 5
Bus
Addr
Value
A1
10
Memory
Action
Proc.
Addr
Value
Addr
Value
Wr.Ms
P1
A1
RdMs
P2
A1
WrBk
P1
A1
10
A1
10
10
Shar
A1
10
RdDa
P2
A1
Excl
A1
20
WrMs
P2
A1
10
WrMs
P2
A2
10
WrBk
P2
A1
P2:Write
40 to A2
Excl
Assumes Initial Cache State
is invalid and A1 and A2 map
to same cache block.
but A1 == A2
A2
40
20
10
A1
20
Distributed Memory Systems
P1
P2
P3
P4
Message Passing for
Inter Process
Communication.
NETWORK
P1
P2
P3
P4
Shared Vs Distributed Memory
1. Single Address
Space
1. Multiple Address
Space
2. Easy to Program 2. Difficult to
Program
3. Less Scalable
3. More Scalable
provided you
know how to
program.
Basics of Message Passing Systems
Messages
• Which processor is sending the message.
• Where is the data on the sending processor.
•What kind of data is being sent.
•How much data is there.
•Which processor(s) are receiving the message.
•Where should the data be left on the receiving processor.
•How much data is the receiving processor prepared to accept.
Aspects
• Access to the message passing systems
• Addressing
• Reception
• Point – Point Communication: Two processors
communicate
•Synchronous and Asynchronous
•Blocking and Non–blocking Operations.
• Collective Communication – Group of processors
communicate
•Barrier , Broadcast and Reduction Operations.
Point – Point Synchronous Communication
Synchronous communication does not complete until the message
has been received.
Point-Point Asynchronous Communication
Asynchronous communication completes as soon as the message is on
its way.
Non – Blocking Operations
Non –blocking Operations return straight away after initiating the
Operation and hence allows useful work to be performed while
waiting for the operation to complete.One can test for the completion
of the operation when necessary.
Blocking Operations
Blocking Operations waits for the operation to complete before
Proceeding further.
Barrier
Barrier
Barrier
Barrier
• Synchronizes processors by blocking until all of the participating
processors have called the barrier routine.
• There is no exchange of data.
Broadcast
One–to–many Communication.One processor send the same
message to several destinations in a single operation.
Reduction
STRIKE
Takes data items from several processors and reduces them to a
Single data item that is usually made available to all of the
participating processors. e.g. Strike Voting , Summation
Parallel Models
EREW PRAM : Exclusive Read
Exclusive Write Parallel Random
Access Memory.
CREW PRAM : Concurrent Read
Exclusive Write Parallel Random
Access Memory.
CRCW PRAM : Concurrent Read
Concurrent Write Parallel Random
Access Memory
Parallel Algorithm:Recursive Doubling Technique
• Finding Prefix Sum of ‘8’ Numbers
7
6
5
4
-15
6
-8
7
3
2
1
3
-2
1
0
0
Parallel Algorithm:Recursive Doubling Technique (Step 1)
• Finding Prefix Sum of ‘8’ Numbers
7
-9
6
5
4
3
2
1
-2
-1
10
1
-1
1
Step 1
0
0
Parallel Algorithm:Recursive Doubling Technique (Step 2)
• Finding Prefix Sum of ‘8’ Numbers
7
6
5
4
3
2
1
-10
8
0
9
2
-1
1
Step 2
0
0
Parallel Algorithm:Recursive Doubling Technique (Step 3)
• Finding Prefix Sum of ‘8’ Numbers
7
-8
6
5
4
3
2
1
7
1
9
2
-1
1
Step 3
0
0
• Prefix Sum of n numbers in O(log n) steps
• EREW Implementation
• Applicable for any semigroup operator like
min , max , mul etc.
CRCW Algorithms
• What can you do in constant time.
• Find OR of ‘n’ bits (b1,b2,…,bn) using n Processors.
--Each processor Pi , reads bi in parallel.
--If (bi = 1) then C = 1;
• All Processors write to ‘C’ the same value.
Optimality of Parallel Algorithms
• Parallel Algorithm is optimal iff
Time taken by Parallel Algorithm * No.of Processors used
= Time taken by best known sequential algorithm.
• Prefix Sum of n Numbers



O(n) Sequential
O(n log n) – Processor-Time Product
Not optimal.
Make it Optimal
 Yes – Using Brent’s Technique
 Use ‘p’ Processors and split data set into n/p
blocks.
6 5 4 3 2 1
Bn/p
B2
B1
Allot one processor per block, Let us take 3
processors and 6 numbers for example.
The Algorithm
6
5
4
3
2
1
Step 1: Processor ‘i’ finds prefix sum of elements in Bi sequentially and
outputs Si the Sum of last element- O(n/p) time.
11 5
11
7
3
7
3
1
3
Step2: Find prefix sum of (Sp,……,S1) and let it be (S’p….S’1) –
O(log(n/p)) time Recursive Doubling.
Step3: Processor ’i’, i > 1, adds S’i-1 to all prefixes
of block Bi -O(n/p)
Total O(n/p+log n – log p) time = O(n/p)time.
11 5
7
21
10
3
21 15 10 6
3
1
3
3
1
Sublogarthmic Algorithms and
CRCW PRAM
• Minimum of (a(1),a(2),…,a(n)) using n2
processors in O(1) time.
• C[i] = 1; using n processors.
• for all (i, j)
– if a(i) > a(j) then C[i] = 0; uses n2 processors
• if C[i] = 1 then B = A[i]; uses n processors
• B has the minimum
What if we have “n” processors
only?
• Split into n0.5 blocks each of elements and assign
n0.5 processors to each block
• Solve recursively in each block to get n0.5
minimum elements
• Now we have n0.5 elements from which we have to
find the minimum element using n processors. Do
it in constant time using the previous algorithm.
• T(n) = T(n0.5) + 1 = O(log log n)