FPGA computational Strengths:

Download Report

Transcript FPGA computational Strengths:

Case Studies of Porting Two
Algorithms to Reconfigurable
Processors
Reconfigurable Systems Summer Institute
Wednesday July 13 2005
Craig Steffen
National Center for SuperComputing Applications
National Center for Supercomputing Applications
Reminder: FPGA computational Strengths
•
•
•
•
•
Parallel elements
Integer processing means small footprint
Processor cache structure—data reuse
Simple problems: maximum utilization
Simple problems to describe
National Center for Supercomputing Applications
Example Algorithm: Matrix Multiply
• Popular and much-used algorithm
• Well-known API in place
• Advantages: Simple, dividable, inherently
parallel, data re-use
• Disadvantage: floating point
National Center for Supercomputing Applications
Matrix Multiply Algorithm
• Parallel computations
• Multiple data uses
• Lends well to MAC units
A
a b c d
e f g h
B
X
k
m
n
p
q = ak + bm + cn + dp
C
=
q
r
r = ek + fm + gn + hp
National Center for Supercomputing Applications
Matrix Multiply Algorithm
(matrix dimensions)
δ
θ
α
a b c d
e f g h
X
θ
q = ak + bm + cn + dp
k
m
n
p
δ
=α
q
r
r = ek + fm + gn + hp
National Center for Supercomputing Applications
Matrix Multiply Implementation in C
for(i=0; i<α; i++){
for(j=0; j<δ; j++){
C[i][j] = 0.0;
for(k=0; k<θ; k++){
C[i][j] += A[i][k] * B[k][j];
}
}
}
National Center for Supercomputing Applications
Naïve Implementation Performance
• Generated a version that timed matrix
multiply with inputs α,δ,θ and N (iterations)
• Going from 40,800,250,40 to
400,800,250,40 caused a 2.5x slowdown
(cache issues) (data re-use rears its head)
• Speed was 500M MACs per second, or 1B
operations per second on a 2.8 GHz CPU
• Real optimized library would run about 6x
faster
National Center for Supercomputing Applications
Matrix Multiply: block-wise divisible
• Any block of elements may be multiplied as a unit
• As long as the general rules are followed the final result
is the same as it would have been
• This can be exploited to take advantage of specialized
units with preferred operand sizes
X
=
=
X
+
X
National Center for Supercomputing Applications
64-bit Floating-Point FPGA Matrix Multiplication
• Yong Dou, S. Vassiliadis, G. K. Kuzmanov,
G. N. Gaydadjiev
• FPGA ’05, February 20-22, 2005,
Monterey, California, USA
• Contains a 12-stage pipelined MAC block
design
• Multi-FPGA master-slave multiple
simultaneous execution design
National Center for Supercomputing Applications
Plan for Implementation on SRC
MAP Processor
• MAP runs at 100 MHz clock speed
• Assuming fully pipelined logical units
(MAC in this case), requires 5 MACs
running in parallel (disregarding transfer
latencies)
National Center for Supercomputing Applications
Single Data Use: MAP Starves
32 bit
64 bit
32 bit
MAP Processor
32 bit
64 bit
32 bit
• RAM to MAP data pipe can feed 2 MACs
• 6 required to make it worthwhile
National Center for Supercomputing Applications
Using Caching: Now Equals CPU Speed
Data re-use in OBM
MAP Processor
On-Board Memory
32 bit
64 bit
32 bit
32 bit
64 bit
On-Board Memory
32 bit
On-Board Memory
FPGA
On-Board Memory
On-Board Memory
On-Board Memory
National Center for Supercomputing Applications
More Speedup:
Requires Data Re-use in FPGA Block Ram
MAP Processor
On-Board Memory
32 bit
64 bit
32 bit
32 bit
64 bit
On-Board Memory
32 bit
On-Board Memory
FPGA
On-Board Memory
On-Board Memory
On-Board Memory
National Center for Supercomputing Applications
Matrix Multiply Status
• On hold for the moment
• Need to understand programming and
access issues for Block Ram
National Center for Supercomputing Applications
BLAST: DNA Comparison Code
• Basic Local Alignment Search Tool
• Biology code for comparing DNA and protein
sequences
• Multiple modes: DNA-DNA, DNA-protein,
protein-DNA, protein-protein
• Answers the question:
“Is A a subset of B?” where A is short and B is
very long
• Not a complete match—takes into account DNA
combinatoric and protein substitution rules
National Center for Supercomputing Applications
blastp: compare DNA to Protein
• First, translate DNA to Protein
DNA
Amino
Acid
sequence
National Center for Supercomputing Applications
blastp: compare DNA to Protein
• First, translate DNA to Protein
DNA
Frame 1
Amino
Acid
sequences
Frame 2
National Center for Supercomputing Applications
blastp: compare DNA to Protein
• First, translate DNA to Protein
• Translate all forward frames
DNA
Frame 1
Amino
Acid
sequences
Frame 2
Frame 3
National Center for Supercomputing Applications
blastp: compare DNA to Protein
• First, translate DNA to Protein
• Translate all forward frames
• Complete “6-Frame translation”
DNA
Amino
Acid
sequences
National Center for Supercomputing Applications
BLAST Method (per Frame):
• Finds small local matches
• Tries to expand matches to improve them, by changing matches or
inserting gaps; all of which have weights which are tied to the
probability of one combination mutating to another
• Each change causes a change in the “goodness of match” score
• Many combinations are attempted until the goodness value peaks
• This method very unsuited for FPGA: each step depends on the
previous one, one small comparison loop carrying all the weight
Initial match:
Score: two elements
Score: 2 el. – 1 gap
Score: 4 el. – 1 gap
National Center for Supercomputing Applications
BLAST: Solve the same problem differently
• Component-bycomponent comparison
for multiple offsets
• For each offset record a
score based on the
number and/or
arrangement of matches
• After all comparisons are
finished, then (perhaps)
do iterative matching
Position score: 0
Position score: 2
Position score: 0
Position score: 2
Position score: 0
National Center for Supercomputing Applications
Performance
• Problem provided by Matt Hudson of Crop
Sciences: a protein that is built by a DNA in a
certain plant chromosome.
• Protein is ~1100 amino acids, Chromosome is
31 million bases
• BLAST detects two very strong hits and two
weak ones, taking 3 CPU-seconds
• My algorithm, when coded naively, takes 20
minutes
• With reasonable speed-ups, takes 19 to 40
seconds
National Center for Supercomputing Applications
FPGA Advantages to this Algorithm:
• Each offset is
independent
• each element’s
comparision is
independent
• Data re-use factor is
the length of the short
sequence
• 6-fold parallelism due
to 6-frame translation
Position score: 0
Position score: 2
Position score: 0
Position score: 2
Position score: 0
National Center for Supercomputing Applications
Implementation:
• Do comparisons in
parallel
• Shift dna sequence,
push new element
into pipe
• repeat
National Center for Supercomputing Applications
Current Status
• Element-wise shift not trivially defined in MAP-C,
defined Perl-expanded macro
• Must create parallel comparision and adder tree
to finish
?
?
?
?
+
+
Total offset score
+
National Center for Supercomputing Applications
Conclusion
• Tools are gaining in usefulness and
sophistication
• The programmer must explicitly deal with
memory architectures and data movement
• Some things just don’t work as you’re
used to thinking about them
National Center for Supercomputing Applications