Mapping of Regular Nested Loop Programs to
Download
Report
Transcript Mapping of Regular Nested Loop Programs to
Mapping of Regular Nested Loop Programs to
Coarse-grained Reconfigurable Arrays – Constraints and Methodology
F. Hanning, H. Dutta, W. Tichy, and Jürgen Teich
University of Erlangen-Nuremberg, Germany
Proceedings of the 18thInternational Parallel and Distributed Processing Symposium (IPDPS’04)
Presented by: Luis Ortiz
Department of Computer Science
The University of Texas at San Antonio
Outline
Overview
The Problem
Reconfigurable Architectures
Design Flow for Regular Mapping
Parallelizing Transformations
Constraints Related to CG Reconfigurable Arrays
Case Study
Results
Conclusions and Future Work
Overview
Constructing a parallel program is equivalent to specifying its
execution order
• the operations of a program form a set, and its execution order is
a binary, transitive and asymmetric relation
• the relevant sets are (unions of) Z-polytopes
• most of the optimizations may be presented as transformation of
the original program
The problem of automatic parallelization
• given a set of operations E and a strict total order on it
• find a partial order on E such that execution of E under it is
determinate and gives the same results as the original program
Overview (cont.)
Defining a polyhedron
• a set of linear inequalities: Ax + a ≥ 0
• the polyhedron is the set of all x which satisfies these inequalities
• the basic property of a polyhedron is convexity:
• if two points a and b belong to a polyhedron, then so all
convex combinations
• λa + (1 – λ)b, 0 ≤ λ ≤ 1
• a bounded polyhedron is called a polytope
Overview (cont.)
The essence of the polytope model is to apply affine
transformations to the iteration spaces of a program
• the iteration domain of statement S:
Dom(S) = {x | Dsx + ds ≥ 0}
• Ds and ds are the matrix and constant vector which define the
iteration polytope. ds may depend linearly on the structure
parameters
Overview (cont.)
Coarse-grained reconfigurable architectures
• provide flexibility of software combined with the performance of
hardware
• but, hardware complexity is a problem due to a lack of mapping
tools
Parallelization techniques and compilers
• map computationally intensive algorithms efficiently to coarsegrained reconfigurable arrays
The Problem
“Mapping a certain class of regular nested loop programs onto
a dedicated processor array”
Reconfigurable Architectures
Span a wide range of abstraction levels
• from fine-grained Look-Up Table (LUT) based reconfigurable logic
devices to distributed and hierarchical systems with
heterogeneous reconfigurable components
Efficiency comparison
• standard arithmetic is less efficient on fine-grained architectures
• due to the large routing area overhead
Few research work which deals with the compilation to
coarse-grained reconfigurable architecture
Design Flow for Regular Mapping
Design Flow for Regular Mapping (cont.)
A piecewise regular algorithm contains N quantified
equations
•
• each equation Si[I] is of the form
•
•
• xi[I] are indexed variables
• fi are arbitrary functions
• dji ∈ ℤn are constant data dependence vectors, and denote
similar arguments
• Ii are called index spaces
Design Flow for Regular Mapping (cont.)
Linearly bounded lattice
•
•
•
• this set is affinely mapped onto iteration vectors I using an affine
transformation
Block pipelining period
• time interval between the initiations of two successive problem
instances (β)
Parallelizing Transformations
Based on the representation of equations and index spaces
several combinations of parallelizing transformations in the
polytope model can be applied
•
•
•
•
•
•
•
Affine Transformations
Localization
Operator Splitting
Exploration of Space-Time Mappings
Partitioning
Control Generation
HDL Generation & Synthesis
Constraints Related to CG Reconfigurable Arrays
Coarse-grained (re)configurable architectures consist of an
array of processor elements (PE)
• array of processor elements (PE)
• one or more dedicated functional units or
• one or more arithmetic logic units (ALU)
• memory
• local memory → register files
• memory banks
• an instruction memory is required if the PE contains an
instruction programmable ALU
• interconnect structures
• I/O ports
• synchronization and reconfiguration mechanisms
Case Study
Regular mapping methodology applied for a matrix
multiplication algorithm
• target architecture
• PACT XPP64-A reconfigurable processor array
• 64 ALU-PAEs of 24 bit data with in an 8x8 array
• each ALU-PAE contains of three objects
• the ALU-PAE
• Back-Register-object (BREG)
• Forward-Register-object (FREG)
• all objects are connected to horizontal routing channels
Case Study (cont.)
• RAM-PAE are located in two columns at the left and the
right border of the array, two ports for independent r/w
operations
• RAM can be configured to FIFO mode
• each RAM-PAE has a 512x24 bit storage capacity
• four independent I/O interfaces located in the corners of
the array
Case Study (cont.)
Structure of the PACT XPP64-A reconfigurable processor
ALU-PAE objects
Case Study (cont.)
Matrix multiplication algorithm
• C=A*B
• A ∈ ZNxN
• B ∈ ZNxN
• computations may be represented by a dependence graph (DG)
• dependence graphs can be represented in a reduced form
• Reduced Dependence Graph: to each edge e = (vi, vj) there is
associated a dependence vector dij ∈ Zn
• virtual Processor Elements (VPEs) are used to map the PE
obtained from the design flow to the given architecture
Case Study (cont.)
Matrix multiplication algorithm, C-code
Matrix multiplication algorithm after parallelization, operator splitting, embedding, and localization
Case Study (cont.)
DG of transformed matrix multiplication algorithm
N=2
4 x 4 processor array
Reduced dependence graph
Case Study (cont.)
Output data
• Ox the output-variable space of variable x of the space-time
mapped or partitioned index space
• the output can be two-dimensional
• the transformed output variables are distributed over the entire
array
• collect the data from one processor’s line PL and feed them out to
an array border
•
•
•
• m ∈ Z1xn denote the time instances t ∈ Tx(Pi,j) where the variable x
produces an output at processor element Pi,j
Case Study (cont.)
• if one of the following conditions holds, output data can be
serialized
Case Study (cont.)
Partitioned implementation of the matrix multiplication
algorithm
a)
b)
c)
Dataflow graph of the LPGS-partitioned matrix multiplication 4 x 4 example
Dataflow graph after performing localization inside each file
Array implementation of the partitioned example
Results
Both implementations (full-size and partitioned) show
optimal utilization of resources
Each configured MAC-unit performs one operation per
cycle
It is observed that using fewer resources with better
implementation more performance per cycle can be
achieved
The number of ALUs is reduced from O(3N) to O(N)
Merging and writing of output data streams is overlapped
with computations in PEs
Conclusions and Future Work
The mapping methodology based on loop parallelization
in the polytope model provides results that are efficient in
terms of utilization of resources and execution time
Future work is focused on perform automatic compilation
of nested loop programs