Mapping of Regular Nested Loop Programs to

Download Report

Transcript Mapping of Regular Nested Loop Programs to

Mapping of Regular Nested Loop Programs to
Coarse-grained Reconfigurable Arrays – Constraints and Methodology
F. Hanning, H. Dutta, W. Tichy, and Jürgen Teich
University of Erlangen-Nuremberg, Germany
Proceedings of the 18thInternational Parallel and Distributed Processing Symposium (IPDPS’04)
Presented by: Luis Ortiz
Department of Computer Science
The University of Texas at San Antonio
Outline









Overview
The Problem
Reconfigurable Architectures
Design Flow for Regular Mapping
Parallelizing Transformations
Constraints Related to CG Reconfigurable Arrays
Case Study
Results
Conclusions and Future Work
Overview
 Constructing a parallel program is equivalent to specifying its
execution order
• the operations of a program form a set, and its execution order is
a binary, transitive and asymmetric relation
• the relevant sets are (unions of) Z-polytopes
• most of the optimizations may be presented as transformation of
the original program
 The problem of automatic parallelization
• given a set of operations E and a strict total order on it
• find a partial order on E such that execution of E under it is
determinate and gives the same results as the original program
Overview (cont.)
 Defining a polyhedron
• a set of linear inequalities: Ax + a ≥ 0
• the polyhedron is the set of all x which satisfies these inequalities
• the basic property of a polyhedron is convexity:
• if two points a and b belong to a polyhedron, then so all
convex combinations
• λa + (1 – λ)b, 0 ≤ λ ≤ 1
• a bounded polyhedron is called a polytope
Overview (cont.)
 The essence of the polytope model is to apply affine
transformations to the iteration spaces of a program
• the iteration domain of statement S:
Dom(S) = {x | Dsx + ds ≥ 0}
• Ds and ds are the matrix and constant vector which define the
iteration polytope. ds may depend linearly on the structure
parameters
Overview (cont.)
 Coarse-grained reconfigurable architectures
• provide flexibility of software combined with the performance of
hardware
• but, hardware complexity is a problem due to a lack of mapping
tools
 Parallelization techniques and compilers
• map computationally intensive algorithms efficiently to coarsegrained reconfigurable arrays
The Problem
“Mapping a certain class of regular nested loop programs onto
a dedicated processor array”
Reconfigurable Architectures
 Span a wide range of abstraction levels
• from fine-grained Look-Up Table (LUT) based reconfigurable logic
devices to distributed and hierarchical systems with
heterogeneous reconfigurable components
 Efficiency comparison
• standard arithmetic is less efficient on fine-grained architectures
• due to the large routing area overhead
 Few research work which deals with the compilation to
coarse-grained reconfigurable architecture
Design Flow for Regular Mapping
Design Flow for Regular Mapping (cont.)
 A piecewise regular algorithm contains N quantified
equations
•
• each equation Si[I] is of the form
•
•
• xi[I] are indexed variables
• fi are arbitrary functions
• dji ∈ ℤn are constant data dependence vectors, and denote
similar arguments
• Ii are called index spaces
Design Flow for Regular Mapping (cont.)
 Linearly bounded lattice
•
•
•
• this set is affinely mapped onto iteration vectors I using an affine
transformation
 Block pipelining period
• time interval between the initiations of two successive problem
instances (β)
Parallelizing Transformations
 Based on the representation of equations and index spaces
several combinations of parallelizing transformations in the
polytope model can be applied
•
•
•
•
•
•
•
Affine Transformations
Localization
Operator Splitting
Exploration of Space-Time Mappings
Partitioning
Control Generation
HDL Generation & Synthesis
Constraints Related to CG Reconfigurable Arrays
 Coarse-grained (re)configurable architectures consist of an
array of processor elements (PE)
• array of processor elements (PE)
• one or more dedicated functional units or
• one or more arithmetic logic units (ALU)
• memory
• local memory → register files
• memory banks
• an instruction memory is required if the PE contains an
instruction programmable ALU
• interconnect structures
• I/O ports
• synchronization and reconfiguration mechanisms
Case Study
 Regular mapping methodology applied for a matrix
multiplication algorithm
• target architecture
• PACT XPP64-A reconfigurable processor array
• 64 ALU-PAEs of 24 bit data with in an 8x8 array
• each ALU-PAE contains of three objects
• the ALU-PAE
• Back-Register-object (BREG)
• Forward-Register-object (FREG)
• all objects are connected to horizontal routing channels
Case Study (cont.)
• RAM-PAE are located in two columns at the left and the
right border of the array, two ports for independent r/w
operations
• RAM can be configured to FIFO mode
• each RAM-PAE has a 512x24 bit storage capacity
• four independent I/O interfaces located in the corners of
the array
Case Study (cont.)
Structure of the PACT XPP64-A reconfigurable processor
ALU-PAE objects
Case Study (cont.)
 Matrix multiplication algorithm
• C=A*B
• A ∈ ZNxN
• B ∈ ZNxN
• computations may be represented by a dependence graph (DG)
• dependence graphs can be represented in a reduced form
• Reduced Dependence Graph: to each edge e = (vi, vj) there is
associated a dependence vector dij ∈ Zn
• virtual Processor Elements (VPEs) are used to map the PE
obtained from the design flow to the given architecture
Case Study (cont.)
Matrix multiplication algorithm, C-code
Matrix multiplication algorithm after parallelization, operator splitting, embedding, and localization
Case Study (cont.)
DG of transformed matrix multiplication algorithm
N=2
4 x 4 processor array
Reduced dependence graph
Case Study (cont.)
 Output data
• Ox the output-variable space of variable x of the space-time
mapped or partitioned index space
• the output can be two-dimensional
• the transformed output variables are distributed over the entire
array
• collect the data from one processor’s line PL and feed them out to
an array border
•
•
•
• m ∈ Z1xn denote the time instances t ∈ Tx(Pi,j) where the variable x
produces an output at processor element Pi,j
Case Study (cont.)
• if one of the following conditions holds, output data can be
serialized
Case Study (cont.)
 Partitioned implementation of the matrix multiplication
algorithm
a)
b)
c)
Dataflow graph of the LPGS-partitioned matrix multiplication 4 x 4 example
Dataflow graph after performing localization inside each file
Array implementation of the partitioned example
Results
 Both implementations (full-size and partitioned) show
optimal utilization of resources
 Each configured MAC-unit performs one operation per
cycle
 It is observed that using fewer resources with better
implementation more performance per cycle can be
achieved
 The number of ALUs is reduced from O(3N) to O(N)
 Merging and writing of output data streams is overlapped
with computations in PEs
Conclusions and Future Work
 The mapping methodology based on loop parallelization
in the polytope model provides results that are efficient in
terms of utilization of resources and execution time
 Future work is focused on perform automatic compilation
of nested loop programs