Polyhedral-Based Data Reuse Optimization for Configurable Computing

Size: px

Start display at page:

Download "Polyhedral-Based Data Reuse Optimization for Configurable Computing"

Alan Joseph
6 years ago
Views:

1 Polyhedral-Based Data Reuse Optimization for Configurable Computing Louis-Noël Pouchet 1 Peng Zhang 1 P. Sadayappan 2 Jason Cong 1 1 University of California, Los Angeles 2 The Ohio State University February 12, 2013 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Monterey, CA

2 Overview: Overview The current situation: Tremendous improvements on FPGA capacity/speed/energy But off-chip communications remains very costly, on-chip memory is scarce HLS/ESL tools have made great progresses (ex: AutoESL/Vivado) But still extensive manual effort needed for best performance Numerous previous research work on C-to-FPGA (PICO, DEFACTO, MMAlpha, etc.) and data reuse optimizations But (strong) limitations in applicability / transformations supported / performance achieved UCLA / OSU 2

3 Overview: Overview The current situation: Tremendous improvements on FPGA capacity/speed/energy But off-chip communications remains very costly, on-chip memory is scarce HLS/ESL tools have made great progresses (ex: AutoESL/Vivado) But still extensive manual effort needed for best performance Numerous previous research work on C-to-FPGA (PICO, DEFACTO, MMAlpha, etc.) and data reuse optimizations But (strong) limitations in applicability / transformations supported / performance achieved UCLA / OSU 2

4 Overview: Overview The current situation: Tremendous improvements on FPGA capacity/speed/energy But off-chip communications remains very costly, on-chip memory is scarce HLS/ESL tools have made great progresses (ex: AutoESL/Vivado) But still extensive manual effort needed for best performance Numerous previous research work on C-to-FPGA (PICO, DEFACTO, MMAlpha, etc.) and data reuse optimizations But (strong) limitations in applicability / transformations supported / performance achieved UCLA / OSU 2

5 Overview: Overview The current situation: Tremendous improvements on FPGA capacity/speed/energy But off-chip communications remains very costly, on-chip memory is scarce Our solution: automatic, resource-aware data reuse optimization framework (combining loop transformations, on-chip buffers, and communication generation) HLS/ESL tools have made great progresses (ex: AutoESL/Vivado) But still extensive manual effort needed for best performance Numerous previous research work on C-to-FPGA (PICO, DEFACTO, MMAlpha, etc.) and data reuse optimizations But (strong) limitations in applicability / transformations supported / performance achieved UCLA / OSU 2

6 Overview: Overview The current situation: Tremendous improvements on FPGA capacity/speed/energy But off-chip communications remains very costly, on-chip memory is scarce Our solution: automatic, resource-aware data reuse optimization framework (combining loop transformations, on-chip buffers, and communication generation) HLS/ESL tools have made great progresses (ex: AutoESL/Vivado) But still extensive manual effort needed for best performance Our solution: complete HLS-focused source-to-source compiler Numerous previous research work on C-to-FPGA (PICO, DEFACTO, MMAlpha, etc.) and data reuse optimizations But (strong) limitations in applicability / transformations supported / performance achieved UCLA / OSU 2

7 Overview: Overview The current situation: Tremendous improvements on FPGA capacity/speed/energy But off-chip communications remains very costly, on-chip memory is scarce Our solution: automatic, resource-aware data reuse optimization framework (combining loop transformations, on-chip buffers, and communication generation) HLS/ESL tools have made great progresses (ex: AutoESL/Vivado) But still extensive manual effort needed for best performance Our solution: complete HLS-focused source-to-source compiler Numerous previous research work on C-to-FPGA (PICO, DEFACTO, MMAlpha, etc.) and data reuse optimizations But (strong) limitations in applicability / transformations supported / performance achieved Our solution: unleash the true power of the polyhedral framework (loop transfo., comm. scheduling, etc.) UCLA / OSU 2

8 The Polyhedral Model: The Polyhedral Model in a Nutshell Affine program regions: Loops have affine control only (over-approximation otherwise) Image processing, including medical imaging pipeline (NSF CDSC project) Linear algebra Iterative solvers (PDE, etc.) UCLA / OSU 3

9 The Polyhedral Model: The Polyhedral Model in a Nutshell Affine program regions: Loops have affine control only (over-approximation otherwise) Iteration domain: represented as integer polyhedra for (i=1; i<=n; ++i). for (j=1; j<=n; ++j).. if (i<=n-j+2)... s[i] =... D S1 = i j n 1 0 UCLA / OSU 3

10 The Polyhedral Model: The Polyhedral Model in a Nutshell Affine program regions: Loops have affine control only (over-approximation otherwise) Iteration domain: represented as integer polyhedra Memory accesses: static references, represented as affine functions of x S and p for (i=0; i<n; ++i) {. s[i] = 0;. for (j=0; j<n; ++j).. s[i] = s[i]+a[i][j]*x[j]; } f s ( x S2 ) = [ ]. [ f a ( x S2 ) = ]. f x ( x S2 ) = [ ]. x S2 n 1 x S2 n 1 x S2 n 1 UCLA / OSU 3

11 The Polyhedral Model: The Polyhedral Model in a Nutshell Affine program regions: Loops have affine control only (over-approximation otherwise) Iteration domain: represented as integer polyhedra Memory accesses: static references, represented as affine functions of x S and p Data dependence between S1 and S2: a subset of the Cartesian product of D S1 and D S2 (exact analysis) for (i=1; i<=3; ++i) {. s[i] = 0;. for (j=1; j<=3; ++j).. s[i] = s[i] + 1; } D S1δS2 : i S1. i S2 j S2 1 = 0 0 S1 iterations S2 iterations i UCLA / OSU 3

12 The Polyhedral Model: The Polyhedral Model in a Nutshell Affine program regions: Loops have affine control only (over-approximation otherwise) Iteration domain: represented as integer polyhedra Memory accesses: static references, represented as affine functions of x S and p Data dependence between S1 and S2: a subset of the Cartesian product of D S1 and D S2 (exact analysis) Polyhedral compilation: Precise dataflow analysis [Feautrier,88] Optimal algorithms for data locality [Bondhugula,08] Effective code generation [Bastoul,04] Computationally expensive algorithms (ILP/PIP) UCLA / OSU 3

13 Data Reuse Optimization: Step 1: Scheduling for Better Data Reuse Main idea: schedule operations accessing the same data as close as possible from each other Tiling is useful, but not all programs are tilable by default! Need complex sequence of loop transformations to enable tiling The Tiling Hyperplane method automatically finds such sequence Uses an ILP for the optimization problem In our software, the first stage is to transform the input code so that: 1 The number of tilable "loops" is maximized 2 Temporal data locality is maximized 3 All tilable loops can be tiled with an arbitrary tile size UCLA / OSU 4

14 Data Reuse Optimization: Step 2: Reuse Data Using On-Chip Buffers Key ideas: Compute the set of data used at a given loop iteration Reuse data between consecutive loop iterations The process works for any loop in the program Natural complement of tiling: the tile size will determine how much data is read by a non-inner-loop iteration The polyhedral framework can be used to easily compute all this information, including what to communicate UCLA / OSU 5

15 Data Reuse Optimization: Computing the Per-Iteration Data Reuse j-2 j-1 j j+1 j+2 i+2 i+1 i i-1 i-2 // Two-dimensional Jacobi-like stencil for (t = 0; t < T; ++t) for (i = 0; i < N; ++i) for (j = 0; j < N; ++j) B[i][j] = 0.2*( A[i][j-1] + A[i][j] + A[i][j+1] + A[i-1][j] + A[i+1][j]); UCLA / OSU 6

16 Data Reuse Optimization: Computing the Per-Iteration Data Reuse j-2 j-1 j j+1 j+2 i+2 i+1 i Compute the data space of A, at iteration x = (t,i,j) DS A ( x) = s SFS s A ( x) i-1 i-2 F( x) is the image of x by the function F. UCLA / OSU 7

17 Data Reuse Optimization: Computing the Per-Iteration Data Reuse j-2 j-1 j j+1 j+2 i+2 i+1 i Compute the data space of A, at iteration y = (t,i,j 1) DS A ( y) = s SFS s A ( y) i-1 i-2 UCLA / OSU 7

18 Data Reuse Optimization: Computing the Per-Iteration Data Reuse j-2 j-1 j j+1 j+2 i+2 i+1 Reused data: red set i i-1 ReuseSet = DS A ( x) DS A ( y) i-2 UCLA / OSU 7

19 Data Reuse Optimization: Computing the Per-Iteration Data Reuse j-2 j-1 j j+1 j+2 i+2 i+1 Per-iteration communication: blue set i i-1 PerCommSet = DS B ( x) ReuseSet i-2 UCLA / OSU 7

20 Data Reuse Optimization: Computing the Per-Iteration Data Reuse j-2 j-1 j j+1 j+2 i+2 i+1 i i-1 i-2 These sets are parametric polyhedral sets Use CLooG to scan them Work for any value of t,i,j an initialization copy is executed before the first iteration of the loop, and communications are done at each iteration UCLA / OSU 7

21 Data Reuse Optimization: Computing the Per-Iteration Data Reuse j-2 j-1 j j+1 j+2 i+2 i+1 i i-1 Buffer set: full blue set (data space at (t,i,j)) i-2 UCLA / OSU 7

22 Data Reuse Optimization: Quick Overview of the Full Algorithm 1 For each array and each loop, compute: the buffer polyhedron the per-iteration communication polyhedron 2 For a given array, find the loop which minimizes communication volume with a buffer fitting the FPGA resource 3 Make the entire program use on-chip arrays (buffers) Example: A[i][j] = A[i][j+1] becomes for a buffer A_l[bs1][bs2]: A_l[i % bs1][j % bs2] = A_l[i % bs1][(j+1) % bs2] 4 Insert the codes scanning the polyhedral sets in the program Example of copy-in statement: A_l[i % bs1][j % bs2] = A[i][j]; UCLA / OSU 8

23 High-Level Synthesis: Step 3: HLS-specific Optimizations For good performance, numerous complementary optimizations needed Reduce the II of inner loops by forcing inner-most parallel loops Use polyhedral-based parallelization methods Exhibit usable task-level parallelism Use polyhedral-based analysis, and factor the tasks in functions Overlap communication and computation Use FIFO commmunication modules, and scan polyhedral commmunication sets also in prefetch functions to issue requests Find the best tile size / shape for a program Create a machine-specific accurate communication latency model Run AutoESL on a variety of tile sizes, retain the best one UCLA / OSU 9

24 Total communic 1e+07 Experimental Results: 5e e+06 4e e+06 1e+071.5e+072e+072.5e+073e+073.5e+074e+074.5e+07 Total communication time (in cycles) Performance Results Total communic 2e+06 5e+06 1e e+07 2e e+07 3e+07 Total communication time (in cycles) Figure 5: Communication time vs. Communication volume Total communic 1e+07 5e e+07 4e+07 6e+07 8e+07 1e e+081.4e+081.6e+08 Total communication time (in cycles) Total BRAMs (in 16kB blocks) Denoise: Pareto-optimal e+08 2e+08 3e+08 4e+08 5e+08 6e+08 7e+08 Total execution time (in cycles) Total BRAMs (in 16kB blocks) Segmentation: Pareto-optimal e e+09 2e e+09 3e e+09 4e e+09 Total execution time (in cycles) Total BRAMs (in 16kB blocks) DGEMM: Pareto-optimal e e+07 2e e e e e e e e e+07 Total execution time (in cycles) Figure 6: Total time vs. On-Chip Buffer Size Requirement, Pareto-optimal points E reports the performance achieved with our automated framework. Those are AutoESL results obtained with our fast DSE framework. Hand-tuned reports the performance of a manually hand-tuned version serving as our performance reference, from Cong et al. [17]. It has been designed through time-consuming source code level man Complete Results tomatically generated code. On the other hand, the reference manual implementation basic off-chip uses numerous PolyOpt techniques nothand-tuned implemented in[17] for each tested benchmark. We report #PEs the number of replica- our automatic framework, such as in-register data reuse, fine-grain Table Benchmark 2 summarizes the best version found Description by our framework, tionsdenoise of the full computation 3Dwe Jacobi+Seidel-like have been able to place 7-point on a single stencilscommunication 0.02 GF/s pipelining, and4.58 algorithmic GF/smodifications 52.0 leading GF/s to Virtex-6 FPGA as in the Convey HC-1, showing the level of coarsegrain parallelization we have achieved. BRAM and LUT are totals For segmentation, 0.05 GF/swe outperform 24.91the GF/s manual design, despite GF/s the near-optimal performance for this version. segmentation 3D Jacobi-like 7-point stencils for the DGEMM set of PEs placed on the chip. matrix-multiplication clear remaining 0.04 GF/s room for improvement GF/s our framework still N/A has, as shown by the denoise number. We mention that semi-automated GEMVER sequence of matrix-vector 0.10 GF/s 1.07 GF/s N/A Table 2: Characteristics of Best Found Versions manual design can be performed on top of our framework, to address Benchmark tile size #PEs #BRAM #LUT optimizations we do not support, such as array partitioning. denoise segmentation DGEMM Table 3: Side-by-side comparison [17] GF/s Benchmark out-of-the-box PolyOpt/HLS-E hand-tuned GEMVER Convey HC-1 (4 Xilinx Virtex-6 FPGAs), total denoise bandwidth 0.02 GF/s up to4.5880gb/s GF/s 52.0 segmentation 0.05 GF/s GF/s GF/s Table 3 reports the performance, in GigaFlop per second, of numerous AutoESL different implementations versionof , the same benchmark. use memory/control out-of- gemver interfaces 0.10 GF/s provided 1.07 GF/s by Convey N/A dgemm 0.04 GF/s GF/s N/A the-box reports the performance of a basic manual off-chip-only implementation ofthebenchmark, without ourframework. PolyOpt/HLS- Finally Table 4 compares the latency as reported by AutoESL us- Core design frequency: 150MHz, off-chip memory frequency: 300HMz ing our memory latency framework for fast DSE, against the wallclock time observed on the machine after full synthesis of the generated RTL. We report the performance of a single PE call executing UCLA / OSU a subset (slice) of the full computation. 10

Software Infrastructure: PolyOpt/HLS Input full C program Parser C-to-AST PolyParser AST-to-polyhedra

for inner-parallel HLS-friendly C program Unparser AST-to-C Outliner restructure code for HLS PolyUnparser

generation C code Sage AST (ROSE) SCoP (polyhedral rep.

25 Software Infrastructure: PolyOpt/HLS Input full C program Parser C-to-AST PolyParser AST-to-polyhedra Candl dependence analysis PIPLib Pluto Transfo. for tilability vectorizer Transfo. for inner-parallel HLS-friendly C program Unparser AST-to-C Outliner restructure code for HLS PolyUnparser PAST-to-AST CLooG Polyhedra-to- PAST ISL LMP buffer and comm. generation C code Sage AST (ROSE) SCoP (polyhedral rep.) PAST (Polyhedral AST) PoCC, the Polyhedral Compiler Collection PolyOpt, a Polyhedral Optimizer for the ROSE compiler ROSE compiler infrastructure (LLNL) More at UCLA / OSU 11

26 Conclusion: Conclusions Take-home message: Affine programs are an excellent fit for FPGA/HLS Recent progresses in HLS tools let compiler researchers target FPGA optimization Complete, end-to-end framework implemented and effectiveness demonstrated Future work: Use analytical models for tile size selection Improve further the performance with additional optimizations Support more machines/fpgas (currently: developed for Convey HC-1) Improve polyhedral code generation for HLS/FPGAs UCLA / OSU 12

Polyhedral-Based Data Reuse Optimization for Configurable Computing

Polyhedral-Based Data Reuse Optimization for Configurable Computing Louis-Noël Pouchet, 1 Peng Zhang, 1 P. Sadayappan, 2 Jason Cong 1 1 University of California, Los Angeles {pouchet,pengzh,cong}@cs.ucla.edu