Polyhedral-Based Data Reuse Optimization for Configurable Computing
|
|
- Alan Joseph
- 6 years ago
- Views:
Transcription
1 Polyhedral-Based Data Reuse Optimization for Configurable Computing Louis-Noël Pouchet 1 Peng Zhang 1 P. Sadayappan 2 Jason Cong 1 1 University of California, Los Angeles 2 The Ohio State University February 12, 2013 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Monterey, CA
2 Overview: Overview The current situation: Tremendous improvements on FPGA capacity/speed/energy But off-chip communications remains very costly, on-chip memory is scarce HLS/ESL tools have made great progresses (ex: AutoESL/Vivado) But still extensive manual effort needed for best performance Numerous previous research work on C-to-FPGA (PICO, DEFACTO, MMAlpha, etc.) and data reuse optimizations But (strong) limitations in applicability / transformations supported / performance achieved UCLA / OSU 2
3 Overview: Overview The current situation: Tremendous improvements on FPGA capacity/speed/energy But off-chip communications remains very costly, on-chip memory is scarce HLS/ESL tools have made great progresses (ex: AutoESL/Vivado) But still extensive manual effort needed for best performance Numerous previous research work on C-to-FPGA (PICO, DEFACTO, MMAlpha, etc.) and data reuse optimizations But (strong) limitations in applicability / transformations supported / performance achieved UCLA / OSU 2
4 Overview: Overview The current situation: Tremendous improvements on FPGA capacity/speed/energy But off-chip communications remains very costly, on-chip memory is scarce HLS/ESL tools have made great progresses (ex: AutoESL/Vivado) But still extensive manual effort needed for best performance Numerous previous research work on C-to-FPGA (PICO, DEFACTO, MMAlpha, etc.) and data reuse optimizations But (strong) limitations in applicability / transformations supported / performance achieved UCLA / OSU 2
5 Overview: Overview The current situation: Tremendous improvements on FPGA capacity/speed/energy But off-chip communications remains very costly, on-chip memory is scarce Our solution: automatic, resource-aware data reuse optimization framework (combining loop transformations, on-chip buffers, and communication generation) HLS/ESL tools have made great progresses (ex: AutoESL/Vivado) But still extensive manual effort needed for best performance Numerous previous research work on C-to-FPGA (PICO, DEFACTO, MMAlpha, etc.) and data reuse optimizations But (strong) limitations in applicability / transformations supported / performance achieved UCLA / OSU 2
6 Overview: Overview The current situation: Tremendous improvements on FPGA capacity/speed/energy But off-chip communications remains very costly, on-chip memory is scarce Our solution: automatic, resource-aware data reuse optimization framework (combining loop transformations, on-chip buffers, and communication generation) HLS/ESL tools have made great progresses (ex: AutoESL/Vivado) But still extensive manual effort needed for best performance Our solution: complete HLS-focused source-to-source compiler Numerous previous research work on C-to-FPGA (PICO, DEFACTO, MMAlpha, etc.) and data reuse optimizations But (strong) limitations in applicability / transformations supported / performance achieved UCLA / OSU 2
7 Overview: Overview The current situation: Tremendous improvements on FPGA capacity/speed/energy But off-chip communications remains very costly, on-chip memory is scarce Our solution: automatic, resource-aware data reuse optimization framework (combining loop transformations, on-chip buffers, and communication generation) HLS/ESL tools have made great progresses (ex: AutoESL/Vivado) But still extensive manual effort needed for best performance Our solution: complete HLS-focused source-to-source compiler Numerous previous research work on C-to-FPGA (PICO, DEFACTO, MMAlpha, etc.) and data reuse optimizations But (strong) limitations in applicability / transformations supported / performance achieved Our solution: unleash the true power of the polyhedral framework (loop transfo., comm. scheduling, etc.) UCLA / OSU 2
8 The Polyhedral Model: The Polyhedral Model in a Nutshell Affine program regions: Loops have affine control only (over-approximation otherwise) Image processing, including medical imaging pipeline (NSF CDSC project) Linear algebra Iterative solvers (PDE, etc.) UCLA / OSU 3
9 The Polyhedral Model: The Polyhedral Model in a Nutshell Affine program regions: Loops have affine control only (over-approximation otherwise) Iteration domain: represented as integer polyhedra for (i=1; i<=n; ++i). for (j=1; j<=n; ++j).. if (i<=n-j+2)... s[i] =... D S1 = i j n 1 0 UCLA / OSU 3
10 The Polyhedral Model: The Polyhedral Model in a Nutshell Affine program regions: Loops have affine control only (over-approximation otherwise) Iteration domain: represented as integer polyhedra Memory accesses: static references, represented as affine functions of x S and p for (i=0; i<n; ++i) {. s[i] = 0;. for (j=0; j<n; ++j).. s[i] = s[i]+a[i][j]*x[j]; } f s ( x S2 ) = [ ]. [ f a ( x S2 ) = ]. f x ( x S2 ) = [ ]. x S2 n 1 x S2 n 1 x S2 n 1 UCLA / OSU 3
11 The Polyhedral Model: The Polyhedral Model in a Nutshell Affine program regions: Loops have affine control only (over-approximation otherwise) Iteration domain: represented as integer polyhedra Memory accesses: static references, represented as affine functions of x S and p Data dependence between S1 and S2: a subset of the Cartesian product of D S1 and D S2 (exact analysis) for (i=1; i<=3; ++i) {. s[i] = 0;. for (j=1; j<=3; ++j).. s[i] = s[i] + 1; } D S1δS2 : i S1. i S2 j S2 1 = 0 0 S1 iterations S2 iterations i UCLA / OSU 3
12 The Polyhedral Model: The Polyhedral Model in a Nutshell Affine program regions: Loops have affine control only (over-approximation otherwise) Iteration domain: represented as integer polyhedra Memory accesses: static references, represented as affine functions of x S and p Data dependence between S1 and S2: a subset of the Cartesian product of D S1 and D S2 (exact analysis) Polyhedral compilation: Precise dataflow analysis [Feautrier,88] Optimal algorithms for data locality [Bondhugula,08] Effective code generation [Bastoul,04] Computationally expensive algorithms (ILP/PIP) UCLA / OSU 3
13 Data Reuse Optimization: Step 1: Scheduling for Better Data Reuse Main idea: schedule operations accessing the same data as close as possible from each other Tiling is useful, but not all programs are tilable by default! Need complex sequence of loop transformations to enable tiling The Tiling Hyperplane method automatically finds such sequence Uses an ILP for the optimization problem In our software, the first stage is to transform the input code so that: 1 The number of tilable "loops" is maximized 2 Temporal data locality is maximized 3 All tilable loops can be tiled with an arbitrary tile size UCLA / OSU 4
14 Data Reuse Optimization: Step 2: Reuse Data Using On-Chip Buffers Key ideas: Compute the set of data used at a given loop iteration Reuse data between consecutive loop iterations The process works for any loop in the program Natural complement of tiling: the tile size will determine how much data is read by a non-inner-loop iteration The polyhedral framework can be used to easily compute all this information, including what to communicate UCLA / OSU 5
15 Data Reuse Optimization: Computing the Per-Iteration Data Reuse j-2 j-1 j j+1 j+2 i+2 i+1 i i-1 i-2 // Two-dimensional Jacobi-like stencil for (t = 0; t < T; ++t) for (i = 0; i < N; ++i) for (j = 0; j < N; ++j) B[i][j] = 0.2*( A[i][j-1] + A[i][j] + A[i][j+1] + A[i-1][j] + A[i+1][j]); UCLA / OSU 6
16 Data Reuse Optimization: Computing the Per-Iteration Data Reuse j-2 j-1 j j+1 j+2 i+2 i+1 i Compute the data space of A, at iteration x = (t,i,j) DS A ( x) = s SFS s A ( x) i-1 i-2 F( x) is the image of x by the function F. UCLA / OSU 7
17 Data Reuse Optimization: Computing the Per-Iteration Data Reuse j-2 j-1 j j+1 j+2 i+2 i+1 i Compute the data space of A, at iteration y = (t,i,j 1) DS A ( y) = s SFS s A ( y) i-1 i-2 UCLA / OSU 7
18 Data Reuse Optimization: Computing the Per-Iteration Data Reuse j-2 j-1 j j+1 j+2 i+2 i+1 Reused data: red set i i-1 ReuseSet = DS A ( x) DS A ( y) i-2 UCLA / OSU 7
19 Data Reuse Optimization: Computing the Per-Iteration Data Reuse j-2 j-1 j j+1 j+2 i+2 i+1 Per-iteration communication: blue set i i-1 PerCommSet = DS B ( x) ReuseSet i-2 UCLA / OSU 7
20 Data Reuse Optimization: Computing the Per-Iteration Data Reuse j-2 j-1 j j+1 j+2 i+2 i+1 i i-1 i-2 These sets are parametric polyhedral sets Use CLooG to scan them Work for any value of t,i,j an initialization copy is executed before the first iteration of the loop, and communications are done at each iteration UCLA / OSU 7
21 Data Reuse Optimization: Computing the Per-Iteration Data Reuse j-2 j-1 j j+1 j+2 i+2 i+1 i i-1 Buffer set: full blue set (data space at (t,i,j)) i-2 UCLA / OSU 7
22 Data Reuse Optimization: Quick Overview of the Full Algorithm 1 For each array and each loop, compute: the buffer polyhedron the per-iteration communication polyhedron 2 For a given array, find the loop which minimizes communication volume with a buffer fitting the FPGA resource 3 Make the entire program use on-chip arrays (buffers) Example: A[i][j] = A[i][j+1] becomes for a buffer A_l[bs1][bs2]: A_l[i % bs1][j % bs2] = A_l[i % bs1][(j+1) % bs2] 4 Insert the codes scanning the polyhedral sets in the program Example of copy-in statement: A_l[i % bs1][j % bs2] = A[i][j]; UCLA / OSU 8
23 High-Level Synthesis: Step 3: HLS-specific Optimizations For good performance, numerous complementary optimizations needed Reduce the II of inner loops by forcing inner-most parallel loops Use polyhedral-based parallelization methods Exhibit usable task-level parallelism Use polyhedral-based analysis, and factor the tasks in functions Overlap communication and computation Use FIFO commmunication modules, and scan polyhedral commmunication sets also in prefetch functions to issue requests Find the best tile size / shape for a program Create a machine-specific accurate communication latency model Run AutoESL on a variety of tile sizes, retain the best one UCLA / OSU 9
24 Total communic 1e+07 Experimental Results: 5e e+06 4e e+06 1e+071.5e+072e+072.5e+073e+073.5e+074e+074.5e+07 Total communication time (in cycles) Performance Results Total communic 2e+06 5e+06 1e e+07 2e e+07 3e+07 Total communication time (in cycles) Figure 5: Communication time vs. Communication volume Total communic 1e+07 5e e+07 4e+07 6e+07 8e+07 1e e+081.4e+081.6e+08 Total communication time (in cycles) Total BRAMs (in 16kB blocks) Denoise: Pareto-optimal e+08 2e+08 3e+08 4e+08 5e+08 6e+08 7e+08 Total execution time (in cycles) Total BRAMs (in 16kB blocks) Segmentation: Pareto-optimal e e+09 2e e+09 3e e+09 4e e+09 Total execution time (in cycles) Total BRAMs (in 16kB blocks) DGEMM: Pareto-optimal e e+07 2e e e e e e e e e+07 Total execution time (in cycles) Figure 6: Total time vs. On-Chip Buffer Size Requirement, Pareto-optimal points E reports the performance achieved with our automated framework. Those are AutoESL results obtained with our fast DSE framework. Hand-tuned reports the performance of a manually hand-tuned version serving as our performance reference, from Cong et al. [17]. It has been designed through time-consuming source code level man Complete Results tomatically generated code. On the other hand, the reference manual implementation basic off-chip uses numerous PolyOpt techniques nothand-tuned implemented in[17] for each tested benchmark. We report #PEs the number of replica- our automatic framework, such as in-register data reuse, fine-grain Table Benchmark 2 summarizes the best version found Description by our framework, tionsdenoise of the full computation 3Dwe Jacobi+Seidel-like have been able to place 7-point on a single stencilscommunication 0.02 GF/s pipelining, and4.58 algorithmic GF/smodifications 52.0 leading GF/s to Virtex-6 FPGA as in the Convey HC-1, showing the level of coarsegrain parallelization we have achieved. BRAM and LUT are totals For segmentation, 0.05 GF/swe outperform 24.91the GF/s manual design, despite GF/s the near-optimal performance for this version. segmentation 3D Jacobi-like 7-point stencils for the DGEMM set of PEs placed on the chip. matrix-multiplication clear remaining 0.04 GF/s room for improvement GF/s our framework still N/A has, as shown by the denoise number. We mention that semi-automated GEMVER sequence of matrix-vector 0.10 GF/s 1.07 GF/s N/A Table 2: Characteristics of Best Found Versions manual design can be performed on top of our framework, to address Benchmark tile size #PEs #BRAM #LUT optimizations we do not support, such as array partitioning. denoise segmentation DGEMM Table 3: Side-by-side comparison [17] GF/s Benchmark out-of-the-box PolyOpt/HLS-E hand-tuned GEMVER Convey HC-1 (4 Xilinx Virtex-6 FPGAs), total denoise bandwidth 0.02 GF/s up to4.5880gb/s GF/s 52.0 segmentation 0.05 GF/s GF/s GF/s Table 3 reports the performance, in GigaFlop per second, of numerous AutoESL different implementations versionof , the same benchmark. use memory/control out-of- gemver interfaces 0.10 GF/s provided 1.07 GF/s by Convey N/A dgemm 0.04 GF/s GF/s N/A the-box reports the performance of a basic manual off-chip-only implementation ofthebenchmark, without ourframework. PolyOpt/HLS- Finally Table 4 compares the latency as reported by AutoESL us- Core design frequency: 150MHz, off-chip memory frequency: 300HMz ing our memory latency framework for fast DSE, against the wallclock time observed on the machine after full synthesis of the generated RTL. We report the performance of a single PE call executing UCLA / OSU a subset (slice) of the full computation. 10
25 Software Infrastructure: PolyOpt/HLS Input full C program Parser C-to-AST PolyParser AST-to-polyhedra Candl dependence analysis PIPLib Pluto Transfo. for tilability vectorizer Transfo. for inner-parallel HLS-friendly C program Unparser AST-to-C Outliner restructure code for HLS PolyUnparser PAST-to-AST CLooG Polyhedra-to- PAST ISL LMP buffer and comm. generation C code Sage AST (ROSE) SCoP (polyhedral rep.) PAST (Polyhedral AST) PoCC, the Polyhedral Compiler Collection PolyOpt, a Polyhedral Optimizer for the ROSE compiler ROSE compiler infrastructure (LLNL) More at UCLA / OSU 11
26 Conclusion: Conclusions Take-home message: Affine programs are an excellent fit for FPGA/HLS Recent progresses in HLS tools let compiler researchers target FPGA optimization Complete, end-to-end framework implemented and effectiveness demonstrated Future work: Use analytical models for tile size selection Improve further the performance with additional optimizations Support more machines/fpgas (currently: developed for Convey HC-1) Improve polyhedral code generation for HLS/FPGAs UCLA / OSU 12
Polyhedral-Based Data Reuse Optimization for Configurable Computing
Polyhedral-Based Data Reuse Optimization for Configurable Computing Louis-Noël Pouchet, 1 Peng Zhang, 1 P. Sadayappan, 2 Jason Cong 1 1 University of California, Los Angeles {pouchet,pengzh,cong}@cs.ucla.edu
More informationPolyhedral Compilation Foundations
Polyhedral Compilation Foundations Louis-Noël Pouchet pouchet@cse.ohio-state.edu Dept. of Computer Science and Engineering, the Ohio State University Feb 22, 2010 888.11, Class #5 Introduction: Polyhedral
More informationSODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou
SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou University of California, Los Angeles 1 What is stencil computation? 2 What is Stencil Computation? A sliding
More informationPolyOpt/C. A Polyhedral Optimizer for the ROSE compiler Edition 0.2, for PolyOpt/C March 12th Louis-Noël Pouchet
PolyOpt/C A Polyhedral Optimizer for the ROSE compiler Edition 0.2, for PolyOpt/C 0.2.1 March 12th 2012 Louis-Noël Pouchet This manual is dedicated to PolyOpt/C version 0.2.1, a framework for Polyhedral
More informationPolyhedral Compilation Foundations
Polyhedral Compilation Foundations Louis-Noël Pouchet pouchet@cse.ohio-state.edu Dept. of Computer Science and Engineering, the Ohio State University Feb 15, 2010 888.11, Class #4 Introduction: Polyhedral
More informationIterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time
Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time Louis-Noël Pouchet, Cédric Bastoul, Albert Cohen and Nicolas Vasilache ALCHEMY, INRIA Futurs / University of Paris-Sud XI March
More informationSolving the Live-out Iterator Problem, Part I
Louis-Noël Pouchet pouchet@cse.ohio-state.edu Dept. of Computer Science and Engineering, the Ohio State University September 2010 888.11 Outline: Reminder: step-by-step methodology 1 Problem definition:
More informationAn Optimal Microarchitecture for Stencil Computation Acceleration Based on Non-Uniform Partitioning of Data Reuse Buffers
An Optimal Microarchitecture for Stencil Computation Acceleration Based on Non-Uniform Partitioning of Data Reuse Buffers Jason Cong *, Peng Li, Bingjun Xiao *, and Peng Zhang Computer Science Dept. &
More informationA polyhedral loop transformation framework for parallelization and tuning
A polyhedral loop transformation framework for parallelization and tuning Ohio State University Uday Bondhugula, Muthu Baskaran, Albert Hartono, Sriram Krishnamoorthy, P. Sadayappan Argonne National Laboratory
More informationCS671 Parallel Programming in the Many-Core Era
1 CS671 Parallel Programming in the Many-Core Era Polyhedral Framework for Compilation: Polyhedral Model Representation, Data Dependence Analysis, Scheduling and Data Locality Optimizations December 3,
More informationNeural Network Assisted Tile Size Selection
Neural Network Assisted Tile Size Selection Mohammed Rahman, Louis-Noël Pouchet and P. Sadayappan Dept. of Computer Science and Engineering Ohio State University June 22, 2010 iwapt 2010 Workshop Berkeley,
More informationParametric Multi-Level Tiling of Imperfectly Nested Loops*
Parametric Multi-Level Tiling of Imperfectly Nested Loops* Albert Hartono 1, Cedric Bastoul 2,3 Sriram Krishnamoorthy 4 J. Ramanujam 6 Muthu Baskaran 1 Albert Cohen 2 Boyana Norris 5 P. Sadayappan 1 1
More informationImproving Polyhedral Code Generation for High-Level Synthesis
Improving Polyhedral Code Generation for High-Level Synthesis Wei Zuo 1,5 Peng Li 2,3 Deming Chen 5 Louis-Noël Pouchet 4,3 Shunan Zhong 1 Jason Cong 4,3 1 Beijing Institute of Technology 2 Peking University
More informationThe Polyhedral Model Is More Widely Applicable Than You Think
The Polyhedral Model Is More Widely Applicable Than You Think Mohamed-Walid Benabderrahmane 1 Louis-Noël Pouchet 1,2 Albert Cohen 1 Cédric Bastoul 1 1 ALCHEMY group, INRIA Saclay / University of Paris-Sud
More informationThe Polyhedral Compilation Framework
The Polyhedral Compilation Framework Louis-Noël Pouchet Dept. of Computer Science and Engineering Ohio State University pouchet@cse.ohio-state.edu October 20, 2011 Introduction: Overview of Today s Lecture
More informationAutomating Beyond High-Level Synthesis
Automating Beyond High-Level Synthesis Jason Cong Chancellor s Professor, UCLA Director, Center for Domain-Specific Computing cong@cs.ucla.edu http://cadlab.cs.ucla.edu/~cong 1 High-Level Synthesis: A
More informationDATA REUSE ANALYSIS FOR AUTOMATED SYNTHESIS OF CUSTOM INSTRUCTIONS IN SLIDING WINDOW APPLICATIONS
Georgios Zacharopoulos Giovanni Ansaloni Laura Pozzi DATA REUSE ANALYSIS FOR AUTOMATED SYNTHESIS OF CUSTOM INSTRUCTIONS IN SLIDING WINDOW APPLICATIONS Università della Svizzera italiana (USI Lugano), Faculty
More informationTechnology Mapping and Packing. FPGAs
Technology Mapping and Packing for Coarse-grained, Anti-fuse Based FPGAs Chang Woo Kang, Ali Iranli, and Massoud Pedram University of Southern California Department of Electrical Engineering Los Angeles
More informationTiling: A Data Locality Optimizing Algorithm
Tiling: A Data Locality Optimizing Algorithm Previously Performance analysis of existing codes Data dependence analysis for detecting parallelism Specifying transformations using frameworks Today Usefulness
More informationOptimizing FPGA-based Accelerator Design for Deep Convolutional Neural Network
Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Network Chen Zhang 1, Peng Li 3, Guangyu Sun 1,2, Yijin Guan 1, Bingjun Xiao 3, Jason Cong 1,2,3 1 Peking University 2 PKU/UCLA Joint
More informationPolly Polyhedral Optimizations for LLVM
Polly Polyhedral Optimizations for LLVM Tobias Grosser - Hongbin Zheng - Raghesh Aloor Andreas Simbürger - Armin Grösslinger - Louis-Noël Pouchet April 03, 2011 Polly - Polyhedral Optimizations for LLVM
More informationDNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs
IBM Research AI Systems Day DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang 1, Junsong Wang 2, Chao Zhu 2, Yonghua Lin 2, Jinjun Xiong 3, Wen-mei
More informationTheory and Algorithm for Generalized Memory Partitioning in High-Level Synthesis
Theory and Algorithm for Generalized Memory Partitioning in High-Level Synthesis Yuxin Wang ayerwang@pku.edu.cn Peng Li peng.li@pku.edu.cn Jason Cong,3,, cong@cs.ucla.edu Center for Energy-Efficient Computing
More informationA Compiler Framework for Optimization of Affine Loop Nests for General Purpose Computations on GPUs
A Compiler Framework for Optimization of Affine Loop Nests for General Purpose Computations on GPUs Muthu Manikandan Baskaran 1 Uday Bondhugula 1 Sriram Krishnamoorthy 1 J. Ramanujam 2 Atanas Rountev 1
More informationAn Overview to. Polyhedral Model. Fangzhou Jiao
An Overview to Polyhedral Model Fangzhou Jiao Polyhedral Model A framework for performing loop transformation Loop representation: using polytopes to achieve fine-grain representation of program Loop transformation:
More informationComputing and Informatics, Vol. 36, 2017, , doi: /cai
Computing and Informatics, Vol. 36, 2017, 566 596, doi: 10.4149/cai 2017 3 566 NESTED-LOOPS TILING FOR PARALLELIZATION AND LOCALITY OPTIMIZATION Saeed Parsa, Mohammad Hamzei Department of Computer Engineering
More informationScalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA
Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, Sarma Vrudhula School of Electrical, Computer and Energy Engineering School
More informationAffine Loop Optimization using Modulo Unrolling in CHAPEL
Affine Loop Optimization using Modulo Unrolling in CHAPEL Aroon Sharma, Joshua Koehler, Rajeev Barua LTS POC: Michael Ferguson 2 Overall Goal Improve the runtime of certain types of parallel computers
More informationStatic and Dynamic Frequency Scaling on Multicore CPUs
Static and Dynamic Frequency Scaling on Multicore CPUs Wenlei Bao 1 Changwan Hong 1 Sudheer Chunduri 2 Sriram Krishnamoorthy 3 Louis-Noël Pouchet 4 Fabrice Rastello 5 P. Sadayappan 1 1 The Ohio State University
More informationA Framework for Automatic OpenMP Code Generation
1/31 A Framework for Automatic OpenMP Code Generation Raghesh A (CS09M032) Guide: Dr. Shankar Balachandran May 2nd, 2011 Outline 2/31 The Framework An Example Necessary Background Polyhedral Model SCoP
More informationGeneration of Multigrid-based Numerical Solvers for FPGA Accelerators
Generation of Multigrid-based Numerical Solvers for FPGA Accelerators Christian Schmitt, Moritz Schmid, Frank Hannig, Jürgen Teich, Sebastian Kuckuk, Harald Köstler Hardware/Software Co-Design, System
More informationOverview of ROCCC 2.0
Overview of ROCCC 2.0 Walid Najjar and Jason Villarreal SUMMARY FPGAs have been shown to be powerful platforms for hardware code acceleration. However, their poor programmability is the main impediment
More informationExploiting Loop-Array Dependencies to Accelerate the Design Space Exploration with High Level Synthesis
Exploiting Loop-Array Dependencies to Accelerate the Design Space Exploration with High Level Synthesis Nam Khanh Pham 1,2, Amit Kumar Singh 3, Akash Kumar 2 and Mi Mi Aung Khin 1 1 DCT Division, Data
More informationStratix vs. Virtex-II Pro FPGA Performance Analysis
White Paper Stratix vs. Virtex-II Pro FPGA Performance Analysis The Stratix TM and Stratix II architecture provides outstanding performance for the high performance design segment, providing clear performance
More informationTowards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA
Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, Chunyuan Zhang National University of Defense Technology,
More informationScalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA
Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA Yun R. Qu, Viktor K. Prasanna Ming Hsieh Dept. of Electrical Engineering University of Southern California Los Angeles, CA 90089
More informationVerification of Polyhedral Optimizations with Constant Loop Bounds in Finite State Space Computations
Verification of Polyhedral Optimizations with Constant Loop Bounds in Finite State Space Computations Markus Schordan 1, Pei-Hung Lin 1, Dan Quinlan 1, and Louis-Noël Pouchet 2 1 Lawrence Livermore National
More informationOptimizing Memory Hierarchy Allocation with Loop Transformations for High-Level Synthesis
Optimizing Memory Hierarchy Allocation with Loop Transformations for High-Level Synthesis Jason Cong, Peng Zhang, Yi Zou Computer Science Department University of California, Los Angeles Los Angeles, CA
More informationRun Fast When You Can: Loop Pipelining with Uncertain and Non-uniform Memory Dependencies
Run Fast When You Can: Loop Pipelining with Uncertain and Non-uniform Memory Dependencies Junyi Liu, John Wickerson, Samuel Bayliss, and George A. Constantinides Department of Electrical and Electronic
More informationPredictive Modeling in a Polyhedral Optimization Space
Noname manuscript No. (will be inserted by the editor) Predictive Modeling in a Polyhedral Optimization Space Eunjung Park 1 John Cavazos 1 Louis-Noël Pouchet 2,3 Cédric Bastoul 4 Albert Cohen 5 P. Sadayappan
More informationOil and Water Can Mix: An Integration of Polyhedral and AST-based Transformations
Oil and Water Can Mix: An Integration of Polyhedral and AST-based Transformations Jun Shirako Rice University Louis-Noël Pouchet University of California Los Angeles Vivek Sarkar Rice University Abstract
More informationElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests
ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests Mingxing Tan 1 2, Gai Liu 1, Ritchie Zhao 1, Steve Dai 1, Zhiru Zhang 1 1 Computer Systems Laboratory, Electrical and Computer
More informationTowards Scalable and Efficient FPGA Stencil Accelerators
Towards Scalable and Efficient FPGA Stencil Accelerators Gaël Deest, Nicolas Estibals, Tomofumi Yuki, Steven Derrien, Sanjay Rajopadhye To cite this version: Gaël Deest, Nicolas Estibals, Tomofumi Yuki,
More informationPorting Performance across GPUs and FPGAs
Porting Performance across GPUs and FPGAs Deming Chen, ECE, University of Illinois In collaboration with Alex Papakonstantinou 1, Karthik Gururaj 2, John Stratton 1, Jason Cong 2, Wen-Mei Hwu 1 1: ECE
More informationPoCC. The Polyhedral Compiler Collection package Edition 0.3, for PoCC 1.2 February 18th Louis-Noël Pouchet
PoCC The Polyhedral Compiler Collection package Edition 0.3, for PoCC 1.2 February 18th 2013 Louis-Noël Pouchet This manual is dedicated to PoCC version 1.2, a flexible source-to-source compiler in the
More informationCombined Iterative and Model-driven Optimization in an Automatic Parallelization Framework
Combined Iterative and Model-driven Optimization in an Automatic Parallelization Framework Louis-Noël Pouchet The Ohio State University pouchet@cse.ohio-state.edu Uday Bondhugula IBM T.J. Watson Research
More informationImplementing Logic in FPGA Memory Arrays: Heterogeneous Memory Architectures
Implementing Logic in FPGA Memory Arrays: Heterogeneous Memory Architectures Steven J.E. Wilton Department of Electrical and Computer Engineering University of British Columbia Vancouver, BC, Canada, V6T
More informationPLuTo: A Practical and Fully Automatic Polyhedral Program Optimization System
PLuTo: A Practical and Fully Automatic Polyhedral Program Optimization System Uday Bondhugula J. Ramanujam P. Sadayappan Dept. of Computer Science and Engineering Dept. of Electrical & Computer Engg. and
More informationLoop Splitting for Efficient Pipelining in High-Level Synthesis
Loop Splitting for Efficient Pipelining in High-Level Synthesis Junyi Liu, John Wickerson, George A. Constantinides Department of Electrical and Electronic Engineering, Imperial College London, SW7 2AZ,
More informationFastTrack: Leveraging Heterogeneous FPGA Wires to Design Low-cost High-performance Soft NoCs
1/29 FastTrack: Leveraging Heterogeneous FPGA Wires to Design Low-cost High-performance Soft NoCs Nachiket Kapre + Tushar Krishna nachiket@uwaterloo.ca, tushar@ece.gatech.edu 2/29 Claim FPGA overlay NoCs
More informationParallel Reconfigurable Hardware Architectures for Video Processing Applications
Parallel Reconfigurable Hardware Architectures for Video Processing Applications PhD Dissertation Presented by Karim Ali Supervised by Prof. Jean-Luc Dekeyser Dr. HdR. Rabie Ben Atitallah Parallel Reconfigurable
More informationIterative Optimization in the Polyhedral Model
Iterative Optimization in the Polyhedral Model Louis-Noël Pouchet, INRIA Saclay / University of Paris-Sud 11, France January 18th, 2010 Ph.D Defense Introduction: A Brief History... A Quick look backward:
More informationParallel graph traversal for FPGA
LETTER IEICE Electronics Express, Vol.11, No.7, 1 6 Parallel graph traversal for FPGA Shice Ni a), Yong Dou, Dan Zou, Rongchun Li, and Qiang Wang National Laboratory for Parallel and Distributed Processing,
More informationDesign Automation Beyond High-Level Synthesis
Design Automation Beyond High-Level Synthesis Jason Cong Chancellor s Professor, UCLA Director, Center for Domain-Specific Computing cong@cs.ucla.edu http://cadlab.cs.ucla.edu/~cong 1 Outline A brief history
More informationSDSoC: Session 1
SDSoC: Session 1 ADAM@ADIUVOENGINEERING.COM What is SDSoC SDSoC is a system optimising compiler which allows us to optimise Zynq PS / PL Zynq MPSoC PS / PL MicroBlaze What does this mean? Following the
More informationA Lost Cycles Analysis for Performance Prediction using High-Level Synthesis
A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,
More informationEffective Automatic Parallelization and Locality Optimization Using The Polyhedral Model
Effective Automatic Parallelization and Locality Optimization Using The Polyhedral Model DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate
More informationFourier-Motzkin and Farkas Questions (HW10)
Automating Scheduling Logistics Final report for project due this Friday, 5/4/12 Quiz 4 due this Monday, 5/7/12 Poster session Thursday May 10 from 2-4pm Distance students need to contact me to set up
More informationFCUDA: Enabling Efficient Compilation of CUDA Kernels onto
FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs October 13, 2009 Overview Presenting: Alex Papakonstantinou, Karthik Gururaj, John Stratton, Jason Cong, Deming Chen, Wen-mei Hwu. FCUDA:
More informationThe Power of Streams on the SRC MAP. Wim Bohm Colorado State University. RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
The Power of Streams on the SRC MAP Wim Bohm Colorado State University RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. MAP C Pure C runs on the MAP Generated code: circuits Basic blocks in
More informationAn Optimal Microarchitecture for Stencil Computation Acceleration Based on Non-Uniform Partitioning of Data Reuse Buffers
An Optimal Microarchitecture for Stencil Computation Acceleration Based on Non-Uniform Partitioning of Data Reuse Buffers Jason Cong *, Peng Li, Bingjun Xiao *, and Peng Zhang Computer Science Dept. &
More informationOptimal Iteration Scheduling for Intra- and Inter- Tile Reuse in Nested Loop Accelerators
Optimal Iteration Scheduling for Intra- and Inter- Tile Reuse in Nested Loop Accelerators Maurice Peemen, Bart Mesman, and Henk Corporaal ES Reports ISSN 574-957 ESR-- 9 December Eindhoven University of
More informationAffine and Unimodular Transformations for Non-Uniform Nested Loops
th WSEAS International Conference on COMPUTERS, Heraklion, Greece, July 3-, 008 Affine and Unimodular Transformations for Non-Uniform Nested Loops FAWZY A. TORKEY, AFAF A. SALAH, NAHED M. EL DESOUKY and
More informationMaximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman
Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael
More informationEARLY PREDICTION OF HARDWARE COMPLEXITY IN HLL-TO-HDL TRANSLATION
UNIVERSITA DEGLI STUDI DI NAPOLI FEDERICO II Facoltà di Ingegneria EARLY PREDICTION OF HARDWARE COMPLEXITY IN HLL-TO-HDL TRANSLATION Alessandro Cilardo, Paolo Durante, Carmelo Lofiego, and Antonino Mazzeo
More informationResearch Article Performance Modeling for FPGAs: Extending the Roofline Model with High-Level Synthesis Tools
Hindawi Publishing Corporation International Journal of Reconfigurable Computing Volume 2013, Article ID 428078, 10 pages http://dx.doi.org/10.1155/2013/428078 Research Article Performance Modeling for
More informationPerformance Comparison Between Patus and Pluto Compilers on Stencils
Louisiana State University LSU Digital Commons LSU Master's Theses Graduate School 214 Performance Comparison Between Patus and Pluto Compilers on Stencils Pratik Prabhu Hanagodimath Louisiana State University
More informationA Novel Design Framework for the Design of Reconfigurable Systems based on NoCs
Politecnico di Milano & EPFL A Novel Design Framework for the Design of Reconfigurable Systems based on NoCs Vincenzo Rana, Ivan Beretta, Donatella Sciuto Donatella Sciuto sciuto@elet.polimi.it Introduction
More informationBoosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search
Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search Jialiang Zhang, Soroosh Khoram and Jing Li 1 Outline Background Big graph analytics Hybrid
More informationBehavioral Array Mapping into Multiport Memories Targeting Low Power 3
Behavioral Array Mapping into Multiport Memories Targeting Low Power 3 Preeti Ranjan Panda and Nikil D. Dutt Department of Information and Computer Science University of California, Irvine, CA 92697-3425,
More informationPutting Automatic Polyhedral Compilation for GPGPU to Work
Putting Automatic Polyhedral Compilation for GPGPU to Work Soufiane Baghdadi 1, Armin Größlinger 2,1, and Albert Cohen 1 1 INRIA Saclay and LRI, Paris-Sud 11 University, France {soufiane.baghdadi,albert.cohen@inria.fr
More informationFrequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System
Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM
More informationSDSLc: a Multi-Target Domain-Specific Compiler for Stencil Computations
SDSLc: a Multi-Target Domain-Specific Compiler for Stencil Computations Prashant Rawat 1 Martin Kong 1 Tom Henretty 1 Justin Holewinski 1 Kevin Stock 1 Louis-Noël Pouchet 1 J. Ramanuam 2 Atanas Rountev
More informationIterative Refinement on FPGAs
Iterative Refinement on FPGAs Tennessee Advanced Computing Laboratory University of Tennessee JunKyu Lee July 19 th 2011 This work was partially supported by the National Science Foundation, grant NSF
More informationECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University
ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University Lab 4: Binarized Convolutional Neural Networks Due Wednesday, October 31, 2018, 11:59pm
More informationFIST: A Fast, Lightweight, FPGA-Friendly Packet Latency Estimator for NoC Modeling in Full-System Simulations
FIST: A Fast, Lightweight, FPGA-Friendly Packet Latency Estimator for oc Modeling in Full-System Simulations Michael K. Papamichael, James C. Hoe, Onur Mutlu papamix@cs.cmu.edu, jhoe@ece.cmu.edu, onur@cmu.edu
More informationCompiling Affine Loop Nests for Distributed-Memory Parallel Architectures
Compiling Affine Loop Nests for Distributed-Memory Parallel Architectures Uday Bondhugula Indian Institute of Science Supercomputing 2013 Nov 16 22, 2013 Denver, Colorado 1/46 1 Introduction 2 Distributed-memory
More informationA Compiler Framework for Optimization of Affine Loop Nests for GPGPUs
A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs Muthu Manikandan Baskaran Department of Computer Science and Engg. The Ohio State University baskaran@cse.ohiostate.edu J. Ramanujam
More informationReconfigurable Hardware Implementation of Mesh Routing in the Number Field Sieve Factorization
Reconfigurable Hardware Implementation of Mesh Routing in the Number Field Sieve Factorization Sashisu Bajracharya, Deapesh Misra, Kris Gaj George Mason University Tarek El-Ghazawi The George Washington
More informationMapping-aware Logic Synthesis with Parallelized Stochastic Optimization
Mapping-aware Logic Synthesis with Parallelized Stochastic Optimization Zhiru Zhang School of ECE, Cornell University September 29, 2017 @ EPFL A Case Study on Digit Recognition bit6 popcount(bit49 digit)
More informationLocality Aware Concurrent Start for Stencil Applications
Locality Aware Concurrent Start for Stencil Applications Sunil Shrestha Guang R. Gao University of Delaware sunil@udel.edu, ggao@capsl.udel.edu Joseph Manzano Andres Marquez John Feo Pacific Nothwest National
More informationGenerating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory
Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Roshan Dathathri Thejas Ramashekar Chandan Reddy Uday Bondhugula Department of Computer Science and Automation
More informationFCUDA: Enabling Efficient Compilation of CUDA Kernels onto
FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs October 13, 2009 Overview Presenting: Alex Papakonstantinou, Karthik Gururaj, John Stratton, Jason Cong, Deming Chen, Wen-mei Hwu. FCUDA:
More informationAutomatic Polyhedral Optimization of Stencil Codes
Automatic Polyhedral Optimization of Stencil Codes ExaStencils 2014 Stefan Kronawitter Armin Größlinger Christian Lengauer 31.03.2014 The Need for Different Optimizations 3D 1st-grade Jacobi smoother Speedup
More informationA Configurable Multi-Ported Register File Architecture for Soft Processor Cores
A Configurable Multi-Ported Register File Architecture for Soft Processor Cores Mazen A. R. Saghir and Rawan Naous Department of Electrical and Computer Engineering American University of Beirut P.O. Box
More informationStatic and Dynamic Frequency Scaling on Multicore CPUs
Static and Dynamic Frequency Scaling on Multicore CPUs WENLEI BAO and CHANGWAN HONG, The Ohio State University SUDHEER CHUNDURI, IBM Research India SRIRAM KRISHNAMOORTHY, Pacific Northwest National Laboratory
More informationDesigning a Hardware in the Loop Wireless Digital Channel Emulator for Software Defined Radio
Designing a Hardware in the Loop Wireless Digital Channel Emulator for Software Defined Radio Janarbek Matai, Pingfan Meng, Lingjuan Wu, Brad Weals, and Ryan Kastner Department of Computer Science and
More informationSDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center
SDAccel Environment The Xilinx SDAccel Development Environment Bringing The Best Performance/Watt to the Data Center Introduction Data center operators constantly seek more server performance. Currently
More informationPolyhedral Operations. Algorithms needed for automation. Logistics
Polyhedral Operations Logistics Intermediate reports late deadline is Friday March 30 at midnight HW6 (posted) and HW7 (posted) due April 5 th Tuesday April 4 th, help session during class with Manaf,
More informationAdvanced optimizations of cache performance ( 2.2)
Advanced optimizations of cache performance ( 2.2) 30 1. Small and Simple Caches to reduce hit time Critical timing path: address tag memory, then compare tags, then select set Lower associativity Direct-mapped
More informationLab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm
ECE5775 High-Level Digital Design Automation, Fall 2017 School of Electrical Computer Engineering, Cornell University Lab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm 1 Introduction
More informationMemory Partitioning and Scheduling Co-optimization in Behavioral Synthesis
Memory Partitioning and Scheduling Co-optimization in Behavioral Synthesis Peng Li, Yuxin Wang, Peng Zhang, 2 Guojie Luo, Tao Wang,,3 Jason Cong,2,3 Center for Energy-Efficient Computing and Applications,
More informationUNIVERSITÉ DE PARIS-SUD 11 U.F.R. SCIENTIFIQUE D ORSAY
UNIVERSITÉ DE PARIS-SUD 11 U.F.R. SCIENTIFIQUE D ORSAY In Partial Fulfillment of the Requirements for the Degree of DOCTOR OF PHILOSOPHY Specialty: Computer Science Louis-Noël POUCHET Subject: ITERATIVE
More informationTowards Optimal Custom Instruction Processors
Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT CHIPS 18 Overview 1. background: extensible processors
More informationTiling: A Data Locality Optimizing Algorithm
Tiling: A Data Locality Optimizing Algorithm Previously Unroll and Jam Homework PA3 is due Monday November 2nd Today Unroll and Jam is tiling Code generation for fixed-sized tiles Paper writing and critique
More informationMemory-Centric Accelerator Design for Convolutional Neural Networks
Memory-Centric Accelerator Design for Convolutional Neural Networks Maurice Peemen, Arnaud A. A. Setio, Bart Mesman and Henk Corporaal Department of Electrical Engineering, Eindhoven University of Technology,
More informationLatency Masking Threads on FPGAs
Latency Masking Threads on FPGAs Walid Najjar UC Riverside & Jacquard Computing Inc. Credits } Edward B. Fernandez (UCR) } Dr. Jason Villarreal (Jacquard Computing) } Adrian Park (Jacquard Computing) }
More informationUniversity of Delaware Department of Electrical and Computer Engineering Computer Architecture and Parallel Systems Laboratory
University of Delaware Department of Electrical and Computer Engineering Computer Architecture and Parallel Systems Laboratory Locality Optimization of Stencil Applications using Data Dependency Graphs
More informationD-TEC DSL Technology for Exascale Computing
D-TEC DSL Technology for Exascale Computing Progress Report: March 2013 DOE Office of Science Program: Office of Advanced Scientific Computing Research ASCR Program Manager: Dr. Sonia Sachs 1 Introduction
More informationControl flow graphs and loop optimizations. Thursday, October 24, 13
Control flow graphs and loop optimizations Agenda Building control flow graphs Low level loop optimizations Code motion Strength reduction Unrolling High level loop optimizations Loop fusion Loop interchange
More information