Polyhedral-Based Data Reuse Optimization for Configurable Computing

Size: px
Start display at page:

Download "Polyhedral-Based Data Reuse Optimization for Configurable Computing"

Transcription

1 Polyhedral-Based Data Reuse Optimization for Configurable Computing Louis-Noël Pouchet 1 Peng Zhang 1 P. Sadayappan 2 Jason Cong 1 1 University of California, Los Angeles 2 The Ohio State University February 12, 2013 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Monterey, CA

2 Overview: Overview The current situation: Tremendous improvements on FPGA capacity/speed/energy But off-chip communications remains very costly, on-chip memory is scarce HLS/ESL tools have made great progresses (ex: AutoESL/Vivado) But still extensive manual effort needed for best performance Numerous previous research work on C-to-FPGA (PICO, DEFACTO, MMAlpha, etc.) and data reuse optimizations But (strong) limitations in applicability / transformations supported / performance achieved UCLA / OSU 2

3 Overview: Overview The current situation: Tremendous improvements on FPGA capacity/speed/energy But off-chip communications remains very costly, on-chip memory is scarce HLS/ESL tools have made great progresses (ex: AutoESL/Vivado) But still extensive manual effort needed for best performance Numerous previous research work on C-to-FPGA (PICO, DEFACTO, MMAlpha, etc.) and data reuse optimizations But (strong) limitations in applicability / transformations supported / performance achieved UCLA / OSU 2

4 Overview: Overview The current situation: Tremendous improvements on FPGA capacity/speed/energy But off-chip communications remains very costly, on-chip memory is scarce HLS/ESL tools have made great progresses (ex: AutoESL/Vivado) But still extensive manual effort needed for best performance Numerous previous research work on C-to-FPGA (PICO, DEFACTO, MMAlpha, etc.) and data reuse optimizations But (strong) limitations in applicability / transformations supported / performance achieved UCLA / OSU 2

5 Overview: Overview The current situation: Tremendous improvements on FPGA capacity/speed/energy But off-chip communications remains very costly, on-chip memory is scarce Our solution: automatic, resource-aware data reuse optimization framework (combining loop transformations, on-chip buffers, and communication generation) HLS/ESL tools have made great progresses (ex: AutoESL/Vivado) But still extensive manual effort needed for best performance Numerous previous research work on C-to-FPGA (PICO, DEFACTO, MMAlpha, etc.) and data reuse optimizations But (strong) limitations in applicability / transformations supported / performance achieved UCLA / OSU 2

6 Overview: Overview The current situation: Tremendous improvements on FPGA capacity/speed/energy But off-chip communications remains very costly, on-chip memory is scarce Our solution: automatic, resource-aware data reuse optimization framework (combining loop transformations, on-chip buffers, and communication generation) HLS/ESL tools have made great progresses (ex: AutoESL/Vivado) But still extensive manual effort needed for best performance Our solution: complete HLS-focused source-to-source compiler Numerous previous research work on C-to-FPGA (PICO, DEFACTO, MMAlpha, etc.) and data reuse optimizations But (strong) limitations in applicability / transformations supported / performance achieved UCLA / OSU 2

7 Overview: Overview The current situation: Tremendous improvements on FPGA capacity/speed/energy But off-chip communications remains very costly, on-chip memory is scarce Our solution: automatic, resource-aware data reuse optimization framework (combining loop transformations, on-chip buffers, and communication generation) HLS/ESL tools have made great progresses (ex: AutoESL/Vivado) But still extensive manual effort needed for best performance Our solution: complete HLS-focused source-to-source compiler Numerous previous research work on C-to-FPGA (PICO, DEFACTO, MMAlpha, etc.) and data reuse optimizations But (strong) limitations in applicability / transformations supported / performance achieved Our solution: unleash the true power of the polyhedral framework (loop transfo., comm. scheduling, etc.) UCLA / OSU 2

8 The Polyhedral Model: The Polyhedral Model in a Nutshell Affine program regions: Loops have affine control only (over-approximation otherwise) Image processing, including medical imaging pipeline (NSF CDSC project) Linear algebra Iterative solvers (PDE, etc.) UCLA / OSU 3

9 The Polyhedral Model: The Polyhedral Model in a Nutshell Affine program regions: Loops have affine control only (over-approximation otherwise) Iteration domain: represented as integer polyhedra for (i=1; i<=n; ++i). for (j=1; j<=n; ++j).. if (i<=n-j+2)... s[i] =... D S1 = i j n 1 0 UCLA / OSU 3

10 The Polyhedral Model: The Polyhedral Model in a Nutshell Affine program regions: Loops have affine control only (over-approximation otherwise) Iteration domain: represented as integer polyhedra Memory accesses: static references, represented as affine functions of x S and p for (i=0; i<n; ++i) {. s[i] = 0;. for (j=0; j<n; ++j).. s[i] = s[i]+a[i][j]*x[j]; } f s ( x S2 ) = [ ]. [ f a ( x S2 ) = ]. f x ( x S2 ) = [ ]. x S2 n 1 x S2 n 1 x S2 n 1 UCLA / OSU 3

11 The Polyhedral Model: The Polyhedral Model in a Nutshell Affine program regions: Loops have affine control only (over-approximation otherwise) Iteration domain: represented as integer polyhedra Memory accesses: static references, represented as affine functions of x S and p Data dependence between S1 and S2: a subset of the Cartesian product of D S1 and D S2 (exact analysis) for (i=1; i<=3; ++i) {. s[i] = 0;. for (j=1; j<=3; ++j).. s[i] = s[i] + 1; } D S1δS2 : i S1. i S2 j S2 1 = 0 0 S1 iterations S2 iterations i UCLA / OSU 3

12 The Polyhedral Model: The Polyhedral Model in a Nutshell Affine program regions: Loops have affine control only (over-approximation otherwise) Iteration domain: represented as integer polyhedra Memory accesses: static references, represented as affine functions of x S and p Data dependence between S1 and S2: a subset of the Cartesian product of D S1 and D S2 (exact analysis) Polyhedral compilation: Precise dataflow analysis [Feautrier,88] Optimal algorithms for data locality [Bondhugula,08] Effective code generation [Bastoul,04] Computationally expensive algorithms (ILP/PIP) UCLA / OSU 3

13 Data Reuse Optimization: Step 1: Scheduling for Better Data Reuse Main idea: schedule operations accessing the same data as close as possible from each other Tiling is useful, but not all programs are tilable by default! Need complex sequence of loop transformations to enable tiling The Tiling Hyperplane method automatically finds such sequence Uses an ILP for the optimization problem In our software, the first stage is to transform the input code so that: 1 The number of tilable "loops" is maximized 2 Temporal data locality is maximized 3 All tilable loops can be tiled with an arbitrary tile size UCLA / OSU 4

14 Data Reuse Optimization: Step 2: Reuse Data Using On-Chip Buffers Key ideas: Compute the set of data used at a given loop iteration Reuse data between consecutive loop iterations The process works for any loop in the program Natural complement of tiling: the tile size will determine how much data is read by a non-inner-loop iteration The polyhedral framework can be used to easily compute all this information, including what to communicate UCLA / OSU 5

15 Data Reuse Optimization: Computing the Per-Iteration Data Reuse j-2 j-1 j j+1 j+2 i+2 i+1 i i-1 i-2 // Two-dimensional Jacobi-like stencil for (t = 0; t < T; ++t) for (i = 0; i < N; ++i) for (j = 0; j < N; ++j) B[i][j] = 0.2*( A[i][j-1] + A[i][j] + A[i][j+1] + A[i-1][j] + A[i+1][j]); UCLA / OSU 6

16 Data Reuse Optimization: Computing the Per-Iteration Data Reuse j-2 j-1 j j+1 j+2 i+2 i+1 i Compute the data space of A, at iteration x = (t,i,j) DS A ( x) = s SFS s A ( x) i-1 i-2 F( x) is the image of x by the function F. UCLA / OSU 7

17 Data Reuse Optimization: Computing the Per-Iteration Data Reuse j-2 j-1 j j+1 j+2 i+2 i+1 i Compute the data space of A, at iteration y = (t,i,j 1) DS A ( y) = s SFS s A ( y) i-1 i-2 UCLA / OSU 7

18 Data Reuse Optimization: Computing the Per-Iteration Data Reuse j-2 j-1 j j+1 j+2 i+2 i+1 Reused data: red set i i-1 ReuseSet = DS A ( x) DS A ( y) i-2 UCLA / OSU 7

19 Data Reuse Optimization: Computing the Per-Iteration Data Reuse j-2 j-1 j j+1 j+2 i+2 i+1 Per-iteration communication: blue set i i-1 PerCommSet = DS B ( x) ReuseSet i-2 UCLA / OSU 7

20 Data Reuse Optimization: Computing the Per-Iteration Data Reuse j-2 j-1 j j+1 j+2 i+2 i+1 i i-1 i-2 These sets are parametric polyhedral sets Use CLooG to scan them Work for any value of t,i,j an initialization copy is executed before the first iteration of the loop, and communications are done at each iteration UCLA / OSU 7

21 Data Reuse Optimization: Computing the Per-Iteration Data Reuse j-2 j-1 j j+1 j+2 i+2 i+1 i i-1 Buffer set: full blue set (data space at (t,i,j)) i-2 UCLA / OSU 7

22 Data Reuse Optimization: Quick Overview of the Full Algorithm 1 For each array and each loop, compute: the buffer polyhedron the per-iteration communication polyhedron 2 For a given array, find the loop which minimizes communication volume with a buffer fitting the FPGA resource 3 Make the entire program use on-chip arrays (buffers) Example: A[i][j] = A[i][j+1] becomes for a buffer A_l[bs1][bs2]: A_l[i % bs1][j % bs2] = A_l[i % bs1][(j+1) % bs2] 4 Insert the codes scanning the polyhedral sets in the program Example of copy-in statement: A_l[i % bs1][j % bs2] = A[i][j]; UCLA / OSU 8

23 High-Level Synthesis: Step 3: HLS-specific Optimizations For good performance, numerous complementary optimizations needed Reduce the II of inner loops by forcing inner-most parallel loops Use polyhedral-based parallelization methods Exhibit usable task-level parallelism Use polyhedral-based analysis, and factor the tasks in functions Overlap communication and computation Use FIFO commmunication modules, and scan polyhedral commmunication sets also in prefetch functions to issue requests Find the best tile size / shape for a program Create a machine-specific accurate communication latency model Run AutoESL on a variety of tile sizes, retain the best one UCLA / OSU 9

24 Total communic 1e+07 Experimental Results: 5e e+06 4e e+06 1e+071.5e+072e+072.5e+073e+073.5e+074e+074.5e+07 Total communication time (in cycles) Performance Results Total communic 2e+06 5e+06 1e e+07 2e e+07 3e+07 Total communication time (in cycles) Figure 5: Communication time vs. Communication volume Total communic 1e+07 5e e+07 4e+07 6e+07 8e+07 1e e+081.4e+081.6e+08 Total communication time (in cycles) Total BRAMs (in 16kB blocks) Denoise: Pareto-optimal e+08 2e+08 3e+08 4e+08 5e+08 6e+08 7e+08 Total execution time (in cycles) Total BRAMs (in 16kB blocks) Segmentation: Pareto-optimal e e+09 2e e+09 3e e+09 4e e+09 Total execution time (in cycles) Total BRAMs (in 16kB blocks) DGEMM: Pareto-optimal e e+07 2e e e e e e e e e+07 Total execution time (in cycles) Figure 6: Total time vs. On-Chip Buffer Size Requirement, Pareto-optimal points E reports the performance achieved with our automated framework. Those are AutoESL results obtained with our fast DSE framework. Hand-tuned reports the performance of a manually hand-tuned version serving as our performance reference, from Cong et al. [17]. It has been designed through time-consuming source code level man Complete Results tomatically generated code. On the other hand, the reference manual implementation basic off-chip uses numerous PolyOpt techniques nothand-tuned implemented in[17] for each tested benchmark. We report #PEs the number of replica- our automatic framework, such as in-register data reuse, fine-grain Table Benchmark 2 summarizes the best version found Description by our framework, tionsdenoise of the full computation 3Dwe Jacobi+Seidel-like have been able to place 7-point on a single stencilscommunication 0.02 GF/s pipelining, and4.58 algorithmic GF/smodifications 52.0 leading GF/s to Virtex-6 FPGA as in the Convey HC-1, showing the level of coarsegrain parallelization we have achieved. BRAM and LUT are totals For segmentation, 0.05 GF/swe outperform 24.91the GF/s manual design, despite GF/s the near-optimal performance for this version. segmentation 3D Jacobi-like 7-point stencils for the DGEMM set of PEs placed on the chip. matrix-multiplication clear remaining 0.04 GF/s room for improvement GF/s our framework still N/A has, as shown by the denoise number. We mention that semi-automated GEMVER sequence of matrix-vector 0.10 GF/s 1.07 GF/s N/A Table 2: Characteristics of Best Found Versions manual design can be performed on top of our framework, to address Benchmark tile size #PEs #BRAM #LUT optimizations we do not support, such as array partitioning. denoise segmentation DGEMM Table 3: Side-by-side comparison [17] GF/s Benchmark out-of-the-box PolyOpt/HLS-E hand-tuned GEMVER Convey HC-1 (4 Xilinx Virtex-6 FPGAs), total denoise bandwidth 0.02 GF/s up to4.5880gb/s GF/s 52.0 segmentation 0.05 GF/s GF/s GF/s Table 3 reports the performance, in GigaFlop per second, of numerous AutoESL different implementations versionof , the same benchmark. use memory/control out-of- gemver interfaces 0.10 GF/s provided 1.07 GF/s by Convey N/A dgemm 0.04 GF/s GF/s N/A the-box reports the performance of a basic manual off-chip-only implementation ofthebenchmark, without ourframework. PolyOpt/HLS- Finally Table 4 compares the latency as reported by AutoESL us- Core design frequency: 150MHz, off-chip memory frequency: 300HMz ing our memory latency framework for fast DSE, against the wallclock time observed on the machine after full synthesis of the generated RTL. We report the performance of a single PE call executing UCLA / OSU a subset (slice) of the full computation. 10

25 Software Infrastructure: PolyOpt/HLS Input full C program Parser C-to-AST PolyParser AST-to-polyhedra Candl dependence analysis PIPLib Pluto Transfo. for tilability vectorizer Transfo. for inner-parallel HLS-friendly C program Unparser AST-to-C Outliner restructure code for HLS PolyUnparser PAST-to-AST CLooG Polyhedra-to- PAST ISL LMP buffer and comm. generation C code Sage AST (ROSE) SCoP (polyhedral rep.) PAST (Polyhedral AST) PoCC, the Polyhedral Compiler Collection PolyOpt, a Polyhedral Optimizer for the ROSE compiler ROSE compiler infrastructure (LLNL) More at UCLA / OSU 11

26 Conclusion: Conclusions Take-home message: Affine programs are an excellent fit for FPGA/HLS Recent progresses in HLS tools let compiler researchers target FPGA optimization Complete, end-to-end framework implemented and effectiveness demonstrated Future work: Use analytical models for tile size selection Improve further the performance with additional optimizations Support more machines/fpgas (currently: developed for Convey HC-1) Improve polyhedral code generation for HLS/FPGAs UCLA / OSU 12

Polyhedral-Based Data Reuse Optimization for Configurable Computing

Polyhedral-Based Data Reuse Optimization for Configurable Computing Polyhedral-Based Data Reuse Optimization for Configurable Computing Louis-Noël Pouchet, 1 Peng Zhang, 1 P. Sadayappan, 2 Jason Cong 1 1 University of California, Los Angeles {pouchet,pengzh,cong}@cs.ucla.edu

More information

Polyhedral Compilation Foundations

Polyhedral Compilation Foundations Polyhedral Compilation Foundations Louis-Noël Pouchet pouchet@cse.ohio-state.edu Dept. of Computer Science and Engineering, the Ohio State University Feb 22, 2010 888.11, Class #5 Introduction: Polyhedral

More information

SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou

SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou University of California, Los Angeles 1 What is stencil computation? 2 What is Stencil Computation? A sliding

More information

PolyOpt/C. A Polyhedral Optimizer for the ROSE compiler Edition 0.2, for PolyOpt/C March 12th Louis-Noël Pouchet

PolyOpt/C. A Polyhedral Optimizer for the ROSE compiler Edition 0.2, for PolyOpt/C March 12th Louis-Noël Pouchet PolyOpt/C A Polyhedral Optimizer for the ROSE compiler Edition 0.2, for PolyOpt/C 0.2.1 March 12th 2012 Louis-Noël Pouchet This manual is dedicated to PolyOpt/C version 0.2.1, a framework for Polyhedral

More information

Polyhedral Compilation Foundations

Polyhedral Compilation Foundations Polyhedral Compilation Foundations Louis-Noël Pouchet pouchet@cse.ohio-state.edu Dept. of Computer Science and Engineering, the Ohio State University Feb 15, 2010 888.11, Class #4 Introduction: Polyhedral

More information

Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time

Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time Iterative Optimization in the Polyhedral Model: Part I, One-Dimensional Time Louis-Noël Pouchet, Cédric Bastoul, Albert Cohen and Nicolas Vasilache ALCHEMY, INRIA Futurs / University of Paris-Sud XI March

More information

Solving the Live-out Iterator Problem, Part I

Solving the Live-out Iterator Problem, Part I Louis-Noël Pouchet pouchet@cse.ohio-state.edu Dept. of Computer Science and Engineering, the Ohio State University September 2010 888.11 Outline: Reminder: step-by-step methodology 1 Problem definition:

More information

An Optimal Microarchitecture for Stencil Computation Acceleration Based on Non-Uniform Partitioning of Data Reuse Buffers

An Optimal Microarchitecture for Stencil Computation Acceleration Based on Non-Uniform Partitioning of Data Reuse Buffers An Optimal Microarchitecture for Stencil Computation Acceleration Based on Non-Uniform Partitioning of Data Reuse Buffers Jason Cong *, Peng Li, Bingjun Xiao *, and Peng Zhang Computer Science Dept. &

More information

A polyhedral loop transformation framework for parallelization and tuning

A polyhedral loop transformation framework for parallelization and tuning A polyhedral loop transformation framework for parallelization and tuning Ohio State University Uday Bondhugula, Muthu Baskaran, Albert Hartono, Sriram Krishnamoorthy, P. Sadayappan Argonne National Laboratory

More information

CS671 Parallel Programming in the Many-Core Era

CS671 Parallel Programming in the Many-Core Era 1 CS671 Parallel Programming in the Many-Core Era Polyhedral Framework for Compilation: Polyhedral Model Representation, Data Dependence Analysis, Scheduling and Data Locality Optimizations December 3,

More information

Neural Network Assisted Tile Size Selection

Neural Network Assisted Tile Size Selection Neural Network Assisted Tile Size Selection Mohammed Rahman, Louis-Noël Pouchet and P. Sadayappan Dept. of Computer Science and Engineering Ohio State University June 22, 2010 iwapt 2010 Workshop Berkeley,

More information

Parametric Multi-Level Tiling of Imperfectly Nested Loops*

Parametric Multi-Level Tiling of Imperfectly Nested Loops* Parametric Multi-Level Tiling of Imperfectly Nested Loops* Albert Hartono 1, Cedric Bastoul 2,3 Sriram Krishnamoorthy 4 J. Ramanujam 6 Muthu Baskaran 1 Albert Cohen 2 Boyana Norris 5 P. Sadayappan 1 1

More information

Improving Polyhedral Code Generation for High-Level Synthesis

Improving Polyhedral Code Generation for High-Level Synthesis Improving Polyhedral Code Generation for High-Level Synthesis Wei Zuo 1,5 Peng Li 2,3 Deming Chen 5 Louis-Noël Pouchet 4,3 Shunan Zhong 1 Jason Cong 4,3 1 Beijing Institute of Technology 2 Peking University

More information

The Polyhedral Model Is More Widely Applicable Than You Think

The Polyhedral Model Is More Widely Applicable Than You Think The Polyhedral Model Is More Widely Applicable Than You Think Mohamed-Walid Benabderrahmane 1 Louis-Noël Pouchet 1,2 Albert Cohen 1 Cédric Bastoul 1 1 ALCHEMY group, INRIA Saclay / University of Paris-Sud

More information

The Polyhedral Compilation Framework

The Polyhedral Compilation Framework The Polyhedral Compilation Framework Louis-Noël Pouchet Dept. of Computer Science and Engineering Ohio State University pouchet@cse.ohio-state.edu October 20, 2011 Introduction: Overview of Today s Lecture

More information

Automating Beyond High-Level Synthesis

Automating Beyond High-Level Synthesis Automating Beyond High-Level Synthesis Jason Cong Chancellor s Professor, UCLA Director, Center for Domain-Specific Computing cong@cs.ucla.edu http://cadlab.cs.ucla.edu/~cong 1 High-Level Synthesis: A

More information

DATA REUSE ANALYSIS FOR AUTOMATED SYNTHESIS OF CUSTOM INSTRUCTIONS IN SLIDING WINDOW APPLICATIONS

DATA REUSE ANALYSIS FOR AUTOMATED SYNTHESIS OF CUSTOM INSTRUCTIONS IN SLIDING WINDOW APPLICATIONS Georgios Zacharopoulos Giovanni Ansaloni Laura Pozzi DATA REUSE ANALYSIS FOR AUTOMATED SYNTHESIS OF CUSTOM INSTRUCTIONS IN SLIDING WINDOW APPLICATIONS Università della Svizzera italiana (USI Lugano), Faculty

More information

Technology Mapping and Packing. FPGAs

Technology Mapping and Packing. FPGAs Technology Mapping and Packing for Coarse-grained, Anti-fuse Based FPGAs Chang Woo Kang, Ali Iranli, and Massoud Pedram University of Southern California Department of Electrical Engineering Los Angeles

More information

Tiling: A Data Locality Optimizing Algorithm

Tiling: A Data Locality Optimizing Algorithm Tiling: A Data Locality Optimizing Algorithm Previously Performance analysis of existing codes Data dependence analysis for detecting parallelism Specifying transformations using frameworks Today Usefulness

More information

Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Network

Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Network Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Network Chen Zhang 1, Peng Li 3, Guangyu Sun 1,2, Yijin Guan 1, Bingjun Xiao 3, Jason Cong 1,2,3 1 Peking University 2 PKU/UCLA Joint

More information

Polly Polyhedral Optimizations for LLVM

Polly Polyhedral Optimizations for LLVM Polly Polyhedral Optimizations for LLVM Tobias Grosser - Hongbin Zheng - Raghesh Aloor Andreas Simbürger - Armin Grösslinger - Louis-Noël Pouchet April 03, 2011 Polly - Polyhedral Optimizations for LLVM

More information

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs IBM Research AI Systems Day DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang 1, Junsong Wang 2, Chao Zhu 2, Yonghua Lin 2, Jinjun Xiong 3, Wen-mei

More information

Theory and Algorithm for Generalized Memory Partitioning in High-Level Synthesis

Theory and Algorithm for Generalized Memory Partitioning in High-Level Synthesis Theory and Algorithm for Generalized Memory Partitioning in High-Level Synthesis Yuxin Wang ayerwang@pku.edu.cn Peng Li peng.li@pku.edu.cn Jason Cong,3,, cong@cs.ucla.edu Center for Energy-Efficient Computing

More information

A Compiler Framework for Optimization of Affine Loop Nests for General Purpose Computations on GPUs

A Compiler Framework for Optimization of Affine Loop Nests for General Purpose Computations on GPUs A Compiler Framework for Optimization of Affine Loop Nests for General Purpose Computations on GPUs Muthu Manikandan Baskaran 1 Uday Bondhugula 1 Sriram Krishnamoorthy 1 J. Ramanujam 2 Atanas Rountev 1

More information

An Overview to. Polyhedral Model. Fangzhou Jiao

An Overview to. Polyhedral Model. Fangzhou Jiao An Overview to Polyhedral Model Fangzhou Jiao Polyhedral Model A framework for performing loop transformation Loop representation: using polytopes to achieve fine-grain representation of program Loop transformation:

More information

Computing and Informatics, Vol. 36, 2017, , doi: /cai

Computing and Informatics, Vol. 36, 2017, , doi: /cai Computing and Informatics, Vol. 36, 2017, 566 596, doi: 10.4149/cai 2017 3 566 NESTED-LOOPS TILING FOR PARALLELIZATION AND LOCALITY OPTIMIZATION Saeed Parsa, Mohammad Hamzei Department of Computer Engineering

More information

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, Sarma Vrudhula School of Electrical, Computer and Energy Engineering School

More information

Affine Loop Optimization using Modulo Unrolling in CHAPEL

Affine Loop Optimization using Modulo Unrolling in CHAPEL Affine Loop Optimization using Modulo Unrolling in CHAPEL Aroon Sharma, Joshua Koehler, Rajeev Barua LTS POC: Michael Ferguson 2 Overall Goal Improve the runtime of certain types of parallel computers

More information

Static and Dynamic Frequency Scaling on Multicore CPUs

Static and Dynamic Frequency Scaling on Multicore CPUs Static and Dynamic Frequency Scaling on Multicore CPUs Wenlei Bao 1 Changwan Hong 1 Sudheer Chunduri 2 Sriram Krishnamoorthy 3 Louis-Noël Pouchet 4 Fabrice Rastello 5 P. Sadayappan 1 1 The Ohio State University

More information

A Framework for Automatic OpenMP Code Generation

A Framework for Automatic OpenMP Code Generation 1/31 A Framework for Automatic OpenMP Code Generation Raghesh A (CS09M032) Guide: Dr. Shankar Balachandran May 2nd, 2011 Outline 2/31 The Framework An Example Necessary Background Polyhedral Model SCoP

More information

Generation of Multigrid-based Numerical Solvers for FPGA Accelerators

Generation of Multigrid-based Numerical Solvers for FPGA Accelerators Generation of Multigrid-based Numerical Solvers for FPGA Accelerators Christian Schmitt, Moritz Schmid, Frank Hannig, Jürgen Teich, Sebastian Kuckuk, Harald Köstler Hardware/Software Co-Design, System

More information

Overview of ROCCC 2.0

Overview of ROCCC 2.0 Overview of ROCCC 2.0 Walid Najjar and Jason Villarreal SUMMARY FPGAs have been shown to be powerful platforms for hardware code acceleration. However, their poor programmability is the main impediment

More information

Exploiting Loop-Array Dependencies to Accelerate the Design Space Exploration with High Level Synthesis

Exploiting Loop-Array Dependencies to Accelerate the Design Space Exploration with High Level Synthesis Exploiting Loop-Array Dependencies to Accelerate the Design Space Exploration with High Level Synthesis Nam Khanh Pham 1,2, Amit Kumar Singh 3, Akash Kumar 2 and Mi Mi Aung Khin 1 1 DCT Division, Data

More information

Stratix vs. Virtex-II Pro FPGA Performance Analysis

Stratix vs. Virtex-II Pro FPGA Performance Analysis White Paper Stratix vs. Virtex-II Pro FPGA Performance Analysis The Stratix TM and Stratix II architecture provides outstanding performance for the high performance design segment, providing clear performance

More information

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, Chunyuan Zhang National University of Defense Technology,

More information

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA Yun R. Qu, Viktor K. Prasanna Ming Hsieh Dept. of Electrical Engineering University of Southern California Los Angeles, CA 90089

More information

Verification of Polyhedral Optimizations with Constant Loop Bounds in Finite State Space Computations

Verification of Polyhedral Optimizations with Constant Loop Bounds in Finite State Space Computations Verification of Polyhedral Optimizations with Constant Loop Bounds in Finite State Space Computations Markus Schordan 1, Pei-Hung Lin 1, Dan Quinlan 1, and Louis-Noël Pouchet 2 1 Lawrence Livermore National

More information

Optimizing Memory Hierarchy Allocation with Loop Transformations for High-Level Synthesis

Optimizing Memory Hierarchy Allocation with Loop Transformations for High-Level Synthesis Optimizing Memory Hierarchy Allocation with Loop Transformations for High-Level Synthesis Jason Cong, Peng Zhang, Yi Zou Computer Science Department University of California, Los Angeles Los Angeles, CA

More information

Run Fast When You Can: Loop Pipelining with Uncertain and Non-uniform Memory Dependencies

Run Fast When You Can: Loop Pipelining with Uncertain and Non-uniform Memory Dependencies Run Fast When You Can: Loop Pipelining with Uncertain and Non-uniform Memory Dependencies Junyi Liu, John Wickerson, Samuel Bayliss, and George A. Constantinides Department of Electrical and Electronic

More information

Predictive Modeling in a Polyhedral Optimization Space

Predictive Modeling in a Polyhedral Optimization Space Noname manuscript No. (will be inserted by the editor) Predictive Modeling in a Polyhedral Optimization Space Eunjung Park 1 John Cavazos 1 Louis-Noël Pouchet 2,3 Cédric Bastoul 4 Albert Cohen 5 P. Sadayappan

More information

Oil and Water Can Mix: An Integration of Polyhedral and AST-based Transformations

Oil and Water Can Mix: An Integration of Polyhedral and AST-based Transformations Oil and Water Can Mix: An Integration of Polyhedral and AST-based Transformations Jun Shirako Rice University Louis-Noël Pouchet University of California Los Angeles Vivek Sarkar Rice University Abstract

More information

ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests

ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests Mingxing Tan 1 2, Gai Liu 1, Ritchie Zhao 1, Steve Dai 1, Zhiru Zhang 1 1 Computer Systems Laboratory, Electrical and Computer

More information

Towards Scalable and Efficient FPGA Stencil Accelerators

Towards Scalable and Efficient FPGA Stencil Accelerators Towards Scalable and Efficient FPGA Stencil Accelerators Gaël Deest, Nicolas Estibals, Tomofumi Yuki, Steven Derrien, Sanjay Rajopadhye To cite this version: Gaël Deest, Nicolas Estibals, Tomofumi Yuki,

More information

Porting Performance across GPUs and FPGAs

Porting Performance across GPUs and FPGAs Porting Performance across GPUs and FPGAs Deming Chen, ECE, University of Illinois In collaboration with Alex Papakonstantinou 1, Karthik Gururaj 2, John Stratton 1, Jason Cong 2, Wen-Mei Hwu 1 1: ECE

More information

PoCC. The Polyhedral Compiler Collection package Edition 0.3, for PoCC 1.2 February 18th Louis-Noël Pouchet

PoCC. The Polyhedral Compiler Collection package Edition 0.3, for PoCC 1.2 February 18th Louis-Noël Pouchet PoCC The Polyhedral Compiler Collection package Edition 0.3, for PoCC 1.2 February 18th 2013 Louis-Noël Pouchet This manual is dedicated to PoCC version 1.2, a flexible source-to-source compiler in the

More information

Combined Iterative and Model-driven Optimization in an Automatic Parallelization Framework

Combined Iterative and Model-driven Optimization in an Automatic Parallelization Framework Combined Iterative and Model-driven Optimization in an Automatic Parallelization Framework Louis-Noël Pouchet The Ohio State University pouchet@cse.ohio-state.edu Uday Bondhugula IBM T.J. Watson Research

More information

Implementing Logic in FPGA Memory Arrays: Heterogeneous Memory Architectures

Implementing Logic in FPGA Memory Arrays: Heterogeneous Memory Architectures Implementing Logic in FPGA Memory Arrays: Heterogeneous Memory Architectures Steven J.E. Wilton Department of Electrical and Computer Engineering University of British Columbia Vancouver, BC, Canada, V6T

More information

PLuTo: A Practical and Fully Automatic Polyhedral Program Optimization System

PLuTo: A Practical and Fully Automatic Polyhedral Program Optimization System PLuTo: A Practical and Fully Automatic Polyhedral Program Optimization System Uday Bondhugula J. Ramanujam P. Sadayappan Dept. of Computer Science and Engineering Dept. of Electrical & Computer Engg. and

More information

Loop Splitting for Efficient Pipelining in High-Level Synthesis

Loop Splitting for Efficient Pipelining in High-Level Synthesis Loop Splitting for Efficient Pipelining in High-Level Synthesis Junyi Liu, John Wickerson, George A. Constantinides Department of Electrical and Electronic Engineering, Imperial College London, SW7 2AZ,

More information

FastTrack: Leveraging Heterogeneous FPGA Wires to Design Low-cost High-performance Soft NoCs

FastTrack: Leveraging Heterogeneous FPGA Wires to Design Low-cost High-performance Soft NoCs 1/29 FastTrack: Leveraging Heterogeneous FPGA Wires to Design Low-cost High-performance Soft NoCs Nachiket Kapre + Tushar Krishna nachiket@uwaterloo.ca, tushar@ece.gatech.edu 2/29 Claim FPGA overlay NoCs

More information

Parallel Reconfigurable Hardware Architectures for Video Processing Applications

Parallel Reconfigurable Hardware Architectures for Video Processing Applications Parallel Reconfigurable Hardware Architectures for Video Processing Applications PhD Dissertation Presented by Karim Ali Supervised by Prof. Jean-Luc Dekeyser Dr. HdR. Rabie Ben Atitallah Parallel Reconfigurable

More information

Iterative Optimization in the Polyhedral Model

Iterative Optimization in the Polyhedral Model Iterative Optimization in the Polyhedral Model Louis-Noël Pouchet, INRIA Saclay / University of Paris-Sud 11, France January 18th, 2010 Ph.D Defense Introduction: A Brief History... A Quick look backward:

More information

Parallel graph traversal for FPGA

Parallel graph traversal for FPGA LETTER IEICE Electronics Express, Vol.11, No.7, 1 6 Parallel graph traversal for FPGA Shice Ni a), Yong Dou, Dan Zou, Rongchun Li, and Qiang Wang National Laboratory for Parallel and Distributed Processing,

More information

Design Automation Beyond High-Level Synthesis

Design Automation Beyond High-Level Synthesis Design Automation Beyond High-Level Synthesis Jason Cong Chancellor s Professor, UCLA Director, Center for Domain-Specific Computing cong@cs.ucla.edu http://cadlab.cs.ucla.edu/~cong 1 Outline A brief history

More information

SDSoC: Session 1

SDSoC: Session 1 SDSoC: Session 1 ADAM@ADIUVOENGINEERING.COM What is SDSoC SDSoC is a system optimising compiler which allows us to optimise Zynq PS / PL Zynq MPSoC PS / PL MicroBlaze What does this mean? Following the

More information

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,

More information

Effective Automatic Parallelization and Locality Optimization Using The Polyhedral Model

Effective Automatic Parallelization and Locality Optimization Using The Polyhedral Model Effective Automatic Parallelization and Locality Optimization Using The Polyhedral Model DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate

More information

Fourier-Motzkin and Farkas Questions (HW10)

Fourier-Motzkin and Farkas Questions (HW10) Automating Scheduling Logistics Final report for project due this Friday, 5/4/12 Quiz 4 due this Monday, 5/7/12 Poster session Thursday May 10 from 2-4pm Distance students need to contact me to set up

More information

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs October 13, 2009 Overview Presenting: Alex Papakonstantinou, Karthik Gururaj, John Stratton, Jason Cong, Deming Chen, Wen-mei Hwu. FCUDA:

More information

The Power of Streams on the SRC MAP. Wim Bohm Colorado State University. RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.

The Power of Streams on the SRC MAP. Wim Bohm Colorado State University. RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RESERVED. The Power of Streams on the SRC MAP Wim Bohm Colorado State University RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. MAP C Pure C runs on the MAP Generated code: circuits Basic blocks in

More information

An Optimal Microarchitecture for Stencil Computation Acceleration Based on Non-Uniform Partitioning of Data Reuse Buffers

An Optimal Microarchitecture for Stencil Computation Acceleration Based on Non-Uniform Partitioning of Data Reuse Buffers An Optimal Microarchitecture for Stencil Computation Acceleration Based on Non-Uniform Partitioning of Data Reuse Buffers Jason Cong *, Peng Li, Bingjun Xiao *, and Peng Zhang Computer Science Dept. &

More information

Optimal Iteration Scheduling for Intra- and Inter- Tile Reuse in Nested Loop Accelerators

Optimal Iteration Scheduling for Intra- and Inter- Tile Reuse in Nested Loop Accelerators Optimal Iteration Scheduling for Intra- and Inter- Tile Reuse in Nested Loop Accelerators Maurice Peemen, Bart Mesman, and Henk Corporaal ES Reports ISSN 574-957 ESR-- 9 December Eindhoven University of

More information

Affine and Unimodular Transformations for Non-Uniform Nested Loops

Affine and Unimodular Transformations for Non-Uniform Nested Loops th WSEAS International Conference on COMPUTERS, Heraklion, Greece, July 3-, 008 Affine and Unimodular Transformations for Non-Uniform Nested Loops FAWZY A. TORKEY, AFAF A. SALAH, NAHED M. EL DESOUKY and

More information

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael

More information

EARLY PREDICTION OF HARDWARE COMPLEXITY IN HLL-TO-HDL TRANSLATION

EARLY PREDICTION OF HARDWARE COMPLEXITY IN HLL-TO-HDL TRANSLATION UNIVERSITA DEGLI STUDI DI NAPOLI FEDERICO II Facoltà di Ingegneria EARLY PREDICTION OF HARDWARE COMPLEXITY IN HLL-TO-HDL TRANSLATION Alessandro Cilardo, Paolo Durante, Carmelo Lofiego, and Antonino Mazzeo

More information

Research Article Performance Modeling for FPGAs: Extending the Roofline Model with High-Level Synthesis Tools

Research Article Performance Modeling for FPGAs: Extending the Roofline Model with High-Level Synthesis Tools Hindawi Publishing Corporation International Journal of Reconfigurable Computing Volume 2013, Article ID 428078, 10 pages http://dx.doi.org/10.1155/2013/428078 Research Article Performance Modeling for

More information

Performance Comparison Between Patus and Pluto Compilers on Stencils

Performance Comparison Between Patus and Pluto Compilers on Stencils Louisiana State University LSU Digital Commons LSU Master's Theses Graduate School 214 Performance Comparison Between Patus and Pluto Compilers on Stencils Pratik Prabhu Hanagodimath Louisiana State University

More information

A Novel Design Framework for the Design of Reconfigurable Systems based on NoCs

A Novel Design Framework for the Design of Reconfigurable Systems based on NoCs Politecnico di Milano & EPFL A Novel Design Framework for the Design of Reconfigurable Systems based on NoCs Vincenzo Rana, Ivan Beretta, Donatella Sciuto Donatella Sciuto sciuto@elet.polimi.it Introduction

More information

Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search

Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search Jialiang Zhang, Soroosh Khoram and Jing Li 1 Outline Background Big graph analytics Hybrid

More information

Behavioral Array Mapping into Multiport Memories Targeting Low Power 3

Behavioral Array Mapping into Multiport Memories Targeting Low Power 3 Behavioral Array Mapping into Multiport Memories Targeting Low Power 3 Preeti Ranjan Panda and Nikil D. Dutt Department of Information and Computer Science University of California, Irvine, CA 92697-3425,

More information

Putting Automatic Polyhedral Compilation for GPGPU to Work

Putting Automatic Polyhedral Compilation for GPGPU to Work Putting Automatic Polyhedral Compilation for GPGPU to Work Soufiane Baghdadi 1, Armin Größlinger 2,1, and Albert Cohen 1 1 INRIA Saclay and LRI, Paris-Sud 11 University, France {soufiane.baghdadi,albert.cohen@inria.fr

More information

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM

More information

SDSLc: a Multi-Target Domain-Specific Compiler for Stencil Computations

SDSLc: a Multi-Target Domain-Specific Compiler for Stencil Computations SDSLc: a Multi-Target Domain-Specific Compiler for Stencil Computations Prashant Rawat 1 Martin Kong 1 Tom Henretty 1 Justin Holewinski 1 Kevin Stock 1 Louis-Noël Pouchet 1 J. Ramanuam 2 Atanas Rountev

More information

Iterative Refinement on FPGAs

Iterative Refinement on FPGAs Iterative Refinement on FPGAs Tennessee Advanced Computing Laboratory University of Tennessee JunKyu Lee July 19 th 2011 This work was partially supported by the National Science Foundation, grant NSF

More information

ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University

ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University Lab 4: Binarized Convolutional Neural Networks Due Wednesday, October 31, 2018, 11:59pm

More information

FIST: A Fast, Lightweight, FPGA-Friendly Packet Latency Estimator for NoC Modeling in Full-System Simulations

FIST: A Fast, Lightweight, FPGA-Friendly Packet Latency Estimator for NoC Modeling in Full-System Simulations FIST: A Fast, Lightweight, FPGA-Friendly Packet Latency Estimator for oc Modeling in Full-System Simulations Michael K. Papamichael, James C. Hoe, Onur Mutlu papamix@cs.cmu.edu, jhoe@ece.cmu.edu, onur@cmu.edu

More information

Compiling Affine Loop Nests for Distributed-Memory Parallel Architectures

Compiling Affine Loop Nests for Distributed-Memory Parallel Architectures Compiling Affine Loop Nests for Distributed-Memory Parallel Architectures Uday Bondhugula Indian Institute of Science Supercomputing 2013 Nov 16 22, 2013 Denver, Colorado 1/46 1 Introduction 2 Distributed-memory

More information

A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs

A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs Muthu Manikandan Baskaran Department of Computer Science and Engg. The Ohio State University baskaran@cse.ohiostate.edu J. Ramanujam

More information

Reconfigurable Hardware Implementation of Mesh Routing in the Number Field Sieve Factorization

Reconfigurable Hardware Implementation of Mesh Routing in the Number Field Sieve Factorization Reconfigurable Hardware Implementation of Mesh Routing in the Number Field Sieve Factorization Sashisu Bajracharya, Deapesh Misra, Kris Gaj George Mason University Tarek El-Ghazawi The George Washington

More information

Mapping-aware Logic Synthesis with Parallelized Stochastic Optimization

Mapping-aware Logic Synthesis with Parallelized Stochastic Optimization Mapping-aware Logic Synthesis with Parallelized Stochastic Optimization Zhiru Zhang School of ECE, Cornell University September 29, 2017 @ EPFL A Case Study on Digit Recognition bit6 popcount(bit49 digit)

More information

Locality Aware Concurrent Start for Stencil Applications

Locality Aware Concurrent Start for Stencil Applications Locality Aware Concurrent Start for Stencil Applications Sunil Shrestha Guang R. Gao University of Delaware sunil@udel.edu, ggao@capsl.udel.edu Joseph Manzano Andres Marquez John Feo Pacific Nothwest National

More information

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Roshan Dathathri Thejas Ramashekar Chandan Reddy Uday Bondhugula Department of Computer Science and Automation

More information

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs October 13, 2009 Overview Presenting: Alex Papakonstantinou, Karthik Gururaj, John Stratton, Jason Cong, Deming Chen, Wen-mei Hwu. FCUDA:

More information

Automatic Polyhedral Optimization of Stencil Codes

Automatic Polyhedral Optimization of Stencil Codes Automatic Polyhedral Optimization of Stencil Codes ExaStencils 2014 Stefan Kronawitter Armin Größlinger Christian Lengauer 31.03.2014 The Need for Different Optimizations 3D 1st-grade Jacobi smoother Speedup

More information

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores A Configurable Multi-Ported Register File Architecture for Soft Processor Cores Mazen A. R. Saghir and Rawan Naous Department of Electrical and Computer Engineering American University of Beirut P.O. Box

More information

Static and Dynamic Frequency Scaling on Multicore CPUs

Static and Dynamic Frequency Scaling on Multicore CPUs Static and Dynamic Frequency Scaling on Multicore CPUs WENLEI BAO and CHANGWAN HONG, The Ohio State University SUDHEER CHUNDURI, IBM Research India SRIRAM KRISHNAMOORTHY, Pacific Northwest National Laboratory

More information

Designing a Hardware in the Loop Wireless Digital Channel Emulator for Software Defined Radio

Designing a Hardware in the Loop Wireless Digital Channel Emulator for Software Defined Radio Designing a Hardware in the Loop Wireless Digital Channel Emulator for Software Defined Radio Janarbek Matai, Pingfan Meng, Lingjuan Wu, Brad Weals, and Ryan Kastner Department of Computer Science and

More information

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center SDAccel Environment The Xilinx SDAccel Development Environment Bringing The Best Performance/Watt to the Data Center Introduction Data center operators constantly seek more server performance. Currently

More information

Polyhedral Operations. Algorithms needed for automation. Logistics

Polyhedral Operations. Algorithms needed for automation. Logistics Polyhedral Operations Logistics Intermediate reports late deadline is Friday March 30 at midnight HW6 (posted) and HW7 (posted) due April 5 th Tuesday April 4 th, help session during class with Manaf,

More information

Advanced optimizations of cache performance ( 2.2)

Advanced optimizations of cache performance ( 2.2) Advanced optimizations of cache performance ( 2.2) 30 1. Small and Simple Caches to reduce hit time Critical timing path: address tag memory, then compare tags, then select set Lower associativity Direct-mapped

More information

Lab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm

Lab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm ECE5775 High-Level Digital Design Automation, Fall 2017 School of Electrical Computer Engineering, Cornell University Lab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm 1 Introduction

More information

Memory Partitioning and Scheduling Co-optimization in Behavioral Synthesis

Memory Partitioning and Scheduling Co-optimization in Behavioral Synthesis Memory Partitioning and Scheduling Co-optimization in Behavioral Synthesis Peng Li, Yuxin Wang, Peng Zhang, 2 Guojie Luo, Tao Wang,,3 Jason Cong,2,3 Center for Energy-Efficient Computing and Applications,

More information

UNIVERSITÉ DE PARIS-SUD 11 U.F.R. SCIENTIFIQUE D ORSAY

UNIVERSITÉ DE PARIS-SUD 11 U.F.R. SCIENTIFIQUE D ORSAY UNIVERSITÉ DE PARIS-SUD 11 U.F.R. SCIENTIFIQUE D ORSAY In Partial Fulfillment of the Requirements for the Degree of DOCTOR OF PHILOSOPHY Specialty: Computer Science Louis-Noël POUCHET Subject: ITERATIVE

More information

Towards Optimal Custom Instruction Processors

Towards Optimal Custom Instruction Processors Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT CHIPS 18 Overview 1. background: extensible processors

More information

Tiling: A Data Locality Optimizing Algorithm

Tiling: A Data Locality Optimizing Algorithm Tiling: A Data Locality Optimizing Algorithm Previously Unroll and Jam Homework PA3 is due Monday November 2nd Today Unroll and Jam is tiling Code generation for fixed-sized tiles Paper writing and critique

More information

Memory-Centric Accelerator Design for Convolutional Neural Networks

Memory-Centric Accelerator Design for Convolutional Neural Networks Memory-Centric Accelerator Design for Convolutional Neural Networks Maurice Peemen, Arnaud A. A. Setio, Bart Mesman and Henk Corporaal Department of Electrical Engineering, Eindhoven University of Technology,

More information

Latency Masking Threads on FPGAs

Latency Masking Threads on FPGAs Latency Masking Threads on FPGAs Walid Najjar UC Riverside & Jacquard Computing Inc. Credits } Edward B. Fernandez (UCR) } Dr. Jason Villarreal (Jacquard Computing) } Adrian Park (Jacquard Computing) }

More information

University of Delaware Department of Electrical and Computer Engineering Computer Architecture and Parallel Systems Laboratory

University of Delaware Department of Electrical and Computer Engineering Computer Architecture and Parallel Systems Laboratory University of Delaware Department of Electrical and Computer Engineering Computer Architecture and Parallel Systems Laboratory Locality Optimization of Stencil Applications using Data Dependency Graphs

More information

D-TEC DSL Technology for Exascale Computing

D-TEC DSL Technology for Exascale Computing D-TEC DSL Technology for Exascale Computing Progress Report: March 2013 DOE Office of Science Program: Office of Advanced Scientific Computing Research ASCR Program Manager: Dr. Sonia Sachs 1 Introduction

More information

Control flow graphs and loop optimizations. Thursday, October 24, 13

Control flow graphs and loop optimizations. Thursday, October 24, 13 Control flow graphs and loop optimizations Agenda Building control flow graphs Low level loop optimizations Code motion Strength reduction Unrolling High level loop optimizations Loop fusion Loop interchange

More information