DATA REUSE ANALYSIS FOR AUTOMATED SYNTHESIS OF CUSTOM INSTRUCTIONS IN SLIDING WINDOW APPLICATIONS

Size: px
Start display at page:

Download "DATA REUSE ANALYSIS FOR AUTOMATED SYNTHESIS OF CUSTOM INSTRUCTIONS IN SLIDING WINDOW APPLICATIONS"

Transcription

1 Georgios Zacharopoulos Giovanni Ansaloni Laura Pozzi DATA REUSE ANALYSIS FOR AUTOMATED SYNTHESIS OF CUSTOM INSTRUCTIONS IN SLIDING WINDOW APPLICATIONS Università della Svizzera italiana (USI Lugano), Faculty of Informatics IMPACT 2017 Jan 23, 2017 Stockholm, Sweden

2 INTRODUCTION 2 MOORE S LAW - DENNARD SCALING Expectation Performance should keep increasing Reality Performance is NOT increasing anymore

3 INTRODUCTION BREAKDOWN OF DENNARD SCALING Single core > Multiple cores Dark Silicon 3

4 INTRODUCTION 4 HETEROGENEOUS ERA Multiple cores > Heterogeneous Computing Need for Specialised HW [1] [1]

5 BACKGROUND 5 SPECIALISATION AND PERFORMANCE ASICS CPU + HW ACCELERATORS SPECIALISATION GENERAL PURPOSE CPU DOMAIN SPECIFIC PROCESSORS (ADSPS) APPLICATION SPECIFIC INSTRUCTION-SET PROCESSORS (ASIPS) PERFORMANCE

6 RELATED WORK 6 DATA TRANSFER IS THE BOTTLENECK Accelerate data transfer between accelerators and [4] [5] memory Custom storage as an optimisation [6] [7] Identify Data reuse for building custom storage [4] G. Gutin, A. Johnstone, J. Reddington, E. Scott, and A. Yeo. An algorithm for finding input-output constrained convex sets in an acyclic digraph. J. Discrete Algorithms, 13:47 58, [5] E. Giaquinta, A. Mishra, and L. Pozzi. Maximum convex subgraphs under I/O constraint for automatic identification of custom instructions. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 34(3): , [6] P. Biswas, N. Dutt, P. Ienne, and L. Pozzi. Automatic identification of application-specific functional units with architecturally visible storage. In Proceedings of the Design, Automation and Test in Europe Conference and Exhibition, pages , Mar [7] M.Haaß,L.Bauer,andJ.Henkel.Automatic Custom Instruction Identification in Memory Streaming Algorithms. In Proceedings of the International Conference on Compilers, Architectures, and Synthesis for Embedded Systems, pages 1 9, Oct

7 RELATED WORK 7 DATA TRANSFER IS THE BOTTLENECK Rely on source-to-source transformations [8] [9] [10] Limited Parallelism [8] [9] Large Internal Buffers [10] [8] W. Meeus and D. Stroobandt. Automating data reuse in high-level synthesis. In Proceedings of the Design, Automation and Test in Europe Conference and Exhibition, pages 1 4, Mar [9] L.-N. Pouchet, P. Zhang, P. Sadayappan, and J. Cong. Polyhedral-based data reuse optimization for configurable computing. In Proceedings of the 2013 ACM/SIGDA 21st International Symposium on Field Programmable Gate Arrays, pages 29 38, Feb [10] M. Schmid, O. Reiche, F. Hannig, and J. Teich. Loop coarsening in C-based high-level synthesis. In Proceedings of the 26th International Conference on Application-specific Systems, Architectures and Processors, pages , July 2015.

8 RELATED WORK 8 DATA TRANSFER IS THE BOTTLENECK Rely on source-to-source transformations [8] [9] [10] Limited Parallelism [8] [9] Large Internal Buffers [10] Multiple Datapaths Area efficient Accelerators [8] W. Meeus and D. Stroobandt. Automating data reuse in high-level synthesis. In Proceedings of the Design, Automation and Test in Europe Conference and Exhibition, pages 1 4, Mar [9] L.-N. Pouchet, P. Zhang, P. Sadayappan, and J. Cong. Polyhedral-based data reuse optimization for configurable computing. In Proceedings of the 2013 ACM/SIGDA 21st International Symposium on Field Programmable Gate Arrays, pages 29 38, Feb [10] M. Schmid, O. Reiche, F. Hannig, and J. Teich. Loop coarsening in C-based high-level synthesis. In Proceedings of the 26th International Conference on Application-specific Systems, Architectures and Processors, pages , July 2015.

9 METHODOLOGY 9 SW ANALYSIS FOR DATA REUSE void Sobelsystem() { outer_loop:for(i = 1; i < 100; i++) { inner_loop:for(j = 1; j < 100; j++) { Sobelmodule(image[i-1][j-1], image[i][j-1], image[i+1][j+1], image[i-1][j], image[i+1][j], image[i-1][j+1], image[i][j+1], image[i+1][j-1], image[i][j]); LOADS } } }

10 METHODOLOGY 10 SW ANALYSIS FOR DATA REUSE

11 METHODOLOGY 11 SW ANALYSIS FOR DATA REUSE

12 METHODOLOGY 12 SW ANALYSIS FOR DATA REUSE

13 METHODOLOGY 13 WINDOW 1

14 METHODOLOGY 14 WINDOW 2

15 METHODOLOGY 15 WINDOW 1 WINDOW 2 UNROLLING

16 METHODOLOGY 16 WINDOW 1 WINDOW 2 UNROLLING PIPELINING

17 METHODOLOGY 17 WINDOW 1 WINDOW 2 UNROLLING PIPELINING

18 METHODOLOGY 18 COMPILERS Perform Source Code Analysis Identify Data Reuse in Sliding Window Applications

19 METHODOLOGY 19 COMPILERS Perform Source Code Analysis Identify Data Reuse in Sliding Window Applications [2] [3] [2] C. Lattner and V. Adve. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proceedings of the 2nd International Symposium on Code generation and optimization, Palo Alto, California, Mar [3] T. Grosser, H. Zheng, R. Aloor, A. Simbürger, A. Größlinger, and L.-N. Pouchet. Polly-polyhedral optimization in llvm. In Proceedings of the First International Workshop on Polyhedral Compilation Techniques (IMPACT), volume 2011, 2011.

20 METHODOLOGY 20 LLVM POLLY SCOP ANALYSIS FOR DATA REUSE WINDOW SIZE

21 METHODOLOGY 21 LLVM POLLY SCOP ANALYSIS FOR DATA REUSE STRIDE

22 METHODOLOGY 22 LLVM POLLY SCOP ANALYSIS FOR DATA REUSE FRAME SIZE

23 METHODOLOGY 23 HW IMPLEMENTATION Parameters retrieved with LLVM Polly Analysis Pass: Horizontal and Vertical Window Size Stride Iteration Domain (Frame Size)

24 METHODOLOGY 24 HW IMPLEMENTATION

25 METHODOLOGY 25 HW IMPLEMENTATION AVAILABLE INPUT DATA WIDTH

26 METHODOLOGY 26 HW IMPLEMENTATION AVAILABLE INPUT DATA WIDTH Buffer Size: Horizontal Width = Input Width (INw) INw

27 METHODOLOGY 27 HW IMPLEMENTATION Buffer Size: Horizontal Width = Input Width (INw) Example: INw INw = Horizontal Win. Size + 1

28 METHODOLOGY 28 HW IMPLEMENTATION Buffer Size: Horizontal Width = Input Width (INw) Vertical Width = Vertical Win. Size Example: INw INw = Horizontal Win. Size + 1

29 METHODOLOGY 29 HW IMPLEMENTATION Buffer Size: Horizontal Width = Input Width (INw) Vertical Width = Vertical Win. Size Example: INw INw = Horizontal Win. Size + 1

30 METHODOLOGY 30 WINDOW 1 WINDOW 2 BUFFER

31 METHODOLOGY 31 WINDOW 1 WINDOW 2 STRIDE - DISCARD TOPMOST ROW STRIDE - STORE NEW ROW

32 METHODOLOGY 32 WINDOW 1 WINDOW 2

33 METHODOLOGY 33 WINDOW 1 WINDOW 2

34 METHODOLOGY 34 WINDOW 1 WINDOW 2

35 METHODOLOGY 35 WINDOW 1 WINDOW 2

36 METHODOLOGY 36 DISPLACEMENT WINDOW 1 WINDOW 2

37 METHODOLOGY 37 DISPLACEMENT WINDOW 1 WINDOW 2 Horiz. Displacement = INw - Horiz. Win. Size + 1

38 METHODOLOGY 38 DISPLACEMENT WINDOW 1 WINDOW 2 Horiz. Displacement = INw - Horiz. Win. Size + 1 Example: INw = Horizontal Win. Size + 1 Horiz. Displacement = 2

39 EXPERIMENTAL SETUP 39 APPROACHES COMPARED ROCCC Vivado HLS Our Approach

40 EXPERIMENTAL SETUP 40 APPROACHES COMPARED ROCCC Vivado HLS Our Approach PIPELINING DATA REUSE

41 EXPERIMENTAL SETUP 41 APPROACHES COMPARED ROCCC Vivado HLS Our Approach WINDOW 1 WINDOW 2 PIPELINING PIPELINING DATA REUSE

42 EXPERIMENTAL SETUP 42 APPROACHES COMPARED ROCCC Vivado HLS Our Approach WINDOW 1 WINDOW 2 PIPELINING ROCCC PIPELINING DATA REUSE

43 EXPERIMENTAL SETUP 43 APPROACHES COMPARED ROCCC Vivado HLS Our Approach WINDOW 1 WINDOW 2 PIPELINING ROCCC PIPELINING DATA REUSE VIVADO Manual Rewrite

44 EXPERIMENTAL SETUP 44 APPROACHES COMPARED ROCCC Vivado HLS Our Approach WINDOW 1 WINDOW 2 PIPELINING ROCCC PIPELINING DATA REUSE VIVADO Manual Rewrite OUR

45 EXPERIMENTAL SETUP 45 APPROACHES COMPARED ROCCC Vivado HLS Our Approach PIPELINING DATA REUSE UNROLLING DATA REUSE ROCCC VIVADO Manual Rewrite OUR

46 EXPERIMENTAL SETUP 46 APPROACHES COMPARED ROCCC WINDOW 1 WINDOW 2 Vivado HLS Our Approach UNROLLING PIPELINING DATA REUSE UNROLLING DATA REUSE ROCCC VIVADO Manual Rewrite OUR

47 EXPERIMENTAL SETUP 47 APPROACHES COMPARED ROCCC WINDOW 1 WINDOW 2 Vivado HLS Our Approach UNROLLING PIPELINING DATA REUSE UNROLLING DATA REUSE ROCCC VIVADO Manual Rewrite OUR

48 EXPERIMENTAL SETUP 48 APPROACHES COMPARED ROCCC WINDOW 1 WINDOW 2 Vivado HLS Our Approach UNROLLING PIPELINING DATA REUSE UNROLLING DATA REUSE ROCCC VIVADO Manual Rewrite OUR

49 EXPERIMENTAL SETUP 49 APPROACHES COMPARED ROCCC WINDOW 1 WINDOW 2 Vivado HLS Our Approach UNROLLING PIPELINING DATA REUSE UNROLLING DATA REUSE ROCCC VIVADO Manual Rewrite OUR

50 EXPERIMENTAL SETUP 50 APPROACHES COMPARED ROCCC Vivado HLS Our Approach ROCCC PIPELINING DATA REUSE UNROLLING DATA REUSE PARAMETRIC I/O WIDTH VIVADO Manual Rewrite OUR

51 EXPERIMENTAL SETUP 51 APPROACHES COMPARED INPUT WIDTH ROCCC Vivado HLS BUFFER Our Approach PIPELINING DATA REUSE UNROLLING DATA REUSE PARAMETRIC I/O WIDTH ROCCC VIVADO Manual Rewrite OUR

52 EXPERIMENTAL SETUP 52 APPROACHES COMPARED INPUT WIDTH ROCCC Vivado HLS BUFFER Our Approach PIPELINING DATA REUSE UNROLLING DATA REUSE PARAMETRIC I/O WIDTH ROCCC VIVADO Manual Rewrite OUR

53 EXPERIMENTAL SETUP 53 APPROACHES COMPARED INPUT WIDTH ROCCC Vivado HLS BUFFER Our Approach PIPELINING DATA REUSE UNROLLING DATA REUSE PARAMETRIC I/O WIDTH ROCCC VIVADO Manual Rewrite OUR

54 EXPERIMENTAL SETUP 54 APPROACHES COMPARED INPUT WIDTH ROCCC Vivado HLS BUFFER Our Approach PIPELINING DATA REUSE UNROLLING DATA REUSE PARAMETRIC I/O WIDTH ROCCC VIVADO Manual Rewrite OUR

55 EXPERIMENTAL SETUP 55 OUR APPROACH Xilinx Virtex7 FPGA (HDL Simulation)

56 EXPERIMENTAL SETUP 56 OUR APPROACH Xilinx Virtex7 FPGA (HDL Simulation) 3 CONFIGURATIONS

57 EXPERIMENTAL SETUP 57 OUR APPROACH Xilinx Virtex7 FPGA (HDL Simulation) 3 CONFIGURATIONS INPUT WIDTH # DATAPATHS

58 EXPERIMENTAL SETUP 58 OUR APPROACH Xilinx Virtex7 FPGA (HDL Simulation) 3 CONFIGURATIONS INPUT WIDTH # DATAPATHS WINDOW SIZE SAD MAX FILTER SOBEL 4X4 8X8 3X3

59 EXPERIMENTAL SETUP 59 OUR APPROACH Xilinx Virtex7 FPGA (HDL Simulation) INPUT WIDTH # DATAPATHS WINDOW SIZE SINGLE DATAPATH SDP SAD MAX FILTER SOBEL 4X4 8X8 3X3

60 EXPERIMENTAL SETUP 60 OUR APPROACH Xilinx Virtex7 FPGA (HDL Simulation) INPUT WIDTH # DATAPATHS SAD MAX FILTER SOBEL WINDOW SIZE 4X4 8X8 3X3 SINGLE DATAPATH SDP 4, 1 8, 1 3, 1

61 EXPERIMENTAL SETUP 61 OUR APPROACH Xilinx Virtex7 FPGA (HDL Simulation) INPUT WIDTH # DATAPATHS WINDOW SIZE SINGLE DATAPATH SDP MULTIPLE DATAPATHS MDP SAD MAX FILTER SOBEL 4X4 8X8 3X3 4, 1 8, 1 3, 1

62 EXPERIMENTAL SETUP 62 OUR APPROACH Xilinx Virtex7 FPGA (HDL Simulation) INPUT WIDTH # DATAPATHS WINDOW SIZE SINGLE DATAPATH SDP MULTIPLE DATAPATHS MDP SAD 4X4 4, 1 8, 5 MAX FILTER 8X8 8, 1 16, 9 SOBEL 3X3 3, 1 8, 6

63 EXPERIMENTAL SETUP 63 OUR APPROACH Xilinx Virtex7 FPGA (HDL Simulation) INPUT WIDTH # DATAPATHS WINDOW SIZE SINGLE DATAPATH SDP MULTIPLE DATAPATHS MDP EXTRA MULT. DATAPATHS XMDP SAD 4X4 4, 1 8, 5 MAX FILTER 8X8 8, 1 16, 9 SOBEL 3X3 3, 1 8, 6

64 EXPERIMENTAL SETUP 64 OUR APPROACH Xilinx Virtex7 FPGA (HDL Simulation) INPUT WIDTH # DATAPATHS WINDOW SIZE SINGLE DATAPATH SDP MULTIPLE DATAPATHS MDP EXTRA MULT. DATAPATHS XMDP SAD 4X4 4, 1 8, 5 16, 13 MAX FILTER 8X8 8, 1 16, 9 24, 17 SOBEL 3X3 3, 1 8, 6 16, 14

65 EXPERIMENTS 65 EXECUTION TIME COMPARISON ROCCC Vivado_norew Vivado_norev Vivado_rew Vivado_rev CISC=Conf.1 SDP Conf.1========= CISC=Conf.2 Conf.2 MDP Conf.3 CISC=Conf.3 XMDP ns Sobel MaxFilter BlockSAD

66 EXPERIMENTS 66 EXECUTION TIME COMPARISON ROCCC Vivado_norew Vivado_norev Vivado_rew Vivado_rev CISC=Conf.1 SDP Conf.1========= CISC=Conf.2 Conf.2 MDP Conf.3 CISC=Conf.3 XMDP ns Sobel MaxFilter BlockSAD PIPELINING DATA REUSE UNROLLING DATA REUSE PARAMETRIC I/O WIDTH VIVADO_NOREW

67 EXPERIMENTS 67 EXECUTION TIME COMPARISON ROCCC Vivado_norew Vivado_norev Vivado_rew Vivado_rev CISC=Conf.1 SDP Conf.1========= CISC=Conf.2 Conf.2 MDP Conf.3 CISC=Conf.3 XMDP ns Sobel MaxFilter BlockSAD PIPELINING DATA REUSE UNROLLING DATA REUSE PARAMETRIC I/O WIDTH VIVADO_REW Manual Rewrite

68 EXPERIMENTS 68 EXECUTION TIME COMPARISON ROCCC Vivado_norew Vivado_norev Vivado_rew Vivado_rev CISC=Conf.1 SDP Conf.1========= CISC=Conf.2 Conf.2 MDP Conf.3 CISC=Conf.3 XMDP ns Sobel MaxFilter BlockSAD Exec. Time to process a 100x100 frame

69 EXPERIMENTS 69 EXECUTION TIME COMPARISON ROCCC Vivado_norew Vivado_norev Vivado_rew Vivado_rev CISC=Conf.1 SDP Conf.1========= CISC=Conf.2 Conf.2 MDP Conf.3 CISC=Conf.3 XMDP ns Sobel MaxFilter BlockSAD Exec. Time to process a 100x100 frame

70 EXPERIMENTS 70 EXECUTION TIME COMPARISON ROCCC Vivado_norew Vivado_norev Vivado_rew Vivado_rev CISC=Conf.1 SDP Conf.1========= CISC=Conf.2 Conf.2 MDP Conf.3 CISC=Conf.3 XMDP ns Sobel MaxFilter BlockSAD Exec. Time to process a 100x100 frame

71 EXPERIMENTS 71 EXECUTION TIME COMPARISON ROCCC Vivado_norew Vivado_norev Vivado_rew Vivado_rev CISC=Conf.1 SDP Conf.1========= CISC=Conf.2 Conf.2 MDP Conf.3 CISC=Conf.3 XMDP ns Sobel MaxFilter BlockSAD Exec. Time to process a 100x100 frame

72 EXPERIMENTS 72 AREA COMPARISON 45666!"""" 7,8)9$:;$.&8 7,8)9$:;$.&C!"""" 7,8)9$:.&8 7,8)9$:.&C 6<#6=6$;>?! SDP 6$;>?!=========!"""!""" MDP 6<#6=6$;>?A 6$;>?A XMDP!"" #$%&' ()*+,'-&. /'$01#23!"" #$%&' ()*+,'-&. /'$01#23 )B FLIP FLOPS LUTS

73 EXPERIMENTS 73 AREA COMPARISON 45666!"""" 7,8)9$:;$.&8 7,8)9$:;$.&C!"""" 7,8)9$:.&8 7,8)9$:.&C 6<#6=6$;>?! SDP 6$;>?!=========!"""!""" MDP 6<#6=6$;>?A 6$;>?A XMDP!"" #$%&' ()*+,'-&. /'$01#23!"" #$%&' ()*+,'-&. /'$01#23 )B FLIP FLOPS LUTS

74 EXPERIMENTS 74 AREA COMPARISON 45666!"""" 7,8)9$:;$.&8 7,8)9$:;$.&C!"""" 7,8)9$:.&8 7,8)9$:.&C 6<#6=6$;>?! SDP 6$;>?!=========!"""!""" MDP 6<#6=6$;>?A 6$;>?A XMDP!"" #$%&' ()*+,'-&. /'$01#23!"" #$%&' ()*+,'-&. /'$01#23 )B FLIP FLOPS LUTS

75 EXPERIMENTS 75 AREA COMPARISON 45666!"""" 7,8)9$:;$.&8 7,8)9$:;$.&C!"""" 7,8)9$:.&8 7,8)9$:.&C 6<#6=6$;>?! SDP 6$;>?!=========!"""!""" MDP 6<#6=6$;>?A 6$;>?A XMDP!"" #$%&' ()*+,'-&. /'$01#23!"" #$%&' ()*+,'-&. /'$01#23 )B FLIP FLOPS LUTS

76 EXPERIMENTS 76 AREA COMPARISON 45666!"""" 7,8)9$:;$.&8 7,8)9$:;$.&C!"""" 7,8)9$:.&8 7,8)9$:.&C 6<#6=6$;>?! SDP 6$;>?!=========!"""!""" MDP 6<#6=6$;>?A 6$;>?A XMDP!"" #$%&' ()*+,'-&. /'$01#23!"" #$%&' ()*+,'-&. /'$01#23 )B FLIP FLOPS LUTS

77 EXPERIMENTS 77 AREA COMPARISON 45666!"""" 7,8)9$:;$.&8 7,8)9$:;$.&C!"""" 7,8)9$:.&8 7,8)9$:.&C 6<#6=6$;>?! SDP 6$;>?!=========!"""!""" MDP 6<#6=6$;>?A 6$;>?A XMDP!"" #$%&' ()*+,'-&. /'$01#23!"" #$%&' ()*+,'-&. /'$01#23 )B FLIP FLOPS LUTS

78 EXPERIMENTS 78 AREA COMPARISON 45666!"""" 7,8)9$:;$.&8 7,8)9$:;$.&C!"""" 7,8)9$:.&8 7,8)9$:.&C 6<#6=6$;>?! SDP 6$;>?!=========!"""!""" MDP 6<#6=6$;>?A 6$;>?A XMDP!"" #$%&' ()*+,'-&. /'$01#23!"" #$%&' ()*+,'-&. /'$01#23 )B FLIP FLOPS LUTS

79 EXPERIMENTS 79 SPEEDUP AND AREA 74 7& & :9 ;/0,'"(0/<0,-'=>?@)A ;/0,'"(0/<0,-'=BB)A SOBEL MAX FILTER SAD SOBEL MAX FILTER SAD MDP vs Vivado_Rew!"#$%&''()%''*+(,-"./01 XMDP vs Vivado_Rew!"#$%2''()%''*+(,-"./01

80 CONCLUSIONS 80 DATA REUSE ANALYSIS BASED ON POLLY Optimize HLS of Accelerators for Sliding Window Applications Use Polly for Better HW Designs

81 CONCLUSIONS 81 DATA REUSE ANALYSIS BASED ON POLLY Optimize HLS of Accelerators for Sliding Window Applications Use Polly for Better HW Designs Exploit Data Locality

82 CONCLUSIONS 82 DATA REUSE ANALYSIS BASED ON POLLY Optimize HLS of Accelerators for Sliding Window Applications Use Polly for Better HW Designs Exploit Data Locality Design Efficient Buffers

83 CONCLUSIONS 83 DATA REUSE ANALYSIS BASED ON POLLY Optimize HLS of Accelerators for Sliding Window Applications Use Polly for Better HW Designs Exploit Data Locality Design Efficient Buffers Better Use of Resources

84 CONCLUSIONS 84 DATA REUSE ANALYSIS BASED ON POLLY Optimize HLS of Accelerators for Sliding Window Applications Use Polly for Better HW Designs Exploit Data Locality Design Efficient Buffers Better Use of Resources Better Performance

Data Reuse Analysis for Automated Synthesis of Custom Instructions in Sliding Window Applications

Data Reuse Analysis for Automated Synthesis of Custom Instructions in Sliding Window Applications Data Reuse Analysis for Automated Synthesis of Custom Instructions in Sliding Window Applications Georgios Zacharopoulos, Giovanni Ansaloni, Laura Pozzi Faculty of Informatics University of Lugano (USI)

More information

A 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation

A 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation A 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation Abstract: The power budget is expected to limit the portion of the chip that we can power ON at the upcoming technology nodes. This problem,

More information

An Optimal Microarchitecture for Stencil Computation Acceleration Based on Non-Uniform Partitioning of Data Reuse Buffers

An Optimal Microarchitecture for Stencil Computation Acceleration Based on Non-Uniform Partitioning of Data Reuse Buffers An Optimal Microarchitecture for Stencil Computation Acceleration Based on Non-Uniform Partitioning of Data Reuse Buffers Jason Cong *, Peng Li, Bingjun Xiao *, and Peng Zhang Computer Science Dept. &

More information

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,

More information

Run Fast When You Can: Loop Pipelining with Uncertain and Non-uniform Memory Dependencies

Run Fast When You Can: Loop Pipelining with Uncertain and Non-uniform Memory Dependencies Run Fast When You Can: Loop Pipelining with Uncertain and Non-uniform Memory Dependencies Junyi Liu, John Wickerson, Samuel Bayliss, and George A. Constantinides Department of Electrical and Electronic

More information

Polyhedral-Based Data Reuse Optimization for Configurable Computing

Polyhedral-Based Data Reuse Optimization for Configurable Computing Polyhedral-Based Data Reuse Optimization for Configurable Computing Louis-Noël Pouchet 1 Peng Zhang 1 P. Sadayappan 2 Jason Cong 1 1 University of California, Los Angeles 2 The Ohio State University February

More information

SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou

SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou University of California, Los Angeles 1 What is stencil computation? 2 What is Stencil Computation? A sliding

More information

Overview. CSE372 Digital Systems Organization and Design Lab. Hardware CAD. Two Types of Chips

Overview. CSE372 Digital Systems Organization and Design Lab. Hardware CAD. Two Types of Chips Overview CSE372 Digital Systems Organization and Design Lab Prof. Milo Martin Unit 5: Hardware Synthesis CAD (Computer Aided Design) Use computers to design computers Virtuous cycle Architectural-level,

More information

Efficient Hardware Acceleration on SoC- FPGA using OpenCL

Efficient Hardware Acceleration on SoC- FPGA using OpenCL Efficient Hardware Acceleration on SoC- FPGA using OpenCL Advisor : Dr. Benjamin Carrion Schafer Susmitha Gogineni 30 th August 17 Presentation Overview 1.Objective & Motivation 2.Configurable SoC -FPGA

More information

Generation of Multigrid-based Numerical Solvers for FPGA Accelerators

Generation of Multigrid-based Numerical Solvers for FPGA Accelerators Generation of Multigrid-based Numerical Solvers for FPGA Accelerators Christian Schmitt, Moritz Schmid, Frank Hannig, Jürgen Teich, Sebastian Kuckuk, Harald Köstler Hardware/Software Co-Design, System

More information

ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests

ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests Mingxing Tan 1 2, Gai Liu 1, Ritchie Zhao 1, Steve Dai 1, Zhiru Zhang 1 1 Computer Systems Laboratory, Electrical and Computer

More information

CHAPTER 3 METHODOLOGY. 3.1 Analysis of the Conventional High Speed 8-bits x 8-bits Wallace Tree Multiplier

CHAPTER 3 METHODOLOGY. 3.1 Analysis of the Conventional High Speed 8-bits x 8-bits Wallace Tree Multiplier CHAPTER 3 METHODOLOGY 3.1 Analysis of the Conventional High Speed 8-bits x 8-bits Wallace Tree Multiplier The design analysis starts with the analysis of the elementary algorithm for multiplication by

More information

ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University

ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University Lab 4: Binarized Convolutional Neural Networks Due Wednesday, October 31, 2018, 11:59pm

More information

EARLY PREDICTION OF HARDWARE COMPLEXITY IN HLL-TO-HDL TRANSLATION

EARLY PREDICTION OF HARDWARE COMPLEXITY IN HLL-TO-HDL TRANSLATION UNIVERSITA DEGLI STUDI DI NAPOLI FEDERICO II Facoltà di Ingegneria EARLY PREDICTION OF HARDWARE COMPLEXITY IN HLL-TO-HDL TRANSLATION Alessandro Cilardo, Paolo Durante, Carmelo Lofiego, and Antonino Mazzeo

More information

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM

More information

ESE532: System-on-a-Chip Architecture. Today. Message. Graph Cycles. Preclass 1. Reminder

ESE532: System-on-a-Chip Architecture. Today. Message. Graph Cycles. Preclass 1. Reminder ESE532: System-on-a-Chip Architecture Day 8: September 26, 2018 Spatial Computations Today Graph Cycles (from Day 7) Accelerator Pipelines FPGAs Zynq Computational Capacity 1 2 Message Custom accelerators

More information

FPGA: What? Why? Marco D. Santambrogio

FPGA: What? Why? Marco D. Santambrogio FPGA: What? Why? Marco D. Santambrogio marco.santambrogio@polimi.it 2 Reconfigurable Hardware Reconfigurable computing is intended to fill the gap between hardware and software, achieving potentially much

More information

An Overlay Architecture for FPGA-based Industrial Control Systems Designed with Functional Block Diagrams

An Overlay Architecture for FPGA-based Industrial Control Systems Designed with Functional Block Diagrams R2-7 SASIMI 26 Proceedings An Overlay Architecture for FPGA-based Industrial Control Systems Designed with Functional Block Diagrams Taisei Segawa, Yuichiro Shibata, Yudai Shirakura, Kenichi Morimoto,

More information

Exploring Automatically Generated Platforms in High Performance FPGAs

Exploring Automatically Generated Platforms in High Performance FPGAs Exploring Automatically Generated Platforms in High Performance FPGAs Panagiotis Skrimponis b, Georgios Zindros a, Ioannis Parnassos a, Muhsen Owaida b, Nikolaos Bellas a, and Paolo Ienne b a Electrical

More information

Implementing Logic in FPGA Memory Arrays: Heterogeneous Memory Architectures

Implementing Logic in FPGA Memory Arrays: Heterogeneous Memory Architectures Implementing Logic in FPGA Memory Arrays: Heterogeneous Memory Architectures Steven J.E. Wilton Department of Electrical and Computer Engineering University of British Columbia Vancouver, BC, Canada, V6T

More information

High-Level Synthesis of Programmable Hardware Accelerators Considering Potential Varieties

High-Level Synthesis of Programmable Hardware Accelerators Considering Potential Varieties THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS TECHNICAL REPORT OF IEICE.,, (VDEC) 113 8656 7 3 1 CREST E-mail: hiroaki@cad.t.u-tokyo.ac.jp, fujita@ee.t.u-tokyo.ac.jp SoC Abstract

More information

Integrating MRPSOC with multigrain parallelism for improvement of performance

Integrating MRPSOC with multigrain parallelism for improvement of performance Integrating MRPSOC with multigrain parallelism for improvement of performance 1 Swathi S T, 2 Kavitha V 1 PG Student [VLSI], Dept. of ECE, CMRIT, Bangalore, Karnataka, India 2 Ph.D Scholar, Jain University,

More information

MOJTABA MAHDAVI Mojtaba Mahdavi DSP Design Course, EIT Department, Lund University, Sweden

MOJTABA MAHDAVI Mojtaba Mahdavi DSP Design Course, EIT Department, Lund University, Sweden High Level Synthesis with Catapult MOJTABA MAHDAVI 1 Outline High Level Synthesis HLS Design Flow in Catapult Data Types Project Creation Design Setup Data Flow Analysis Resource Allocation Scheduling

More information

High Capacity and High Performance 20nm FPGAs. Steve Young, Dinesh Gaitonde August Copyright 2014 Xilinx

High Capacity and High Performance 20nm FPGAs. Steve Young, Dinesh Gaitonde August Copyright 2014 Xilinx High Capacity and High Performance 20nm FPGAs Steve Young, Dinesh Gaitonde August 2014 Not a Complete Product Overview Page 2 Outline Page 3 Petabytes per month Increasing Bandwidth Global IP Traffic Growth

More information

Lab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm

Lab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm ECE5775 High-Level Digital Design Automation, Fall 2017 School of Electrical Computer Engineering, Cornell University Lab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm 1 Introduction

More information

International Journal of Computer Sciences and Engineering. Research Paper Volume-6, Issue-2 E-ISSN:

International Journal of Computer Sciences and Engineering. Research Paper Volume-6, Issue-2 E-ISSN: International Journal of Computer Sciences and Engineering Open Access Research Paper Volume-6, Issue-2 E-ISSN: 2347-2693 Implementation Sobel Edge Detector on FPGA S. Nandy 1*, B. Datta 2, D. Datta 3

More information

How Much Logic Should Go in an FPGA Logic Block?

How Much Logic Should Go in an FPGA Logic Block? How Much Logic Should Go in an FPGA Logic Block? Vaughn Betz and Jonathan Rose Department of Electrical and Computer Engineering, University of Toronto Toronto, Ontario, Canada M5S 3G4 {vaughn, jayar}@eecgutorontoca

More information

HIGH-LEVEL SYNTHESIS

HIGH-LEVEL SYNTHESIS HIGH-LEVEL SYNTHESIS Page 1 HIGH-LEVEL SYNTHESIS High-level synthesis: the automatic addition of structural information to a design described by an algorithm. BEHAVIORAL D. STRUCTURAL D. Systems Algorithms

More information

ESE532: System-on-a-Chip Architecture. Today. Message. Clock Cycle BRAM

ESE532: System-on-a-Chip Architecture. Today. Message. Clock Cycle BRAM ESE532: System-on-a-Chip Architecture Day 20: April 3, 2017 Pipelining, Frequency, Dataflow Today What drives cycle times Pipelining in Vivado HLS C Avoiding bottlenecks feeding data in Vivado HLS C Penn

More information

Resource-Efficient Scheduling for Partially-Reconfigurable FPGAbased

Resource-Efficient Scheduling for Partially-Reconfigurable FPGAbased Resource-Efficient Scheduling for Partially-Reconfigurable FPGAbased Systems Andrea Purgato: andrea.purgato@mail.polimi.it Davide Tantillo: davide.tantillo@mail.polimi.it Marco Rabozzi: marco.rabozzi@polimi.it

More information

Course Overview Revisited

Course Overview Revisited Course Overview Revisited void blur_filter_3x3( Image &in, Image &blur) { // allocate blur array Image blur(in.width(), in.height()); // blur in the x dimension for (int y = ; y < in.height(); y++) for

More information

FPGA Implementation and Validation of the Asynchronous Array of simple Processors

FPGA Implementation and Validation of the Asynchronous Array of simple Processors FPGA Implementation and Validation of the Asynchronous Array of simple Processors Jeremy W. Webb VLSI Computation Laboratory Department of ECE University of California, Davis One Shields Avenue Davis,

More information

Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays

Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays Éricles Sousa 1, Frank Hannig 1, Jürgen Teich 1, Qingqing Chen 2, and Ulf Schlichtmann

More information

Vivado HLx Design Entry. June 2016

Vivado HLx Design Entry. June 2016 Vivado HLx Design Entry June 2016 Agenda What is the HLx Design Methodology? New & Early Access features for Connectivity Platforms Creating Differentiated Logic 2 What is the HLx Design Methodology? Page

More information

New Approach for Affine Combination of A New Architecture of RISC cum CISC Processor

New Approach for Affine Combination of A New Architecture of RISC cum CISC Processor Volume 2 Issue 1 March 2014 ISSN: 2320-9984 (Online) International Journal of Modern Engineering & Management Research Website: www.ijmemr.org New Approach for Affine Combination of A New Architecture

More information

Using FPGAs as Microservices

Using FPGAs as Microservices Using FPGAs as Microservices David Ojika, Ann Gordon-Ross, Herman Lam, Bhavesh Patel, Gaurav Kaul, Jayson Strayer (University of Florida, DELL EMC, Intel Corporation) The 9 th Workshop on Big Data Benchmarks,

More information

FPGA Matrix Multiplier

FPGA Matrix Multiplier FPGA Matrix Multiplier In Hwan Baek Henri Samueli School of Engineering and Applied Science University of California Los Angeles Los Angeles, California Email: chris.inhwan.baek@gmail.com David Boeck Henri

More information

Hello I am Zheng, Hongbin today I will introduce our LLVM based HLS framework, which is built upon the Target Independent Code Generator.

Hello I am Zheng, Hongbin today I will introduce our LLVM based HLS framework, which is built upon the Target Independent Code Generator. Hello I am Zheng, Hongbin today I will introduce our LLVM based HLS framework, which is built upon the Target Independent Code Generator. 1 In this talk I will first briefly introduce what HLS is, and

More information

Overview of ROCCC 2.0

Overview of ROCCC 2.0 Overview of ROCCC 2.0 Walid Najjar and Jason Villarreal SUMMARY FPGAs have been shown to be powerful platforms for hardware code acceleration. However, their poor programmability is the main impediment

More information

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based

More information

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael

More information

Introduction to Neural Networks

Introduction to Neural Networks ECE 5775 (Fall 17) High-Level Digital Design Automation Introduction to Neural Networks Ritchie Zhao, Zhiru Zhang School of Electrical and Computer Engineering Rise of the Machines Neural networks have

More information

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores A Configurable Multi-Ported Register File Architecture for Soft Processor Cores Mazen A. R. Saghir and Rawan Naous Department of Electrical and Computer Engineering American University of Beirut P.O. Box

More information

Design Space Exploration Using Parameterized Cores

Design Space Exploration Using Parameterized Cores RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS UNIVERSITY OF WINDSOR Design Space Exploration Using Parameterized Cores Ian D. L. Anderson M.A.Sc. Candidate March 31, 2006 Supervisor: Dr. M. Khalid 1 OUTLINE

More information

EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs)

EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) September 12, 2002 John Wawrzynek Fall 2002 EECS150 - Lec06-FPGA Page 1 Outline What are FPGAs? Why use FPGAs (a short history

More information

Synthesizable FPGA Fabrics Targetable by the VTR CAD Tool

Synthesizable FPGA Fabrics Targetable by the VTR CAD Tool Synthesizable FPGA Fabrics Targetable by the VTR CAD Tool Jin Hee Kim and Jason Anderson FPL 2015 London, UK September 3, 2015 2 Motivation for Synthesizable FPGA Trend towards ASIC design flow Design

More information

Lossless Compression using Efficient Encoding of Bitmasks

Lossless Compression using Efficient Encoding of Bitmasks Lossless Compression using Efficient Encoding of Bitmasks Chetan Murthy and Prabhat Mishra Department of Computer and Information Science and Engineering University of Florida, Gainesville, FL 326, USA

More information

Outline. EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) FPGA Overview. Why FPGAs?

Outline. EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) FPGA Overview. Why FPGAs? EECS150 - Digital Design Lecture 6 - Field Programmable Gate Arrays (FPGAs) September 12, 2002 John Wawrzynek Outline What are FPGAs? Why use FPGAs (a short history lesson). FPGA variations Internal logic

More information

A Methodology for Predicting Application- Specific Achievable Memory Bandwidth for HW/SW-Codesign

A Methodology for Predicting Application- Specific Achievable Memory Bandwidth for HW/SW-Codesign Göbel, M.; Elhossini, A.; Juurlink, B. A Methodology for Predicting Application- Specific Achievable Memory Bandwidth for HW/SW-Codesign Conference paper Accepted manuscript (Postprint) This version is

More information

Advanced FPGA Design Methodologies with Xilinx Vivado

Advanced FPGA Design Methodologies with Xilinx Vivado Advanced FPGA Design Methodologies with Xilinx Vivado Alexander Jäger Computer Architecture Group Heidelberg University, Germany Abstract With shrinking feature sizes in the ASIC manufacturing technology,

More information

C vs. VHDL: Benchmarking CAESAR Candidates Using High- Level Synthesis and Register- Transfer Level Methodologies

C vs. VHDL: Benchmarking CAESAR Candidates Using High- Level Synthesis and Register- Transfer Level Methodologies C vs. VHDL: Benchmarking CAESAR Candidates Using High- Level Synthesis and Register- Transfer Level Methodologies Ekawat Homsirikamol, William Diehl, Ahmed Ferozpuri, Farnoud Farahmand, and Kris Gaj George

More information

Big Data Systems on Future Hardware. Bingsheng He NUS Computing

Big Data Systems on Future Hardware. Bingsheng He NUS Computing Big Data Systems on Future Hardware Bingsheng He NUS Computing http://www.comp.nus.edu.sg/~hebs/ 1 Outline Challenges for Big Data Systems Why Hardware Matters? Open Challenges Summary 2 3 ANYs in Big

More information

Specializing Hardware for Image Processing

Specializing Hardware for Image Processing Lecture 6: Specializing Hardware for Image Processing Visual Computing Systems So far, the discussion in this class has focused on generating efficient code for multi-core processors such as CPUs and GPUs.

More information

EECS150 - Digital Design Lecture 13 - Accelerators. Recap and Outline

EECS150 - Digital Design Lecture 13 - Accelerators. Recap and Outline EECS150 - Digital Design Lecture 13 - Accelerators Oct. 10, 2013 Prof. Ronald Fearing Electrical Engineering and Computer Sciences University of California, Berkeley (slides courtesy of Prof. John Wawrzynek)

More information

Advanced Synthesis Techniques

Advanced Synthesis Techniques Advanced Synthesis Techniques Reminder From Last Year Use UltraFast Design Methodology for Vivado www.xilinx.com/ultrafast Recommendations for Rapid Closure HDL: use HDL Language Templates & DRC Constraints:

More information

Co-synthesis and Accelerator based Embedded System Design

Co-synthesis and Accelerator based Embedded System Design Co-synthesis and Accelerator based Embedded System Design COE838: Embedded Computer System http://www.ee.ryerson.ca/~courses/coe838/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer

More information

FPGA briefing Part II FPGA development DMW: FPGA development DMW:

FPGA briefing Part II FPGA development DMW: FPGA development DMW: FPGA briefing Part II FPGA development FPGA development 1 FPGA development FPGA development : Domain level analysis (Level 3). System level design (Level 2). Module level design (Level 1). Academical focus

More information

Experience with the NetFPGA Program

Experience with the NetFPGA Program Experience with the NetFPGA Program John W. Lockwood Algo-Logic Systems Algo-Logic.com With input from the Stanford University NetFPGA Group & Xilinx XUP Program Sunday, February 21, 2010 FPGA-2010 Pre-Conference

More information

Custom Instruction Generation Using Temporal Partitioning Techniques for a Reconfigurable Functional Unit

Custom Instruction Generation Using Temporal Partitioning Techniques for a Reconfigurable Functional Unit Custom Instruction Generation Using Temporal Partitioning Techniques for a Reconfigurable Functional Unit Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani, Kazuaki Murakami, Koji Inoue, Mehdi Sedighi

More information

Unit 2: High-Level Synthesis

Unit 2: High-Level Synthesis Course contents Unit 2: High-Level Synthesis Hardware modeling Data flow Scheduling/allocation/assignment Reading Chapter 11 Unit 2 1 High-Level Synthesis (HLS) Hardware-description language (HDL) synthesis

More information

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based

More information

Technology Dependent Logic Optimization Prof. Kurt Keutzer EECS University of California Berkeley, CA Thanks to S. Devadas

Technology Dependent Logic Optimization Prof. Kurt Keutzer EECS University of California Berkeley, CA Thanks to S. Devadas Technology Dependent Logic Optimization Prof. Kurt Keutzer EECS University of California Berkeley, CA Thanks to S. Devadas 1 RTL Design Flow HDL RTL Synthesis Manual Design Module Generators Library netlist

More information

Polly Polyhedral Optimizations for LLVM

Polly Polyhedral Optimizations for LLVM Polly Polyhedral Optimizations for LLVM Tobias Grosser - Hongbin Zheng - Raghesh Aloor Andreas Simbürger - Armin Grösslinger - Louis-Noël Pouchet April 03, 2011 Polly - Polyhedral Optimizations for LLVM

More information

Embedded Systems. 7. System Components

Embedded Systems. 7. System Components Embedded Systems 7. System Components Lothar Thiele 7-1 Contents of Course 1. Embedded Systems Introduction 2. Software Introduction 7. System Components 10. Models 3. Real-Time Models 4. Periodic/Aperiodic

More information

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA Yun R. Qu, Viktor K. Prasanna Ming Hsieh Dept. of Electrical Engineering University of Southern California Los Angeles, CA 90089

More information

The Design of Sobel Edge Extraction System on FPGA

The Design of Sobel Edge Extraction System on FPGA The Design of Sobel Edge Extraction System on FPGA Yu ZHENG 1, * 1 School of software, Beijing University of technology, Beijing 100124, China; Abstract. Edge is a basic feature of an image, the purpose

More information

Theory and Algorithm for Generalized Memory Partitioning in High-Level Synthesis

Theory and Algorithm for Generalized Memory Partitioning in High-Level Synthesis Theory and Algorithm for Generalized Memory Partitioning in High-Level Synthesis Yuxin Wang ayerwang@pku.edu.cn Peng Li peng.li@pku.edu.cn Jason Cong,3,, cong@cs.ucla.edu Center for Energy-Efficient Computing

More information

COSC 6385 Computer Architecture. - Memory Hierarchies (I)

COSC 6385 Computer Architecture. - Memory Hierarchies (I) COSC 6385 Computer Architecture - Hierarchies (I) Fall 2007 Slides are based on a lecture by David Culler, University of California, Berkley http//www.eecs.berkeley.edu/~culler/courses/cs252-s05 Recap

More information

FPGA Entering the Era of the All Programmable SoC

FPGA Entering the Era of the All Programmable SoC FPGA Entering the Era of the All Programmable SoC Ivo Bolsens, Senior Vice President & CTO Page 1 Moore s Law: The Technology Pipeline Page 2 Industry Debates on Cost Page 3 Design Cost Estimated Chip

More information

ECE 5775 (Fall 17) High-Level Digital Design Automation. More Pipelining

ECE 5775 (Fall 17) High-Level Digital Design Automation. More Pipelining ECE 5775 (Fall 17) High-Level Digital Design Automation More Pipelining Announcements HW 2 due Monday 10/16 (no late submission) Second round paper bidding @ 5pm tomorrow on Piazza Talk by Prof. Margaret

More information

FPGA design with National Instuments

FPGA design with National Instuments FPGA design with National Instuments Rémi DA SILVA Systems Engineer - Embedded and Data Acquisition Systems - MED Region ni.com The NI Approach to Flexible Hardware Processor Real-time OS Application software

More information

Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Network

Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Network Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Network Chen Zhang 1, Peng Li 3, Guangyu Sun 1,2, Yijin Guan 1, Bingjun Xiao 3, Jason Cong 1,2,3 1 Peking University 2 PKU/UCLA Joint

More information

II. LITERATURE SURVEY

II. LITERATURE SURVEY Hardware Co-Simulation of Sobel Edge Detection Using FPGA and System Generator Sneha Moon 1, Prof Meena Chavan 2 1,2 Department of Electronics BVUCOE Pune India Abstract: This paper implements an image

More information

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Naveen Suda, Vikas Chandra *, Ganesh Dasika *, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, Yu

More information

Chip Design for Turbo Encoder Module for In-Vehicle System

Chip Design for Turbo Encoder Module for In-Vehicle System Chip Design for Turbo Encoder Module for In-Vehicle System Majeed Nader Email: majeed@wayneedu Yunrui Li Email: yunruili@wayneedu John Liu Email: johnliu@wayneedu Abstract This paper studies design and

More information

FPGA for Complex System Implementation. National Chiao Tung University Chun-Jen Tsai 04/14/2011

FPGA for Complex System Implementation. National Chiao Tung University Chun-Jen Tsai 04/14/2011 FPGA for Complex System Implementation National Chiao Tung University Chun-Jen Tsai 04/14/2011 About FPGA FPGA was invented by Ross Freeman in 1989 SRAM-based FPGA properties Standard parts Allowing multi-level

More information

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction

More information

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume VII /Issue 2 / OCT 2016

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume VII /Issue 2 / OCT 2016 NEW VLSI ARCHITECTURE FOR EXPLOITING CARRY- SAVE ARITHMETIC USING VERILOG HDL B.Anusha 1 Ch.Ramesh 2 shivajeehul@gmail.com 1 chintala12271@rediffmail.com 2 1 PG Scholar, Dept of ECE, Ganapathy Engineering

More information

Parallelized Radix-4 Scalable Montgomery Multipliers

Parallelized Radix-4 Scalable Montgomery Multipliers Parallelized Radix-4 Scalable Montgomery Multipliers Nathaniel Pinckney and David Money Harris 1 1 Harvey Mudd College, 301 Platt. Blvd., Claremont, CA, USA e-mail: npinckney@hmc.edu ABSTRACT This paper

More information

Synthesis of VHDL Code for FPGA Design Flow Using Xilinx PlanAhead Tool

Synthesis of VHDL Code for FPGA Design Flow Using Xilinx PlanAhead Tool Synthesis of VHDL Code for FPGA Design Flow Using Xilinx PlanAhead Tool Md. Abdul Latif Sarker, Moon Ho Lee Division of Electronics & Information Engineering Chonbuk National University 664-14 1GA Dekjin-Dong

More information

ESL design with the Agility Compiler for SystemC

ESL design with the Agility Compiler for SystemC ESL design with the Agility Compiler for SystemC SystemC behavioral design & synthesis Steve Chappell & Chris Sullivan Celoxica ESL design portfolio Complete ESL design environment Streaming Video Processing

More information

An Automated flow for Arithmetic Component Generation in Field-Programmable Gate Arrays

An Automated flow for Arithmetic Component Generation in Field-Programmable Gate Arrays An Automated flow for Arithmetic Component Generation in Field-Programmable Gate Arrays ALASTAIR M. SMITH, GEORGE A. CONSTANTINIDES, PETER Y. K. CHEUNG Imperial College London, UK State of the art configurable

More information

Reconfigurable Computing. Introduction

Reconfigurable Computing. Introduction Reconfigurable Computing Tony Givargis and Nikil Dutt Introduction! Reconfigurable computing, a new paradigm for system design Post fabrication software personalization for hardware computation Traditionally

More information

Field Programmable Gate Array (FPGA)

Field Programmable Gate Array (FPGA) Field Programmable Gate Array (FPGA) Lecturer: Krébesz, Tamas 1 FPGA in general Reprogrammable Si chip Invented in 1985 by Ross Freeman (Xilinx inc.) Combines the advantages of ASIC and uc-based systems

More information

FPGA system development What you need to think about. Frédéric Leens, CEO

FPGA system development What you need to think about. Frédéric Leens, CEO FPGA system development What you need to think about Frédéric Leens, CEO About Byte Paradigm 2005 : Founded by 3 ASIC-SoC-FPGA engineers as a Design Center for high-end FPGA and board design. 2007 : GP

More information

Fast FPGA prototyping with Software Development Kit for FPGA (SDK4FPGA)

Fast FPGA prototyping with Software Development Kit for FPGA (SDK4FPGA) Fast FPGA prototyping with Software Development Kit for FPGA (SDK4FPGA) Andrea Suardi cas.ee.ic.ac.uk/projects/sdk4fpga This research has been supported by EPSRC Impact Acceleration grant number EP/K503733/1

More information

ii Copyright c Harikrishna Samala 2009 All Rights Reserved

ii Copyright c Harikrishna Samala 2009 All Rights Reserved ii Copyright c Harikrishna Samala 2009 All Rights Reserved iii Abstract Methodology to Derive Resource Aware Context Adaptable Architectures for Field Programmable Gate Arrays by Harikrishna Samala, Master

More information

Analyzing the Generation and Optimization of an FPGA Accelerator using High Level Synthesis

Analyzing the Generation and Optimization of an FPGA Accelerator using High Level Synthesis Paper Analyzing the Generation and Optimization of an FPGA Accelerator using High Level Synthesis Sebastian Kaltenstadler Ulm University Ulm, Germany sebastian.kaltenstadler@missinglinkelectronics.com

More information

PINE TRAINING ACADEMY

PINE TRAINING ACADEMY PINE TRAINING ACADEMY Course Module A d d r e s s D - 5 5 7, G o v i n d p u r a m, G h a z i a b a d, U. P., 2 0 1 0 1 3, I n d i a Digital Logic System Design using Gates/Verilog or VHDL and Implementation

More information

Performance Verification for ESL Design Methodology from AADL Models

Performance Verification for ESL Design Methodology from AADL Models Performance Verification for ESL Design Methodology from AADL Models Hugues Jérome Institut Supérieur de l'aéronautique et de l'espace (ISAE-SUPAERO) Université de Toulouse 31055 TOULOUSE Cedex 4 Jerome.huges@isae.fr

More information

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, Sarma Vrudhula School of Electrical, Computer and Energy Engineering School

More information

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, Chunyuan Zhang National University of Defense Technology,

More information

Exploring Logic Block Granularity for Regular Fabrics

Exploring Logic Block Granularity for Regular Fabrics 1530-1591/04 $20.00 (c) 2004 IEEE Exploring Logic Block Granularity for Regular Fabrics A. Koorapaty, V. Kheterpal, P. Gopalakrishnan, M. Fu, L. Pileggi {aneeshk, vkheterp, pgopalak, mfu, pileggi}@ece.cmu.edu

More information

ECE 5775 High-Level Digital Design Automation, Fall 2016 School of Electrical and Computer Engineering, Cornell University

ECE 5775 High-Level Digital Design Automation, Fall 2016 School of Electrical and Computer Engineering, Cornell University ECE 5775 High-Level Digital Design Automation, Fall 2016 School of Electrical and Computer Engineering, Cornell University Optical Flow on FPGA Ian Thompson (ijt5), Joseph Featherston (jgf82), Judy Stephen

More information

Unlocking FPGAs Using High- Level Synthesis Compiler Technologies

Unlocking FPGAs Using High- Level Synthesis Compiler Technologies Unlocking FPGAs Using High- Leel Synthesis Compiler Technologies Fernando Mar*nez Vallina, Henry Styles Xilinx Feb 22, 2015 Why are FPGAs Good Scalable, highly parallel and customizable compute 10s to

More information

Fast and Flexible High-Level Synthesis from OpenCL using Reconfiguration Contexts

Fast and Flexible High-Level Synthesis from OpenCL using Reconfiguration Contexts Fast and Flexible High-Level Synthesis from OpenCL using Reconfiguration Contexts James Coole, Greg Stitt University of Florida Department of Electrical and Computer Engineering and NSF Center For High-Performance

More information

A Novel Deadlock Avoidance Algorithm and Its Hardware Implementation

A Novel Deadlock Avoidance Algorithm and Its Hardware Implementation A ovel Deadlock Avoidance Algorithm and Its Hardware Implementation + Jaehwan Lee and *Vincent* J. Mooney III Hardware/Software RTOS Group Center for Research on Embedded Systems and Technology (CREST)

More information

Multi-Gigahertz Parallel FFTs for FPGA and ASIC Implementation

Multi-Gigahertz Parallel FFTs for FPGA and ASIC Implementation Multi-Gigahertz Parallel FFTs for FPGA and ASIC Implementation Doug Johnson, Applications Consultant Chris Eddington, Technical Marketing Synopsys 2013 1 Synopsys, Inc. 700 E. Middlefield Road Mountain

More information

Model-Based Design for effective HW/SW Co-Design Alexander Schreiber Senior Application Engineer MathWorks, Germany

Model-Based Design for effective HW/SW Co-Design Alexander Schreiber Senior Application Engineer MathWorks, Germany Model-Based Design for effective HW/SW Co-Design Alexander Schreiber Senior Application Engineer MathWorks, Germany 2013 The MathWorks, Inc. 1 Agenda Model-Based Design of embedded Systems Software Implementation

More information

Christophe HURIAUX. Embedded Reconfigurable Hardware Accelerators with Efficient Dynamic Reconfiguration

Christophe HURIAUX. Embedded Reconfigurable Hardware Accelerators with Efficient Dynamic Reconfiguration Mid-term Evaluation March 19 th, 2015 Christophe HURIAUX Embedded Reconfigurable Hardware Accelerators with Efficient Dynamic Reconfiguration Accélérateurs matériels reconfigurables embarqués avec reconfiguration

More information