Lecture 10: Vivado C to IP HLS. James C. Hoe Department of ECE Carnegie Mellon University
|
|
- Oliver Nichols
- 5 years ago
- Views:
Transcription
1 Lecture 10: Vivado C to IP HLS James C. Hoe Department of ECE Carnegie Mellon University F17 L10 S1, James C. Hoe, CMU/ECE/CALCM, 2017
2 Housekeeping Your goal today: learn how to tell Vivado HLS what you really want and understand what Vivado HLS is telling you Notices Handout #4: lab 2, due noon, 10/6 3.5 weeks to project proposal Readings Ch 15, The Zynq Book (skim Ch 14) Vivado Design Suite User Guide: High Level Synthesis (UG902) F17 L10 S2, James C. Hoe, CMU/ECE/CALCM, 2017
3 Tortoise Tortoise and Hare delivers exact optimal implementation to a fully specified objective (functional + tuning) perfection takes time say last 10% of quality takes up 90% of the time Hare only gets to 90% quality delivers the design 10 times faster This hare doesn t take a nap after one design F17 L10 S3, James C. Hoe, CMU/ECE/CALCM, 2017
4 The Design Race power hey, it works out of time 90% Good Enough Box educated guess best possible 1/perf F17 L10 S4, James C. Hoe, CMU/ECE/CALCM, 2017
5 Why the Hare Wins In real design projects don t always know exact target initially can t land first shot on target anyway good enough really is good enough hitting schedule is everything show at COMDEX in Nov or bust in Dec There are a lot more rabbits than turtles in this world; there are not enough turtles in this world Even more turkeys... but that s a different class F17 L10 S5, James C. Hoe, CMU/ECE/CALCM, 2017 All characters appearing in this story are fictitious. Any resemblance to real persons, living or dead, is purely coincidental.
6 F17 L10 S6, James C. Hoe, CMU/ECE/CALCM, 2017 Vivado HLS
7 Function to IP, not Program to HW **Object of design is an IP module** Designer still in charge (garbage in, garbage out) specify functionality as algorithm (in C) specify structure as pragmas (beyond C) set optimization constraints (beyond C) Offload bit and cycle level design/opt. to tools Vivado HLS (formerly AutoESL; formerly UCLA) never mind all of C (what s main( )? what malloc?) never mind all usages of allowed subset (all loops okay, but static ones actually work well) what else beyond C might a HW designer need (types, interface, structural hints) F17 L10 S7, James C. Hoe, CMU/ECE/CALCM, 2017
8 What does Vivado see? int fibi(int n) { int last=1; int lastlast=0; int temp; if (n==0) return 0; if (n==1) return 1; for(;n>1;n--) { temp=last+lastlast; lastlast=last; last=temp; } } return temp; F17 L10 S8, James C. Hoe, CMU/ECE/CALCM, 2017
9 Function to IP Block n ap_clk ap_rst ap_start Don t look inside yet fibi ap_ready ap_done ap_idle What if I want multiple outputs? F17 L10 S9, James C. Hoe, CMU/ECE/CALCM, 2017 int fibi(int n) {.... return...; }
10 AP_CTRL_HS Block Protocol ap_clk I ap_rst ap_start ap_idle 1 1 O ap_ready ap_done F17 L10 S10, James C. Hoe, CMU/ECE/CALCM, 2017 inputs consumed output valid ready for new ap_start
11 Function Invocation: Latency vs Throughput minimum initiation interval latency start ready done start ready done start ready done F17 L10 S11, James C. Hoe, CMU/ECE/CALCM, 2017
12 Other Block Control Options ap_ctrl_chain separate input producer and output consumer ap_continue:driven by the consumer to backpressure the block and producer IF a block reaches done AND ap_continue is deasserted, the block will hold ap_done and keep output valid until ap_continue is asserted AXI compatible port interfaces software on ARM interacts with the block using fxn call like interfaces (input, output, start, etc.) IP specific.h and routines generated automatically F17 L10 S12, James C. Hoe, CMU/ECE/CALCM, 2017
13 Scalar I/O Port Timing By default (ap_none) input ports should be stable between ap_start and ap_ready output port is valid when ap_done 3 asynchronous handshake options on input ap_vldonly: consumes only if input valid ap_ackonly: signals back when input consumed ap_hs: ap_vld + ap_ack HLS s job to follow protocol F17 L10 S13, James C. Hoe, CMU/ECE/CALCM, 2017 n ap_vld ap_ack
14 Pass by Reference Arguments void fibi(int *n, int *fib) { int last=1; int lastlast=0; int temp; int nn=*n; if (nn==0) { *fib=0; *n=0; return; } if (nn==1) { *fib=1; *n=0; return; } for(;nn>1;nn--) { temp=last+lastlast; lastlast=last; last=temp; } } *fib=last; *n=lastlast; F17 L10 S14, James C. Hoe, CMU/ECE/CALCM, 2017
15 Pass by Reference I/O n_i ap_clk ap_rst ap_start Don t look inside yet fib n_o ap_ready ap_done ap_idle They are not really pointers do not evaluate *(fib+1) or fib except to pretend to be a fifo F17 L10 S15, James C. Hoe, CMU/ECE/CALCM, 2017 void fibi(int *n, int *fib) {.... *n in RHS and LHS; *fib in LHS only.... } used before assigned
16 All I/O Options Fig 1 49, Vivado Design Suite User Guide: High Level Synthesis F17 L10 S16, James C. Hoe, CMU/ECE/CALCM, 2017
17 Array Arguments #define N (1<<10) void D2XPY (double Y[N], double X[N]) { for(i=0; i<n; i++) { Y[i]=2*X[i]+Y[i]; } } X_q0[63:0] X_ce0 X_addr0[9:0] F17 L10 S17, James C. Hoe, CMU/ECE/CALCM, 2017 *could ask to use separate read and write ports Y_q0[63:0] Y_ce0 Y_we0 Y_addr0[9:0]
18 Array Arg Options By default, array args become BRAM ports array must be fixed size can use 2 ports for bandwidth or split read/write If array arg is accessed always consecutively AND only either read or written can become ap_fifo port i.e., no addresses, just push or pop Array args can also become AXI or a generic bus master ports Scheduler handles port sharing and dynamic delays F17 L10 S18, James C. Hoe, CMU/ECE/CALCM, 2017
19 Time to Look Inside n fibi ap_clk ap_rst ap_start ap_ready ap_done ap_idle F17 L10 S19, James C. Hoe, CMU/ECE/CALCM, 2017
20 MMM (yet again) void mmm(char A[N][N], char B[N][N], short C[N][N) { for(int i=0; i<n; i++) { for(int j=0; j<n; j++) { C[i][j]=0; for(int k=0; k<n; k++) { C[i][j] += A[i][k]*B[k][j]; } } } keep it simple } F17 L10 S20, James C. Hoe, CMU/ECE/CALCM, 2017 N 2 by 8b BRAM N 2 by 8b BRAM BRAM Rd BRAM Rd mmm BRAM Rd/Wr N 2 by 8b BRAM Same example as Zynq Book Tutorial 3
21 Structural Pragma: Pipelining Fully elaborate scope (e.g., unroll loops) Find minimum iteration interval (II) schedule II >= num stages a resource instance is used II >= RAW hazard distance E.g., to pipeline C[i][j]+=A[i][k]*B[k][j]; RAW hazard, II>=3 rd0 A rd0 B A*B rd0 C rd0 A rd0 B F17 L10 S21, James C. Hoe, CMU/ECE/CALCM, 2017 accum A*B rd0 C rd0 A rd0 B wr0 C accum A*B rd0 C rd0 A rd0 B wr0 C accum A*B rd0 C wr0 C accum structural conflict, II>=2 (II>=1 if 2 port) wr0 C
22 HLS Analysis and Visualization // Zynq Book Tutorial 3, Sol#2 for(int i=0; i<5; i++) { for(int j=0; j<5; j++) { C[i][j]=0; for(int k=0; k<5; k++) { #pragma HLS PIPELINE C[i][j] += A[i][k]*B[k][j]; F17 L10 S22, James C. Hoe, CMU/ECE/CALCM, 2017 [Vivado HLS Screenshots]
23 Design by Trial and Error // Zynq Book Tutorial 3, Sol#3 for(int i=0; i<5; i++) { for(int j=0; j<5; j++) { C[i][j]=0; #pragma HLS PIPELINE for(int k=0; k<5; k++) { C[i][j] += A[i][k]*B[k][j]; F17 L10 S23, James C. Hoe, CMU/ECE/CALCM, 2017 [Vivado HLS Screenshots]
24 Design by Trial and Error // Zynq Book Tutorial 3, Sol#4 #program HLS ARRAY_RESHAPE variable=a, dim=2 #program HLS ARRAY_RESHAPE variable=b, dim=1 for(int i=0; i<5; i++) { for(int j=0; j<5; j++) { C[i][j]=0; #pragma HLS PIPELINE for(int k=0; k<5; k++) { C[i][j] += A[i][k]*B[k][j]; F17 L10 S24, James C. Hoe, CMU/ECE/CALCM, 2017 A and B reshaped to read entire row/column at a time? What if N>>5? [Vivado HLS Screenshots]
25 Recall from Last Time for(k= for(i= for(i= for(j= for(j= GET C[i][j] for(k= GET C[i][j] for(i= for(j= GET C[i][j] parallel kernel pipelines fully unrolled inner loops F17 L10 S25, James C. Hoe, CMU/ECE/CALCM, 2017
26 With Algo. Rewrite (Option 1) From here we can play with pragmas to sensibly widen concurrency if needed // assume C initialized to 0 for(int k=0; k<5; k++) for(int i=0; i<5; i++) { for(int j=0; j<5; j++) { #pragma HLS PIPELINE C[i][j]+= A[i][k]*B[k][j]; can fix by disable flattening F17 L10 S26, James C. Hoe, CMU/ECE/CALCM, 2017 [Vivado HLS Screenshots]
27 With Algo. Rewrite (Option 2) for(int i=0; i<5; i++) { for(int j=0; j<5; j++) { short Ctemp=0; for(int k=0; k<5; k++) #pragma HLS PIPELINE Ctemp += A[i][k]*B[k][j]; C[i][j]=Ctemp; can fix by disable flattening F17 L10 S27, James C. Hoe, CMU/ECE/CALCM, 2017 HLS figured out forwarding [Vivado HLS Screenshots]
28 Loop Unroll (full and partial) amortize loop control overhead increase loop body size, hence ILP and scheduling flexibility Loop Merge combine loop bodies of independent loops of same control improve parallelism and scheduling Loop Flatten streamline loop nest control reduce start/finish stutter F17 L10 S28, James C. Hoe, CMU/ECE/CALCM, 2017 Pragma Crib Sheet: Loops 4 iter 2iter (unroll by2) 2x (2 iter) fully unrolled 2+2 iter 2 iter merged 4 iter longer steadystate
29 Map multiple arrays in same BRAM no perf loss if no scheduling conflicts Reshape change BRAM aspect ratio to widen ports higher bandwidth on consecutive addresses Partition Pragma Crib Sheet: Arrays map 1 array to multiple BRAMs multiple independent ports if no bank conflicts addr data F17 L10 S29, James C. Hoe, CMU/ECE/CALCM, 2017 A lot more you can control; must read UG902
30 Design by Exploration reference algorithm & testbench algorithm for synthesis pragmas When this takes only minutes, a little trial anderror is okay (just a little!!!!) co simulation validation HLS & analysis good enough yes no F17 L10 S30, James C. Hoe, CMU/ECE/CALCM, 2017 RTL RTL backend not good enough after backend
31 Putting it in context (from last time) Why hardware design is hard reason #1: low level abstraction reason #2: unrestricted design freedom reason #3: massive concurrency C to HW (i.e., C to RTL) compiler bridges the gap between functionality and implementation fill in the details below the functional abstraction make good decisions when filling in the details extract parallelism from a sequential specification Vivado does its part fast and without mistakes F17 L10 S31, James C. Hoe, CMU/ECE/CALCM, 2017
32 F17 L10 S32, James C. Hoe, CMU/ECE/CALCM, 2017 Parting Thoughts Vivado doesn t turn program into HW Vivado doesn t turn programmer into HW designer Multifaceted benefits to HW designer algo. development/debug/validate in SW pragma steering (no RTL hacking, machine tuning) fast analysis and visualization data type support it is about more than adding double to Verilog built in, stylized IP interfaces integration with the rest of Vivado and Zynq!! We are entering a new era for FPGAs
33 Vivado Software Defined SoC Screenshot, page 24, SDSoC Environment Getting Started (UG1028) F17 L10 S33, James C. Hoe, CMU/ECE/CALCM, 2017
Lecture 10: Vivado C to IP HLS. Housekeeping
18 643 Lecture 10: Vivado C to IP HLS James C. Hoe Department of ECE Carnegie Mellon University 18 643 F17 L10 S1, James C. Hoe, CMU/ECE/CALCM, 2017 Housekeeping Your goal today: learn how to tell Vivado
More information81920**slide. 1Developing the Accelerator Using HLS
81920**slide - 1Developing the Accelerator Using HLS - 82038**slide Objectives After completing this module, you will be able to: Describe the high-level synthesis flow Describe the capabilities of the
More informationLecture 11: OpenCL and Altera OpenCL. James C. Hoe Department of ECE Carnegie Mellon University
18 643 Lecture 11: OpenCL and Altera OpenCL James C. Hoe Department of ECE Carnegie Mellon University 18 643 F17 L11 S1, James C. Hoe, CMU/ECE/CALCM, 2017 Housekeeping Your goal today: understand Altera
More informationLecture 8: Abstractions for HW. James C. Hoe Department of ECE Carnegie Mellon University
18 643 Lecture 8: Abstractions for HW James C. Hoe Department of ECE Carnegie Mellon University 18 643 F17 L08 S1, James C. Hoe, CMU/ECE/CALCM, 2017 18 643 F17 L08 S2, James C. Hoe, CMU/ECE/CALCM, 2017
More informationImproving Area and Resource Utilization Lab
Lab Workbook Introduction This lab introduces various techniques and directives which can be used in Vivado HLS to improve design performance as well as area and resource utilization. The design under
More informationInternational Training Workshop on FPGA Design for Scientific Instrumentation and Computing November 2013
2499-20 International Training Workshop on FPGA Design for Scientific Instrumentation and Computing 11-22 November 2013 High-Level Synthesis: how to improve FPGA design productivity RINCON CALLE Fernando
More informationNEW FPGA DESIGN AND VERIFICATION TECHNIQUES MICHAL HUSEJKO IT-PES-ES
NEW FPGA DESIGN AND VERIFICATION TECHNIQUES MICHAL HUSEJKO IT-PES-ES Design: Part 1 High Level Synthesis (Xilinx Vivado HLS) Part 2 SDSoC (Xilinx, HLS + ARM) Part 3 OpenCL (Altera OpenCL SDK) Verification:
More informationLecture 7: Structural RTL Design. Housekeeping
18 643 Lecture 7: Structural RTL Design James C. Hoe Department of ECE Carnegie Mellon University 18 643 F17 L07 S1, James C. Hoe, CMU/ECE/CALCM, 2017 Housekeeping Your goal today: think about what you
More informationLecture 16: Cache in Context (Uniprocessor) James C. Hoe Department of ECE Carnegie Mellon University
18 447 Lecture 16: Cache in Context (Uniprocessor) James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L16 S1, James C. Hoe, CMU/ECE/CALCM, 2018 Your goal today Housekeeping understand
More informationThis material exempt per Department of Commerce license exception TSU. Improving Performance
This material exempt per Department of Commerce license exception TSU Performance Outline Adding Directives Latency Manipulating Loops Throughput Performance Bottleneck Summary Performance 13-2 Performance
More informationVivado HLx Design Entry. June 2016
Vivado HLx Design Entry June 2016 Agenda What is the HLx Design Methodology? New & Early Access features for Connectivity Platforms Creating Differentiated Logic 2 What is the HLx Design Methodology? Page
More informationLecture 13: Bus and I/O. James C. Hoe Department of ECE Carnegie Mellon University
18 447 Lecture 13: Bus and I/O James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L13 S1, James C. Hoe, CMU/ECE/CALCM, 2018 Your goal today Housekeeping take first peek outside of the
More informationESE532: System-on-a-Chip Architecture. Today. Message. Clock Cycle BRAM
ESE532: System-on-a-Chip Architecture Day 20: April 3, 2017 Pipelining, Frequency, Dataflow Today What drives cycle times Pipelining in Vivado HLS C Avoiding bottlenecks feeding data in Vivado HLS C Penn
More informationHigh-Level Synthesis: Accelerating Alignment Algorithm using SDSoC
High-Level Synthesis: Accelerating Alignment Algorithm using SDSoC Steven Derrien & Simon Rokicki The objective of this lab is to present how High-Level Synthesis (HLS) can be used to accelerate a given
More informationSDSoC: Session 1
SDSoC: Session 1 ADAM@ADIUVOENGINEERING.COM What is SDSoC SDSoC is a system optimising compiler which allows us to optimise Zynq PS / PL Zynq MPSoC PS / PL MicroBlaze What does this mean? Following the
More informationLecture 15: Cache Design (in Isolation) James C. Hoe Department of ECE Carnegie Mellon University
18 447 Lecture 15: Cache Design (in Isolation) James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L15 S1, James C. Hoe, CMU/ECE/CALCM, 2018 Your goal today Housekeeping recover from Spring
More informationLecture 25: Busses. A Typical Computer Organization
S 09 L25-1 18-447 Lecture 25: Busses James C. Hoe Dept of ECE, CMU April 27, 2009 Announcements: Project 4 due this week (no late check off) HW 4 due today Handouts: Practice Final Solutions A Typical
More informationFCUDA-SoC: Platform Integration for Field-Programmable SoC with the CUDAto-FPGA
1 FCUDA-SoC: Platform Integration for Field-Programmable SoC with the CUDAto-FPGA Compiler Tan Nguyen 1, Swathi Gurumani 1, Kyle Rupnow 1, Deming Chen 2 1 Advanced Digital Sciences Center, Singapore {tan.nguyen,
More informationSimBSP Enabling RTL Simulation for Intel FPGA OpenCL Kernels
SimBSP Enabling RTL Simulation for Intel FPGA OpenCL Kernels Ahmed Sanaullah, Chen Yang, Daniel Crawley and Martin C. Herbordt Department of Electrical and Computer Engineering, Boston University The Intel
More informationLecture 4: Modern FPGA Programmability: PR and SoC. Housekeeping
18 643 Lecture 4: Modern FPGA Programmability: PR and SoC James C. Hoe Department of ECE Carnegie Mellon University 18 643 F17 L04 S1, James C. Hoe, CMU/ECE/CALCM, 2017 Housekeeping Your goal today: appreciate
More informationLecture 22: 1 Lecture Worth of Parallel Programming Primer. James C. Hoe Department of ECE Carnegie Mellon University
18 447 Lecture 22: 1 Lecture Worth of Parallel Programming Primer James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L22 S1, James C. Hoe, CMU/ECE/CALCM, 2018 18 447 S18 L22 S2, James
More informationVivado Design Suite Tutorial: High-Level Synthesis
Vivado Design Suite Tutorial: Notice of Disclaimer The information disclosed to you hereunder (the "Materials") is provided solely for the selection and use of Xilinx products. To the maximum extent permitted
More informationECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University
ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University Lab 4: Binarized Convolutional Neural Networks Due Wednesday, October 31, 2018, 11:59pm
More informationCan High-Level Synthesis Compete Against a Hand-Written Code in the Cryptographic Domain? A Case Study
Can High-Level Synthesis Compete Against a Hand-Written Code in the Cryptographic Domain? A Case Study Ekawat Homsirikamol & Kris Gaj George Mason University USA Project supported by NSF Grant #1314540
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class
More informationHigh Level Synthesis of Cryptographic Hardware. Jeremy Trimble ECE 646
High Level Synthesis of Cryptographic Hardware Jeremy Trimble ECE 646 High Level Synthesis Synthesize (FPGA) hardware using software programming languages: C / C++, Domain specific Languages ( DSL ) Typical
More informationDSP Mapping, Coding, Optimization
DSP Mapping, Coding, Optimization On TMS320C6000 Family using CCS (Code Composer Studio) ver 3.3 Started with writing a simple C code in the class, from scratch Project called First, written for C6713
More informationOptimizing HW/SW Partition of a Complex Embedded Systems. Simon George November 2015.
Optimizing HW/SW Partition of a Complex Embedded Systems Simon George November 2015 Zynq-7000 All Programmable SoC HP ACP GP Page 2 Zynq UltraScale+ MPSoC Page 3 HW/SW Optimization Challenges application()
More informationEnergy aware transprecision computing
17-20 July 2018 NiPS Summer School 2018 University of Perugia, Italy Co-Funded by the H2020 Framework Programme of the European Union Energy aware transprecision computing FPGA programming using arbitrary
More informationLecture 25: Synchronization. James C. Hoe Department of ECE Carnegie Mellon University
18 447 Lecture 25: Synchronization James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L25 S1, James C. Hoe, CMU/ECE/CALCM, 2018 Your goal today Housekeeping be introduced to synchronization
More informationECE1387 Exercise 3: Using the LegUp High-level Synthesis Framework
ECE1387 Exercise 3: Using the LegUp High-level Synthesis Framework 1 Introduction and Motivation This lab will give you an overview of how to use the LegUp high-level synthesis framework. In LegUp, you
More informationHEAD HardwarE Accelerated Deduplication
HEAD HardwarE Accelerated Deduplication Final Report CS710 Computing Acceleration with FPGA December 9, 2016 Insu Jang Seikwon Kim Seonyoung Lee Executive Summary A-Z development of deduplication SW version
More informationLecture 2. Memory locality optimizations Address space organization
Lecture 2 Memory locality optimizations Address space organization Announcements Office hours in EBU3B Room 3244 Mondays 3.00 to 4.00pm; Thurs 2:00pm-3:30pm Partners XSED Portal accounts Log in to Lilliput
More informationApple LLVM GPU Compiler: Embedded Dragons. Charu Chandrasekaran, Apple Marcello Maggioni, Apple
Apple LLVM GPU Compiler: Embedded Dragons Charu Chandrasekaran, Apple Marcello Maggioni, Apple 1 Agenda How Apple uses LLVM to build a GPU Compiler Factors that affect GPU performance The Apple GPU compiler
More informationLab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm
ECE5775 High-Level Digital Design Automation, Fall 2017 School of Electrical Computer Engineering, Cornell University Lab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm 1 Introduction
More informationLecture 11: Interrupt and Exception. James C. Hoe Department of ECE Carnegie Mellon University
18 447 Lecture 11: Interrupt and Exception James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L11 S1, James C. Hoe, CMU/ECE/CALCM, 2018 Your goal today Housekeeping first peek outside
More informationEECS 470 Lab 6. SystemVerilog. Department of Electrical Engineering and Computer Science College of Engineering. (University of Michigan)
EECS 470 Lab 6 SystemVerilog Department of Electrical Engineering and Computer Science College of Engineering University of Michigan Thursday, October. 18 th, 2018 Thursday, October. 18 th, 2018 1 / Overview
More informationLecture: Pipeline Wrap-Up and Static ILP
Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2) 1 Multicycle
More informationAdvanced optimizations of cache performance ( 2.2)
Advanced optimizations of cache performance ( 2.2) 30 1. Small and Simple Caches to reduce hit time Critical timing path: address tag memory, then compare tags, then select set Lower associativity Direct-mapped
More informationEEL 4783: HDL in Digital System Design
EEL 4783: HDL in Digital System Design Lecture 9: Coding for Synthesis (cont.) Prof. Mingjie Lin 1 Code Principles Use blocking assignments to model combinatorial logic. Use nonblocking assignments to
More informationLecture: Static ILP. Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2) 1 Static vs Dynamic Scheduling Arguments against dynamic scheduling: requires complex structures
More informationLegUp HLS Tutorial for Microsemi PolarFire Sobel Filtering for Image Edge Detection
LegUp HLS Tutorial for Microsemi PolarFire Sobel Filtering for Image Edge Detection This tutorial will introduce you to high-level synthesis (HLS) concepts using LegUp. You will apply HLS to a real problem:
More informationCreating a Processor System Lab
Lab Workbook Introduction This lab introduces a design flow to generate a IP-XACT adapter from a design using Vivado HLS and using the generated IP-XACT adapter in a processor system using IP Integrator
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationHigh Performance Computing and Programming, Lecture 3
High Performance Computing and Programming, Lecture 3 Memory usage and some other things Ali Dorostkar Division of Scientific Computing, Department of Information Technology, Uppsala University, Sweden
More informationESL design with the Agility Compiler for SystemC
ESL design with the Agility Compiler for SystemC SystemC behavioral design & synthesis Steve Chappell & Chris Sullivan Celoxica ESL design portfolio Complete ESL design environment Streaming Video Processing
More informationCS 31: Intro to Systems Caching. Kevin Webb Swarthmore College March 24, 2015
CS 3: Intro to Systems Caching Kevin Webb Swarthmore College March 24, 205 Reading Quiz Abstraction Goal Reality: There is no one type of memory to rule them all! Abstraction: hide the complex/undesirable
More informationThe Power of Streams on the SRC MAP. Wim Bohm Colorado State University. RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RESERVED.
The Power of Streams on the SRC MAP Wim Bohm Colorado State University RSS!2006 Copyright 2006 SRC Computers, Inc. ALL RIGHTS RSRV. MAP C Pure C runs on the MAP Generated code: circuits Basic blocks in
More informationECE 5775 Student-Led Discussions (10/16)
ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A Adam Macioszek, Julia Currie, Nick Sarkis Sparse Matrix Vector Multiplication Nick Comly, Felipe Fortuna, Mark Li, Serena Krech Matrix
More informationComputer Generation of IP Cores
A I n Computer Generation of IP Cores Peter Milder (ECE, Carnegie Mellon) James Hoe (ECE, Carnegie Mellon) Markus Püschel (CS, ETH Zürich) addfxp #(16, 1) add15282(.a(a69),.b(a70),.clk(clk),.q(t45)); addfxp
More informationComputer Systems C S Cynthia Lee Today s materials adapted from Kevin Webb at Swarthmore College
Computer Systems C S 0 7 Cynthia Lee Today s materials adapted from Kevin Webb at Swarthmore College 2 Today s Topics TODAY S LECTURE: Caching ANNOUNCEMENTS: Assign6 & Assign7 due Friday! 6 & 7 NO late
More informationLecture 19: Memory Hierarchy: Cache Design. Recap: Basic Cache Parameters
S 09 L19-1 18-447 Lecture 19: Memory Hierarchy: Cache Design James C. Hoe Dept of ECE, CMU April 6, 2009 Announcements: Ckpt 1 bonus reminder Graded midterms You are invited to attend Amdahl's Law in the
More informationPerformance Issues in Parallelization Saman Amarasinghe Fall 2009
Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries
More informationCS 103 Lecture 4 Slides
1 CS 103 Lecture 4 Slides Algorithms Mark Redekopp ARRAYS 2 3 Need for Arrays If I want to keep the score of 100 players in a game I could declare a separate variable to track each one s score: int player1
More informationOpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen
OpenMP I Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS16/17 OpenMP References Using OpenMP: Portable Shared Memory Parallel Programming. The MIT Press,
More informationIntroduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines
Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi
More informationHigh-Level Synthesis Optimization for Blocked Floating-Point Matrix Multiplication
High-Level Synthesis Optimization for Blocked Floating-Point Matrix Multiplication Erik H. D Hollander Electronics and Information Systems Department Ghent University, Ghent, Belgium Erik.DHollander@ugent.be
More informationLab 1: Using the LegUp High-level Synthesis Framework
Lab 1: Using the LegUp High-level Synthesis Framework 1 Introduction and Motivation This lab will give you an overview of how to use the LegUp high-level synthesis framework. In LegUp, you can compile
More information19.1. Unit 19. OpenMP Library for Parallelism
19.1 Unit 19 OpenMP Library for Parallelism 19.2 Overview of OpenMP A library or API (Application Programming Interface) for parallelism Requires compiler support (make sure the compiler you use supports
More informationECE 5730 Memory Systems
ECE 5730 Memory Systems Spring 2009 Off-line Cache Content Management Lecture 7: 1 Quiz 4 on Tuesday Announcements Only covers today s lecture No office hours today Lecture 7: 2 Where We re Headed Off-line
More informationYet Another Implementation of CoRAM Memory
Dec 7, 2013 CARL2013@Davis, CA Py Yet Another Implementation of Memory Architecture for Modern FPGA-based Computing Shinya Takamaeda-Yamazaki, Kenji Kise, James C. Hoe * Tokyo Institute of Technology JSPS
More informationLab 1: CORDIC Design Due Friday, September 8, 2017, 11:59pm
ECE5775 High-Level Digital Design Automation, Fall 2017 School of Electrical Computer Engineering, Cornell University Lab 1: CORDIC Design Due Friday, September 8, 2017, 11:59pm 1 Introduction COordinate
More informationConcurrent Programming Introduction
Concurrent Programming Introduction Frédéric Haziza Department of Computer Systems Uppsala University Ericsson - Fall 2007 Outline 1 Good to know 2 Scenario 3 Definitions 4 Hardware 5 Classical
More informationECE 5775 (Fall 17) High-Level Digital Design Automation. Hardware-Software Co-Design
ECE 5775 (Fall 17) High-Level Digital Design Automation Hardware-Software Co-Design Announcements Midterm graded You can view your exams during TA office hours (Fri/Wed 11am-noon, Rhodes 312) Second paper
More informationInvestigation of High-Level Synthesis tools applicability to data acquisition systems design based on the CMS ECAL Data Concentrator Card example
Journal of Physics: Conference Series PAPER OPEN ACCESS Investigation of High-Level Synthesis tools applicability to data acquisition systems design based on the CMS ECAL Data Concentrator Card example
More informationReducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip
Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off
More informationLecture 17: Address Translation. James C. Hoe Department of ECE Carnegie Mellon University
18 447 Lecture 17: Address Translation James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L17 S1, James C. Hoe, CMU/ECE/CALCM, 2018 Your goal today Housekeeping see Virtual Memory into
More informationAdvanced OpenACC. Steve Abbott November 17, 2017
Advanced OpenACC Steve Abbott , November 17, 2017 AGENDA Expressive Parallelism Pipelining Routines 2 The loop Directive The loop directive gives the compiler additional information
More informationCSCI 104 Runtime Complexity. Mark Redekopp David Kempe
1 CSCI 104 Runtime Complexity Mark Redekopp David Kempe 2 Runtime It is hard to compare the run time of an algorithm on actual hardware Time may vary based on speed of the HW, etc. The same program may
More informationImproving Energy Efficiency with Special-Purpose Accelerators
Improving Energy Efficiency with Special-Purpose Accelerators Alexandru Fiodorov Embedded Computing Systems Submission date: June 2013 Supervisor: Magnus Jahre, IDI Norwegian University of Science and
More informationLegUp: Accelerating Memcached on Cloud FPGAs
0 LegUp: Accelerating Memcached on Cloud FPGAs Xilinx Developer Forum December 10, 2018 Andrew Canis & Ruolong Lian LegUp Computing Inc. 1 COMPUTE IS BECOMING SPECIALIZED 1 GPU Nvidia graphics cards are
More informationOverview of OpenMP. Unit 19. Using OpenMP. Parallel for. OpenMP Library for Parallelism
19.1 Overview of OpenMP 19.2 Unit 19 OpenMP Library for Parallelism A library or API (Application Programming Interface) for parallelism Requires compiler support (make sure the compiler you use supports
More informationComputer Architecture and Engineering CS152 Quiz #4 April 11th, 2012 Professor Krste Asanović
Computer Architecture and Engineering CS152 Quiz #4 April 11th, 2012 Professor Krste Asanović Name: ANSWER SOLUTIONS This is a closed book, closed notes exam. 80 Minutes 17 Pages Notes: Not all questions
More informationPerformance Issues in Parallelization. Saman Amarasinghe Fall 2010
Performance Issues in Parallelization Saman Amarasinghe Fall 2010 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries
More informationCSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1
CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level
More informationApplication parallelization for multi-core Android devices
SOFTWARE & SYSTEMS DESIGN Application parallelization for multi-core Android devices Jos van Eijndhoven Vector Fabrics BV The Netherlands http://www.vectorfabrics.com MULTI-CORE PROCESSORS: HERE TO STAY
More informationLecture 10: Static ILP Basics. Topics: loop unrolling, static branch prediction, VLIW (Sections )
Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 4.4) 1 Static vs Dynamic Scheduling Arguments against dynamic scheduling: requires complex structures
More informationExploring OpenCL Memory Throughput on the Zynq
Exploring OpenCL Memory Throughput on the Zynq Technical Report no. 2016:04, ISSN 1652-926X Chalmers University of Technology Bo Joel Svensson bo.joel.svensson@gmail.com Abstract The Zynq platform combines
More informationESE532: System-on-a-Chip Architecture. Today. Message. Graph Cycles. Preclass 1. Reminder
ESE532: System-on-a-Chip Architecture Day 8: September 26, 2018 Spatial Computations Today Graph Cycles (from Day 7) Accelerator Pipelines FPGAs Zynq Computational Capacity 1 2 Message Custom accelerators
More informationAutotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT
Autotuning John Cavazos University of Delaware What is Autotuning? Searching for the best code parameters, code transformations, system configuration settings, etc. Search can be Quasi-intelligent: genetic
More informationLecture 8: Data Hazard and Resolution. James C. Hoe Department of ECE Carnegie Mellon University
18 447 Lecture 8: Data Hazard and Resolution James C. Hoe Department of ECE Carnegie ellon University 18 447 S18 L08 S1, James C. Hoe, CU/ECE/CALC, 2018 Your goal today Housekeeping detect and resolve
More informationWeeks 6&7: Procedures and Parameter Passing
CS320 Principles of Programming Languages Weeks 6&7: Procedures and Parameter Passing Jingke Li Portland State University Fall 2017 PSU CS320 Fall 17 Weeks 6&7: Procedures and Parameter Passing 1 / 45
More informationCS61C Machine Structures. Lecture 3 Introduction to the C Programming Language. 1/23/2006 John Wawrzynek. www-inst.eecs.berkeley.
CS61C Machine Structures Lecture 3 Introduction to the C Programming Language 1/23/2006 John Wawrzynek (www.cs.berkeley.edu/~johnw) www-inst.eecs.berkeley.edu/~cs61c/ CS 61C L03 Introduction to C (1) Administrivia
More informationCSCI 104 Runtime Complexity. Mark Redekopp David Kempe Sandra Batista
1 CSCI 104 Runtime Complexity Mark Redekopp David Kempe Sandra Batista 2 Motivation You are given a large data set with n = 500,000 genetic markers for 5000 patients and you want to examine that data for
More informationModern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design
Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant
More informationSDAccel Development Environment User Guide
SDAccel Development Environment User Guide Features and Development Flows Revision History The following table shows the revision history for this document. Date Version Revision 05/13/2016 2016.1 Added
More information借助 SDSoC 快速開發複雜的嵌入式應用
借助 SDSoC 快速開發複雜的嵌入式應用 May 2017 What Is C/C++ Development System-level Profiling SoC application-like programming Tools and IP for system-level profiling Specify C/C++ Functions for Acceleration Full System
More informationLecture 14: Memory Hierarchy. James C. Hoe Department of ECE Carnegie Mellon University
18 447 Lecture 14: Memory Hierarchy James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L14 S1, James C. Hoe, CMU/ECE/CALCM, 2018 Your goal today Housekeeping understand memory system
More informationAdvanced Computer Architecture
18-742 Advanced Computer Architecture Test 2 April 14, 1998 Name (please print): Instructions: DO NOT OPEN TEST UNTIL TOLD TO START YOU HAVE UNTIL 12:20 PM TO COMPLETE THIS TEST The exam is composed of
More informationIntroduction to Embedded Systems. Lab Logistics
Introduction to Embedded Systems CS/ECE 6780/5780 Al Davis Today s topics: lab logistics interrupt synchronization reentrant code 1 CS 5780 Lab Logistics Lab2 Status Wed: 3/11 teams have completed their
More informationBeiHang Short Course, Part 5: Pandora Smart IP Generators
BeiHang Short Course, Part 5: Pandora Smart IP Generators James C. Hoe Department of ECE Carnegie Mellon University Collaborator: Michael Papamichael J. C. Hoe, CMU/ECE/CALCM, 0, BHSC L5 s CONNECT NoC
More informationECEN 449: Microprocessor System Design Department of Electrical and Computer Engineering Texas A&M University. Laboratory Exercise #1 Using the Vivado
ECEN 449: Microprocessor System Design Department of Electrical and Computer Engineering Texas A&M University Prof. Sunil P Khatri (Lab exercise created and tested by Ramu Endluri, He Zhou, Andrew Douglass
More informationNVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU
NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated
More informationFloating-Point Design with Xilinx s Vivado HLS
Floating-Point Design with Xilinx s Vivado HLS by James Hrica Senior Staff Software Applications Engineer Xilinx, Inc. jhrica@xilinx.com 28 Xcell Journal Fourth Quarter 2012 The ability to easily implement
More informationCache Memories /18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, Today s Instructor: Phil Gibbons
Cache Memories 15-213/18-213/15-513: Introduction to Computer Systems 12 th Lecture, October 5, 2017 Today s Instructor: Phil Gibbons 1 Today Cache memory organization and operation Performance impact
More informationSystemC Synthesis Standard: Which Topics for Next Round? Frederic Doucet Qualcomm Atheros, Inc
SystemC Synthesis Standard: Which Topics for Next Round? Frederic Doucet Qualcomm Atheros, Inc 2/29/2016 Frederic Doucet, Qualcomm Atheros, Inc 2 What to Standardize Next Benefit of current standard: Provides
More informationVirtual Memory Primitives for User Programs
Virtual Memory Primitives for User Programs Andrew W. Appel & Kai Li Department of Computer Science Princeton University Presented By: Anirban Sinha (aka Ani), anirbans@cs.ubc.ca 1 About the Authors Andrew
More informationNOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline
CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism
More informationOptimising for the p690 memory system
Optimising for the p690 memory Introduction As with all performance optimisation it is important to understand what is limiting the performance of a code. The Power4 is a very powerful micro-processor
More informationAn Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware
An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware Tao Chen, Shreesha Srinath Christopher Batten, G. Edward Suh Computer Systems Laboratory School of Electrical
More information