ECE 5775 (Fall 17) High-Level Digital Design Automation. More Pipelining
|
|
- Heather Hawkins
- 6 years ago
- Views:
Transcription
1 ECE 5775 (Fall 17) High-Level Digital Design Automation More Pipelining
2 Announcements HW 2 due Monday 10/16 (no late submission) Second round paper 5pm tomorrow on Piazza Talk by Prof. Margaret Martonosi 4:30pm today PHL 203 on End of Moore s Law Challenges and Opportunities: Computer Architecture Perspectives 5pm instructor office hour cancelled Midterm logistics Exam time: Thursday 10/19 in class (75 mins) Open book, open notes, closed Internet In-class recitation on Tuesday 10/17 Instructor office hour next Thursday moved to Wednesday 10/18, 4:30-5:30pm (Rhodes 320) 1
3 Agenda Midterm coverage Modulo scheduling Case studies on pipelining 2
4 Midterm (20%) Topics covered Specialized computing FPGAs C-based synthesis Control flow graph and SSA Scheduling Resource sharing Pipelining 3
5 Key Topics (1) Algorithm basics Time complexity, esp. big-o notation Graphs Trees, DAGs, topological sort BDDs, timing analysis FPGAs LUTs and LUT mapping C-based synthesis Arbitrary precision and fixed-point types Basic C-to-RTL mapping Key HLS optimizations to improve design performance 4
6 Key Topics (2) Compiler analysis Control data flow graph Dominance relation SSA Scheduling TCS & RCS algorithms: ILP, list scheduling, SDC Operation chaining, frequency/latency/resource constraints Ability to devise a simple scheduling algorithm to optimize certain design metric 5
7 Key Topics (3) Resource sharing Conflict and compatibility graphs Left edge algorithm Ability to determine minimum resource usage in # of functional units and/or registers, given a fixed schedule Pipelining Dependence types Ability to determine minimum II given a code snippet Modulo scheduling concepts: MII, RecMII, ResMII 6
8 Review: Dependence Graph Data dependences of a loop often represented by a dependence graph Forward edges: Intra-iteration (loopindependent) dependences Back edges: Inter-iteration (loop-carried) dependences Edges are annotated with distance values: number of iterations separating the two dependent operations involved [1] [0] v 1 [0] v 2 v 3 [0] [0] [2] Recurrence manifests itself as a circuit in the dependence graph v 4 Edges annotated with distance values 7
9 Pipelining through Modulo Scheduling Dependence graph: LD + Schedule: 0 1 LD + II = ST II = 2 Time (cycle) Iteration 0 LD + ST LD + ST ST LD + ST LD + ST LD ST + Initiation Interval (II) slot 0 slot 1 Steady state (II cycles) Steady state determines both performance resource usage 8
10 Modulo Scheduling A regular form of software pipelining technique Also applies to loop (/ function) pipelining for hardware synthesis Loop iterations (/ function) use the same schedule, which are initiated at a constant rate Advantages of modulo scheduling Easy to analyze Steady state determines performance and resource usage Cost efficient No code or hardware replication Optimization objective: (1) Minimize II under resource constraints or (2) minimize resource usage under II constraint NP-hard in general Optimal polynomial time solution exists without recurrences or resource constraints 9
11 Algorithmic Scheme for Modulo Scheduling Common scheme of heuristic algorithms Find MII (a lower bound on II) = max { ResMII, RecMII) Look for a schedule with given II If a feasible schedule not found, increase II and try again Find MII and set II = MII Look for a schedule Found it? No Increase II Yes 10
12 Calculating Lower Bound of Initiation Interval Minimum possible II (MII) MII = max(resmii, RecMII) A lower bound, not necessary achievable Resource constrained MII (ResMII) ResMII = max i éops(r i ) / Limit(r i )ù OPs(r): number of operations that use resource of type r Limit(r): number of available resources of type r Recurrence constrained MII (RecMII) RecMII = max i élatency(c i ) / Distance(c i )ù Latency(c i ): total latency in dependence circuit c i Distance(c i ): total distance in dependence circuit c i 11
13 Minimum II due to Resource Limits (ResMII) Dependence Reservation tables adders a0 a1 a2 a3 time i0 i1 i2 i3 i0 i1 i2 i0 i1 i0 4 5 i4 i5 i3 i4 i2 i3 i1 i2 0, 1, 2, 3, : time (clock cycles) a0, a1, a2, a3 : available adders i0, i1, i2, : loop iterations adders a0 a1 i0 i0 i1 i1 i0 i0 i2 i2 i3 i3 i1 i1 i2 i2 due to limited resources, cannot initiate iterations less than 2 cycles apart Compute ResMII: Max among all types of resources ResMII = max i éops(r i ) / Limit(r i )ù 12
14 Minimum II due to Recurrences (RecMII) Dependence Schedule (II=2) Dependence Schedule (II=1) [0] a b [1] [1] dependence distance = 1 a b a [0] a b [3] [3] dependence distance = 3 b Assume no operation chaining a b a b a b a b Compute Recurrence Minimum II (RecMII): Max among all circuits of: RecMII = max i élatency(c i ) / Distance(c i )ù Latency(c) : sum of operation latencies along circuit c Distance(c) : sum of dependence distances along circuit c 13
15 SDC-Based Modulo Scheduling SDC-based modulo scheduling Unifies intra-iteration and interiteration scheduling constraints in a single SDC Iterative algorithm with efficient incremental SDC update Loop Find Minimum II Model intra-iteration scheduling constraints Model inter-iteration scheduling constraints SDC feasible? Yes Incremental scheduling No Fail Increase II Schedule 14
16 Example: Modeling Loop-Carried Dependence The dependence between two operations from different iterations is termed inter-iteration (loop-carried) dependence Loop-carried dependence u à v with Dist(u, v) = K s u + Lat u s v + K* II A[i] C[i] ld ld v 1 v 2 II = 2 s 5 s 1 + 2*II for (i = 0; i < N-2; i++) { B[i] = A[i] * C[i]; A[i+2] = B[i] + C[i]; } v 3 + st v 4 v 5 ld v 1 v 3 ld v 2 Dist(v 5, v 1 ) = 2 A[i+2] + v 4 ld ld v 1 v 2 st v 5 v 3 15
17 SDC Constraint Graph The constraint inequalities form a system of difference constraints SDC(X, C) X denotes the variable set C denotes the constraint set The constraint graph G c (V c, E c ) of SDC(X, C) is a weighted directed graph x i Î X : v i Î V c x i x j b k Î C : e(v i, v j ) ÎE c and weight(e) = b k 16
18 Feasiblity Checking SDC Constraint Graph G c X: {x 1, x 2, x 3, x 4, x 5 } v 1 0 x 1 x 2 0 x 1 x 5-1 x 2 x 5 1 x 3 x 1-4 x 1 x 4 3 x 4 x 3-1 v 5-1 v v 2 x 3 x 1-4 x 1 x 4 3 x 4 x 3-1 à 0-2 Contradiction! v 3 An SDC(X, C) is feasible if and only if its constraint graph G c does not contain any negative cycles Feasibility check can be done by solving a single-source shortest path problem 17
19 Case Study: Prefix Sum Prefix sum computes a cumulative sum of a sequence of numbers commonly used in many applications such as radix sort, histogram, etc. void prefixsum ( int in[n], int out[n] ) out[0] = in[0]; for ( int i = 1; i < N; i++ ) { #pragma HLS pipeline II=? out[i] = out[i-1]+ in[i]; } } out[0] = in[0]; out[1] = in[0] + in[1]; out[1] = in[0] + in[1] + in[2]; out[1] = in[0] + in[1] + in[2] + in[3]; 18
20 Prefix Sum: RecMII Loop-carried dependence exists between to reads on out[] Assume chaining is not possible on memory reads (i.e., ld) and writes (i.e., st) due to target cycle time RecMII = 3 in[i] ld 2 out[i-1] ld 1 out[0] = in[0]; for ( int i = 1; i < N; i++ ) out[i] = out[i-1]+ in[i]; out[i] + st ld Load st Store cycle 1 cycle 2 cycle 3 cycle 4 i = 0 ld 1 ld 2 + st i = 1 ld 1 II = 1 + st ld 2 Assume chaining is not possible on memory reads (i.e., ld) and writes (i.e., st) due to cycle time constraint 19
21 Prefix Sum: Code Optimization Introduce an intermediate variable tmp to hold the running sum from the previous in[] values Shorter dependence circuit leads to RecMII = 1 in[i] ld + tmp int tmp = in[0]; for ( int i = 1; i < N; i++ ) { tmp += in[i]; out[i] = tmp; } out[i] st ld Load st Store cycle 1 cycle 2 cycle 3 cycle 4 i = 0 ld + st i = 1 ld + st II = 1 20
22 Case Study: Convolution for Image Processing A common computation of image/video processing is performed over overlapping stencils, termed as convolution Input image frame x3 convolution Output image frame 21
23 Achieving High Throughput with Pipelining for (r = 1; r < R; r++) for (c = 1; c < C; c++) { #pragma HLS pipeline II=? for (i = 0; i < 3; i++) for (j = 0; j < 3; j++) out[r][c] += img[r+i-1][c+j-1] * f[i][j]; } Inner loops (i & j) are automatically unrolled With a 3x3 convolution kernel, 9 pixels are required for calculating the value of one output pixel If the entire input image is stored in an on-chip buffer with two read ports ResMII =? What about RecMII? 22
24 Achieving II=1 for 3x3 Convolution Pixels in line buffer (2 lines stored) Push three pixels into shift register window: one new pixel plus two pixels from line buffer New pixel fetched from input stream or frame buffer in off-chip memory Output pixel produced by one convolution operation 23
25 Resulting Specialized Memory Hierarchy Memory architecture customized for convolution Input pixel stream Flip- Flops Convolve Output pixel stream Processing window Line buffers Frame buffers On-chip SRAMs Frame n-2 Frame n-1 Frame n Off-chip DDR 24
26 HLS Code Snippet 1 LineBuffer<2,C,pixel_t> linebuf; 2 Window<3,3,pixel_t> window; 3 for (int r = 1; r < R+1; r++) { 4 for (int c = 1; c < C+1; c++) { 5 #pragma HLS pipeline II=1 6 pixel_t new_pixel = img[r][c]; 7 // Update shift window 8 window.shift_left(); 9 if (r < R && c < C) { 10 for (int i = 0; i < 2; i++ ) 11 window.insert(buf[i][c]); 12 } 13 else { // zero padding 14 for (int i = 0; i < 2; i++) 15 window.insert(0); 16 } 17 window.insert(new_pixel); 18 // Update line buffer 19 linebuf.shift_up(c); 20 if (r < R && c < C) 21 linebuf[1].insert(c, new_pixel); 22 else // Zero padding 23 linebuf[1].insert(c, 0); 24 // Perform 3x3 convolution 25 out[r-1][c-1] = convolve(window, weights); 26 } 27 } 25
27 Summary Pipelining is one of the most commonly used techniques in HLS to boost performance of the synthesized hardware Recurrences and resource restrictions limit the pipeline throughput Modulo scheduling A regular form of software pipeline technique Also applies to loop pipelining for hardware synthesis NP-hard problem in general 26
28 Acknowledgements These slides contain/adapt materials developed by Prof. Ryan Kastner (UCSD) Prof. Scott Mahlke (UMich) Dr. Stephen Neuendorffer (Xilinx) 27
Topic 12: Cyclic Instruction Scheduling
Topic 12: Cyclic Instruction Scheduling COS 320 Compiling Techniques Princeton University Spring 2015 Prof. David August 1 Overlap Iterations Using Pipelining time Iteration 1 2 3 n n 1 2 3 With hardware
More informationECE 5775 (Fall 17) High-Level Digital Design Automation. More Binding Pipelining
ECE 5775 (Fall 17) High-Level Digital Design Automation More Binding Pipelining Logistics Lab 3 due Friday 10/6 No late penalty for this assignment (up to 3 days late) HW 2 will be posted tomorrow 1 Agenda
More informationEECS 583 Class 13 Software Pipelining
EECS 583 Class 13 Software Pipelining University of Michigan October 29, 2012 Announcements + Reading Material Project proposals» Due Friday, Nov 2, 5pm» 1 paragraph summary of what you plan to work on
More informationECE 5775 High-Level Digital Design Automation Fall More CFG Static Single Assignment
ECE 5775 High-Level Digital Design Automation Fall 2018 More CFG Static Single Assignment Announcements HW 1 due next Monday (9/17) Lab 2 will be released tonight (due 9/24) Instructor OH cancelled this
More informationECE 5775 High-Level Digital Design Automation Fall Fixed-Point Types Analysis of Algorithms
ECE 5775 High-Level Digital Design Automation Fall 2018 Fixed-Point Types Analysis of Algorithms Announcements Lab 1 on CORDIC is released Due Monday 9/10 @11:59am Part-time PhD TA: Hanchen Jin (hj424)
More informationECE 5775 Student-Led Discussions (10/16)
ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A Adam Macioszek, Julia Currie, Nick Sarkis Sparse Matrix Vector Multiplication Nick Comly, Felipe Fortuna, Mark Li, Serena Krech Matrix
More informationMapping-Aware Constrained Scheduling for LUT-Based FPGAs
Mapping-Aware Constrained Scheduling for LUT-Based FPGAs Mingxing Tan, Steve Dai, Udit Gupta, Zhiru Zhang School of Electrical and Computer Engineering Cornell University High-Level Synthesis (HLS) for
More informationCE431 Parallel Computer Architecture Spring Compile-time ILP extraction Modulo Scheduling
CE431 Parallel Computer Architecture Spring 2018 Compile-time ILP extraction Modulo Scheduling Nikos Bellas Electrical and Computer Engineering University of Thessaly Parallel Computer Architecture 1 Readings
More informationLegUp HLS Tutorial for Microsemi PolarFire Sobel Filtering for Image Edge Detection
LegUp HLS Tutorial for Microsemi PolarFire Sobel Filtering for Image Edge Detection This tutorial will introduce you to high-level synthesis (HLS) concepts using LegUp. You will apply HLS to a real problem:
More informationLecture 19. Software Pipelining. I. Example of DoAll Loops. I. Introduction. II. Problem Formulation. III. Algorithm.
Lecture 19 Software Pipelining I. Introduction II. Problem Formulation III. Algorithm I. Example of DoAll Loops Machine: Per clock: 1 read, 1 write, 1 (2-stage) arithmetic op, with hardware loop op and
More informationUnit 2: High-Level Synthesis
Course contents Unit 2: High-Level Synthesis Hardware modeling Data flow Scheduling/allocation/assignment Reading Chapter 11 Unit 2 1 High-Level Synthesis (HLS) Hardware-description language (HDL) synthesis
More informationHigh-Level Synthesis
High-Level Synthesis 1 High-Level Synthesis 1. Basic definition 2. A typical HLS process 3. Scheduling techniques 4. Allocation and binding techniques 5. Advanced issues High-Level Synthesis 2 Introduction
More informationHigh-Level Synthesis (HLS)
Course contents Unit 11: High-Level Synthesis Hardware modeling Data flow Scheduling/allocation/assignment Reading Chapter 11 Unit 11 1 High-Level Synthesis (HLS) Hardware-description language (HDL) synthesis
More informationIntroduction to High level. Synthesis
Introduction to High level Synthesis LISHA/UFSC Prof. Dr. Antônio Augusto Fröhlich Tiago Rogério Mück http://www.lisha.ufsc.br/~guto June 2007 http://www.lisha.ufsc.br/ 1 What is HLS? Example: High level
More informationLecture 21. Software Pipelining & Prefetching. I. Software Pipelining II. Software Prefetching (of Arrays) III. Prefetching via Software Pipelining
Lecture 21 Software Pipelining & Prefetching I. Software Pipelining II. Software Prefetching (of Arrays) III. Prefetching via Software Pipelining [ALSU 10.5, 11.11.4] Phillip B. Gibbons 15-745: Software
More informationECE 5775 (Fall 17) High-Level Digital Design Automation. Static Single Assignment
ECE 5775 (Fall 17) High-Level Digital Design Automation Static Single Assignment Announcements HW 1 released (due Friday) Student-led discussions on Tuesday 9/26 Sign up on Piazza: 3 students / group Meet
More informationLecture 6 MIPS R4000 and Instruction Level Parallelism. Computer Architectures S
Lecture 6 MIPS R4000 and Instruction Level Parallelism Computer Architectures 521480S Case Study: MIPS R4000 (200 MHz, 64-bit instructions, MIPS-3 instruction set) 8 Stage Pipeline: first half of fetching
More informationECE 2300 Digital Logic & Computer Organization. More Verilog Finite State Machines
ECE 2300 Digital Logic & Computer Organization Spring 2018 More Verilog Finite Machines Lecture 8: 1 Prelim 1, Thursday 3/1, 1:25pm, 75 mins Arrive early by 1:20pm Review sessions Announcements Monday
More informationResolve: Generation of High Performance Sorting Architectures from High Level Synthesis
Resolve: Generation of High Performance Sorting Architectures from High Level Synthesis Janarbek Matai, Dustin Richmond, Dajung Lee, Zac Blair, Qiongzhi Wu, Amin Abazari, Ryan Kastner Department of Computer
More informationMultiple Instruction Issue. Superscalars
Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths
More informationInstruction Scheduling. Software Pipelining - 3
Instruction Scheduling and Software Pipelining - 3 Department of Computer Science and Automation Indian Institute of Science Bangalore 560 012 NPTEL Course on Principles of Compiler Design Instruction
More informationMany ways to build logic out of MOSFETs
Many ways to build logic out of MOSFETs pass transistor logic (most similar to the first switch logic we saw) static CMOS logic (what we saw last time) dynamic CMOS logic Clock=0 precharges X through the
More informationECE 5775 (Fall 17) High-Level Digital Design Automation. Fixed-Point Types Analysis of Algorithms
ECE 5775 (Fall 17) High-Level Digital Design Automation Fixed-Point Types Analysis of Algorithms Announcements Lab 1 on CORDIC is released Due Friday 9/8 @11:59am MEng TA: Jeffrey Witz (jmw483) Office
More informationIntroduction to Electronic Design Automation. Model of Computation. Model of Computation. Model of Computation
Introduction to Electronic Design Automation Model of Computation Jie-Hong Roland Jiang 江介宏 Department of Electrical Engineering National Taiwan University Spring 03 Model of Computation In system design,
More informationHIGH-LEVEL SYNTHESIS
HIGH-LEVEL SYNTHESIS Page 1 HIGH-LEVEL SYNTHESIS High-level synthesis: the automatic addition of structural information to a design described by an algorithm. BEHAVIORAL D. STRUCTURAL D. Systems Algorithms
More informationHardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University
Hardware Design Environments Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University Outline Welcome to COE 405 Digital System Design Design Domains and Levels of Abstractions Synthesis
More informationCourse Overview Revisited
Course Overview Revisited void blur_filter_3x3( Image &in, Image &blur) { // allocate blur array Image blur(in.width(), in.height()); // blur in the x dimension for (int y = ; y < in.height(); y++) for
More informationESE532: System-on-a-Chip Architecture. Today. Message. Real Time. Real-Time Tasks. Real-Time Guarantees. Real Time Demands Challenges
ESE532: System-on-a-Chip Architecture Day 9: October 1, 2018 Real Time Real Time Demands Challenges Today Algorithms Architecture Disciplines to achieve Penn ESE532 Fall 2018 -- DeHon 1 Penn ESE532 Fall
More informationLab 3 Sequential Logic for Synthesis. FPGA Design Flow.
Lab 3 Sequential Logic for Synthesis. FPGA Design Flow. Task 1 Part 1 Develop a VHDL description of a Debouncer specified below. The following diagram shows the interface of the Debouncer. The following
More informationParallel Programming for FPGAs. Ryan Kastner and Stephen Neuendorffer
Parallel Programming for FPGAs Ryan Kastner and Stephen Neuendorffer Sunday 31 st July, 2016 When someone says, I want a programming language in which I need only say what I wish done, give him a lollipop.
More informationEE/CSCI 451 Midterm 1
EE/CSCI 451 Midterm 1 Spring 2018 Instructor: Xuehai Qian Friday: 02/26/2018 Problem # Topic Points Score 1 Definitions 20 2 Memory System Performance 10 3 Cache Performance 10 4 Shared Memory Programming
More informationECE 498AL. Lecture 12: Application Lessons. When the tires hit the road
ECE 498AL Lecture 12: Application Lessons When the tires hit the road 1 Objective Putting the CUDA performance knowledge to work Plausible strategies may or may not lead to performance enhancement Different
More informationScan Algorithm Effects on Parallelism and Memory Conflicts
Scan Algorithm Effects on Parallelism and Memory Conflicts 11 Parallel Prefix Sum (Scan) Definition: The all-prefix-sums operation takes a binary associative operator with identity I, and an array of n
More informationLab 1: CORDIC Design Due Friday, September 8, 2017, 11:59pm
ECE5775 High-Level Digital Design Automation, Fall 2017 School of Electrical Computer Engineering, Cornell University Lab 1: CORDIC Design Due Friday, September 8, 2017, 11:59pm 1 Introduction COordinate
More informationIn-order vs. Out-of-order Execution. In-order vs. Out-of-order Execution
In-order vs. Out-of-order Execution In-order instruction execution instructions are fetched, executed & committed in compilergenerated order if one instruction stalls, all instructions behind it stall
More informationHello I am Zheng, Hongbin today I will introduce our LLVM based HLS framework, which is built upon the Target Independent Code Generator.
Hello I am Zheng, Hongbin today I will introduce our LLVM based HLS framework, which is built upon the Target Independent Code Generator. 1 In this talk I will first briefly introduce what HLS is, and
More informationESE532: System-on-a-Chip Architecture. Today. Message. Graph Cycles. Preclass 1. Reminder
ESE532: System-on-a-Chip Architecture Day 8: September 26, 2018 Spatial Computations Today Graph Cycles (from Day 7) Accelerator Pipelines FPGAs Zynq Computational Capacity 1 2 Message Custom accelerators
More informationVHDL for Synthesis. Course Description. Course Duration. Goals
VHDL for Synthesis Course Description This course provides all necessary theoretical and practical know how to write an efficient synthesizable HDL code through VHDL standard language. The course goes
More informationBinary Adders. Ripple-Carry Adder
Ripple-Carry Adder Binary Adders x n y n x y x y c n FA c n - c 2 FA c FA c s n MSB position Longest delay (Critical-path delay): d c(n) = n d carry = 2n gate delays d s(n-) = (n-) d carry +d sum = 2n
More informationAn Approach for Integrating Basic Retiming and Software Pipelining
An Approach for Integrating Basic Retiming and Software Pipelining Noureddine Chabini Department of Electrical and Computer Engineering Royal Military College of Canada PB 7000 Station Forces Kingston
More informationIntel HLS Compiler: Fast Design, Coding, and Hardware
white paper Intel HLS Compiler Intel HLS Compiler: Fast Design, Coding, and Hardware The Modern FPGA Workflow Authors Melissa Sussmann HLS Product Manager Intel Corporation Tom Hill OpenCL Product Manager
More informationEECS 583 Class 15 Register Allocation
EECS 583 Class 15 Register Allocation University of Michigan November 2, 2011 Announcements + Reading Material Midterm exam: Monday, Nov 14?» Could also do Wednes Nov 9 (next week!) or Wednes Nov 16 (2
More informationCS 61C: Great Ideas in Computer Architecture. Multilevel Caches, Cache Questions
CS 61C: Great Ideas in Computer Architecture Multilevel Caches, Cache Questions Instructor: Alan Christopher 7/14/2014 Summer 2014 -- Lecture #12 1 Great Idea #3: Principle of Locality/ Memory Hierarchy
More informationEfficient Hardware Acceleration on SoC- FPGA using OpenCL
Efficient Hardware Acceleration on SoC- FPGA using OpenCL Advisor : Dr. Benjamin Carrion Schafer Susmitha Gogineni 30 th August 17 Presentation Overview 1.Objective & Motivation 2.Configurable SoC -FPGA
More informationFPGA for Software Engineers
FPGA for Software Engineers Course Description This course closes the gap between hardware and software engineers by providing the software engineer all the necessary FPGA concepts and terms. The course
More informationSoftware Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors
Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors Francisco Barat, Murali Jayapala, Pieter Op de Beeck and Geert Deconinck K.U.Leuven, Belgium. {f-barat, j4murali}@ieee.org,
More informationSoftware Pipelining by Modulo Scheduling. Philip Sweany University of North Texas
Software Pipelining by Modulo Scheduling Philip Sweany University of North Texas Overview Instruction-Level Parallelism Instruction Scheduling Opportunities for Loop Optimization Software Pipelining Modulo
More informationEECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 14 EE141
EECS 151/251A Fall 2017 Digital Design and Integrated Circuits Instructor: John Wawrzynek and Nicholas Weaver Lecture 14 EE141 Outline Parallelism EE141 2 Parallelism Parallelism is the act of doing more
More informationScalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA
Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, Sarma Vrudhula School of Electrical, Computer and Energy Engineering School
More informationTutorial Outline. 9:00 am 10:00 am Pre-RTL Simulation Framework: Aladdin. 8:30 am 9:00 am! Introduction! 10:00 am 10:30 am! Break!
Tutorial Outline Time Topic! 8:30 am 9:00 am! Introduction! 9:00 am 10:00 am Pre-RTL Simulation Framework: Aladdin 10:00 am 10:30 am! Break! 10:30 am 11:00 am! Workload Characterization Tool: WIICA! 11:00
More informationEECS150 - Digital Design Lecture 09 - Parallelism
EECS150 - Digital Design Lecture 09 - Parallelism Feb 19, 2013 John Wawrzynek Spring 2013 EECS150 - Lec09-parallel Page 1 Parallelism Parallelism is the act of doing more than one thing at a time. Optimization
More informationMOJTABA MAHDAVI Mojtaba Mahdavi DSP Design Course, EIT Department, Lund University, Sweden
High Level Synthesis with Catapult MOJTABA MAHDAVI 1 Outline High Level Synthesis HLS Design Flow in Catapult Data Types Project Creation Design Setup Data Flow Analysis Resource Allocation Scheduling
More informationECE 5775 (Fall 17) High-Level Digital Design Automation. Specialized Computing
ECE 5775 (Fall 7) High-Level Digital Design Automation Specialized Computing Announcements All students enrolled in CMS & Piazza Vivado HLS tutorial on Tuesday 8/29 Install an SSH client (mobaxterm or
More informationCS521 \ Notes for the Final Exam
CS521 \ Notes for final exam 1 Ariel Stolerman Asymptotic Notations: CS521 \ Notes for the Final Exam Notation Definition Limit Big-O ( ) Small-o ( ) Big- ( ) Small- ( ) Big- ( ) Notes: ( ) ( ) ( ) ( )
More informationTECH. 9. Code Scheduling for ILP-Processors. Levels of static scheduling. -Eligible Instructions are
9. Code Scheduling for ILP-Processors Typical layout of compiler: traditional, optimizing, pre-pass parallel, post-pass parallel {Software! compilers optimizing code for ILP-processors, including VLIW}
More informationAnnouncements. Midterm 2 next Thursday, 6-7:30pm, 277 Cory Review session on Tuesday, 6-7:30pm, 277 Cory Homework 8 due next Tuesday Labs: project
- Fall 2002 Lecture 20 Synthesis Sequential Logic Announcements Midterm 2 next Thursday, 6-7:30pm, 277 Cory Review session on Tuesday, 6-7:30pm, 277 Cory Homework 8 due next Tuesday Labs: project» Teams
More informationPage # Let the Compiler Do it Pros and Cons Pros. Exploiting ILP through Software Approaches. Cons. Perhaps a mixture of the two?
Exploiting ILP through Software Approaches Venkatesh Akella EEC 270 Winter 2005 Based on Slides from Prof. Al. Davis @ cs.utah.edu Let the Compiler Do it Pros and Cons Pros No window size limitation, the
More informationLecture 20: High-level Synthesis (1)
Lecture 20: High-level Synthesis (1) Slides courtesy of Deming Chen Some slides are from Prof. S. Levitan of U. of Pittsburgh Outline High-level synthesis introduction High-level synthesis operations Scheduling
More informationMidterm Exam ECE 448 Spring 2019 Wednesday, March 6 15 points
Midterm Exam ECE 448 Spring 2019 Wednesday, March 6 15 points Instructions: Zip all your deliverables into an archive .zip and submit it through Blackboard no later than Wednesday, March 6,
More informationLecture 9 Basic Parallelization
Lecture 9 Basic Parallelization I. Introduction II. Data Dependence Analysis III. Loop Nests + Locality IV. Interprocedural Parallelization Chapter 11.1-11.1.4 CS243: Parallelization 1 Machine Learning
More informationLecture 9 Basic Parallelization
Lecture 9 Basic Parallelization I. Introduction II. Data Dependence Analysis III. Loop Nests + Locality IV. Interprocedural Parallelization Chapter 11.1-11.1.4 CS243: Parallelization 1 Machine Learning
More informationMaterial covered. Areas/Topics covered. Logistics. What to focus on. Areas/Topics covered 5/14/2015. COS 226 Final Exam Review Spring 2015
COS 226 Final Exam Review Spring 2015 Ananda Gunawardena (guna) guna@cs.princeton.edu guna@princeton.edu Material covered The exam willstressmaterial covered since the midterm, including the following
More informationTopic 14: Scheduling COS 320. Compiling Techniques. Princeton University Spring Lennart Beringer
Topic 14: Scheduling COS 320 Compiling Techniques Princeton University Spring 2016 Lennart Beringer 1 The Back End Well, let s see Motivating example Starting point Motivating example Starting point Multiplication
More informationComputer Architecture and Engineering. CS152 Quiz #5. April 23rd, Professor Krste Asanovic. Name: Answer Key
Computer Architecture and Engineering CS152 Quiz #5 April 23rd, 2009 Professor Krste Asanovic Name: Answer Key Notes: This is a closed book, closed notes exam. 80 Minutes 8 Pages Not all questions are
More information1-D Time-Domain Convolution. for (i=0; i < outputsize; i++) { y[i] = 0; for (j=0; j < kernelsize; j++) { y[i] += x[i - j] * h[j]; } }
Introduction: Convolution is a common operation in digital signal processing. In this project, you will be creating a custom circuit implemented on the Nallatech board that exploits a significant amount
More informationECE 5730 Memory Systems
ECE 5730 Memory Systems Spring 2009 Off-line Cache Content Management Lecture 7: 1 Quiz 4 on Tuesday Announcements Only covers today s lecture No office hours today Lecture 7: 2 Where We re Headed Off-line
More informationCS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines
CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture VLIW, Vector, and Multithreaded Machines Assigned 3/24/2019 Problem Set #4 Due 4/5/2019 http://inst.eecs.berkeley.edu/~cs152/sp19
More informationA Lost Cycles Analysis for Performance Prediction using High-Level Synthesis
A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #11 2/21/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline Midterm 1:
More informationNotes slides from before lecture. CSE 21, Winter 2017, Section A00. Lecture 4 Notes. Class URL:
Notes slides from before lecture CSE 21, Winter 2017, Section A00 Lecture 4 Notes Class URL: http://vlsicad.ucsd.edu/courses/cse21-w17/ Notes slides from before lecture Notes January 23 (1) HW2 due tomorrow
More informationCOVER SHEET: Problem#: Points
EEL 4712 Midterm 3 Spring 2017 VERSION 1 Name: UFID: Sign here to give permission for your test to be returned in class, where others might see your score: IMPORTANT: Please be neat and write (or draw)
More informationESE532: System-on-a-Chip Architecture. Today. Message. Clock Cycle BRAM
ESE532: System-on-a-Chip Architecture Day 20: April 3, 2017 Pipelining, Frequency, Dataflow Today What drives cycle times Pipelining in Vivado HLS C Avoiding bottlenecks feeding data in Vivado HLS C Penn
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests
ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests Mingxing Tan 1 2, Gai Liu 1, Ritchie Zhao 1, Steve Dai 1, Zhiru Zhang 1 1 Computer Systems Laboratory, Electrical and Computer
More informationRUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC. Zoltan Baruch
RUN-TIME RECONFIGURABLE IMPLEMENTATION OF DSP ALGORITHMS USING DISTRIBUTED ARITHMETIC Zoltan Baruch Computer Science Department, Technical University of Cluj-Napoca, 26-28, Bariţiu St., 3400 Cluj-Napoca,
More informationSDSoC: Session 1
SDSoC: Session 1 ADAM@ADIUVOENGINEERING.COM What is SDSoC SDSoC is a system optimising compiler which allows us to optimise Zynq PS / PL Zynq MPSoC PS / PL MicroBlaze What does this mean? Following the
More informationECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University
ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University Lab 4: Binarized Convolutional Neural Networks Due Wednesday, October 31, 2018, 11:59pm
More informationDSP Mapping, Coding, Optimization
DSP Mapping, Coding, Optimization On TMS320C6000 Family using CCS (Code Composer Studio) ver 3.3 Started with writing a simple C code in the class, from scratch Project called First, written for C6713
More informationLecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996
Lecture 9: Case Study MIPS R4000 and Introduction to Advanced Pipelining Professor Randy H. Katz Computer Science 252 Spring 1996 RHK.SP96 1 Review: Evaluating Branch Alternatives Two part solution: Determine
More informationESE532: System-on-a-Chip Architecture. Today. Message. Fixed Point. Operator Sizes. Observe. Fixed Point
ESE532: System-on-a-Chip Architecture Day 27: December 5, 2018 Representation and Precision Fixed Point Today Errors from Limited Precision Precision Analysis / Interval Arithmetic Floating Point If time
More informationSynthesis at different abstraction levels
Synthesis at different abstraction levels System Level Synthesis Clustering. Communication synthesis. High-Level Synthesis Resource or time constrained scheduling Resource allocation. Binding Register-Transfer
More informationEECS2021. EECS2021 Computer Organization. EECS2021 Computer Organization. Morgan Kaufmann Publishers September 14, 2016
EECS2021 Computer Organization Fall 2015 The slides are based on the publisher slides and contribution from Profs Amir Asif and Peter Lian The slides will be modified, annotated, explained on the board,
More informationSpecializing Hardware for Image Processing
Lecture 6: Specializing Hardware for Image Processing Visual Computing Systems So far, the discussion in this class has focused on generating efficient code for multi-core processors such as CPUs and GPUs.
More informationCS 152, Spring 2011 Section 10
CS 152, Spring 2011 Section 10 Christopher Celio University of California, Berkeley Agenda Stuff (Quiz 4 Prep) http://3dimensionaljigsaw.wordpress.com/2008/06/18/physics-based-games-the-new-genre/ Intel
More informationGoals for Today. CS 133: Databases. Final Exam: Logistics. Why Use a DBMS? Brief overview of course. Course evaluations
Goals for Today Brief overview of course CS 133: Databases Course evaluations Fall 2018 Lec 27 12/13 Course and Final Review Prof. Beth Trushkowsky More details about the Final Exam Practice exercises
More informationECE 5775 (Fall 17) High-Level Digital Design Automation. Hardware-Software Co-Design
ECE 5775 (Fall 17) High-Level Digital Design Automation Hardware-Software Co-Design Announcements Midterm graded You can view your exams during TA office hours (Fri/Wed 11am-noon, Rhodes 312) Second paper
More informationA Statically Scheduled Time- Division-Multiplexed Networkon-Chip for Real-Time Systems
A Statically Scheduled Time- Division-Multiplexed Networkon-Chip for Real-Time Systems Martin Schoeberl, Florian Brandner, Jens Sparsø, Evangelia Kasapaki Technical University of Denamrk 1 Real-Time Systems
More informationCSE 417 Practical Algorithms. (a.k.a. Algorithms & Computational Complexity)
CSE 417 Practical Algorithms (a.k.a. Algorithms & Computational Complexity) Outline for Today > Course Goals & Overview > Administrivia > Greedy Algorithms Why study algorithms? > Learn the history of
More informationComputer Science 246 Computer Architecture
Computer Architecture Spring 2009 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Compiler ILP Static ILP Overview Have discussed methods to extract ILP from hardware Why can t some of these
More informationLecture 10: Static ILP Basics. Topics: loop unrolling, static branch prediction, VLIW (Sections )
Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 4.4) 1 Static vs Dynamic Scheduling Arguments against dynamic scheduling: requires complex structures
More informationInternational Training Workshop on FPGA Design for Scientific Instrumentation and Computing November 2013
2499-20 International Training Workshop on FPGA Design for Scientific Instrumentation and Computing 11-22 November 2013 High-Level Synthesis: how to improve FPGA design productivity RINCON CALLE Fernando
More informationVerilog for High Performance
Verilog for High Performance Course Description This course provides all necessary theoretical and practical know-how to write synthesizable HDL code through Verilog standard language. The course goes
More informationECE 595Z Digital Systems Design Automation
ECE 595Z Digital Systems Design Automation Anand Raghunathan, raghunathan@purdue.edu How do you design chips with over 1 Billion transistors? Human designer capability grows far slower than Moore s law!
More informationESE535: Electronic Design Automation. Today. LUT Mapping. Simplifying Structure. Preclass: Cover in 4-LUT? Preclass: Cover in 4-LUT?
ESE55: Electronic Design Automation Day 7: February, 0 Clustering (LUT Mapping, Delay) Today How do we map to LUTs What happens when IO dominates Delay dominates Lessons for non-luts for delay-oriented
More informationCprE 488 Embedded Systems Design. Exam 2 Review
CprE 488 Embedded Systems Design Exam 2 Review Joseph Zambreno Electrical and Computer Engineering Iowa State University www.ece.iastate.edu/~zambreno rcl.ece.iastate.edu This is the end. My only friend,
More informationLecture 1: CS/ECE 3810 Introduction
Lecture 1: CS/ECE 3810 Introduction Today s topics: Why computer organization is important Logistics Modern trends 1 Why Computer Organization 2 Image credits: uber, extremetech, anandtech Why Computer
More informationHigh Level, high speed FPGA programming
Opportunity: FPAs High Level, high speed FPA programming Wim Bohm, Bruce Draper, Ross Beveridge, Charlie Ross, Monica Chawathe Colorado State University Reconfigurable Computing technology High speed at
More informationNotes slides from before lecture. CSE 21, Winter 2017, Section A00. Lecture 10 Notes. Class URL:
Notes slides from before lecture CSE 21, Winter 2017, Section A00 Lecture 10 Notes Class URL: http://vlsicad.ucsd.edu/courses/cse21-w17/ Notes slides from before lecture Notes February 13 (1) HW5 is due
More informationThis material exempt per Department of Commerce license exception TSU. Improving Performance
This material exempt per Department of Commerce license exception TSU Performance Outline Adding Directives Latency Manipulating Loops Throughput Performance Bottleneck Summary Performance 13-2 Performance
More informationECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines
ECE 552 / CPS 550 Advanced Computer Architecture I Lecture 15 Very Long Instruction Word Machines Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall11.html
More information