ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests
|
|
- Lawrence Gilbert
- 6 years ago
- Views:
Transcription
1 ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests Mingxing Tan 1 2, Gai Liu 1, Ritchie Zhao 1, Steve Dai 1, Zhiru Zhang 1 1 Computer Systems Laboratory, Electrical and Computer Engineering, Cornell University 2 Google Inc
2 Outline Loop pipelining in HLS Irregular loop nest ElasticFlow architecture ElasticFlow synthesis Experimental results 2
3 Clock Cycle Loop Pipelining An important optimization in HLS Create a static schedule for the loop body to allow successive loop iterations to be overlapped 0 j=0 load A1 j=1 Objective Initiation Interval (II) B1 *+ C1 load A2 B2 *+ C2 steady state j=2 load A3 B3 *+ C3 j=3 load A4 B4 *+ C4 II=1 II=1 Throughput for(; i < 4; i++){ for(j=0; j < 4; j++){ #pragma pipeline acc += A[j] * i; } } 3
4 Pipelining Outer Loop for(; i < 4; i++){ for(j=0; j < 4; j++){ acc += A[j] * i; } } 1 Pipelining only inner loop for(; i < 4; i++){ for(j=0; j < 4; j++){ #pragma pipeline acc += A[i] * j; } } 1 inner loop iteration per cycle Fixed inner loop bound 2 Pipelining outer loop by unrolling inner loop for(; i < 4; i++){ #pragma pipeline acc += A[0] * i; acc += A[1] * i; acc += A[2] * i; acc += A[3] * i; } 1 outer loop iteration per cycle 4
5 Pipelining Irregular Loop Nest Contains one or more dynamic-bound inner loops Number of inner loop iterations vary during runtime Accesses less-regular data structures (eg sparse matrices, graphs, and hash tables) common in emerging applications How to pipeline this loop nest to achieve one lookup per cycle? 0 1 N Hash buckets Keys Hash lookup for (i : keys_to_find) #pragma pipeline } hv = Jenkins_hash(k); p = hashtbl[hv]keys; while (p && p->key!=k) p = p->next; format_output(p) 5
6 Clock Cycle Aggressively Unrolling Inner Loop B j=0 j=1 j=2 j=3 j=4 j=5 j=0 j=1 j=2 j=3 j=4 j=5 j=0 j=1 j=2 j=3 j=4 j=5 j=0 j=1 j=2 j=3 j=4 j=5 Resource 1 lookup/cycle i=4 j=0 j=1 j=2 j=3 j=4 j=5 i=4 i=5 j=0 j=1 j=2 j=3 j=4 j=5 i=5 for (i : keys_to_find) #pragma pipeline } hv = Jenkins_hash(k); p = hashtbl[hv]keys; for (j=0; j<6; j++) #pragma unroll if (p && p->key!=k) p = p->next; format_output(p) A B C 6
7 Clock Cycle Issues with Aggressive Unrolling B j=0 j=1 j=2 j=3 j=4 j=5 j=99 j=0 j=1 j=2 j=3 j=4 j=5 j=99 j=0 j=1 j=2 j=3 j=4 j=5 j=0 j=1 j=2 j=3 j=4 j=5 Resource i=4 j=0 j=1 j=2 j=3 j=4 j=5 1 May not be statically determinable i=5 j=0 j=1 j=2 j=3 j=4 j=5 2 Worst-case bound >> common case (eg 99 vs 2) i=6 j=0 j=1 j=2 j=3 j=4 j=5 3 Unnecessarily deep pipeline, very inefficient in area i=7 j=0 j=1 j=2 j=3 j=4 i=8 j=0 j=1 j=2 i=9 j=0 j=1 7
8 Need for a New Approach Irregular loop nests are prevalent Graph processing, Scientific computation, Image processing, etc Naive approaches result in low throughput or large area Need resource-efficient pipelining of the outer loop for an irregular loop nest to target one outer loop iteration per cycle 8
9 ElasticFlow Concept ElasticFlow Architecture and associated synthesis techniques Effectively accelerate irregular loop nests Transform the irregular loop nest into a multi-stage dataflow pipeline Dynamically distribute different outer loop instances of the dynamic-bound inner loop to one or more processing units Inner loops execute in a pipelined fashion across different outer loop iterations 9
10 ElasticFlow Architecture Each dynamic-bound inner loop is mapped to an application-specific loop processing array (LPA) LPA contains one or more loop processing units (LPUs) Each LPU executes an inner loop until completion, which automatically handles inner loop carried dependences A Distributor <i, val> <i, val> B C LPU 1 LPU 2 LPU K Collector <i, val> Loop Processing Array (LPA) for B 10
11 Distributor and Collector Distributor Dynamically distributes inner loop instances to LPUs Collector Collects results from the LPUs Acts as an reorder buffer (ROB) to ensure that results are committed to the next stage in-order A Distributor B C Collector LPA 11
12 Clock Cycle ElasticFlow on Hash Lookup A B for (i : keys_to_find) { #pragma pipeline hv = Jenkins_hash(k); p = hashtbl[hv]keys; while (p && p->key!=k) p = p->next; A B C LPUs specialized for B C } format_output(p) Dynamically overlap inner loops across outer loop iterations to achieve a throughput of one outer loop iteration per cycle 12
13 Execution with Single LPU A B C Single LPU for Stage B Execution in Stage A and C can overlap in time Inner loop iterations execute serially on Stage B stall stall stall stall Clock Cycle for (i : keys_to_find) { #pragma pipeline stall i=4 stall A hv = Jenkins_hash(k); p = hashtbl[hv]keys; stall i=4 stall while (p && p->key!=k) i=5 i=4 B p = p->next; i=5 stall i=6 i=5 stall i=6 stall C format_output(p) Throughput i=7 i=6 } bottlenecked by the inner loop latency i=7 in stage stall B i=7 13
14 Execution with Multiple LPUs A B C Multiple LPUs for Stage B Dynamically schedule inner loops A i=4 i=5 i=6 i=7 LPU 1 LPU 2 LPU 3 LPU 4 i=4 B i=5 i=7 i=6 Multiple LPUs for B C i=4 i=5 i=6 i=7 i=4 i=5 i=6 i=4 i=5 i=6 i=4 i=5 i=7 i=6 i=7 i=7 Single LPU for B 14 Clock Cycle
15 Dynamic Scheduling Dynamic scheduling policy Mitigates the effect of unbalanced workload Inefficient on resource throughput utilization due to latency variation of different inner loops due to many stalls and idles! A Dynamic scheduling B C Static scheduling A B C i=4 i=5 i=6 i=7 LPU 1 LPU 2 LPU 3 LPU 4 i=4 i=5 i=7 i=6 i=4 i=5 i=6 i=7 LPU 1 LPU 2 LPU 3 LPU 4 i=5 i=4 i=6 i=7 15
16 Multiple Dynamic-Bound Inner Loops B D for (; i<num_keys; i++) #pragma pipeline } // A: lookup hashtbl1 // B: dynamic-bound loop while (p && p->key!=k) p = p->next; // C: loop up hashtbl2 // D: dynamic-bound loop while (q && q->key!=k) q = q->next; // E: merge results Database join Architecture with dedicated LPAs slpu 1 LPA B A Distributor <i, val> slpu K Collector <i, val> E C Distributor <i, val> B B D D slpu 1 slpu K Collector LPA D <i, val> Each LPA is dedicated to a particular inner loop 16
17 Issues with Dedicated LPAs If loop B incurs much longer average latency than loop D, the LPA for loop D results in poor resource utilization execute dbjoin on dedicated LPUs slpa B slpa D slpu 1 slpu 2 i B =0 i B =1 slpu 3 i B =2 slpu 1 slpu 2 i D =0 i D =1 i D =3 i D =4 slpu 3 i D =2 i D =5 Clock Cycle i B =4 i B =5 i B =3 Idle Idle Idle 17
18 LPA Sharing An LPA can be shared among one or more inner loops slpu: single-loop processing unit, dedicated to one loop mlpu: multi-loop processing unit, shared among multiple loops slpa: single-loop processing array, consists of multiple slpus for a particular loop mlpa: multi-loop processing array, consists of multiple mlpus each shared among loops A <s, i, val> Distributor C Shared mlpus Collector <s, i, val> Architecture with shared LPUs B/D B/D B/D <i, val> E <i, val> mlpab,d vs slpa 18
19 Execution with Shared LPUs mlpa improves resource utilizations and performance by reducing pipeline stalls for unbalanced workload Execution of dbjoin on dedicated LPAs slpa B slpa D slpu 1 slpu 2 slpu 3 slpu 1 slpu 2 slpu 3 i D =0 i D =1 i D =2 i B =0 i D =3 i D =4 i B =2 i B =1 i D =5 Execution on shared mlpa mlpa B,D mlpu 1 mlpu 2 mlpu 3 mlpu 4 i D =0 i D =1 i B =0 i B =2 i B =1 i D =2 Clock Cycle i B =4 i B =5 i B =3 Idle Idle Idle i D =3 i B =5 i B =4 i D =5 i D =4 i B =3 Even requires fewer LPUs 19
20 ElasticFlow Synthesis Maps irregular loop nest to the ElasticFlow architecture Partition the loop nest into multiple stages Identify inner loop candidates to form the LPAs Synthesize these inner loops into slpus and mlpus for (; i<num_keys; i++) #pragma pipeline A Goal: Optimize LPU allocation to meet the expected throughput } // A: lookup hashtbl1 // B: dynamic-bound loop while (p && p->key!=k) p = p->next; // C: loop up hashtbl2 // D: dynamic-bound loop while (q && q->key!=k) q = q->next; // E: merge results B C D E Distributor 1 How many? 2 Shared or not shared? Shared mlpus Collector mlpa B,D 20
21 slpu Allocation Definitions TP: Expected number of outer loop iterations per cycle II i : Achievable initiation interval (II) of inner loop i L i : Latency in cycles of a single iteration of loop i B i : Common-case bound of inner loop i (from profiling) U i =[II i (B i -1)+L i ] TP Number of slpus Common-case latency of each inner loop instance To achieve the expected throughput Need this many slpu to hide the latency of inner loop How many simultaneous in-flight outer loop iterations is required? 21
22 mlpu Allocation Replace dedicated slpus with shared mlpus to improve performance and resource utilization How many slpus should be replaced with mlpus? Inherent trade-off between performance and area mlpus improve performance by allowing adaptive assignment of resources to different types of loops depending on workload mlpus typically consume more area than slpus 22
23 LPU Allocation Optimize the tradeoff as an integer linear program given Resource usage of each type of LPU Area of the slpa architecture sharing + #LPUs performance Total area of the LPAs Prevent over-allocation of LPUs Each loop maps to a single type of LPA Loops mapped to compatible LPA 23
24 Time ROB Buffer Sizing Reorder buffer (ROB) must hold all results from the LPUs that are not yet ready to be committed Distributor stalled when ROB is full LPUs cannot process new outer loop iterations, and become underutilized Need to store results from to i=7 because they finish before Distributor LPU 1 LPU 2 LPU K Collector (ROB) LPA A LPU 1 LPU 2 LPU 3 LPU 4 i=4 stall i=5 i=6 i=4 i=5 i=7 i=7 i=6 Problem: how to statically but suitably size the ROB during synthesis? B C i=4 i=5 i=6 i=7 24
25 ROB Buffer Sizing We estimate the ROB size based on profiling Maximum latency L max Minimum latency L min Average latency L avg Latency standard deviation σ Our estimates (for K LPUs) achieve good performance based on the following empirical formulation 25
26 Deadlock Avoidance Both slpa and mlpa are deadlock-free Limit the number of in-flight outer loop iterations to be no greater than the number of available ROB entries Entire dataflow architecture cannot deadlock If the architecture forms a directed acyclic graph If there is data dependence between shared inner loops 26
27 Experimental Setup ElasticFlow s setup leverages a commercial HLS tool which uses LLVM compiler as its front-end Compared ElasticFlow to pipelining techniques employed in state-of-the-arts commercial HLS tool Target Xilinx Virtex-7 FPGA with 5-ns target clock period Benchmark applications Graph processing, database, scientific computing, image processing 27
28 Performance for Different Number of LPUs Normalized speedup9 Performance Comparison Close to proportional improvement in performance for increasing Benchmark number applications of LPUs 1 LPU 2 LPUs 4 LPUs 8 LPUs 28
29 ElasticFlow vs Aggressive Unrolling Achieves comparable performance with significantly less resource usage Unrolling is inapplicable when the worst-case loop bound cannot be statically determined Design Technique Latency LUTs Registers dbjoin Unroll ElasticFlow spmv Unroll ElasticFlow comparable 15x reduction 45x reduction 29
30 Effectiveness of LPU Sharing Using mlpa can further improve the performance by 21%-34% with similar area Comparison of mlpus over slpus Design # slpus # mlpus Latency Reduction Slice Overhead cfd-a % 38% cfd-b % 52% dbjoin-a % 70% dbjoin-b % 57% Significant latency reduction Small area overhead 30
31 Take-Away Points Existing HLS tools rely on static pipelining techniques Extract parallelism only at compile time Not competitive for irregular programs with dynamic parallelism Need for adaptive pipelining techniques Dynamically extract parallelism at runtime Efficiently handle statically unanalyzable program patterns We address pipelining of irregular loop nests containing dynamic-bound inner loops Novel dataflow pipeline architecture and synthesis techniques Substantial performance improvement 31
32 ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests Mingxing Tan 1 2, Gai Liu 1, Ritchie Zhao 1, Steve Dai 1, Zhiru Zhang 1 1 Computer Systems Laboratory, Electrical and Computer Engineering, Cornell University 2 Google Inc
33 Backup Slides 33
34 Coarse-Grained Pipelined Accelerators (CGPA) Liu, Johnson, and August, DAC 14 Generates coarse-grained pipelines for a loop nest by partitioning it into parallel and non-parallel sections Employs replicated data-level parallelism to create multiple identical copies of the parallel section Applies decoupled pipeline parallelism to separate the parallel and sequential sections with a set of FIFOs ElasticFlow achieves additional performance and resource efficiency Enables out-of-order execution and dynamic scheduling Optimizes allocation and sharing of LPUs with mlpa architecture Studies sizing for both ROB and delay line and runtime policy to prevent deadlock 34
35 Comparison with CGPA 35
36 Widx Kocberber, Grot, Picorel, Falsafi, Lim, and Ranganathan, MICRO 13 A reconfigurable accelerator for hash indexing in database systems Uses decoupled pipeline architecture similar to ElasticFlow Hashing unit distributes work to a parallel array of walker units, whose results are combined in a n output unit ElasticFlow is a technique for addressing a more general problem of pipelining irregular loop nests 36
Enabling Adaptive Loop Pipelining in High-Level Synthesis
Enabling Adaptive Loop Pipelining in High-Level Synthesis Steve Dai, Gai Liu, Ritchie Zhao, Zhiru Zhang School of Electrical and Computer Engineering, Cornell University, Ithaca, NY Email: {hd273,gl387,rz252,zhiruz}@cornell.edu
More informationMeet the Walkers! Accelerating Index Traversals for In-Memory Databases"
Meet the Walkers! Accelerating Index Traversals for In-Memory Databases Onur Kocberber Boris Grot, Javier Picorel, Babak Falsafi, Kevin Lim, Parthasarathy Ranganathan Our World is Data-Driven! Data resides
More informationMapping-Aware Constrained Scheduling for LUT-Based FPGAs
Mapping-Aware Constrained Scheduling for LUT-Based FPGAs Mingxing Tan, Steve Dai, Udit Gupta, Zhiru Zhang School of Electrical and Computer Engineering Cornell University High-Level Synthesis (HLS) for
More informationHardware thread reordering to boost OpenCL throughput on FPGAs
Hardware thread reordering to boost OpenCL throughput on FPGAs Amir Momeni ECE Department Northeastern University Boston, MA Email: momeni@ece.neu.edu Hamed Tabkhi ECE Department Northeastern University
More informationSDSoC: Session 1
SDSoC: Session 1 ADAM@ADIUVOENGINEERING.COM What is SDSoC SDSoC is a system optimising compiler which allows us to optimise Zynq PS / PL Zynq MPSoC PS / PL MicroBlaze What does this mean? Following the
More informationEfficient Data Supply for Hardware Accelerators with Prefetching and Access/Execute Decoupling
Efficient Data Supply for Hardware Accelerators with Prefetching and Access/Execute Decoupling Tao Chen and G. Edward Suh Cornell University Ithaca, NY 14850, USA {tc466, gs272}@cornell.edu Abstract This
More informationA Lost Cycles Analysis for Performance Prediction using High-Level Synthesis
A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,
More informationThis material exempt per Department of Commerce license exception TSU. Improving Performance
This material exempt per Department of Commerce license exception TSU Performance Outline Adding Directives Latency Manipulating Loops Throughput Performance Bottleneck Summary Performance 13-2 Performance
More informationSDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center
SDAccel Environment The Xilinx SDAccel Development Environment Bringing The Best Performance/Watt to the Data Center Introduction Data center operators constantly seek more server performance. Currently
More informationECE 5775 Student-Led Discussions (10/16)
ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A Adam Macioszek, Julia Currie, Nick Sarkis Sparse Matrix Vector Multiplication Nick Comly, Felipe Fortuna, Mark Li, Serena Krech Matrix
More informationReview: Creating a Parallel Program. Programming for Performance
Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)
More informationEfficient Data Supply for Hardware Accelerators with Prefetching and Access/ Execute Decoupling
Cornell University Efficient Data Supply for Hardware Accelerators with Prefetching and Access/ Execute Decoupling Tao Chen and G. Edward Suh Computer Systems Laboratory Cornell University Accelerator-Rich
More informationLecture 13: March 25
CISC 879 Software Support for Multicore Architectures Spring 2007 Lecture 13: March 25 Lecturer: John Cavazos Scribe: Ying Yu 13.1. Bryan Youse-Optimization of Sparse Matrix-Vector Multiplication on Emerging
More informationAccelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs
Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Ritchie Zhao 1, Weinan Song 2, Wentao Zhang 2, Tianwei Xing 3, Jeng-Hau Lin 4, Mani Srivastava 3, Rajesh Gupta 4, Zhiru
More informationDNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs
IBM Research AI Systems Day DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang 1, Junsong Wang 2, Chao Zhu 2, Yonghua Lin 2, Jinjun Xiong 3, Wen-mei
More informationVivado HLx Design Entry. June 2016
Vivado HLx Design Entry June 2016 Agenda What is the HLx Design Methodology? New & Early Access features for Connectivity Platforms Creating Differentiated Logic 2 What is the HLx Design Methodology? Page
More informationAutomated Space/Time Scaling of Streaming Task Graphs. Hossein Omidian Supervisor: Guy Lemieux
Automated Space/Time Scaling of Streaming Task Graphs Hossein Omidian Supervisor: Guy Lemieux 1 Contents Introduction KPN-based HLS Tool for MPPA overlay Experimental Results Future Work Conclusion 2 Introduction
More informationAdvanced Database Systems
Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed
More informationMOJTABA MAHDAVI Mojtaba Mahdavi DSP Design Course, EIT Department, Lund University, Sweden
High Level Synthesis with Catapult MOJTABA MAHDAVI 1 Outline High Level Synthesis HLS Design Flow in Catapult Data Types Project Creation Design Setup Data Flow Analysis Resource Allocation Scheduling
More informationUnit 2: High-Level Synthesis
Course contents Unit 2: High-Level Synthesis Hardware modeling Data flow Scheduling/allocation/assignment Reading Chapter 11 Unit 2 1 High-Level Synthesis (HLS) Hardware-description language (HDL) synthesis
More informationScalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA
Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, Sarma Vrudhula School of Electrical, Computer and Energy Engineering School
More informationDRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric
DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based
More informationNatalie Enright Jerger, Jason Anderson, University of Toronto November 5, 2010
Next Generation FPGA Research Natalie Enright Jerger, Jason Anderson, and Ali Sheikholeslami l i University of Toronto November 5, 2010 Outline Part (I): Next Generation FPGA Architectures Asynchronous
More informationLegUp: Accelerating Memcached on Cloud FPGAs
0 LegUp: Accelerating Memcached on Cloud FPGAs Xilinx Developer Forum December 10, 2018 Andrew Canis & Ruolong Lian LegUp Computing Inc. 1 COMPUTE IS BECOMING SPECIALIZED 1 GPU Nvidia graphics cards are
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationMaximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman
Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael
More informationRun Fast When You Can: Loop Pipelining with Uncertain and Non-uniform Memory Dependencies
Run Fast When You Can: Loop Pipelining with Uncertain and Non-uniform Memory Dependencies Junyi Liu, John Wickerson, Samuel Bayliss, and George A. Constantinides Department of Electrical and Electronic
More informationHandout 2 ILP: Part B
Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP
More informationWCET-Aware C Compiler: WCC
12 WCET-Aware C Compiler: WCC Jian-Jia Chen (slides are based on Prof. Heiko Falk) TU Dortmund, Informatik 12 2015 年 05 月 05 日 These slides use Microsoft clip arts. Microsoft copyright restrictions apply.
More informationParallel graph traversal for FPGA
LETTER IEICE Electronics Express, Vol.11, No.7, 1 6 Parallel graph traversal for FPGA Shice Ni a), Yong Dou, Dan Zou, Rongchun Li, and Qiang Wang National Laboratory for Parallel and Distributed Processing,
More informationESE532: System-on-a-Chip Architecture. Today. Message. Clock Cycle BRAM
ESE532: System-on-a-Chip Architecture Day 20: April 3, 2017 Pipelining, Frequency, Dataflow Today What drives cycle times Pipelining in Vivado HLS C Avoiding bottlenecks feeding data in Vivado HLS C Penn
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class
More informationResource-efficient Acceleration of 2-Dimensional Fast Fourier Transform Computations on FPGAs
In Proceedings of the International Conference on Distributed Smart Cameras, Como, Italy, August 2009. Resource-efficient Acceleration of 2-Dimensional Fast Fourier Transform Computations on FPGAs Hojin
More informationStaged Memory Scheduling
Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:
More informationECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University
ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University Lab 4: Binarized Convolutional Neural Networks Due Wednesday, October 31, 2018, 11:59pm
More informationDRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric
DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based
More informationKartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18
Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation
More informationPushPull: Short Path Padding for Timing Error Resilient Circuits YU-MING YANG IRIS HUI-RU JIANG SUNG-TING HO. IRIS Lab National Chiao Tung University
PushPull: Short Path Padding for Timing Error Resilient Circuits YU-MING YANG IRIS HUI-RU JIANG SUNG-TING HO IRIS Lab National Chiao Tung University Outline Introduction Problem Formulation Algorithm -
More informationParallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010
Parallelizing FPGA Technology Mapping using GPUs Doris Chen Deshanand Singh Aug 31 st, 2010 Motivation: Compile Time In last 12 years: 110x increase in FPGA Logic, 23x increase in CPU speed, 4.8x gap Question:
More informationMapping-aware Logic Synthesis with Parallelized Stochastic Optimization
Mapping-aware Logic Synthesis with Parallelized Stochastic Optimization Zhiru Zhang School of ECE, Cornell University September 29, 2017 @ EPFL A Case Study on Digit Recognition bit6 popcount(bit49 digit)
More informationUsing FPGAs as Microservices
Using FPGAs as Microservices David Ojika, Ann Gordon-Ross, Herman Lam, Bhavesh Patel, Gaurav Kaul, Jayson Strayer (University of Florida, DELL EMC, Intel Corporation) The 9 th Workshop on Big Data Benchmarks,
More informationCache Aware Optimization of Stream Programs
Cache Aware Optimization of Stream Programs Janis Sermulins, William Thies, Rodric Rabbah and Saman Amarasinghe LCTES Chicago, June 2005 Streaming Computing Is Everywhere! Prevalent computing domain with
More informationCS 426 Parallel Computing. Parallel Computing Platforms
CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:
More informationModeling Arbitrator Delay-Area Dependencies in Customizable Instruction Set Processors
Modeling Arbitrator Delay-Area Dependencies in Customizable Instruction Set Processors Siew-Kei Lam Centre for High Performance Embedded Systems, Nanyang Technological University, Singapore (assklam@ntu.edu.sg)
More informationPipelining and Vector Processing
Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline
More informationNear Memory Key/Value Lookup Acceleration MemSys 2017
Near Key/Value Lookup Acceleration MemSys 2017 October 3, 2017 Scott Lloyd, Maya Gokhale Center for Applied Scientific Computing This work was performed under the auspices of the U.S. Department of Energy
More informationPARLGRAN: Parallelism granularity selection for scheduling task chains on dynamically reconfigurable architectures *
PARLGRAN: Parallelism granularity selection for scheduling task chains on dynamically reconfigurable architectures * Sudarshan Banerjee, Elaheh Bozorgzadeh, Nikil Dutt Center for Embedded Computer Systems
More informationCMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago
CMSC 22200 Computer Architecture Lecture 12: Multi-Core Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 4 " Due: 11:49pm, Saturday " Two late days with penalty! Exam I " Grades out on
More informationFPGAs for Image Processing
FPGAs for Image Processing A DSL and program transformations Rob Stewart Greg Michaelson Idress Ibrahim Deepayan Bhowmik Andy Wallace Paulo Garcia Heriot-Watt University 10 May 2016 What I will say 1.
More informationBuffered Co-scheduling: A New Methodology for Multitasking Parallel Jobs on Distributed Systems
National Alamos Los Laboratory Buffered Co-scheduling: A New Methodology for Multitasking Parallel Jobs on Distributed Systems Fabrizio Petrini and Wu-chun Feng {fabrizio,feng}@lanl.gov Los Alamos National
More informationDatabase System Concepts
Chapter 13: Query Processing s Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2008/2009 Slides (fortemente) baseados nos slides oficiais do livro c Silberschatz, Korth
More informationLegUp HLS Tutorial for Microsemi PolarFire Sobel Filtering for Image Edge Detection
LegUp HLS Tutorial for Microsemi PolarFire Sobel Filtering for Image Edge Detection This tutorial will introduce you to high-level synthesis (HLS) concepts using LegUp. You will apply HLS to a real problem:
More informationTowards Optimal Custom Instruction Processors
Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT CHIPS 18 Overview 1. background: extensible processors
More informationOutline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??
Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross
More informationImproving Area and Resource Utilization Lab
Lab Workbook Introduction This lab introduces various techniques and directives which can be used in Vivado HLS to improve design performance as well as area and resource utilization. The design under
More informationEECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 14 EE141
EECS 151/251A Fall 2017 Digital Design and Integrated Circuits Instructor: John Wawrzynek and Nicholas Weaver Lecture 14 EE141 Outline Parallelism EE141 2 Parallelism Parallelism is the act of doing more
More informationΠαράλληλη Επεξεργασία
Παράλληλη Επεξεργασία Μέτρηση και σύγκριση Παράλληλης Απόδοσης Γιάννος Σαζεϊδης Εαρινό Εξάμηνο 2013 HW 1. Homework #3 due on cuda (summary of Tesla paper on web page) Slides based on Lin and Snyder textbook
More informationOptimal Partition with Block-Level Parallelization in C-to-RTL Synthesis for Streaming Applications
Optimal Partition with Block-Level Parallelization in C-to-RTL Synthesis for Streaming Applications Authors: Shuangchen Li, Yongpan Liu, X.Sharon Hu, Xinyu He, Pei Zhang, and Huazhong Yang 2013/01/23 Outline
More informationEE178 Spring 2018 Lecture Module 4. Eric Crabill
EE178 Spring 2018 Lecture Module 4 Eric Crabill Goals Implementation tradeoffs Design variables: throughput, latency, area Pipelining for throughput Retiming for throughput and latency Interleaving for
More informationA 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation
A 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation Abstract: The power budget is expected to limit the portion of the chip that we can power ON at the upcoming technology nodes. This problem,
More informationEECS150 - Digital Design Lecture 09 - Parallelism
EECS150 - Digital Design Lecture 09 - Parallelism Feb 19, 2013 John Wawrzynek Spring 2013 EECS150 - Lec09-parallel Page 1 Parallelism Parallelism is the act of doing more than one thing at a time. Optimization
More informationHigh-Level Synthesis: Accelerating Alignment Algorithm using SDSoC
High-Level Synthesis: Accelerating Alignment Algorithm using SDSoC Steven Derrien & Simon Rokicki The objective of this lab is to present how High-Level Synthesis (HLS) can be used to accelerate a given
More informationEfficient Hardware Acceleration on SoC- FPGA using OpenCL
Efficient Hardware Acceleration on SoC- FPGA using OpenCL Advisor : Dr. Benjamin Carrion Schafer Susmitha Gogineni 30 th August 17 Presentation Overview 1.Objective & Motivation 2.Configurable SoC -FPGA
More informationChapter 13: Query Processing
Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing
More informationA Framework for Space and Time Efficient Scheduling of Parallelism
A Framework for Space and Time Efficient Scheduling of Parallelism Girija J. Narlikar Guy E. Blelloch December 996 CMU-CS-96-97 School of Computer Science Carnegie Mellon University Pittsburgh, PA 523
More informationComputer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:
More informationCS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.
CS 2410 Mid term (fall 2015) Name: Question 1 (10 points) Indicate which of the following statements is true and which is false. (1) SMT architectures reduces the thread context switch time by saving in
More informationData Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions
Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing
More informationA Multiprocessor Memory Processor for Efficient Sharing And Access Coordination
1 1 A Multiprocessor Memory Processor for Efficient Sharing And Access Coordination David M. Koppelman Department of Electrical & Computer Engineering Louisiana State University, Baton Rouge koppel@ee.lsu.edu
More informationChapter 12: Query Processing. Chapter 12: Query Processing
Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Sorting Join
More informationScheduling Transactions in Replicated Distributed Transactional Memory
Scheduling Transactions in Replicated Distributed Transactional Memory Junwhan Kim and Binoy Ravindran Virginia Tech USA {junwhan,binoy}@vt.edu CCGrid 2013 Concurrency control on chip multiprocessors significantly
More informationChapter 12: Query Processing
Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Overview Chapter 12: Query Processing Measures of Query Cost Selection Operation Sorting Join
More informationTDT 4260 lecture 7 spring semester 2015
1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding
More informationLab 1: CORDIC Design Due Friday, September 8, 2017, 11:59pm
ECE5775 High-Level Digital Design Automation, Fall 2017 School of Electrical Computer Engineering, Cornell University Lab 1: CORDIC Design Due Friday, September 8, 2017, 11:59pm 1 Introduction COordinate
More informationEarly Performance-Cost Estimation of Application-Specific Data Path Pipelining
Early Performance-Cost Estimation of Application-Specific Data Path Pipelining Jelena Trajkovic Computer Science Department École Polytechnique de Montréal, Canada Email: jelena.trajkovic@polymtl.ca Daniel
More informationOpenMP Device Offloading to FPGA Accelerators. Lukas Sommer, Jens Korinth, Andreas Koch
OpenMP Device Offloading to FPGA Accelerators Lukas Sommer, Jens Korinth, Andreas Koch Motivation Increasing use of heterogeneous systems to overcome CPU power limitations 2017-07-12 OpenMP FPGA Device
More information! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for
Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and
More informationChapter 13: Query Processing Basic Steps in Query Processing
Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and
More informationHRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing
HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing Mingyu Gao and Christos Kozyrakis Stanford University http://mast.stanford.edu HPCA March 14, 2016 PIM is Coming Back End of Dennard
More informationDesign of Parallel Algorithms. Models of Parallel Computation
+ Design of Parallel Algorithms Models of Parallel Computation + Chapter Overview: Algorithms and Concurrency n Introduction to Parallel Algorithms n Tasks and Decomposition n Processes and Mapping n Processes
More informationPatterns of Parallel Programming with.net 4. Ade Miller Microsoft patterns & practices
Patterns of Parallel Programming with.net 4 Ade Miller (adem@microsoft.com) Microsoft patterns & practices Introduction Why you should care? Where to start? Patterns walkthrough Conclusions (and a quiz)
More informationMemory Consistency. Challenges. Program order Memory access order
Memory Consistency Memory Consistency Memory Consistency Reads and writes of the shared memory face consistency problem Need to achieve controlled consistency in memory events Shared memory behavior determined
More informationNOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline
CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism
More informationCSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1
CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level
More informationA Novel Design Framework for the Design of Reconfigurable Systems based on NoCs
Politecnico di Milano & EPFL A Novel Design Framework for the Design of Reconfigurable Systems based on NoCs Vincenzo Rana, Ivan Beretta, Donatella Sciuto Donatella Sciuto sciuto@elet.polimi.it Introduction
More informationDeveloping Dynamic Profiling and Debugging Support in OpenCL for FPGAs
Developing Dynamic Profiling and Debugging Support in OpenCL for FPGAs ABSTRACT Anshuman Verma Virginia Tech, Blacksburg, VA anshuman@vt.edu Skip Booth, Robbie King, James Coole, Andy Keep, John Marshall
More informationLab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm
ECE5775 High-Level Digital Design Automation, Fall 2017 School of Electrical Computer Engineering, Cornell University Lab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm 1 Introduction
More informationScalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA
Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA Yun R. Qu, Viktor K. Prasanna Ming Hsieh Dept. of Electrical Engineering University of Southern California Los Angeles, CA 90089
More informationQuery Processing. Introduction to Databases CompSci 316 Fall 2017
Query Processing Introduction to Databases CompSci 316 Fall 2017 2 Announcements (Tue., Nov. 14) Homework #3 sample solution posted in Sakai Homework #4 assigned today; due on 12/05 Project milestone #2
More informationCore Fusion: Accommodating Software Diversity in Chip Multiprocessors
Core Fusion: Accommodating Software Diversity in Chip Multiprocessors Authors: Engin Ipek, Meyrem Kırman, Nevin Kırman, and Jose F. Martinez Navreet Virk Dept of Computer & Information Sciences University
More informationSparse Matrix-Vector Multiplication FPGA Implementation
UNIVERSITY OF CALIFORNIA, LOS ANGELES Sparse Matrix-Vector Multiplication FPGA Implementation (SID: 704-272-121) 02/27/2015 Table of Contents 1 Introduction... 3 2 Sparse Matrix-Vector Multiplication...
More informationQuery Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016
Query Processing Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016 Slides re-used with some modification from www.db-book.com Reference: Database System Concepts, 6 th Ed. By Silberschatz,
More informationUltra-Fast NoC Emulation on a Single FPGA
The 25 th International Conference on Field-Programmable Logic and Applications (FPL 2015) September 3, 2015 Ultra-Fast NoC Emulation on a Single FPGA Thiem Van Chu, Shimpei Sato, and Kenji Kise Tokyo
More informationComputer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13
Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,
More informationLecture 9 Basic Parallelization
Lecture 9 Basic Parallelization I. Introduction II. Data Dependence Analysis III. Loop Nests + Locality IV. Interprocedural Parallelization Chapter 11.1-11.1.4 CS243: Parallelization 1 Machine Learning
More informationLecture 9 Basic Parallelization
Lecture 9 Basic Parallelization I. Introduction II. Data Dependence Analysis III. Loop Nests + Locality IV. Interprocedural Parallelization Chapter 11.1-11.1.4 CS243: Parallelization 1 Machine Learning
More informationA Virtualized Quality of Service Packet Scheduler Accelerator. Kangtao Kendall Chuang
A Virtualized Quality of Service Packet Scheduler Accelerator A Thesis Presented to The Academic Faculty by Kangtao Kendall Chuang In Partial Fulfillment of the Requirements for the Degree Master of Science
More informationMT-SDF: Scheduled Dataflow Architecture with mini-threads
2013 Data-Flow Execution Models for Extreme Scale Computing MT-SDF: Scheduled Dataflow Architecture with mini-threads Domenico Pace University of Pisa Pisa, Italy col.pace@hotmail.it Krishna Kavi University
More informationEfficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked Data Marcin Wylot 1 Motivation and objectives of the research The proliferation of heterogeneous Linked Data on the Web requires data management
More informationFCUDA-SoC: Platform Integration for Field-Programmable SoC with the CUDAto-FPGA
1 FCUDA-SoC: Platform Integration for Field-Programmable SoC with the CUDAto-FPGA Compiler Tan Nguyen 1, Swathi Gurumani 1, Kyle Rupnow 1, Deming Chen 2 1 Advanced Digital Sciences Center, Singapore {tan.nguyen,
More information