ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests

Size: px
Start display at page:

Download "ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests"

Transcription

1 ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests Mingxing Tan 1 2, Gai Liu 1, Ritchie Zhao 1, Steve Dai 1, Zhiru Zhang 1 1 Computer Systems Laboratory, Electrical and Computer Engineering, Cornell University 2 Google Inc

2 Outline Loop pipelining in HLS Irregular loop nest ElasticFlow architecture ElasticFlow synthesis Experimental results 2

3 Clock Cycle Loop Pipelining An important optimization in HLS Create a static schedule for the loop body to allow successive loop iterations to be overlapped 0 j=0 load A1 j=1 Objective Initiation Interval (II) B1 *+ C1 load A2 B2 *+ C2 steady state j=2 load A3 B3 *+ C3 j=3 load A4 B4 *+ C4 II=1 II=1 Throughput for(; i < 4; i++){ for(j=0; j < 4; j++){ #pragma pipeline acc += A[j] * i; } } 3

4 Pipelining Outer Loop for(; i < 4; i++){ for(j=0; j < 4; j++){ acc += A[j] * i; } } 1 Pipelining only inner loop for(; i < 4; i++){ for(j=0; j < 4; j++){ #pragma pipeline acc += A[i] * j; } } 1 inner loop iteration per cycle Fixed inner loop bound 2 Pipelining outer loop by unrolling inner loop for(; i < 4; i++){ #pragma pipeline acc += A[0] * i; acc += A[1] * i; acc += A[2] * i; acc += A[3] * i; } 1 outer loop iteration per cycle 4

5 Pipelining Irregular Loop Nest Contains one or more dynamic-bound inner loops Number of inner loop iterations vary during runtime Accesses less-regular data structures (eg sparse matrices, graphs, and hash tables) common in emerging applications How to pipeline this loop nest to achieve one lookup per cycle? 0 1 N Hash buckets Keys Hash lookup for (i : keys_to_find) #pragma pipeline } hv = Jenkins_hash(k); p = hashtbl[hv]keys; while (p && p->key!=k) p = p->next; format_output(p) 5

6 Clock Cycle Aggressively Unrolling Inner Loop B j=0 j=1 j=2 j=3 j=4 j=5 j=0 j=1 j=2 j=3 j=4 j=5 j=0 j=1 j=2 j=3 j=4 j=5 j=0 j=1 j=2 j=3 j=4 j=5 Resource 1 lookup/cycle i=4 j=0 j=1 j=2 j=3 j=4 j=5 i=4 i=5 j=0 j=1 j=2 j=3 j=4 j=5 i=5 for (i : keys_to_find) #pragma pipeline } hv = Jenkins_hash(k); p = hashtbl[hv]keys; for (j=0; j<6; j++) #pragma unroll if (p && p->key!=k) p = p->next; format_output(p) A B C 6

7 Clock Cycle Issues with Aggressive Unrolling B j=0 j=1 j=2 j=3 j=4 j=5 j=99 j=0 j=1 j=2 j=3 j=4 j=5 j=99 j=0 j=1 j=2 j=3 j=4 j=5 j=0 j=1 j=2 j=3 j=4 j=5 Resource i=4 j=0 j=1 j=2 j=3 j=4 j=5 1 May not be statically determinable i=5 j=0 j=1 j=2 j=3 j=4 j=5 2 Worst-case bound >> common case (eg 99 vs 2) i=6 j=0 j=1 j=2 j=3 j=4 j=5 3 Unnecessarily deep pipeline, very inefficient in area i=7 j=0 j=1 j=2 j=3 j=4 i=8 j=0 j=1 j=2 i=9 j=0 j=1 7

8 Need for a New Approach Irregular loop nests are prevalent Graph processing, Scientific computation, Image processing, etc Naive approaches result in low throughput or large area Need resource-efficient pipelining of the outer loop for an irregular loop nest to target one outer loop iteration per cycle 8

9 ElasticFlow Concept ElasticFlow Architecture and associated synthesis techniques Effectively accelerate irregular loop nests Transform the irregular loop nest into a multi-stage dataflow pipeline Dynamically distribute different outer loop instances of the dynamic-bound inner loop to one or more processing units Inner loops execute in a pipelined fashion across different outer loop iterations 9

10 ElasticFlow Architecture Each dynamic-bound inner loop is mapped to an application-specific loop processing array (LPA) LPA contains one or more loop processing units (LPUs) Each LPU executes an inner loop until completion, which automatically handles inner loop carried dependences A Distributor <i, val> <i, val> B C LPU 1 LPU 2 LPU K Collector <i, val> Loop Processing Array (LPA) for B 10

11 Distributor and Collector Distributor Dynamically distributes inner loop instances to LPUs Collector Collects results from the LPUs Acts as an reorder buffer (ROB) to ensure that results are committed to the next stage in-order A Distributor B C Collector LPA 11

12 Clock Cycle ElasticFlow on Hash Lookup A B for (i : keys_to_find) { #pragma pipeline hv = Jenkins_hash(k); p = hashtbl[hv]keys; while (p && p->key!=k) p = p->next; A B C LPUs specialized for B C } format_output(p) Dynamically overlap inner loops across outer loop iterations to achieve a throughput of one outer loop iteration per cycle 12

13 Execution with Single LPU A B C Single LPU for Stage B Execution in Stage A and C can overlap in time Inner loop iterations execute serially on Stage B stall stall stall stall Clock Cycle for (i : keys_to_find) { #pragma pipeline stall i=4 stall A hv = Jenkins_hash(k); p = hashtbl[hv]keys; stall i=4 stall while (p && p->key!=k) i=5 i=4 B p = p->next; i=5 stall i=6 i=5 stall i=6 stall C format_output(p) Throughput i=7 i=6 } bottlenecked by the inner loop latency i=7 in stage stall B i=7 13

14 Execution with Multiple LPUs A B C Multiple LPUs for Stage B Dynamically schedule inner loops A i=4 i=5 i=6 i=7 LPU 1 LPU 2 LPU 3 LPU 4 i=4 B i=5 i=7 i=6 Multiple LPUs for B C i=4 i=5 i=6 i=7 i=4 i=5 i=6 i=4 i=5 i=6 i=4 i=5 i=7 i=6 i=7 i=7 Single LPU for B 14 Clock Cycle

15 Dynamic Scheduling Dynamic scheduling policy Mitigates the effect of unbalanced workload Inefficient on resource throughput utilization due to latency variation of different inner loops due to many stalls and idles! A Dynamic scheduling B C Static scheduling A B C i=4 i=5 i=6 i=7 LPU 1 LPU 2 LPU 3 LPU 4 i=4 i=5 i=7 i=6 i=4 i=5 i=6 i=7 LPU 1 LPU 2 LPU 3 LPU 4 i=5 i=4 i=6 i=7 15

16 Multiple Dynamic-Bound Inner Loops B D for (; i<num_keys; i++) #pragma pipeline } // A: lookup hashtbl1 // B: dynamic-bound loop while (p && p->key!=k) p = p->next; // C: loop up hashtbl2 // D: dynamic-bound loop while (q && q->key!=k) q = q->next; // E: merge results Database join Architecture with dedicated LPAs slpu 1 LPA B A Distributor <i, val> slpu K Collector <i, val> E C Distributor <i, val> B B D D slpu 1 slpu K Collector LPA D <i, val> Each LPA is dedicated to a particular inner loop 16

17 Issues with Dedicated LPAs If loop B incurs much longer average latency than loop D, the LPA for loop D results in poor resource utilization execute dbjoin on dedicated LPUs slpa B slpa D slpu 1 slpu 2 i B =0 i B =1 slpu 3 i B =2 slpu 1 slpu 2 i D =0 i D =1 i D =3 i D =4 slpu 3 i D =2 i D =5 Clock Cycle i B =4 i B =5 i B =3 Idle Idle Idle 17

18 LPA Sharing An LPA can be shared among one or more inner loops slpu: single-loop processing unit, dedicated to one loop mlpu: multi-loop processing unit, shared among multiple loops slpa: single-loop processing array, consists of multiple slpus for a particular loop mlpa: multi-loop processing array, consists of multiple mlpus each shared among loops A <s, i, val> Distributor C Shared mlpus Collector <s, i, val> Architecture with shared LPUs B/D B/D B/D <i, val> E <i, val> mlpab,d vs slpa 18

19 Execution with Shared LPUs mlpa improves resource utilizations and performance by reducing pipeline stalls for unbalanced workload Execution of dbjoin on dedicated LPAs slpa B slpa D slpu 1 slpu 2 slpu 3 slpu 1 slpu 2 slpu 3 i D =0 i D =1 i D =2 i B =0 i D =3 i D =4 i B =2 i B =1 i D =5 Execution on shared mlpa mlpa B,D mlpu 1 mlpu 2 mlpu 3 mlpu 4 i D =0 i D =1 i B =0 i B =2 i B =1 i D =2 Clock Cycle i B =4 i B =5 i B =3 Idle Idle Idle i D =3 i B =5 i B =4 i D =5 i D =4 i B =3 Even requires fewer LPUs 19

20 ElasticFlow Synthesis Maps irregular loop nest to the ElasticFlow architecture Partition the loop nest into multiple stages Identify inner loop candidates to form the LPAs Synthesize these inner loops into slpus and mlpus for (; i<num_keys; i++) #pragma pipeline A Goal: Optimize LPU allocation to meet the expected throughput } // A: lookup hashtbl1 // B: dynamic-bound loop while (p && p->key!=k) p = p->next; // C: loop up hashtbl2 // D: dynamic-bound loop while (q && q->key!=k) q = q->next; // E: merge results B C D E Distributor 1 How many? 2 Shared or not shared? Shared mlpus Collector mlpa B,D 20

21 slpu Allocation Definitions TP: Expected number of outer loop iterations per cycle II i : Achievable initiation interval (II) of inner loop i L i : Latency in cycles of a single iteration of loop i B i : Common-case bound of inner loop i (from profiling) U i =[II i (B i -1)+L i ] TP Number of slpus Common-case latency of each inner loop instance To achieve the expected throughput Need this many slpu to hide the latency of inner loop How many simultaneous in-flight outer loop iterations is required? 21

22 mlpu Allocation Replace dedicated slpus with shared mlpus to improve performance and resource utilization How many slpus should be replaced with mlpus? Inherent trade-off between performance and area mlpus improve performance by allowing adaptive assignment of resources to different types of loops depending on workload mlpus typically consume more area than slpus 22

23 LPU Allocation Optimize the tradeoff as an integer linear program given Resource usage of each type of LPU Area of the slpa architecture sharing + #LPUs performance Total area of the LPAs Prevent over-allocation of LPUs Each loop maps to a single type of LPA Loops mapped to compatible LPA 23

24 Time ROB Buffer Sizing Reorder buffer (ROB) must hold all results from the LPUs that are not yet ready to be committed Distributor stalled when ROB is full LPUs cannot process new outer loop iterations, and become underutilized Need to store results from to i=7 because they finish before Distributor LPU 1 LPU 2 LPU K Collector (ROB) LPA A LPU 1 LPU 2 LPU 3 LPU 4 i=4 stall i=5 i=6 i=4 i=5 i=7 i=7 i=6 Problem: how to statically but suitably size the ROB during synthesis? B C i=4 i=5 i=6 i=7 24

25 ROB Buffer Sizing We estimate the ROB size based on profiling Maximum latency L max Minimum latency L min Average latency L avg Latency standard deviation σ Our estimates (for K LPUs) achieve good performance based on the following empirical formulation 25

26 Deadlock Avoidance Both slpa and mlpa are deadlock-free Limit the number of in-flight outer loop iterations to be no greater than the number of available ROB entries Entire dataflow architecture cannot deadlock If the architecture forms a directed acyclic graph If there is data dependence between shared inner loops 26

27 Experimental Setup ElasticFlow s setup leverages a commercial HLS tool which uses LLVM compiler as its front-end Compared ElasticFlow to pipelining techniques employed in state-of-the-arts commercial HLS tool Target Xilinx Virtex-7 FPGA with 5-ns target clock period Benchmark applications Graph processing, database, scientific computing, image processing 27

28 Performance for Different Number of LPUs Normalized speedup9 Performance Comparison Close to proportional improvement in performance for increasing Benchmark number applications of LPUs 1 LPU 2 LPUs 4 LPUs 8 LPUs 28

29 ElasticFlow vs Aggressive Unrolling Achieves comparable performance with significantly less resource usage Unrolling is inapplicable when the worst-case loop bound cannot be statically determined Design Technique Latency LUTs Registers dbjoin Unroll ElasticFlow spmv Unroll ElasticFlow comparable 15x reduction 45x reduction 29

30 Effectiveness of LPU Sharing Using mlpa can further improve the performance by 21%-34% with similar area Comparison of mlpus over slpus Design # slpus # mlpus Latency Reduction Slice Overhead cfd-a % 38% cfd-b % 52% dbjoin-a % 70% dbjoin-b % 57% Significant latency reduction Small area overhead 30

31 Take-Away Points Existing HLS tools rely on static pipelining techniques Extract parallelism only at compile time Not competitive for irregular programs with dynamic parallelism Need for adaptive pipelining techniques Dynamically extract parallelism at runtime Efficiently handle statically unanalyzable program patterns We address pipelining of irregular loop nests containing dynamic-bound inner loops Novel dataflow pipeline architecture and synthesis techniques Substantial performance improvement 31

32 ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests Mingxing Tan 1 2, Gai Liu 1, Ritchie Zhao 1, Steve Dai 1, Zhiru Zhang 1 1 Computer Systems Laboratory, Electrical and Computer Engineering, Cornell University 2 Google Inc

33 Backup Slides 33

34 Coarse-Grained Pipelined Accelerators (CGPA) Liu, Johnson, and August, DAC 14 Generates coarse-grained pipelines for a loop nest by partitioning it into parallel and non-parallel sections Employs replicated data-level parallelism to create multiple identical copies of the parallel section Applies decoupled pipeline parallelism to separate the parallel and sequential sections with a set of FIFOs ElasticFlow achieves additional performance and resource efficiency Enables out-of-order execution and dynamic scheduling Optimizes allocation and sharing of LPUs with mlpa architecture Studies sizing for both ROB and delay line and runtime policy to prevent deadlock 34

35 Comparison with CGPA 35

36 Widx Kocberber, Grot, Picorel, Falsafi, Lim, and Ranganathan, MICRO 13 A reconfigurable accelerator for hash indexing in database systems Uses decoupled pipeline architecture similar to ElasticFlow Hashing unit distributes work to a parallel array of walker units, whose results are combined in a n output unit ElasticFlow is a technique for addressing a more general problem of pipelining irregular loop nests 36

Enabling Adaptive Loop Pipelining in High-Level Synthesis

Enabling Adaptive Loop Pipelining in High-Level Synthesis Enabling Adaptive Loop Pipelining in High-Level Synthesis Steve Dai, Gai Liu, Ritchie Zhao, Zhiru Zhang School of Electrical and Computer Engineering, Cornell University, Ithaca, NY Email: {hd273,gl387,rz252,zhiruz}@cornell.edu

More information

Meet the Walkers! Accelerating Index Traversals for In-Memory Databases"

Meet the Walkers! Accelerating Index Traversals for In-Memory Databases Meet the Walkers! Accelerating Index Traversals for In-Memory Databases Onur Kocberber Boris Grot, Javier Picorel, Babak Falsafi, Kevin Lim, Parthasarathy Ranganathan Our World is Data-Driven! Data resides

More information

Mapping-Aware Constrained Scheduling for LUT-Based FPGAs

Mapping-Aware Constrained Scheduling for LUT-Based FPGAs Mapping-Aware Constrained Scheduling for LUT-Based FPGAs Mingxing Tan, Steve Dai, Udit Gupta, Zhiru Zhang School of Electrical and Computer Engineering Cornell University High-Level Synthesis (HLS) for

More information

Hardware thread reordering to boost OpenCL throughput on FPGAs

Hardware thread reordering to boost OpenCL throughput on FPGAs Hardware thread reordering to boost OpenCL throughput on FPGAs Amir Momeni ECE Department Northeastern University Boston, MA Email: momeni@ece.neu.edu Hamed Tabkhi ECE Department Northeastern University

More information

SDSoC: Session 1

SDSoC: Session 1 SDSoC: Session 1 ADAM@ADIUVOENGINEERING.COM What is SDSoC SDSoC is a system optimising compiler which allows us to optimise Zynq PS / PL Zynq MPSoC PS / PL MicroBlaze What does this mean? Following the

More information

Efficient Data Supply for Hardware Accelerators with Prefetching and Access/Execute Decoupling

Efficient Data Supply for Hardware Accelerators with Prefetching and Access/Execute Decoupling Efficient Data Supply for Hardware Accelerators with Prefetching and Access/Execute Decoupling Tao Chen and G. Edward Suh Cornell University Ithaca, NY 14850, USA {tc466, gs272}@cornell.edu Abstract This

More information

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,

More information

This material exempt per Department of Commerce license exception TSU. Improving Performance

This material exempt per Department of Commerce license exception TSU. Improving Performance This material exempt per Department of Commerce license exception TSU Performance Outline Adding Directives Latency Manipulating Loops Throughput Performance Bottleneck Summary Performance 13-2 Performance

More information

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center SDAccel Environment The Xilinx SDAccel Development Environment Bringing The Best Performance/Watt to the Data Center Introduction Data center operators constantly seek more server performance. Currently

More information

ECE 5775 Student-Led Discussions (10/16)

ECE 5775 Student-Led Discussions (10/16) ECE 5775 Student-Led Discussions (10/16) Talks: 18-min talk + 2-min Q&A Adam Macioszek, Julia Currie, Nick Sarkis Sparse Matrix Vector Multiplication Nick Comly, Felipe Fortuna, Mark Li, Serena Krech Matrix

More information

Review: Creating a Parallel Program. Programming for Performance

Review: Creating a Parallel Program. Programming for Performance Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)

More information

Efficient Data Supply for Hardware Accelerators with Prefetching and Access/ Execute Decoupling

Efficient Data Supply for Hardware Accelerators with Prefetching and Access/ Execute Decoupling Cornell University Efficient Data Supply for Hardware Accelerators with Prefetching and Access/ Execute Decoupling Tao Chen and G. Edward Suh Computer Systems Laboratory Cornell University Accelerator-Rich

More information

Lecture 13: March 25

Lecture 13: March 25 CISC 879 Software Support for Multicore Architectures Spring 2007 Lecture 13: March 25 Lecturer: John Cavazos Scribe: Ying Yu 13.1. Bryan Youse-Optimization of Sparse Matrix-Vector Multiplication on Emerging

More information

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Ritchie Zhao 1, Weinan Song 2, Wentao Zhang 2, Tianwei Xing 3, Jeng-Hau Lin 4, Mani Srivastava 3, Rajesh Gupta 4, Zhiru

More information

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs IBM Research AI Systems Day DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang 1, Junsong Wang 2, Chao Zhu 2, Yonghua Lin 2, Jinjun Xiong 3, Wen-mei

More information

Vivado HLx Design Entry. June 2016

Vivado HLx Design Entry. June 2016 Vivado HLx Design Entry June 2016 Agenda What is the HLx Design Methodology? New & Early Access features for Connectivity Platforms Creating Differentiated Logic 2 What is the HLx Design Methodology? Page

More information

Automated Space/Time Scaling of Streaming Task Graphs. Hossein Omidian Supervisor: Guy Lemieux

Automated Space/Time Scaling of Streaming Task Graphs. Hossein Omidian Supervisor: Guy Lemieux Automated Space/Time Scaling of Streaming Task Graphs Hossein Omidian Supervisor: Guy Lemieux 1 Contents Introduction KPN-based HLS Tool for MPPA overlay Experimental Results Future Work Conclusion 2 Introduction

More information

Advanced Database Systems

Advanced Database Systems Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed

More information

MOJTABA MAHDAVI Mojtaba Mahdavi DSP Design Course, EIT Department, Lund University, Sweden

MOJTABA MAHDAVI Mojtaba Mahdavi DSP Design Course, EIT Department, Lund University, Sweden High Level Synthesis with Catapult MOJTABA MAHDAVI 1 Outline High Level Synthesis HLS Design Flow in Catapult Data Types Project Creation Design Setup Data Flow Analysis Resource Allocation Scheduling

More information

Unit 2: High-Level Synthesis

Unit 2: High-Level Synthesis Course contents Unit 2: High-Level Synthesis Hardware modeling Data flow Scheduling/allocation/assignment Reading Chapter 11 Unit 2 1 High-Level Synthesis (HLS) Hardware-description language (HDL) synthesis

More information

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, Sarma Vrudhula School of Electrical, Computer and Energy Engineering School

More information

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based

More information

Natalie Enright Jerger, Jason Anderson, University of Toronto November 5, 2010

Natalie Enright Jerger, Jason Anderson, University of Toronto November 5, 2010 Next Generation FPGA Research Natalie Enright Jerger, Jason Anderson, and Ali Sheikholeslami l i University of Toronto November 5, 2010 Outline Part (I): Next Generation FPGA Architectures Asynchronous

More information

LegUp: Accelerating Memcached on Cloud FPGAs

LegUp: Accelerating Memcached on Cloud FPGAs 0 LegUp: Accelerating Memcached on Cloud FPGAs Xilinx Developer Forum December 10, 2018 Andrew Canis & Ruolong Lian LegUp Computing Inc. 1 COMPUTE IS BECOMING SPECIALIZED 1 GPU Nvidia graphics cards are

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael

More information

Run Fast When You Can: Loop Pipelining with Uncertain and Non-uniform Memory Dependencies

Run Fast When You Can: Loop Pipelining with Uncertain and Non-uniform Memory Dependencies Run Fast When You Can: Loop Pipelining with Uncertain and Non-uniform Memory Dependencies Junyi Liu, John Wickerson, Samuel Bayliss, and George A. Constantinides Department of Electrical and Electronic

More information

Handout 2 ILP: Part B

Handout 2 ILP: Part B Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP

More information

WCET-Aware C Compiler: WCC

WCET-Aware C Compiler: WCC 12 WCET-Aware C Compiler: WCC Jian-Jia Chen (slides are based on Prof. Heiko Falk) TU Dortmund, Informatik 12 2015 年 05 月 05 日 These slides use Microsoft clip arts. Microsoft copyright restrictions apply.

More information

Parallel graph traversal for FPGA

Parallel graph traversal for FPGA LETTER IEICE Electronics Express, Vol.11, No.7, 1 6 Parallel graph traversal for FPGA Shice Ni a), Yong Dou, Dan Zou, Rongchun Li, and Qiang Wang National Laboratory for Parallel and Distributed Processing,

More information

ESE532: System-on-a-Chip Architecture. Today. Message. Clock Cycle BRAM

ESE532: System-on-a-Chip Architecture. Today. Message. Clock Cycle BRAM ESE532: System-on-a-Chip Architecture Day 20: April 3, 2017 Pipelining, Frequency, Dataflow Today What drives cycle times Pipelining in Vivado HLS C Avoiding bottlenecks feeding data in Vivado HLS C Penn

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #7 2/5/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class

More information

Resource-efficient Acceleration of 2-Dimensional Fast Fourier Transform Computations on FPGAs

Resource-efficient Acceleration of 2-Dimensional Fast Fourier Transform Computations on FPGAs In Proceedings of the International Conference on Distributed Smart Cameras, Como, Italy, August 2009. Resource-efficient Acceleration of 2-Dimensional Fast Fourier Transform Computations on FPGAs Hojin

More information

Staged Memory Scheduling

Staged Memory Scheduling Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:

More information

ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University

ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University Lab 4: Binarized Convolutional Neural Networks Due Wednesday, October 31, 2018, 11:59pm

More information

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based

More information

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation

More information

PushPull: Short Path Padding for Timing Error Resilient Circuits YU-MING YANG IRIS HUI-RU JIANG SUNG-TING HO. IRIS Lab National Chiao Tung University

PushPull: Short Path Padding for Timing Error Resilient Circuits YU-MING YANG IRIS HUI-RU JIANG SUNG-TING HO. IRIS Lab National Chiao Tung University PushPull: Short Path Padding for Timing Error Resilient Circuits YU-MING YANG IRIS HUI-RU JIANG SUNG-TING HO IRIS Lab National Chiao Tung University Outline Introduction Problem Formulation Algorithm -

More information

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010 Parallelizing FPGA Technology Mapping using GPUs Doris Chen Deshanand Singh Aug 31 st, 2010 Motivation: Compile Time In last 12 years: 110x increase in FPGA Logic, 23x increase in CPU speed, 4.8x gap Question:

More information

Mapping-aware Logic Synthesis with Parallelized Stochastic Optimization

Mapping-aware Logic Synthesis with Parallelized Stochastic Optimization Mapping-aware Logic Synthesis with Parallelized Stochastic Optimization Zhiru Zhang School of ECE, Cornell University September 29, 2017 @ EPFL A Case Study on Digit Recognition bit6 popcount(bit49 digit)

More information

Using FPGAs as Microservices

Using FPGAs as Microservices Using FPGAs as Microservices David Ojika, Ann Gordon-Ross, Herman Lam, Bhavesh Patel, Gaurav Kaul, Jayson Strayer (University of Florida, DELL EMC, Intel Corporation) The 9 th Workshop on Big Data Benchmarks,

More information

Cache Aware Optimization of Stream Programs

Cache Aware Optimization of Stream Programs Cache Aware Optimization of Stream Programs Janis Sermulins, William Thies, Rodric Rabbah and Saman Amarasinghe LCTES Chicago, June 2005 Streaming Computing Is Everywhere! Prevalent computing domain with

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

Modeling Arbitrator Delay-Area Dependencies in Customizable Instruction Set Processors

Modeling Arbitrator Delay-Area Dependencies in Customizable Instruction Set Processors Modeling Arbitrator Delay-Area Dependencies in Customizable Instruction Set Processors Siew-Kei Lam Centre for High Performance Embedded Systems, Nanyang Technological University, Singapore (assklam@ntu.edu.sg)

More information

Pipelining and Vector Processing

Pipelining and Vector Processing Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline

More information

Near Memory Key/Value Lookup Acceleration MemSys 2017

Near Memory Key/Value Lookup Acceleration MemSys 2017 Near Key/Value Lookup Acceleration MemSys 2017 October 3, 2017 Scott Lloyd, Maya Gokhale Center for Applied Scientific Computing This work was performed under the auspices of the U.S. Department of Energy

More information

PARLGRAN: Parallelism granularity selection for scheduling task chains on dynamically reconfigurable architectures *

PARLGRAN: Parallelism granularity selection for scheduling task chains on dynamically reconfigurable architectures * PARLGRAN: Parallelism granularity selection for scheduling task chains on dynamically reconfigurable architectures * Sudarshan Banerjee, Elaheh Bozorgzadeh, Nikil Dutt Center for Embedded Computer Systems

More information

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 12: Multi-Core Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 4 " Due: 11:49pm, Saturday " Two late days with penalty! Exam I " Grades out on

More information

FPGAs for Image Processing

FPGAs for Image Processing FPGAs for Image Processing A DSL and program transformations Rob Stewart Greg Michaelson Idress Ibrahim Deepayan Bhowmik Andy Wallace Paulo Garcia Heriot-Watt University 10 May 2016 What I will say 1.

More information

Buffered Co-scheduling: A New Methodology for Multitasking Parallel Jobs on Distributed Systems

Buffered Co-scheduling: A New Methodology for Multitasking Parallel Jobs on Distributed Systems National Alamos Los Laboratory Buffered Co-scheduling: A New Methodology for Multitasking Parallel Jobs on Distributed Systems Fabrizio Petrini and Wu-chun Feng {fabrizio,feng}@lanl.gov Los Alamos National

More information

Database System Concepts

Database System Concepts Chapter 13: Query Processing s Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2008/2009 Slides (fortemente) baseados nos slides oficiais do livro c Silberschatz, Korth

More information

LegUp HLS Tutorial for Microsemi PolarFire Sobel Filtering for Image Edge Detection

LegUp HLS Tutorial for Microsemi PolarFire Sobel Filtering for Image Edge Detection LegUp HLS Tutorial for Microsemi PolarFire Sobel Filtering for Image Edge Detection This tutorial will introduce you to high-level synthesis (HLS) concepts using LegUp. You will apply HLS to a real problem:

More information

Towards Optimal Custom Instruction Processors

Towards Optimal Custom Instruction Processors Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT CHIPS 18 Overview 1. background: extensible processors

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

Improving Area and Resource Utilization Lab

Improving Area and Resource Utilization Lab Lab Workbook Introduction This lab introduces various techniques and directives which can be used in Vivado HLS to improve design performance as well as area and resource utilization. The design under

More information

EECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 14 EE141

EECS 151/251A Fall 2017 Digital Design and Integrated Circuits. Instructor: John Wawrzynek and Nicholas Weaver. Lecture 14 EE141 EECS 151/251A Fall 2017 Digital Design and Integrated Circuits Instructor: John Wawrzynek and Nicholas Weaver Lecture 14 EE141 Outline Parallelism EE141 2 Parallelism Parallelism is the act of doing more

More information

Παράλληλη Επεξεργασία

Παράλληλη Επεξεργασία Παράλληλη Επεξεργασία Μέτρηση και σύγκριση Παράλληλης Απόδοσης Γιάννος Σαζεϊδης Εαρινό Εξάμηνο 2013 HW 1. Homework #3 due on cuda (summary of Tesla paper on web page) Slides based on Lin and Snyder textbook

More information

Optimal Partition with Block-Level Parallelization in C-to-RTL Synthesis for Streaming Applications

Optimal Partition with Block-Level Parallelization in C-to-RTL Synthesis for Streaming Applications Optimal Partition with Block-Level Parallelization in C-to-RTL Synthesis for Streaming Applications Authors: Shuangchen Li, Yongpan Liu, X.Sharon Hu, Xinyu He, Pei Zhang, and Huazhong Yang 2013/01/23 Outline

More information

EE178 Spring 2018 Lecture Module 4. Eric Crabill

EE178 Spring 2018 Lecture Module 4. Eric Crabill EE178 Spring 2018 Lecture Module 4 Eric Crabill Goals Implementation tradeoffs Design variables: throughput, latency, area Pipelining for throughput Retiming for throughput and latency Interleaving for

More information

A 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation

A 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation A 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation Abstract: The power budget is expected to limit the portion of the chip that we can power ON at the upcoming technology nodes. This problem,

More information

EECS150 - Digital Design Lecture 09 - Parallelism

EECS150 - Digital Design Lecture 09 - Parallelism EECS150 - Digital Design Lecture 09 - Parallelism Feb 19, 2013 John Wawrzynek Spring 2013 EECS150 - Lec09-parallel Page 1 Parallelism Parallelism is the act of doing more than one thing at a time. Optimization

More information

High-Level Synthesis: Accelerating Alignment Algorithm using SDSoC

High-Level Synthesis: Accelerating Alignment Algorithm using SDSoC High-Level Synthesis: Accelerating Alignment Algorithm using SDSoC Steven Derrien & Simon Rokicki The objective of this lab is to present how High-Level Synthesis (HLS) can be used to accelerate a given

More information

Efficient Hardware Acceleration on SoC- FPGA using OpenCL

Efficient Hardware Acceleration on SoC- FPGA using OpenCL Efficient Hardware Acceleration on SoC- FPGA using OpenCL Advisor : Dr. Benjamin Carrion Schafer Susmitha Gogineni 30 th August 17 Presentation Overview 1.Objective & Motivation 2.Configurable SoC -FPGA

More information

Chapter 13: Query Processing

Chapter 13: Query Processing Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing

More information

A Framework for Space and Time Efficient Scheduling of Parallelism

A Framework for Space and Time Efficient Scheduling of Parallelism A Framework for Space and Time Efficient Scheduling of Parallelism Girija J. Narlikar Guy E. Blelloch December 996 CMU-CS-96-97 School of Computer Science Carnegie Mellon University Pittsburgh, PA 523

More information

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:

More information

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false. CS 2410 Mid term (fall 2015) Name: Question 1 (10 points) Indicate which of the following statements is true and which is false. (1) SMT architectures reduces the thread context switch time by saving in

More information

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing

More information

A Multiprocessor Memory Processor for Efficient Sharing And Access Coordination

A Multiprocessor Memory Processor for Efficient Sharing And Access Coordination 1 1 A Multiprocessor Memory Processor for Efficient Sharing And Access Coordination David M. Koppelman Department of Electrical & Computer Engineering Louisiana State University, Baton Rouge koppel@ee.lsu.edu

More information

Chapter 12: Query Processing. Chapter 12: Query Processing

Chapter 12: Query Processing. Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Sorting Join

More information

Scheduling Transactions in Replicated Distributed Transactional Memory

Scheduling Transactions in Replicated Distributed Transactional Memory Scheduling Transactions in Replicated Distributed Transactional Memory Junwhan Kim and Binoy Ravindran Virginia Tech USA {junwhan,binoy}@vt.edu CCGrid 2013 Concurrency control on chip multiprocessors significantly

More information

Chapter 12: Query Processing

Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Overview Chapter 12: Query Processing Measures of Query Cost Selection Operation Sorting Join

More information

TDT 4260 lecture 7 spring semester 2015

TDT 4260 lecture 7 spring semester 2015 1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding

More information

Lab 1: CORDIC Design Due Friday, September 8, 2017, 11:59pm

Lab 1: CORDIC Design Due Friday, September 8, 2017, 11:59pm ECE5775 High-Level Digital Design Automation, Fall 2017 School of Electrical Computer Engineering, Cornell University Lab 1: CORDIC Design Due Friday, September 8, 2017, 11:59pm 1 Introduction COordinate

More information

Early Performance-Cost Estimation of Application-Specific Data Path Pipelining

Early Performance-Cost Estimation of Application-Specific Data Path Pipelining Early Performance-Cost Estimation of Application-Specific Data Path Pipelining Jelena Trajkovic Computer Science Department École Polytechnique de Montréal, Canada Email: jelena.trajkovic@polymtl.ca Daniel

More information

OpenMP Device Offloading to FPGA Accelerators. Lukas Sommer, Jens Korinth, Andreas Koch

OpenMP Device Offloading to FPGA Accelerators. Lukas Sommer, Jens Korinth, Andreas Koch OpenMP Device Offloading to FPGA Accelerators Lukas Sommer, Jens Korinth, Andreas Koch Motivation Increasing use of heterogeneous systems to overcome CPU power limitations 2017-07-12 OpenMP FPGA Device

More information

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and

More information

Chapter 13: Query Processing Basic Steps in Query Processing

Chapter 13: Query Processing Basic Steps in Query Processing Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and

More information

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing Mingyu Gao and Christos Kozyrakis Stanford University http://mast.stanford.edu HPCA March 14, 2016 PIM is Coming Back End of Dennard

More information

Design of Parallel Algorithms. Models of Parallel Computation

Design of Parallel Algorithms. Models of Parallel Computation + Design of Parallel Algorithms Models of Parallel Computation + Chapter Overview: Algorithms and Concurrency n Introduction to Parallel Algorithms n Tasks and Decomposition n Processes and Mapping n Processes

More information

Patterns of Parallel Programming with.net 4. Ade Miller Microsoft patterns & practices

Patterns of Parallel Programming with.net 4. Ade Miller Microsoft patterns & practices Patterns of Parallel Programming with.net 4 Ade Miller (adem@microsoft.com) Microsoft patterns & practices Introduction Why you should care? Where to start? Patterns walkthrough Conclusions (and a quiz)

More information

Memory Consistency. Challenges. Program order Memory access order

Memory Consistency. Challenges. Program order Memory access order Memory Consistency Memory Consistency Memory Consistency Reads and writes of the shared memory face consistency problem Need to achieve controlled consistency in memory events Shared memory behavior determined

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

A Novel Design Framework for the Design of Reconfigurable Systems based on NoCs

A Novel Design Framework for the Design of Reconfigurable Systems based on NoCs Politecnico di Milano & EPFL A Novel Design Framework for the Design of Reconfigurable Systems based on NoCs Vincenzo Rana, Ivan Beretta, Donatella Sciuto Donatella Sciuto sciuto@elet.polimi.it Introduction

More information

Developing Dynamic Profiling and Debugging Support in OpenCL for FPGAs

Developing Dynamic Profiling and Debugging Support in OpenCL for FPGAs Developing Dynamic Profiling and Debugging Support in OpenCL for FPGAs ABSTRACT Anshuman Verma Virginia Tech, Blacksburg, VA anshuman@vt.edu Skip Booth, Robbie King, James Coole, Andy Keep, John Marshall

More information

Lab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm

Lab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm ECE5775 High-Level Digital Design Automation, Fall 2017 School of Electrical Computer Engineering, Cornell University Lab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm 1 Introduction

More information

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA Yun R. Qu, Viktor K. Prasanna Ming Hsieh Dept. of Electrical Engineering University of Southern California Los Angeles, CA 90089

More information

Query Processing. Introduction to Databases CompSci 316 Fall 2017

Query Processing. Introduction to Databases CompSci 316 Fall 2017 Query Processing Introduction to Databases CompSci 316 Fall 2017 2 Announcements (Tue., Nov. 14) Homework #3 sample solution posted in Sakai Homework #4 assigned today; due on 12/05 Project milestone #2

More information

Core Fusion: Accommodating Software Diversity in Chip Multiprocessors

Core Fusion: Accommodating Software Diversity in Chip Multiprocessors Core Fusion: Accommodating Software Diversity in Chip Multiprocessors Authors: Engin Ipek, Meyrem Kırman, Nevin Kırman, and Jose F. Martinez Navreet Virk Dept of Computer & Information Sciences University

More information

Sparse Matrix-Vector Multiplication FPGA Implementation

Sparse Matrix-Vector Multiplication FPGA Implementation UNIVERSITY OF CALIFORNIA, LOS ANGELES Sparse Matrix-Vector Multiplication FPGA Implementation (SID: 704-272-121) 02/27/2015 Table of Contents 1 Introduction... 3 2 Sparse Matrix-Vector Multiplication...

More information

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016 Query Processing Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016 Slides re-used with some modification from www.db-book.com Reference: Database System Concepts, 6 th Ed. By Silberschatz,

More information

Ultra-Fast NoC Emulation on a Single FPGA

Ultra-Fast NoC Emulation on a Single FPGA The 25 th International Conference on Field-Programmable Logic and Applications (FPL 2015) September 3, 2015 Ultra-Fast NoC Emulation on a Single FPGA Thiem Van Chu, Shimpei Sato, and Kenji Kise Tokyo

More information

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,

More information

Lecture 9 Basic Parallelization

Lecture 9 Basic Parallelization Lecture 9 Basic Parallelization I. Introduction II. Data Dependence Analysis III. Loop Nests + Locality IV. Interprocedural Parallelization Chapter 11.1-11.1.4 CS243: Parallelization 1 Machine Learning

More information

Lecture 9 Basic Parallelization

Lecture 9 Basic Parallelization Lecture 9 Basic Parallelization I. Introduction II. Data Dependence Analysis III. Loop Nests + Locality IV. Interprocedural Parallelization Chapter 11.1-11.1.4 CS243: Parallelization 1 Machine Learning

More information

A Virtualized Quality of Service Packet Scheduler Accelerator. Kangtao Kendall Chuang

A Virtualized Quality of Service Packet Scheduler Accelerator. Kangtao Kendall Chuang A Virtualized Quality of Service Packet Scheduler Accelerator A Thesis Presented to The Academic Faculty by Kangtao Kendall Chuang In Partial Fulfillment of the Requirements for the Degree Master of Science

More information

MT-SDF: Scheduled Dataflow Architecture with mini-threads

MT-SDF: Scheduled Dataflow Architecture with mini-threads 2013 Data-Flow Execution Models for Extreme Scale Computing MT-SDF: Scheduled Dataflow Architecture with mini-threads Domenico Pace University of Pisa Pisa, Italy col.pace@hotmail.it Krishna Kavi University

More information

Efficient, Scalable, and Provenance-Aware Management of Linked Data

Efficient, Scalable, and Provenance-Aware Management of Linked Data Efficient, Scalable, and Provenance-Aware Management of Linked Data Marcin Wylot 1 Motivation and objectives of the research The proliferation of heterogeneous Linked Data on the Web requires data management

More information

FCUDA-SoC: Platform Integration for Field-Programmable SoC with the CUDAto-FPGA

FCUDA-SoC: Platform Integration for Field-Programmable SoC with the CUDAto-FPGA 1 FCUDA-SoC: Platform Integration for Field-Programmable SoC with the CUDAto-FPGA Compiler Tan Nguyen 1, Swathi Gurumani 1, Kyle Rupnow 1, Deming Chen 2 1 Advanced Digital Sciences Center, Singapore {tan.nguyen,

More information