A Design Framework for Mapping Vectorized Synchronous Dataflow Graphs onto CPU-GPU Platforms

Size: px
Start display at page:

Download "A Design Framework for Mapping Vectorized Synchronous Dataflow Graphs onto CPU-GPU Platforms"

Transcription

1 A Design Framework for Mapping Vectorized Synchronous Dataflow Graphs onto CPU-GPU Platforms Shuoxin Lin, Yanzhou Liu, William Plishker, Shuvra Bhattacharyya Maryland DSPCAD Research Group Department of Electrical and Computer Engineering, and Institute for Advanced Computer Studies University of Maryland International Workshop on Software and Compilers for Embedded Systems May 23, 2016 Sankt Goar, Germany

2 Motivation From high-level system specification to software on hybrid multicore-cpu GPU platforms A B C 2

3 Synchronous Dataflow (SDF) [1] p 1 p A B 2 e c 5 C 1 1 e 2 Vertices (actors) computational modules Edges FIFO buffers Tokens data elements passed between actors Production / consumption rates In SDF, production and consumption rates are known at compile-time. Iterative execution on large or unbounded data streams c 2 3

4 Objectives To automatically exploit data, task, and pipeline parallelism from model-based specifications of digital signal processing (DSP) applications. To generate throughput-optimized code for hybrid CPU- GPU platforms (with multi-core CPU and GPU devices working together on the application) A B C Dataflow Design Framework 4

5 Exploiting Parallelism in SDF Task Parallelism A B C D P1 P2 A B C D A Data Parallelism A B 4 C D C 4 Pipeline Parallelism 2 2 B 2 2 A 4 D 4 4 C 4 4 stage 1 stage 2 stage 3 P1 P2 P3 A B C 4 A D 5

6 Multicore CPU-GPU architecture Multi-core CPU CPU cores (Multiple Instructions Multiple Data, MIMD) Cores share main memory Host GPU Many SIMD multiprocessors Separate Memory Device float* hp, dp; hp = (float*)malloc(sizeof(float)*n); cudamalloc(&dp, sizeof(float) * N); cudamemcpy(dp,hp, cudamemcpyhosttodevice); call_kernel(dp); /*... */ /* Other kernel executions */ cudamemcpy(hp,dp, cudamemcpydevicetohost); cudafree(dp); free(hp); 6

7 Dataflow Design Framework Challenges Many factors affect system throughput Vectoriza*on Mul*processor Scheduling Throughput Inter-processor communica*on Other System Constraints 7

8 Dataflow Design Framework DIF-GPU framework Model Specification Actor implementation Vectorization Compile-time Scheduling Code Generation 8

9 Dataflow Design Framework Comparison of DIF-GPU with some dataflow runtime frameworks Compiletime Run-time DIF-GPU Vectorization, Scheduling, Inter-processor Communication Peer-worker multithreading Related works (StarPU [2] and OmpSS [3]) Scheduling, Inter-processor Communication, Manager-worker multithreading [2] C. Augonnet et al., StarPU: a unified platform for task scheduling on heterogeneous multicore architectures, Journal of Concurrency and Computation: Practice & Experience, 23(2): , February [3] A. Duran et al., OmpSS: a proposal for programming heterogeneous multi-core architectures. Parallel Processing Letters, 21(02): ,

10 Dataflow Design Framework Why DIF-GPU? Compile-time scheduling and data transfers less runtime overhead Integration of vectorization and code generation more extensive design automation 10

11 Model Specification in DIF Dataflow Interchange Format (DIF) [1]; a standard language for specifying mixedgrain dataflow models for digital signal processing (DSP) systems <1> <2> <3> 2 3 src usp snk 1 2 src: source usp: upsampler snk: sink sdf usp_graph { topology { nodes = src, usp, snk; edges = e1 (src,usp), e2 (usp,snk); } production {e1 = 2; e2 = 3; } consumption {e1 = 1; e2 = 2; } attribute edge_type { e1 = "float"; e2 = "float"; } actor src { name = "src_1f"; port_0 : OUTPUT = e1; } actor usp { name = "usp3"; GPU_enabled = 1; port_0 : INPUT = e1; port_1 : OUTPUT = e2; } actor snk { name = "snk_1f"; port_0 : INPUT = e2; } } 11

12 Actor Implementation in the LIghtweight Dataflow Environment (LIDE) LIDE: programming methodology and APIs for implementing dataflow graph actors and edges Function Actor new() enable() invoke() Edge new() Description Performs memory allocation and initialization for the actor; specifies CPU/GPU version & vectorization Checks the input availability and output free space For CPU version, executes CPU function; For GPU version, launches GPU kernel(s) terminate() Frees memory that has been dynamically allocated for the actor. free() Allocates FIFO buffer in CPU/GPU memory Frees the FIFO buffer 12

13 Actor Implementation in LIghtweight Dataflow Environment (LIDE) LIDE Compact, extensible, flexible Function Actor new() enable() invoke() Edge new() Description Performs memory allocation and initialization for the actor; specifies CPU/GPU version & vectorization Checks the input availability and output free space For CPU version, executes CPU function; For GPU version, launches GPU kernel(s) terminate() Frees memory that has been dynamically allocated for the actor. free() Allocates FIFO buffer in CPU/GPU memory Frees the FIFO buffer 13

14 Exploiting Parallelism in DIF-GPU Vectorization Data Parallelism Graph Level Vectorization (GVD) <1> <2> <3> src 2 3 usp 1 2 snk <1> <1> <1> 2b 6b src b usp 2b snk 2b 6b 3b Each actor v is vectorized by b q(v), where q(v) is the repetition count of v, and b is the graph vectorization degree (GVD) Multi-rate SDF graph à block-processing single-rate SDF task graph for scheduling b is limited by system constraints (memory, latency, etc). 14

15 Exploiting Parallelism in DIF-GPU Scheduling Task & Pipeline Parallelism First Come First Serve (FCFS) [4] Simple greedy approach Schedules an actor whenever a processor becomes idle Heterogeneous Earliest Finish Time (HEFT) [5] Manages a list of actors that are ready to be executed Selects the actor-processor pair with earliest finish time Can be extended with other scheduling strategies 15

16 Exploiting Parallelism in DIF-GPU Scheduling Task & Pipeline Parallelism Example 1/0.5 1/0.5 t 1 /t 2 1/ B C 1/ A 1/1 0.5/0.5 F D E FCFS HEFT P1 A B C F A P1 A D E F A P2 D E P2 B C T = 4 T =

17 Inter-processor Data Transfer Host-centered FIFO Allocation (HCFA) Maintain all FIFOs on host memory Frameworks without GPU support (e.g., GNU Radio) Easy integration with existing frameworks Large amounts of overhead due to excessive CPU- GPU data transfer u e 2 e 3 e 1 buf 1 v w e 1 v x y H2D e 3 e 2 kernel D2H H2D buf 3 buf 2 w CPU actor v GPU actor 17

18 Inter-processor Data Transfer Mapping-dependent FIFO Allocation (MDFA) FIFOs can be allocated in host or device memory depending on the schedule Insert H2D/D2H actors to explicitly move data Inter-processor data transfer only occurs at locations determined by the schedule e 1c w e 2c u y e 1g v kernel e 3 H2D H2D D2H e 2g e 1g v e 2g e 3 x w CPU actor v GPU actor 18

19 DIF-GPU Example <1> <2> <3> src 2 3 usp 1 2 snk Vectorization <1> <1> <1> src b 2b 6b usp 2b 2b 6b snk 3b Scheduling & Data transfer actor insertion <1> <1> <1> <1> <1> src b 2b 2b 6b 6b H2D usp 2b D2H 2b 2b 6b 6b snk 3b CPU : src, snk, src, snk, GPU : H2D, usp, D2H, H2D, 19

20 DIF-GPU Example Generated LIDE-CUDA code Header file #include <stdio.h> /*... */ #define SRC 0 #define USP 1 #define SNK 2 #define H2D_0 3 #define D2H_0 4 #define ACTOR_COUNT 5 #define CPU 0 #define GPU 1 #define NUMBER_OF_THREADS 2 Headers Macro Definitions class usp_graph { public: usp_graph(); ~usp_graph(); void execute(); private: thread_list* thread_list; actor_context_type* actors[actor_count]; fifo_pointer edge_in_h2d_0; fifo_pointer edge_out_d2h_0; fifo_pointer edge_in_d2h_0; fifo_pointer edge_out_h2d_0; }; Class Declaration 20

21 DIF-GPU Example Generated LIDE-CUDA code Source code: graph constructor #include "usp_graph.h" usp_graph::usp_graph(){ /* Create edges */ edge_in_h2d_0 = fifo_new(4, sizeof(float), CPU); edge_out_d2h_0 = fifo_new(12, sizeof(float), CPU); edge_in_d2h_0 = fifo_new(12, sizeof(float), GPU); edge_out_h2d_0 = fifo_new(4, sizeof(float), GPU); /* Create actors */ actors[d2h_0] = (actor_context_type*) memcpy_new( edge_in_d2h_0,edge_out_d2h_0,12,12,sizeof(float), GPU); actors[snk] = (actor_context_type*) snk_1f_new( edge_out_d2h_0,12, CPU); actors[h2d_0] = (actor_context_type*) memcpy_new( edge_in_h2d_0,edge_out_h2d_0,4,4,sizeof(float), GPU); actors[usp] = (actor_context_type*) usp3_new( edge_out_h2d_0,edge_in_d2h_0,4,12, GPU); actors[src] = (actor_context_type*) src_1f_new( edge_in_h2d_0,4, CPU); /* Create schedules of each thread */ const char* thread_schedules[number_of_threads] = {"thread_0.txt","thread_1.txt"}; thread_list = thread_list_init(number_of_threads, thread_schedules, actors, ACTOR_COUNT); } 21

22 DIF-GPU Example Generated LIDE-CUDA code Source code: graph-level execute() and destructor void usp_graph::execute(){ thread_list_scheduler(thread_list); } usp_graph::~usp_graph(){ /* Terminate threads */ thread_list_terminate(thread_list); /* Free FIFOs */ fifo_free(edge_in_h2d_0); fifo_free(edge_out_d2h_0); fifo_free(edge_in_d2h_0); fifo_free(edge_out_h2d_0); /* Destroy actors */ memcpy_terminate((memcpy_context_type*)actors[d2h_0]); snk_1f_terminate((snk_1f_context_type*)actors[snk]); memcpy_terminate((memcpy_context_type*)actors[h2d_0]); usp3_terminate((usp3_context_type*)actors[usp]); src_1f_terminate((src_1f_context_type*)actors[src]); } 22

23 Case Study Throughput for b-vectorized graph Th = b/t Th: throughput; b: vec. degree; T: schedule period Test bench MP-Sched (P x S) Item Grid Size Platform Scheduler Values 2x5, 4x4, 6x3 1 CC + 1 GPU 3 CCs + 1 GPU HEFT, FCFS * CC: CPU Core 23

24 Speedup of FIR Filter K = 7 Excluding CPU-GPU data-transfer time B FIR B B = 1, 2,, N Filter length = K Speedup Slow increase from b=2 7 to 2 10 Low GPU utilization Fast increase from b=2 10 to 2 16 Increased utilization Slow increase from b=2 17 to 2 19 Saturation 24

25 Data Transfer Evaluation Throughput and data transfer overhead for FIFO implementation based on HCFA and MDFA. Percentage = ratio of time spent on data transfer HCFA MDFA Topology 2x5 4x4 6x3 Th(10 6 /s) D2H 37.4% 37.6% 37.1% H2D 16.4% 16.2% 15.8% Th(10 6 /s) D2H 17.2% 15.5% 20.8% H2D 6.7% 9.9% 8.2% 25

26 GPU Workload vs. Vectorization MP-Sched 4x4 Vec. Degree = 128 Vec. Degree = 256 Vec. Degree =

27 GPU Workload vs. Vectorization MP-Sched 4x4 Vec. Degree = 1024 Vec. Degree = 2048 Vec. Degree =

28 GPU Workload vs. Vectorization MP-Sched 4x4 Vec. Degree =

29 System Level Evaluation Single-processor baselines CPU baseline Th c : all actors scheduled on the same CPU core (CC) GPU baseline Th g : all actors with GPU acceleration scheduled on the GPU; all others scheduled on the same CC DIF-GPU Speedup sp = Th/max(Th c, Th g ) 29

30 System Level Evaluation MP-Sched 2x5 30

31 System Level Evaluation MP-Sched 4x4 FCFS HEFT 31

32 System Level Evaluation MP-Sched 6x3 FCFS HEFT 32

33 System Level Evaluation For small vectorization degrees, 3CC + 1GPU gives higher throughput GPU version slower More cores For large vectorization degrees, 1CC + 1GPU gives higher throughput GPU version much faster than CPU HEFT/FCFS scheduling Multithreading runtime overhead 33

34 Scheduler Evaluation Speedup in different topologies 2x5, 4x4, 6x3 Th(HEFT) > Th(FCFS) in general Consistent gain over GPU baseline Inter-processor data transfer 34

35 Scheduler Evaluation Speedup in different topologies 2x5, 4x4, 6x3 In some cases (b l < b < b u ), FCFS is better 35

36 Conclusion DIF-GPU framework SDF graph specification (DIF) Vectorization Scheduling Code generation Demonstration à MP-Sched benchmarks Data transfer overhead reduction using MDFA Performance improvement over CPU and GPUbaseline 36

37 References 1. E. A. Lee and D. G. Messerschmitt. Synchronous dataflow. Proceedings of the IEEE, 75(9): , September C. Augonnet, S. Thibault, R. Namyst, and P. Wacrenier. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures, Journal of Concurrency and Computation: Practice & Experience, 23(2): , February A. Duran, E. Ayguade, R. M. Badia, J. Labarta, L. Martinell, X. Martorell, and J. Planas. OmpSS: a proposal for programming heterogeneous multi-core architectures. Parallel Processing Letters, 21(02): , G. Teodoro, R. Sachetto, O. Sertel, M. N. Gurcan, W. Meira, U. Catalyurek, and R. Ferreira. Coordinating the use of GPU and CPU for improving performance of compute intensive applications. In Proceedings of the IEEE International Conference on Cluster Computing and Workshops, pages 1-10, H. Topcuoglu, S. Hariri, and M.-Y. Wu. Performance-eective and low-complexity task scheduling for heterogeneous computing. IEEE Transactions on Parallel and Distributed Systems, 13(3): ,

Multicore DSP Software Synthesis using Partial Expansion of Dataflow Graphs

Multicore DSP Software Synthesis using Partial Expansion of Dataflow Graphs Multicore DSP Software Synthesis using Partial Expansion of Dataflow Graphs George F. Zaki, William Plishker, Shuvra S. Bhattacharyya University of Maryland, College Park, MD, USA & Frank Fruth Texas Instruments

More information

A Standalone Package for Bringing Graphics Processor Acceleration to GNU Radio: GRGPU

A Standalone Package for Bringing Graphics Processor Acceleration to GNU Radio: GRGPU A Standalone Package for Bringing Graphics Processor Acceleration to GNU Radio: GRGPU William Plishker University of Maryland plishker@umd.edu 1/25 Outline Introduction GPU Background Graphics Processor

More information

Applying Graphics Processor Acceleration in a Software Defined Radio Prototyping Environment

Applying Graphics Processor Acceleration in a Software Defined Radio Prototyping Environment Applying Graphics Processor Acceleration in a Software Defined Radio Prototyping Environment GNU Radio with Graphics Processor Acceleration as a Standalone Package Will Plishker, George F. Zaki, Shuvra

More information

Recent Advances in Heterogeneous Computing using Charm++

Recent Advances in Heterogeneous Computing using Charm++ Recent Advances in Heterogeneous Computing using Charm++ Jaemin Choi, Michael Robson Parallel Programming Laboratory University of Illinois Urbana-Champaign April 12, 2018 1 / 24 Heterogeneous Computing

More information

Communication Library to Overlap Computation and Communication for OpenCL Application

Communication Library to Overlap Computation and Communication for OpenCL Application Communication Library to Overlap Computation and Communication for OpenCL Application Toshiya Komoda, Shinobu Miwa, Hiroshi Nakamura Univ.Tokyo What is today s talk about? Heterogeneous Computing System

More information

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture Rafia Inam Mälardalen Real-Time Research Centre Mälardalen University, Västerås, Sweden http://www.mrtc.mdh.se rafia.inam@mdh.se CONTENTS

More information

Adaptive Stream Mining: A Novel Dynamic Computing Paradigm for Knowledge Extraction

Adaptive Stream Mining: A Novel Dynamic Computing Paradigm for Knowledge Extraction Adaptive Stream Mining: A Novel Dynamic Computing Paradigm for Knowledge Extraction AFOSR DDDAS Program PI Meeting Presentation PIs: Shuvra S. Bhattacharyya, University of Maryland Mihaela van der Schaar,

More information

CUDA Programming Model

CUDA Programming Model CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [ jbalfour@nvidia.com] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers

More information

Software Synthesis Trade-offs in Dataflow Representations of DSP Applications

Software Synthesis Trade-offs in Dataflow Representations of DSP Applications in Dataflow Representations of DSP Applications Shuvra S. Bhattacharyya Department of Electrical and Computer Engineering, and Institute for Advanced Computer Studies University of Maryland, College Park

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers COMP 322: Fundamentals of Parallel Programming Lecture 37: General-Purpose GPU (GPGPU) Computing Max Grossman, Vivek Sarkar Department of Computer Science, Rice University max.grossman@rice.edu, vsarkar@rice.edu

More information

Hybrid Implementation of 3D Kirchhoff Migration

Hybrid Implementation of 3D Kirchhoff Migration Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation

More information

Chapter 3 Parallel Software

Chapter 3 Parallel Software Chapter 3 Parallel Software Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers

More information

DIGITAL VS. ANALOG SIGNAL PROCESSING Digital signal processing (DSP) characterized by: OUTLINE APPLICATIONS OF DIGITAL SIGNAL PROCESSING

DIGITAL VS. ANALOG SIGNAL PROCESSING Digital signal processing (DSP) characterized by: OUTLINE APPLICATIONS OF DIGITAL SIGNAL PROCESSING 1 DSP applications DSP platforms The synthesis problem Models of computation OUTLINE 2 DIGITAL VS. ANALOG SIGNAL PROCESSING Digital signal processing (DSP) characterized by: Time-discrete representation

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

Real-time Graphics 9. GPGPU

Real-time Graphics 9. GPGPU 9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing GPGPU general-purpose

More information

StarPU: a unified platform for task scheduling on heterogeneous multicore architectures

StarPU: a unified platform for task scheduling on heterogeneous multicore architectures StarPU: a unified platform for task scheduling on heterogeneous multicore architectures Cédric Augonnet, Samuel Thibault, Raymond Namyst, Pierre-André Wacrenier To cite this version: Cédric Augonnet, Samuel

More information

A DESIGN TOOL FOR EFFICIENT MAPPING OF MULTIMEDIA APPLICATIONS ONTO HETEROGENEOUS PLATFORMS

A DESIGN TOOL FOR EFFICIENT MAPPING OF MULTIMEDIA APPLICATIONS ONTO HETEROGENEOUS PLATFORMS In Proceedings of the IEEE International Conference on Multimedia and Expo, Barcelona, Spain, July 2011. A DESIGN TOOL FOR EFFICIENT MAPPING OF MULTIMEDIA APPLICATIONS ONTO HETEROGENEOUS PLATFORMS Chung-Ching

More information

CellSs Making it easier to program the Cell Broadband Engine processor

CellSs Making it easier to program the Cell Broadband Engine processor Perez, Bellens, Badia, and Labarta CellSs Making it easier to program the Cell Broadband Engine processor Presented by: Mujahed Eleyat Outline Motivation Architecture of the cell processor Challenges of

More information

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics GPU Programming Rüdiger Westermann Chair for Computer Graphics & Visualization Faculty of Informatics Overview Programming interfaces and support libraries The CUDA programming abstraction An in-depth

More information

Offloading Java to Graphics Processors

Offloading Java to Graphics Processors Offloading Java to Graphics Processors Peter Calvert (prc33@cam.ac.uk) University of Cambridge, Computer Laboratory Abstract Massively-parallel graphics processors have the potential to offer high performance

More information

ECE 574 Cluster Computing Lecture 15

ECE 574 Cluster Computing Lecture 15 ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements

More information

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters IEEE CLUSTER 2015 Chicago, IL, USA Luis Sant Ana 1, Daniel Cordeiro 2, Raphael Camargo 1 1 Federal University of ABC,

More information

Real-time Graphics 9. GPGPU

Real-time Graphics 9. GPGPU Real-time Graphics 9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing

More information

AdaStreams : A Type-based Programming Extension for Stream-Parallelism with Ada 2005

AdaStreams : A Type-based Programming Extension for Stream-Parallelism with Ada 2005 AdaStreams : A Type-based Programming Extension for Stream-Parallelism with Ada 2005 Gingun Hong*, Kirak Hong*, Bernd Burgstaller* and Johan Blieberger *Yonsei University, Korea Vienna University of Technology,

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

Overview of research activities Toward portability of performance

Overview of research activities Toward portability of performance Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into

More information

A Lightweight Dataflow Approach for Design and Implementation of SDR Systems

A Lightweight Dataflow Approach for Design and Implementation of SDR Systems In Proceedings of the Wireless Innovation Conference and Product Exposition, Washington DC, USA, November 2010. A Lightweight Dataflow Approach for Design and Implementation of SDR Systems Chung-Ching

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Lecture 4: Synchronous Data Flow Graphs - HJ94 goal: Skiing down a mountain

Lecture 4: Synchronous Data Flow Graphs - HJ94 goal: Skiing down a mountain Lecture 4: Synchronous ata Flow Graphs - I. Verbauwhede, 05-06 K.U.Leuven HJ94 goal: Skiing down a mountain SPW, Matlab, C pipelining, unrolling Specification Algorithm Transformations loop merging, compaction

More information

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction

More information

Design of a Dynamic Data-Driven System for Multispectral Video Processing

Design of a Dynamic Data-Driven System for Multispectral Video Processing Design of a Dynamic Data-Driven System for Multispectral Video Processing Shuvra S. Bhattacharyya University of Maryland at College Park ssb@umd.edu With contributions from H. Li, K. Sudusinghe, Y. Liu,

More information

Parallel Processing SIMD, Vector and GPU s cont.

Parallel Processing SIMD, Vector and GPU s cont. Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP

More information

GPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3

GPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3 /CPU,a),2,2 2,2 Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3 XMP XMP-dev CPU XMP-dev/StarPU XMP-dev XMP CPU StarPU CPU /CPU XMP-dev/StarPU N /CPU CPU. Graphics Processing Unit GP General-Purpose

More information

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1 Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei

More information

Parallel Hybrid Computing F. Bodin, CAPS Entreprise

Parallel Hybrid Computing F. Bodin, CAPS Entreprise Parallel Hybrid Computing F. Bodin, CAPS Entreprise Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism Various heterogeneous

More information

Partial Expansion Graphs: Exposing Parallelism and Dynamic Scheduling Opportunities for DSP Applications

Partial Expansion Graphs: Exposing Parallelism and Dynamic Scheduling Opportunities for DSP Applications In Proceedings of the International Conference on Application Specific Systems, Architectures, and Processors, 2012, to appear. Partial Expansion Graphs: Exposing Parallelism and Dynamic Scheduling Opportunities

More information

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

Memory concept. Grid concept, Synchronization. GPU Programming.   Szénási Sándor. Memory concept Grid concept, Synchronization GPU Programming http://cuda.nik.uni-obuda.hu Szénási Sándor szenasi.sandor@nik.uni-obuda.hu GPU Education Center of Óbuda University MEMORY CONCEPT Off-chip

More information

Introduction II. Overview

Introduction II. Overview Introduction II Overview Today we will introduce multicore hardware (we will introduce many-core hardware prior to learning OpenCL) We will also consider the relationship between computer hardware and

More information

Introduction to Multicore Programming

Introduction to Multicore Programming Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Automatic Parallelization and OpenMP 3 GPGPU 2 Multithreaded Programming

More information

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010 Parallelizing FPGA Technology Mapping using GPUs Doris Chen Deshanand Singh Aug 31 st, 2010 Motivation: Compile Time In last 12 years: 110x increase in FPGA Logic, 23x increase in CPU speed, 4.8x gap Question:

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Islam Harb 08/21/2015 Agenda Motivation Research Challenges Contributions & Approach Results Conclusion Future Work 2

More information

Adaptive Runtime Resource Management of Heterogeneous Resources

Adaptive Runtime Resource Management of Heterogeneous Resources Adaptive Runtime Resource Management of Heterogeneous Resources Roel Wuyts Principal Scientist, imec Professor, KUL (Distrinet) Carreer Overview Studies: Licentiaat Informatica (VUB, 1991-1995) 1995 001

More information

Efficient Hardware Acceleration on SoC- FPGA using OpenCL

Efficient Hardware Acceleration on SoC- FPGA using OpenCL Efficient Hardware Acceleration on SoC- FPGA using OpenCL Advisor : Dr. Benjamin Carrion Schafer Susmitha Gogineni 30 th August 17 Presentation Overview 1.Objective & Motivation 2.Configurable SoC -FPGA

More information

Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory

Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory Institute of Computational Science Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory Juraj Kardoš (University of Lugano) July 9, 2014 Juraj Kardoš Efficient GPU data transfers July 9, 2014

More information

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture

More information

Agenda. Threads. Single and Multi-threaded Processes. What is Thread. CSCI 444/544 Operating Systems Fall 2008

Agenda. Threads. Single and Multi-threaded Processes. What is Thread. CSCI 444/544 Operating Systems Fall 2008 Agenda Threads CSCI 444/544 Operating Systems Fall 2008 Thread concept Thread vs process Thread implementation - user-level - kernel-level - hybrid Inter-process (inter-thread) communication What is Thread

More information

Introduction to Multicore Programming

Introduction to Multicore Programming Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Synchronization 3 Automatic Parallelization and OpenMP 4 GPGPU 5 Q& A 2 Multithreaded

More information

CPU-GPU Heterogeneous Computing

CPU-GPU Heterogeneous Computing CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

Parallel Numerical Algorithms

Parallel Numerical Algorithms Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna14/ [ 10 ] GPU and CUDA Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture and Performance

More information

CS 179: GPU Computing. Lecture 2: The Basics

CS 179: GPU Computing. Lecture 2: The Basics CS 179: GPU Computing Lecture 2: The Basics Recap Can use GPU to solve highly parallelizable problems Performance benefits vs. CPU Straightforward extension to C language Disclaimer Goal for Week 1: Fast-paced

More information

GPU Programming Using CUDA

GPU Programming Using CUDA GPU Programming Using CUDA Michael J. Schnieders Depts. of Biomedical Engineering & Biochemistry The University of Iowa & Gregory G. Howes Department of Physics and Astronomy The University of Iowa Iowa

More information

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee

More information

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Dirk Ribbrock Faculty of Mathematics, TU dortmund 2016 Table of Contents Why parallel

More information

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh GPU Programming EPCC The University of Edinburgh Contents NVIDIA CUDA C Proprietary interface to NVIDIA architecture CUDA Fortran Provided by PGI OpenCL Cross platform API 2 NVIDIA CUDA CUDA allows NVIDIA

More information

Postprint. This is the accepted version of a paper presented at MCC13, November 25 26, Halmstad, Sweden.

Postprint.   This is the accepted version of a paper presented at MCC13, November 25 26, Halmstad, Sweden. http://www.diva-portal.org Postprint This is the accepted version of a paper presented at MCC13, November 25 26, Halmstad, Sweden. Citation for the original published paper: Ceballos, G., Black-Schaffer,

More information

Introduction to CELL B.E. and GPU Programming. Agenda

Introduction to CELL B.E. and GPU Programming. Agenda Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU

More information

Dynamic Dataflow. Seminar on embedded systems

Dynamic Dataflow. Seminar on embedded systems Dynamic Dataflow Seminar on embedded systems Dataflow Dataflow programming, Dataflow architecture Dataflow Models of Computation Computation is divided into nodes that can be executed concurrently Dataflow

More information

TaskGenX: A Hardware-Software Proposal for Accelerating Task Parallelism

TaskGenX: A Hardware-Software Proposal for Accelerating Task Parallelism TaskGenX: A Hardware-Software Proposal for Accelerating Task Parallelism Kallia Chronaki, Marc Casas, Miquel Moreto, Jaume Bosch, Rosa M. Badia Barcelona Supercomputing Center, Artificial Intelligence

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

EE213A - EE298-2 Lecture 8

EE213A - EE298-2 Lecture 8 EE3A - EE98- Lecture 8 Synchronous ata Flow Ingrid Verbauwhede epartment of Electrical Engineering University of California Los Angeles ingrid@ee.ucla.edu EE3A, Spring 000, Ingrid Verbauwhede, UCLA - Lecture

More information

GPU Programming Using NVIDIA CUDA

GPU Programming Using NVIDIA CUDA GPU Programming Using NVIDIA CUDA Siddhante Nangla 1, Professor Chetna Achar 2 1, 2 MET s Institute of Computer Science, Bandra Mumbai University Abstract: GPGPU or General-Purpose Computing on Graphics

More information

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 04 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for

More information

High Performance Computing and GPU Programming

High Performance Computing and GPU Programming High Performance Computing and GPU Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model Objectives Objectives Before we begin a little motivation Intel Xeon 2.67GHz

More information

Accelerating CFD with Graphics Hardware

Accelerating CFD with Graphics Hardware Accelerating CFD with Graphics Hardware Graham Pullan (Whittle Laboratory, Cambridge University) 16 March 2009 Today Motivation CPUs and GPUs Programming NVIDIA GPUs with CUDA Application to turbomachinery

More information

! Readings! ! Room-level, on-chip! vs.!

! Readings! ! Room-level, on-chip! vs.! 1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads

More information

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

GPU Programming Using CUDA. Samuli Laine NVIDIA Research GPU Programming Using CUDA Samuli Laine NVIDIA Research Today GPU vs CPU Different architecture, different workloads Basics of CUDA Executing code on GPU Managing memory between CPU and GPU CUDA API Quick

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance

More information

GPUfs: Integrating a file system with GPUs

GPUfs: Integrating a file system with GPUs GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Traditional System Architecture Applications OS CPU

More information

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture Mary Thomas Department of Computer Science Computational Science Research Center (CSRC) San Diego State University (SDSU) Posted:

More information

Energy-efficient acceleration of task dependency trees on CPU-GPU hybrids

Energy-efficient acceleration of task dependency trees on CPU-GPU hybrids Energy-efficient acceleration of task dependency trees on CPU-GPU hybrids Mark Silberstein - Technion Naoya Maruyama Tokyo Institute of Technology Mark Silberstein, Technion 1 The case for heterogeneous

More information

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Introduction to GPU programming. Introduction to GPU programming p. 1/17 Introduction to GPU programming Introduction to GPU programming p. 1/17 Introduction to GPU programming p. 2/17 Overview GPUs & computing Principles of CUDA programming One good reference: David B. Kirk

More information

Accelerating image registration on GPUs

Accelerating image registration on GPUs Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

A Pattern-supported Parallelization Approach

A Pattern-supported Parallelization Approach A Pattern-supported Parallelization Approach Ralf Jahr, Mike Gerdes, Theo Ungerer University of Augsburg, Germany The 2013 International Workshop on Programming Models and Applications for Multicores and

More information

CUDA. Matthew Joyner, Jeremy Williams

CUDA. Matthew Joyner, Jeremy Williams CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

Introduc)on to GPU Programming

Introduc)on to GPU Programming Introduc)on to GPU Programming Mubashir Adnan Qureshi h3p://www.ncsa.illinois.edu/people/kindr/projects/hpca/files/singapore_p1.pdf h3p://developer.download.nvidia.com/cuda/training/nvidia_gpu_compu)ng_webinars_cuda_memory_op)miza)on.pdf

More information

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Lecture 8: GPU Programming. CSE599G1: Spring 2017 Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library

More information

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors Weifeng Liu and Brian Vinter Niels Bohr Institute University of Copenhagen Denmark {weifeng, vinter}@nbi.dk March 1, 2014 Weifeng

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

Heterogeneous platforms

Heterogeneous platforms Heterogeneous platforms Systems combining main processors and accelerators e.g., CPU + GPU, CPU + Intel MIC, AMD APU, ARM SoC Any platform using a GPU is a heterogeneous platform! Further in this talk

More information

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology September 19, 2007 Markus Levy, EEMBC and Multicore Association Enabling the Multicore Ecosystem Multicore

More information

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI CMPE 655- MULTIPLE PROCESSOR SYSTEMS OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI What is MULTI PROCESSING?? Multiprocessing is the coordinated processing

More information

Serial and Parallel Sobel Filtering for multimedia applications

Serial and Parallel Sobel Filtering for multimedia applications Serial and Parallel Sobel Filtering for multimedia applications Gunay Abdullayeva Institute of Computer Science University of Tartu Email: gunay@ut.ee Abstract GSteamer contains various plugins to apply

More information

An Efficient Stream Buffer Mechanism for Dataflow Execution on Heterogeneous Platforms with GPUs

An Efficient Stream Buffer Mechanism for Dataflow Execution on Heterogeneous Platforms with GPUs An Efficient Stream Buffer Mechanism for Dataflow Execution on Heterogeneous Platforms with GPUs Ana Balevic Leiden Institute of Advanced Computer Science University of Leiden Leiden, The Netherlands balevic@liacs.nl

More information

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento

More information

Automatic Intra-Application Load Balancing for Heterogeneous Systems

Automatic Intra-Application Load Balancing for Heterogeneous Systems Automatic Intra-Application Load Balancing for Heterogeneous Systems Michael Boyer, Shuai Che, and Kevin Skadron Department of Computer Science University of Virginia Jayanth Gummaraju and Nuwan Jayasena

More information

SignalPU: A programming model for DSP applications on parallel and heterogeneous clusters

SignalPU: A programming model for DSP applications on parallel and heterogeneous clusters SignalPU: A programming model for DSP applications on parallel and heterogeneous clusters Farouk Mansouri, Sylvain Huet, Dominique Houzet To cite this version: Farouk Mansouri, Sylvain Huet, Dominique

More information

Web Physics: A Hardware Accelerated Physics Engine for Web- Based Applications

Web Physics: A Hardware Accelerated Physics Engine for Web- Based Applications Web Physics: A Hardware Accelerated Physics Engine for Web- Based Applications Tasneem Brutch, Bo Li, Guodong Rong, Yi Shen, Chang Shu Samsung Research America-Silicon Valley {t.brutch, robert.li, g.rong,

More information

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization

More information

cuda-on-cl A compiler and runtime for running NVIDIA CUDA C++11 applications on OpenCL 1.2 devices Hugh Perkins (ASAPP)

cuda-on-cl A compiler and runtime for running NVIDIA CUDA C++11 applications on OpenCL 1.2 devices Hugh Perkins (ASAPP) cuda-on-cl A compiler and runtime for running NVIDIA CUDA C++11 applications on OpenCL 1.2 devices Hugh Perkins (ASAPP) Demo: CUDA on Intel HD5500 global void setvalue(float *data, int idx, float value)

More information

Predictive Runtime Code Scheduling for Heterogeneous Architectures

Predictive Runtime Code Scheduling for Heterogeneous Architectures Predictive Runtime Code Scheduling for Heterogeneous Architectures Víctor Jiménez, Lluís Vilanova, Isaac Gelado Marisa Gil, Grigori Fursin, Nacho Navarro HiPEAC 2009 January, 26th, 2009 1 Outline Motivation

More information

CS 31: Intro to Systems Threading & Parallel Applications. Kevin Webb Swarthmore College November 27, 2018

CS 31: Intro to Systems Threading & Parallel Applications. Kevin Webb Swarthmore College November 27, 2018 CS 31: Intro to Systems Threading & Parallel Applications Kevin Webb Swarthmore College November 27, 2018 Reading Quiz Making Programs Run Faster We all like how fast computers are In the old days (1980

More information

Chapter 4: Multi-Threaded Programming

Chapter 4: Multi-Threaded Programming Chapter 4: Multi-Threaded Programming Chapter 4: Threads 4.1 Overview 4.2 Multicore Programming 4.3 Multithreading Models 4.4 Thread Libraries Pthreads Win32 Threads Java Threads 4.5 Implicit Threading

More information