A Design Framework for Mapping Vectorized Synchronous Dataflow Graphs onto CPU-GPU Platforms
|
|
- Augustus Ward
- 5 years ago
- Views:
Transcription
1 A Design Framework for Mapping Vectorized Synchronous Dataflow Graphs onto CPU-GPU Platforms Shuoxin Lin, Yanzhou Liu, William Plishker, Shuvra Bhattacharyya Maryland DSPCAD Research Group Department of Electrical and Computer Engineering, and Institute for Advanced Computer Studies University of Maryland International Workshop on Software and Compilers for Embedded Systems May 23, 2016 Sankt Goar, Germany
2 Motivation From high-level system specification to software on hybrid multicore-cpu GPU platforms A B C 2
3 Synchronous Dataflow (SDF) [1] p 1 p A B 2 e c 5 C 1 1 e 2 Vertices (actors) computational modules Edges FIFO buffers Tokens data elements passed between actors Production / consumption rates In SDF, production and consumption rates are known at compile-time. Iterative execution on large or unbounded data streams c 2 3
4 Objectives To automatically exploit data, task, and pipeline parallelism from model-based specifications of digital signal processing (DSP) applications. To generate throughput-optimized code for hybrid CPU- GPU platforms (with multi-core CPU and GPU devices working together on the application) A B C Dataflow Design Framework 4
5 Exploiting Parallelism in SDF Task Parallelism A B C D P1 P2 A B C D A Data Parallelism A B 4 C D C 4 Pipeline Parallelism 2 2 B 2 2 A 4 D 4 4 C 4 4 stage 1 stage 2 stage 3 P1 P2 P3 A B C 4 A D 5
6 Multicore CPU-GPU architecture Multi-core CPU CPU cores (Multiple Instructions Multiple Data, MIMD) Cores share main memory Host GPU Many SIMD multiprocessors Separate Memory Device float* hp, dp; hp = (float*)malloc(sizeof(float)*n); cudamalloc(&dp, sizeof(float) * N); cudamemcpy(dp,hp, cudamemcpyhosttodevice); call_kernel(dp); /*... */ /* Other kernel executions */ cudamemcpy(hp,dp, cudamemcpydevicetohost); cudafree(dp); free(hp); 6
7 Dataflow Design Framework Challenges Many factors affect system throughput Vectoriza*on Mul*processor Scheduling Throughput Inter-processor communica*on Other System Constraints 7
8 Dataflow Design Framework DIF-GPU framework Model Specification Actor implementation Vectorization Compile-time Scheduling Code Generation 8
9 Dataflow Design Framework Comparison of DIF-GPU with some dataflow runtime frameworks Compiletime Run-time DIF-GPU Vectorization, Scheduling, Inter-processor Communication Peer-worker multithreading Related works (StarPU [2] and OmpSS [3]) Scheduling, Inter-processor Communication, Manager-worker multithreading [2] C. Augonnet et al., StarPU: a unified platform for task scheduling on heterogeneous multicore architectures, Journal of Concurrency and Computation: Practice & Experience, 23(2): , February [3] A. Duran et al., OmpSS: a proposal for programming heterogeneous multi-core architectures. Parallel Processing Letters, 21(02): ,
10 Dataflow Design Framework Why DIF-GPU? Compile-time scheduling and data transfers less runtime overhead Integration of vectorization and code generation more extensive design automation 10
11 Model Specification in DIF Dataflow Interchange Format (DIF) [1]; a standard language for specifying mixedgrain dataflow models for digital signal processing (DSP) systems <1> <2> <3> 2 3 src usp snk 1 2 src: source usp: upsampler snk: sink sdf usp_graph { topology { nodes = src, usp, snk; edges = e1 (src,usp), e2 (usp,snk); } production {e1 = 2; e2 = 3; } consumption {e1 = 1; e2 = 2; } attribute edge_type { e1 = "float"; e2 = "float"; } actor src { name = "src_1f"; port_0 : OUTPUT = e1; } actor usp { name = "usp3"; GPU_enabled = 1; port_0 : INPUT = e1; port_1 : OUTPUT = e2; } actor snk { name = "snk_1f"; port_0 : INPUT = e2; } } 11
12 Actor Implementation in the LIghtweight Dataflow Environment (LIDE) LIDE: programming methodology and APIs for implementing dataflow graph actors and edges Function Actor new() enable() invoke() Edge new() Description Performs memory allocation and initialization for the actor; specifies CPU/GPU version & vectorization Checks the input availability and output free space For CPU version, executes CPU function; For GPU version, launches GPU kernel(s) terminate() Frees memory that has been dynamically allocated for the actor. free() Allocates FIFO buffer in CPU/GPU memory Frees the FIFO buffer 12
13 Actor Implementation in LIghtweight Dataflow Environment (LIDE) LIDE Compact, extensible, flexible Function Actor new() enable() invoke() Edge new() Description Performs memory allocation and initialization for the actor; specifies CPU/GPU version & vectorization Checks the input availability and output free space For CPU version, executes CPU function; For GPU version, launches GPU kernel(s) terminate() Frees memory that has been dynamically allocated for the actor. free() Allocates FIFO buffer in CPU/GPU memory Frees the FIFO buffer 13
14 Exploiting Parallelism in DIF-GPU Vectorization Data Parallelism Graph Level Vectorization (GVD) <1> <2> <3> src 2 3 usp 1 2 snk <1> <1> <1> 2b 6b src b usp 2b snk 2b 6b 3b Each actor v is vectorized by b q(v), where q(v) is the repetition count of v, and b is the graph vectorization degree (GVD) Multi-rate SDF graph à block-processing single-rate SDF task graph for scheduling b is limited by system constraints (memory, latency, etc). 14
15 Exploiting Parallelism in DIF-GPU Scheduling Task & Pipeline Parallelism First Come First Serve (FCFS) [4] Simple greedy approach Schedules an actor whenever a processor becomes idle Heterogeneous Earliest Finish Time (HEFT) [5] Manages a list of actors that are ready to be executed Selects the actor-processor pair with earliest finish time Can be extended with other scheduling strategies 15
16 Exploiting Parallelism in DIF-GPU Scheduling Task & Pipeline Parallelism Example 1/0.5 1/0.5 t 1 /t 2 1/ B C 1/ A 1/1 0.5/0.5 F D E FCFS HEFT P1 A B C F A P1 A D E F A P2 D E P2 B C T = 4 T =
17 Inter-processor Data Transfer Host-centered FIFO Allocation (HCFA) Maintain all FIFOs on host memory Frameworks without GPU support (e.g., GNU Radio) Easy integration with existing frameworks Large amounts of overhead due to excessive CPU- GPU data transfer u e 2 e 3 e 1 buf 1 v w e 1 v x y H2D e 3 e 2 kernel D2H H2D buf 3 buf 2 w CPU actor v GPU actor 17
18 Inter-processor Data Transfer Mapping-dependent FIFO Allocation (MDFA) FIFOs can be allocated in host or device memory depending on the schedule Insert H2D/D2H actors to explicitly move data Inter-processor data transfer only occurs at locations determined by the schedule e 1c w e 2c u y e 1g v kernel e 3 H2D H2D D2H e 2g e 1g v e 2g e 3 x w CPU actor v GPU actor 18
19 DIF-GPU Example <1> <2> <3> src 2 3 usp 1 2 snk Vectorization <1> <1> <1> src b 2b 6b usp 2b 2b 6b snk 3b Scheduling & Data transfer actor insertion <1> <1> <1> <1> <1> src b 2b 2b 6b 6b H2D usp 2b D2H 2b 2b 6b 6b snk 3b CPU : src, snk, src, snk, GPU : H2D, usp, D2H, H2D, 19
20 DIF-GPU Example Generated LIDE-CUDA code Header file #include <stdio.h> /*... */ #define SRC 0 #define USP 1 #define SNK 2 #define H2D_0 3 #define D2H_0 4 #define ACTOR_COUNT 5 #define CPU 0 #define GPU 1 #define NUMBER_OF_THREADS 2 Headers Macro Definitions class usp_graph { public: usp_graph(); ~usp_graph(); void execute(); private: thread_list* thread_list; actor_context_type* actors[actor_count]; fifo_pointer edge_in_h2d_0; fifo_pointer edge_out_d2h_0; fifo_pointer edge_in_d2h_0; fifo_pointer edge_out_h2d_0; }; Class Declaration 20
21 DIF-GPU Example Generated LIDE-CUDA code Source code: graph constructor #include "usp_graph.h" usp_graph::usp_graph(){ /* Create edges */ edge_in_h2d_0 = fifo_new(4, sizeof(float), CPU); edge_out_d2h_0 = fifo_new(12, sizeof(float), CPU); edge_in_d2h_0 = fifo_new(12, sizeof(float), GPU); edge_out_h2d_0 = fifo_new(4, sizeof(float), GPU); /* Create actors */ actors[d2h_0] = (actor_context_type*) memcpy_new( edge_in_d2h_0,edge_out_d2h_0,12,12,sizeof(float), GPU); actors[snk] = (actor_context_type*) snk_1f_new( edge_out_d2h_0,12, CPU); actors[h2d_0] = (actor_context_type*) memcpy_new( edge_in_h2d_0,edge_out_h2d_0,4,4,sizeof(float), GPU); actors[usp] = (actor_context_type*) usp3_new( edge_out_h2d_0,edge_in_d2h_0,4,12, GPU); actors[src] = (actor_context_type*) src_1f_new( edge_in_h2d_0,4, CPU); /* Create schedules of each thread */ const char* thread_schedules[number_of_threads] = {"thread_0.txt","thread_1.txt"}; thread_list = thread_list_init(number_of_threads, thread_schedules, actors, ACTOR_COUNT); } 21
22 DIF-GPU Example Generated LIDE-CUDA code Source code: graph-level execute() and destructor void usp_graph::execute(){ thread_list_scheduler(thread_list); } usp_graph::~usp_graph(){ /* Terminate threads */ thread_list_terminate(thread_list); /* Free FIFOs */ fifo_free(edge_in_h2d_0); fifo_free(edge_out_d2h_0); fifo_free(edge_in_d2h_0); fifo_free(edge_out_h2d_0); /* Destroy actors */ memcpy_terminate((memcpy_context_type*)actors[d2h_0]); snk_1f_terminate((snk_1f_context_type*)actors[snk]); memcpy_terminate((memcpy_context_type*)actors[h2d_0]); usp3_terminate((usp3_context_type*)actors[usp]); src_1f_terminate((src_1f_context_type*)actors[src]); } 22
23 Case Study Throughput for b-vectorized graph Th = b/t Th: throughput; b: vec. degree; T: schedule period Test bench MP-Sched (P x S) Item Grid Size Platform Scheduler Values 2x5, 4x4, 6x3 1 CC + 1 GPU 3 CCs + 1 GPU HEFT, FCFS * CC: CPU Core 23
24 Speedup of FIR Filter K = 7 Excluding CPU-GPU data-transfer time B FIR B B = 1, 2,, N Filter length = K Speedup Slow increase from b=2 7 to 2 10 Low GPU utilization Fast increase from b=2 10 to 2 16 Increased utilization Slow increase from b=2 17 to 2 19 Saturation 24
25 Data Transfer Evaluation Throughput and data transfer overhead for FIFO implementation based on HCFA and MDFA. Percentage = ratio of time spent on data transfer HCFA MDFA Topology 2x5 4x4 6x3 Th(10 6 /s) D2H 37.4% 37.6% 37.1% H2D 16.4% 16.2% 15.8% Th(10 6 /s) D2H 17.2% 15.5% 20.8% H2D 6.7% 9.9% 8.2% 25
26 GPU Workload vs. Vectorization MP-Sched 4x4 Vec. Degree = 128 Vec. Degree = 256 Vec. Degree =
27 GPU Workload vs. Vectorization MP-Sched 4x4 Vec. Degree = 1024 Vec. Degree = 2048 Vec. Degree =
28 GPU Workload vs. Vectorization MP-Sched 4x4 Vec. Degree =
29 System Level Evaluation Single-processor baselines CPU baseline Th c : all actors scheduled on the same CPU core (CC) GPU baseline Th g : all actors with GPU acceleration scheduled on the GPU; all others scheduled on the same CC DIF-GPU Speedup sp = Th/max(Th c, Th g ) 29
30 System Level Evaluation MP-Sched 2x5 30
31 System Level Evaluation MP-Sched 4x4 FCFS HEFT 31
32 System Level Evaluation MP-Sched 6x3 FCFS HEFT 32
33 System Level Evaluation For small vectorization degrees, 3CC + 1GPU gives higher throughput GPU version slower More cores For large vectorization degrees, 1CC + 1GPU gives higher throughput GPU version much faster than CPU HEFT/FCFS scheduling Multithreading runtime overhead 33
34 Scheduler Evaluation Speedup in different topologies 2x5, 4x4, 6x3 Th(HEFT) > Th(FCFS) in general Consistent gain over GPU baseline Inter-processor data transfer 34
35 Scheduler Evaluation Speedup in different topologies 2x5, 4x4, 6x3 In some cases (b l < b < b u ), FCFS is better 35
36 Conclusion DIF-GPU framework SDF graph specification (DIF) Vectorization Scheduling Code generation Demonstration à MP-Sched benchmarks Data transfer overhead reduction using MDFA Performance improvement over CPU and GPUbaseline 36
37 References 1. E. A. Lee and D. G. Messerschmitt. Synchronous dataflow. Proceedings of the IEEE, 75(9): , September C. Augonnet, S. Thibault, R. Namyst, and P. Wacrenier. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures, Journal of Concurrency and Computation: Practice & Experience, 23(2): , February A. Duran, E. Ayguade, R. M. Badia, J. Labarta, L. Martinell, X. Martorell, and J. Planas. OmpSS: a proposal for programming heterogeneous multi-core architectures. Parallel Processing Letters, 21(02): , G. Teodoro, R. Sachetto, O. Sertel, M. N. Gurcan, W. Meira, U. Catalyurek, and R. Ferreira. Coordinating the use of GPU and CPU for improving performance of compute intensive applications. In Proceedings of the IEEE International Conference on Cluster Computing and Workshops, pages 1-10, H. Topcuoglu, S. Hariri, and M.-Y. Wu. Performance-eective and low-complexity task scheduling for heterogeneous computing. IEEE Transactions on Parallel and Distributed Systems, 13(3): ,
Multicore DSP Software Synthesis using Partial Expansion of Dataflow Graphs
Multicore DSP Software Synthesis using Partial Expansion of Dataflow Graphs George F. Zaki, William Plishker, Shuvra S. Bhattacharyya University of Maryland, College Park, MD, USA & Frank Fruth Texas Instruments
More informationA Standalone Package for Bringing Graphics Processor Acceleration to GNU Radio: GRGPU
A Standalone Package for Bringing Graphics Processor Acceleration to GNU Radio: GRGPU William Plishker University of Maryland plishker@umd.edu 1/25 Outline Introduction GPU Background Graphics Processor
More informationApplying Graphics Processor Acceleration in a Software Defined Radio Prototyping Environment
Applying Graphics Processor Acceleration in a Software Defined Radio Prototyping Environment GNU Radio with Graphics Processor Acceleration as a Standalone Package Will Plishker, George F. Zaki, Shuvra
More informationRecent Advances in Heterogeneous Computing using Charm++
Recent Advances in Heterogeneous Computing using Charm++ Jaemin Choi, Michael Robson Parallel Programming Laboratory University of Illinois Urbana-Champaign April 12, 2018 1 / 24 Heterogeneous Computing
More informationCommunication Library to Overlap Computation and Communication for OpenCL Application
Communication Library to Overlap Computation and Communication for OpenCL Application Toshiya Komoda, Shinobu Miwa, Hiroshi Nakamura Univ.Tokyo What is today s talk about? Heterogeneous Computing System
More informationAn Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture
An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture Rafia Inam Mälardalen Real-Time Research Centre Mälardalen University, Västerås, Sweden http://www.mrtc.mdh.se rafia.inam@mdh.se CONTENTS
More informationAdaptive Stream Mining: A Novel Dynamic Computing Paradigm for Knowledge Extraction
Adaptive Stream Mining: A Novel Dynamic Computing Paradigm for Knowledge Extraction AFOSR DDDAS Program PI Meeting Presentation PIs: Shuvra S. Bhattacharyya, University of Maryland Mihaela van der Schaar,
More informationCUDA Programming Model
CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationIntroduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research
Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [ jbalfour@nvidia.com] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers
More informationSoftware Synthesis Trade-offs in Dataflow Representations of DSP Applications
in Dataflow Representations of DSP Applications Shuvra S. Bhattacharyya Department of Electrical and Computer Engineering, and Institute for Advanced Computer Studies University of Maryland, College Park
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationCOMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers
COMP 322: Fundamentals of Parallel Programming Lecture 37: General-Purpose GPU (GPGPU) Computing Max Grossman, Vivek Sarkar Department of Computer Science, Rice University max.grossman@rice.edu, vsarkar@rice.edu
More informationHybrid Implementation of 3D Kirchhoff Migration
Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation
More informationChapter 3 Parallel Software
Chapter 3 Parallel Software Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers
More informationDIGITAL VS. ANALOG SIGNAL PROCESSING Digital signal processing (DSP) characterized by: OUTLINE APPLICATIONS OF DIGITAL SIGNAL PROCESSING
1 DSP applications DSP platforms The synthesis problem Models of computation OUTLINE 2 DIGITAL VS. ANALOG SIGNAL PROCESSING Digital signal processing (DSP) characterized by: Time-discrete representation
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationReal-time Graphics 9. GPGPU
9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing GPGPU general-purpose
More informationStarPU: a unified platform for task scheduling on heterogeneous multicore architectures
StarPU: a unified platform for task scheduling on heterogeneous multicore architectures Cédric Augonnet, Samuel Thibault, Raymond Namyst, Pierre-André Wacrenier To cite this version: Cédric Augonnet, Samuel
More informationA DESIGN TOOL FOR EFFICIENT MAPPING OF MULTIMEDIA APPLICATIONS ONTO HETEROGENEOUS PLATFORMS
In Proceedings of the IEEE International Conference on Multimedia and Expo, Barcelona, Spain, July 2011. A DESIGN TOOL FOR EFFICIENT MAPPING OF MULTIMEDIA APPLICATIONS ONTO HETEROGENEOUS PLATFORMS Chung-Ching
More informationCellSs Making it easier to program the Cell Broadband Engine processor
Perez, Bellens, Badia, and Labarta CellSs Making it easier to program the Cell Broadband Engine processor Presented by: Mujahed Eleyat Outline Motivation Architecture of the cell processor Challenges of
More informationTechnische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics
GPU Programming Rüdiger Westermann Chair for Computer Graphics & Visualization Faculty of Informatics Overview Programming interfaces and support libraries The CUDA programming abstraction An in-depth
More informationOffloading Java to Graphics Processors
Offloading Java to Graphics Processors Peter Calvert (prc33@cam.ac.uk) University of Cambridge, Computer Laboratory Abstract Massively-parallel graphics processors have the potential to offer high performance
More informationECE 574 Cluster Computing Lecture 15
ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements
More informationPLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters
PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters IEEE CLUSTER 2015 Chicago, IL, USA Luis Sant Ana 1, Daniel Cordeiro 2, Raphael Camargo 1 1 Federal University of ABC,
More informationReal-time Graphics 9. GPGPU
Real-time Graphics 9. GPGPU GPGPU GPU (Graphics Processing Unit) Flexible and powerful processor Programmability, precision, power Parallel processing CPU Increasing number of cores Parallel processing
More informationAdaStreams : A Type-based Programming Extension for Stream-Parallelism with Ada 2005
AdaStreams : A Type-based Programming Extension for Stream-Parallelism with Ada 2005 Gingun Hong*, Kirak Hong*, Bernd Burgstaller* and Johan Blieberger *Yonsei University, Korea Vienna University of Technology,
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationOverview of research activities Toward portability of performance
Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into
More informationA Lightweight Dataflow Approach for Design and Implementation of SDR Systems
In Proceedings of the Wireless Innovation Conference and Product Exposition, Washington DC, USA, November 2010. A Lightweight Dataflow Approach for Design and Implementation of SDR Systems Chung-Ching
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationLecture 4: Synchronous Data Flow Graphs - HJ94 goal: Skiing down a mountain
Lecture 4: Synchronous ata Flow Graphs - I. Verbauwhede, 05-06 K.U.Leuven HJ94 goal: Skiing down a mountain SPW, Matlab, C pipelining, unrolling Specification Algorithm Transformations loop merging, compaction
More informationEnergy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package
High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction
More informationDesign of a Dynamic Data-Driven System for Multispectral Video Processing
Design of a Dynamic Data-Driven System for Multispectral Video Processing Shuvra S. Bhattacharyya University of Maryland at College Park ssb@umd.edu With contributions from H. Li, K. Sudusinghe, Y. Liu,
More informationParallel Processing SIMD, Vector and GPU s cont.
Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP
More informationGPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3
/CPU,a),2,2 2,2 Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3 XMP XMP-dev CPU XMP-dev/StarPU XMP-dev XMP CPU StarPU CPU /CPU XMP-dev/StarPU N /CPU CPU. Graphics Processing Unit GP General-Purpose
More informationLecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1
Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei
More informationParallel Hybrid Computing F. Bodin, CAPS Entreprise
Parallel Hybrid Computing F. Bodin, CAPS Entreprise Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism Various heterogeneous
More informationPartial Expansion Graphs: Exposing Parallelism and Dynamic Scheduling Opportunities for DSP Applications
In Proceedings of the International Conference on Application Specific Systems, Architectures, and Processors, 2012, to appear. Partial Expansion Graphs: Exposing Parallelism and Dynamic Scheduling Opportunities
More informationMemory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.
Memory concept Grid concept, Synchronization GPU Programming http://cuda.nik.uni-obuda.hu Szénási Sándor szenasi.sandor@nik.uni-obuda.hu GPU Education Center of Óbuda University MEMORY CONCEPT Off-chip
More informationIntroduction II. Overview
Introduction II Overview Today we will introduce multicore hardware (we will introduce many-core hardware prior to learning OpenCL) We will also consider the relationship between computer hardware and
More informationIntroduction to Multicore Programming
Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Automatic Parallelization and OpenMP 3 GPGPU 2 Multithreaded Programming
More informationParallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010
Parallelizing FPGA Technology Mapping using GPUs Doris Chen Deshanand Singh Aug 31 st, 2010 Motivation: Compile Time In last 12 years: 110x increase in FPGA Logic, 23x increase in CPU speed, 4.8x gap Question:
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationDirected Optimization On Stencil-based Computational Fluid Dynamics Application(s)
Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Islam Harb 08/21/2015 Agenda Motivation Research Challenges Contributions & Approach Results Conclusion Future Work 2
More informationAdaptive Runtime Resource Management of Heterogeneous Resources
Adaptive Runtime Resource Management of Heterogeneous Resources Roel Wuyts Principal Scientist, imec Professor, KUL (Distrinet) Carreer Overview Studies: Licentiaat Informatica (VUB, 1991-1995) 1995 001
More informationEfficient Hardware Acceleration on SoC- FPGA using OpenCL
Efficient Hardware Acceleration on SoC- FPGA using OpenCL Advisor : Dr. Benjamin Carrion Schafer Susmitha Gogineni 30 th August 17 Presentation Overview 1.Objective & Motivation 2.Configurable SoC -FPGA
More informationEfficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory
Institute of Computational Science Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory Juraj Kardoš (University of Lugano) July 9, 2014 Juraj Kardoš Efficient GPU data transfers July 9, 2014
More informationCUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav
CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture
More informationAgenda. Threads. Single and Multi-threaded Processes. What is Thread. CSCI 444/544 Operating Systems Fall 2008
Agenda Threads CSCI 444/544 Operating Systems Fall 2008 Thread concept Thread vs process Thread implementation - user-level - kernel-level - hybrid Inter-process (inter-thread) communication What is Thread
More informationIntroduction to Multicore Programming
Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Synchronization 3 Automatic Parallelization and OpenMP 4 GPGPU 5 Q& A 2 Multithreaded
More informationCPU-GPU Heterogeneous Computing
CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms http://sudalab.is.s.u-tokyo.ac.jp/~reiji/pna14/ [ 10 ] GPU and CUDA Parallel Numerical Algorithms / IST / UTokyo 1 PNA16 Lecture Plan General Topics 1. Architecture and Performance
More informationCS 179: GPU Computing. Lecture 2: The Basics
CS 179: GPU Computing Lecture 2: The Basics Recap Can use GPU to solve highly parallelizable problems Performance benefits vs. CPU Straightforward extension to C language Disclaimer Goal for Week 1: Fast-paced
More informationGPU Programming Using CUDA
GPU Programming Using CUDA Michael J. Schnieders Depts. of Biomedical Engineering & Biochemistry The University of Iowa & Gregory G. Howes Department of Physics and Astronomy The University of Iowa Iowa
More informationPerformance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference
The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee
More informationIntroduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model
Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Dirk Ribbrock Faculty of Mathematics, TU dortmund 2016 Table of Contents Why parallel
More informationGPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh
GPU Programming EPCC The University of Edinburgh Contents NVIDIA CUDA C Proprietary interface to NVIDIA architecture CUDA Fortran Provided by PGI OpenCL Cross platform API 2 NVIDIA CUDA CUDA allows NVIDIA
More informationPostprint. This is the accepted version of a paper presented at MCC13, November 25 26, Halmstad, Sweden.
http://www.diva-portal.org Postprint This is the accepted version of a paper presented at MCC13, November 25 26, Halmstad, Sweden. Citation for the original published paper: Ceballos, G., Black-Schaffer,
More informationIntroduction to CELL B.E. and GPU Programming. Agenda
Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU
More informationDynamic Dataflow. Seminar on embedded systems
Dynamic Dataflow Seminar on embedded systems Dataflow Dataflow programming, Dataflow architecture Dataflow Models of Computation Computation is divided into nodes that can be executed concurrently Dataflow
More informationTaskGenX: A Hardware-Software Proposal for Accelerating Task Parallelism
TaskGenX: A Hardware-Software Proposal for Accelerating Task Parallelism Kallia Chronaki, Marc Casas, Miquel Moreto, Jaume Bosch, Rosa M. Badia Barcelona Supercomputing Center, Artificial Intelligence
More informationCUDA OPTIMIZATIONS ISC 2011 Tutorial
CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationEE213A - EE298-2 Lecture 8
EE3A - EE98- Lecture 8 Synchronous ata Flow Ingrid Verbauwhede epartment of Electrical Engineering University of California Los Angeles ingrid@ee.ucla.edu EE3A, Spring 000, Ingrid Verbauwhede, UCLA - Lecture
More informationGPU Programming Using NVIDIA CUDA
GPU Programming Using NVIDIA CUDA Siddhante Nangla 1, Professor Chetna Achar 2 1, 2 MET s Institute of Computer Science, Bandra Mumbai University Abstract: GPGPU or General-Purpose Computing on Graphics
More informationChapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 04 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for
More informationHigh Performance Computing and GPU Programming
High Performance Computing and GPU Programming Lecture 1: Introduction Objectives C++/CPU Review GPU Intro Programming Model Objectives Objectives Before we begin a little motivation Intel Xeon 2.67GHz
More informationAccelerating CFD with Graphics Hardware
Accelerating CFD with Graphics Hardware Graham Pullan (Whittle Laboratory, Cambridge University) 16 March 2009 Today Motivation CPUs and GPUs Programming NVIDIA GPUs with CUDA Application to turbomachinery
More information! Readings! ! Room-level, on-chip! vs.!
1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads
More informationGPU Programming Using CUDA. Samuli Laine NVIDIA Research
GPU Programming Using CUDA Samuli Laine NVIDIA Research Today GPU vs CPU Different architecture, different workloads Basics of CUDA Executing code on GPU Managing memory between CPU and GPU CUDA API Quick
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance
More informationGPUfs: Integrating a file system with GPUs
GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Traditional System Architecture Applications OS CPU
More informationCOMP 605: Introduction to Parallel Computing Lecture : GPU Architecture
COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture Mary Thomas Department of Computer Science Computational Science Research Center (CSRC) San Diego State University (SDSU) Posted:
More informationEnergy-efficient acceleration of task dependency trees on CPU-GPU hybrids
Energy-efficient acceleration of task dependency trees on CPU-GPU hybrids Mark Silberstein - Technion Naoya Maruyama Tokyo Institute of Technology Mark Silberstein, Technion 1 The case for heterogeneous
More informationIntroduction to GPU programming. Introduction to GPU programming p. 1/17
Introduction to GPU programming Introduction to GPU programming p. 1/17 Introduction to GPU programming p. 2/17 Overview GPUs & computing Principles of CUDA programming One good reference: David B. Kirk
More informationAccelerating image registration on GPUs
Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied
More informationA Pattern-supported Parallelization Approach
A Pattern-supported Parallelization Approach Ralf Jahr, Mike Gerdes, Theo Ungerer University of Augsburg, Germany The 2013 International Workshop on Programming Models and Applications for Multicores and
More informationCUDA. Matthew Joyner, Jeremy Williams
CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationIntroduc)on to GPU Programming
Introduc)on to GPU Programming Mubashir Adnan Qureshi h3p://www.ncsa.illinois.edu/people/kindr/projects/hpca/files/singapore_p1.pdf h3p://developer.download.nvidia.com/cuda/training/nvidia_gpu_compu)ng_webinars_cuda_memory_op)miza)on.pdf
More informationLecture 8: GPU Programming. CSE599G1: Spring 2017
Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library
More informationad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors
ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors Weifeng Liu and Brian Vinter Niels Bohr Institute University of Copenhagen Denmark {weifeng, vinter}@nbi.dk March 1, 2014 Weifeng
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationHeterogeneous platforms
Heterogeneous platforms Systems combining main processors and accelerators e.g., CPU + GPU, CPU + Intel MIC, AMD APU, ARM SoC Any platform using a GPU is a heterogeneous platform! Further in this talk
More informationUsing Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology
Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology September 19, 2007 Markus Levy, EEMBC and Multicore Association Enabling the Multicore Ecosystem Multicore
More informationOVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI
CMPE 655- MULTIPLE PROCESSOR SYSTEMS OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI What is MULTI PROCESSING?? Multiprocessing is the coordinated processing
More informationSerial and Parallel Sobel Filtering for multimedia applications
Serial and Parallel Sobel Filtering for multimedia applications Gunay Abdullayeva Institute of Computer Science University of Tartu Email: gunay@ut.ee Abstract GSteamer contains various plugins to apply
More informationAn Efficient Stream Buffer Mechanism for Dataflow Execution on Heterogeneous Platforms with GPUs
An Efficient Stream Buffer Mechanism for Dataflow Execution on Heterogeneous Platforms with GPUs Ana Balevic Leiden Institute of Advanced Computer Science University of Leiden Leiden, The Netherlands balevic@liacs.nl
More informationAn Extension of the StarSs Programming Model for Platforms with Multiple GPUs
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento
More informationAutomatic Intra-Application Load Balancing for Heterogeneous Systems
Automatic Intra-Application Load Balancing for Heterogeneous Systems Michael Boyer, Shuai Che, and Kevin Skadron Department of Computer Science University of Virginia Jayanth Gummaraju and Nuwan Jayasena
More informationSignalPU: A programming model for DSP applications on parallel and heterogeneous clusters
SignalPU: A programming model for DSP applications on parallel and heterogeneous clusters Farouk Mansouri, Sylvain Huet, Dominique Houzet To cite this version: Farouk Mansouri, Sylvain Huet, Dominique
More informationWeb Physics: A Hardware Accelerated Physics Engine for Web- Based Applications
Web Physics: A Hardware Accelerated Physics Engine for Web- Based Applications Tasneem Brutch, Bo Li, Guodong Rong, Yi Shen, Chang Shu Samsung Research America-Silicon Valley {t.brutch, robert.li, g.rong,
More informationKernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow
Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization
More informationcuda-on-cl A compiler and runtime for running NVIDIA CUDA C++11 applications on OpenCL 1.2 devices Hugh Perkins (ASAPP)
cuda-on-cl A compiler and runtime for running NVIDIA CUDA C++11 applications on OpenCL 1.2 devices Hugh Perkins (ASAPP) Demo: CUDA on Intel HD5500 global void setvalue(float *data, int idx, float value)
More informationPredictive Runtime Code Scheduling for Heterogeneous Architectures
Predictive Runtime Code Scheduling for Heterogeneous Architectures Víctor Jiménez, Lluís Vilanova, Isaac Gelado Marisa Gil, Grigori Fursin, Nacho Navarro HiPEAC 2009 January, 26th, 2009 1 Outline Motivation
More informationCS 31: Intro to Systems Threading & Parallel Applications. Kevin Webb Swarthmore College November 27, 2018
CS 31: Intro to Systems Threading & Parallel Applications Kevin Webb Swarthmore College November 27, 2018 Reading Quiz Making Programs Run Faster We all like how fast computers are In the old days (1980
More informationChapter 4: Multi-Threaded Programming
Chapter 4: Multi-Threaded Programming Chapter 4: Threads 4.1 Overview 4.2 Multicore Programming 4.3 Multithreading Models 4.4 Thread Libraries Pthreads Win32 Threads Java Threads 4.5 Implicit Threading
More information