A Design Framework for Mapping Vectorized Synchronous Dataflow Graphs onto CPU-GPU Platforms

Similar documents
Multicore DSP Software Synthesis using Partial Expansion of Dataflow Graphs

A Standalone Package for Bringing Graphics Processor Acceleration to GNU Radio: GRGPU

Applying Graphics Processor Acceleration in a Software Defined Radio Prototyping Environment

Recent Advances in Heterogeneous Computing using Charm++

Communication Library to Overlap Computation and Communication for OpenCL Application

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Adaptive Stream Mining: A Novel Dynamic Computing Paradigm for Knowledge Extraction

CUDA Programming Model

Fundamental CUDA Optimization. NVIDIA Corporation

Tesla Architecture, CUDA and Optimization Strategies

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Software Synthesis Trade-offs in Dataflow Representations of DSP Applications

Fundamental CUDA Optimization. NVIDIA Corporation

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

Hybrid Implementation of 3D Kirchhoff Migration

Chapter 3 Parallel Software

DIGITAL VS. ANALOG SIGNAL PROCESSING Digital signal processing (DSP) characterized by: OUTLINE APPLICATIONS OF DIGITAL SIGNAL PROCESSING

Portland State University ECE 588/688. Graphics Processors

Real-time Graphics 9. GPGPU

StarPU: a unified platform for task scheduling on heterogeneous multicore architectures

A DESIGN TOOL FOR EFFICIENT MAPPING OF MULTIMEDIA APPLICATIONS ONTO HETEROGENEOUS PLATFORMS

CellSs Making it easier to program the Cell Broadband Engine processor

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Offloading Java to Graphics Processors

ECE 574 Cluster Computing Lecture 15

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters

Real-time Graphics 9. GPGPU

AdaStreams : A Type-based Programming Extension for Stream-Parallelism with Ada 2005

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Overview of research activities Toward portability of performance

A Lightweight Dataflow Approach for Design and Implementation of SDR Systems

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Lecture 4: Synchronous Data Flow Graphs - HJ94 goal: Skiing down a mountain

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Design of a Dynamic Data-Driven System for Multispectral Video Processing

Parallel Processing SIMD, Vector and GPU s cont.

GPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Parallel Hybrid Computing F. Bodin, CAPS Entreprise

Partial Expansion Graphs: Exposing Parallelism and Dynamic Scheduling Opportunities for DSP Applications

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

Introduction II. Overview

Introduction to Multicore Programming

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Adaptive Runtime Resource Management of Heterogeneous Resources

Efficient Hardware Acceleration on SoC- FPGA using OpenCL

Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Agenda. Threads. Single and Multi-threaded Processes. What is Thread. CSCI 444/544 Operating Systems Fall 2008

Introduction to Multicore Programming

CPU-GPU Heterogeneous Computing

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Parallel Numerical Algorithms

CS 179: GPU Computing. Lecture 2: The Basics

GPU Programming Using CUDA

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

Postprint. This is the accepted version of a paper presented at MCC13, November 25 26, Halmstad, Sweden.

Introduction to CELL B.E. and GPU Programming. Agenda

Dynamic Dataflow. Seminar on embedded systems

TaskGenX: A Hardware-Software Proposal for Accelerating Task Parallelism

CUDA OPTIMIZATIONS ISC 2011 Tutorial

EE213A - EE298-2 Lecture 8

GPU Programming Using NVIDIA CUDA

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

High Performance Computing and GPU Programming

Accelerating CFD with Graphics Hardware

! Readings! ! Room-level, on-chip! vs.!

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

GPUfs: Integrating a file system with GPUs

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

Energy-efficient acceleration of task dependency trees on CPU-GPU hybrids

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Accelerating image registration on GPUs

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

A Pattern-supported Parallelization Approach

CUDA. Matthew Joyner, Jeremy Williams

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Introduc)on to GPU Programming

Lecture 8: GPU Programming. CSE599G1: Spring 2017

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Heterogeneous platforms

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

Serial and Parallel Sobel Filtering for multimedia applications

An Efficient Stream Buffer Mechanism for Dataflow Execution on Heterogeneous Platforms with GPUs

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

Automatic Intra-Application Load Balancing for Heterogeneous Systems

SignalPU: A programming model for DSP applications on parallel and heterogeneous clusters

Web Physics: A Hardware Accelerated Physics Engine for Web- Based Applications

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

cuda-on-cl A compiler and runtime for running NVIDIA CUDA C++11 applications on OpenCL 1.2 devices Hugh Perkins (ASAPP)

Predictive Runtime Code Scheduling for Heterogeneous Architectures

CS 31: Intro to Systems Threading & Parallel Applications. Kevin Webb Swarthmore College November 27, 2018

Chapter 4: Multi-Threaded Programming

Transcription:

A Design Framework for Mapping Vectorized Synchronous Dataflow Graphs onto CPU-GPU Platforms Shuoxin Lin, Yanzhou Liu, William Plishker, Shuvra Bhattacharyya Maryland DSPCAD Research Group Department of Electrical and Computer Engineering, and Institute for Advanced Computer Studies University of Maryland http://www.ece.umd.edu/dspcad/home/dspcad.htm International Workshop on Software and Compilers for Embedded Systems May 23, 2016 Sankt Goar, Germany

Motivation From high-level system specification to software on hybrid multicore-cpu GPU platforms A 2 1 4 1 B C 2

Synchronous Dataflow (SDF) [1] p 1 p A B 2 e c 5 C 1 1 e 2 Vertices (actors) computational modules Edges FIFO buffers Tokens data elements passed between actors Production / consumption rates In SDF, production and consumption rates are known at compile-time. Iterative execution on large or unbounded data streams c 2 3

Objectives To automatically exploit data, task, and pipeline parallelism from model-based specifications of digital signal processing (DSP) applications. To generate throughput-optimized code for hybrid CPU- GPU platforms (with multi-core CPU and GPU devices working together on the application) A 2 1 4 1 B C Dataflow Design Framework 4

Exploiting Parallelism in SDF Task Parallelism A 2 4 2 1 B C 2 1 2 4 D P1 P2 A B C D A Data Parallelism A 2 4 2 B 4 C 4 2 4 2 4 D C 4 Pipeline Parallelism 2 2 B 2 2 A 4 D 4 4 C 4 4 stage 1 stage 2 stage 3 P1 P2 P3 A B C 4 A D 5

Multicore CPU-GPU architecture Multi-core CPU CPU cores (Multiple Instructions Multiple Data, MIMD) Cores share main memory Host GPU Many SIMD multiprocessors Separate Memory Device float* hp, dp; hp = (float*)malloc(sizeof(float)*n); cudamalloc(&dp, sizeof(float) * N); cudamemcpy(dp,hp, cudamemcpyhosttodevice); call_kernel(dp); /*... */ /* Other kernel executions */ cudamemcpy(hp,dp, cudamemcpydevicetohost); cudafree(dp); free(hp); 6

Dataflow Design Framework Challenges Many factors affect system throughput Vectoriza*on Mul*processor Scheduling Throughput Inter-processor communica*on Other System Constraints 7

Dataflow Design Framework DIF-GPU framework Model Specification Actor implementation Vectorization Compile-time Scheduling Code Generation 8

Dataflow Design Framework Comparison of DIF-GPU with some dataflow runtime frameworks Compiletime Run-time DIF-GPU Vectorization, Scheduling, Inter-processor Communication Peer-worker multithreading Related works (StarPU [2] and OmpSS [3]) Scheduling, Inter-processor Communication, Manager-worker multithreading [2] C. Augonnet et al., StarPU: a unified platform for task scheduling on heterogeneous multicore architectures, Journal of Concurrency and Computation: Practice & Experience, 23(2):187-198, February 2011. [3] A. Duran et al., OmpSS: a proposal for programming heterogeneous multi-core architectures. Parallel Processing Letters, 21(02):173-193, 2011. 9

Dataflow Design Framework Why DIF-GPU? Compile-time scheduling and data transfers less runtime overhead Integration of vectorization and code generation more extensive design automation 10

Model Specification in DIF Dataflow Interchange Format (DIF) [1]; a standard language for specifying mixedgrain dataflow models for digital signal processing (DSP) systems <1> <2> <3> 2 3 src usp snk 1 2 src: source usp: upsampler snk: sink sdf usp_graph { topology { nodes = src, usp, snk; edges = e1 (src,usp), e2 (usp,snk); } production {e1 = 2; e2 = 3; } consumption {e1 = 1; e2 = 2; } attribute edge_type { e1 = "float"; e2 = "float"; } actor src { name = "src_1f"; port_0 : OUTPUT = e1; } actor usp { name = "usp3"; GPU_enabled = 1; port_0 : INPUT = e1; port_1 : OUTPUT = e2; } actor snk { name = "snk_1f"; port_0 : INPUT = e2; } } 11

Actor Implementation in the LIghtweight Dataflow Environment (LIDE) LIDE: programming methodology and APIs for implementing dataflow graph actors and edges Function Actor new() enable() invoke() Edge new() Description Performs memory allocation and initialization for the actor; specifies CPU/GPU version & vectorization Checks the input availability and output free space For CPU version, executes CPU function; For GPU version, launches GPU kernel(s) terminate() Frees memory that has been dynamically allocated for the actor. free() Allocates FIFO buffer in CPU/GPU memory Frees the FIFO buffer 12

Actor Implementation in LIghtweight Dataflow Environment (LIDE) LIDE Compact, extensible, flexible Function Actor new() enable() invoke() Edge new() Description Performs memory allocation and initialization for the actor; specifies CPU/GPU version & vectorization Checks the input availability and output free space For CPU version, executes CPU function; For GPU version, launches GPU kernel(s) terminate() Frees memory that has been dynamically allocated for the actor. free() Allocates FIFO buffer in CPU/GPU memory Frees the FIFO buffer 13

Exploiting Parallelism in DIF-GPU Vectorization Data Parallelism Graph Level Vectorization (GVD) <1> <2> <3> src 2 3 usp 1 2 snk <1> <1> <1> 2b 6b src b usp 2b snk 2b 6b 3b Each actor v is vectorized by b q(v), where q(v) is the repetition count of v, and b is the graph vectorization degree (GVD) Multi-rate SDF graph à block-processing single-rate SDF task graph for scheduling b is limited by system constraints (memory, latency, etc). 14

Exploiting Parallelism in DIF-GPU Scheduling Task & Pipeline Parallelism First Come First Serve (FCFS) [4] Simple greedy approach Schedules an actor whenever a processor becomes idle Heterogeneous Earliest Finish Time (HEFT) [5] Manages a list of actors that are ready to be executed Selects the actor-processor pair with earliest finish time Can be extended with other scheduling strategies 15

Exploiting Parallelism in DIF-GPU Scheduling Task & Pipeline Parallelism Example 1/0.5 1/0.5 t 1 /t 2 1/ B C 1/ A 1/1 0.5/0.5 F D E FCFS HEFT P1 A B C F A P1 A D E F A P2 D E P2 B C T = 4 T = 3.5 16

Inter-processor Data Transfer Host-centered FIFO Allocation (HCFA) Maintain all FIFOs on host memory Frameworks without GPU support (e.g., GNU Radio) Easy integration with existing frameworks Large amounts of overhead due to excessive CPU- GPU data transfer u e 2 e 3 e 1 buf 1 v w e 1 v x y H2D e 3 e 2 kernel D2H H2D buf 3 buf 2 w CPU actor v GPU actor 17

Inter-processor Data Transfer Mapping-dependent FIFO Allocation (MDFA) FIFOs can be allocated in host or device memory depending on the schedule Insert H2D/D2H actors to explicitly move data Inter-processor data transfer only occurs at locations determined by the schedule e 1c w e 2c u y e 1g v kernel e 3 H2D H2D D2H e 2g e 1g v e 2g e 3 x w CPU actor v GPU actor 18

DIF-GPU Example <1> <2> <3> src 2 3 usp 1 2 snk Vectorization <1> <1> <1> src b 2b 6b usp 2b 2b 6b snk 3b Scheduling & Data transfer actor insertion <1> <1> <1> <1> <1> src b 2b 2b 6b 6b H2D usp 2b D2H 2b 2b 6b 6b snk 3b CPU : src, snk, src, snk, GPU : H2D, usp, D2H, H2D, 19

DIF-GPU Example Generated LIDE-CUDA code Header file #include <stdio.h> /*... */ #define SRC 0 #define USP 1 #define SNK 2 #define H2D_0 3 #define D2H_0 4 #define ACTOR_COUNT 5 #define CPU 0 #define GPU 1 #define NUMBER_OF_THREADS 2 Headers Macro Definitions class usp_graph { public: usp_graph(); ~usp_graph(); void execute(); private: thread_list* thread_list; actor_context_type* actors[actor_count]; fifo_pointer edge_in_h2d_0; fifo_pointer edge_out_d2h_0; fifo_pointer edge_in_d2h_0; fifo_pointer edge_out_h2d_0; }; Class Declaration 20

DIF-GPU Example Generated LIDE-CUDA code Source code: graph constructor #include "usp_graph.h" usp_graph::usp_graph(){ /* Create edges */ edge_in_h2d_0 = fifo_new(4, sizeof(float), CPU); edge_out_d2h_0 = fifo_new(12, sizeof(float), CPU); edge_in_d2h_0 = fifo_new(12, sizeof(float), GPU); edge_out_h2d_0 = fifo_new(4, sizeof(float), GPU); /* Create actors */ actors[d2h_0] = (actor_context_type*) memcpy_new( edge_in_d2h_0,edge_out_d2h_0,12,12,sizeof(float), GPU); actors[snk] = (actor_context_type*) snk_1f_new( edge_out_d2h_0,12, CPU); actors[h2d_0] = (actor_context_type*) memcpy_new( edge_in_h2d_0,edge_out_h2d_0,4,4,sizeof(float), GPU); actors[usp] = (actor_context_type*) usp3_new( edge_out_h2d_0,edge_in_d2h_0,4,12, GPU); actors[src] = (actor_context_type*) src_1f_new( edge_in_h2d_0,4, CPU); /* Create schedules of each thread */ const char* thread_schedules[number_of_threads] = {"thread_0.txt","thread_1.txt"}; thread_list = thread_list_init(number_of_threads, thread_schedules, actors, ACTOR_COUNT); } 21

DIF-GPU Example Generated LIDE-CUDA code Source code: graph-level execute() and destructor void usp_graph::execute(){ thread_list_scheduler(thread_list); } usp_graph::~usp_graph(){ /* Terminate threads */ thread_list_terminate(thread_list); /* Free FIFOs */ fifo_free(edge_in_h2d_0); fifo_free(edge_out_d2h_0); fifo_free(edge_in_d2h_0); fifo_free(edge_out_h2d_0); /* Destroy actors */ memcpy_terminate((memcpy_context_type*)actors[d2h_0]); snk_1f_terminate((snk_1f_context_type*)actors[snk]); memcpy_terminate((memcpy_context_type*)actors[h2d_0]); usp3_terminate((usp3_context_type*)actors[usp]); src_1f_terminate((src_1f_context_type*)actors[src]); } 22

Case Study Throughput for b-vectorized graph Th = b/t Th: throughput; b: vec. degree; T: schedule period Test bench MP-Sched (P x S) Item Grid Size Platform Scheduler Values 2x5, 4x4, 6x3 1 CC + 1 GPU 3 CCs + 1 GPU HEFT, FCFS * CC: CPU Core 23

Speedup of FIR Filter K = 7 Excluding CPU-GPU data-transfer time B FIR B B = 1, 2,, N Filter length = K Speedup Slow increase from b=2 7 to 2 10 Low GPU utilization Fast increase from b=2 10 to 2 16 Increased utilization Slow increase from b=2 17 to 2 19 Saturation 24

Data Transfer Evaluation Throughput and data transfer overhead for FIFO implementation based on HCFA and MDFA. Percentage = ratio of time spent on data transfer HCFA MDFA Topology 2x5 4x4 6x3 Th(10 6 /s) 4.80 2.84 2.52 D2H 37.4% 37.6% 37.1% H2D 16.4% 16.2% 15.8% Th(10 6 /s) 6.71 4.06 3.59 D2H 17.2% 15.5% 20.8% H2D 6.7% 9.9% 8.2% 25

GPU Workload vs. Vectorization MP-Sched 4x4 Vec. Degree = 128 Vec. Degree = 256 Vec. Degree = 512 26

GPU Workload vs. Vectorization MP-Sched 4x4 Vec. Degree = 1024 Vec. Degree = 2048 Vec. Degree = 4096 27

GPU Workload vs. Vectorization MP-Sched 4x4 Vec. Degree = 8192 28

System Level Evaluation Single-processor baselines CPU baseline Th c : all actors scheduled on the same CPU core (CC) GPU baseline Th g : all actors with GPU acceleration scheduled on the GPU; all others scheduled on the same CC DIF-GPU Speedup sp = Th/max(Th c, Th g ) 29

System Level Evaluation MP-Sched 2x5 30

System Level Evaluation MP-Sched 4x4 FCFS HEFT 31

System Level Evaluation MP-Sched 6x3 FCFS HEFT 32

System Level Evaluation For small vectorization degrees, 3CC + 1GPU gives higher throughput GPU version slower More cores For large vectorization degrees, 1CC + 1GPU gives higher throughput GPU version much faster than CPU HEFT/FCFS scheduling Multithreading runtime overhead 33

Scheduler Evaluation Speedup in different topologies 2x5, 4x4, 6x3 Th(HEFT) > Th(FCFS) in general Consistent gain over GPU baseline Inter-processor data transfer 34

Scheduler Evaluation Speedup in different topologies 2x5, 4x4, 6x3 In some cases (b l < b < b u ), FCFS is better 35

Conclusion DIF-GPU framework SDF graph specification (DIF) Vectorization Scheduling Code generation Demonstration à MP-Sched benchmarks Data transfer overhead reduction using MDFA Performance improvement over CPU and GPUbaseline 36

References 1. E. A. Lee and D. G. Messerschmitt. Synchronous dataflow. Proceedings of the IEEE, 75(9):1235-1245, September 1987. 2. C. Augonnet, S. Thibault, R. Namyst, and P. Wacrenier. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures, Journal of Concurrency and Computation: Practice & Experience, 23(2):187-198, February 2011. 3. A. Duran, E. Ayguade, R. M. Badia, J. Labarta, L. Martinell, X. Martorell, and J. Planas. OmpSS: a proposal for programming heterogeneous multi-core architectures. Parallel Processing Letters, 21(02):173-193, 2011. 4. G. Teodoro, R. Sachetto, O. Sertel, M. N. Gurcan, W. Meira, U. Catalyurek, and R. Ferreira. Coordinating the use of GPU and CPU for improving performance of compute intensive applications. In Proceedings of the IEEE International Conference on Cluster Computing and Workshops, pages 1-10, 2009. 5. H. Topcuoglu, S. Hariri, and M.-Y. Wu. Performance-eective and low-complexity task scheduling for heterogeneous computing. IEEE Transactions on Parallel and Distributed Systems, 13(3): 260-274, 2002. 37