A Design Framework for Mapping Vectorized Synchronous Dataflow Graphs onto CPU-GPU Platforms

A Design Framework for Mapping Vectorized Synchronous Dataflow Graphs onto CPU-GPU Platforms Shuoxin Lin, Yanzhou Liu, William Plishker, Shuvra Bhattacharyya Maryland DSPCAD Research Group Department of Electrical and Computer Engineering, and Institute for Advanced Computer Studies University of Maryland http://www.ece.umd.edu/dspcad/home/dspcad.htm International Workshop on Software and Compilers for Embedded Systems May 23, 2016 Sankt Goar, Germany

Motivation From high-level system specification to software on hybrid multicore-cpu GPU platforms A 2 1 4 1 B C 2

Synchronous Dataflow (SDF) [1] p 1 p A B 2 e c 5 C 1 1 e 2 Vertices (actors) computational modules Edges FIFO buffers Tokens data elements passed between actors Production / consumption rates In SDF, production and consumption rates are known at compile-time. Iterative execution on large or unbounded data streams c 2 3

Objectives To automatically exploit data, task, and pipeline parallelism from model-based specifications of digital signal processing (DSP) applications. To generate throughput-optimized code for hybrid CPU- GPU platforms (with multi-core CPU and GPU devices working together on the application) A 2 1 4 1 B C Dataflow Design Framework 4

Exploiting Parallelism in SDF Task Parallelism A 2 4 2 1 B C 2 1 2 4 D P1 P2 A B C D A Data Parallelism A 2 4 2 B 4 C 4 2 4 2 4 D C 4 Pipeline Parallelism 2 2 B 2 2 A 4 D 4 4 C 4 4 stage 1 stage 2 stage 3 P1 P2 P3 A B C 4 A D 5

Multicore CPU-GPU architecture Multi-core CPU CPU cores (Multiple Instructions Multiple Data, MIMD) Cores share main memory Host GPU Many SIMD multiprocessors Separate Memory Device float* hp, dp; hp = (float*)malloc(sizeof(float)*n); cudamalloc(&dp, sizeof(float) * N); cudamemcpy(dp,hp, cudamemcpyhosttodevice); call_kernel(dp); /*... */ /* Other kernel executions */ cudamemcpy(hp,dp, cudamemcpydevicetohost); cudafree(dp); free(hp); 6

Dataflow Design Framework Challenges Many factors affect system throughput Vectoriza*on Mul*processor Scheduling Throughput Inter-processor communica*on Other System Constraints 7

Dataflow Design Framework DIF-GPU framework Model Specification Actor implementation Vectorization Compile-time Scheduling Code Generation 8

Dataflow Design Framework Comparison of DIF-GPU with some dataflow runtime frameworks Compiletime Run-time DIF-GPU Vectorization, Scheduling, Inter-processor Communication Peer-worker multithreading Related works (StarPU [2] and OmpSS [3]) Scheduling, Inter-processor Communication, Manager-worker multithreading [2] C. Augonnet et al., StarPU: a unified platform for task scheduling on heterogeneous multicore architectures, Journal of Concurrency and Computation: Practice & Experience, 23(2):187-198, February 2011. [3] A. Duran et al., OmpSS: a proposal for programming heterogeneous multi-core architectures. Parallel Processing Letters, 21(02):173-193, 2011. 9

Dataflow Design Framework Why DIF-GPU? Compile-time scheduling and data transfers less runtime overhead Integration of vectorization and code generation more extensive design automation 10

Model Specification in DIF Dataflow Interchange Format (DIF) [1]; a standard language for specifying mixedgrain dataflow models for digital signal processing (DSP) systems <1> <2> <3> 2 3 src usp snk 1 2 src: source usp: upsampler snk: sink sdf usp_graph { topology { nodes = src, usp, snk; edges = e1 (src,usp), e2 (usp,snk); } production {e1 = 2; e2 = 3; } consumption {e1 = 1; e2 = 2; } attribute edge_type { e1 = "float"; e2 = "float"; } actor src { name = "src_1f"; port_0 : OUTPUT = e1; } actor usp { name = "usp3"; GPU_enabled = 1; port_0 : INPUT = e1; port_1 : OUTPUT = e2; } actor snk { name = "snk_1f"; port_0 : INPUT = e2; } } 11

Actor Implementation in the LIghtweight Dataflow Environment (LIDE) LIDE: programming methodology and APIs for implementing dataflow graph actors and edges Function Actor new() enable() invoke() Edge new() Description Performs memory allocation and initialization for the actor; specifies CPU/GPU version & vectorization Checks the input availability and output free space For CPU version, executes CPU function; For GPU version, launches GPU kernel(s) terminate() Frees memory that has been dynamically allocated for the actor. free() Allocates FIFO buffer in CPU/GPU memory Frees the FIFO buffer 12

Actor Implementation in LIghtweight Dataflow Environment (LIDE) LIDE Compact, extensible, flexible Function Actor new() enable() invoke() Edge new() Description Performs memory allocation and initialization for the actor; specifies CPU/GPU version & vectorization Checks the input availability and output free space For CPU version, executes CPU function; For GPU version, launches GPU kernel(s) terminate() Frees memory that has been dynamically allocated for the actor. free() Allocates FIFO buffer in CPU/GPU memory Frees the FIFO buffer 13

Exploiting Parallelism in DIF-GPU Vectorization Data Parallelism Graph Level Vectorization (GVD) <1> <2> <3> src 2 3 usp 1 2 snk <1> <1> <1> 2b 6b src b usp 2b snk 2b 6b 3b Each actor v is vectorized by b q(v), where q(v) is the repetition count of v, and b is the graph vectorization degree (GVD) Multi-rate SDF graph à block-processing single-rate SDF task graph for scheduling b is limited by system constraints (memory, latency, etc). 14

Exploiting Parallelism in DIF-GPU Scheduling Task & Pipeline Parallelism First Come First Serve (FCFS) [4] Simple greedy approach Schedules an actor whenever a processor becomes idle Heterogeneous Earliest Finish Time (HEFT) [5] Manages a list of actors that are ready to be executed Selects the actor-processor pair with earliest finish time Can be extended with other scheduling strategies 15

Exploiting Parallelism in DIF-GPU Scheduling Task & Pipeline Parallelism Example 1/0.5 1/0.5 t 1 /t 2 1/ B C 1/ A 1/1 0.5/0.5 F D E FCFS HEFT P1 A B C F A P1 A D E F A P2 D E P2 B C T = 4 T = 3.5 16

Inter-processor Data Transfer Host-centered FIFO Allocation (HCFA) Maintain all FIFOs on host memory Frameworks without GPU support (e.g., GNU Radio) Easy integration with existing frameworks Large amounts of overhead due to excessive CPU- GPU data transfer u e 2 e 3 e 1 buf 1 v w e 1 v x y H2D e 3 e 2 kernel D2H H2D buf 3 buf 2 w CPU actor v GPU actor 17

Inter-processor Data Transfer Mapping-dependent FIFO Allocation (MDFA) FIFOs can be allocated in host or device memory depending on the schedule Insert H2D/D2H actors to explicitly move data Inter-processor data transfer only occurs at locations determined by the schedule e 1c w e 2c u y e 1g v kernel e 3 H2D H2D D2H e 2g e 1g v e 2g e 3 x w CPU actor v GPU actor 18

DIF-GPU Example <1> <2> <3> src 2 3 usp 1 2 snk Vectorization <1> <1> <1> src b 2b 6b usp 2b 2b 6b snk 3b Scheduling & Data transfer actor insertion <1> <1> <1> <1> <1> src b 2b 2b 6b 6b H2D usp 2b D2H 2b 2b 6b 6b snk 3b CPU : src, snk, src, snk, GPU : H2D, usp, D2H, H2D, 19

DIF-GPU Example Generated LIDE-CUDA code Header file #include <stdio.h> /*... */ #define SRC 0 #define USP 1 #define SNK 2 #define H2D_0 3 #define D2H_0 4 #define ACTOR_COUNT 5 #define CPU 0 #define GPU 1 #define NUMBER_OF_THREADS 2 Headers Macro Definitions class usp_graph { public: usp_graph(); ~usp_graph(); void execute(); private: thread_list* thread_list; actor_context_type* actors[actor_count]; fifo_pointer edge_in_h2d_0; fifo_pointer edge_out_d2h_0; fifo_pointer edge_in_d2h_0; fifo_pointer edge_out_h2d_0; }; Class Declaration 20

DIF-GPU Example Generated LIDE-CUDA code Source code: graph constructor #include "usp_graph.h" usp_graph::usp_graph(){ /* Create edges */ edge_in_h2d_0 = fifo_new(4, sizeof(float), CPU); edge_out_d2h_0 = fifo_new(12, sizeof(float), CPU); edge_in_d2h_0 = fifo_new(12, sizeof(float), GPU); edge_out_h2d_0 = fifo_new(4, sizeof(float), GPU); /* Create actors */ actors[d2h_0] = (actor_context_type*) memcpy_new( edge_in_d2h_0,edge_out_d2h_0,12,12,sizeof(float), GPU); actors[snk] = (actor_context_type*) snk_1f_new( edge_out_d2h_0,12, CPU); actors[h2d_0] = (actor_context_type*) memcpy_new( edge_in_h2d_0,edge_out_h2d_0,4,4,sizeof(float), GPU); actors[usp] = (actor_context_type*) usp3_new( edge_out_h2d_0,edge_in_d2h_0,4,12, GPU); actors[src] = (actor_context_type*) src_1f_new( edge_in_h2d_0,4, CPU); /* Create schedules of each thread */ const char* thread_schedules[number_of_threads] = {"thread_0.txt","thread_1.txt"}; thread_list = thread_list_init(number_of_threads, thread_schedules, actors, ACTOR_COUNT); } 21

DIF-GPU Example Generated LIDE-CUDA code Source code: graph-level execute() and destructor void usp_graph::execute(){ thread_list_scheduler(thread_list); } usp_graph::~usp_graph(){ /* Terminate threads */ thread_list_terminate(thread_list); /* Free FIFOs */ fifo_free(edge_in_h2d_0); fifo_free(edge_out_d2h_0); fifo_free(edge_in_d2h_0); fifo_free(edge_out_h2d_0); /* Destroy actors */ memcpy_terminate((memcpy_context_type*)actors[d2h_0]); snk_1f_terminate((snk_1f_context_type*)actors[snk]); memcpy_terminate((memcpy_context_type*)actors[h2d_0]); usp3_terminate((usp3_context_type*)actors[usp]); src_1f_terminate((src_1f_context_type*)actors[src]); } 22

Case Study Throughput for b-vectorized graph Th = b/t Th: throughput; b: vec. degree; T: schedule period Test bench MP-Sched (P x S) Item Grid Size Platform Scheduler Values 2x5, 4x4, 6x3 1 CC + 1 GPU 3 CCs + 1 GPU HEFT, FCFS * CC: CPU Core 23

Speedup of FIR Filter K = 7 Excluding CPU-GPU data-transfer time B FIR B B = 1, 2,, N Filter length = K Speedup Slow increase from b=2 7 to 2 10 Low GPU utilization Fast increase from b=2 10 to 2 16 Increased utilization Slow increase from b=2 17 to 2 19 Saturation 24

Data Transfer Evaluation Throughput and data transfer overhead for FIFO implementation based on HCFA and MDFA. Percentage = ratio of time spent on data transfer HCFA MDFA Topology 2x5 4x4 6x3 Th(10 6 /s) 4.80 2.84 2.52 D2H 37.4% 37.6% 37.1% H2D 16.4% 16.2% 15.8% Th(10 6 /s) 6.71 4.06 3.59 D2H 17.2% 15.5% 20.8% H2D 6.7% 9.9% 8.2% 25

GPU Workload vs. Vectorization MP-Sched 4x4 Vec. Degree = 128 Vec. Degree = 256 Vec. Degree = 512 26

GPU Workload vs. Vectorization MP-Sched 4x4 Vec. Degree = 1024 Vec. Degree = 2048 Vec. Degree = 4096 27

GPU Workload vs. Vectorization MP-Sched 4x4 Vec. Degree = 8192 28

System Level Evaluation Single-processor baselines CPU baseline Th c : all actors scheduled on the same CPU core (CC) GPU baseline Th g : all actors with GPU acceleration scheduled on the GPU; all others scheduled on the same CC DIF-GPU Speedup sp = Th/max(Th c, Th g ) 29

System Level Evaluation MP-Sched 2x5 30

System Level Evaluation MP-Sched 4x4 FCFS HEFT 31

System Level Evaluation MP-Sched 6x3 FCFS HEFT 32

System Level Evaluation For small vectorization degrees, 3CC + 1GPU gives higher throughput GPU version slower More cores For large vectorization degrees, 1CC + 1GPU gives higher throughput GPU version much faster than CPU HEFT/FCFS scheduling Multithreading runtime overhead 33

Scheduler Evaluation Speedup in different topologies 2x5, 4x4, 6x3 Th(HEFT) > Th(FCFS) in general Consistent gain over GPU baseline Inter-processor data transfer 34

Scheduler Evaluation Speedup in different topologies 2x5, 4x4, 6x3 In some cases (b l < b < b u ), FCFS is better 35

Conclusion DIF-GPU framework SDF graph specification (DIF) Vectorization Scheduling Code generation Demonstration à MP-Sched benchmarks Data transfer overhead reduction using MDFA Performance improvement over CPU and GPUbaseline 36

References 1. E. A. Lee and D. G. Messerschmitt. Synchronous dataflow. Proceedings of the IEEE, 75(9):1235-1245, September 1987. 2. C. Augonnet, S. Thibault, R. Namyst, and P. Wacrenier. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures, Journal of Concurrency and Computation: Practice & Experience, 23(2):187-198, February 2011. 3. A. Duran, E. Ayguade, R. M. Badia, J. Labarta, L. Martinell, X. Martorell, and J. Planas. OmpSS: a proposal for programming heterogeneous multi-core architectures. Parallel Processing Letters, 21(02):173-193, 2011. 4. G. Teodoro, R. Sachetto, O. Sertel, M. N. Gurcan, W. Meira, U. Catalyurek, and R. Ferreira. Coordinating the use of GPU and CPU for improving performance of compute intensive applications. In Proceedings of the IEEE International Conference on Cluster Computing and Workshops, pages 1-10, 2009. 5. H. Topcuoglu, S. Hariri, and M.-Y. Wu. Performance-eective and low-complexity task scheduling for heterogeneous computing. IEEE Transactions on Parallel and Distributed Systems, 13(3): 260-274, 2002. 37