A Many-Core Machine Model for Designing Algorithms with Minimum Parallelism Overheads

Similar documents
Dense Arithmetic over Finite Fields with the CUMODP Library

Determinant Computation on the GPU using the Condensation AMMCS Method / 1

The Basic Polynomial Algebra Subprograms

MetaFork: A Metalanguage for Concurrency Platforms Targeting Multicores

Parallel FFT Program Optimizations on Heterogeneous Computers

MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores

Parallel Random-Access Machines

Multithreaded Parallelism and Performance Measures

Multicore programming in CilkPlus

MetaFork: A Compilation Framework for Concurrency

Plan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice

Determinant Computation on the GPU using the Condensation Method

Space vs Time, Cache vs Main Memory

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Quantifier Elimination, Polyhedral Computations and their Applications to the Parallelization of Computer Programs

Cache Friendly Sparse Matrix Vector Multilication

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

Portland State University ECE 588/688. Graphics Processors

Using GPUs to compute the multilevel summation of electrostatic forces

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Plan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice

An Overview of Parallel Computing

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

Experimenting with the MetaFork Framework Targeting Multicores

CS 598: Communication Cost Analysis of Algorithms Lecture 15: Communication-optimal sorting and tree-based algorithms

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Numerical Simulation on the GPU

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

EE/CSCI 451: Parallel and Distributed Computation

COMP Parallel Computing. PRAM (4) PRAM models and complexity

Symbolic Buffer Sizing for Throughput-Optimal Scheduling of Dataflow Graphs

EE/CSCI 451: Parallel and Distributed Computation

15-750: Parallel Algorithms

Lecture 2: CUDA Programming

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. November Parallel Sorting

PRAM (Parallel Random Access Machine)

MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Thomas H. Cormen Charles E. Leiserson Ronald L. Rivest. Introduction to Algorithms

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

Tesla Architecture, CUDA and Optimization Strategies

B. Tech. Project Second Stage Report on

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Models. Hypercube Butterfly Fully Connected Other Networks Shared Memory v.s. Distributed Memory SIMD v.s. MIMD

CS3350B Computer Architecture

Small Discrete Fourier Transforms on GPUs

1 Overview, Models of Computation, Brent s Theorem

Parallel Systems Course: Chapter VIII. Sorting Algorithms. Kumar Chapter 9. Jan Lemeire ETRO Dept. Fall Parallel Sorting

Performance potential for simulating spin models on GPU

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

TUNING CUDA APPLICATIONS FOR MAXWELL

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Lesson 1 1 Introduction

The PRAM (Parallel Random Access Memory) model. All processors operate synchronously under the control of a common CPU.

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

General Purpose GPU Computing in Partial Wave Analysis

Efficient Implementation of Polynomial Arithmetic in a Multiple-Level Programming Environment

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Fundamental Algorithms

TUNING CUDA APPLICATIONS FOR MAXWELL

The PRAM model. A. V. Gerbessiotis CIS 485/Spring 1999 Handout 2 Week 2

Bandwidth Approximation of Many-Caterpillars

Advanced CUDA Optimizing to Get 20x Performance. Brent Oster

Unrolling parallel loops

RESEARCH ARTICLE. Accelerating Ant Colony Optimization for the Traveling Salesman Problem on the GPU

QR Decomposition on GPUs

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Multithreaded Algorithms Part 1. Dept. of Computer Science & Eng University of Moratuwa

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Plan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice

10 Introduction to Distributed Computing

Automatic FFT Kernel Generation for CUDA GPUs. Akira Nukada Tokyo Institute of Technology

GPU Fundamentals Jeff Larkin November 14, 2016

Interpolation and Splines

IE 495 Lecture 3. Septermber 5, 2000

Programming GPUs for database applications - outsourcing index search operations

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

CUB. collective software primitives. Duane Merrill. NVIDIA Research

GRAPHICS PROCESSING UNITS

CHAPTER 1 INTRODUCTION

The PRAM Model. Alexandre David

COMP4300/8300: Overview of Parallel Hardware. Alistair Rendell. COMP4300/8300 Lecture 2-1 Copyright c 2015 The Australian National University

Experimental Calibration and Validation of a Speed Scaling Simulator

Neural Network Implementation using CUDA and OpenMP

Exploring GPU Architecture for N2P Image Processing Algorithms

CS3350B Computer Architecture CPU Performance and Profiling

GPGPU LAB. Case study: Finite-Difference Time- Domain Method on CUDA

An Adaptive Framework for Scientific Software Libraries. Ayaz Ali Lennart Johnsson Dept of Computer Science University of Houston

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Comprehensive Optimization of Parametric Kernels for Graphics Processing Units

Minimum Cost Edge Disjoint Paths

Fundamental Algorithms

high performance medical reconstruction using stream programming paradigms

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

OpenCL implementation of PSO: a comparison between multi-core CPU and GPU performances

Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA

Mathematical and Algorithmic Foundations Linear Programming and Matchings

Transcription:

A Many-Core Machine Model for Designing Algorithms with Minimum Parallelism Overheads Sardar Anisul Haque Marc Moreno Maza Ning Xie University of Western Ontario, Canada IBM CASCON, November 4, 2014 ardar Anisul Haque, Marc Moreno Maza, Ning Xie (University A Many-Core of Western Machine Ontario, Model Canada) for Designing Algorithms with IBM Minimum CASCON, Parallelism November Overheads 4, 2014 1 / 33

Optimize algorithms targeting GPU-like many-core devices Background Given a CUDA code, an experimented programmer may attempt well-known strategies to improve the code performance in terms of arithmetic intensity and memory bandwidth. Given a CUDA-like algorithm, one would like to derive code for which much of this optimization process has been lifted at the design level, i.e. before the code is written. Methodology We need a model of computation which captures the computer hardware characteristics that have a dominant impact on program performance. combines its complexity measures (work, span) so as to determine the best algorithm among different possible algorithmic solutions to a given problem. Sardar Anisul Haque, Marc Moreno Maza, Ning Xie (University A Many-Core of Western Machine Ontario, Model Canada) for Designing Algorithms with IBM Minimum CASCON, Parallelism November Overheads 4, 2014 2 / 33

Challenges in designing a model of computation Theoretical aspects GPU-like architectures introduces many machine parameters (like memory sizes, number of cores), and too many could lead to intractable calculations. GPU-like code depends also on program parameters (like number of threads per thread-block) which specify how the work is divided among the computing resources. Practical aspects One wants to avoid answers like: Algorithm 1 is better than Algorithm 2 providing that the machine parameters satisfy a system of constraints. We prefer analysis results independent of machine parameters. Moreover, this should be achieved by selecting program parameters in appropriate ranges. Sardar Anisul Haque, Marc Moreno Maza, Ning Xie (University A Many-Core of Western Machine Ontario, Model Canada) for Designing Algorithms with IBM Minimum CASCON, Parallelism November Overheads 4, 2014 3 / 33

Overview We present a model of multithreaded computation with an emphasis on estimating parallelism overheads of programs written for modern many-core architectures. We evaluate the benefits of our model with fundamental algorithms from scientific computing. For two case studies, our model is used to minimize parallelism overheads by determining an appropriate value range for a given program parameter. For the others, our model is used to compare different algorithms solving the same problem. In each case, the studied algorithms were implemented 1 and the results of their experimental comparison are coherent with the theoretical analysis based on our model. 1 Publicly available written in CUDA from http://www.cumodp.org/ Sardar Anisul Haque, Marc Moreno Maza, Ning Xie (University A Many-Core of Western Machine Ontario, Model Canada) for Designing Algorithms with IBM Minimum CASCON, Parallelism November Overheads 4, 2014 4 / 33

Plan 1 Models of computation Fork-join model Parallel random access machine (PRAM) model Threaded many-core memory (TMM) model 2 A many-core machine (MCM) model 3 Experimental validation 4 Concluding remarks Sardar Anisul Haque, Marc Moreno Maza, Ning Xie (University A Many-Core of Western Machine Ontario, Model Canada) for Designing Algorithms with IBM Minimum CASCON, Parallelism November Overheads 4, 2014 5 / 33

Fork-join model This model has become popular with the development of the concurrency platform CilkPlus, targeting multi-core architectures. The work T 1 is the total time to execute the entire program on one processor. The span T is the longest time to execute along any path in the DAG. We recall that the Graham-Brent theorem states that the running time T P on P processors satisfies T P T 1 /P + T. A refinement of this theorem captures scheduling and synchronization costs, that is, T P T 1 /P + 2δ T, where δ is a constant and T is the burdened span. Figure: An example of computation DAG: 4-th Fibonacci number Sardar Anisul Haque, Marc Moreno Maza, Ning Xie (University A Many-Core of Western Machine Ontario, Model Canada) for Designing Algorithms with IBM Minimum CASCON, Parallelism November Overheads 4, 2014 6 / 33

Parallel random access machine (PRAM) model Figure: Abstract machine of PRAM model Instructions on a processor execute in a 3-phase cycle: read-compute-write. Processors access to the global memory in a unit time (unless an access conflict occurs). These strategies deal with read/write conflicts to the same global memory cell: EREW, CREW and CRCW (exclusive or concurrent). A refinement of PRAM integrates communication delay into the computation time. Sardar Anisul Haque, Marc Moreno Maza, Ning Xie (University A Many-Core of Western Machine Ontario, Model Canada) for Designing Algorithms with IBM Minimum CASCON, Parallelism November Overheads 4, 2014 7 / 33

Hybrid CPU-GPU system Figure: Overview of a hybrid CPU-GPU system Sardar Anisul Haque, Marc Moreno Maza, Ning Xie (University A Many-Core of Western Machine Ontario, Model Canada) for Designing Algorithms with IBM Minimum CASCON, Parallelism November Overheads 4, 2014 8 / 33

Threaded many-core memory (TMM) model Ma, Agrawal and Chamberlain introduce the TMM model which retains many important characteristics of GPU-type architectures. L P C Z Q X Description Time for a global memory access Number of processors (cores) Memory access width Size of fast private memory per core group Number of cores per core group Hardware limit on number of threads per core Table: Machine parameters of the TMM model In TMM analysis, the running time of algorithm is estimated by choosing the maximum quantity among the work, span and amount of memory accesses. No Graham-Brent theorem-like is provided. Such running time estimates depend on the machine parameters. Sardar Anisul Haque, Marc Moreno Maza, Ning Xie (University A Many-Core of Western Machine Ontario, Model Canada) for Designing Algorithms with IBM Minimum CASCON, Parallelism November Overheads 4, 2014 9 / 33

Plan 1 Models of computation 2 A many-core machine (MCM) model Characteristics Complexity measures 3 Experimental validation 4 Concluding remarks Sardar Anisul Haque, Marc Moreno Maza, Ning Xie (University A Many-Core of Western Machine Ontario, Model Canada) for Designing Algorithms withibm Minimum CASCON, Parallelism November Overheads 4, 2014 10 / 33

A many-core machine (MCM) model We propose a many-core machine (MCM) model which aims at tuning program parameters to minimize parallelism overheads of algorithms targeting GPU-like architectures as well as comparing different algorithms independently of the value of machine parameters of the targeted hardware device. In the design of this model, we insist on the following features: Two-level DAG programs Parallelism overhead A Graham-Brent theorem Sardar Anisul Haque, Marc Moreno Maza, Ning Xie (University A Many-Core of Western Machine Ontario, Model Canada) for Designing Algorithms withibm Minimum CASCON, Parallelism November Overheads 4, 2014 11 / 33

Characteristics of the abstract many-core machines Figure: A many-core machine It has a global memory with high latency and low throughput while private memories have low latency and high throughput Sardar Anisul Haque, Marc Moreno Maza, Ning Xie (University A Many-Core of Western Machine Ontario, Model Canada) for Designing Algorithms withibm Minimum CASCON, Parallelism November Overheads 4, 2014 12 / 33

Characteristics of the abstract many-core machines Figure: Overview of a many-core machine program Sardar Anisul Haque, Marc Moreno Maza, Ning Xie (University A Many-Core of Western Machine Ontario, Model Canada) for Designing Algorithms withibm Minimum CASCON, Parallelism November Overheads 4, 2014 13 / 33

Characteristics of the abstract many-core machines Synchronization costs It follows that MCM kernel code needs no synchronization statement. Consequently, the only form of synchronization taking place among the threads executing a given thread-block is that implied by code divergence. An MCM machine handles code divergence by eliminating the corresponding conditional branches via code replication, and the corresponding cost will be captured by the complexity measures (work, span and parallelism overhead) of the MCM model. Sardar Anisul Haque, Marc Moreno Maza, Ning Xie (University A Many-Core of Western Machine Ontario, Model Canada) for Designing Algorithms withibm Minimum CASCON, Parallelism November Overheads 4, 2014 14 / 33

Machine parameters of the abstract many-core machines Z: Private memory size of any SM It sets up an upper bound on several program parameters, for instance, the number of threads of a thread-block or the number of words in a data transfer between the global memory and the private memory of a thread-block. U: Data transfer time Time (expressed in clock cycles) to transfer one machine word between the global memory and the private memory of any SM, that is, U > 0. As an abstract machine, the MCM aims at capturing either the best or the worst scenario for data transfer time of a thread-block, that is, T D (α + β) U, if coalesced accesses occur; or l (α + β) U, otherwise, where α and β are the numbers of words respectively read and written to the global memory by one thread of a thread-block B and l be the number of threads per thread-block. Sardar Anisul Haque, Marc Moreno Maza, Ning Xie (University A Many-Core of Western Machine Ontario, Model Canada) for Designing Algorithms withibm Minimum CASCON, Parallelism November Overheads 4, 2014 15 / 33

Complexity measures for the many-core machine model For any kernel K of an MCM program, work W (K) is the total number of local operations of all its threads; span S(K) is the maximum number of local operations of one thread; parallelism overhead O(K) is the total data transfer time among all its thread-blocks. For the entire program P, work W (P) is the total work of all its kernels; span S(P) is the longest path, counting the weight (span) of each vertex (kernel), in the kernel DAG; parallelism overhead O(P) is the total parallelism overhead of all its kernels. Sardar Anisul Haque, Marc Moreno Maza, Ning Xie (University A Many-Core of Western Machine Ontario, Model Canada) for Designing Algorithms withibm Minimum CASCON, Parallelism November Overheads 4, 2014 16 / 33

Characteristic quantities of the thread-block DAG Figure: Thread-block DAG of a many-core machine program N(P): number of vertices in the thread-block DAG of P, L(P): critical path length (where length of a path is the number of edges in that path) in the thread-block DAG of P. Sardar Anisul Haque, Marc Moreno Maza, Ning Xie (University A Many-Core of Western Machine Ontario, Model Canada) for Designing Algorithms withibm Minimum CASCON, Parallelism November Overheads 4, 2014 17 / 33

Complexity measures for the many-core machine model Theorem (A Graham-Brent theorem with parallelism overhead) We have the following estimate for the running time T P of the program P when executed on P SMs: T P (N(P)/P + L(P))C(P) (1) where C(P) is the maximum running time of local operations (including read/write requests) and data transfer by one thread-block. Corollary Let K be the maximum number of thread-blocks along an anti-chain of the thread-block DAG of P. Then the running time T P of the program P satisfies: T P (N(P)/K + L(P))C(P) (2) Sardar Anisul Haque, Marc Moreno Maza, Ning Xie (University A Many-Core of Western Machine Ontario, Model Canada) for Designing Algorithms withibm Minimum CASCON, Parallelism November Overheads 4, 2014 18 / 33

Plan 1 Models of computation 2 A many-core machine (MCM) model 3 Experimental validation 4 Concluding remarks Sardar Anisul Haque, Marc Moreno Maza, Ning Xie (University A Many-Core of Western Machine Ontario, Model Canada) for Designing Algorithms withibm Minimum CASCON, Parallelism November Overheads 4, 2014 19 / 33

Tuning program parameter with MCM model For an MCM program P depending on program parameter s varying in a range S. Let s 0 be an initial value of s corresponding to an instance P 0 of P. Assume the work ratio W s0 /W s remains essentially constant meanwhile the parallelism overhead O s varies more substantially, say O s0 /O s Θ(s s 0 ). Then, we determine a value s min S maximizing the ratio O s0 /O s. Next, we use our version of Graham-Brent theorem to confirm that the upper bound for the running time of P(s min ) is less than that of P(s 0 ). Sardar Anisul Haque, Marc Moreno Maza, Ning Xie (University A Many-Core of Western Machine Ontario, Model Canada) for Designing Algorithms withibm Minimum CASCON, Parallelism November Overheads 4, 2014 20 / 33

Plain (or long) univariate polynomial multiplication We denote by a and b two univariate polynomials (or vectors) with sizes n m. We compute the product f = a b. a 3 a 2 a 1 a 0 b 0 + + + a 3 a 2 a 1 a 0 b 1 + + + a 3 a 2 a 1 a 0 b 2 f 5 f 4 f 3 f 2 f 1 f 0 This long multiplication process has two phases. Matrix multiplication follows a similar pattern. Sardar Anisul Haque, Marc Moreno Maza, Ning Xie (University A Many-Core of Western Machine Ontario, Model Canada) for Designing Algorithms withibm Minimum CASCON, Parallelism November Overheads 4, 2014 21 / 33

Plain univariate polynomial multiplication Multiplication phase: every coefficient of a is multiplied with every coefficients of b; each thread accumulates s partial sums into an auxiliary array M. Addition phase: these partial sums are added together repeatedly to form the polynomial f. Sardar Anisul Haque, Marc Moreno Maza, Ning Xie (University A Many-Core of Western Machine Ontario, Model Canada) for Designing Algorithms withibm Minimum CASCON, Parallelism November Overheads 4, 2014 22 / 33

Plain univariate polynomial multiplication The work, span and parallelism overhead ratios between s 0 = 1 (initial program) and an arbitrary s are, respectively, 2 W 1 W s = S 1 S s = O 1 O s = n n + s 1, log 2 (m) + 1 s (log 2 (m/s) + 2 s 1), n s 2 (7 m 3) (n + s 1) (5 m s + 2 m 3 s 2 ). Increasing s leaves work essentially constant, while span increases and parallelism overhead decreases in the same order when m escapes to infinity. Hence, should s be large or close to s 0 = 1? 2 See the detailed analysis in the form of executable Maple worksheets of three applications: http://www.csd.uwo.ca/~nxie6/projects/mcm/ Sardar Anisul Haque, Marc Moreno Maza, Ning Xie (University A Many-Core of Western Machine Ontario, Model Canada) for Designing Algorithms withibm Minimum CASCON, Parallelism November Overheads 4, 2014 23 / 33

Plain univariate polynomial multiplication Applying our version of the Graham-Brent theorem, the ratio R of the estimated running times on Θ( (n+s 1) m ) SMs is l s 2 R = (m log 2 (m) + 3 m 1) (1 + 4 U) (m log 2 ( m s ) + 3 m s) (2 U s + 2 U + 2 s2 s), which is asymptotically equivalent to 2 log 2 (m) s log 2 (m/s). This latter ratio is greater than 1 if and only if s = 1 or s = 2. In other words, increasing s makes the algorithm performance worse. Sardar Anisul Haque, Marc Moreno Maza, Ning Xie (University A Many-Core of Western Machine Ontario, Model Canada) for Designing Algorithms withibm Minimum CASCON, Parallelism November Overheads 4, 2014 24 / 33

Plain univariate polynomial multiplication Figure: Running time of the plain polynomial multiplication algorithm with polynomials a (deg(a) = n 1) and b (deg(b) = m 1) and the parameter s on GeForce GTX 670. Sardar Anisul Haque, Marc Moreno Maza, Ning Xie (University A Many-Core of Western Machine Ontario, Model Canada) for Designing Algorithms withibm Minimum CASCON, Parallelism November Overheads 4, 2014 25 / 33

The Euclidean algorithm Let s > 0. We proceed by repeatedly calling a subroutine which takes as input a pair (a, b) of polynomials and returns another pair (a, b ) of polynomials such that gcd(a, b) = gcd(a, b ) and, either b = 0 or we have deg(a ) + deg(b ) deg(a) + deg(b) s. When s = Θ(l) (the number of threads per thread-block), the work is increased by a constant factor and the parallelism overhead will reduce by a factor in Θ(s). Further, the estimated running time ratio T 1 /T s on Θ( m ) SMs is greater l than 1 if and only if s > 1. Sardar Anisul Haque, Marc Moreno Maza, Ning Xie (University A Many-Core of Western Machine Ontario, Model Canada) for Designing Algorithms withibm Minimum CASCON, Parallelism November Overheads 4, 2014 26 / 33

Fast Fourier Transform Let f be a vector with coefficients in a field (either a prime field like Z/pZ or C) and size n, which is a power of 2. Let ω be a n-th primitive root of unity. The n-point Discrete Fourier Transform (DFT) at ω is the linear map defined by x DFT n x with DFT n = [ω ij ] 0 i, j<n. We are interested in comparing popular algorithms for computing DFTs on many-core architectures: Cooley & Tukey FFT algorithm, Stockham FFT algorithm. Sardar Anisul Haque, Marc Moreno Maza, Ning Xie (University A Many-Core of Western Machine Ontario, Model Canada) for Designing Algorithms withibm Minimum CASCON, Parallelism November Overheads 4, 2014 27 / 33

Fast Fourier Transforms: Cooley & Tukey vs Stockham The work, span and parallelism overhead ratios between Cooley & Tukey s and Stockham s FFT algorithms are, respectively, W ct W sh S ct S sh 4 n (47 log 2 (n) l + 34 log 2 (n) l log 2 (l)) 172 n log 2 (n) l + n + 48 l 2, 34 log 2 (n) log 2 (l) + 47 log 2 (n), 43 log 2 (n) + 16 log 2 (l) O ct O sh = 8 n (4 log 2 (n) + l log 2 (l) log 2 (l) 15), 20 n log 2 (n) + 5 n 4 l where l is the number of threads per thread-block. Both the work and span of the algorithm of Cooley & Tukey are increased by Θ(log 2 (l)) factor w.r.t their counterparts in Stockham algorithm. Sardar Anisul Haque, Marc Moreno Maza, Ning Xie (University A Many-Core of Western Machine Ontario, Model Canada) for Designing Algorithms withibm Minimum CASCON, Parallelism November Overheads 4, 2014 28 / 33

Fast Fourier Transforms: Cooley & Tukey vs Stockham The ratio R = T ct /T sh of the estimated running times (using our Graham-Brent theorem) on Θ( n l ) SMs is 3 : R log 2(n)(2 U l + 34 log 2 (l) + 2 U), 5 log 2 (n) (U + 2 log 2 (l)) when n escapes to infinity. This latter ratio is greater than 1 if and only if l > 1. n Cooley & Tukey Stockham 2 14 0.583296 0.666496 2 15 0.826784 0.7624 2 16 1.19542 0.929632 2 17 2.07514 1.24928 2 18 4.66762 1.86458 2 19 9.11498 3.04365 2 20 16.8699 5.38781 Table: Running time (secs) with input size n on GeForce GTX 670. 3 l is the number of threads per thread-block. Sardar Anisul Haque, Marc Moreno Maza, Ning Xie (University A Many-Core of Western Machine Ontario, Model Canada) for Designing Algorithms withibm Minimum CASCON, Parallelism November Overheads 4, 2014 29 / 33

Univariate polynomial multiplication: Plain vs FFT-based Polynomial multiplication can be done either via the long (= plain) scheme or via FFT computations. Let n be the largest size of an input polynomial and l be the number of threads per thread-block. The theoretical analysis of our model indicates that the plain multiplication performs more work and parallelism overhead. However, on O( n2 l ) SMs, the ratio T plain/t fft of the estimated running times is essentially constant. On the other hand, the running time ratio T plain /T fft on Θ( n ) SMs suggests l FFT-based multiplication outperforms plain multiplication for n large enough. Sardar Anisul Haque, Marc Moreno Maza, Ning Xie (University A Many-Core of Western Machine Ontario, Model Canada) for Designing Algorithms withibm Minimum CASCON, Parallelism November Overheads 4, 2014 30 / 33

Plan 1 Models of computation 2 A many-core machine (MCM) model 3 Experimental validation 4 Concluding remarks Sardar Anisul Haque, Marc Moreno Maza, Ning Xie (University A Many-Core of Western Machine Ontario, Model Canada) for Designing Algorithms withibm Minimum CASCON, Parallelism November Overheads 4, 2014 31 / 33

Concluding remarks We have presented a model of multithreaded computation combining the fork-join and SIMD parallelisms, with an emphasis on estimating parallelism overheads of GPU programs. In practice, our model determines a trade-off among work, span and parallelism overhead by checking the estimated overall running time so as to (1) either tune a program parameter or, (2) compare different algorithms independently of the hardware details. Intensive experimental validation was conducted with the CUMODP library, which is integrated in the Maple computer algebra system. The CUMODP library (http://www.cumodp.org/) Sardar Anisul Haque, Marc Moreno Maza, Ning Xie (University A Many-Core of Western Machine Ontario, Model Canada) for Designing Algorithms withibm Minimum CASCON, Parallelism November Overheads 4, 2014 32 / 33

Future works (1) We plan to integrate our model in the MetaFork compilation framework for automatic translation between multithreaded languages. (2) Within MetaFork, we plan to use our model to optimize CUDA-like code automatically generated from CilkPlus or OpenMP code enhanced with device constructs. (3) We also plan to extend our model to support hybrid architectures like CPU-GPU. The Metafork framework (http://www.metafork.org/) Sardar Anisul Haque, Marc Moreno Maza, Ning Xie (University A Many-Core of Western Machine Ontario, Model Canada) for Designing Algorithms withibm Minimum CASCON, Parallelism November Overheads 4, 2014 33 / 33