High Performance Stencil Code Generation with Lift. Bastian Hagedorn Larisa Stoltzfus Michel Steuwer Sergei Gorlatch Christophe Dubach

Size: px
Start display at page:

Download "High Performance Stencil Code Generation with Lift. Bastian Hagedorn Larisa Stoltzfus Michel Steuwer Sergei Gorlatch Christophe Dubach"

Transcription

1 High Performance Stencil Code Generation with Lift Bastian Hagedorn Larisa Stoltzfus Michel Steuwer Sergei Gorlatch Christophe Dubach

2 Why Stencil Computations? Stencil computations are a class of kernels which update neighboring array elements according to a fixed pattern, called stencil. Frequently occur in: Medical Imaging Physics Simulations Machine Learning PDE Solvers

3 Why Stencil Computations? Stencil compute time: HPC Center München 49% 2017 HPC Center Stuttgart 68% 2016 Frequently occur in: Medical Imaging Physics Simulations Machine Learning PDE Solvers

4 Yet Another Stencil Paper? ICS'09 ICS'05 PLDI' CGO'12 SC'10 CLUSTER' CGO' CLUSTER'17 WOLFHPC'16 CGO'18

5 Domain Specific Languages PATUS Pochoir PARTANS Halide... DSL

6 Exploiting Domain Knowledge PATUS Multicore CPU Pochoir GPU PARTANS HPC Mobile Xeon Phi Halide KNC KNL... DSL... Hardware

7 Exploiting Domain Knowledge Stencil DSLs Multicore CPU GPU HPC Mobile Xeon Phi KNC KNL... Hardware

8 Exploiting Domain Knowledge Linear Algebra DSLs Multicore CPU GPU Stencil DSLs HPC Mobile Xeon Phi N-Body DSLs KNC KNL... Hardware...

9 approaching Performance Portability Linear Algebra DSLs Stencil DSLs N-Body DSLs Universal High Performance Code Generator Multicore CPU GPU HPC Mobile Xeon Phi KNC KNL... Hardware...

10 approaching Performance Portability Linear Algebra DSLs Stencil DSLs N-Body DSLs Lift Multicore CPU GPU HPC Mobile Xeon Phi KNC KNL... Hardware...

11 DSL DSL Lift Multicore CPU GPU HPC Mobile Xeon Phi DSL KNC KNL... Hardware

12 DSL DSL Lift DSL High-Level IR Multicore CPU GPU HPC Mobile Xeon Phi KNC KNL... Hardware

13 DSL DSL Lift DSL High-Level IR Explore Optimizations by rewriting Multicore CPU GPU HPC Mobile Xeon Phi KNC KNL [CASES'16]... Hardware

14 DSL DSL Lift DSL High-Level IR Explore Optimizations by rewriting [CASES'16] Low-Level Program Multicore CPU GPU HPC Mobile Xeon Phi KNC KNL... Hardware

15 DSL DSL Lift DSL High-Level IR Explore Optimizations by rewriting Low-Level Program Multicore CPU GPU HPC Mobile Xeon Phi KNC KNL [CASES'16] Code Generation [CGO'17]... Hardware

16

17 Lift's High-level Primitives map( ) reduce( ) split(n) join zip

18 Lift's High-level Primitives map( ) dotproduct.lift reduce( ) * split(n) join zip a b +

19 Lift's High-level Primitives map( ) dotproduct.lift reduce( ) * split(n) + join zip a b reduce(+,0, map(*, zip(a,b)))

20 Lift's High-level Primitives map( ) dotproduct.lift reduce( ) * split(n) + join zip a b reduce(+,0, map(*, zip(a,b)))

21 Lift's High-level Primitives map( ) dotproduct.lift reduce( ) * split(n) + join zip a b reduce(+,0, map(*, zip(a,b)))

22 Lift's High-level Primitives map( ) stencil.lift? reduce( ) split(n) join zip Can we express stencil computations in Lift? Let's look at a simple stencil example...

23 What ARE Stencil Computations? 3-point-stencil.c for (int i = 0; i int sum = 0; for ( int j = int pos = pos = pos pos = pos sum += A[ B[ i ] = sum ; } < N ; i ++) { -1; i + < 0 > N pos j <= 1; j ++) { j;? 0 : pos; - 1? N - 1 : pos; ]; } A B

24 What ARE Stencil Computations? 3-point-stencil.c for (int i = 0; i int sum = 0; for ( int j = int pos = pos = pos pos = pos sum += A[ B[ i ] = sum ; } < N ; i ++) { -1; i + < 0 > N pos j <= 1; j ++) { j;? 0 : pos; - 1? N - 1 : pos; ]; }

25 What ARE Stencil Computations? 3-point-stencil.c for (int i = 0; i int sum = 0; for ( int j = int pos = pos = pos pos = pos sum += A[ B[ i ] = sum ; } < N ; i ++) { -1; i + < 0 > N pos j <= 1; j ++) { j;? 0 : pos; - 1? N - 1 : pos; ]; }

26 Stencil computations in Lift map( ) reduce( ) split(n) join zip 3-point-stencil.lift

27 Stencil computations in Lift map( ) 3-point-stencil.lift reduce( ) split(n) join zip stencil Add specialized primitive: Job done?

28 Stencil computations in Lift map( ) 3-point-stencil.lift reduce( ) split(n) join zip stencil Add specialized primitive: Job done? No Reuse of existing primitives and optimizations Domain-specific rather than generic Multidimensional? is it composable?

29 Decomposing Stencil Computations 3-point-stencil.c for (int i = 0; i int sum = 0; for ( int j = int pos = pos = pos pos = pos sum += A[ B[ i ] = sum ; } < N ; i ++) { -1; i + < 0 > N pos j <= 1; j ++) { j;? 0 : pos; - 1? N - 1 : pos; ]; }

30 Decomposing Stencil Computations 3-point-stencil.c for (int i = 0; i int sum = 0; for ( int j = int pos = pos = pos pos = pos sum += A[ B[ i ] = sum ; } < N ; i ++) { -1; i + < 0 > N pos j <= 1; j ++) { // ( a ) j;? 0 : pos; - 1? N - 1 : pos; ]; } (a) access neighborhoods for every element

31 Decomposing Stencil Computations 3-point-stencil.c for (int i = 0; i int sum = 0; for ( int j = int pos = pos = pos pos = pos sum += A[ B[ i ] = sum ; } < N ; i ++) { -1; i + < 0 > N pos j <= 1; j ++) { // ( a ) j;? 0 : pos; // ( b ) - 1? N - 1 : pos; ]; } (a) access neighborhoods for every element (b) specify boundary handling

32 Decomposing Stencil Computations 3-point-stencil.c for (int i = 0; i int sum = 0; for ( int j = int pos = pos = pos pos = pos sum += A[ B[ i ] = sum ; } < N ; i ++) { -1; i + < 0 > N pos j <= 1; j ++) { // ( a ) j;? 0 : pos; // ( b ) - 1? N - 1 : pos; ]; } // ( c ) (a) access neighborhoods for every element (b) specify boundary handling (c) apply stencil function to neighborhoods

33 Decomposing Stencil Computations 3-point-stencil.c for (int i = 0; i int sum = 0; for ( int j = int pos = pos = pos pos = pos sum += A[ B[ i ] = sum ; } < N ; i ++) { -1; i + < 0 > N pos j <= 1; j ++) { // ( a ) j;? 0 : pos; // ( b ) - 1? N - 1 : pos; ]; } // ( c ) (a) access neighborhoods for every element (b) specify boundary handling (c) apply stencil function to neighborhoods

34 Stencil computations in Lift map( ) 3-point-stencil.lift reduce( ) split(n) join zip??????????????????

35 Stencil computations in Lift map( ) 3-point-stencil.lift reduce( ) split(n) join zip Reuse map allows to reuse existing rewrite rules Simplicity???????????? one primitive per task Multidimensional easily composable??? map

36 Boundary Handling Using Pad pad ( reindexing ) pad ( constant ) C C pad-reindexing.lift pad-constant.lift clamp(i, n) = (i < 0)? 0 : ((i >= n)? n-1:i) constant(i, n) = C pad(1,1,clamp, [a,b,c,d]) = [a,a,b,c,d,d] pad(1,1,constant, [a,b,c,d]) = [C,a,b,c,d,C]

37 Neighborhood Creation using Slide size slide-example.lift slide(3,1,[a,b,c,d,e]) = [[a,b,c],[b,c,d],[c,d,e]] step...

38 Applying Stencil function using Map sum-neighborhoods.lift map(nbh => reduce(add, 0.0f, nbh))

39 Putting it Together map( ) stencil1d.lift reduce( ) split(n) join zip pad(l,r,b) slide(n,s) pad def stencil1d = fun(a => map(reduce(add, 0.0f), slide(3,1, pad(1,1,clamp,a)))) slide map

40 m co De to are expressed as compositions of intuitive, generic 1D primitives se po Multidimensional stencil computations se po om -C Re

41 m co De to are expressed as compositions of intuitive, generic 1D primitives se po Multidimensional stencil computations se po om -C Re map2(sumnbh, slide2(3,1, pad2(1,1,clamp,input)))

42 m co De to are expressed as compositions of intuitive, generic 1D primitives se po Multidimensional stencil computations se po om -C Re map2(sumnbh, slide2(3,1, pad2(1,1,clamp,input)))

43 Multidimensional Boundary Handling using PAD 2 input pad2 = map(pad(l,r,b,pad(l,r,b,input)))

44 Multidimensional Boundary Handling using PAD 2 input pad2 = map(pad(l,r,b,pad(l,r,b,input)))

45 Multidimensional Boundary Handling using PAD 2 input pad2 = map(pad(l,r,b,pad(l,r,b,input)))

46 m co De to are expressed as compositions of intuitive, generic 1D primitives se po Multidimensional stencil computations se po om -C Re map2(sum, slide2(3,1, pad2(1,1,clamp,input)))

47 m co De to are expressed as compositions of intuitive, generic 1D primitives se po Multidimensional stencil computations se po om -C Re map2(sum, slide2(3,1, pad2(1,1,clamp,input)))

48 m co De to are expressed as compositions of intuitive, generic 1D primitives se po Multidimensional stencil computations se po om -C Re map2(sum, slide2(3,1, pad2(1,1,clamp,input)))

49 m co De to are expressed as compositions of intuitive, generic 1D primitives se po Multidimensional stencil computations se po om -C Re map3(sum, slide3(3,1, pad3(1,1,clamp,input)))

50 m co De to are expressed as compositions of intuitive, generic 1D primitives se po Multidimensional stencil computations se po om -C Re map3(sum, slide3(3,1, pad3(1,1,clamp,input))) Advantages: Compact Language Reuse Rewrites Simple Compilation

51

52 Reusing existing Rewrite Rules Divide & Conquer map(f, A)

53 Reusing existing Rewrite Rules Divide & Conquer map(f, A) join(map(map(f), split(n, A)))

54 Optimization: Overlapped Tiling Exploit Locality Close neighborhoods share elements that can be grouped in tiles Shared Memory Fast memory can be used to cache tiles

55 Optimization: Overlapped Tiling Shared memory size Exploit Locality Close neighborhoods share elements that can be grouped in tiles Shared Memory Fast memory can be used to cache tiles

56 Optimization: Overlapped Tiling overlap Exploit Locality Close neighborhoods share elements that can be grouped in tiles Shared Memory Fast memory can be used to cache tiles

57 Optimization: Overlapped Tiling overlap Exploit Locality Close neighborhoods share elements that can be grouped in tiles tile containing three neighborhoods Shared Memory Fast memory can be used to cache tiles

58 Overlapped Tiling as a rewrite rule overlapped tiling rule map(f, slide(3,1,input)) u v

59 Overlapped Tiling as a rewrite rule overlapped tiling rule map(f, slide(3,1,input)) join(map(tile map(f, slide(3,1,tile)), slide(u,v,input))) u v

60 Overlapped Tiling as a rewrite rule overlapped tiling rule map(f, slide(3,1,input)) join(map(tile map(f, slide(3,1,tile)), slide(u,v,input))) u v

61

62 Comparison with Hand-Optimized codes higher is better Lift achieves the same performance as hand optimized code

63 Comparison with polyhedral compilation higher is better Lift outperforms state-of-the-art optimizing compilers

64 Stencil Computations in LIft We added: 2 Primitives pad, slide 1 Rewrite Rule overlapped tiling High-Level IR

65 Lift is open Source! more info at: lift-project.org Paper CGO Artifact Bastian Hagedorn: Source Code

High Performance Stencil Code Generation with Lift

High Performance Stencil Code Generation with Lift High Performance Stencil Code Generation with Lift Bastian Hagedorn Larisa Stoltzfus Michel Steuwer University of Münster Münster, Germany b.hagedorn@wwu.de The University of Edinburgh Edinburgh, Scotland,

More information

Lift: a Functional Approach to Generating High Performance GPU Code using Rewrite Rules

Lift: a Functional Approach to Generating High Performance GPU Code using Rewrite Rules Lift: a Functional Approach to Generating High Performance GPU Code using Rewrite Rules Toomas Remmelg Michel Steuwer Christophe Dubach The 4th South of England Regional Programming Language Seminar 27th

More information

Administrative. Optimizing Stencil Computations. March 18, Stencil Computations, Performance Issues. Stencil Computations 3/18/13

Administrative. Optimizing Stencil Computations. March 18, Stencil Computations, Performance Issues. Stencil Computations 3/18/13 Administrative Optimizing Stencil Computations March 18, 2013 Midterm coming April 3? In class March 25, can bring one page of notes Review notes, readings and review lecture Prior exams are posted Design

More information

Michel Steuwer.

Michel Steuwer. Michel Steuwer http://homepages.inf.ed.ac.uk/msteuwer/ SKELCL: Algorithmic Skeletons for GPUs X i a i b i = reduce (+) 0 (zip ( ) A B) #include #include #include

More information

Open Compute Stack (OpenCS) Overview. D.D. Nikolić Updated: 20 August 2018 DAE Tools Project,

Open Compute Stack (OpenCS) Overview. D.D. Nikolić Updated: 20 August 2018 DAE Tools Project, Open Compute Stack (OpenCS) Overview D.D. Nikolić Updated: 20 August 2018 DAE Tools Project, http://www.daetools.com/opencs What is OpenCS? A framework for: Platform-independent model specification 1.

More information

TORSTEN HOEFLER

TORSTEN HOEFLER TORSTEN HOEFLER MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures most work performed by TOBIAS GYSI AND TOBIAS GROSSER Stencil computations (oh no,

More information

The LIFT Project. Performance Portable Parallel Code Generation via Rewrite Rules. Michel Steuwer

The LIFT Project. Performance Portable Parallel Code Generation via Rewrite Rules. Michel Steuwer The LIFT Project Performance Portable Parallel Code Generation via Rewrite Rules Michel Steuwer michel.steuwer@glasgow.ac.uk What are the problems LIFT tries to tackle? Parallel processors everywhere CPU

More information

LARISA C. STOLTZFUS Informatics Forum 10 Crichton Street Edinburgh, UK EH8 9AB ed.ac.uk

LARISA C. STOLTZFUS Informatics Forum 10 Crichton Street Edinburgh, UK EH8 9AB ed.ac.uk Education LARISA C. STOLTZFUS Informatics Forum 10 Crichton Street Edinburgh, UK EH8 9AB larisa.stoltzfus @ ed.ac.uk http://homepages.inf.ed.ac.uk/s1147290/ PhD Informatics - University of Edinburgh, UK

More information

Programming Languages Research Programme

Programming Languages Research Programme Programming Languages Research Programme Logic & semantics Planning Language-based security Resource-bound analysis Theorem proving/ CISA verification LFCS Logic Functional "Foundational" PL has been an

More information

Tessellating Stencils. Liang Yuan, Yunquan Zhang, Peng Guo, Shan Huang SKL of Computer Architecture, ICT, CAS

Tessellating Stencils. Liang Yuan, Yunquan Zhang, Peng Guo, Shan Huang SKL of Computer Architecture, ICT, CAS Tessellating Stencils Liang Yuan, Yunquan Zhang, Peng Guo, Shan Huang SKL of Computer Architecture, ICT, CAS Outline Introduction Related work Tessellating Stencils Stencil Stencil Overview update each

More information

MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures

MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures TORSTEN HOEFLER MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures with support of Tobias Gysi, Tobias Grosser @ SPCL presented at Guangzhou, China,

More information

Scheduling Image Processing Pipelines

Scheduling Image Processing Pipelines Lecture 14: Scheduling Image Processing Pipelines Visual Computing Systems Simple image processing kernel int WIDTH = 1024; int HEIGHT = 1024; float input[width * HEIGHT]; float output[width * HEIGHT];

More information

Auto-Generation and Auto-Tuning of 3D Stencil Codes on GPU Clusters

Auto-Generation and Auto-Tuning of 3D Stencil Codes on GPU Clusters Auto-Generation and Auto-Tuning of 3D Stencil s on GPU Clusters Yongpeng Zhang, Frank Mueller North Carolina State University CGO 2012 Outline Motivation DSL front-end and Benchmarks Framework Experimental

More information

Scheduling Image Processing Pipelines

Scheduling Image Processing Pipelines Lecture 15: Scheduling Image Processing Pipelines Visual Computing Systems Simple image processing kernel int WIDTH = 1024; int HEIGHT = 1024; float input[width * HEIGHT]; float output[width * HEIGHT];

More information

Performance Portable GPU Code Generation for Matrix Multiplication

Performance Portable GPU Code Generation for Matrix Multiplication Performance Portable GPU Code Generation for Matrix Multiplication Toomas Remmelg Thibaut Lutz Michel Steuwer Christophe Dubach University of Edinburgh {toomas.remmelg, thibaut.lutz, michel.steuwer, christophe.dubach}@ed.ac.uk

More information

Text. Parallelism Gabriele Keller University of New South Wales

Text. Parallelism Gabriele Keller University of New South Wales Text Parallelism Gabriele Keller University of New South Wales Programming Languages Mentoring Workshop 2015 undergrad Karslruhe & Berlin, Germany PhD Berlin & Tsukuba, Japan University of Technology Sydney

More information

Progress Report on QDP-JIT

Progress Report on QDP-JIT Progress Report on QDP-JIT F. T. Winter Thomas Jefferson National Accelerator Facility USQCD Software Meeting 14 April 16-17, 14 at Jefferson Lab F. Winter (Jefferson Lab) QDP-JIT USQCD-Software 14 1 /

More information

Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX

Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX David Pfander*, Gregor Daiß*, Dominic Marcello**, Hartmut Kaiser**, Dirk Pflüger* * University of Stuttgart ** Louisiana State

More information

Expressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17

Expressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17 Expressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17 Tutorial Instructors [James Reinders, Michael J. Voss, Pablo Reble, Rafael Asenjo]

More information

Parallel Systems. Project topics

Parallel Systems. Project topics Parallel Systems Project topics 2016-2017 1. Scheduling Scheduling is a common problem which however is NP-complete, so that we are never sure about the optimality of the solution. Parallelisation is a

More information

Generating Performance Portable Code using Rewrite Rules

Generating Performance Portable Code using Rewrite Rules Generating Performance Portable Code using Rewrite Rules From High-Level Functional Expressions to High-Performance OpenCL Code Michel Steuwer Christian Fensch Sam Lindley Christophe Dubach The Problem(s)

More information

Parallel Programming in the Age of Ubiquitous Parallelism. Andrew Lenharth Slides: Keshav Pingali The University of Texas at Austin

Parallel Programming in the Age of Ubiquitous Parallelism. Andrew Lenharth Slides: Keshav Pingali The University of Texas at Austin Parallel Programming in the Age of Ubiquitous Parallelism Andrew Lenharth Slides: Keshav Pingali The University of Texas at Austin Parallel Programming in the Age of Ubiquitous Parallelism Andrew Lenharth

More information

Delite: A Framework for Heterogeneous Parallel DSLs

Delite: A Framework for Heterogeneous Parallel DSLs Delite: A Framework for Heterogeneous Parallel DSLs Hassan Chafi, Arvind Sujeeth, Kevin Brown, HyoukJoong Lee, Kunle Olukotun Stanford University Tiark Rompf, Martin Odersky EPFL Heterogeneous Parallel

More information

Designing a Domain-specific Language to Simulate Particles. dan bailey

Designing a Domain-specific Language to Simulate Particles. dan bailey Designing a Domain-specific Language to Simulate Particles dan bailey Double Negative Largest Visual Effects studio in Europe Offices in London and Singapore Large and growing R & D team Squirt Fluid Solver

More information

Parallel Programming Patterns

Parallel Programming Patterns Parallel Programming Patterns Pattern-Driven Parallel Application Development 7/10/2014 DragonStar 2014 - Qing Yi 1 Parallelism and Performance p Automatic compiler optimizations have their limitations

More information

A Multi-layered Domain-specific Language for Stencil Computations

A Multi-layered Domain-specific Language for Stencil Computations A Multi-layered Domain-specific Language for Stencil Computations Christian Schmitt Hardware/Software Co-Design, University of Erlangen-Nuremberg SPPEXA-Kolloquium, Erlangen, Germany; July 09, 2014 Challenges

More information

A Domain-Specific Language and Compiler for Stencil Computations for Different Target Architectures J. Ram Ramanujam Louisiana State University

A Domain-Specific Language and Compiler for Stencil Computations for Different Target Architectures J. Ram Ramanujam Louisiana State University A Domain-Specific Language and Compiler for Stencil Computations for Different Target Architectures J. Ram Ramanujam Louisiana State University SIMAC3 Workshop, Boston Univ. Acknowledgments Collaborators

More information

Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures

Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures Photos placed in horizontal position with even amount of white space between photos and header Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures Christopher Forster,

More information

Optimization of Tele-Immersion Codes

Optimization of Tele-Immersion Codes Optimization of Tele-Immersion Codes Albert Sidelnik, I-Jui Sung, Wanmin Wu, María Garzarán, Wen-mei Hwu, Klara Nahrstedt, David Padua, Sanjay Patel University of Illinois at Urbana-Champaign 1 Agenda

More information

Ultra Large-Scale FFT Processing on Graphics Processor Arrays. Author: J.B. Glenn-Anderson, PhD, CTO enparallel, Inc.

Ultra Large-Scale FFT Processing on Graphics Processor Arrays. Author: J.B. Glenn-Anderson, PhD, CTO enparallel, Inc. Abstract Ultra Large-Scale FFT Processing on Graphics Processor Arrays Author: J.B. Glenn-Anderson, PhD, CTO enparallel, Inc. Graphics Processor Unit (GPU) technology has been shown well-suited to efficient

More information

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Roshan Dathathri Thejas Ramashekar Chandan Reddy Uday Bondhugula Department of Computer Science and Automation

More information

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Islam Harb 08/21/2015 Agenda Motivation Research Challenges Contributions & Approach Results Conclusion Future Work 2

More information

ICSA Institute for Computing Systems Architecture. LFCS Laboratory for Foundations of Computer Science

ICSA Institute for Computing Systems Architecture. LFCS Laboratory for Foundations of Computer Science Largest Informatics Department in the UK: > 500 academic and research staff + PhD students Overall 6 Research Institutes 2 particular relevant for the topic of the talk: ICSA Institute for Computing Systems

More information

Identifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning

Identifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning Identifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning Yukinori Sato (JAIST / JST CREST) Hiroko Midorikawa (Seikei Univ. / JST CREST) Toshio Endo (TITECH / JST CREST)

More information

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA

2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA 2006: Short-Range Molecular Dynamics on GPU San Jose, CA September 22, 2010 Peng Wang, NVIDIA Overview The LAMMPS molecular dynamics (MD) code Cell-list generation and force calculation Algorithm & performance

More information

Platform-Specific Optimization and Mapping of Stencil Codes through Refinement

Platform-Specific Optimization and Mapping of Stencil Codes through Refinement Platform-Specific Optimization and Mapping of Stencil Codes through Refinement Marcel Köster, Roland Leißa, and Sebastian Hack Compiler Design Lab, Saarland University Intel Visual Computing Institute

More information

Reusing this material

Reusing this material XEON PHI BASICS Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware

More information

AutoTuneTMP: Auto-Tuning in C++ With Runtime Template Metaprogramming

AutoTuneTMP: Auto-Tuning in C++ With Runtime Template Metaprogramming AutoTuneTMP: Auto-Tuning in C++ With Runtime Template Metaprogramming David Pfander, Malte Brunn, Dirk Pflüger University of Stuttgart, Germany May 25, 2018 Vancouver, Canada, iwapt18 May 25, 2018 Vancouver,

More information

Early Experiences Writing Performance Portable OpenMP 4 Codes

Early Experiences Writing Performance Portable OpenMP 4 Codes Early Experiences Writing Performance Portable OpenMP 4 Codes Verónica G. Vergara Larrea Wayne Joubert M. Graham Lopez Oscar Hernandez Oak Ridge National Laboratory Problem statement APU FPGA neuromorphic

More information

CFD Solvers on Many-core Processors

CFD Solvers on Many-core Processors CFD Solvers on Many-core Processors Tobias Brandvik Whittle Laboratory CFD Solvers on Many-core Processors p.1/36 CFD Backgroud CFD: Computational Fluid Dynamics Whittle Laboratory - Turbomachinery CFD

More information

Optimised all-to-all communication on multicore architectures applied to FFTs with pencil decomposition

Optimised all-to-all communication on multicore architectures applied to FFTs with pencil decomposition Optimised all-to-all communication on multicore architectures applied to FFTs with pencil decomposition CUG 2018, Stockholm Andreas Jocksch, Matthias Kraushaar (CSCS), David Daverio (University of Cambridge,

More information

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC

More information

HPC Architectures. Types of resource currently in use

HPC Architectures. Types of resource currently in use HPC Architectures Types of resource currently in use Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Parallel Programming

Parallel Programming Parallel Programming 7. Data Parallelism Christoph von Praun praun@acm.org 07-1 (1) Parallel algorithm structure design space Organization by Data (1.1) Geometric Decomposition Organization by Tasks (1.3)

More information

Code Generators for Stencil Auto-tuning

Code Generators for Stencil Auto-tuning Code Generators for Stencil Auto-tuning Shoaib Kamil with Cy Chan, Sam Williams, Kaushik Datta, John Shalf, Katherine Yelick, Jim Demmel, Leonid Oliker Diagnosing Power/Performance Correctness Where this

More information

Conflict-Free Vectorization of Associative Irregular Applications with Recent SIMD Architectural Advances

Conflict-Free Vectorization of Associative Irregular Applications with Recent SIMD Architectural Advances Conflict-Free Vectorization of Associative Irregular Applications with Recent SIMD Architectural Advances Peng Jiang Gagan Agrawal Department of Computer Science and Engineering The Ohio State University,

More information

Convey Vector Personalities FPGA Acceleration with an OpenMP-like programming effort?

Convey Vector Personalities FPGA Acceleration with an OpenMP-like programming effort? Convey Vector Personalities FPGA Acceleration with an OpenMP-like programming effort? Björn Meyer, Jörn Schumacher, Christian Plessl, Jens Förstner University of Paderborn, Germany 2 ), - 4 * 4 + - 6-4.

More information

Automatic Generation of Algorithms and Data Structures for Geometric Multigrid. Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014

Automatic Generation of Algorithms and Data Structures for Geometric Multigrid. Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014 Automatic Generation of Algorithms and Data Structures for Geometric Multigrid Harald Köstler, Sebastian Kuckuk Siam Parallel Processing 02/21/2014 Introduction Multigrid Goal: Solve a partial differential

More information

HPCG on Intel Xeon Phi 2 nd Generation, Knights Landing. Alexander Kleymenov and Jongsoo Park Intel Corporation SC16, HPCG BoF

HPCG on Intel Xeon Phi 2 nd Generation, Knights Landing. Alexander Kleymenov and Jongsoo Park Intel Corporation SC16, HPCG BoF HPCG on Intel Xeon Phi 2 nd Generation, Knights Landing Alexander Kleymenov and Jongsoo Park Intel Corporation SC16, HPCG BoF 1 Outline KNL results Our other work related to HPCG 2 ~47 GF/s per KNL ~10

More information

Deep Learning as a Polyhedral Compiler's Killer App

Deep Learning as a Polyhedral Compiler's Killer App Deep Learning as a Polyhedral Compiler's Killer App From Research to Industry Transfer and Back and Again Albert Cohen - Inria and Facebook (work in progress) Nicolas Vasilache, Oleksandr Zinenko, Theodoros

More information

Parallelizing Adaptive Triangular Grids with Refinement Trees and Space Filling Curves

Parallelizing Adaptive Triangular Grids with Refinement Trees and Space Filling Curves Parallelizing Adaptive Triangular Grids with Refinement Trees and Space Filling Curves Daniel Butnaru butnaru@in.tum.de Advisor: Michael Bader bader@in.tum.de JASS 08 Computational Science and Engineering

More information

OpenACC2 vs.openmp4. James Lin 1,2 and Satoshi Matsuoka 2

OpenACC2 vs.openmp4. James Lin 1,2 and Satoshi Matsuoka 2 2014@San Jose Shanghai Jiao Tong University Tokyo Institute of Technology OpenACC2 vs.openmp4 he Strong, the Weak, and the Missing to Develop Performance Portable Applica>ons on GPU and Xeon Phi James

More information

Generating Performance Portable Code using Rewrite Rules

Generating Performance Portable Code using Rewrite Rules Generating Performance Portable Code using Rewrite Rules From High-Level Functional Expressions to High-Performance OpenCL Code Michel Steuwer The University of Edinburgh (UK) University of Münster (Germany)

More information

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc Processors The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past, computers

More information

Unleashing the performance of ccnuma multiprocessor architectures in heterogeneous stencil computations

Unleashing the performance of ccnuma multiprocessor architectures in heterogeneous stencil computations The Journal of Supercomputing https://doi.org/10.1007/s11227-018-2460-0 Unleashing the performance of ccnuma multiprocessor architectures in heterogeneous stencil computations Lukasz Szustak 1 Kamil Halbiniak

More information

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains:

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains: The Lecture Contains: Data Access and Communication Data Access Artifactual Comm. Capacity Problem Temporal Locality Spatial Locality 2D to 4D Conversion Transfer Granularity Worse: False Sharing Contention

More information

Intel Xeon Phi Programmability (the good, the bad and the ugly)

Intel Xeon Phi Programmability (the good, the bad and the ugly) Intel Xeon Phi Programmability (the good, the bad and the ugly) Robert Geva Parallel Programming Models Architect My Perspective When the compiler can solve the problem When the programmer has to solve

More information

Jose Aliaga (Universitat Jaume I, Castellon, Spain), Ruyman Reyes, Mehdi Goli (Codeplay Software) 2017 Codeplay Software Ltd.

Jose Aliaga (Universitat Jaume I, Castellon, Spain), Ruyman Reyes, Mehdi Goli (Codeplay Software) 2017 Codeplay Software Ltd. SYCL-BLAS: LeveragingSYCL-BLAS Expression Trees for Linear Algebra Jose Aliaga (Universitat Jaume I, Castellon, Spain), Ruyman Reyes, Mehdi Goli (Codeplay Software) 1 About me... Phd in Compilers and Parallel

More information

SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou

SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou University of California, Los Angeles 1 What is stencil computation? 2 What is Stencil Computation? A sliding

More information

Opportunities and Challenges in Sparse Linear Algebra on Many-Core Processors with High-Bandwidth Memory

Opportunities and Challenges in Sparse Linear Algebra on Many-Core Processors with High-Bandwidth Memory Opportunities and Challenges in Sparse Linear Algebra on Many-Core Processors with High-Bandwidth Memory Jongsoo Park, Parallel Computing Lab, Intel Corporation with contributions from MKL team 1 Algorithm/

More information

How to Optimize Geometric Multigrid Methods on GPUs

How to Optimize Geometric Multigrid Methods on GPUs How to Optimize Geometric Multigrid Methods on GPUs Markus Stürmer, Harald Köstler, Ulrich Rüde System Simulation Group University Erlangen March 31st 2011 at Copper Schedule motivation imaging in gradient

More information

Generating high-performance multiplatform finite element solvers using the Manycore Form Compiler and OP2

Generating high-performance multiplatform finite element solvers using the Manycore Form Compiler and OP2 Generating high-performance multiplatform finite element solvers using the Manycore Form Compiler and OP2 Graham R. Markall, Florian Rathgeber, David A. Ham, Paul H. J. Kelly, Carlo Bertolli, Adam Betts

More information

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER Adrian Jackson a.jackson@epcc.ed.ac.uk @adrianjhpc Processors The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past,

More information

OpenMPSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs

OpenMPSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs www.bsc.es OpenMPSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs Hugo Pérez UPC-BSC Benjamin Hernandez Oak Ridge National Lab Isaac Rudomin BSC March 2015 OUTLINE

More information

Fast Image Processing using Halide

Fast Image Processing using Halide Fast Image Processing using Halide Andrew Adams (Google) Jonathan Ragan-Kelley (Berkeley) Zalman Stern (Google) Steven Johnson (Google) Dillon Sharlet (Google) Patricia Suriana (Google) 1 This talk The

More information

Speed up a Machine-Learning-based Image Super-Resolution Algorithm on GPGPU

Speed up a Machine-Learning-based Image Super-Resolution Algorithm on GPGPU Speed up a Machine-Learning-based Image Super-Resolution Algorithm on GPGPU Ke Ma 1, and Yao Song 2 1 Department of Computer Sciences 2 Department of Electrical and Computer Engineering University of Wisconsin-Madison

More information

Trends in HPC (hardware complexity and software challenges)

Trends in HPC (hardware complexity and software challenges) Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18

More information

A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening. Alberto Magni, Christophe Dubach, Michael O'Boyle

A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening. Alberto Magni, Christophe Dubach, Michael O'Boyle A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening Alberto Magni, Christophe Dubach, Michael O'Boyle Introduction Wide adoption of GPGPU for HPC Many GPU devices from many of vendors AMD

More information

Scientific Computing with Intel Xeon Phi Coprocessors

Scientific Computing with Intel Xeon Phi Coprocessors Scientific Computing with Intel Xeon Phi Coprocessors Andrey Vladimirov Colfax International HPC Advisory Council Stanford Conference 2015 Compututing with Xeon Phi Welcome Colfax International, 2014 Contents

More information

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures Procedia Computer Science Volume 51, 2015, Pages 2774 2778 ICCS 2015 International Conference On Computational Science Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid

More information

Towards Mapping Lift to Deep Neural Network Accelerators

Towards Mapping Lift to Deep Neural Network Accelerators Towards Mapping Lift to Deep Neural Network Accelerators Naums Mogers University of Edinburgh naums.mogers@ed.ac.uk Aaron Smith University of Edinburgh Microsoft Research aaron.smith@microsoft.com Dimitrios

More information

Optical Flow Estimation with CUDA. Mikhail Smirnov

Optical Flow Estimation with CUDA. Mikhail Smirnov Optical Flow Estimation with CUDA Mikhail Smirnov msmirnov@nvidia.com Document Change History Version Date Responsible Reason for Change Mikhail Smirnov Initial release Abstract Optical flow is the apparent

More information

Comparison and analysis of parallel tasking performance for an irregular application

Comparison and analysis of parallel tasking performance for an irregular application Comparison and analysis of parallel tasking performance for an irregular application Patrick Atkinson, University of Bristol (p.atkinson@bristol.ac.uk) Simon McIntosh-Smith, University of Bristol Motivation

More information

High performance Computing and O&G Challenges

High performance Computing and O&G Challenges High performance Computing and O&G Challenges 2 Seismic exploration challenges High Performance Computing and O&G challenges Worldwide Context Seismic,sub-surface imaging Computing Power needs Accelerating

More information

Dynamic Cuda with F# HPC GPU & F# Meetup. March 19. San Jose, California

Dynamic Cuda with F# HPC GPU & F# Meetup. March 19. San Jose, California Dynamic Cuda with F# HPC GPU & F# Meetup March 19 San Jose, California Dr. Daniel Egloff daniel.egloff@quantalea.net +41 44 520 01 17 +41 79 430 03 61 About Us! Software development and consulting company!

More information

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber, HPC trends (Myths about) accelerator cards & more June 24, 2015 - Martin Schreiber, M.Schreiber@exeter.ac.uk Outline HPC & current architectures Performance: Programming models: OpenCL & OpenMP Some applications:

More information

High-Level Synthesis of Functional Patterns for Reconfigurable Logic. Martin Kristien

High-Level Synthesis of Functional Patterns for Reconfigurable Logic. Martin Kristien High-Level Synthesis of Functional Patterns for Reconfigurable Logic Martin Kristien 4th Year Project Report Computer Science School of Informatics University of Edinburgh 2017 Abstract Together with

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,

More information

Introduction to tuning on many core platforms. Gilles Gouaillardet RIST

Introduction to tuning on many core platforms. Gilles Gouaillardet RIST Introduction to tuning on many core platforms Gilles Gouaillardet RIST gilles@rist.or.jp Agenda Why do we need many core platforms? Single-thread optimization Parallelization Conclusions Why do we need

More information

Efficient numerical simulation on multicore processors (MuCoSim) WS 2017

Efficient numerical simulation on multicore processors (MuCoSim) WS 2017 ERLANGEN REGIONAL COMPUTING CENTER Efficient numerical simulation on multicore processors (MuCoSim) WS 2017 Prof. Gerhard Wellein, Dr. G. Hager Department für Informatik & HPC Services Regionales Rechenzentrum

More information

Offload acceleration of scientific calculations within.net assemblies

Offload acceleration of scientific calculations within.net assemblies Offload acceleration of scientific calculations within.net assemblies Lebedev A. 1, Khachumov V. 2 1 Rybinsk State Aviation Technical University, Rybinsk, Russia 2 Institute for Systems Analysis of Russian

More information

Computing and energy performance

Computing and energy performance Equipe I M S Equipe Projet INRIA AlGorille Computing and energy performance optimization i i of a multi algorithms li l i PDE solver on CPU and GPU clusters Stéphane Vialle, Sylvain Contassot Vivier, Thomas

More information

STELLA: A Domain Specific Language for Stencil Computations

STELLA: A Domain Specific Language for Stencil Computations STELLA: A Domain Specific Language for Stencil Computations, Center for Climate Systems Modeling, ETH Zurich Tobias Gysi, Department of Computer Science, ETH Zurich Oliver Fuhrer, Federal Office of Meteorology

More information

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm

More information

GPU Architecture. Alan Gray EPCC The University of Edinburgh

GPU Architecture. Alan Gray EPCC The University of Edinburgh GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From

More information

D-TEC DSL Technology for Exascale Computing

D-TEC DSL Technology for Exascale Computing D-TEC DSL Technology for Exascale Computing Progress Report: March 2013 DOE Office of Science Program: Office of Advanced Scientific Computing Research ASCR Program Manager: Dr. Sonia Sachs 1 Introduction

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

Optimisation Myths and Facts as Seen in Statistical Physics

Optimisation Myths and Facts as Seen in Statistical Physics Optimisation Myths and Facts as Seen in Statistical Physics Massimo Bernaschi Institute for Applied Computing National Research Council & Computer Science Department University La Sapienza Rome - ITALY

More information

High Performance Computing: Tools and Applications

High Performance Computing: Tools and Applications High Performance Computing: Tools and Applications Edmond Chow School of Computational Science and Engineering Georgia Institute of Technology Lecture 8 Processor-level SIMD SIMD instructions can perform

More information

Data parallel algorithms, algorithmic building blocks, precision vs. accuracy

Data parallel algorithms, algorithmic building blocks, precision vs. accuracy Data parallel algorithms, algorithmic building blocks, precision vs. accuracy Robert Strzodka Architecture of Computing Systems GPGPU and CUDA Tutorials Dresden, Germany, February 25 2008 2 Overview Parallel

More information

Sequoia. Mattan Erez. The University of Texas at Austin

Sequoia. Mattan Erez. The University of Texas at Austin Sequoia Mattan Erez The University of Texas at Austin EE382N: Parallelism and Locality, Fall 2015 1 2 Emerging Themes Writing high-performance code amounts to Intelligently structuring algorithms [compiler

More information

Software and Performance Engineering for numerical codes on GPU clusters

Software and Performance Engineering for numerical codes on GPU clusters Software and Performance Engineering for numerical codes on GPU clusters H. Köstler International Workshop of GPU Solutions to Multiscale Problems in Science and Engineering Harbin, China 28.7.2010 2 3

More information

Parallel Programming in the Age of Ubiquitous Parallelism. Keshav Pingali The University of Texas at Austin

Parallel Programming in the Age of Ubiquitous Parallelism. Keshav Pingali The University of Texas at Austin Parallel Programming in the Age of Ubiquitous Parallelism Keshav Pingali The University of Texas at Austin Parallelism is everywhere Texas Advanced Computing Center Laptops Cell-phones Parallel programming?

More information

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated

More information

Advances of parallel computing. Kirill Bogachev May 2016

Advances of parallel computing. Kirill Bogachev May 2016 Advances of parallel computing Kirill Bogachev May 2016 Demands in Simulations Field development relies more and more on static and dynamic modeling of the reservoirs that has come a long way from being

More information

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection

Dynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection Numerical Libraries in the DOE ACTS Collection The DOE ACTS Collection SIAM Parallel Processing for Scientific Computing, Savannah, Georgia Feb 15, 2012 Tony Drummond Computational Research Division Lawrence

More information

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4 OpenACC Course Class #1 Q&A Contents OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4 OpenACC/CUDA/OpenMP Q: Is OpenACC an NVIDIA standard or is it accepted

More information

Extending the Capabilities of Tiramisu. Malek Ben Romdhane

Extending the Capabilities of Tiramisu. Malek Ben Romdhane Extending the Capabilities of Tiramisu by Malek Ben Romdhane B.S., Massachusetts Institute of Technology (2017) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment

More information