Delite: A Framework for Heterogeneous Parallel DSLs

Size: px
Start display at page:

Download "Delite: A Framework for Heterogeneous Parallel DSLs"

Transcription

1 Delite: A Framework for Heterogeneous Parallel DSLs Hassan Chafi, Arvind Sujeeth, Kevin Brown, HyoukJoong Lee, Kunle Olukotun Stanford University Tiark Rompf, Martin Odersky EPFL

2 Heterogeneous Parallel Programming Pthreads OpenMP Sun T2 CUDA OpenCL Nvidia Fermi Verilog VHDL Altera FPGA MPI Cray Jaguar

3 Programmability Chasm Applications Scientific Engineering Pthreads OpenMP Sun T2 Virtual Worlds CUDA OpenCL Nvidia Fermi Personal Robotics Data informatics Verilog VHDL MPI Too many different programming models Altera FPGA Cray Jaguar

4

5

6 PPL Vision Applications Scientific Engineering Virtual Worlds Personal Robotics Data informatics Domain Specific Languages Statistics (R) Physics (Liszt) Data Analytics (OptiQL) Graph Alg. (Green Marl) Machine Learning (OptiML) Domain Embedding Language (Scala) DSL Infrastructure Polymorphic Embedding Staging Static Domain Specific Opt. Parallel Runtime (Delite, GRAMPS) Dynamic Domain Spec. Opt. Task & Data Parallelism Locality Aware Scheduling Heterogeneous Hardware

7 DSLs Present New Problem We need to develop all these DSLs Current DSL methods are unsatisfactory

8 Current DSL Development Approaches Stand-alone DSLs Can include extensive optimizations Enormous effort to develop to a sufficient degree of maturity Actual Compiler/Optimizations Tooling (IDE, Debuggers, ) Interoperation between multiple DSLs is very difficult Purely embedded DSLs just a library Easy to develop (can reuse full host language) Easier to learn DSL Can Combine multiple DSLs in one program Can Share DSL infrastructure among several DSLs Hard to optimize using domain knowledge Target same architecture as host language Need to do better

9 Need to Do Better Goal: Develop embedded DSLs that perform as well as stand-alone ones Intuition: General-purpose languages should be designed with DSL embedding in mind

10 Modular Staging Approach Embedded DSL gets it all for free, Modular Staging provides a hybrid approach but can t change any of it DSLs adopt front-end from Stand-alone DSL highly expressive but can customize IR and implements everything embedding language participate in backend phases Lexer Parser Type checker Analysis Optimization Code gen Typical Compiler GPCE 10: Lightweight modular staging: a pragmatic approach to runtime code generation and compiled DSLs

11 Linear Algebra Example object TestMatrix { } def example(a: Matrix, b: Matrix, c: Matrix, d: Matrix) = { val x = a*b + a*c val y = a*c + a*d println(x+y) }

12 Abstract Matrix Usage object TestMatrix extends DSLApp with MatrixArith { } def example(a: Rep[Matrix], b: Rep[Matrix], c: Rep[Matrix], d: Rep[Matrix]) = { val x = a*b + a*c val y = a*c + a*d println(x+y) } Rep[Matrix]: abstract type constructor range of possible implementations of Matrix Operations on Rep[Matrix] defined in MatrixArith trait

13 Lifting Matrix to Abstract Representation DSL interface building blocks structured as traits Expressions of type Rep[T] represent expressions of type T Can plug in different representation Need to be able to convert (lift) Matrix to abstract representation Need to define an interface for our DSL type trait MatrixArith { } type Rep[T] implicit def liftmatrixtorep(x: Matrix): Rep[Matrix] def infix_+(x:rep[matrix], y: Rep[Matrix]): Rep[Matrix] def infix_*(x:rep[matrix], y: Rep[Matrix]): Rep[Matrix] Now can plugin different implementations and representations for the DSL

14 Modest Effort Lifting each new DSL that uses slightly different IR violates Effort criterion Need a DSL embedding infrastructure Provide building blocks of common DSL compiler functionality

15 Delite DSL Compiler Liszt program Scala Embedding Framework OptiML program Delite Parallelism Framework Provide a common IR that can be extended while still benefitting from generic analysis and opt. Intermediate Representation (IR) Base IR Generic Analysis & Opt.

16 Now Can Build an IR Start with common IR structure to be shared among DSLs trait Expressions { // constants/symbols (atomic) abstract class Exp[T] case class Const[T](x: T) extends Exp[T] case class Sym[T](n: Int) extends Exp[T] // operations (composite, defined in subtraits) abstract class Op[T] // additional members for managing encountered definitions def findorcreatedefinition[t](op: Op[T]): Sym[T] } implicit def toexp[t](d: Op[T]): Exp[T] = findorcreatedefinition(d) Generic optimizations provided to all DSLs Common subexpression elimination Dead code elimination Code motion

17 Delite DSL Compiler Scala Embedding Framework Base IR Generic Analysis & Opt. Liszt program Delite Parallelism Framework Intermediate Representation (IR) Delite IR Parallelism Analysis, Opt. & Mapping OptiML program Provide a common IR that can be extended while still benefitting from generic analysis and opt. Extend common IR and provide IR nodes that encode parallel execution patterns Now can do parallel optimizations and mapping

18 Delite Ops Encode known parallel execution patterns Data-parallel: map, zip, reduce, Delite provides implementations of these patterns for multiple hardware targets e.g., multi-core, GPU DSL developer maps each domain operation to the appropriate pattern

19 Delite Op Fusing Operates on all loop-based ops Reduces op overhead and improves locality Elimination of temporary data structures Communication through registers Fuse both dependent and side-by-side operations Fused ops can have multiple outputs Algorithm: fuse two loops if size(loop1) == size(loop2) No dependencies exist that would require an impossible schedule if fused e.g., C depends on B, depends on A -> cannot fuse C and A

20 Delite DSL Compiler Scala Embedding Framework Base IR Generic Analysis & Opt. Liszt program Intermediate Representation (IR) Delite IR Parallelism Analysis, Opt. & Mapping OptiML program Delite Parallelism Framework DS IR Domain Analysis & Opt. Provide a common IR that can be extended while still benefitting from generic analysis and opt. Extend common IR and provide IR nodes that encode parallel execution patterns Now can do parallel optimizations and mapping DSL extends appropriate parallel nodes for their operations Now can do domainspecific analysis and opt.

21 Customize IR with Domain Info trait MatrixArithRepExp extends MatrixArith with Expressions { type Rep[T] = Exp[T] implicit def liftmatrixtorep(x: Matrix) = Const(x) case class Plus(x: Exp[Matrix],y: Exp[Matrix]) extends DeliteOpZip[Matrix] } def infix_+(x: Exp[Matrix],y: Exp[Matrix]) = Plus(x, y) Choose Exp as representation for the DSL types Define Lifting function to create expressions Extend Delite IR with domain-specific node types DSL methods build IR as program runs

22 The Delite IR Domain User Interface M1 = M2 + M3 V1 = exp(v2) Application s = sum(m) C2 = sort(c1) DSL User Domain Analysis & Opt. Matrix Plus Vector Exp DS IR Matrix Sum Collection Quicksort DSL Author Parallelism Analysis & Opt. Code Generation ZipWith Map Delite Op IR Reduce Divide & Conquer Delite Generic Analysis & Opt. Op Base IR Delite

23 Delite DSL Compiler Scala Embedding Framework Base IR Generic Analysis & Opt. Delite Execution Graph Liszt program Intermediate Representation (IR) Delite IR Parallelism Analysis, Opt. & Mapping Code Generation Kernels (Scala, C, Cuda, MPI Verilog, ) OptiML program Delite Parallelism Framework DS IR Domain Analysis & Opt. Data Structures (arrays, trees, graphs, ) Provide a common IR that can be extended while still benefitting from generic analysis and opt. Extend common IR and provide IR nodes that encode data parallel execution patterns Now can do parallel optimizations and mapping DSL extends appropriate data parallel nodes for their operations Now can do domainspecific analysis and opt. Generate an execution graph, kernels and data structures

24 Delite Runtime Local System SMP GPU Machine Inputs Scheduler Delite Execution Graph Walk-Time Kernels (Scala, C, Cuda, ) Application Inputs Partial schedules, Fused, specialized kernels DSL Data Structures Code Generator JIT Kernel Fusion, Specialization, Synchronization Run-Time Schedule Dispatch, Memory Management, Lazy Data Transfers Maps the machineagnostic DSL compiler output onto the given machine specification for customized execution Walk-time scheduling produces partial schedules Generates variants to be selected at run-time once actual execution flow is resolved Code generation produces fused, specialized kernels to be launched on each resource Optimizes for the execution plan using the partial schedules Run-time executor controls and optimizes execution Launches kernels on resources Monitors execution and system state to dynamically adjust kernels and schedule

25 Schedule and Kernel Compilation Compile execution graph to executables for each resource after scheduling All synchronization is added at this point where required Kernels specialized based on number of processors allocated for it e.g., specialize height of tree reduction

26 GPU Management Cuda host thread launches kernels and automatically performs data transfers as required by schedule Compiler provides helper functions to Copy data structures between address spaces pre-allocate outputs and temporaries select the number of threads & thread blocks Provides device memory management for kernels Perform liveness analysis to determine when op inputs and outputs are dead on the GPU Runtime frees dead data when it experiences memory pressure

27 Conclusions Need to simplify the process of developing DSLs for parallelism Need programming languages to be designed for flexible embedding Lightweight modular staging in Scala allows for more powerful embedded DSLs Delite provides a framework for adding parallelism

Language Virtualization for Heterogeneous Parallel Computing

Language Virtualization for Heterogeneous Parallel Computing Language Virtualization for Heterogeneous Parallel Computing Hassan Chafi, Arvind Sujeeth, Zach DeVito, Pat Hanrahan, Kunle Olukotun Stanford University Adriaan Moors, Tiark Rompf, Martin Odersky EPFL

More information

Delite. Kevin Brown, Arvind Sujeeth, HyoukJoong Lee, Hassan Chafi, Kunle Olukotun Stanford University. Tiark Rompf, Martin Odersky EPFL

Delite. Kevin Brown, Arvind Sujeeth, HyoukJoong Lee, Hassan Chafi, Kunle Olukotun Stanford University. Tiark Rompf, Martin Odersky EPFL Delite Kevin Brown, Arvind Sujeeth, HyoukJoong Lee, Hassan Chafi, Kunle Olukotun Stanford University Tiark Rompf, Martin Odersky EPFL Building a Language is Difficult Building a new DSL Design the language

More information

Scala. MULTICORE PART 2: Parallel Collections and Parallel DSLs. for. Philipp HALLER, STANFORD UNIVERSITY AND EPFL. Tuesday, June 21, 2011

Scala. MULTICORE PART 2: Parallel Collections and Parallel DSLs. for. Philipp HALLER, STANFORD UNIVERSITY AND EPFL. Tuesday, June 21, 2011 Scala for MULTICORE PART 2: Parallel Collections and Parallel DSLs Philipp HALLER, STANFORD UNIVERSITY AND EPFL 1 Scala for RESOURCES ONLINE AT: http://lamp.epfl.ch/~phaller/upmarc MULTICORE PART 2: Parallel

More information

Kunle Olukotun Pervasive Parallelism Laboratory Stanford University

Kunle Olukotun Pervasive Parallelism Laboratory Stanford University Kunle Olukotun Pervasive Parallelism Laboratory Stanford University Unleash full power of future computing platforms Make parallel application development practical for the masses (Joe the programmer)

More information

OptiML: An Implicitly Parallel Domain-Specific Language for ML

OptiML: An Implicitly Parallel Domain-Specific Language for ML OptiML: An Implicitly Parallel Domain-Specific Language for ML Arvind K. Sujeeth, HyoukJoong Lee, Kevin J. Brown, Hassan Chafi, Michael Wu, Anand Atreya, Kunle Olukotun Stanford University Pervasive Parallelism

More information

A DOMAIN SPECIFIC APPROACH TO HETEROGENEOUS PARALLELISM

A DOMAIN SPECIFIC APPROACH TO HETEROGENEOUS PARALLELISM A DOMAIN SPECIFIC APPROACH TO HETEROGENEOUS PARALLELISM Hassan Chafi, Arvind Sujeeth, Kevin Brown, HyoukJoong Lee, Anand Atreya, Kunle Olukotun Stanford University Pervasive Parallelism Laboratory (PPL)

More information

CS442: High Productivity and Performance with Domainspecific

CS442: High Productivity and Performance with Domainspecific CS442: High Productivity and Performance with Domainspecific Languages in Scala CS442 Overview Experimental Research Course Lots of interest in DSLs Opportunity to do something new New ideas: DSLs for

More information

Delite. Hassan Chafi, Arvind Sujeeth, Kevin Brown, HyoukJoong Lee, Kunle Olukotun Stanford University. Tiark Rompf, Martin Odersky EPFL

Delite. Hassan Chafi, Arvind Sujeeth, Kevin Brown, HyoukJoong Lee, Kunle Olukotun Stanford University. Tiark Rompf, Martin Odersky EPFL Delite Hassan Chafi, Arvind Sujeeth, Kevin Brown, HyoukJoong Lee, Kunle Olukotun Stanford University Tiark Rompf, Martin Odersky EPFL Administrative PS 1 due today Email to me PS 2 out soon Build a simple

More information

Language Virtualization for Heterogeneous Parallel Computing

Language Virtualization for Heterogeneous Parallel Computing Language Virtualization for Heterogeneous Parallel Computing Hassan Chafi Zach DeVito Adriaan Moors Tiark Rompf Arvind K. Sujeeth Pat Hanrahan Martin Odersky Kunle Olukotun Stanford University: {hchafi,

More information

Delite: A Compiler Architecture for Performance-Oriented Embedded Domain-Specific Languages

Delite: A Compiler Architecture for Performance-Oriented Embedded Domain-Specific Languages Delite: A Compiler Architecture for Performance-Oriented Embedded Domain-Specific Languages ARVIND K. SUJEETH, KEVIN J. BROWN, and HYOUKJOONG LEE, Stanford University TIARK ROMPF, Oracle Labs and EPFL

More information

Simplifying Parallel Programming with Domain Specific Languages

Simplifying Parallel Programming with Domain Specific Languages Simplifying Parallel Programming with Domain Specific Languages Hassan Chafi, HyoukJoong Lee, Arvind Sujeeth, Kevin Brown, Anand Atreya, Nathan Bronson, Kunle Olukotun Stanford University Pervasive Parallelism

More information

OptiML: An Implicitly Parallel Domain-Specific Language for ML

OptiML: An Implicitly Parallel Domain-Specific Language for ML OptiML: An Implicitly Parallel Domain-Specific Language for ML Arvind K. Sujeeth, HyoukJoong Lee, Kevin J. Brown, Hassan Chafi, Michael Wu, Anand Atreya, Kunle Olukotun Stanford University Pervasive Parallelism

More information

Lift: a Functional Approach to Generating High Performance GPU Code using Rewrite Rules

Lift: a Functional Approach to Generating High Performance GPU Code using Rewrite Rules Lift: a Functional Approach to Generating High Performance GPU Code using Rewrite Rules Toomas Remmelg Michel Steuwer Christophe Dubach The 4th South of England Regional Programming Language Seminar 27th

More information

OptiML: An Implicitly Parallel Domain-Specific Language for Machine Learning

OptiML: An Implicitly Parallel Domain-Specific Language for Machine Learning OptiML: An Implicitly Parallel Domain-Specific Language for Machine Learning Arvind K. Sujeeth asujeeth@stanford.edu HyoukJoong Lee hyouklee@stanford.edu Kevin J. Brown kjbrown@stanford.edu Hassan Chafi

More information

Concepts of Programming Design

Concepts of Programming Design Concepts of Programming Design Scala and Lightweight Modular Staging (LMS) Alexey Rodriguez Blanter, Ferenc Balla, Matthijs Steen, Ruben Meerkerk, Ratih Ngestrini 1 Scala and Lightweight Modular Staging

More information

ScalaPipe: A Streaming Application Generator

ScalaPipe: A Streaming Application Generator ScalaPipe: A Streaming Application Generator Joseph G. Wingbermuehle, Roger D. Chamberlain, Ron K. Cytron This work is supported by the National Science Foundation under grants CNS-09095368 and CNS-0931693.

More information

Locality-Aware Mapping of Nested Parallel Patterns on GPUs

Locality-Aware Mapping of Nested Parallel Patterns on GPUs Locality-Aware Mapping of Nested Parallel Patterns on GPUs HyoukJoong Lee *, Kevin Brown *, Arvind Sujeeth *, Tiark Rompf, Kunle Olukotun * * Pervasive Parallelism Laboratory, Stanford University Purdue

More information

GRAMPS Beyond Rendering. Jeremy Sugerman 11 December 2009 PPL Retreat

GRAMPS Beyond Rendering. Jeremy Sugerman 11 December 2009 PPL Retreat GRAMPS Beyond Rendering Jeremy Sugerman 11 December 2009 PPL Retreat The PPL Vision: GRAMPS Applications Scientific Engineering Virtual Worlds Personal Robotics Data informatics Domain Specific Languages

More information

Unifying functional and object-oriented programming with Scala 齐琦, 海南大学计算机发展前沿系列课程

Unifying functional and object-oriented programming with Scala 齐琦, 海南大学计算机发展前沿系列课程 Unifying functional and object-oriented programming with Scala, 海南大学计算机发展前沿系列课程 2 原文引用 Odersky, M. and T. Rompf (2014). "Unifying functional and objectoriented programming with Scala." Communications of

More information

Power Efficiency is Key to Exascale

Power Efficiency is Key to Exascale Domain-Specific Languages for Heterogeneous Computing Platforms Pat Hanrahan Stanford Pervasive Parallelism Laboratory (Supported by Sun, AMD, NVIDIA, IBM, Intel, NEC, HP) Parallel @ Illinois DLS April

More information

Liszt! Programming Mesh-based PDEs! on! Heterogeneous Parallel Platforms!

Liszt! Programming Mesh-based PDEs! on! Heterogeneous Parallel Platforms! Liszt! Programming Mesh-based PDEs! on! Heterogeneous Parallel Platforms! Z. DeVito, N. Joubert, M. Medina, M. Barrientos, S. Oakley," J. Alonso, E. Darve, F. Ham, P. Hanrahan! Stanford PSAAP Center AST

More information

Weld: A Common Runtime for Data Analytics

Weld: A Common Runtime for Data Analytics Weld: A Common Runtime for Data Analytics Shoumik Palkar, James Thomas, Anil Shanbhag*, Deepak Narayanan, Malte Schwarzkopf*, Holger Pirk*, Saman Amarasinghe*, Matei Zaharia Stanford InfoLab, *MIT CSAIL

More information

Introduction to Multicore Programming

Introduction to Multicore Programming Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Synchronization 3 Automatic Parallelization and OpenMP 4 GPGPU 5 Q& A 2 Multithreaded

More information

Integrating Productivity-Oriented Programming Languages with High-Performance Data Structures

Integrating Productivity-Oriented Programming Languages with High-Performance Data Structures Integrating Productivity-Oriented Programming Languages with High-Performance Data Structures James Fairbanks Rohit Varkey Thankachan, Eric Hein, Brian Swenson Georgia Tech Research Institute September

More information

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software GPU Debugging Made Easy David Lecomber CTO, Allinea Software david@allinea.com Allinea Software HPC development tools company Leading in HPC software tools market Wide customer base Blue-chip engineering,

More information

Kunle Olukotun. EPFL August 26, 2009

Kunle Olukotun. EPFL August 26, 2009 Kunle Olukotun EPFL August 26, 2009 The end of uniprocessors Embedded, server, desktop New Moore s Law is 2X cores per chip every technology generation Cores are the new GHz! No other viable options for

More information

Introduction to Multicore Programming

Introduction to Multicore Programming Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Automatic Parallelization and OpenMP 3 GPGPU 2 Multithreaded Programming

More information

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors Michael Boyer, David Tarjan, Scott T. Acton, and Kevin Skadron University of Virginia IPDPS 2009 Outline Leukocyte

More information

Jose Aliaga (Universitat Jaume I, Castellon, Spain), Ruyman Reyes, Mehdi Goli (Codeplay Software) 2017 Codeplay Software Ltd.

Jose Aliaga (Universitat Jaume I, Castellon, Spain), Ruyman Reyes, Mehdi Goli (Codeplay Software) 2017 Codeplay Software Ltd. SYCL-BLAS: LeveragingSYCL-BLAS Expression Trees for Linear Algebra Jose Aliaga (Universitat Jaume I, Castellon, Spain), Ruyman Reyes, Mehdi Goli (Codeplay Software) 1 About me... Phd in Compilers and Parallel

More information

High-Performance DSLs Embedded in Scala. Vojin Jovanovic, EPFL

High-Performance DSLs Embedded in Scala. Vojin Jovanovic, EPFL High-Performance DSLs Embedded in Scala Vojin Jovanovic, EPFL Twitter: @vojjov Is SQL a Good DSL? select (c.name, c.email) from Candidate c, AppliesToJobs atj, Jobs j where c.company = 'Oracle' and c.id

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

Trends and Challenges in Multicore Programming

Trends and Challenges in Multicore Programming Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010 Outline The Roadmap of Multicores

More information

The Eclipse Parallel Tools Platform Project

The Eclipse Parallel Tools Platform Project The Eclipse Parallel Tools Platform Project EclipseCon 2005 LAUR-05-0574 Parallel Development Tools State of the Art Command-line compilers for Fortran and C/C++ Sometimes wrapped in a GUI Editors are

More information

OptiQL: LINQ on Delite

OptiQL: LINQ on Delite OptiQL: LINQ on Delite What is LINQ? Language Integrated Query (LINQ) is a set of language and framework features for writing structured typesafe queries over local object collections and remote data sources.

More information

Heterogeneous Computing with a Fused CPU+GPU Device

Heterogeneous Computing with a Fused CPU+GPU Device with a Fused CPU+GPU Device MediaTek White Paper January 2015 2015 MediaTek Inc. 1 Introduction Heterogeneous computing technology, based on OpenCL (Open Computing Language), intelligently fuses GPU and

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

CPU-GPU Heterogeneous Computing

CPU-GPU Heterogeneous Computing CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems

More information

Computational Interdisciplinary Modelling High Performance Parallel & Distributed Computing Our Research

Computational Interdisciplinary Modelling High Performance Parallel & Distributed Computing Our Research Insieme Insieme-an Optimization System for OpenMP, MPI and OpenCLPrograms Institute of Computer Science University of Innsbruck Thomas Fahringer, Ivan Grasso, Klaus Kofler, Herbert Jordan, Hans Moritsch,

More information

OpenACC 2.6 Proposed Features

OpenACC 2.6 Proposed Features OpenACC 2.6 Proposed Features OpenACC.org June, 2017 1 Introduction This document summarizes features and changes being proposed for the next version of the OpenACC Application Programming Interface, tentatively

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

Generating high-performance multiplatform finite element solvers using the Manycore Form Compiler and OP2

Generating high-performance multiplatform finite element solvers using the Manycore Form Compiler and OP2 Generating high-performance multiplatform finite element solvers using the Manycore Form Compiler and OP2 Graham R. Markall, Florian Rathgeber, David A. Ham, Paul H. J. Kelly, Carlo Bertolli, Adam Betts

More information

Michel Steuwer.

Michel Steuwer. Michel Steuwer http://homepages.inf.ed.ac.uk/msteuwer/ SKELCL: Algorithmic Skeletons for GPUs X i a i b i = reduce (+) 0 (zip ( ) A B) #include #include #include

More information

Productive Design of Extensible Cache Coherence Protocols!

Productive Design of Extensible Cache Coherence Protocols! P A R A L L E L C O M P U T I N G L A B O R A T O R Y Productive Design of Extensible Cache Coherence Protocols!! Henry Cook, Jonathan Bachrach,! Krste Asanovic! Par Lab Summer Retreat, Santa Cruz June

More information

ET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc.

ET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc. HPC Runtime Software Rishi Khan SC 11 Current Programming Models Shared Memory Multiprocessing OpenMP fork/join model Pthreads Arbitrary SMP parallelism (but hard to program/ debug) Cilk Work Stealing

More information

Utilizing Static Analysis and Code Generation to Accelerate Neural Networks

Utilizing Static Analysis and Code Generation to Accelerate Neural Networks Utilizing Static Analysis and Code Generation to Accelerate Neural Networks Lawrence McAfee Kunle Olukotun Stanford University, 45 Serra Mall, Stanford, CA 9435 lcmcafee@stanford.edu kunle@stanford.edu

More information

Optimizing Higher-Order Functions in Scala

Optimizing Higher-Order Functions in Scala Optimizing Higher-Order Functions in Scala Iulian Dragos EPFL Outline A quick tour of Scala Generalities Extension through libraries Closure translation in Scala Cost Optimizations in the Scala compiler

More information

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware

More information

Concepts of Object Oriented Programming

Concepts of Object Oriented Programming Concepts of Object Oriented Programming Dr. Axel Kohlmeyer Senior Scientific Computing Expert Information and Telecommunication Section The Abdus Salam International Centre for Theoretical Physics http://sites.google.com/site/akohlmey/

More information

Lecture 23: Domain-Specific Parallel Programming

Lecture 23: Domain-Specific Parallel Programming Lecture 23: Domain-Specific Parallel Programming CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Acknowledgments: Pat Hanrahan, Hassan Chafi Announcements List of class final projects

More information

Distributed & Heterogeneous Programming in C++ for HPC at SC17

Distributed & Heterogeneous Programming in C++ for HPC at SC17 Distributed & Heterogeneous Programming in C++ for HPC at SC17 Michael Wong (Codeplay), Hal Finkel DHPCC++ 2018 1 The Panel 2 Ben Sanders (AMD, HCC, HiP, HSA) Carter Edwards (SNL, Kokkos, ISO C++) CJ Newburn

More information

Chapter 3 Parallel Software

Chapter 3 Parallel Software Chapter 3 Parallel Software Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers

More information

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi

More information

Porting COSMO to Hybrid Architectures

Porting COSMO to Hybrid Architectures Porting COSMO to Hybrid Architectures T. Gysi 1, O. Fuhrer 2, C. Osuna 3, X. Lapillonne 3, T. Diamanti 3, B. Cumming 4, T. Schroeder 5, P. Messmer 5, T. Schulthess 4,6,7 [1] Supercomputing Systems AG,

More information

Addressing Heterogeneity in Manycore Applications

Addressing Heterogeneity in Manycore Applications Addressing Heterogeneity in Manycore Applications RTM Simulation Use Case stephane.bihan@caps-entreprise.com Oil&Gas HPC Workshop Rice University, Houston, March 2008 www.caps-entreprise.com Introduction

More information

CUDA 5 and Beyond. Mark Ebersole. Original Slides: Mark Harris 2012 NVIDIA

CUDA 5 and Beyond. Mark Ebersole. Original Slides: Mark Harris 2012 NVIDIA CUDA 5 and Beyond Mark Ebersole Original Slides: Mark Harris The Soul of CUDA The Platform for High Performance Parallel Computing Accessible High Performance Enable Computing Ecosystem Introducing CUDA

More information

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization

More information

From the latency to the throughput age. Prof. Jesús Labarta Director Computer Science Dept (BSC) UPC

From the latency to the throughput age. Prof. Jesús Labarta Director Computer Science Dept (BSC) UPC From the latency to the throughput age Prof. Jesús Labarta Director Computer Science Dept (BSC) UPC ETP4HPC Post-H2020 HPC Vision Frankfurt, June 24 th 2018 To exascale... and beyond 2 Vision The multicore

More information

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. Portable and Productive Performance with OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 Cray: Leadership in Computational Research Earth Sciences

More information

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran

More information

State of Scala. Martin Odersky Typesafe and EPFL

State of Scala. Martin Odersky Typesafe and EPFL State of Scala Martin Odersky Typesafe and EPFL Trajectory 10 years ago The plan: Unify functional and object-oriented programming in a practical language. 8 years ago First Scala release (experimental,

More information

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Parallel Programming Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Single computers nowadays Several CPUs (cores) 4 to 8 cores on a single chip Hyper-threading

More information

Accelerating Stateflow With LLVM

Accelerating Stateflow With LLVM Accelerating Stateflow With LLVM By Dale Martin Dale.Martin@mathworks.com 2015 The MathWorks, Inc. 1 What is Stateflow? A block in Simulink, which is a graphical language for modeling algorithms 2 What

More information

GpuWrapper: A Portable API for Heterogeneous Programming at CGG

GpuWrapper: A Portable API for Heterogeneous Programming at CGG GpuWrapper: A Portable API for Heterogeneous Programming at CGG Victor Arslan, Jean-Yves Blanc, Gina Sitaraman, Marc Tchiboukdjian, Guillaume Thomas-Collignon March 2 nd, 2016 GpuWrapper: Objectives &

More information

Technology for a better society. hetcomp.com

Technology for a better society. hetcomp.com Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction

More information

SPOC : GPGPU programming through Stream Processing with OCaml

SPOC : GPGPU programming through Stream Processing with OCaml SPOC : GPGPU programming through Stream Processing with OCaml Mathias Bourgoin - Emmanuel Chailloux - Jean-Luc Lamotte January 23rd, 2012 GPGPU Programming Two main frameworks Cuda OpenCL Different Languages

More information

Research Faculty Summit Systems Fueling future disruptions

Research Faculty Summit Systems Fueling future disruptions Research Faculty Summit 2018 Systems Fueling future disruptions Wolong: A Back-end Optimizer for Deep Learning Computation Jilong Xue Researcher, Microsoft Research Asia System Challenge in Deep Learning

More information

A Productive Framework for Generating High Performance, Portable, Scalable Applications for Heterogeneous computing

A Productive Framework for Generating High Performance, Portable, Scalable Applications for Heterogeneous computing A Productive Framework for Generating High Performance, Portable, Scalable Applications for Heterogeneous computing Wen-mei W. Hwu with Tom Jablin, Chris Rodrigues, Liwen Chang, Steven ShengZhou Wu, Abdul

More information

Design and Performance of the OP2 Library for Unstructured Mesh Applications

Design and Performance of the OP2 Library for Unstructured Mesh Applications Design and Performance of the OP2 Library for Unstructured Mesh Applications Carlo Bertolli 1, Adam Betts 1, Gihan Mudalige 2,MikeGiles 2, and Paul Kelly 1 1 Department of Computing, Imperial College London

More information

Scala Next. SF Scala meetup Dec 8 th, 2011

Scala Next. SF Scala meetup Dec 8 th, 2011 Scala Next SF Scala meetup Dec 8 th, 2011 Scala Today 2 Some adoption vectors: Web platforms Trading platforms Financial modeling Simulation Fast to first product, scalable afterwards 3 Github vs. Stack

More information

Dynamic Cuda with F# HPC GPU & F# Meetup. March 19. San Jose, California

Dynamic Cuda with F# HPC GPU & F# Meetup. March 19. San Jose, California Dynamic Cuda with F# HPC GPU & F# Meetup March 19 San Jose, California Dr. Daniel Egloff daniel.egloff@quantalea.net +41 44 520 01 17 +41 79 430 03 61 About Us! Software development and consulting company!

More information

arxiv: v1 [hep-lat] 12 Nov 2013

arxiv: v1 [hep-lat] 12 Nov 2013 Lattice Simulations using OpenACC compilers arxiv:13112719v1 [hep-lat] 12 Nov 2013 Indian Association for the Cultivation of Science, Kolkata E-mail: tppm@iacsresin OpenACC compilers allow one to use Graphics

More information

Givy, a software DSM runtime using raw pointers

Givy, a software DSM runtime using raw pointers Givy, a software DSM runtime using raw pointers François Gindraud UJF/Inria Compilation days (21/09/2015) F. Gindraud (UJF/Inria) Givy 21/09/2015 1 / 22 Outline 1 Distributed shared memory systems Examples

More information

Khronos Connects Software to Silicon

Khronos Connects Software to Silicon Press Pre-Briefing GDC 2015 Neil Trevett Khronos President NVIDIA Vice President Mobile Ecosystem All Materials Embargoed Until Tuesday 3 rd March, 12:01AM Pacific Time Copyright Khronos Group 2015 - Page

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Ocelot: An Open Source Debugging and Compilation Framework for CUDA

Ocelot: An Open Source Debugging and Compilation Framework for CUDA Ocelot: An Open Source Debugging and Compilation Framework for CUDA Gregory Diamos*, Andrew Kerr*, Sudhakar Yalamanchili Computer Architecture and Systems Laboratory School of Electrical and Computer Engineering

More information

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues

Chapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Chapter 4: Threads Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues 4.2 Silberschatz, Galvin

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

OpenMP tasking model for Ada: safety and correctness

OpenMP tasking model for Ada: safety and correctness www.bsc.es www.cister.isep.ipp.pt OpenMP tasking model for Ada: safety and correctness Sara Royuela, Xavier Martorell, Eduardo Quiñones and Luis Miguel Pinho Vienna (Austria) June 12-16, 2017 Parallel

More information

Liszt. Programming Mesh-based PDE Solvers for Heterogeneous Parallel Platforms

Liszt. Programming Mesh-based PDE Solvers for Heterogeneous Parallel Platforms Liszt Programming Mesh-based PDE Solvers for Heterogeneous Parallel Platforms Z DeVito, N Joubert, F Palacios, S Oakley, M Medina, M Barrientos, E Elsen, F Ham, A Aiken, K Duraisamy, E Darve, J Alonso,

More information

Lattice Simulations using OpenACC compilers. Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata)

Lattice Simulations using OpenACC compilers. Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata) Lattice Simulations using OpenACC compilers Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata) OpenACC is a programming standard for parallel computing developed by Cray, CAPS,

More information

Compiling CUDA and Other Languages for GPUs. Vinod Grover and Yuan Lin

Compiling CUDA and Other Languages for GPUs. Vinod Grover and Yuan Lin Compiling CUDA and Other Languages for GPUs Vinod Grover and Yuan Lin Agenda Vision Compiler Architecture Scenarios SDK Components Roadmap Deep Dive SDK Samples Demos Vision Build a platform for GPU computing

More information

LLVM + Gallium3D: Mixing a Compiler With a Graphics Framework. Stéphane Marchesin

LLVM + Gallium3D: Mixing a Compiler With a Graphics Framework. Stéphane Marchesin LLVM + Gallium3D: Mixing a Compiler With a Graphics Framework Stéphane Marchesin What problems are we solving? Shader optimizations are really needed All Mesa drivers are

More information

Red Fox: An Execution Environment for Relational Query Processing on GPUs

Red Fox: An Execution Environment for Relational Query Processing on GPUs Red Fox: An Execution Environment for Relational Query Processing on GPUs Georgia Institute of Technology: Haicheng Wu, Ifrah Saeed, Sudhakar Yalamanchili LogicBlox Inc.: Daniel Zinn, Martin Bravenboer,

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

COMP528: Multi-core and Multi-Processor Computing

COMP528: Multi-core and Multi-Processor Computing COMP528: Multi-core and Multi-Processor Computing Dr Michael K Bane, G14, Computer Science, University of Liverpool m.k.bane@liverpool.ac.uk https://cgi.csc.liv.ac.uk/~mkbane/comp528 2X So far Why and

More information

CS 179: GPU Programming LECTURE 5: GPU COMPUTE ARCHITECTURE FOR THE LAST TIME

CS 179: GPU Programming LECTURE 5: GPU COMPUTE ARCHITECTURE FOR THE LAST TIME CS 179: GPU Programming LECTURE 5: GPU COMPUTE ARCHITECTURE FOR THE LAST TIME 1 Last time... GPU Memory System Different kinds of memory pools, caches, etc Different optimization techniques 2 Warp Schedulers

More information

GPUfs: Integrating a file system with GPUs

GPUfs: Integrating a file system with GPUs GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Traditional System Architecture Applications OS CPU

More information

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany Computing on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Summary: The increasing power of GPUs has led to the intent to transfer computing load from CPUs to GPUs. A first example has been

More information

OPTIML. Arvind K. Sujeeth, HyoukJoong Lee, Kevin J. Brown, Hassan Chafi, Tiark Rompf, Anand Atreya, Michael Wu, Kunle Olukotun

OPTIML. Arvind K. Sujeeth, HyoukJoong Lee, Kevin J. Brown, Hassan Chafi, Tiark Rompf, Anand Atreya, Michael Wu, Kunle Olukotun OPTIML Arvind K. Sujeeth, HyoukJoong Lee, Kevin J. Brown, Hassan Chafi, Tiark Rompf, Anand Atreya, Michael Wu, Kunle Olukotun Stanford University Pervasive Parallelism Laboratory (PPL) Outline OptiML as

More information

Domain Specific Languages for Financial Payoffs. Matthew Leslie Bank of America Merrill Lynch

Domain Specific Languages for Financial Payoffs. Matthew Leslie Bank of America Merrill Lynch Domain Specific Languages for Financial Payoffs Matthew Leslie Bank of America Merrill Lynch Outline Introduction What, How, and Why do we use DSLs in Finance? Implementation Interpreting, Compiling Performance

More information

Porting Performance across GPUs and FPGAs

Porting Performance across GPUs and FPGAs Porting Performance across GPUs and FPGAs Deming Chen, ECE, University of Illinois In collaboration with Alex Papakonstantinou 1, Karthik Gururaj 2, John Stratton 1, Jason Cong 2, Wen-Mei Hwu 1 1: ECE

More information

Scala in Martin Odersky

Scala in Martin Odersky Scala in 2016 - Martin Odersky 2015 was on the quiet side Maturing tools: 2.11.x, IDEs, sbt Steady growth indeed.com jobs google trends In 2016, things will move again Scala 2.12 release Rethinking the

More information

General Purpose GPU Programming. Advanced Operating Systems Tutorial 9

General Purpose GPU Programming. Advanced Operating Systems Tutorial 9 General Purpose GPU Programming Advanced Operating Systems Tutorial 9 Tutorial Outline Review of lectured material Key points Discussion OpenCL Future directions 2 Review of Lectured Material Heterogeneous

More information

Supporting Data Parallelism in Matcloud: Final Report

Supporting Data Parallelism in Matcloud: Final Report Supporting Data Parallelism in Matcloud: Final Report Yongpeng Zhang, Xing Wu 1 Overview Matcloud is an on-line service to run Matlab-like script on client s web browser. Internally it is accelerated by

More information

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools Productive Performance on the Cray XK System Using OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 The New Generation of Supercomputers Hybrid

More information

Practical Introduction to CUDA and GPU

Practical Introduction to CUDA and GPU Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009 Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing

More information

Improving the Productivity of Scalable Application Development with TotalView May 18th, 2010

Improving the Productivity of Scalable Application Development with TotalView May 18th, 2010 Improving the Productivity of Scalable Application Development with TotalView May 18th, 2010 Chris Gottbrath Principal Product Manager Rogue Wave Major Product Offerings 2 TotalView Technologies Family

More information

Multi-agent System Simulation in Scala: An Evaluation of Actors for Parallel Simulation

Multi-agent System Simulation in Scala: An Evaluation of Actors for Parallel Simulation Multi-agent System Simulation in Scala: An Evaluation of for Parallel Simulation Aaron B. Todd 1, Amara K. Keller 2, Mark C. Lewis 2 and Martin G. Kelly 3 1 Department of Computer Science, Grinnell College,

More information

GETTING STARTED WITH CUDA SDK SAMPLES

GETTING STARTED WITH CUDA SDK SAMPLES GETTING STARTED WITH CUDA SDK SAMPLES DA-05723-001_v01 January 2012 Application Note TABLE OF CONTENTS Getting Started with CUDA SDK Samples... 1 Before You Begin... 2 Getting Started With SDK Samples...

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information