Delite: A Framework for Heterogeneous Parallel DSLs
|
|
- Phoebe Merritt
- 5 years ago
- Views:
Transcription
1 Delite: A Framework for Heterogeneous Parallel DSLs Hassan Chafi, Arvind Sujeeth, Kevin Brown, HyoukJoong Lee, Kunle Olukotun Stanford University Tiark Rompf, Martin Odersky EPFL
2 Heterogeneous Parallel Programming Pthreads OpenMP Sun T2 CUDA OpenCL Nvidia Fermi Verilog VHDL Altera FPGA MPI Cray Jaguar
3 Programmability Chasm Applications Scientific Engineering Pthreads OpenMP Sun T2 Virtual Worlds CUDA OpenCL Nvidia Fermi Personal Robotics Data informatics Verilog VHDL MPI Too many different programming models Altera FPGA Cray Jaguar
4
5
6 PPL Vision Applications Scientific Engineering Virtual Worlds Personal Robotics Data informatics Domain Specific Languages Statistics (R) Physics (Liszt) Data Analytics (OptiQL) Graph Alg. (Green Marl) Machine Learning (OptiML) Domain Embedding Language (Scala) DSL Infrastructure Polymorphic Embedding Staging Static Domain Specific Opt. Parallel Runtime (Delite, GRAMPS) Dynamic Domain Spec. Opt. Task & Data Parallelism Locality Aware Scheduling Heterogeneous Hardware
7 DSLs Present New Problem We need to develop all these DSLs Current DSL methods are unsatisfactory
8 Current DSL Development Approaches Stand-alone DSLs Can include extensive optimizations Enormous effort to develop to a sufficient degree of maturity Actual Compiler/Optimizations Tooling (IDE, Debuggers, ) Interoperation between multiple DSLs is very difficult Purely embedded DSLs just a library Easy to develop (can reuse full host language) Easier to learn DSL Can Combine multiple DSLs in one program Can Share DSL infrastructure among several DSLs Hard to optimize using domain knowledge Target same architecture as host language Need to do better
9 Need to Do Better Goal: Develop embedded DSLs that perform as well as stand-alone ones Intuition: General-purpose languages should be designed with DSL embedding in mind
10 Modular Staging Approach Embedded DSL gets it all for free, Modular Staging provides a hybrid approach but can t change any of it DSLs adopt front-end from Stand-alone DSL highly expressive but can customize IR and implements everything embedding language participate in backend phases Lexer Parser Type checker Analysis Optimization Code gen Typical Compiler GPCE 10: Lightweight modular staging: a pragmatic approach to runtime code generation and compiled DSLs
11 Linear Algebra Example object TestMatrix { } def example(a: Matrix, b: Matrix, c: Matrix, d: Matrix) = { val x = a*b + a*c val y = a*c + a*d println(x+y) }
12 Abstract Matrix Usage object TestMatrix extends DSLApp with MatrixArith { } def example(a: Rep[Matrix], b: Rep[Matrix], c: Rep[Matrix], d: Rep[Matrix]) = { val x = a*b + a*c val y = a*c + a*d println(x+y) } Rep[Matrix]: abstract type constructor range of possible implementations of Matrix Operations on Rep[Matrix] defined in MatrixArith trait
13 Lifting Matrix to Abstract Representation DSL interface building blocks structured as traits Expressions of type Rep[T] represent expressions of type T Can plug in different representation Need to be able to convert (lift) Matrix to abstract representation Need to define an interface for our DSL type trait MatrixArith { } type Rep[T] implicit def liftmatrixtorep(x: Matrix): Rep[Matrix] def infix_+(x:rep[matrix], y: Rep[Matrix]): Rep[Matrix] def infix_*(x:rep[matrix], y: Rep[Matrix]): Rep[Matrix] Now can plugin different implementations and representations for the DSL
14 Modest Effort Lifting each new DSL that uses slightly different IR violates Effort criterion Need a DSL embedding infrastructure Provide building blocks of common DSL compiler functionality
15 Delite DSL Compiler Liszt program Scala Embedding Framework OptiML program Delite Parallelism Framework Provide a common IR that can be extended while still benefitting from generic analysis and opt. Intermediate Representation (IR) Base IR Generic Analysis & Opt.
16 Now Can Build an IR Start with common IR structure to be shared among DSLs trait Expressions { // constants/symbols (atomic) abstract class Exp[T] case class Const[T](x: T) extends Exp[T] case class Sym[T](n: Int) extends Exp[T] // operations (composite, defined in subtraits) abstract class Op[T] // additional members for managing encountered definitions def findorcreatedefinition[t](op: Op[T]): Sym[T] } implicit def toexp[t](d: Op[T]): Exp[T] = findorcreatedefinition(d) Generic optimizations provided to all DSLs Common subexpression elimination Dead code elimination Code motion
17 Delite DSL Compiler Scala Embedding Framework Base IR Generic Analysis & Opt. Liszt program Delite Parallelism Framework Intermediate Representation (IR) Delite IR Parallelism Analysis, Opt. & Mapping OptiML program Provide a common IR that can be extended while still benefitting from generic analysis and opt. Extend common IR and provide IR nodes that encode parallel execution patterns Now can do parallel optimizations and mapping
18 Delite Ops Encode known parallel execution patterns Data-parallel: map, zip, reduce, Delite provides implementations of these patterns for multiple hardware targets e.g., multi-core, GPU DSL developer maps each domain operation to the appropriate pattern
19 Delite Op Fusing Operates on all loop-based ops Reduces op overhead and improves locality Elimination of temporary data structures Communication through registers Fuse both dependent and side-by-side operations Fused ops can have multiple outputs Algorithm: fuse two loops if size(loop1) == size(loop2) No dependencies exist that would require an impossible schedule if fused e.g., C depends on B, depends on A -> cannot fuse C and A
20 Delite DSL Compiler Scala Embedding Framework Base IR Generic Analysis & Opt. Liszt program Intermediate Representation (IR) Delite IR Parallelism Analysis, Opt. & Mapping OptiML program Delite Parallelism Framework DS IR Domain Analysis & Opt. Provide a common IR that can be extended while still benefitting from generic analysis and opt. Extend common IR and provide IR nodes that encode parallel execution patterns Now can do parallel optimizations and mapping DSL extends appropriate parallel nodes for their operations Now can do domainspecific analysis and opt.
21 Customize IR with Domain Info trait MatrixArithRepExp extends MatrixArith with Expressions { type Rep[T] = Exp[T] implicit def liftmatrixtorep(x: Matrix) = Const(x) case class Plus(x: Exp[Matrix],y: Exp[Matrix]) extends DeliteOpZip[Matrix] } def infix_+(x: Exp[Matrix],y: Exp[Matrix]) = Plus(x, y) Choose Exp as representation for the DSL types Define Lifting function to create expressions Extend Delite IR with domain-specific node types DSL methods build IR as program runs
22 The Delite IR Domain User Interface M1 = M2 + M3 V1 = exp(v2) Application s = sum(m) C2 = sort(c1) DSL User Domain Analysis & Opt. Matrix Plus Vector Exp DS IR Matrix Sum Collection Quicksort DSL Author Parallelism Analysis & Opt. Code Generation ZipWith Map Delite Op IR Reduce Divide & Conquer Delite Generic Analysis & Opt. Op Base IR Delite
23 Delite DSL Compiler Scala Embedding Framework Base IR Generic Analysis & Opt. Delite Execution Graph Liszt program Intermediate Representation (IR) Delite IR Parallelism Analysis, Opt. & Mapping Code Generation Kernels (Scala, C, Cuda, MPI Verilog, ) OptiML program Delite Parallelism Framework DS IR Domain Analysis & Opt. Data Structures (arrays, trees, graphs, ) Provide a common IR that can be extended while still benefitting from generic analysis and opt. Extend common IR and provide IR nodes that encode data parallel execution patterns Now can do parallel optimizations and mapping DSL extends appropriate data parallel nodes for their operations Now can do domainspecific analysis and opt. Generate an execution graph, kernels and data structures
24 Delite Runtime Local System SMP GPU Machine Inputs Scheduler Delite Execution Graph Walk-Time Kernels (Scala, C, Cuda, ) Application Inputs Partial schedules, Fused, specialized kernels DSL Data Structures Code Generator JIT Kernel Fusion, Specialization, Synchronization Run-Time Schedule Dispatch, Memory Management, Lazy Data Transfers Maps the machineagnostic DSL compiler output onto the given machine specification for customized execution Walk-time scheduling produces partial schedules Generates variants to be selected at run-time once actual execution flow is resolved Code generation produces fused, specialized kernels to be launched on each resource Optimizes for the execution plan using the partial schedules Run-time executor controls and optimizes execution Launches kernels on resources Monitors execution and system state to dynamically adjust kernels and schedule
25 Schedule and Kernel Compilation Compile execution graph to executables for each resource after scheduling All synchronization is added at this point where required Kernels specialized based on number of processors allocated for it e.g., specialize height of tree reduction
26 GPU Management Cuda host thread launches kernels and automatically performs data transfers as required by schedule Compiler provides helper functions to Copy data structures between address spaces pre-allocate outputs and temporaries select the number of threads & thread blocks Provides device memory management for kernels Perform liveness analysis to determine when op inputs and outputs are dead on the GPU Runtime frees dead data when it experiences memory pressure
27 Conclusions Need to simplify the process of developing DSLs for parallelism Need programming languages to be designed for flexible embedding Lightweight modular staging in Scala allows for more powerful embedded DSLs Delite provides a framework for adding parallelism
Language Virtualization for Heterogeneous Parallel Computing
Language Virtualization for Heterogeneous Parallel Computing Hassan Chafi, Arvind Sujeeth, Zach DeVito, Pat Hanrahan, Kunle Olukotun Stanford University Adriaan Moors, Tiark Rompf, Martin Odersky EPFL
More informationDelite. Kevin Brown, Arvind Sujeeth, HyoukJoong Lee, Hassan Chafi, Kunle Olukotun Stanford University. Tiark Rompf, Martin Odersky EPFL
Delite Kevin Brown, Arvind Sujeeth, HyoukJoong Lee, Hassan Chafi, Kunle Olukotun Stanford University Tiark Rompf, Martin Odersky EPFL Building a Language is Difficult Building a new DSL Design the language
More informationScala. MULTICORE PART 2: Parallel Collections and Parallel DSLs. for. Philipp HALLER, STANFORD UNIVERSITY AND EPFL. Tuesday, June 21, 2011
Scala for MULTICORE PART 2: Parallel Collections and Parallel DSLs Philipp HALLER, STANFORD UNIVERSITY AND EPFL 1 Scala for RESOURCES ONLINE AT: http://lamp.epfl.ch/~phaller/upmarc MULTICORE PART 2: Parallel
More informationKunle Olukotun Pervasive Parallelism Laboratory Stanford University
Kunle Olukotun Pervasive Parallelism Laboratory Stanford University Unleash full power of future computing platforms Make parallel application development practical for the masses (Joe the programmer)
More informationOptiML: An Implicitly Parallel Domain-Specific Language for ML
OptiML: An Implicitly Parallel Domain-Specific Language for ML Arvind K. Sujeeth, HyoukJoong Lee, Kevin J. Brown, Hassan Chafi, Michael Wu, Anand Atreya, Kunle Olukotun Stanford University Pervasive Parallelism
More informationA DOMAIN SPECIFIC APPROACH TO HETEROGENEOUS PARALLELISM
A DOMAIN SPECIFIC APPROACH TO HETEROGENEOUS PARALLELISM Hassan Chafi, Arvind Sujeeth, Kevin Brown, HyoukJoong Lee, Anand Atreya, Kunle Olukotun Stanford University Pervasive Parallelism Laboratory (PPL)
More informationCS442: High Productivity and Performance with Domainspecific
CS442: High Productivity and Performance with Domainspecific Languages in Scala CS442 Overview Experimental Research Course Lots of interest in DSLs Opportunity to do something new New ideas: DSLs for
More informationDelite. Hassan Chafi, Arvind Sujeeth, Kevin Brown, HyoukJoong Lee, Kunle Olukotun Stanford University. Tiark Rompf, Martin Odersky EPFL
Delite Hassan Chafi, Arvind Sujeeth, Kevin Brown, HyoukJoong Lee, Kunle Olukotun Stanford University Tiark Rompf, Martin Odersky EPFL Administrative PS 1 due today Email to me PS 2 out soon Build a simple
More informationLanguage Virtualization for Heterogeneous Parallel Computing
Language Virtualization for Heterogeneous Parallel Computing Hassan Chafi Zach DeVito Adriaan Moors Tiark Rompf Arvind K. Sujeeth Pat Hanrahan Martin Odersky Kunle Olukotun Stanford University: {hchafi,
More informationDelite: A Compiler Architecture for Performance-Oriented Embedded Domain-Specific Languages
Delite: A Compiler Architecture for Performance-Oriented Embedded Domain-Specific Languages ARVIND K. SUJEETH, KEVIN J. BROWN, and HYOUKJOONG LEE, Stanford University TIARK ROMPF, Oracle Labs and EPFL
More informationSimplifying Parallel Programming with Domain Specific Languages
Simplifying Parallel Programming with Domain Specific Languages Hassan Chafi, HyoukJoong Lee, Arvind Sujeeth, Kevin Brown, Anand Atreya, Nathan Bronson, Kunle Olukotun Stanford University Pervasive Parallelism
More informationOptiML: An Implicitly Parallel Domain-Specific Language for ML
OptiML: An Implicitly Parallel Domain-Specific Language for ML Arvind K. Sujeeth, HyoukJoong Lee, Kevin J. Brown, Hassan Chafi, Michael Wu, Anand Atreya, Kunle Olukotun Stanford University Pervasive Parallelism
More informationLift: a Functional Approach to Generating High Performance GPU Code using Rewrite Rules
Lift: a Functional Approach to Generating High Performance GPU Code using Rewrite Rules Toomas Remmelg Michel Steuwer Christophe Dubach The 4th South of England Regional Programming Language Seminar 27th
More informationOptiML: An Implicitly Parallel Domain-Specific Language for Machine Learning
OptiML: An Implicitly Parallel Domain-Specific Language for Machine Learning Arvind K. Sujeeth asujeeth@stanford.edu HyoukJoong Lee hyouklee@stanford.edu Kevin J. Brown kjbrown@stanford.edu Hassan Chafi
More informationConcepts of Programming Design
Concepts of Programming Design Scala and Lightweight Modular Staging (LMS) Alexey Rodriguez Blanter, Ferenc Balla, Matthijs Steen, Ruben Meerkerk, Ratih Ngestrini 1 Scala and Lightweight Modular Staging
More informationScalaPipe: A Streaming Application Generator
ScalaPipe: A Streaming Application Generator Joseph G. Wingbermuehle, Roger D. Chamberlain, Ron K. Cytron This work is supported by the National Science Foundation under grants CNS-09095368 and CNS-0931693.
More informationLocality-Aware Mapping of Nested Parallel Patterns on GPUs
Locality-Aware Mapping of Nested Parallel Patterns on GPUs HyoukJoong Lee *, Kevin Brown *, Arvind Sujeeth *, Tiark Rompf, Kunle Olukotun * * Pervasive Parallelism Laboratory, Stanford University Purdue
More informationGRAMPS Beyond Rendering. Jeremy Sugerman 11 December 2009 PPL Retreat
GRAMPS Beyond Rendering Jeremy Sugerman 11 December 2009 PPL Retreat The PPL Vision: GRAMPS Applications Scientific Engineering Virtual Worlds Personal Robotics Data informatics Domain Specific Languages
More informationUnifying functional and object-oriented programming with Scala 齐琦, 海南大学计算机发展前沿系列课程
Unifying functional and object-oriented programming with Scala, 海南大学计算机发展前沿系列课程 2 原文引用 Odersky, M. and T. Rompf (2014). "Unifying functional and objectoriented programming with Scala." Communications of
More informationPower Efficiency is Key to Exascale
Domain-Specific Languages for Heterogeneous Computing Platforms Pat Hanrahan Stanford Pervasive Parallelism Laboratory (Supported by Sun, AMD, NVIDIA, IBM, Intel, NEC, HP) Parallel @ Illinois DLS April
More informationLiszt! Programming Mesh-based PDEs! on! Heterogeneous Parallel Platforms!
Liszt! Programming Mesh-based PDEs! on! Heterogeneous Parallel Platforms! Z. DeVito, N. Joubert, M. Medina, M. Barrientos, S. Oakley," J. Alonso, E. Darve, F. Ham, P. Hanrahan! Stanford PSAAP Center AST
More informationWeld: A Common Runtime for Data Analytics
Weld: A Common Runtime for Data Analytics Shoumik Palkar, James Thomas, Anil Shanbhag*, Deepak Narayanan, Malte Schwarzkopf*, Holger Pirk*, Saman Amarasinghe*, Matei Zaharia Stanford InfoLab, *MIT CSAIL
More informationIntroduction to Multicore Programming
Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Synchronization 3 Automatic Parallelization and OpenMP 4 GPGPU 5 Q& A 2 Multithreaded
More informationIntegrating Productivity-Oriented Programming Languages with High-Performance Data Structures
Integrating Productivity-Oriented Programming Languages with High-Performance Data Structures James Fairbanks Rohit Varkey Thankachan, Eric Hein, Brian Swenson Georgia Tech Research Institute September
More informationGPU Debugging Made Easy. David Lecomber CTO, Allinea Software
GPU Debugging Made Easy David Lecomber CTO, Allinea Software david@allinea.com Allinea Software HPC development tools company Leading in HPC software tools market Wide customer base Blue-chip engineering,
More informationKunle Olukotun. EPFL August 26, 2009
Kunle Olukotun EPFL August 26, 2009 The end of uniprocessors Embedded, server, desktop New Moore s Law is 2X cores per chip every technology generation Cores are the new GHz! No other viable options for
More informationIntroduction to Multicore Programming
Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Automatic Parallelization and OpenMP 3 GPGPU 2 Multithreaded Programming
More informationAccelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors
Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors Michael Boyer, David Tarjan, Scott T. Acton, and Kevin Skadron University of Virginia IPDPS 2009 Outline Leukocyte
More informationJose Aliaga (Universitat Jaume I, Castellon, Spain), Ruyman Reyes, Mehdi Goli (Codeplay Software) 2017 Codeplay Software Ltd.
SYCL-BLAS: LeveragingSYCL-BLAS Expression Trees for Linear Algebra Jose Aliaga (Universitat Jaume I, Castellon, Spain), Ruyman Reyes, Mehdi Goli (Codeplay Software) 1 About me... Phd in Compilers and Parallel
More informationHigh-Performance DSLs Embedded in Scala. Vojin Jovanovic, EPFL
High-Performance DSLs Embedded in Scala Vojin Jovanovic, EPFL Twitter: @vojjov Is SQL a Good DSL? select (c.name, c.email) from Candidate c, AppliesToJobs atj, Jobs j where c.company = 'Oracle' and c.id
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationTrends and Challenges in Multicore Programming
Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010 Outline The Roadmap of Multicores
More informationThe Eclipse Parallel Tools Platform Project
The Eclipse Parallel Tools Platform Project EclipseCon 2005 LAUR-05-0574 Parallel Development Tools State of the Art Command-line compilers for Fortran and C/C++ Sometimes wrapped in a GUI Editors are
More informationOptiQL: LINQ on Delite
OptiQL: LINQ on Delite What is LINQ? Language Integrated Query (LINQ) is a set of language and framework features for writing structured typesafe queries over local object collections and remote data sources.
More informationHeterogeneous Computing with a Fused CPU+GPU Device
with a Fused CPU+GPU Device MediaTek White Paper January 2015 2015 MediaTek Inc. 1 Introduction Heterogeneous computing technology, based on OpenCL (Open Computing Language), intelligently fuses GPU and
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationCPU-GPU Heterogeneous Computing
CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems
More informationComputational Interdisciplinary Modelling High Performance Parallel & Distributed Computing Our Research
Insieme Insieme-an Optimization System for OpenMP, MPI and OpenCLPrograms Institute of Computer Science University of Innsbruck Thomas Fahringer, Ivan Grasso, Klaus Kofler, Herbert Jordan, Hans Moritsch,
More informationOpenACC 2.6 Proposed Features
OpenACC 2.6 Proposed Features OpenACC.org June, 2017 1 Introduction This document summarizes features and changes being proposed for the next version of the OpenACC Application Programming Interface, tentatively
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied
More informationGenerating high-performance multiplatform finite element solvers using the Manycore Form Compiler and OP2
Generating high-performance multiplatform finite element solvers using the Manycore Form Compiler and OP2 Graham R. Markall, Florian Rathgeber, David A. Ham, Paul H. J. Kelly, Carlo Bertolli, Adam Betts
More informationMichel Steuwer.
Michel Steuwer http://homepages.inf.ed.ac.uk/msteuwer/ SKELCL: Algorithmic Skeletons for GPUs X i a i b i = reduce (+) 0 (zip ( ) A B) #include #include #include
More informationProductive Design of Extensible Cache Coherence Protocols!
P A R A L L E L C O M P U T I N G L A B O R A T O R Y Productive Design of Extensible Cache Coherence Protocols!! Henry Cook, Jonathan Bachrach,! Krste Asanovic! Par Lab Summer Retreat, Santa Cruz June
More informationET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc.
HPC Runtime Software Rishi Khan SC 11 Current Programming Models Shared Memory Multiprocessing OpenMP fork/join model Pthreads Arbitrary SMP parallelism (but hard to program/ debug) Cilk Work Stealing
More informationUtilizing Static Analysis and Code Generation to Accelerate Neural Networks
Utilizing Static Analysis and Code Generation to Accelerate Neural Networks Lawrence McAfee Kunle Olukotun Stanford University, 45 Serra Mall, Stanford, CA 9435 lcmcafee@stanford.edu kunle@stanford.edu
More informationOptimizing Higher-Order Functions in Scala
Optimizing Higher-Order Functions in Scala Iulian Dragos EPFL Outline A quick tour of Scala Generalities Extension through libraries Closure translation in Scala Cost Optimizations in the Scala compiler
More informationTOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT
TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware
More informationConcepts of Object Oriented Programming
Concepts of Object Oriented Programming Dr. Axel Kohlmeyer Senior Scientific Computing Expert Information and Telecommunication Section The Abdus Salam International Centre for Theoretical Physics http://sites.google.com/site/akohlmey/
More informationLecture 23: Domain-Specific Parallel Programming
Lecture 23: Domain-Specific Parallel Programming CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Acknowledgments: Pat Hanrahan, Hassan Chafi Announcements List of class final projects
More informationDistributed & Heterogeneous Programming in C++ for HPC at SC17
Distributed & Heterogeneous Programming in C++ for HPC at SC17 Michael Wong (Codeplay), Hal Finkel DHPCC++ 2018 1 The Panel 2 Ben Sanders (AMD, HCC, HiP, HSA) Carter Edwards (SNL, Kokkos, ISO C++) CJ Newburn
More informationChapter 3 Parallel Software
Chapter 3 Parallel Software Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers
More informationIntroduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines
Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi
More informationPorting COSMO to Hybrid Architectures
Porting COSMO to Hybrid Architectures T. Gysi 1, O. Fuhrer 2, C. Osuna 3, X. Lapillonne 3, T. Diamanti 3, B. Cumming 4, T. Schroeder 5, P. Messmer 5, T. Schulthess 4,6,7 [1] Supercomputing Systems AG,
More informationAddressing Heterogeneity in Manycore Applications
Addressing Heterogeneity in Manycore Applications RTM Simulation Use Case stephane.bihan@caps-entreprise.com Oil&Gas HPC Workshop Rice University, Houston, March 2008 www.caps-entreprise.com Introduction
More informationCUDA 5 and Beyond. Mark Ebersole. Original Slides: Mark Harris 2012 NVIDIA
CUDA 5 and Beyond Mark Ebersole Original Slides: Mark Harris The Soul of CUDA The Platform for High Performance Parallel Computing Accessible High Performance Enable Computing Ecosystem Introducing CUDA
More informationPROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec
PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization
More informationFrom the latency to the throughput age. Prof. Jesús Labarta Director Computer Science Dept (BSC) UPC
From the latency to the throughput age Prof. Jesús Labarta Director Computer Science Dept (BSC) UPC ETP4HPC Post-H2020 HPC Vision Frankfurt, June 24 th 2018 To exascale... and beyond 2 Vision The multicore
More informationPortable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.
Portable and Productive Performance with OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 Cray: Leadership in Computational Research Earth Sciences
More informationParticle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA
Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran
More informationState of Scala. Martin Odersky Typesafe and EPFL
State of Scala Martin Odersky Typesafe and EPFL Trajectory 10 years ago The plan: Unify functional and object-oriented programming in a practical language. 8 years ago First Scala release (experimental,
More informationParallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops
Parallel Programming Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops Single computers nowadays Several CPUs (cores) 4 to 8 cores on a single chip Hyper-threading
More informationAccelerating Stateflow With LLVM
Accelerating Stateflow With LLVM By Dale Martin Dale.Martin@mathworks.com 2015 The MathWorks, Inc. 1 What is Stateflow? A block in Simulink, which is a graphical language for modeling algorithms 2 What
More informationGpuWrapper: A Portable API for Heterogeneous Programming at CGG
GpuWrapper: A Portable API for Heterogeneous Programming at CGG Victor Arslan, Jean-Yves Blanc, Gina Sitaraman, Marc Tchiboukdjian, Guillaume Thomas-Collignon March 2 nd, 2016 GpuWrapper: Objectives &
More informationTechnology for a better society. hetcomp.com
Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction
More informationSPOC : GPGPU programming through Stream Processing with OCaml
SPOC : GPGPU programming through Stream Processing with OCaml Mathias Bourgoin - Emmanuel Chailloux - Jean-Luc Lamotte January 23rd, 2012 GPGPU Programming Two main frameworks Cuda OpenCL Different Languages
More informationResearch Faculty Summit Systems Fueling future disruptions
Research Faculty Summit 2018 Systems Fueling future disruptions Wolong: A Back-end Optimizer for Deep Learning Computation Jilong Xue Researcher, Microsoft Research Asia System Challenge in Deep Learning
More informationA Productive Framework for Generating High Performance, Portable, Scalable Applications for Heterogeneous computing
A Productive Framework for Generating High Performance, Portable, Scalable Applications for Heterogeneous computing Wen-mei W. Hwu with Tom Jablin, Chris Rodrigues, Liwen Chang, Steven ShengZhou Wu, Abdul
More informationDesign and Performance of the OP2 Library for Unstructured Mesh Applications
Design and Performance of the OP2 Library for Unstructured Mesh Applications Carlo Bertolli 1, Adam Betts 1, Gihan Mudalige 2,MikeGiles 2, and Paul Kelly 1 1 Department of Computing, Imperial College London
More informationScala Next. SF Scala meetup Dec 8 th, 2011
Scala Next SF Scala meetup Dec 8 th, 2011 Scala Today 2 Some adoption vectors: Web platforms Trading platforms Financial modeling Simulation Fast to first product, scalable afterwards 3 Github vs. Stack
More informationDynamic Cuda with F# HPC GPU & F# Meetup. March 19. San Jose, California
Dynamic Cuda with F# HPC GPU & F# Meetup March 19 San Jose, California Dr. Daniel Egloff daniel.egloff@quantalea.net +41 44 520 01 17 +41 79 430 03 61 About Us! Software development and consulting company!
More informationarxiv: v1 [hep-lat] 12 Nov 2013
Lattice Simulations using OpenACC compilers arxiv:13112719v1 [hep-lat] 12 Nov 2013 Indian Association for the Cultivation of Science, Kolkata E-mail: tppm@iacsresin OpenACC compilers allow one to use Graphics
More informationGivy, a software DSM runtime using raw pointers
Givy, a software DSM runtime using raw pointers François Gindraud UJF/Inria Compilation days (21/09/2015) F. Gindraud (UJF/Inria) Givy 21/09/2015 1 / 22 Outline 1 Distributed shared memory systems Examples
More informationKhronos Connects Software to Silicon
Press Pre-Briefing GDC 2015 Neil Trevett Khronos President NVIDIA Vice President Mobile Ecosystem All Materials Embargoed Until Tuesday 3 rd March, 12:01AM Pacific Time Copyright Khronos Group 2015 - Page
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationOcelot: An Open Source Debugging and Compilation Framework for CUDA
Ocelot: An Open Source Debugging and Compilation Framework for CUDA Gregory Diamos*, Andrew Kerr*, Sudhakar Yalamanchili Computer Architecture and Systems Laboratory School of Electrical and Computer Engineering
More informationChapter 4: Threads. Chapter 4: Threads. Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues
Chapter 4: Threads Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues 4.2 Silberschatz, Galvin
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied
More informationOpenMP tasking model for Ada: safety and correctness
www.bsc.es www.cister.isep.ipp.pt OpenMP tasking model for Ada: safety and correctness Sara Royuela, Xavier Martorell, Eduardo Quiñones and Luis Miguel Pinho Vienna (Austria) June 12-16, 2017 Parallel
More informationLiszt. Programming Mesh-based PDE Solvers for Heterogeneous Parallel Platforms
Liszt Programming Mesh-based PDE Solvers for Heterogeneous Parallel Platforms Z DeVito, N Joubert, F Palacios, S Oakley, M Medina, M Barrientos, E Elsen, F Ham, A Aiken, K Duraisamy, E Darve, J Alonso,
More informationLattice Simulations using OpenACC compilers. Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata)
Lattice Simulations using OpenACC compilers Pushan Majumdar (Indian Association for the Cultivation of Science, Kolkata) OpenACC is a programming standard for parallel computing developed by Cray, CAPS,
More informationCompiling CUDA and Other Languages for GPUs. Vinod Grover and Yuan Lin
Compiling CUDA and Other Languages for GPUs Vinod Grover and Yuan Lin Agenda Vision Compiler Architecture Scenarios SDK Components Roadmap Deep Dive SDK Samples Demos Vision Build a platform for GPU computing
More informationLLVM + Gallium3D: Mixing a Compiler With a Graphics Framework. Stéphane Marchesin
LLVM + Gallium3D: Mixing a Compiler With a Graphics Framework Stéphane Marchesin What problems are we solving? Shader optimizations are really needed All Mesa drivers are
More informationRed Fox: An Execution Environment for Relational Query Processing on GPUs
Red Fox: An Execution Environment for Relational Query Processing on GPUs Georgia Institute of Technology: Haicheng Wu, Ifrah Saeed, Sudhakar Yalamanchili LogicBlox Inc.: Daniel Zinn, Martin Bravenboer,
More informationModern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design
Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant
More informationCOMP528: Multi-core and Multi-Processor Computing
COMP528: Multi-core and Multi-Processor Computing Dr Michael K Bane, G14, Computer Science, University of Liverpool m.k.bane@liverpool.ac.uk https://cgi.csc.liv.ac.uk/~mkbane/comp528 2X So far Why and
More informationCS 179: GPU Programming LECTURE 5: GPU COMPUTE ARCHITECTURE FOR THE LAST TIME
CS 179: GPU Programming LECTURE 5: GPU COMPUTE ARCHITECTURE FOR THE LAST TIME 1 Last time... GPU Memory System Different kinds of memory pools, caches, etc Different optimization techniques 2 Warp Schedulers
More informationGPUfs: Integrating a file system with GPUs
GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Traditional System Architecture Applications OS CPU
More informationComputing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany
Computing on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Summary: The increasing power of GPUs has led to the intent to transfer computing load from CPUs to GPUs. A first example has been
More informationOPTIML. Arvind K. Sujeeth, HyoukJoong Lee, Kevin J. Brown, Hassan Chafi, Tiark Rompf, Anand Atreya, Michael Wu, Kunle Olukotun
OPTIML Arvind K. Sujeeth, HyoukJoong Lee, Kevin J. Brown, Hassan Chafi, Tiark Rompf, Anand Atreya, Michael Wu, Kunle Olukotun Stanford University Pervasive Parallelism Laboratory (PPL) Outline OptiML as
More informationDomain Specific Languages for Financial Payoffs. Matthew Leslie Bank of America Merrill Lynch
Domain Specific Languages for Financial Payoffs Matthew Leslie Bank of America Merrill Lynch Outline Introduction What, How, and Why do we use DSLs in Finance? Implementation Interpreting, Compiling Performance
More informationPorting Performance across GPUs and FPGAs
Porting Performance across GPUs and FPGAs Deming Chen, ECE, University of Illinois In collaboration with Alex Papakonstantinou 1, Karthik Gururaj 2, John Stratton 1, Jason Cong 2, Wen-Mei Hwu 1 1: ECE
More informationScala in Martin Odersky
Scala in 2016 - Martin Odersky 2015 was on the quiet side Maturing tools: 2.11.x, IDEs, sbt Steady growth indeed.com jobs google trends In 2016, things will move again Scala 2.12 release Rethinking the
More informationGeneral Purpose GPU Programming. Advanced Operating Systems Tutorial 9
General Purpose GPU Programming Advanced Operating Systems Tutorial 9 Tutorial Outline Review of lectured material Key points Discussion OpenCL Future directions 2 Review of Lectured Material Heterogeneous
More informationSupporting Data Parallelism in Matcloud: Final Report
Supporting Data Parallelism in Matcloud: Final Report Yongpeng Zhang, Xing Wu 1 Overview Matcloud is an on-line service to run Matlab-like script on client s web browser. Internally it is accelerated by
More informationProductive Performance on the Cray XK System Using OpenACC Compilers and Tools
Productive Performance on the Cray XK System Using OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 The New Generation of Supercomputers Hybrid
More informationPractical Introduction to CUDA and GPU
Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009 Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing
More informationImproving the Productivity of Scalable Application Development with TotalView May 18th, 2010
Improving the Productivity of Scalable Application Development with TotalView May 18th, 2010 Chris Gottbrath Principal Product Manager Rogue Wave Major Product Offerings 2 TotalView Technologies Family
More informationMulti-agent System Simulation in Scala: An Evaluation of Actors for Parallel Simulation
Multi-agent System Simulation in Scala: An Evaluation of for Parallel Simulation Aaron B. Todd 1, Amara K. Keller 2, Mark C. Lewis 2 and Martin G. Kelly 3 1 Department of Computer Science, Grinnell College,
More informationGETTING STARTED WITH CUDA SDK SAMPLES
GETTING STARTED WITH CUDA SDK SAMPLES DA-05723-001_v01 January 2012 Application Note TABLE OF CONTENTS Getting Started with CUDA SDK Samples... 1 Before You Begin... 2 Getting Started With SDK Samples...
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More information