Programming Languages Research Programme

Similar documents
Generating Performance Portable Code using Rewrite Rules

Text. Parallelism Gabriele Keller University of New South Wales

General Purpose GPU Programming. Advanced Operating Systems Tutorial 9

Generating Performance Portable Code using Rewrite Rules

Lift: a Functional Approach to Generating High Performance GPU Code using Rewrite Rules

General Purpose GPU Programming. Advanced Operating Systems Tutorial 7

Expressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17

ICSA Institute for Computing Systems Architecture. LFCS Laboratory for Foundations of Computer Science

Introduction to Proof-Carrying Code

Modern Processor Architectures. L25: Modern Compiler Design

Foundations of the C++ Concurrency Memory Model

An Efficient Stream Buffer Mechanism for Dataflow Execution on Heterogeneous Platforms with GPUs

Emel: Deep Learning in One Line using Type Inference, Architecture Selection, and Hyperparameter Tuning

Instances and Classes. SOFTWARE ENGINEERING Christopher A. Welty David A. Ferrucci. 24 Summer 1999 intelligence

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Web Annotator. Dale Reed, Sam John Computer Science Department University of Illinois at Chicago Chicago, IL

Parallel Programming. Libraries and Implementations

Trends and Challenges in Multicore Programming

Query Rewriting Using Views in the Presence of Inclusion Dependencies

"On the Capability and Achievable Performance of FPGAs for HPC Applications"

Hardware Memory Models: x86-tso

Parallel, Distributed, and Differential Processing System for Human Activity Sensing Flows

Introduction to Multicore Programming

High Performance Stencil Code Generation with Lift. Bastian Hagedorn Larisa Stoltzfus Michel Steuwer Sergei Gorlatch Christophe Dubach

7. Introduction to Denotational Semantics. Oscar Nierstrasz

Integration of Product Ontologies for B2B Marketplaces: A Preview

Chapter 3 Parallel Software

Contextion: A Framework for Developing Context-Aware Mobile Applications

Leveraging Transitive Relations for Crowdsourced Joins*

REDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS

Introduction to Multicore Programming

AMaLGaM IDEAs in Noisy Black-Box Optimization Benchmarking

Concurrent Manipulation of Dynamic Data Structures in OpenCL

A General Sign Bit Error Correction Scheme for Approximate Adders

TSO-CC: Consistency-directed Coherence for TSO. Vijay Nagarajan

Distributed Systems Programming (F21DS1) Formal Verification

OpenMP for next generation heterogeneous clusters

Automatic Generation of Graph Models for Model Checking

Early Experiences Writing Performance Portable OpenMP 4 Codes

Designing Views to Answer Queries under Set, Bag,and BagSet Semantics

arxiv: v1 [cs.pl] 30 Sep 2013

Network Function Virtualization. CSU CS557, Spring 2018 Instructor: Lorenzo De Carli

IBM Power Multithreaded Parallelism: Languages and Compilers. Fall Nirav Dave

An Attribute-Based Access Matrix Model

GPUs and Emerging Architectures

Software Engineering Design & Construction

Trends in HPC (hardware complexity and software challenges)

A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening. Alberto Magni, Christophe Dubach, Michael O'Boyle

Declarative, Programmatic Vector Graphics in Haskell

GPU Programming Using NVIDIA CUDA

Optimization Space Exploration of the FastFlow Parallel Skeleton Framework

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives

TCP Goes to Hollywood

HPC Architectures. Types of resource currently in use

The compilation process is driven by the syntactic structure of the program as discovered by the parser

GPU Programming with Ateji PX June 8 th Ateji All rights reserved.

Parallel Programming Environments. Presented By: Anand Saoji Yogesh Patel

CC MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters

EXTERNALLY VERIFIABLE CODE EXECUTION

PTP - PLDT Parallel Language Development Tools Overview, Status & Plans

An Effective Composition Method for Novice Shader Programmers

Acknowledgement. CS Compiler Design. Semantic Processing. Alternatives for semantic processing. Intro to Semantic Analysis. V.

Synthesizing Benchmarks for Predictive Modeling.

Introduction to Concurrent Software Systems. CSCI 5828: Foundations of Software Engineering Lecture 08 09/17/2015

Query Likelihood with Negative Query Generation

A Type System for Functional Traversal-Based Aspects

Parallel Programming Libraries and implementations

TASK SCHEDULING FOR PARALLEL SYSTEMS

Automated Test Execution Using GPUs

General introduction: GPUs and the realm of parallel architectures

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

Multi/Many Core Programming Strategies

Practical Design of Upper-Delay Bounded Switched Local Area Network

Introduction to optimizations. CS Compiler Design. Phases inside the compiler. Optimization. Introduction to Optimizations. V.

Identification and Elimination of the Overhead of Accelerate with a Super-resolution Application

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

A Java Execution Simulator

Digital Presentation and Preservation of Cultural and Scientific Heritage International Conference

The Art of Parallel Processing

Milind Kulkarni Research Statement

Introduction to Concurrent Software Systems. CSCI 5828: Foundations of Software Engineering Lecture 12 09/29/2016

Multicore Computing and Scientific Discovery

CoRAL: Confined Recovery in Distributed Asynchronous Graph Processing

Published in: Proceedings of the 8th Workshop on General Purpose Processing Using GPUs

Advances in Compilers

CONCENTRATIONS: HIGH-PERFORMANCE COMPUTING & BIOINFORMATICS CYBER-SECURITY & NETWORKING

1 Overview. 2 A Classification of Parallel Hardware. 3 Parallel Programming Languages 4 C+MPI. 5 Parallel Haskell

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

Composition and Separation of Concerns in the Object-Oriented Model

RDM through a UK lens - New Roles for Librarians?

Automatically Generating and Tuning GPU Code for Sparse Matrix-Vector Multiplication from a High-Level Representation

Parallel Programming for Graphics

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

Edinburgh Research Explorer

Function compose, Type cut, And the Algebra of logic

Migration Based Page Caching Algorithm for a Hybrid Main Memory of DRAM and PRAM

OpenMP tasking model for Ada: safety and correctness

An Open System Framework for component-based CNC Machines

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

Transcription:

Programming Languages Research Programme

Logic & semantics Planning Language-based security Resource-bound analysis Theorem proving/ CISA verification LFCS Logic Functional "Foundational" PL has been an area of strength for Edinburgh since the 70s/80s with activity centered in CISA and LFCS (frequently cross-fertilizing) and numerous clubs/research groups active at different times. However, in 2012-2013 there was no longer an organised activity so we formed the Programming Languages Interest Group (PLInG).

Theorem proving/ CISA verification LFCS Logic Functional Parallel Real-world DSLs Compilers High-performance C/C++, Scala Optimisation computing ICSA Multicore, GPU Over the last few years we have broadened the scope to include compilers, optimization and parallel (cf. ICSA, Pervasive Parallelism CDT) Recently, this led to the creation of the School's Programming Languages Research Programme

Research collaborations between different parts of Informatics have been strengthened by PLInG and already led to publications & good PhD student interactions Security: language-based security, analysis of Ethereum language used in blockchain systems,...? Generating Performance Portable Code using Rewrite Rules From High-Level Functional Expressions to High-Performance OpenCL Code The University of Edinburgh (UK) Heriot-Watt University (UK) The University of Edinburgh (UK) University of Münster (Germany) c.fensch@hw.ac.uk sam.lindley@ed.ac.uk michel.steuwer@ed.ac.uk Michel Steuwer Christian Fensch Christophe Dubach The University of Edinburgh (UK) christophe.dubach@ed.ac.uk Abstract 1. Introduction Computers have become increasingly complex with the emergence of heterogeneous hardware combining multicore CPUs and GPUs. These parallel systems exhibit tremendous computational power at the cost of increased effort resulting in a tension between performance and code portability. Typically, code is either tuned in a low-level imperative language using hardware-specific optimizations to achieve maximum performance or is written in a high-level, possibly functional, language to achieve portability at the expense of performance. We propose a novel approach aiming to combine high-level, code portability, and high-performance. Starting from a high-level functional expression we apply a simple set of rewrite rules to transform it into a low-level functional representation, close to the OpenCL model, from which OpenCL code is generated. Our rewrite rules define a space of possible implementations which we automatically explore to generate hardware-specific OpenCL implementations. We formalize our system with a core dependently-typed -calculus along with a denotational semantics which we use to prove the correctness of the rewrite rules. We test our design in practice by implementing a compiler which generates high performance imperative OpenCL code. Our experiments show that we can automatically derive hardwarespecific implementations from simple functional high-level algorithmic expressions offering performance on a par with highly tuned code for multicore CPUs and GPUs written by experts. Categories and Subject Descriptors D3.2 [Programming Languages]: Language Classification Applicative (functional) languages; Concurrent, distributed, and parallel languages; D3.4 [Processors]: Code generation, Compilers, Optimization Keywords Algorithmic patterns, rewrite rules, performance portability, GPU, OpenCL, code generation Permission to to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be behonored. Abstracting with credit is permitted. To copy otherwise, or republish, to topost poston onservers or to redistribute to lists, requires prior specific permission and/or a fee. fee. Request Request permissions permissions from from Permissions@acm.org. permissions@acm.org. ICFP 15, August 31 September 2, 2015, Vancouver, BC, Canada. Copyright c 2015 ACM 978-1-4503-3669-7/15/08... $15.00. http://dx.doi.org/10.1145/nnnnnnn.nnnnnnn Sam Lindley In recent years, graphics processing units (GPUs) have emerged as the power horse of high-performance computing. These devices offer enormous raw performance but require programmers to have a deep understanding of the hardware in order to maximize performance. This means software is written and tuned on a per-device basis and needs to be adapted frequently to keep pace with ever changing hardware. Programming models such as OpenCL offer the promise of functional portability of code across different parallel processors. However, performance portability often remains elusive; code achieving high performance for one device might only achieve a fraction of the available performance on a different device. Figure 1 illustrates this problem by showing how a parallel reduce (a.k.a. fold) implementation, written and optimized for one particular device, performs on other devices. Three implementations have been tuned to maximize performance on each device: the Nvidia_opt and AMD_opt implementations are tuned for the Nvidia and AMD GPU respectively, implementing tree-based reduce using an iterative approach with carefully specified synchronization primitives. The Nvidia_opt version utilizes the local (a.k.a. shared) memory to store intermediate results and exploits a hardware feature of Nvidia GPUs to avoid certain synchronization barriers. The AMD_opt version does not perform these two optimizations but instead uses vectorized operations. The Intel_opt parallel implementation, tuned for an Intel CPU, also relies on vectorized operations. However, it uses a much coarser form of parallelism with fewer threads, in which each thread performs more work. We hope the School's recognition of the PL Research programme will also help spark new projects or collaborative activities ICFP 15, August 31 September 2, 2015, Vancouver, BC, Canada c 2015 ACM. 978-1-4503-3669-7/15/08...$15.00 http://dx.doi.org/10.1145/2784731.2784754 205 Relative Performance 0.4 0.2 0.0 1.0 0.8 0.6 Nvidia GPU Nvidia opt AMD opt Intel opt AMD GPU Failed Figure 1: Performance is not portable across devices. Each bar represents the device-specific optimized implementation of a parallel reduce implemented in OpenCL and tuned for an Nvidia GPU, AMD GPU, and Intel CPU respectively. Performance is normalized with respect to the best implementation on each device. Intel CPU Machine learning: better abstractions/optimisations for data science, probabilistic,...?

Logic & semantics Planning Language-based security Resource-bound analysis Theorem proving/ CISA verification LFCS Logic Functional Parallel/distributed Real-world DSLs Compilers High-performance C/C++, Scala Optimisation computing ICSA Multicore, GPU

Logic & semantics Planning Theorem proving/ verification Language-based security Resource-bound analysis Logic Functional Parallel/distributed Real-world DSLs Compilers High-performance C/C++, Scala Optimisation computing Multicore, GPU