Meta-Programming and JIT Compilation

Similar documents
Adv. Course in Programming Languages

Metaprogramming. CS315B Lecture 7

Dynamic Binary Instrumentation: Introduction to Pin

Legion Bootcamp: Partitioning

Dynamic Cuda with F# HPC GPU & F# Meetup. March 19. San Jose, California

New Mapping Interface

Scheduling Image Processing Pipelines

BLAS. Christoph Ortner Stef Salvini

McSema: Static Translation of X86 Instructions to LLVM

VA Smalltalk 9: Exploring the next-gen LLVM-based virtual machine

The Legion Mapping Interface

Progress Report on QDP-JIT

Chapter 2: Operating-System Structures. Operating System Concepts 9 th Edit9on

THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June COMP3320/6464/HONS High Performance Scientific Computing

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

Pointers. A pointer is simply a reference to a variable/object. Compilers automatically generate code to store/retrieve variables from memory

Principles. Performance Tuning. Examples. Amdahl s Law: Only Bottlenecks Matter. Original Enhanced = Speedup. Original Enhanced.

THE FUTURE OF GPU DATA MANAGEMENT. Michael Wolfe, May 9, 2017

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2

Sarah Knepper. Intel Math Kernel Library (Intel MKL) 25 May 2018, iwapt 2018

Next-Generation Graphics on Larrabee. Tim Foley Intel Corp

SIMD Exploitation in (JIT) Compilers

Accelerating Stateflow With LLVM

last time out-of-order execution and instruction queues the data flow model idea

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Intel Array Building Blocks

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Automatic Generation of Assembly to IR Translators Using Compilers

CSCI 334: Principles of Programming Languages. Computer Architecture (a really really fast introduction) Lecture 11: Control Structures II

Intel Math Kernel Library 10.3

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines

CHAPTER 2: SYSTEM STRUCTURES. By I-Chen Lin Textbook: Operating System Concepts 9th Ed.

Programming Languages and Compilers. Jeff Nucciarone AERSP 597B Sept. 20, 2004

Software Ecosystem for Arm-based HPC

Branch Prediction Memory Alignment Cache Compiler Optimisations Loop Optimisations PVM. Performance

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay

Mike Martell - Staff Software Engineer David Mackay - Applications Analysis Leader Software Performance Lab Intel Corporation

Achieving Peak Performance on Intel Hardware. Intel Software Developer Conference London, 2017

High Performance Matrix-matrix Multiplication of Very Small Matrices

Power Estimation of UVA CS754 CMP Architecture

EECE.3170: Microprocessor Systems Design I Summer 2017 Homework 4 Solution

Tools and techniques for optimization and debugging. Fabio Affinito October 2015

Deep Learning for Visual Computing Prof. Debdoot Sheet Department of Electrical Engineering Indian Institute of Technology, Kharagpur

An introduction to Halide. Jonathan Ragan-Kelley (Stanford) Andrew Adams (Google) Dillon Sharlet (Google)

Introduction. CS 2210 Compiler Design Wonsun Ahn

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

PLT Course Project Demo Spring Chih- Fan Chen Theofilos Petsios Marios Pomonis Adrian Tang

Data Parallel Algorithmic Skeletons with Accelerator Support

Intel Performance Libraries

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core

AutoTuneTMP: Auto-Tuning in C++ With Runtime Template Metaprogramming

CSE450. Translation of Programming Languages. Lecture 11: Semantic Analysis: Types & Type Checking

Scheduling Image Processing Pipelines

Designing a Domain-specific Language to Simulate Particles. dan bailey

LCA14-412: GPGPU on ARM SoC. Thu 6 March, 2.00pm, T.Gall, G.Pitney

Porting Fabric Engine to NVIDIA Unified Memory: A Case Study. Peter Zion Chief Architect Fabric Engine Inc.

CSE 220: System Fundamentals I Unit 14: MIPS Assembly: Multi-dimensional Arrays. Kevin McDonnell Stony Brook University CSE 220

Unified Memory. Notes on GPU Data Transfers. Andreas Herten, Forschungszentrum Jülich, 24 April Member of the Helmholtz Association

Perspectives form US Department of Energy work on parallel programming models for performance portability

Tapir: a language for verified OS kernel probes

Data Parallel Execution Model

Corrections made in this version not in first posting:

Parallelism. CS6787 Lecture 8 Fall 2017

Memory Management. Kevin Webb Swarthmore College February 27, 2018

Important From Last Time

The ECM (Execution-Cache-Memory) Performance Model

Lecture 16 Optimizing for the memory hierarchy

Introduction. Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: We study the bandwidth problem

Accelerating Ruby with LLVM

Portability of OpenMP Offload Directives Jeff Larkin, OpenMP Booth Talk SC17

CS 360 Programming Languages Interpreters

CSC 8400: Computer Systems. Using the Stack for Function Calls

Using Java for Scientific Computing. Mark Bul EPCC, University of Edinburgh

The Era of Heterogeneous Computing

Simone Campanoni Loop transformations

Important From Last Time

What is concurrency? Concurrency. What is parallelism? concurrency vs parallelism. Concurrency: (the illusion of) happening at the same time.

Dynamic Languages Strike Back. Steve Yegge Stanford EE Dept Computer Systems Colloquium May 7, 2008

CUDA 7.5 OVERVIEW WEBINAR 7/23/15

Concurrency. Johan Montelius KTH

Data Structures for sparse matrices

Introduction. L25: Modern Compiler Design

CS61C : Machine Structures

Modern Processor Architectures. L25: Modern Compiler Design

Costless Software Abstractions For Parallel Hardware System

ART JIT in Android N. Xueliang ZHONG Linaro ART Team

Profiling & Optimization

Advanced CUDA Optimizations. Umar Arshad ArrayFire

Language Translation. Compilation vs. interpretation. Compilation diagram. Step 1: compile. Step 2: run. compiler. Compiled program. program.

Evolution of Virtual Machine Technologies for Portability and Application Capture. Bob Vandette Java Hotspot VM Engineering Sept 2004

Advanced OpenACC. John Urbanic Parallel Computing Scientist Pittsburgh Supercomputing Center. Copyright 2016

More performance options

Project Proposals. 1 Project 1: On-chip Support for ILP, DLP, and TLP in an Imagine-like Stream Processor

Field Analysis. Last time Exploit encapsulation to improve memory system performance

TERRA: SIMPLIFYING HIGH-PERFORMANCE PROGRAMMING USING MULTI-STAGE PROGRAMMING

Performance Per Watt. Native code invited back from exile with the Return of the King:

Overview. Constructors and destructors Virtual functions Single inheritance Multiple inheritance RTTI Templates Exceptions Operator Overloading

CS61C : Machine Structures

Lecture 18. Optimizing for the memory hierarchy

Stream Computing using Brook+

Transcription:

Meta-Programming and JIT Compilation Sean Treichler 1

Portability vs. Performance Many scientific codes sp ~100% of their cycles in a tiny fraction of the code base We want these kernels to be as fast as possible, so we: Start with an efficient algorithm Rely on the compiler s help to optimize the code Manually perform compiler-like optimizations (e.g. loop unrolling) Take advantage of processor-specific features (e.g. SIMD) Add prefetching, block transfers, etc. to improve memory BW Inline/fold values that are constant at compile-time Optimize for a known memory layout Hoist out computations based on run-time parameters that change slowly/not at all Tune them based on run-time profiling... 2

The Problem(s) Each additional step generally improves performance, but: Decreases portability Can write multiple versions to target different machines, use cases: Increases code devel/debug/maintenance costs Optimizations are baked into the checked-in source: Obfuscates intent of code 3

A Solution: Meta-Programming Instead of writing (many variations of) your kernel: Write code that generates the variations programmatically Ideally write kernels at a high level, focusing on intent Apply target-/use-case-specific optimizations by lowering code through layers of abstraction Code transformed programmatically, at compile time Provides benefits of abstraction, without runtime overhead Transformations themselves are often applicable to many types of kernels 4

Meta-Programming isn t New Meta-Programming exists in many forms today: Offline app-specific code generators (e.g. Singe, FFTW) Compile-time e.g. C++ templates Run-time e.g. Lisp, MetaOCaml Legion applications already meta-programming offline, at compile-time Would like to meta-program at runtime, in a way that: Generates FAST code Takes advantage of Legion runtime information 5

Introducing Lua-Terra An active research project at Stanford http://terralang.org Starts with Lua: a very simple dynamic scripting language designed in late 90s, fairly mature at this point designed to be embeddable just about anywhere And then adds Terra: a statically typed, just-in-time (JIT) compiled language designed to interoperate with Lua code also designed to interoperate with existing compiled code 6

Introduction to Lua-Terra function lua_addone(a) return a + 1 > = lua_addone(10) 11 simple Lua function adds to whatever it s given evaluated in Lua s stackbased VM terra terra_addone(a : int) : int return a + 1 Terra looks a lot like Lua code, except with types > = terra_addone(10) 11 function is compiled to native code using LLVM, executed directly on host CPU 7

Capturing JIT-time Constants > X, Y = 10, 100 arbitrary Lua variables terra foo(a : int) : int for i = 0,X do a = a + Y captured when Terra function is defined > foo:disas()... allowing the compiler to optimize a specialized version assembly for function at address 0x22e6070 0x22e6070(+0): lea EAX, DWORD PTR [EDI + 1000] 0x22e6077(+7): ret 8

Functions, Types are Lua Objects function axpy(t) return terra(alpha : T, X : &T, Y : &T, n : int) for i=0,n do defines an anonymous Y[i] = Y[i] + alpha * X[i] Terra function, using the inscope Lua variable for type saxpy = axpy(float) daxpy = axpy(double) caxpy = axpy(complex(float)) zaxpy = axpy(complex(double)) Lua function takes a type as a parameter base Terra types are just Lua values, functions returned by generator are given useful names types themselves can be generated by other Lua code 9

Quotes, Escapes function spmv(a) local function body(y, x) local assns = {} for i = 1,A.rows do local sum = `0 for _,nz in pairs(a[i]) do local col, weight = nz[1], nz[2] sum = `sum + weight * x[col] assns[i] = quote y[i-1] = sum return assns return terra(y : &double, x : &double) [ body(y, x) ] Lua code looks at structure of sparse matrix at invocation iterates to generate a quoted Terra expression for each row s sum returns the sequence of quoted Terra statements a list escape from Terra to Lua to generate list, interpolate statements into Terra function 10

Portability, Dynamic Tasks Terra generates LLVM IR/bitcode can target: x86 (+SSE, AVX, AVX512,...) CUDA ARM anything else for which an LLVM back exists Expanding Legion task registration API Tasks can be dynamically registered during execution Take advantage of properties of program input Registration can specify constraints on usage Preserves mapper s ability to make arbitrary decisions 11

On-Demand Variant Generation Recall that multiple variants of a task can be registered Runtime will select a variant that is compatible with the processor, instance layouts chosen by the mapper Don t really want to pre-generate all possible variants though... Instead register a variant generator function Generator function is written in Lua Will be called by the runtime if no suitable variant exists Runtime provides the processor/layout information Generator function returns a new task variant and conditions under which it can be used 12

Meta-Programming within Legion Planning to take advantage of meta-programming within runtime as well DMA Subsystem Exponential explosion of memory types, instance layouts Could even specialize for particular index space sparsity Avoiding compile-time capacity limits e.g. number of fields per instance Dynamic optimization of depency analysis next step after trace replay 13

Beyond Lua-Terra Modular architecture - Terra is just the trailblazer Task registration API supports different languages C function pointer Terra expression name of symbol from dynamic shared object LLVM IR... Works for variant generators as well Lua native C/C++? queries to a remote database? 14