Deep Learning as a Polyhedral Compiler's Killer App

Similar documents
Neural Network Exchange Format

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS

Trends in HPC (hardware complexity and software challenges)

Compilers and Compiler-based Tools for HPC

Analyses, Hardware/Software Compilation, Code Optimization for Complex Dataflow HPC Applications

Shrinath Shanbhag Senior Software Engineer Microsoft Corporation

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

TensorFlow: A System for Learning-Scale Machine Learning. Google Brain

Red Hat's vision on Telco/NFV/IOT

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center

NVIDIA DGX SYSTEMS PURPOSE-BUILT FOR AI

Unified Deep Learning with CPU, GPU, and FPGA Technologies

Dynamic Cuda with F# HPC GPU & F# Meetup. March 19. San Jose, California

Porting Scientific Applications to OpenPOWER

Scheduling Image Processing Pipelines

High-Performance Data Loading and Augmentation for Deep Neural Network Training

March 14, / 27. The isl Scheduler. Sven Verdoolaege. KU Leuven and Polly Labs. March 14, 2018

Open Standards for Vision and AI Peter McGuinness NNEF WG Chair CEO, Highwai, Inc May 2018

Arm s First-Generation Machine Learning Processor

Enabling a Richer Multimedia Experience with GPU Compute. Roberto Mijat Visual Computing Marketing Manager

Analyses, Hardware/Software Compilation, Code Optimization for Complex Dataflow HPC Applications

Research Faculty Summit Systems Fueling future disruptions

trisycl Open Source C++17 & OpenMP-based OpenCL SYCL prototype Ronan Keryell 05/12/2015 IWOCL 2015 SYCL Tutorial Khronos OpenCL SYCL committee

Progress Report on QDP-JIT

Scheduling Image Processing Pipelines

Beyond Hardware IP An overview of Arm development solutions

THOMAS LATOZA SWE 621 FALL 2018 DESIGN ECOSYSTEMS

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening. Alberto Magni, Christophe Dubach, Michael O'Boyle

A Perspective on the Role of Open-Source IP In Government Electronic Systems

HIRP OPEN 2018 Compiler & Programming Language. An Efficient Framework for Optimizing Tensors

D-TEC DSL Technology for Exascale Computing

An introduction to Halide. Jonathan Ragan-Kelley (Stanford) Andrew Adams (Google) Dillon Sharlet (Google)

Oracle Developer Studio 12.6

End to End Optimization Stack for Deep Learning

An Advanced Graph Processor Prototype

Adam Paszke, Sam Gross, Soumith Chintala, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia

Advanced CUDA Optimization 1. Introduction

Fast Image Processing using Halide

Deep Learning Accelerators

Scaling Deep Learning. Bryan

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA

Introduction. CS 2210 Compiler Design Wonsun Ahn

A low memory footprint OpenCL simulation of short-range particle interactions

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

Fast Hardware For AI

An Efficient Stream Buffer Mechanism for Dataflow Execution on Heterogeneous Platforms with GPUs

VTA: Open & Flexible DL Acceleration. Thierry Moreau TVM Conference, Dec 12th 2018

Executive Brief: The CompassIntel A-List Index A-List in Artificial Intelligence Chipset

Relay: a high level differentiable IR. Jared Roesch TVMConf December 12th, 2018

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Accelerating Convolutional Neural Nets. Yunming Zhang

The OpenVX Computer Vision and Neural Network Inference

KernelGen a toolchain for automatic GPU-centric applications porting. Nicolas Lihogrud Dmitry Mikushin Andrew Adinets

Techniques to improve the scalability of Checkpoint-Restart

Applying OpenCL. IWOCL, May Andrew Richards

webmethods EntireX for ESB: Leveraging Platform and Application Flexibility While Optimizing Service Reuse

A Breakthrough in Non-Volatile Memory Technology FUJITSU LIMITED

Improving Performance of Machine Learning Workloads

EGI-InSPIRE. Cloud Services. Steven Newhouse, EGI.eu Director. 23/05/2011 Cloud Services - ASPIRE - May EGI-InSPIRE RI

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

LLVM An Introduction. Linux Collaboration Summit, April 7, 2011 David Kipping, Qualcomm Incorporated

Jose Aliaga (Universitat Jaume I, Castellon, Spain), Ruyman Reyes, Mehdi Goli (Codeplay Software) 2017 Codeplay Software Ltd.

Optimization Space Pruning without Regrets

How icims Supports. Your Readiness for the European Union General Data Protection Regulation

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks

Exploiting High-Performance Heterogeneous Hardware for Java Programs using Graal

Introduction to Deep Learning in Signal Processing & Communications with MATLAB

Versal: AI Engine & Programming Environment

RFNoC Neural-Network Library using Vivado HLS (rfnoc-hls-neuralnet) EJ Kreinar Team E to the J Omega

Xilinx ML Suite Overview

Building NVLink for Developers

MICROSOFT AND SAUCE LABS FOR MODERN SOFTWARE DELIVERY

High-Level Synthesis: Accelerating Alignment Algorithm using SDSoC

OpenMP Device Offloading to FPGA Accelerators. Lukas Sommer, Jens Korinth, Andreas Koch

FCUDA-SoC: Platform Integration for Field-Programmable SoC with the CUDAto-FPGA

Wu Zhiwen.

Automatic Code Generation TVM Stack

Identifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning

SUSE Linux Entreprise Server for ARM

CLICK TO EDIT MASTER TITLE STYLE. Click to edit Master text styles. Second level Third level Fourth level Fifth level

Lubuntu Linux Virtual Machine

Polly Polyhedral Optimizations for LLVM

Parallelism in Spiral

Weld: A Common Runtime for Data Analytics

Chapter Outline. Chapter 2 Distributed Information Systems Architecture. Distributed transactions (quick refresh) Layers of an information system

Practical Methodology. Lecture slides for Chapter 11 of Deep Learning Ian Goodfellow

Creating outstanding digital cockpits with Qt Automotive Suite

Deep Learning Basic Lecture - Complex Systems & Artificial Intelligence 2017/18 (VO) Asan Agibetov, PhD.

F# for Industrial Applications Worth a Try?

High Performance Computing on GPUs using NVIDIA CUDA

Analyzing the Disruptive Impact of a Silicon Compiler

DataSToRM: Data Science and Technology Research Environment

Small is the New Big: Data Analytics on the Edge

Ch 1: The Architecture Business Cycle

ReconOS: An RTOS Supporting Hardware and Software Threads

Operator Vectorization Library A TensorFlow Plugin

Chapter Outline. Chapter 2 Distributed Information Systems Architecture. Layers of an information system. Design strategies.

Transcription:

Deep Learning as a Polyhedral Compiler's Killer App From Research to Industry Transfer and Back and Again Albert Cohen - Inria and Facebook (work in progress)

Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zach DeVito, William S. Moses, Sven Verdoolaege, Andrew Adams, Albert Cohen

Programming Language & Systems Research for Deep Learning? Deep learning has become massively popular over the last 10 years Machine Learning (ML) frameworks are all over the place Is this good enough?

Programming Language & Systems Research for Deep Learning?

Writing Good Neural Network Layers Tedious experience out of reach from Machine Learning (ML) users and researchers Existing layers: already in vendor libraries from Intel, Nvidia, etc. Can reach really great performance, 95%+ efciency on a few, almost ideal kernels But the practice is often far from machine peak New layer or neural net architecture performance bottleneck Heavy engineering burden to port across ML frameworks HPC library coders are scarce, not all genius, don t scale, tune by hand Goal: capture the ML expert s idea, concisely, automate the rest

Our Approach Don t write programs, synthesize them Derive ML-specifc techniques from software synthesis and compilers A mini-language, close to the mathematics and easy to manage by automatic tools A compiler for algebraic/algorithmic optimization and automatic diferentiation A compiler for polyhedral scheduling and mapping Generates efcient kernel implementations, e.g., for GPU acceleration Just-in-time specialization of the model (hyper)parameters, automatic tuning Works transparently: A New Op for machine learning and applications Integrated with industry-standard frameworks (currently Cafe2, PyTorch)

Abstraction without regret To make development efcient, we need abstractions that provide productivity without sacrifcing performance Given the enormous number of potential kernels, suggests a dynamic-codegeneration approach

Prior work Direct generation such as active library [2] or built-to-order (BTO) [3] provide usability, but miss optimization DSLs such as Halide [4] provide usability, and permit scheduling transformations, though manually specify. Compilers like XLA [5] or Latte [6] optimize and fuse operators, though performance lacking as the language can t represent complex schedules crucial to GPU/others. [2] Run-time code generation in C++ as a foundation for domain-specifc optimization, 2003 [3] Automating the generation of composed linear algebra kernels, 2009 [4] Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines, 2013 [5] Xla:domain-specifc compiler for linear algebra to optimizes tensorfow computations, 2017 [6] Latte: A language, compiler, and runtime for elegant and efcient deep neural networks, 2016

Our Approach

Our Approach

Synthesize From Mathematical Model Tight mathematical model, emits 1000s optimized lines of code Goup Convolution Kronecker Recurrent Units (KRU) algorithmic exploration of storage/recompute tradeofs

Synthesize From Mathematical Model Tight mathematical model, emits 1000s optimized lines of code

Rather Than Write Accelerator Code

Our Approach

Our Approach

Heard That Before? 30 years of parallelizing and optimizing compiler research wrapped into a robust, automated tool, w/ much ML specialization with modern C++ interface and tight ML framework integration Embed the most complex compositions of loop nest and tensor optimizations

Heard That Before? 30 years of parallelizing and optimizing compiler research wrapped into a robust, automated tool, w/ much ML specialization with modern C++ interface and tight ML framework integration

Algorithmic Contributions

Memory Promotion O[l+Idx[i][j]][k] => shared_o[l][i][j][k] Cache indirectly accessed arrays Only done when O and Idx are only read (not written) Promote directly accesses if tile of fxed size, elements reused, and >= 1 access without memory coalescing Promote indirectly accessed arrays in same way (ignore coalescing)

Performance?

Performance?

Algorithmic Exploration

Recurrent Networks? Courtesy Chris Olah, Google Brain

Simple RNN

Long Short Term Memory (LSTM)

Recurrent Networks? Work in progress Inspiration from functional languages with combinators, reactive and datafow synchronous languages Related to the modeling of fnite state machines on infnite streams with Kahn semantics in Lucid Synchrone / Esterel Technologies Scade 6 Challenge: combining data fow on init/last value with storage of tensor slices into higher dimensional tensors Good news: natural modeling of nested loops, space and time, in a polyhedral framework

Where From? Polly Labs Initiative Mathematical core isl : parametric linear optimization, Presburger arithmetic Industry transfer lab founded in February 2015 Building on 12 years of collaboration with AMD (Austin, Bangalore) and IISc (CSA, Bangalore) INRIA-CEFIPRA Associate Team with IISc Google Summer of Code(s) with IITH Partners IIT Hyderabad IISc ETH Zürich Qualcomm Xilinx Facebook

Research and Industry Transfer - Virtual Lab https://www.pollylabs.org Bilateral contract between INRIA and ARM, in cooperation with Qualcomm (QuIC), Xilinx, Facebook (FAIR) Focus on LLVM ecosystem: http://llvm.org exploitation & developer community Mutualization of core polyhedral compilation infrastructure Contributing to domain-specific deep learning, image processing, solvers, linear algebra and research compilers Training and tutorials 30

Timeline isl started in 2008, licensed under LGPLv2.1 Used by GCC as its polyhedral library since 5.0 virtually every smartphone runs code generated using our algorithms http://repo.or.cz/w/isl.git 2013: Relicensed under MIT, through CARP EU project Used by LLVM through the Polly project 2014: Triggered ARM to release tools to generate linear algebra kernels 2014: Qualcomm, then ARM, push for Polly Labs 2015: Qualcomm Snapdragon LLVM uses Polly, compiles Android OSP 2016: Xilinx starts an isl-based project within Vivado HLS 2017: Facebook works on a deep learning compiler using isl 2018: Safran, Morphing Machines (IISc spin-off), CEFIPRA project to work on correct-by-construction signal processing and data analytics 31

Context: economic challenge Funding a sustainable software tool development Mutualize investments by large and not-so-large users Leverage many different research funding instruments Build a sustainable ecosystem of tool vendors 32

Polly Labs experience so far: importance of both upstream and application-driven, cross-domain research, open source, distributed teams and world-class recruitment, advanced prototyping of software artifacts for successful industry transfer but very hard to work out labor law and inflexible rules of academic institutions... We are OPEN, HIRING and Looking for COLLABORATIONS Albert.Cohen@inria.fr