Exploiting High-Performance Heterogeneous Hardware for Java Programs using Graal

Similar documents
Project Beehive: A HW/SW co-designed stack for runtime and architecture research

General Purpose GPU Programming. Advanced Operating Systems Tutorial 9

General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14

Accelerating Interpreted Programming Languages on GPUs with Just-In-Time Compilation and Runtime Optimisations

General Purpose GPU Programming. Advanced Operating Systems Tutorial 7

The Heterogeneous Programming Jungle. Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest

Accelerating Financial Applications on the GPU

Automatic Intra-Application Load Balancing for Heterogeneous Systems

ECE 8823: GPU Architectures. Objectives

Navigating the Landscape for Real-Time Localization and Mapping for Robotics and Virtual and Augmented Reality

Embarquez votre Intelligence Artificielle (IA) sur CPU, GPU et FPGA

OpenCL Overview. Shanghai March Neil Trevett Vice President Mobile Content, NVIDIA President, The Khronos Group

Plattformübergreifende Softwareentwicklung für heterogene Multicore-Systeme

GPU programming. Dr. Bernhard Kainz

arxiv: v1 [cs.cv] 20 Aug 2018

Exploiting Distributed Software Transactional Memory

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

JCudaMP: OpenMP/Java on CUDA

Generating Performance Portable Code using Rewrite Rules

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania

HKG OpenCL Support by NNVM & TVM. Jammy Zhou - Linaro

OPERA. Low Power Heterogeneous Architecture for the Next Generation of Smart Infrastructure and Platforms in Industrial and Societal Applications

Just-In-Time GPU Compilation for Interpreted Languages with Partial Evaluation

Building supercomputers from embedded technologies

OpenCL: History & Future. November 20, 2017

Onto Petaflops with Kubernetes

EXTENDING THE REACH OF PARALLEL COMPUTING WITH CUDA

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD

Copyright Khronos Group 2012 Page 1. OpenCL 1.2. August 2012

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware

Interactive Knowledge Stack A Software Architecture for Semantic Content Management Systems

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

Porting Performance across GPUs and FPGAs

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

Trends in the Infrastructure of Computing

Getting Started with Intel SDK for OpenCL Applications

Programming and Simulating Fused Devices. Part 2 Multi2Sim

Heterogeneous Computing Made Easy:

Computational Graphics: Lecture 15 SpMSpM and SpMV, or, who cares about complexity when we have a thousand processors?

Wu Zhiwen.

Shrinath Shanbhag Senior Software Engineer Microsoft Corporation

Scalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009

Modern Processor Architectures. L25: Modern Compiler Design

Diving into Parallelization in Modern C++ using HPX

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Introduction to GPGPU and GPU-architectures

GPU Programming with Ateji PX June 8 th Ateji All rights reserved.

Workflow, Planning and Performance Information, information, information Dr Andrew Stephen M c Gough

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Parallel Programming. Libraries and Implementations

GPU Memory Model Overview

The OpenVX Computer Vision and Neural Network Inference

Lift: a Functional Approach to Generating High Performance GPU Code using Rewrite Rules

Recent Advances in Heterogeneous Computing using Charm++

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Instruction Set Architecture ( ISA ) 1 / 28

GPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011

Martin Dubois, ing. Contents

Renderscript Accelerated Advanced Image and Video Processing on ARM Mali T-600 GPUs. Lihua Zhang, Ph.D. MulticoreWare Inc.

Web Physics: A Hardware Accelerated Physics Engine for Web- Based Applications

CS260 Intro to Java & Android 04.Android Intro

Accelerating CFD with Graphics Hardware

Expressing Heterogeneous Parallelism in C++ with Intel Threading Building Blocks A full-day tutorial proposal for SC17

Deploying Deep Learning Networks to Embedded GPUs and CPUs

Research Faculty Summit Systems Fueling future disruptions

Teaching Parallel Programming Using Computer Vision and Image Processing Algorithms

Moven. Machine/Deep Learning Models Distribution Relying on the Maven Infrastructure. Sergio Fernández (Redlink GmbH) November 14th, Sevilla

OpenCL Press Conference

Parallel Programming Libraries and implementations

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Profiling & Tuning Applications. CUDA Course István Reguly

OpenCL. Matt Sellitto Dana Schaa Northeastern University NUCAR

Parallel Hybrid Computing F. Bodin, CAPS Entreprise

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

Automatic FFT Kernel Generation for CUDA GPUs. Akira Nukada Tokyo Institute of Technology

Programmable Graphics Hardware (GPU) A Primer

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Deep Learning Frameworks. COSC 7336: Advanced Natural Language Processing Fall 2017

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

A Case for Better Integration of Host and Target Compilation When Using OpenCL for FPGAs

JAVA PERFORMANCE. PR SW2 S18 Dr. Prähofer DI Leopoldseder

Unified Memory. Notes on GPU Data Transfers. Andreas Herten, Forschungszentrum Jülich, 24 April Member of the Helmholtz Association

Inlining Java Native Calls at Runtime

Addressing Heterogeneity in Manycore Applications

x Welcome to the jungle. The free lunch is so over

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

"On the Capability and Achievable Performance of FPGAs for HPC Applications"

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

Identifying Volatile Numeric Expressions in OpenCL Applications Miriam Leeser

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures

Higher Level Programming Abstractions for FPGAs using OpenCL

Optimization solutions for the segmented sum algorithmic function

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

The Art of Parallel Processing

Performance and Environment Monitoring for Continuous Program Optimization

OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors

Transcription:

Exploiting High-Performance Heterogeneous Hardware for Java Programs using Graal James Clarkson ±, Juan Fumero, Michalis Papadimitriou, Foivos S. Zakkak, Christos Kotselidis and Mikel Luján ± Dyson, The University of Manchester ManLang 18, Linz (Austria), 12th September 2018

Outline Background Tornado Tornado-API Tornado Runtime Tornado JIT Compiler Performance Results Conclusions 1

Context of this project Started as the PhD thesis of James Clarkson: Compiler and Runtime Support for Heterogeneous Programming James Clarkson, Christos Kotselidis, Gavin Brown, and Mikel Luján. Boosting Java Performance using GPGPUs. In Proceedings of the 30th International Conference on Architecture of Computing Systems Christos Kotselidis, James Clarkson, Andrey Rodchenko, Andy Nisbet, John Mawer, and Mikel Luján. Heterogeneous Managed Runtime Systems: A Computer Vision Case Study ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE 17) Partially funded by the EPSRC AnyScale grant EP/L000725/1 2

Currently part of the EU H2020 E2Data Project "End-to-end solution for heterogeneous Big Data deployments that fully exploits and advances the state-of-the-art in infrastructure" https://e2data.eu/ European Union s Horizon H2020 research and innovation programme under grant agreement No 780622 3

1. Background 4

Current Heterogeneous Computing Landscape 5

Current Heterogeneous Computing Landscape 6

Current Heterogeneous Computing Landscape 7

Current Virtual Machines 8

Our Solution: VM + Heterogeneous Runtime 9

2. Tornado: A Practical Heterogeneous Programming Framework 10

Tornado A Java based Heterogeneous Programming Framework It exposes a task-based parallel programming API It contains an OpenCL JIT Compiler and a Runtime for running on heterogeneous devices Modular system currently using: OpenJDK/Graal OpenCL It currently runs on CPUs, GPUs and FPGAs* 11

Tornado Overview 12

Tornado API: @Parallel "It s a developer provided annotation that instructs the JIT compiler that it is OK for each iteration to be executed independently." It does not specify or imply: iterations should be executed in parallel; the parallelization scheme to be used 13

Task Schedules "A task schedule describes how to co-ordinate the execution of tasks across heterogeneous hardware.". Composability Sequential consistency Task-based parallelism Automatic and optimised data movement 14

Tornado API: enabling task-based parallelism 15

Tornado API: enabling task-based parallelism 16

Tornado API: enabling task-based parallelism 17

Task Schedules: example 1 c l a s s Ex { 2 p u b l i c s t a t i c void m u l t i p l y 3 ( Double4 [ ] a, Double4 [ ] b, Double4 [ ] c ) { 4 // code here 5 } 6 7 p u b l i c s t a t i c void add 8 ( Double4 [ ] a, Double4 [ ] b, Double4 [ ] c ) { 9 // code here 10 } 11 } 18

Task Schedules: example 19

Task Schedules: example 20

Task Schedules: example 21

3. Tornado Runtime 22

Tornado: WorkFlow Task Graph Tornado API Optimize Task Schedule describes a data-fow graph Task Schedule Source new TaskSchedule("s0") Task Schedule.add(Ex1::add, a, b, c).streamout(c).execute(); 1 Tornado Runtime each node is a Task void add(int[] a, int[] b, int[] c){ for(@parallel int i=0; i<c.length; i++){ c[i] = a[i] + b[i]; } } Tornado Compiler Runtime Optimizations Serialized Task Schedule 4 Graph Optimizer - task placement - data-fow optimization - inserts low-level tasks 2 Sketcher - Tornado API - code reachability analysis - data dependency analysis 3 5 HIR Cache Execute Task Schedule Task Execution Task Executor - maps tasks onto driver API - triggers JIT compilation - triggers data-movements 7 Code Cache Code Generator - compiles cached sketches - parallelization - device specifc built-ins 6 Pluggable Driver Driver API OpenCL Runtime clenqueuewritebufer() clenqueuendrangekernel() clenqueuereadbufer() OpenCL C kernel void foo( ) { } 23

Data parallelism - Task specialisation E.g., currently we have two parallel schemes: course-grain and fine-grain 1 // Loop for GPUs 2 int idx = get_global_id (0); 3 int size = get_global_size (0); 4 for (int i = idx ; i < c. length ; 5 i += size ) { 6 // computation 7 c[i] = a[i] + b[i]; 8 } 1 // Loop for CPUs 2 int id = get_global_id (0); 3 int size = get_global_size (0); 4 int block_size = ( size + 5 inputsize - 1) / size ; 6 int start = id * block_size ; 7 int end = min ( start + bs, c. length ); 8 for ( int i = start ; i < end ; i ++) { 9 // computation 10 c[i] = a[i] + b[i]; 11 } 24

Memory Management Each heterogeneous device has a managed heap Enables objects to persist on devices Currently we duplicate objects which reside in the JVM heap No object creation on devices 25

4. Tornado JIT Compiler 26

Tornado JIT Compiler 27

5. Case study 28

Case study Kinect Fusion: it is a complex computer vision application that is able to re-construct a 3D movel from RGB-D camera in real time. 29

Why KFusion? Not a normal Java application Complex multi-kernel pipeline Sustained the execution of 540-1620 kernels per second. SLA of 30 FPS Representative of cutting edge robotics/computer vision applications Want to deploy across many platform and accelerator combinations 30

What did we get with Tornado? Running on NVIDIA Tesla, up to 150 fps 31

And compared to native code? 250 Frames Per Second 200 150 100 50 0 OpenCL Tornado-OR Tornado-JR 0 250 500 750 Frame Number Tornado is 28% slower than the best OpenCL native code. 32

6. Announcement & Conclusions 33

Tornado is now Open Source! We also have a poster tormorrow, come along! If you are interested, we can also show you demos on GPUs and FPGAs! 34

Takeaway We have presented Tornado We have shown runtime code generation for OpenCL We have shown a case study for computer vision It is open-source, give a try! We are looking forward for your feedback! 35

Thank you very much for your attention This work is partially supported by the EPSRC grants PAMELA EP/K008730/1 and AnyScale Apps EP/L000725/1, and the EU Horizon 2020 E2Data 780245. Juan Fumero <juan.fumero@manchester.ac.uk> 36

Compilation times OpenCL Graal 0.20 Time (seconds) 0.15 0.10 0.05 0.00 AMD A10 7850K Intel i7 4850HQ Intel E5 2620 AMD Radeon R7 Intel Iris Pro 5200 NVIDIA GT 750M NVIDIA Tesla K20m 37

OpenCL Device Driver: Just In Time Compiler OpenCL JIT Compiler and Runtime 38