Offloading Java to Graphics Processors

Similar documents
Acceleration of a Python-based Tsunami Modelling Application via CUDA and OpenHMPP

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

called Hadoop Distribution file System (HDFS). HDFS is designed to run on clusters of commodity hardware and is capable of handling large files. A fil

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Communication Library to Overlap Computation and Communication for OpenCL Application

General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14

Experiences with Achieving Portability across Heterogeneous Architectures

Accelerated Machine Learning Algorithms in Python

Modern Processor Architectures. L25: Modern Compiler Design

Performance impact of dynamic parallelism on different clustering algorithms

Programming Languages Third Edition. Chapter 7 Basic Semantics

A Component Framework for HPC Applications

JCudaMP: OpenMP/Java on CUDA

Introduction to GPU hardware and to CUDA

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

The Art of Parallel Processing

CnC-HC. a programming model for CPU-GPU hybrid parallelism. Alina Sbîrlea, Zoran Budimlic, Vivek Sarkar Rice University

A Design Framework for Mapping Vectorized Synchronous Dataflow Graphs onto CPU-GPU Platforms

Accelerating CFD with Graphics Hardware

CA Compiler Construction

GPU Programming with Ateji PX June 8 th Ateji All rights reserved.

Performance potential for simulating spin models on GPU

From Biological Cells to Populations of Individuals: Complex Systems Simulations with CUDA (S5133)

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Compiling for GPUs. Adarsh Yoga Madhav Ramesh

Parallel Execution of Kahn Process Networks in the GPU

OpenMP for next generation heterogeneous clusters

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

high performance medical reconstruction using stream programming paradigms

GPU Computing: Introduction to CUDA. Dr Paul Richmond

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

OpenACC 2.6 Proposed Features

OpenACC Course. Office Hour #2 Q&A

OpenMP 4.0. Mark Bull, EPCC

Predictive Runtime Code Scheduling for Heterogeneous Architectures

Recent Advances in Heterogeneous Computing using Charm++

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

Untyped Memory in the Java Virtual Machine

JPred-P 2. Josh Choi, Michael Welch {joshchoi,

GPU Programming Using NVIDIA CUDA

General Purpose GPU Programming. Advanced Operating Systems Tutorial 7

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function

Operator Vectorization Library A TensorFlow Plugin

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

Context Tokens for Parallel Algorithms

OpenACC Fundamentals. Steve Abbott November 15, 2017

Point-to-Point Synchronisation on Shared Memory Architectures

OpenMP 4.0/4.5. Mark Bull, EPCC

General Purpose GPU Programming. Advanced Operating Systems Tutorial 9

A PERFORMANCE COMPARISON OF SORT AND SCAN LIBRARIES FOR GPUS

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

Chapter 1 GETTING STARTED. SYS-ED/ Computer Education Techniques, Inc.

PGI Fortran & C Accelerator Programming Model. The Portland Group

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Metaheuristic Optimization with Evolver, Genocop and OptQuest

PROGRAMMING LANGUAGE PARADIGMS & THE MAIN PRINCIPLES OF OBJECT-ORIENTED PROGRAMMING

DawnCC : a Source-to-Source Automatic Parallelizer of C and C++ Programs

Weiss Chapter 1 terminology (parenthesized numbers are page numbers)

Compiler Construction

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

Configuration Management for Component-based Systems

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

SPOC : GPGPU programming through Stream Processing with OCaml

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

Detection and Analysis of Iterative Behavior in Parallel Applications

Abstract. Introduction. Kevin Todisco

Overview of Project's Achievements

Turbostream: A CFD solver for manycore

Efficient Hardware Acceleration on SoC- FPGA using OpenCL

Parallel Functional Programming Lecture 1. John Hughes

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

GPU Accelerated Machine Learning for Bond Price Prediction

Compiling Java For High Performance on Servers

MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise. June 2011

CUDA Programming Model

Technology for a better society. hetcomp.com

Parallelising Pipelined Wavefront Computations on the GPU

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania

Speeding up MATLAB Applications Sean de Wolski Application Engineer

GPU for HPC. October 2010

Solving Dense Linear Systems on Graphics Processors

Bitonic Sorting Intel OpenCL SDK Sample Documentation

Verification and Validation of X-Sim: A Trace-Based Simulator

Tesla Architecture, CUDA and Optimization Strategies

GPUMAP: A TRANSPARENTLY GPU-ACCELERATED PYTHON MAP FUNCTION. A Thesis. presented to. the Faculty of California Polytechnic State University,

Compiler Construction

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

Improved Integral Histogram Algorithm. for Big Sized Images in CUDA Environment

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

ARM Simulation using C++ and Multithreading

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Automatic Intra-Application Load Balancing for Heterogeneous Systems

Single-pass Static Semantic Check for Efficient Translation in YAPL

UNIT TESTING OF C++ TEMPLATE METAPROGRAMS

PARALLEL MULTI-DELAY SIMULATION

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

A Test Suite for High-Performance Parallel Java

Symbolic Evaluation of Sums for Parallelising Compilers

Transcription:

Offloading Java to Graphics Processors Peter Calvert (prc33@cam.ac.uk) University of Cambridge, Computer Laboratory Abstract Massively-parallel graphics processors have the potential to offer high performance at low cost. However, at present such devices are largely inaccessible from higher-level languages such as Java. This work allows compilation from Java bytecode by making use of annotations to specify loops for parallel execution. Data copying to and from the GPU is handled automatically. Evaluation showed that significant speedups could be achieved (up to 18 ), that were similar to those claimed by other less-automated work. 1 Problem and Motivation It has been widely recognised that future improvements in processor performance are likely to come from parallelism rather than increased clock speeds [1]. As well as multi-core CPUs, massively-parallel graphics processors (GPUs) are now widely available that can be used for single instruction multiple data (SIMD) style general processing. However, existing frameworks for making use of this hardware require the developer to have a detailed understanding of the underlying architecture, and to make explicit data movements. Therefore, while the scientific community are making extensive use of such hardware, it has been difficult to do so from standard desktop applications. These tend to be written in higher-level languages such as Java, and it is from these languages that the hardware has been largely inaccessible. However, while providing bindings [2] for the standard APIs (such as CUDA) makes the hardware available, it goes against many of the reasons for using a high-level language: High-level languages are typically used to make programs portable, so making use of hardware specific libraries loses this benefit. Computation kernels will still have to be written in the language supported by the framework (typically a C variant), leading to a multi-language solution. If we could provide a solution that allowed high-level code to be executed on the GPU without affecting protability, then GPUs might be used in more everyday applications. It is also important to minimise the burden on the developer. In this work, I concentrated on Java and VIDIA s CUDA framework. 2 Background and Related Work General purpose GPUs support execution of a single computation kernel across a grid i.e. data parallelism. However, in order to make use of this, we must find such parallelism in our program. ormally, this will be in the form of (nested) for loops where each iteration is independent from the others. Unfortunately, standard Java code does not provide dependency information. Other work [3] has tried to reconstruct this data by analysis. However, automatic parallelisation is well known to be a hard problem due to the difficulty of alias analysis i.e. determining whether two references might point to the same memory location. 1 April 4, 211

Entry Entry 1 Entry Exit Entry 2 Exit 1 (a) Single Entry/Single Exit (b) Multiple Entries Figure 1: Various examples of loops. Exit 2 (c) Multiple Exits A further complication in using graphics processors is that they operate on separate memory. Data must therefore be transferred to and from the device as appropriate. However, to avoid excess time overheads, we want to restrict these transfers to the minimum possible. A common alternative has been to simply wrap the GPU API, either as a library [2] or by extra language syntax [4]. Even when the solution allows the kernel s source to be included as a string in the high-level source code, this approach suffers from the problems already described. 3 Approach and Uniqueness Unlike other approaches, the compiler produced by this work neither relies on automatic analysis nor acts as a simple wrapper for a low-level API. Instead, I introduce program annotations that explicitly indicate parallel for loops. Based on this meta-data, the compiler determines the required data transfers by analysis. It also produces GPU code from the Java bytecode, rather than requiring kernels to be written in the CUDA language. This is all done as an extra compilation step, taking compiled class files and producing modified classes that make the required to calls to a native library (also produced by the compiler) that uses the GPU. Since this step acts on bytecode, it could be performed on an end-user machine without access to source code. This offers some portability, since the transformation can be made just when a suitable GPU is available (provided the required annotations are present). Whilst [3] also acts on bytecode, this is done within the virtual machine. There are arguments for both approaches. Incorporating the extra steps into the virtual machine requires less end-user intervention, but means that the compilation must be done on every execution, and also that users must replace their existing virtual machine. 3.1 Loop Detection Java bytecode is made up of unstructured control flow rather than constructs such as conditionals and loops. For the compiler, it is important to reconstruct at least the applicable loops to be able to convert sequential for loops into parallel ones. In general, loops can have a wide variety of forms as shown in Figure 1. A standard restriction is to those with only a single entry point natural loops. These can be detected using a simple dataflow analysis that considers dominators [5, p655]. The class of loops that can be executed on the GPU must also be suitable for parallel execution. In this work, the loops addressed are limited to trivial loops, defined as natural loops for which there is only one exit, which compares the loop s index variable with an expression. All writes within the loop to the index variable must be either increments or decrements i.e. it must be an increment variable. This is a slightly more general form of the loops considered in [6]. 2 April 4, 211

3.1.1 Increment Variables Increment variables can be detected using a simple analysis. We maintain a value I v (x) for each variable v at each point x in the code under consideration (normally a loop body). This value is taken from the set of values Z. An integer n Z indicates that the net effect on the variable from entry until the given point is an increment of n. indicates that the variable is written to in a more complex fashion. We can calculate these values in an iterative manner using the following equation: I v (x) = { m + n if instruction x increments v by n, and p predecessors(x).i v (p) = m otherwise A proof that this analysis converges is given in the dissertation associated with this work [7, Section 3.3.2]. 3.1.2 @Parallel Annotation When a developer indicates that a loop should be run in parallel using a @Parallel annotation, this means that (apart from increment variables) there are no loop-carried dependencies, and that the expression with which the loop index variable is compared is unaffected by the loop body. In this case, it is therefore possible to know before executing the loop, how many iterations will occur, and the values of the increment variables on each one. 3.2 Code Generation Once it is established that a loop is suitable for GPU execution, it is necessary to compile the loop body for the graphics processor. This must be done before any alterations are made to the bytecode itself, since the body may make use of features not supported by the GPU (such as recursion or exceptions). The compiler generates CUDA C, which is in turn passed to VIDIA s compiler. This avoided the unnecessary effort of writing an optimiser. Java objects are mapped onto C structs that are supported by CUDA directly. Methods within the objects are converted into simple functions by a name mangling process similar to that used by JI [8, Table 2-1]. Unfortunately, it is not possible to support inheritance, since function calls are determined statically. In the cases of both arrays and objects, extra checks are needed to ensure that multiple copies of the same object are not imported (due to aliasing). 3.3 Determining Data Transfers Finally, the compiler must replace the loop with a call to native code that first copies relevant data to the graphics card, and then invokes the generated kernel. The interesting aspect of this is determining which variables need to be copied in to the kernel, and which need to be copied out. 3.4 Copy-In The variables used by the kernel can be determined by using live variable analysis [5, p68] on the kernel. Arrays can be handled by marking the whole array live whenever any element of it is either read or written to. This treatment is similar to the behaviour of a write allocate memory cache (where a cache line must be loaded even for a memory write). 3 April 4, 211

umeric Time (ms) Speedup Factor = 1.5 1 6 Copy-In Execute Copy-Off Kernel Only Overall CPU Execution 118 Local Variable (1D) 9.3 6.2 15.2 19 38 Static Field (1D) 9.3 5. 15.2 236 39 Object Array 11112 5.4 59.6 219 2D Array 25.5 5.2 34. 227 18 Table 1: Speedup factors for simple numeric benchmarks. Benchmark Series (Floating Point) Crypt (Integer) Data Size 1 4 1 5 1 6 3 1 6 2 1 7 5 1 7 CPU (ms) 17971 182894 2878469 414 219 5344 This Project on Geforce GTX 26 (ms) 99 968 9358 41 245 545 JCUDA on Tesla C16 (ms) 11 14 114 2 16 45 Table 2: Comparison of Java Grande benchmark timings with JCUDA. 3.5 Copy-Out Any writes made by the kernel must be to an array, since otherwise an output dependency would exist between different iterations of the loop. The above treatment of arrays means that the copy-out set is always a subset of the copy-in set. However, by using alias analysis it is possible to refine this result. The simple alias analysis used in the compiler does not always converge to a result. It is therefore occasionally necessary to revert to the complete copy-in set (for further details see [7, Section 3.3.3]). 4 Results and Contributions 4.1 Results The compiler was applied to a range of benchmarks, consisting of both examples written to test specific aspects and also existing publically-available code. The performance results were encouraging and suggested that the usual speedups associated with CUDA could still be achieved within Java. On the hardware used (2 3.2GHz Pentium 4, with Geforce GTX 26), speedup factors of about 3 were typical for the simple numeric benchmarks (calculating sin 2 x + cos 2 x for each number in an array) as shown in Table 1. However, cases requiring many small, and potentially-aliased, objects to be transferred expose the communication and JI overheads, exhibiting slow downs when little work is done per-object on the GPU. It is clear from the results that reducing data transfer is key to gaining further improvements. Measurements were also made for two benchmarks from the Java Grande Benchmark Suite [9]. These were broadly similar to the results published for JCUDA [4] on similar hardware (see Table 2). The crypt benchmark is interesting since it shows that the benefits of GPU execution are not restricted to floating point computations. A final benchmark used was that of Mandelbrot set computation. In this case, the number of iterations used in the computation allows the amount of time spent on the GPU to be varied without affecting the data transfer overheads. This confirmed the expected result that overall speedups were improved when the amount of GPU computation was large compared with the data movement required (Figure 2). 4 April 4, 211

Speedup Factor % Overhead 85 8 75 7 65 6 55 5 Kernel Only Overall 45 1 75 5 25. 1 2. 1 2 4. 1 2 6. 1 2 8. 1 2 1. 1 3 1.2 1 3 Iterations Figure 2: Mandelbrot set computation speedup factors, with varying number of iterations (grid fixed at 8 8). Import via JI Copy-In Copy-Out Export via JI Time (ms) 8 4 12 6 12 6 1 5 Figure 3: Fit of overheads model to Mandelbrot timings. In addition, a model was developed that predicts the overheads for a variable given its type and length. Excluding cases with large numbers of objects, this gave accurate results and might allow future work to give feedback on the suitability of GPU execution, or even choose between CPU and GPU execution at runtime. Figure 3 shows the predictions made for the Mandelbrot set benchmark. 4.2 Conclusions This work has developed a new compiler which enables Java code to be executed on massivelyparallel graphics processors without any need for explicit data transfers. The results achieved demonstrate that it is possible to simplify the development process for making use of GPUs without sacrificing performance compared to other recent, but less automated, approaches. It is also possible to predict the overheads in certain scenarios, and this might be important in future work that is either more automated or which gives more developer guidance. 4.3 Acknowledgements This work was completed as a final-year undergraduate project at the University of Cambridge Computer Laboratory between October 29 and May 21, under the supervision of Dr Andrew Rice and Dominic Orchard. 5 April 4, 211

References [1] H. Sutter, The free lunch is over: A fundamental turn toward concurrency in software, Dr Dobb s Journal, March 25. [2] A. Klöckner,. Pinto, Y. Lee, B. Catanzaro, P. Ivanov, and A. Fasih, PyCUDA: GPU Run-Time Code Generation for High-Performance Computing, Arxiv preprint arxiv:911.3456, 29. [3] A. Leung, O. Lhoták, and G. Lashari, Automatic parallelization for graphics processing units, in Proceedings of the 7th International Conference on Principles and Practice of Programming in Java, 29, pp. 91 1. [4] Y. Yan, M. Grossman, and V. Sarkar, JCUDA: A Programmer-Friendly Interface for Accelerating Java Programs with CUDA, in Proceedings of the 15th International Euro- Par Conference on Parallel Processing, 29. [5] A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman, Compilers : Principles, Techniques, & Tools, Second Edition, 2nd ed. Addison-Wesley, 27. [6] A. Bik and D. Gannon, javab - A prototype bytecode parallelization tool, in ACM Workshop on Java for High-Performance etwork Computing, 1998. [7] Peter Calvert, Parallelisation of Java for Graphics Processors, Final-year dissertation at University of Cambridge Computer Laboratory. Available from http://www.cl.cam. ac.uk/ prc33/, 21. [8] S. Liang, Java ative Interface 6. Specification. Sun, 1999. [9] J. M. Bull, L. A. Smith, M. D. Westhead, D. S. Henty, and R. A. Davey, A methodology for benchmarking Java Grande applications, in JAVA 99: Proceedings of the ACM 1999 conference on Java Grande. ew York, Y, USA: ACM, 1999, pp. 81 88. 6 April 4, 211