GPU Linear algebra extensions for GNU/Octave

Similar documents
GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

CMS High Level Trigger Timing Measurements

Partial Wave Analysis using Graphics Cards

Partial wave analysis at BES III harnessing the power of GPUs

Open Source Software in Robotics and Real-Time Control Systems. Gary Crum at OpenWest 2017

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

AperTO - Archivio Istituzionale Open Access dell'università di Torino

General Purpose GPU Computing in Partial Wave Analysis

Using Graphics Chips for General Purpose Computation

Accelerating image registration on GPUs

GPU Programming Using NVIDIA CUDA

Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi

Supporting Data Parallelism in Matcloud: Final Report

OpenCL Overview. Shanghai March Neil Trevett Vice President Mobile Content, NVIDIA President, The Khronos Group

Automatic Pruning of Autotuning Parameter Space for OpenCL Applications

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

ComPWA: A common amplitude analysis framework for PANDA

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science

Vectorisation and Portable Programming using OpenCL

OpenCL: History & Future. November 20, 2017

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

arxiv: v1 [physics.ins-det] 11 Jul 2015

Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators

Lecture 13: OpenGL Shading Language (GLSL)

SPOC : GPGPU programming through Stream Processing with OCaml

Linear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre

Introduction to CUDA Programming

OpenStaPLE, an OpenACC Lattice QCD Application

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Copyright Khronos Group Page 1. Vulkan Overview. June 2015

Automatic Intra-Application Load Balancing for Heterogeneous Systems

Technology for a better society. hetcomp.com

Computational Graphics: Lecture 15 SpMSpM and SpMV, or, who cares about complexity when we have a thousand processors?

Geant4 Computing Performance Benchmarking and Monitoring

HPC with Multicore and GPUs

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

Maximum Likelihood Fits on GPUs S. Jarp, A. Lazzaro, J. Leduc, A. Nowak, F. Pantaleo CERN openlab

Empirical Modeling: an Auto-tuning Method for Linear Algebra Routines on CPU plus Multi-GPU Platforms

Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain)

BIG CPU, BIG DATA. Solving the World s Toughest Computational Problems with Parallel Computing Second Edition. Alan Kaminsky

Altera SDK for OpenCL

RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS

CPU-GPU Heterogeneous Computing

Heterogeneous Computing and OpenCL

CME 213 S PRING Eric Darve

Domain Specific Languages for Financial Payoffs. Matthew Leslie Bank of America Merrill Lynch

Prospects for a more robust, simpler and more efficient shader cross-compilation pipeline in Unity with SPIR-V

Offloading Java to Graphics Processors

A new perspective on processing-in-memory architecture design

BIG CPU, BIG DATA. Solving the World s Toughest Computational Problems with Parallel Computing. Second Edition. Alan Kaminsky

HSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015!

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD

CODE-GENERATION FOR DIFFERENTIAL EQUATION SOLVERS

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

Big Orange Bramble. August 09, 2016

Performance of Multicore LUP Decomposition

ATLAS NOTE. December 4, ATLAS offline reconstruction timing improvements for run-2. The ATLAS Collaboration. Abstract

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

An Introduction to the SPEC High Performance Group and their Benchmark Suites

Performance of popular open source databases for HEP related computing problems

COMPILER-ASSISTED TEST ACCELERATION ON GPUS FOR EMBEDDED SOFTWARE

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

rcuda: an approach to provide remote access to GPU computational power

trisycl Open Source C++17 & OpenMP-based OpenCL SYCL prototype Ronan Keryell 05/12/2015 IWOCL 2015 SYCL Tutorial Khronos OpenCL SYCL committee

CUDA. Matthew Joyner, Jeremy Williams

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CUDA (Compute Unified Device Architecture)

Jose Aliaga (Universitat Jaume I, Castellon, Spain), Ruyman Reyes, Mehdi Goli (Codeplay Software) 2017 Codeplay Software Ltd.

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

CLU: Open Source API for OpenCL Prototyping

OpenACC 2.6 Proposed Features

SIGGRAPH Briefing August 2014

Ajitha Rajan, Christophe Dubach. in preparation for: ISSTA July 2017 Santa Barbara, CA

Approaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department

GPU Programming for Mathematical and Scientific Computing

Parallel waveform extraction algorithms for the Cherenkov Telescope Array Real-Time Analysis

Partial Wave Analysis at BES III harnessing the power of GPUs

Performance potential for simulating spin models on GPU

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

The GAP project: GPU applications for High Level Trigger and Medical Imaging

FPGA Acceleration of 3D Component Matching using OpenCL

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors

Accelerating sequential computer vision algorithms using commodity parallel hardware

CS 112 Introduction to Computing II. Wayne Snyder Computer Science Department Boston University

BENCHMARKING OCTAVE, R AND PYTHON PLATFORMS FOR CODE PROTOTYPING IN DATA ANALYTICS AND MACHINE LEARNING

Exploiting CUDA Dynamic Parallelism for low power ARM based prototypes

PyFR: Heterogeneous Computing on Mixed Unstructured Grids with Python. F.D. Witherden, M. Klemm, P.E. Vincent

Monitoring the software quality in FairRoot. Gesellschaft für Schwerionenforschung, Plankstrasse 1, Darmstadt, Germany

SHOC: The Scalable HeterOgeneous Computing Benchmark Suite

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Don t reinvent the wheel. BLAS LAPACK Intel Math Kernel Library

Cougar Open CL v1.0. Users Guide for Open CL support for Delphi/ C++Builder and.net

Current Trends in Computer Graphics Hardware

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function

Array programming languages

Accelerating Data Warehousing Applications Using General Purpose GPUs

Introduction. CS 2210 Compiler Design Wonsun Ahn

Transcription:

Journal of Physics: Conference Series GPU Linear algebra extensions for GNU/Octave To cite this article: L B Bosi et al 2012 J. Phys.: Conf. Ser. 368 012062 View the article online for updates and enhancements. Related content - Evaluation of likelihood functions on CPU and GPU devices Sverre Jarp, Alfio Lazzaro, Julien Leduc et al. - Efficient pseudo-random number generation for monte-carlo simulations using graphic processors Siddhant Mohanty, A K Mohanty and F Carminati - Partial wave analysis at BES III harnessing the power of GPUs Niklaus Berger This content was downloaded from IP address 46.3.195.26 on 09/01/2018 at 20:43

GPU Linear algebra extensions for GNU/Octave L B Bosi 1, M Mariotti 2 and A Santocchia 3 1 Dipartimento di Fisica, Camerino, IT 2 Dipartimento di Fisica, Perugia, IT 3 INFN and Dipartimento di Fisica, Perugia, IT E-mail: leonebosi@gmail.com, mirko.mariotti@fisica.unipg.it, attilio.santocchia@pg.infn.it Abstract. Octave is one of the most widely used open source tools for numerical analysis and liner algebra. Our project aims to improve Octave by introducing support for GPU computing in order to speed up some linear algebra operations. The core of our work is a C library that executes some BLAS operations concerning vector- vector, vector matrix and matrix-matrix functions on the GPU. OpenCL functions are used to program GPU kernels, which are bound within the GNU/octave framework. We report the project implementation design and some preliminary results about performance. 1. Introduction GPU devices are impacting enormously the science computing, mainly due to their very interesting performances in terms of GFlops per Watt per Euro. Just to have an idea, looking at the top 500 list [1] we discover that the first and the third ranked supercomputers in the word are GPU powered. The strong invasion of GPU hardware in the field of science computing has been stimulated and supported by a lot of important GPU software projects born all around the world. The main purpose of our work is to make the computational power of modern GPUs accessible within octave, the popular free software scientific tool. 2. Octave Octave [2] provides functions to solve numerical solution of linear and nonlinear problems, performing other numerical experiments, plotting facilities. Octave can be user via command line, using a Matlab [3] like language and also via program scripts for a non interactive execution. GNU Octave was written by John W. Eaton and many others and it is also freely redistributable software and can be redistribute and/or modified under the terms of the GNU General Public License (GPL) as published by the Free Software Foundation. Octave is a free software and for that the Octave community encourages to help making the software more useful. Contributions can be provided by writing additional functions for it, and by reporting any problem occurred. 3. OpenCL OpenCL [4] means Open Computing Language. This language has been developed to permit writing programs across multi-platform realized with CPU, GPU and other processors. Published under licence by Ltd 1

OpenCL was initially developed by Apple, which holds trademark rights, and refined into an initial proposal in collaboration with technical teams at AMD, IBM, Intel, ARM and Nvidia. Apple submitted this initial proposal to the Khronos Group. On June 16, 2008 the Khronos Compute Working Group was formed with representatives from CPU, GPU, embedded-processor, and software companies. This group worked for five months to finish the technical details of the specification for OpenCL 1.0 by November 18, 2008. This technical specification was reviewed by the Khronos members and approved for public release on December 8, 2008. OpenCL is analogous to the open industry standards OpenGL and OpenAL, for 3D graphics and computer audio, respectively. OpenCL is managed by the non-profit technology consortium Khronos Group. The OpenCL include a language based on C99 for writing the functions that are executed on OpenCL devices. These functions are called kernels. Moreover the framework provides an API used to define and control the various platforms. OpenCL gives to the applications access to the graphicsprocessing unit for non-graphical computing, and a task-based or data-based parallel computing models. We use OpenCL to develop the GPU linear algebra library extension for octave. 4. Extending octave The aim of our work is to introduce the GPU computing inside the octave framework, bringing the OpenCL power within GNU Octave. This approach promises much better performance on big matrices/vectors size context, where GPU devices can express better their characteristic. In particular we start analysing the feasibility of this integration, optimizing the GPU context handling and producing the first prototype version of the octave-gpu library. By design, this version provides simple vector-vector, vector-matrix, matrix-matrix operation such as fundamental operators and determinants. The work has been complicated due to the octave and OpenCL peculiarity, but also due to the poor octave documentation on object extensions, forcing as on analysing octave sources and pre-processed code. This work permitted to reconstruct enough advanced information on octave design, needed to introduce our optimized OpenCL support. The standard way to extend Octave is using the so-called OCT files. An OCT file is essentially a loadable-shared library that the interpreter may load at runtime and use. Octave provides a utility to create OCT files from sources: mkoctfile. Let's take a look at a trivial OCT file: #include <octave/oct.h> #include <stdio.h> #include <stdlib.h> DEFUN_DLD(testme, args,, "Test OCT file") { int nargin = args.length (); return octave_value(nargin); } This example adds an octave function which purpose is counting and returns the number of its own arguments. The DEFUN_DLD macro is needed to create the interpreter entry point of a new octave function (testme is this example). The files has to be named the same as the function it declares, in this case testme.cc will be the source and testme.oct the OCT file. The octave interpreter expects to find a called external function in an OCT file named in the same way. If we would like to put two or more functions within the same source file we have to take care that every function has its own file (eventually creating symbolic links or copying a single OCT file to various files). Creating the OCT file is straightforward: "mkoctfile testme.cc", and to use it from an octave interpreter (opened in the same directory where the OCT file is): octave:1> testme(3,4,5) ans = 3 2

4.1. The first extension test vector_add Our first test to put OpenCL code within Octave is to create a vector_add.cc that take two octave arrays, copies them to a GPU device, perform the sum and copy the result back to octave. All the OpenCL Context, device and kernel initialization is inside such function. This was a feasibility level zero test about the OpenCL-octave that demonstrated the possibility of the integration and provided preciouses information for the next step of the library that will be reported on the next chapters. 4.1.1. vector_add benchmarking. We wrote a simple benchmark under Octave to the vector_add level zero code. The code loops several times over different vector length the vector add command, where the length is changed in order to test the code increasing computational complexity. The GPU is expected to best work under high computing density. Moreover we loop each length in order to have some statistics. We develop more implementation of the benchmark, in order to have number to compare together: Octave native vector sum: this use the native octave + Vector_add: this is the first GPU implementation Vector_add_precompiled_kernel: this implements a precompiled kernel loading Vector_add_static: this is not called from the octave interpreter but is called as a function (we use it as gain upper limit) The result are reported below, where we report execution time versus vector size. The hardware used for the test is a CPU Intel Xeon E5520 2.27GHz and a GPU Nvidia Tesla C1060. Figure 1. Comparison between GPU and CPU performance, reporting the vector_add average processing time (s) vs vector size. With the zero level implementation we show the octave-gpu link feasibility and the implementation perspectives and first evident criticality, because the GPU performance are much worsted than the CPU. Some overheads were clear at design time and others were shown analysing the octave data/object handling. At the first order: The OpenCL initialization occurs every time an operation is performed. The OpenCL kernel compilation occurs every time an operation is performed. The Vector data are copied every time from host/host (double buffering) and from the GPU/host memory. 3

Figure 2. vector_add profiling, reporting the percentage of time spent on the specific step. The result analysis shows at first sight the main latency causes, mainly due to OpenCL initialization, and OpenCL/octave IO overheads. We discovered that several overheads are added from OpenCL to the octave code. Especially the copy operations in and from the octave native elements bring many delays in. The analysis on such results bring us to the version one implementation, introducing important optimization actions: Definition of a permanent GPU object to host OpenCL status variables and status object. Kernels pre-compilation and bufferization OpenCL device and memory with once initialization approach The our GPU-octave implementation consist of two main object, GPUdevice and GPUobject, where a GPUdevice is the GPU representation inside the octave space, while a GPUobject is a GPU data representation inside the octave space. Moreover we already introduced at the epoch of this article the overloading of the most common operators: +,-,* 4.1.2. The GPUdevice octave object. The purpose of the GPUdevice is to store all the OpenCL initialization data for a given GPU device id. This code implementation supports the multi device support, where the device id is given as initialization parameter. A GPUdevice object, such as ogpu, can be declared using the following sintax: ogpu=gpudevice(0) Where 0 mean the GPU device 0, in our case the C1060. When a GPU device object is declared the code execute the following operations: Compile and load into memory all the defined OpenCL kernels Initialize the GPU device Initialize the GPU context 4.1.3. The GPUobject octave object. Once a GPUdevice has been declared, the user can allocate several GPUobjects starting by that. The GPUobject are the objects used by the octave user to perform the vector/matrix manipulation. The syntax introduced for this purpose is: A=GPUobject(oGPU,[1:100000]) 4

The command creates a GPUobject, embedding the OpenCL context and pre-initialized data through the GPUdevice - GPUobject initial linkage. Moreover with this command the GPU memory is allocated and a vector of 100000 elements is stored, containing number from 1 to 100000. The user can declare and initialize several GPUobjects starting from the same GPUdevice and the purpose of the library is to provide the common octave syntax for the vectors operations. For example the user can declare two objects: A=GPUobject(oGPU,[1:100000]) B=GPUobject(oGPU,[10:100010]) And perform a sum, difference or element-by-element multiplication: C=A+B D=A-B E=A*B 4.1.4. Benchmarking. We benchmark the sum and product functions of the new version of the library following the similar approach of the zero level version. In detail we estimate the gain factor between CPU and GPU timing on vector size increasing about: A+B (SUM) A*B (PRODUCT) A+B+C (SUM3) A+B*C (TRIAD) Figure 3. Processing gain factors of sum (a+b), mul (a*b), sum 3 terms (a+b+c), triad (a+b*c) operations between CPU and GPU on vector length change. 5

The first main result of this library version is the gain on data processing using GPU respect to the normal octave CPU implementation. This consideration is true in the range of the tests. We observe, as expected, a gain factor that roughly increase proportionally with processed vectors length. We observe a drop at 2^23 that needs investigation, probably due to the GPU computing blocks size that it hasn t been optimized at this step. Sum and product report very similar performances with a gain between 2 and 17. The SUM3 and triad report a higher gain between 2.5 and 24. So it means that increasing the GPU computing volume the gain factor increases. This consideration lead us to deduce that the measured timing has still a systematic dominant delay component due to some overhead that can be again object of optimization. 5. Conclusions Our work demonstrates the feasibility of GPU (OpenCL)-octave integration. Our library design allows getting a significant gain from the GPU implementation. In this version of the library we report gains between 2 and 24 respect to the normal CPU-octave computation. Now that the design responds positively, the work will continue finalizing the version 1 of the library implementing more operators and type to matrix. 6. References [1] http://www.top500.org [2] http://www.gnu.org/software/octave/index.html [3] http://www.mathworks.com/products/matlab/ [4] http://www.khronos.org/opencl/ 6