GPU Programming with Ateji PX June 8 th Ateji All rights reserved.

Similar documents
OpenCL. Matt Sellitto Dana Schaa Northeastern University NUCAR

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

GPU Programming Using NVIDIA CUDA

Speed Up Your Codes Using GPU

GpuWrapper: A Portable API for Heterogeneous Programming at CGG

Modern Processor Architectures. L25: Modern Compiler Design

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

GPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways:

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

High-Performance and Parallel Computing

Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi

Trends and Challenges in Multicore Programming

CSE 591: GPU Programming. Using CUDA in Practice. Klaus Mueller. Computer Science Department Stony Brook University

GPU Fundamentals Jeff Larkin November 14, 2016

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

REDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS

Visualization of OpenCL Application Execution on CPU-GPU Systems

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Real-time Graphics 9. GPGPU

cuda-on-cl A compiler and runtime for running NVIDIA CUDA C++11 applications on OpenCL 1.2 devices Hugh Perkins (ASAPP)

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Research Faculty Summit Systems Fueling future disruptions

Real-time Graphics 9. GPGPU

Heterogeneous Computing

MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise. June 2011

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

High Performance Computing and GPU Programming

Advanced CUDA Optimizations. Umar Arshad ArrayFire

General-purpose computing on graphics processing units (GPGPU)

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

GPUs and Emerging Architectures

Josef Pelikán, Jan Horáček CGG MFF UK Praha

CUDA C Programming Mark Harris NVIDIA Corporation

trisycl Open Source C++17 & OpenMP-based OpenCL SYCL prototype Ronan Keryell 05/12/2015 IWOCL 2015 SYCL Tutorial Khronos OpenCL SYCL committee

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

OpenCL: History & Future. November 20, 2017

Advanced CUDA Optimization 1. Introduction

SPOC : GPGPU programming through Stream Processing with OCaml

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

Parallel Programming. Libraries and Implementations

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Supporting Data Parallelism in Matcloud: Final Report

GPU Programming for Mathematical and Scientific Computing

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

GPU ARCHITECTURE Chris Schultz, June 2017

OpenMP for next generation heterogeneous clusters

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Red Fox: An Execution Environment for Relational Query Processing on GPUs

CS377P Programming for Performance GPU Programming - I

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

Practical Introduction to CUDA and GPU

General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

CUDA Programming Model

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

Introduction to GPU hardware and to CUDA

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Computer Architecture

Portland State University ECE 588/688. Graphics Processors

HKG OpenCL Support by NNVM & TVM. Jammy Zhou - Linaro

HSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015!

Advanced Geospatial Image Processing using Graphics Processing Units

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

GPU programming. Dr. Bernhard Kainz

GPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011

OpenCL implementation of PSO: a comparison between multi-core CPU and GPU performances

Accelerating image registration on GPUs

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

Scalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009

x Welcome to the jungle. The free lunch is so over

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Module 2: Introduction to CUDA C

Programming Parallel Computers

Towards a Performance- Portable FFT Library for Heterogeneous Computing

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

ET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc.

CUDA (Compute Unified Device Architecture)

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

ECE 574 Cluster Computing Lecture 15

Introduction to GPGPU and GPU-architectures

Module 2: Introduction to CUDA C. Objective

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

Accelerated Test Execution Using GPUs

Colin Riddell GPU Compiler Developer Codeplay Visit us at

Introduction to OpenCL!

Introduction to Multicore Programming

OpenACC Fundamentals. Steve Abbott November 15, 2017

OpenCL Overview. Shanghai March Neil Trevett Vice President Mobile Content, NVIDIA President, The Khronos Group

Design methodology for multi processor systems design on regular platforms

CUDA. Matthew Joyner, Jeremy Williams

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

Introduction to CUDA Programming

Transcription:

GPU Programming with Ateji PX June 8 th 2010 Ateji All rights reserved.

Goals Write once, run everywhere, even on a GPU Target heterogeneous architectures from Java GPU accelerators OpenCL standard Get a reasonable out-of-the-box performance improvement on mainstream devices Provide enhanced constructs allowing to get an extra speed-up for HPC workloads

The Road to GPU The Ateji PX language already provides constructs allowing to express many parallel patterns... including the kind of data-parallel programs that fit well on a GPU 1 Parallel branch OpenCL Kernel 2 Branch quantification N Dimensional Range 3 Branch scopes Private, local, global memories final int K = 10; [ (#OpenCL()) final int[] result = new int[n]; [ ( int i: result.length, #Kernel( C2070 ) ) int x = i * i + K; result[i] = x; ] display(result); ] One-dimensional kernel executed N times on C2070 GPU 1. Thread 2. Grid 3. Local, shared, global

Architecture Overview Compiler produces Java and OpenCL source files Annotations allow user to explicitly select a particular OpenCL device and pass parameters to the compiler #Kernel("AMD-9350","-cl-mad-enable") #Kernel(): first GPU, default compiler settings Multi-GPU configurations supported Runtime uses OpenCL driver to: Compile programs and cache binaries Transfer memory Set kernel parameters Launch kernels

Supported Java Subset Functions Local & static, no recursion, no overloading Cloning to support polymorphism on memory space qualifiers +, -, *, /, %, java.lang.math.* System.out.format() (OpenCL extension) Types Primitive types 1-D arrays of primitive types Standard control-flow statements No labels (not supported by some OpenCL compilers)

OpenCL Platforms Tested platforms NVIDIA CUDA (GPU) AMD Accelerated Parallel Processing (CPU & GPU) Intel OpenCL (CPU) Using OpenCL to target multi-core is a realistic path Generated OpenCL code exhibits more vectorization opportunities than plain Java code Intel compiler based on the Threading Building Block library could open the door to many-core architectures Diversity of OpenCL devices forces to write (performance) portable code

N Dimensional Range (1/2) Defines the index space 1 work-item 1 per point of the index space Optionally organized into explicit work-groups 2 (coarsegrained decomposition) Main parameter of clenqueuendrangekernel() Expressed using existing branch quantifiers of the Ateji PX language ; Unique instance (int I : N) ; N instances 1. thread 2. blocks

N Dimensional Range (2/2) 2-D index space (int x: X, int y: Y, #Kernel()) NDRange = [X,Y] / [-,-] X*Y work-items Implicit work-groups (not always optimal) 1-D index space with explicit work-groups (int n: N, #Kernel()) // N work-groups (int m: M) // M work-items per work-group NDRange = [N*M] / [M] Generated code automatically references get_global_id(x)and get_local_id(x) primitives 1 1. blockidx.x * blockdim.x + threadidx.x and threadidx.x

Memory Hierarchy Mapped on Java scopes, no extra qualifier required Local and private variables are defined in nested parallel blocks [ (int n: N, #Kernel()) // local variable shared by all work-items // pertaining to the same work-group int[] t = new int[n]; [ (int m: M) // private variables int x = A[N*n + m]; ] ] Other variables allocated in global address space Constant data generated as globally scoped OpenCL constants Images not yet supported

Buffer Management Automatic Compiler automatically Creates memory buffers with appropriate flags (READ_ONLY, WRITE_ONLY, READ_WRITE) Releases buffers Enqueues read/write commands Transparent to the user, but transfers may not start at optimal locations Manual User has to manually create channels to/from the OpenCL device to explicitly transfer memory Memory transfers can be optimized

Performance Mandelbrot set (zoom in/out, 128 frames) Massively parallel No manual optimization Small memory buffers (1024x1024 pixels + color palette) Execution FPS Speed-up Serial Multi-core Serial ~2 Multi-core (Intel i5, 2 cores HT) ~7 3,5 GPU (NVIDIA GT330M) ~39 19,5 5,5

Issues & Limitations Language No direct access (yet) to OpenCL vector types that some architectures could benefit from NVIDIA Public drivers still only supporting OpenCL 1.0 New CUDA 4.0 features not (yet) available in OpenCL Non-standard platform extensions cl_khr_fp64 vs cl_amd_fp64 cl_amd_printf vs cl_intel_printf All OpenCL implementations are slowly converging Profilers not Java-aware Cumbersome configuration No integration into Java IDEs

Ongoing work OpenGPU RNA-folding Intrinsically parallel: window sliding over an RNA sequence... but current implementation heavily relies on a shared state to minimize number of candidates European Space Agency Gaia project: characterizing the photometric and spectral variability of stars Intrinsically parallel: light sources are independent Current bottleneck is memory (RAM & throughput)

Questions