A GPU based brute force de-dispersion algorithm for LOFAR

Similar documents
Trends in HPC (hardware complexity and software challenges)

arxiv: v1 [cs.dc] 12 Nov 2014

The Implementation of a Real-time Polyphase Filter

arxiv: v1 [physics.comp-ph] 4 Nov 2013

GPUs and Emerging Architectures

Auto-Tuning Dedispersion for Many-Core Accelerators

Analyzing and improving performance portability of OpenCL applications via auto-tuning

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

John W. Romein. Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands

OSKAR: Simulating data from the SKA

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Intel Xeon Phi Coprocessors

Support Tools for Porting Legacy Applications to Multicore. Natsuki Kawai, Yuri Ardila, Takashi Nakamura, Yosuke Tamura

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Accelerating Financial Applications on the GPU

Real-Time Rendering Architectures

n N c CIni.o ewsrg.au

GPU for HPC. October 2010

PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort

MANY-CORE COMPUTING. 7-Oct Ana Lucia Varbanescu, UvA. Original slides: Rob van Nieuwpoort, escience Center

Mathematical computations with GPUs

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Using GPUs for unstructured grid CFD

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

OpenCL implementation of PSO: a comparison between multi-core CPU and GPU performances

How GPUs can find your next hit: Accelerating virtual screening with OpenCL. Simon Krige

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Vectorisation and Portable Programming using OpenCL

arxiv: v3 [astro-ph.im] 21 Apr 2016

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP

"On the Capability and Achievable Performance of FPGAs for HPC Applications"

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

The Era of Heterogeneous Computing

GPUs and GPGPUs. Greg Blanton John T. Lubia

Selecting the right Tesla/GTX GPU from a Drunken Baker's Dozen

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

GPU Acceleration of Matrix Algebra. Dr. Ronald C. Young Multipath Corporation. fmslib.com

Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM

Comparison and analysis of parallel tasking performance for an irregular application

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Banking on Monte Carlo and beyond

OP2 FOR MANY-CORE ARCHITECTURES

GPU Architecture. Alan Gray EPCC The University of Edinburgh

General Purpose GPU Computing in Partial Wave Analysis

Tiny GPU Cluster for Big Spatial Data: A Preliminary Performance Evaluation

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

Optimising the Mantevo benchmark suite for multi- and many-core architectures

Large scale Imaging on Current Many- Core Platforms

PARALLEL PROGRAMMING MANY-CORE COMPUTING: THE LOFAR SOFTWARE TELESCOPE (5/5)

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Using Graphics Chips for General Purpose Computation

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Informatica Universiteit van Amsterdam. Tuning and Performance Analysis of the Black-Scholes Model. Robin Hansma. June 9, Bachelor Informatica

TUNING CUDA APPLICATIONS FOR MAXWELL

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

CME 213 S PRING Eric Darve

TUNING CUDA APPLICATIONS FOR MAXWELL

Accelerating molecular docking on multi- and manycore computer architectures

GPU ARCHITECTURE Chris Schultz, June 2017

Motivation Hardware Overview Programming model. GPU computing. Part 1: General introduction. Ch. Hoelbling. Wuppertal University

HPC Architectures. Types of resource currently in use

OpenCL Vectorising Features. Andreas Beckmann

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

ARCHER Champions 2 workshop

Parallel Computing. November 20, W.Homberg

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Parallel Programming on Ranger and Stampede

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center

Memory. Lecture 2: different memory and variable types. Memory Hierarchy. CPU Memory Hierarchy. Main memory

arxiv: v1 [astro-ph.im] 2 Feb 2017

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

PARALLEL PROGRAMMING MANY-CORE COMPUTING FOR THE LOFAR TELESCOPE ROB VAN NIEUWPOORT. Rob van Nieuwpoort

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

Lecture 2: different memory and variable types

The Mont-Blanc approach towards Exascale

Why GPUs? Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

GPUS FOR NGVLA. M Clark, April 2015

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Fundamental CUDA Optimization. NVIDIA Corporation

Antonio R. Miele Marco D. Santambrogio

SIMD Exploitation in (JIT) Compilers

Programming GPUs with CUDA. Prerequisites for this tutorial. Commercial models available for Kepler: GeForce vs. Tesla. I.

HPC future trends from a science perspective

Transcription:

A GPU based brute force de-dispersion algorithm for LOFAR W. Armour, M. Giles, A. Karastergiou and C. Williams. University of Oxford. 8 th May 2012 1

GPUs Why use GPUs?

Latest Kepler/Fermi based cards GeForce GTX 690 2x GK104 GPUs 18.7 GFLOPS / watt Peak performance ~6 TFLOPS Price point ~ 1K M2090 1x GF110 GPU 512 cores. Peak performance 1.3 TFLOPS 6.8 GFLOPS / watt Price point ~ 3K

Influence and Take-up TOP 500 3 out of top 5 utilise Fermi/Tesla Tianhe-1A 2.5 petaflops Based on 14336 Xeon and 7168 M2050 To achieve same performance using only CPUs 50000 CPUs, 2x floor space and 3x power (Estimates made by NVIDIA) Uses Lustre :-s

The MOTIVATE project MOTIVATE is a pathfinder project and aims to investigate the latest many-core technologies with the aim of delivering energy and cost efficiency in the area of radio astronomy HPC. MOTIVATE stands for Many-cOre Technology Investigating Value, Application, deployment and Efficiency. The MOTIVATE project is funded by the Oxford-Martin School through the Institute for the Future of Computing http://www.oerc.ox.ac.uk/research/many-core

Radio Astronomy De-dispersion

Experimental data Most of the measured signals live in the noise of the apparatus.

Experimental data Most of the measured signals live in the noise of the apparatus. Hence frequency channels have to be folded

Brute force algorithm Every DM is calculated to see if a signal is present. In a blind search for a signal many different dispersion measures are calculated. This results in many data points in the (f,t) domain being used multiple times for different dispersion searches. This allows for data reuse in a GPU algorithm.

A new GPU brute force algorithm Three key features Each thread processes a variable number of dispersion measures in local registers. Exploit the L1 Cache / Shared Memory present on the Fermi architecture. Optimise the region of dispersion space being processed (thread blocksize). Web page : http://www.oerc.ox.ac.uk/research/wes

Processing several DM s per thread New Algorithm works in the DM - t space rather than frequency time space. Region of DM space processed by thread block Thread block size Each thread processes a varying number of time samples for a constant dispersion measure. DM This ensures frequency - time data is loaded into fast L1 cache. Using registers ensures very quick memory access. t

Exploiting the L1 cache / Shared Memory Each dispersion measure for a given frequency channel needs a shifted time value. f t Constant DM s with varying time. Incrementing all of the registers at every frequency step ensures a high data reuse of the stored frequency time data in the L1 cache.

Optimising the parameterisation. The GPU block size of the new algorithm can take on any size that is integer multiples of the size of a data chunk DM DM t t

A CPU brute force algorithm Four key features Aim to achieve full cache line utilization. Exploit the large (~375 GB/s) LLC bandwidth present on the new Intel Sand Bridge CPUs. Use the Intel Intrinsics to exploit the 16 AVX/SSE (YMM/XMM) SIMD registers (don t rely on the Intel auto-vectorizer!) Use OpenMP to share work across the CPU cores.

Brute force results Results

Multiples of real-time Results 16 14 12 10 Comparison of different computing technologies Optimised MDSM kernel NVIDIA C2070 L1 cached GPU Algorithm NVIDIA C2070 Shared Memory GPU Algorithm NVIDIA C2070 Intel i7 2600K AVX (4 cores, 4.2 GHz) Intel Xeon X7550 SSE (x4 = 32 cores, 2.7GHz) 8 6 4 2 0 0 500 1000 1500 2000 2500 3000 3500 4000 Number of channels = total number of DMs with a maximum DM of 200

Results

Multiples of real-time Results 1 Simulated comparison of Fermi to Kepler GPUs 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 Shared Memory GPU C2070 Simulated Kepler GTX 690 0.1 0 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 Number of channels = total number of DMs with Maximum DM = 100

Number of beams that can be processed by one (Fermi) GPU Results 50 45 Number of beams processed as a function of the number of de-dispersions performed. 40 35 30 L1 cache based algorithm (No Max DM Limit) Shared memory algorithm (Maximum DM = 100) 25 20 15 10 5 0 1000 1500 2000 2500 3000 3500 4000 Number of de-dispersions = Number of channels

Number of beams that can be processed by one Kepler GTX 690 Results 300 250 Number of beams processed as a function of the number of de-dispersions performed. (Simulated results for GTX 690 Kepler GPU) 200 L1 cache based algorithm (No Max DM Limit) Shared memory algorithm (Maximum DM = 100) 150 100 50 0 1000 1500 2000 2500 3000 3500 4000 Number of de-dispersions = Number of channels

Time in ms Results 3 Adding more CPU cores doesn t help and is expensive!! 2.5 2 1.5 1 0.5 0 0 20 40 60 80 100 120 140 160 180 Cores used

Conclusions and Future Work

Number of Kepler GPUs Conclusions and Future Work 200 Number of Kepler GPUs Needed to process 127-Beam Tied-Array 180 160 Shared Memory Algorithm on Kepler L1 Cache Algorithm on Kepler 140 120 100 80 60 40 20 0 1 2 3 4 5 6 7 8 Number of Tied-Arrays (127 pencils with 2000 channels each)

Conclusions and Future Work GPU wins hands-down. At the moment (and for the foreseeable future)! Shared Memory Algorithm achieves between 60-70% of peak performance. AVX puts up a good fight. Watch out for Intels MIC (Many Integrated Core) chip 32 in-order cores, 4 threads per core 512 bit SIMD units running a 1024 bit ring bus. OpenCL Algorithm, Dan Curran / Simon McIntosh-Smith (Bristol): Initial results are currently 2x slower than NVIDIA CUDA Code. Is a GPU based back-end to the Blue Gene/P at the Central Processing (CEP) facility feasible??

Acknowledgments and Collaborators GPU de-dispersion : http://www.oerc.ox.ac.uk/research/wes ARTEMIS : http://www.oerc.ox.ac.uk/research/artemis University of Malta Alessio Magro MDSM University of Oxford Mike Giles (Maths) Cuda, GPU algorithms. Aris Karastergiou (Physics) ARTEMIS, Astrophysics, Experimental Work. Kimon Zagkouris (Physics) Astrophysics, Experimental Work. Chris Williams (OeRC) RFI Clipper, Data pipeline. Ben Mort (OeRC) Data Pipeline, pelican. Fred Dulwich (OeRC) Data Pipeline, pelican. Stef Salvini (OeRC ) Data Pipeline, pelican. Steve Roberts (Engineering) Signal processing/detection algorithms. University of Bristol Dan Curran (Electrical Engineering) OpenCL work. Simon McIntosh Smith (Electrical Engineering) OpenCL work.