John W. Romein. Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands
|
|
- Randolph Williams
- 6 years ago
- Views:
Transcription
1 Signal Processing on GPUs for Radio Telescopes John W. Romein Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands 1
2 Overview radio telescopes six radio telescope algorithms on GPUs part 1: real-time processing of telescope data 1) FIR filter 2) FFT 3) bandpass correction 4) delay compensation 5) correlator part 2: creation of sky images 6) gridding (new GPU algorithm!) 2
3 Intro: Radio Telescopes 3
4 LOFAR Radio Telescope largest low-frequency telescope distributed sensor network ~85,000 sensors 4
5 LOFAR: A Software Telescope different observation modes require flexibility standard imaging pulsar survey known pulsar epoch of reionization transients ultra-high energy particles need supercomputer real time 5
6 LOFAR Data Processing Blue Gene/P supercomputer 6
7 Square Kilometre Array future radio telescope huge processing requirements TFLOPS LOFAR (2012) SKA 10% (2016) Full SKA (2020) ~30 ~30,000 ~1,000,000 7
8 Part 1: Real-Time Processing of Telescope Data 8
9 Rationale 2005: LOFAR needed supercomputer 2012: can GPUs do this work? 9
10 Blue Gene/P Algorithms on GPUs BG/P software complex several processing pipelines try imaging pipeline on GPU computational kernels only other pipelines + control software: later 10
11 CUDA or OpenCL? OpenCL advantages vendor independent runtime compilation: easier programming (parameters constant) float2 samples[nr_stations][nr_channels][nr_times][nr_polarizations]; OpenCL disadvantages less mature e.g., poor support for FFTs cannot use all GPU features go for OpenCL 11
12 Poly-Phase Filter (PPF) bank splits frequency band into channels like prism time resolution freq. resolution 12
13 Poly-Phase Filter (PPF) bank FIR filter + FFT 13
14 1) Finite Impulse Response (FIR) Filter history & weights (in registers) no physical shift many FMAs operational intensity = 32 ops / 5 bytes 14
15 Performance Measurements maximum foreseen LOFAR load 77 stations KHz dual pol 2x8 bits/sample 240 Gb/s GTX 580, GTX 680, HD 6970, HD 7970 need Tesla quality for real use 15
16 FIR Filter Performance GTX 580 performs best restricted by memory bandwidth 16
17 2) FFT 1D complex complex points tweaked Apple FFT library 64 work items: 1 FFT 256 work items: 4 FFTs 17
18 FFT Performance N=256 tweaked library 5 n log(n) 18
19 Clock Correction corrects cable length errors merge with next step (phase delay) 19
20 3) Delay Compensation (a.k.a. Tracking) track observed source delay telescope data delay changes due to earth rotation shift samples remainder: rotate phase (= cmul) 18 FLOPs / 32 bytes 20
21 4) BandPass Correction powers in channels unequal artifact from station processing multiply by channel-dependent weights 1 FLOP / 8 bytes 21
22 Transpose reorder data for next step (correlator) through local memory see talk S
23 Combined Kernel combine: delay compensation bandpass correction transpose reduces global memory accesses 18 FLOPs / 32 bytes 23
24 Delay / Band Pass Performance poor operational intensity 156 GB/s! 24
25 5) Correlator see previous talk (S0347) multiply samples from each pair of stations integrate ~1s 25
26 Correlator Implementation global memory local memory 1 thread: 2x2 stations (dual pol) 4 float4 loads 64 FMAs 32 accumulator registers one thread 26
27 Correlator #Threads 1024 max #threads #threads GTX GTX HD HD #stations 58 HD 6970 / HD 7970 need multiple passes! 77 27
28 Correlator Performance HD 7970: multiple passes register usage low occupancy 28
29 Combined Pipeline full pipeline 2 host threads own queue, own buffers overlap I/O & computations easy model! H D H D FIR FFT D&B H D Correlate D H H D FIR FFT D&B Correlate 29
30 Overall Performance Imaging Pipeline #GPUs needed for LOFAR GTX 680 (marginally) fastest ~13 GPUs HD 7970 real improvement over HD
31 Performance Breakdown GTX 580 dominated by correlator correlator: compute bound others: memory I/O bound PCIe I/O overlapped 31
32 Performance Breakdown GTX 680 ~20% faster than GTX
33 Performance Breakdown HD 7970 multiple passes correlator visible poor overlap I/O 33
34 Performance Breakdown HD x slower 34
35 Are GPUs Efficient? FIR filter FFT Delay / BandPass Correlator GTX 680 ~21% ~17% ~2.6% ~35% Blue Gene/P 85% 44% 26% 96% % of FPU peak performance Blue Gene/P: better compute-i/o balance & integrated network few tens of GPUs as powerful as 2 BG/P racks 35
36 Feasible? imaging pipeline ~13 GTX 680s ( 8 Tesla K10) + RFI detection? other pipelines? 240 Gb/s FDR InfiniBand transpose 36
37 Future Optimizations combine more kernels fewer passes over global memory FFT: difficult invoke FFT from GPU kernel, not CPU 37
38 Conclusions Part 1 OpenCL ok FFT support = minimal GTX 680 (Kepler) marginally faster than HD 7970 (GCN) <35% of FPU peak: memory I/O bottleneck heavy use of FMA instructions LOFAR imaging pipeline on GPUs = feasible 38
39 Part 2: Creation of Sky Images 39
40 Context after observation: remove RFI calibrate create sky image calibration/imaging loop possibly repeated 40
41 Creating a Sky Image convolve correlations and add to grid 2D FFT sky image 41
42 Gridding corr conv (~100x100) convolve correlation and add to grid grid (~4096x4096) for all correlations 42
43 Two Problems corr conv (~100x100) 1. lots of FLOPS grid (~4096x4096) 2. add to memory: slow! 43
44 Two Solutions corr conv (~100x100) 1. lots of FLOPS use GPUs grid (~4096x4096) 2. add to memory: slow! avoid 44
45 This Is A Hard Problem GFLOPS 300 1) 35 2) ) ) literature: 4 other GPU gridders estimated perf. on GTX680 compensated faster hardware bandwidth difference + 50% x16 32x32 64x x x128 conv. matrix size 45 giga-pixel-updates-per-second
46 This Is A Hard Problem 400 search correlations 2) Cell BE (Varbanescu [PhD,'10]) private grid per block very small grids 4) Humphreys & Cornwell [SKA memo 132, '11] GFLOPS local store 3) van Amesfoort et. al. [CF'09] adds directly to grid in memory 1) 35 2) ) ) x16 32x32 64x x x128 conv. matrix size 46 giga-pixel-updates-per-second 1) MWA (Edgar et. al. [CPC'11]) 50
47 This Is A Hard Problem GFLOPS 300 1) 35 2) ) ) ~3% of FPU peak performance! SKA: exascale x16 32x32 64x x x128 conv. matrix size 47 giga-pixel-updates-per-second
48 W-Projection Gridding depends on frac(u), frac(v), w corr conv correlation has associated (u,v,w) coords (u,v) not exact grid points use different convolution matrices (int(u), int(v)) grid choose most appropriate one 48
49 Where Is The Data? corr conv (~100x100) grid: device memory conv. matrices: texture correlations + (u,v,w) coords: shared (local) memory grid (~4096x4096) 49
50 Placement Movement f corr conv t per baseline: grid (u,v,w) changes slowly grid locality 50
51 Use Locality corr reduce #memory accesses X: one thread accumulate additions in register until conv. matrix slides off conv grid 51
52 But How??? corr conv 1 thread / grid point grid which correlations contribute? severe load imbalance 52
53 An Unintuitive Approach corr conv grid conceptual blocks of conv. matrix size 53
54 An Unintuitive Approach corr 1 thread monitors all X at any time: 1 X covers conv. matrix!!! conv grid 54
55 An Unintuitive Approach corr conv thread computes current: grid X grid point X conv. matrix entry 55
56 An Unintuitive Approach corr conv (u,v) coords change grid 56
57 An Unintuitive Approach corr conv (u,v) coords change more grid 57
58 An Unintuitive Approach corr conv grid (atomically) adds data if switching to another X 58
59 An Unintuitive Approach corr conv grid #threads = block size too many threads do in parts 59
60 (Dis)Advantages corr conv grid overhead < 1% grid-point memory updates 60
61 Performance Measurements 61
62 Performance Tests Setup #stations #channels integration time observation time conv. matrix size oversampling #W-planes grid size s 6h 256x256 8x x2048 (u,v,w) from real LOFAR observation (6 hour) 62
63 GTX 680 Performance (CUDA) Gpixels/s 25% of peak FPU overhead index computations most additions in registers 0.23%-0.55% atomic add = 26% of total run time! occupancy: texture hit rate: >0.872 GFLOPS x16 32x32 64x x x128 conv. matrix size 63 giga-pixel-updates-per-second GTX 680 (CUDA)
64 GTX 680 Performance (OpenCL) OpenCL slower than CUDA no atomic FP add! use atomic cmpxchg V1.1: no 1D images (added in V1.2) 2D image: slower x16 32x32 64x x x128 conv. matrix size 64 giga-pixel-updates-per-second GFLOPS GTX 680 (CUDA) 1000 GTX 680 (OpenCL) 900
65 HD 7970 Performance (OpenCL) GTX 680 (CUDA) 1000 GTX 680 (OpenCL) 900 HD GFLOPS 700 medium & large conv. size: outperforms GTX 680 ~25% > bandwidth, FPU, power small conv. size: poor computation-i/o overlap map host memory into device x16 32x32 64x x x128 conv. matrix size 65 giga-pixel-updates-per-second 110
66 2 x Xeon E Performance (C++/AVX) GTX 680 (CUDA) 1000 GTX 680 (OpenCL) 900 HD x E C++ & AVX vector intrinsics adds directly to grid relies on L1 cache works well on CPU insufficient cache for GPUs 48-79% of peak FPU GFLOPS x16 32x32 64x x x128 conv. matrix size 66 giga-pixel-updates-per-second 110
67 Multi-GPU Scaling eight Nvidia GTX 580s x256 64x64 16x16 GFLOPS nr. GPUs 8 131,072 threads! scales well 67
68 Green Computing power efficiency (GFLOP/W) 2.5 power consumption (kw) 2 256x256 64x64 16x nr. GPUs x256 64x64 16x nr. GPUs up to 1.94 GFLOP/w (with previous gen hardware!) 68
69 1) MWA (Edgar et. al. [CPC'11]) 2) Cell BE (Varbanescu [PhD,'10]) 4) Humphreys & Cornwell [SKA memo 132, '11] new method ~10x faster GFLOPS 3) van Amesfoort et. al. [CF'09] new ) ) 3) ) x16 32x32 64x x x128 conv. matrix size 69 giga-pixel-updates-per-second Compared To Other GPU Gridders
70 See Also An Efficient Work-Distribution Strategy for Gridding Radio-Telescope Data on GPUs, John W. Romein, ACM International Conference on Supercomputing (ICS'12), June 25-29, 2012, Venice, Italy 70
71 Future Work LOFAR gridder combine with A-projection time-dependent conv. function compute on GPU 71
72 Conclusions Part 2 efficient GPU gridding algorithm minimizes memory accesses OpenCL lacks atomic floating-point add ~10x faster than other gridders scales well on 8 GPUs energy efficient 72
PARALLEL PROGRAMMING MANY-CORE COMPUTING: THE LOFAR SOFTWARE TELESCOPE (5/5)
PARALLEL PROGRAMMING MANY-CORE COMPUTING: THE LOFAR SOFTWARE TELESCOPE (5/5) Rob van Nieuwpoort Vrije Universiteit Amsterdam & Astron, the Netherlands Institute for Radio Astronomy Why Radio? Credit: NASA/IPAC
More informationPARALLEL PROGRAMMING MANY-CORE COMPUTING FOR THE LOFAR TELESCOPE ROB VAN NIEUWPOORT. Rob van Nieuwpoort
PARALLEL PROGRAMMING MANY-CORE COMPUTING FOR THE LOFAR TELESCOPE ROB VAN NIEUWPOORT Rob van Nieuwpoort rob@cs.vu.nl Who am I 10 years of Grid / Cloud computing 6 years of many-core computing, radio astronomy
More informationGPUS FOR NGVLA. M Clark, April 2015
S FOR NGVLA M Clark, April 2015 GAMING DESIGN ENTERPRISE VIRTUALIZATION HPC & CLOUD SERVICE PROVIDERS AUTONOMOUS MACHINES PC DATA CENTER MOBILE The World Leader in Visual Computing 2 What is a? Tesla K40
More informationarxiv: v1 [astro-ph.im] 2 Feb 2017
International Journal of Parallel Programming manuscript No. (will be inserted by the editor) Correlating Radio Astronomy Signals with Many-Core Hardware Rob V. van Nieuwpoort John W. Romein arxiv:1702.00844v1
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationOSKAR: Simulating data from the SKA
OSKAR: Simulating data from the SKA Oxford e-research Centre, 4 June 2014 Fred Dulwich, Ben Mort, Stef Salvini 1 Overview Simulating interferometer data for SKA: Radio interferometry basics. Measurement
More informationParticle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA
Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran
More informationA GPU based brute force de-dispersion algorithm for LOFAR
A GPU based brute force de-dispersion algorithm for LOFAR W. Armour, M. Giles, A. Karastergiou and C. Williams. University of Oxford. 8 th May 2012 1 GPUs Why use GPUs? Latest Kepler/Fermi based cards
More informationCASA. Algorithms R&D. S. Bhatnagar. NRAO, Socorro
Algorithms R&D S. Bhatnagar NRAO, Socorro Outline Broad areas of work 1. Processing for wide-field wide-band imaging Full-beam, Mosaic, wide-band, full-polarization Wide-band continuum and spectral-line
More informationComputational issues for HI
Computational issues for HI Tim Cornwell, Square Kilometre Array How SKA processes data Science Data Processing system is part of the telescope Only one system per telescope Data flow so large that dedicated
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationAuto-tuning a LOFAR radio astronomy pipeline in JavaCL
VRIJE UNIVERSITEIT AMSTERDAM Auto-tuning a LOFAR radio astronomy pipeline in JavaCL Author Jan Kis Rob V. van Nieuwpoort Supervisors Ana Lucia Varbanescu A thesis submitted in fulfillment of the requirements
More informationTHE SQUARE KILOMETER ARRAY (SKA) ESD USE CASE
THE SQUARE KILOMETER ARRAY (SKA) ESD USE CASE Ronald Nijboer Head ASTRON R&D Computing Group With material from Chris Broekema (ASTRON) John Romein (ASTRON) Nick Rees (SKA Office) Miles Deegan (SKA Office)
More informationThe LOFAR Correlator: Implementation and Performance Analysis
The LOFR Correlator: Implementation and Performance nalysis John W. Romein P. Chris Broekema Jan David Mol Rob V. van Nieuwpoort STRON (Netherlands Institute for Radio stronomy) Oude Hoogeveensedijk, 7991
More informationIMPLEMENTATION OF THE. Alexander J. Yee University of Illinois Urbana-Champaign
SINGLE-TRANSPOSE IMPLEMENTATION OF THE OUT-OF-ORDER 3D-FFT Alexander J. Yee University of Illinois Urbana-Champaign The Problem FFTs are extremely memory-intensive. Completely bound by memory access. Memory
More informationTowards a Performance- Portable FFT Library for Heterogeneous Computing
Towards a Performance- Portable FFT Library for Heterogeneous Computing Carlo C. del Mundo*, Wu- chun Feng* *Dept. of ECE, Dept. of CS Virginia Tech Slides Updated: 5/19/2014 Forecast (Problem) AMD Radeon
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationCME 213 S PRING Eric Darve
CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and
More informationThe Implementation of a Real-time Polyphase Filter
WDS'14 Proceedings of Contributed Papers Physics, 9 14, 2014. ISBN 978-80-7378-276-4 MATFYZPRESS The Implementation of a Real-time Polyphase Filter K. Adámek and J. Novotný Institute of Physics, Faculty
More informationThe UniBoard. a RadioNet FP7 Joint Research Activity. Arpad Szomoru, JIVE
The UniBoard a RadioNet FP7 Joint Research Activity Arpad Szomoru, JIVE Overview Background, project setup Current state UniBoard as SKA phase 1 correlator/beam former Future: UniBoard 2 The aim Creation
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationReal-Time Support for GPU. GPU Management Heechul Yun
Real-Time Support for GPU GPU Management Heechul Yun 1 This Week Topic: Real-Time Support for General Purpose Graphic Processing Unit (GPGPU) Today Background Challenges Real-Time GPU Management Frameworks
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationAnalysis-driven Engineering of Comparison-based Sorting Algorithms on GPUs
AlgoPARC Analysis-driven Engineering of Comparison-based Sorting Algorithms on GPUs 32nd ACM International Conference on Supercomputing June 17, 2018 Ben Karsin 1 karsin@hawaii.edu Volker Weichert 2 weichert@cs.uni-frankfurt.de
More informationChapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1
Chapter 04 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for
More informationHigh performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli
High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering
More informationHiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.
HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes Ian Glendinning Outline NVIDIA GPU cards CUDA & OpenCL Parallel Implementation
More informationarxiv: v1 [cs.dc] 12 Nov 2014
The Implementation of a Real-Time Polyphase Filter Karel Adámek and Jan Novotný Institute of Physics, Faculty of Philosophy and Science, Silesian University in Opava, Bezručovo nám. 13, CZ-74601 Opava,
More informationUnrolling parallel loops
Unrolling parallel loops Vasily Volkov UC Berkeley November 14, 2011 1 Today Very simple optimization technique Closely resembles loop unrolling Widely used in high performance codes 2 Mapping to GPU:
More informationThe Era of Heterogeneous Computing
The Era of Heterogeneous Computing EU-US Summer School on High Performance Computing New York, NY, USA June 28, 2013 Lars Koesterke: Research Staff @ TACC Nomenclature Architecture Model -------------------------------------------------------
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationOptimization Case Study for Kepler K20 GPUs: Synthetic Aperture Radar Backprojection
Optimization Case Study for Kepler K20 GPUs: Synthetic Aperture Radar Backprojection Thomas M. Benson 1 Daniel P. Campbell 1 David Tarjan 2 Justin Luitjens 2 1 Georgia Tech Research Institute {thomas.benson,dan.campbell}@gtri.gatech.edu
More informationOn the efficiency of the Accelerated Processing Unit for scientific computing
24 th High Performance Computing Symposium Pasadena, April 5 th 2016 On the efficiency of the Accelerated Processing Unit for scientific computing I. Said, P. Fortin, J.-L. Lamotte, R. Dolbeau, H. Calandra
More informationDeep Learning Compiler
Deep Learning Compiler AWS AI Acknowledgement Amazon Sagemaker Neo Enables developers to train machine learning models once and run them anywhere in the cloud and at the edge Hardware targets Intel CPU,
More informationPoS(10th EVN Symposium)098
1 Joint Institute for VLBI in Europe P.O. Box 2, 7990 AA Dwingeloo, The Netherlands E-mail: szomoru@jive.nl The, a Joint Research Activity in the RadioNet FP7 2 programme, has as its aim the creation of
More informationUsing GPUs to compute the multilevel summation of electrostatic forces
Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of
More informationDevice Memories and Matrix Multiplication
Device Memories and Matrix Multiplication 1 Device Memories global, constant, and shared memories CUDA variable type qualifiers 2 Matrix Multiplication an application of tiling runningmatrixmul in the
More informationUsing Graphics Chips for General Purpose Computation
White Paper Using Graphics Chips for General Purpose Computation Document Version 0.1 May 12, 2010 442 Northlake Blvd. Altamonte Springs, FL 32701 (407) 262-7100 TABLE OF CONTENTS 1. INTRODUCTION....1
More informationParallel Computing. November 20, W.Homberg
Mitglied der Helmholtz-Gemeinschaft Parallel Computing November 20, 2017 W.Homberg Why go parallel? Problem too large for single node Job requires more memory Shorter time to solution essential Better
More informationDouble-Precision Matrix Multiply on CUDA
Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices
More informationProfiling & Tuning Applications. CUDA Course István Reguly
Profiling & Tuning Applications CUDA Course István Reguly Introduction Why is my application running slow? Work it out on paper Instrument code Profile it NVIDIA Visual Profiler Works with CUDA, needs
More informationGPGPU, 4th Meeting Mordechai Butrashvily, CEO GASS Company for Advanced Supercomputing Solutions
GPGPU, 4th Meeting Mordechai Butrashvily, CEO moti@gass-ltd.co.il GASS Company for Advanced Supercomputing Solutions Agenda 3rd meeting 4th meeting Future meetings Activities All rights reserved (c) 2008
More informationQR Decomposition on GPUs
QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of
More informationIntroduction to CELL B.E. and GPU Programming. Agenda
Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU
More informationA Code Merging Optimization Technique for GPU. Ryan Taylor Xiaoming Li University of Delaware
A Code Merging Optimization Technique for GPU Ryan Taylor Xiaoming Li University of Delaware FREE RIDE MAIN FINDING A GPU program can use the spare resources of another GPU program without hurting its
More informationCS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology
CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367
More informationGraphics Processor Acceleration and YOU
Graphics Processor Acceleration and YOU James Phillips Research/gpu/ Goals of Lecture After this talk the audience will: Understand how GPUs differ from CPUs Understand the limits of GPU acceleration Have
More informationrcuda: an approach to provide remote access to GPU computational power
rcuda: an approach to provide remote access to computational power Rafael Mayo Gual Universitat Jaume I Spain (1 of 60) HPC Advisory Council Workshop Outline computing Cost of a node rcuda goals rcuda
More informationarxiv: v3 [astro-ph.im] 30 Apr 2018
Cobalt: A GPU-based correlator and beamformer for OFAR P. Chris Broekema 1, J. Jan David Mol 1, Ronald Nijboer 1, Alexander S. van Amesfoort 1, Michiel A. Brentjens 1, G. Marcel oose 1, Wouter F. A. Klijn
More informationComplexity and Advanced Algorithms. Introduction to Parallel Algorithms
Complexity and Advanced Algorithms Introduction to Parallel Algorithms Why Parallel Computing? Save time, resources, memory,... Who is using it? Academia Industry Government Individuals? Two practical
More informationSKA Computing and Software
SKA Computing and Software Nick Rees 18 May 2016 Summary Introduc)on System overview Compu)ng Elements of the SKA Telescope Manager Low Frequency Aperture Array Central Signal Processor Science Data Processor
More informationTrends in HPC (hardware complexity and software challenges)
Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18
More informationThe challenges of computing at astronomical scale
Netherlands Institute for Radio Astronomy The challenges of computing at astronomical scale Chris Broekema Thursday 15th February, 2018, New Zealand SKA Forum 2018, Auckland, New Zealand ASTRON is part
More information"On the Capability and Achievable Performance of FPGAs for HPC Applications"
"On the Capability and Achievable Performance of FPGAs for HPC Applications" Wim Vanderbauwhede School of Computing Science, University of Glasgow, UK Or in other words "How Fast Can Those FPGA Thingies
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control
More informationEVLA Memo #132 Report on the findings of the CASA Terabyte Initiative: Single-node tests
EVLA Memo #132 Report on the findings of the CASA Terabyte Initiative: Single-node tests S. Bhatnagar NRAO, Socorro May 18, 2009 Abstract This note reports on the findings of the Terabyte-Initiative of
More informationarxiv: v1 [physics.comp-ph] 4 Nov 2013
arxiv:1311.0590v1 [physics.comp-ph] 4 Nov 2013 Performance of Kepler GTX Titan GPUs and Xeon Phi System, Weonjong Lee, and Jeonghwan Pak Lattice Gauge Theory Research Center, CTP, and FPRD, Department
More information3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA
3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationGPUs and GPGPUs. Greg Blanton John T. Lubia
GPUs and GPGPUs Greg Blanton John T. Lubia PROCESSOR ARCHITECTURAL ROADMAP Design CPU Optimized for sequential performance ILP increasingly difficult to extract from instruction stream Control hardware
More informationCUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation
CUDA Accelerated Linpack on Clusters E. Phillips, NVIDIA Corporation Outline Linpack benchmark CUDA Acceleration Strategy Fermi DGEMM Optimization / Performance Linpack Results Conclusions LINPACK Benchmark
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationn N c CIni.o ewsrg.au
@NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU
More informationOSKAR-2: Simulating data from the SKA
OSKAR-2: Simulating data from the SKA AACal 2012, Amsterdam, 13 th July 2012 Fred Dulwich, Ben Mort, Stef Salvini 1 Overview OSKAR-2: Interferometer and beamforming simulator package. Intended for simulations
More informationLecture 1: Introduction and Computational Thinking
PASI Summer School Advanced Algorithmic Techniques for GPUs Lecture 1: Introduction and Computational Thinking 1 Course Objective To master the most commonly used algorithm techniques and computational
More informationCSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand
More informationAccelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University
Accelerating GPU computation through mixed-precision methods Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University Outline Motivation Truncated Precision using CUDA Solving Linear
More informationIntroduction to CUDA
Introduction to CUDA Oliver Meister November 7 th 2012 Tutorial Parallel Programming and High Performance Computing, November 7 th 2012 1 References D. Kirk, W. Hwu: Programming Massively Parallel Processors,
More informationPARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort
PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort rob@cs.vu.nl Schedule 2 1. Introduction, performance metrics & analysis 2. Many-core hardware 3. Cuda class 1: basics 4. Cuda class
More informationExperts in Application Acceleration Synective Labs AB
Experts in Application Acceleration 1 2009 Synective Labs AB Magnus Peterson Synective Labs Synective Labs quick facts Expert company within software acceleration Based in Sweden with offices in Gothenburg
More informationFast Holographic Deconvolution
Precision image-domain deconvolution for radio astronomy Ian Sullivan University of Washington 4/19/2013 Precision imaging Modern imaging algorithms grid visibility data using sophisticated beam models
More informationGPGPU. Peter Laurens 1st-year PhD Student, NSC
GPGPU Peter Laurens 1st-year PhD Student, NSC Presentation Overview 1. What is it? 2. What can it do for me? 3. How can I get it to do that? 4. What s the catch? 5. What s the future? What is it? Introducing
More information2006: Short-Range Molecular Dynamics on GPU. San Jose, CA September 22, 2010 Peng Wang, NVIDIA
2006: Short-Range Molecular Dynamics on GPU San Jose, CA September 22, 2010 Peng Wang, NVIDIA Overview The LAMMPS molecular dynamics (MD) code Cell-list generation and force calculation Algorithm & performance
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied
More informationIntel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins
Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications
More informationCUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION
CUDA OPTIMIZATION WITH NVIDIA NSIGHT ECLIPSE EDITION WHAT YOU WILL LEARN An iterative method to optimize your GPU code Some common bottlenecks to look out for Performance diagnostics with NVIDIA Nsight
More informationGPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS
GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS Agenda Forming a GPGPU WG 1 st meeting Future meetings Activities Forming a GPGPU WG To raise needs and enhance information sharing A platform for knowledge
More informationOptimisation Myths and Facts as Seen in Statistical Physics
Optimisation Myths and Facts as Seen in Statistical Physics Massimo Bernaschi Institute for Applied Computing National Research Council & Computer Science Department University La Sapienza Rome - ITALY
More informationSelecting the right Tesla/GTX GPU from a Drunken Baker's Dozen
Selecting the right Tesla/GTX GPU from a Drunken Baker's Dozen GPU Computing Applications Here's what Nvidia says its Tesla K20(X) card excels at doing - Seismic processing, CFD, CAE, Financial computing,
More informationPortland State University ECE 588/688. Graphics Processors
Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly
More informationAdaptive selfcalibration for Allen Telescope Array imaging
Adaptive selfcalibration for Allen Telescope Array imaging Garrett Keating, William C. Barott & Melvyn Wright Radio Astronomy laboratory, University of California, Berkeley, CA, 94720 ABSTRACT Planned
More informationLecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1
Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei
More informationTUNING CUDA APPLICATIONS FOR MAXWELL
TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2
More informationASKAP Central Processor: Design and Implementa8on
ASKAP Central Processor: Design and Implementa8on Calibra8on and Imaging Workshop 2014 Ben Humphreys ASKAP So(ware and Compu3ng Project Engineer 3rd - 7th March 2014 ASTRONOMY AND SPACE SCIENCE Australian
More informationSoftware and Performance Engineering for numerical codes on GPU clusters
Software and Performance Engineering for numerical codes on GPU clusters H. Köstler International Workshop of GPU Solutions to Multiscale Problems in Science and Engineering Harbin, China 28.7.2010 2 3
More informationPowering Real-time Radio Astronomy Signal Processing with latest GPU architectures
Powering Real-time Radio Astronomy Signal Processing with latest GPU architectures Harshavardhan Reddy Suda NCRA, India Vinay Deshpande NVIDIA, India Bharat Kumar NVIDIA, India What signals we are processing?
More informationAccelerating Molecular Modeling Applications with Graphics Processors
Accelerating Molecular Modeling Applications with Graphics Processors John Stone Theoretical and Computational Biophysics Group University of Illinois at Urbana-Champaign Research/gpu/ SIAM Conference
More informationIdentifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011
Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011 Performance Optimization Process Use appropriate performance metric for each kernel For example, Gflops/s don t make sense for
More informationPeter Messmer Developer Technology Group Stan Posey HPC Industry and Applications
Peter Messmer Developer Technology Group pmessmer@nvidia.com Stan Posey HPC Industry and Applications sposey@nvidia.com U Progress Reported at This Workshop 2011 2012 CAM SE COSMO GEOS 5 CAM SE COSMO GEOS
More informationOn Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators
On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC
More informationAccelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX
Accelerating Octo-Tiger: Stellar Mergers on Intel Knights Landing with HPX David Pfander*, Gregor Daiß*, Dominic Marcello**, Hartmut Kaiser**, Dirk Pflüger* * University of Stuttgart ** Louisiana State
More informationA Multi-Tiered Optimization Framework for Heterogeneous Computing
A Multi-Tiered Optimization Framework for Heterogeneous Computing IEEE HPEC 2014 Alan George Professor of ECE University of Florida Herman Lam Assoc. Professor of ECE University of Florida Andrew Milluzzi
More informationA MATLAB Interface to the GPU
Introduction Results, conclusions and further work References Department of Informatics Faculty of Mathematics and Natural Sciences University of Oslo June 2007 Introduction Results, conclusions and further
More informationATS-GPU Real Time Signal Processing Software
Transfer A/D data to at high speed Up to 4 GB/s transfer rate for PCIe Gen 3 digitizer boards Supports CUDA compute capability 2.0+ Designed to work with AlazarTech PCI Express waveform digitizers Optional
More informationTR An Overview of NVIDIA Tegra K1 Architecture. Ang Li, Radu Serban, Dan Negrut
TR-2014-17 An Overview of NVIDIA Tegra K1 Architecture Ang Li, Radu Serban, Dan Negrut November 20, 2014 Abstract This paperwork gives an overview of NVIDIA s Jetson TK1 Development Kit and its Tegra K1
More informationAdministrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.
Administrivia HW0 scores, HW1 peer-review assignments out. HW2 out, due Nov. 2. If you re having Cython trouble with HW2, let us know. Review on Wednesday: Post questions on Piazza Introduction to GPUs
More informationDESIGN AND TESTING OF GPU BASED RTC FOR TMT NFIRAOS
Florence, Italy. Adaptive May 2013 Optics for Extremely Large Telescopes III ISBN: 978-88-908876-0-4 DOI: 10.12839/AO4ELT3.13172 DESIGN AND TESTING OF GPU BASED RTC FOR TMT NFIRAOS Lianqi Wang 1,a, 1 Thirty
More information