Realtime Signal Processing on Embedded GPUs

Size: px
Start display at page:

Download "Realtime Signal Processing on Embedded GPUs"

Transcription

1 Realtime Signal Processing on Embedded s Dr. Matthias Rosenthal Armin Weiss Dr. Amin Mazloumian Institute of Embedded Systems Realtime Platforms Research Group Zurich University of Applied Sciences

2 Motivation In Comparison to FPGA / DSP Solutions: Performance Gain: 100x (e.g. Analog Devices SHARC) Fast Development Time System on Single Chip Cost Effective CPU Nvidia TX Series 2

3 Signal Processing Applications Embedded System mit Video CPU Realtime Linux CUDA OpenCL Video Audio Audio Sensor Data Processed Data 3

4 Challenges Short and Deterministic Latency Video: 16.7 ms / Frame (60 Hz) Audio: 0.33 ms / 32 Samples (@ 96 khz) High Input and Output Data Rate Video: 3.0 Gb/s (1080p@60 24 bit RGB): Audio: 5.5 Gb/s (256x7 Channels 32Bit@96 khz) 4

5 Short and Deterministic Latency Variability in Kernel Launch 99.8% as expected global void identity(float *input, float *output, int numelem) { for (int index = 0; index < numelem; index++) { output[index] = input[index]; } } Outliers numelem = % - 0.2% < 0.1% 5

6 Short and Deterministic Latency Variability in Kernel Launch Latency ~ Buffer Size Outliers 50 ms 6

7 Short and Deterministic Latency Problems How to Improve Deterministic Behavior? à Solution: Persistent CUDA Kernel Eliminate Launch Time 7

8 Short and Deterministic Latency Persistent Kernel CPU Host Application while (running) { if (new_audio_samples() == true) { send_sync_to_(); wait_for sync(); } } global void audiokernel( ) { } // Infinite loop while (true) { } wait_for_cpu_sync(); // Audio channel processing CUDA Kernel wait_for_all_threads_to_finish(); send_sync_to_cpu(); Sync with Memory Accessible by CPU and 8

9 Short and Deterministic Latency CUDA Memory Comparison Zero Copy <-> Managed Zero Copy Managed Memory TX2 TX2 CPU CPU Memory Controller Incl. SMMU DRAM 8GB Buffer Memory Controller Incl. SMMU DRAM 8GB Buffer Not usable for Persistent Kernels 9

10 Short and Deterministic Latency Problems Memory Accessible by CPU and Use Zero Copy Memory for Synchronization What about Parameters? 10

11 Short and Deterministic Latency Infinite Loop: Parameter Communication CPU Application Writes at Arbitrary Time Audio Processing Thread Local Parameter Copy Periodic Update System Memory Parameter Memory (Zero Copy) Slow Access! 11

12 Short and Deterministic Latency Conclusion Use Persistent Kernels Use Memory Accessible by CPU and Use Zero Copy Memory for Synchronization What about Parameters? Speed-up with Local Copy of Parameters How to Make CPUs Deterministic? CPU Core Isolation Custom Linux: Initramfs built with Yocto No Flash Access Anymore During Runtime Minimize Influence from other Applications 12

13 Challenges Short and Deterministic Latency Video: 16.7 ms / Frame (60 Hz) Audio: 0.33 ms / 32 Samples (@ 96 khz) High Input and Output Data Rate Video: 3.0 Gb/s (1080p@60 24 bit RGB): Audio: 5.5 Gb/s (256x7 Channels 32Bit@96 khz) 13

14 High Input / Output Data Rate Separate Buffers TX2 Signal I / O Card PCIe CPU Memory Controller Incl. SMMU DRAM 8GB I/O Buffer memcpy Buffer 14

15 High Input / Output Data Rate memcpy() Measurements memcpy() Time khz for 12h on 3 CPUs (A57) 1.E+09 # Occurrences 1.E+06 1.E % CPU Usage! 1.E Time (µs) TX1 ( Kernel v ) TX2 (Kernel v4.4.15) 15

16 High Input / Output Data Rate Shared Buffer TX2 Signal I / O Card PCIe CPU Memory Controller Incl. SMMU DRAM 8GB I/O Buffer memcpy Buffer 16

17 High Input / Output Data Rate Shared Buffer TX2 Signal I / O Card PCIe CPU Memory Controller Incl. SMMU DRAM 8GB Shared Buffer 17

18 High Input / Output Data Rate Shared Buffer Existing Solutions for Buffer Sharing Direct RDMA (Remote Direct Memory Access) 18

19 High Input / Output Data Rate Direct RDMA Desktop Discrete TX2 Integrated CPUs CPUs Bridge System Memory PCIe Controller Memory Interconnect System Memory PCIe PCIe 3rd Party Device Direct RDMA Memory 3rd Party Device 19

20 High Input / Output Data Rate Shared Buffer Existing Solutions for Buffer Sharing Direct RDMA à Not Available CudaHostRegister() 20

21 High Input / Output Data Rate CudaHostRegister() TX2 Signal I / O Card PCIe CPU Memory Controller Incl. SMMU DRAM 8GB I/O Buffer CudaHostRegister() 21

22 High Input / Output Data Rate Shared Buffer Existing Solutions for Buffer Sharing Direct RDMA à Not Available CudaHostRegister() à Not Implemented on TX2 Video4Linux2 à Userptr Mode 22

23 High Input / Output Data Rate Video4Linux - Userptr TX2 Embedded Camera Video Input CPU Memory Controller Incl. SMMU DRAM 8GB Userptr Mode Mapped Access Buffer 23

24 High Input / Output Data Rate Video4Linux - Userptr TX2 Signal I / O Card PCIe CPU Memory Controller Incl. SMMU DRAM 8GB Userptr Mode Mapped Access Buffer 24

25 High Input / Output Data Rate Shared Buffer Existing Solutions for Buffer Sharing Direct RDMA à Not Available CudaHostRegister() à Not Implemented on TX2 Video4Linux2 à Userptr Mode 25

26 Conclusion Short and Deterministic Latency à Persistent CUDA Kernel High Input / Output Data Rate à Shared Buffer I/O <-> (based on Video4Linux) Feasibility of Low-Latency Signal Processing on Audio Signal Processing on Nvidia TX2 with kHz 26

27 Questions Website: Blog: Github:

Realtime Signal Processing on Nvidia TX2 using CUDA

Realtime Signal Processing on Nvidia TX2 using CUDA Realtime Signal Processing on Nvidia TX2 using CUDA Armin Weiss Dr. Amin Mazloumian Dr. Matthias Rosenthal Institute of Embedded Systems High Performance Multimedia Research Group Zurich University of

More information

Efficient Video Processing on Embedded GPU

Efficient Video Processing on Embedded GPU Efficient Video Processing on Embedded GPU Tobias Kammacher Armin Weiss Matthias Frei Institute of Embedded Systems High Performance Multimedia Research Group Zurich University of Applied Sciences (ZHAW)

More information

4K HEVC Video Processing with GPU Optimization on Jetson TX1

4K HEVC Video Processing with GPU Optimization on Jetson TX1 4K HEVC Video Processing with GPU Optimization on Jetson TX1 Tobias Kammacher Matthias Frei Hans Gelke Institute of Embedded Systems / High Performance Multimedia Research Group Zurich University of Applied

More information

4K Video Processing and Streaming Platform on TX1

4K Video Processing and Streaming Platform on TX1 4K Video Processing and Streaming Platform on TX1 Tobias Kammacher Dr. Matthias Rosenthal Institute of Embedded Systems / High Performance Multimedia Research Group Zurich University of Applied Sciences

More information

4K Video Processing and Streaming Platform on TX1

4K Video Processing and Streaming Platform on TX1 4K Video Processing and Streaming Platform on TX1 Tobias Kammacher Dr. Matthias Rosenthal Institute of Embedded Systems / High Performance Multimedia Research Group Zurich University of Applied Sciences

More information

Towards Automatic Heterogeneous Computing Performance Analysis. Carl Pearson Adviser: Wen-Mei Hwu

Towards Automatic Heterogeneous Computing Performance Analysis. Carl Pearson Adviser: Wen-Mei Hwu Towards Automatic Heterogeneous Computing Performance Analysis Carl Pearson pearson@illinois.edu Adviser: Wen-Mei Hwu 2018 03 30 1 Outline High Performance Computing Challenges Vision CUDA Allocation and

More information

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD INTRODUCTION TO OPENCL TM A Beginner s Tutorial Udeepta Bordoloi AMD IT S A HETEROGENEOUS WORLD Heterogeneous computing The new normal CPU Many CPU s 2, 4, 8, Very many GPU processing elements 100 s Different

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

Martin Dubois, ing. Contents

Martin Dubois, ing. Contents Martin Dubois, ing Contents Without OpenNet vs With OpenNet Technical information Possible applications Artificial Intelligence Deep Packet Inspection Image and Video processing Network equipment development

More information

ECE 574 Cluster Computing Lecture 15

ECE 574 Cluster Computing Lecture 15 ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements

More information

Seamless Dynamic Runtime Reconfiguration in a Software-Defined Radio

Seamless Dynamic Runtime Reconfiguration in a Software-Defined Radio Seamless Dynamic Runtime Reconfiguration in a Software-Defined Radio Michael L Dickens, J Nicholas Laneman, and Brian P Dunn WINNF 11 Europe 2011-Jun-22/24 Overview! Background on relevant SDR! Problem

More information

Research Faculty Summit Systems Fueling future disruptions

Research Faculty Summit Systems Fueling future disruptions Research Faculty Summit 2018 Systems Fueling future disruptions Wolong: A Back-end Optimizer for Deep Learning Computation Jilong Xue Researcher, Microsoft Research Asia System Challenge in Deep Learning

More information

A Real Time Controller for E-ELT

A Real Time Controller for E-ELT A Real Time Controller for E-ELT Addressing the jitter/latency constraints Maxime Lainé, Denis Perret LESIA / Observatoire de Paris Project #671662 funded by European Commission under program H2020-EU.1.2.2

More information

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [ jbalfour@nvidia.com] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers

More information

Accelerating Data Centers Using NVMe and CUDA

Accelerating Data Centers Using NVMe and CUDA Accelerating Data Centers Using NVMe and CUDA Stephen Bates, PhD Technical Director, CSTO, PMC-Sierra Santa Clara, CA 1 Project Donard @ PMC-Sierra Donard is a PMC CTO project that leverages NVM Express

More information

A Real Time Controller for E-ELT

A Real Time Controller for E-ELT A Real Time Controller for E-ELT Addressing the jitter/latency constraints Maxime Lainé, Denis Perret LESIA / Observatoire de Paris Project #671662 funded by European Commission under program H2020-EU.1.2.2

More information

Efficient Data Transfers

Efficient Data Transfers Efficient Data fers Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2016 PCIE Review Typical Structure of a CUDA Program Global variables declaration Function prototypes global

More information

Lecture 10. Efficient Host-Device Data Transfer

Lecture 10. Efficient Host-Device Data Transfer 1 Lecture 10 Efficient Host-Device Data fer 2 Objective To learn the important concepts involved in copying (transferring) data between host and device System Interconnect Direct Memory Access Pinned memory

More information

27 March 2018 Mikael Arguedas and Morgan Quigley

27 March 2018 Mikael Arguedas and Morgan Quigley 27 March 2018 Mikael Arguedas and Morgan Quigley Separate devices: (prototypes 0-3) Unified camera: (prototypes 4-5) Unified system: (prototypes 6+) USB3 USB Host USB3 USB2 USB3 USB Host PCIe root

More information

Gzip Compression Using Altera OpenCL. Mohamed Abdelfattah (University of Toronto) Andrei Hagiescu Deshanand Singh

Gzip Compression Using Altera OpenCL. Mohamed Abdelfattah (University of Toronto) Andrei Hagiescu Deshanand Singh Gzip Compression Using Altera OpenCL Mohamed Abdelfattah (University of Toronto) Andrei Hagiescu Deshanand Singh Gzip Widely-used lossless compression program Gzip = LZ77 + Huffman Big data needs fast

More information

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand Miao Luo, Hao Wang, & D. K. Panda Network- Based Compu2ng Laboratory Department of Computer Science and Engineering The Ohio State

More information

Unified Memory. Notes on GPU Data Transfers. Andreas Herten, Forschungszentrum Jülich, 24 April Member of the Helmholtz Association

Unified Memory. Notes on GPU Data Transfers. Andreas Herten, Forschungszentrum Jülich, 24 April Member of the Helmholtz Association Unified Memory Notes on GPU Data Transfers Andreas Herten, Forschungszentrum Jülich, 24 April 2017 Handout Version Overview, Outline Overview Unified Memory enables easy access to GPU development But some

More information

AT-501 Cortex-A5 System On Module Product Brief

AT-501 Cortex-A5 System On Module Product Brief AT-501 Cortex-A5 System On Module Product Brief 1. Scope The following document provides a brief description of the AT-501 System on Module (SOM) its features and ordering options. For more details please

More information

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA Sreepathi Pai October 18, 2017 URCS Outline Background Memory Code Execution Model Outline Background Memory Code Execution Model

More information

GstShark profiling: a real-life example. Michael Grüner - David Soto -

GstShark profiling: a real-life example. Michael Grüner - David Soto - GstShark profiling: a real-life example Michael Grüner - michael.gruner@ridgerun.com David Soto - david.soto@ridgerun.com Introduction Michael Grüner Technical Lead at RidgeRun Digital signal processing

More information

Porting Performance across GPUs and FPGAs

Porting Performance across GPUs and FPGAs Porting Performance across GPUs and FPGAs Deming Chen, ECE, University of Illinois In collaboration with Alex Papakonstantinou 1, Karthik Gururaj 2, John Stratton 1, Jason Cong 2, Wen-Mei Hwu 1 1: ECE

More information

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

Intel Xeon Phi Coprocessors

Intel Xeon Phi Coprocessors Intel Xeon Phi Coprocessors Reference: Parallel Programming and Optimization with Intel Xeon Phi Coprocessors, by A. Vladimirov and V. Karpusenko, 2013 Ring Bus on Intel Xeon Phi Example with 8 cores Xeon

More information

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs October 13, 2009 Overview Presenting: Alex Papakonstantinou, Karthik Gururaj, John Stratton, Jason Cong, Deming Chen, Wen-mei Hwu. FCUDA:

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

HETEROGENEOUS MEMORY MANAGEMENT. Linux Plumbers Conference Jérôme Glisse

HETEROGENEOUS MEMORY MANAGEMENT. Linux Plumbers Conference Jérôme Glisse HETEROGENEOUS MEMORY MANAGEMENT Linux Plumbers Conference 2018 Jérôme Glisse EVERYTHING IS A POINTER All data structures rely on pointers, explicitly or implicitly: Explicit in languages like C, C++,...

More information

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FCUDA: Enabling Efficient Compilation of CUDA Kernels onto FPGAs October 13, 2009 Overview Presenting: Alex Papakonstantinou, Karthik Gururaj, John Stratton, Jason Cong, Deming Chen, Wen-mei Hwu. FCUDA:

More information

Integrating NVIDIA Deep Learning Accelerator (NVDLA) with RISC-V SoC on FireSim

Integrating NVIDIA Deep Learning Accelerator (NVDLA) with RISC-V SoC on FireSim Integrating NVIDIA Deep Learning Accelerator (NVDLA) with RISC-V SoC on FireSim Farzad Farshchi, Qijing Huang, Heechul Yun University of Kansas, University of California, Berkeley SiFive Internship Rocket

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 20

ECE 571 Advanced Microprocessor-Based Design Lecture 20 ECE 571 Advanced Microprocessor-Based Design Lecture 20 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 12 April 2016 Project/HW Reminder Homework #9 was posted 1 Raspberry Pi

More information

Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory

Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory Institute of Computational Science Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory Juraj Kardoš (University of Lugano) July 9, 2014 Juraj Kardoš Efficient GPU data transfers July 9, 2014

More information

GPU programming. Dr. Bernhard Kainz

GPU programming. Dr. Bernhard Kainz GPU programming Dr. Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages GPU programming paradigms Pitfalls and best practice Reduction and tiling

More information

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance CSE 599 I Accelerated Computing - Programming GPUS Memory performance GPU Teaching Kit Accelerated Computing Module 6.1 Memory Access Performance DRAM Bandwidth Objective To learn that memory bandwidth

More information

Double-Precision Matrix Multiply on CUDA

Double-Precision Matrix Multiply on CUDA Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices

More information

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA PROGRAMMING MODEL Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA: COMMON UNIFIED DEVICE ARCHITECTURE Parallel computing architecture and programming model GPU Computing Application Includes

More information

Using mobile processors for general purpose industrial signal processing

Using mobile processors for general purpose industrial signal processing Using mobile processors for general purpose industrial signal processing Hans-Joachim Gelke hans.gelke@zhaw.ch Tobias Kammacher tobias.kammacher@zhaw.ch Matthias Rosenthal matthias.rosenthal@zhaw.ch Amin

More information

Visualization of OpenCL Application Execution on CPU-GPU Systems

Visualization of OpenCL Application Execution on CPU-GPU Systems Visualization of OpenCL Application Execution on CPU-GPU Systems A. Ziabari*, R. Ubal*, D. Schaa**, D. Kaeli* *NUCAR Group, Northeastern Universiy **AMD Northeastern University Computer Architecture Research

More information

GPGPU, 4th Meeting Mordechai Butrashvily, CEO GASS Company for Advanced Supercomputing Solutions

GPGPU, 4th Meeting Mordechai Butrashvily, CEO GASS Company for Advanced Supercomputing Solutions GPGPU, 4th Meeting Mordechai Butrashvily, CEO moti@gass-ltd.co.il GASS Company for Advanced Supercomputing Solutions Agenda 3rd meeting 4th meeting Future meetings Activities All rights reserved (c) 2008

More information

Throughput Exploration and Optimization of a Consumer Camera Interface for a Reconfigurable Platform

Throughput Exploration and Optimization of a Consumer Camera Interface for a Reconfigurable Platform Throughput Exploration and Optimization of a Consumer Camera Interface for a Reconfigurable Platform By: Floris Driessen (f.c.driessen@student.tue.nl) Introduction 1 Video applications on embedded platforms

More information

Programming Multicore Systems

Programming Multicore Systems Programming Multicore Systems Enabling real time applications for multicore with the XMOS development tools 5 th September 2011 Matt Fyles XMOS Company Overview Established Fabless Semiconductor Company

More information

NVM PCIe Networked Flash Storage

NVM PCIe Networked Flash Storage NVM PCIe Networked Flash Storage Peter Onufryk Microsemi Corporation Santa Clara, CA 1 PCI Express (PCIe) Mid-range/High-end Specification defined by PCI-SIG Software compatible with PCI and PCI-X Reliable,

More information

Distributing Computation to Large GPU Clusters

Distributing Computation to Large GPU Clusters Distributing Computation to Large GPU Clusters What is this about? DiCE: Software library for writing applications scaling to many GPUs and CPUs in a cluster What is this about? DiCE: Software library

More information

Digital Audio Networking with Dante

Digital Audio Networking with Dante Digital Audio Networking with Dante AES Melbourne 2015 Aidan Williams CTO Audinate 2015-04- 20 1 Introduction CTO & Founder, Audinate We make Dante audio networking Elec Eng / Computer Science background

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 18

ECE 571 Advanced Microprocessor-Based Design Lecture 18 ECE 571 Advanced Microprocessor-Based Design Lecture 18 Vince Weaver http://www.eece.maine.edu/ vweaver vincent.weaver@maine.edu 11 November 2014 Homework #4 comments Project/HW Reminder 1 Stuff from Last

More information

GTC 2017 Green Flash Persistent Kernel : Real-Time, Low-Latency and HighPerformance Computation on Pascal. Julien BERNARD

GTC 2017 Green Flash Persistent Kernel : Real-Time, Low-Latency and HighPerformance Computation on Pascal. Julien BERNARD Green Flash Persistent Kernel : Real-Time, Low-Latency and HighPerformance Computation on Pascal Julien BERNARD Project #671662 funded by European Commission under program H2020-EU.1.2.2 coordinated in

More information

Device Memories and Matrix Multiplication

Device Memories and Matrix Multiplication Device Memories and Matrix Multiplication 1 Device Memories global, constant, and shared memories CUDA variable type qualifiers 2 Matrix Multiplication an application of tiling runningmatrixmul in the

More information

CUDA Performance Optimization. Patrick Legresley

CUDA Performance Optimization. Patrick Legresley CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations

More information

EE 457 Unit 7b. Main Memory Organization

EE 457 Unit 7b. Main Memory Organization 1 EE 457 Unit 7b Main Memory Organization 2 Motivation Organize main memory to Facilitate byte-addressability while maintaining Efficient fetching of the words in a cache block Low order interleaving (L.O.I)

More information

Profiling & Tuning Applications. CUDA Course István Reguly

Profiling & Tuning Applications. CUDA Course István Reguly Profiling & Tuning Applications CUDA Course István Reguly Introduction Why is my application running slow? Work it out on paper Instrument code Profile it NVIDIA Visual Profiler Works with CUDA, needs

More information

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee

More information

GStreamer Daemon - Building a media server under 30min. Michael Grüner - David Soto -

GStreamer Daemon - Building a media server under 30min. Michael Grüner - David Soto - GStreamer Daemon - Building a media server under 30min Michael Grüner - michael.gruner@ridgerun.com David Soto - david.soto@ridgerun.com Introduction Michael Grüner Technical Lead at RidgeRun Digital signal

More information

Scalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA. NVIDIA Corporation 2012

Scalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA. NVIDIA Corporation 2012 Scalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA Outline Introduction to Multi-GPU Programming Communication for Single Host, Multiple GPUs Communication for Multiple Hosts, Multiple GPUs

More information

COMP528: Multi-core and Multi-Processor Computing

COMP528: Multi-core and Multi-Processor Computing COMP528: Multi-core and Multi-Processor Computing Dr Michael K Bane, G14, Computer Science, University of Liverpool m.k.bane@liverpool.ac.uk https://cgi.csc.liv.ac.uk/~mkbane/comp528 21 You should compute

More information

GPUDIRECT: INTEGRATING THE GPU WITH A NETWORK INTERFACE DAVIDE ROSSETTI, SW COMPUTE TEAM

GPUDIRECT: INTEGRATING THE GPU WITH A NETWORK INTERFACE DAVIDE ROSSETTI, SW COMPUTE TEAM GPUDIRECT: INTEGRATING THE GPU WITH A NETWORK INTERFACE DAVIDE ROSSETTI, SW COMPUTE TEAM GPUDIRECT FAMILY 1 GPUDirect Shared GPU-Sysmem for inter-node copy optimization GPUDirect P2P for intra-node, accelerated

More information

GPU-centric communication for improved efficiency

GPU-centric communication for improved efficiency GPU-centric communication for improved efficiency Benjamin Klenk *, Lena Oden, Holger Fröning * * Heidelberg University, Germany Fraunhofer Institute for Industrial Mathematics, Germany GPCDP Workshop

More information

ROB IN Performance Measurements

ROB IN Performance Measurements ROB IN Performance Measurements I. Mandjavidze CEA Saclay, 91191 Gif-sur-Yvette CEDEX, France ROB Complex Hardware Organisation Mode of Operation ROB Complex Software Organisation Performance Measurements

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction

More information

A Framework for Modeling GPUs Power Consumption

A Framework for Modeling GPUs Power Consumption A Framework for Modeling GPUs Power Consumption Sohan Lal, Jan Lucas, Michael Andersch, Mauricio Alvarez-Mesa, Ben Juurlink Embedded Systems Architecture Technische Universität Berlin Berlin, Germany January

More information

Lecture 1: Gentle Introduction to GPUs

Lecture 1: Gentle Introduction to GPUs CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 1: Gentle Introduction to GPUs Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Who Am I? Mohamed

More information

Key Points. Rotational delay vs seek delay Disks are slow. Techniques for making disks faster. Flash and SSDs

Key Points. Rotational delay vs seek delay Disks are slow. Techniques for making disks faster. Flash and SSDs IO 1 Today IO 2 Key Points CPU interface and interaction with IO IO devices The basic structure of the IO system (north bridge, south bridge, etc.) The key advantages of high speed serial lines. The benefits

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

SimBSP Enabling RTL Simulation for Intel FPGA OpenCL Kernels

SimBSP Enabling RTL Simulation for Intel FPGA OpenCL Kernels SimBSP Enabling RTL Simulation for Intel FPGA OpenCL Kernels Ahmed Sanaullah, Chen Yang, Daniel Crawley and Martin C. Herbordt Department of Electrical and Computer Engineering, Boston University The Intel

More information

Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators

Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators Remote CUDA (rcuda) Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators Better performance-watt, performance-cost

More information

Massively Parallel Architectures

Massively Parallel Architectures Massively Parallel Architectures A Take on Cell Processor and GPU programming Joel Falcou - LRI joel.falcou@lri.fr Bat. 490 - Bureau 104 20 janvier 2009 Motivation The CELL processor Harder,Better,Faster,Stronger

More information

Designing, developing, debugging ARM Cortex-A and Cortex-M heterogeneous multi-processor systems

Designing, developing, debugging ARM Cortex-A and Cortex-M heterogeneous multi-processor systems Designing, developing, debugging ARM and heterogeneous multi-processor systems Kinjal Dave Senior Product Manager, ARM ARM Tech Symposia India December 7 th 2016 Topics Introduction System design Software

More information

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management Hideyuki Shamoto, Tatsuhiro Chiba, Mikio Takeuchi Tokyo Institute of Technology IBM Research Tokyo Programming for large

More information

CS 179: GPU Computing. Lecture 2: The Basics

CS 179: GPU Computing. Lecture 2: The Basics CS 179: GPU Computing Lecture 2: The Basics Recap Can use GPU to solve highly parallelizable problems Performance benefits vs. CPU Straightforward extension to C language Disclaimer Goal for Week 1: Fast-paced

More information

CS/ECE 217. GPU Architecture and Parallel Programming. Lecture 16: GPU within a computing system

CS/ECE 217. GPU Architecture and Parallel Programming. Lecture 16: GPU within a computing system CS/ECE 217 GPU Architecture and Parallel Programming Lecture 16: GPU within a computing system Objective To understand the major factors that dictate performance when using GPU as an compute co-processor

More information

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs IBM Research AI Systems Day DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang 1, Junsong Wang 2, Chao Zhu 2, Yonghua Lin 2, Jinjun Xiong 3, Wen-mei

More information

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh Learn CUDA in an Afternoon Alan Gray EPCC The University of Edinburgh Overview Introduction to CUDA Practical Exercise 1: Getting started with CUDA GPU Optimisation Practical Exercise 2: Optimising a CUDA

More information

Bringing Intelligence to Enterprise Storage Drives

Bringing Intelligence to Enterprise Storage Drives Bringing Intelligence to Enterprise Storage Drives Neil Werdmuller Director Storage Solutions Arm Santa Clara, CA 1 Who am I? 28 years experience in embedded Lead the storage solutions team Work closely

More information

LATTICE-BOLTZMANN AND COMPUTATIONAL FLUID DYNAMICS

LATTICE-BOLTZMANN AND COMPUTATIONAL FLUID DYNAMICS LATTICE-BOLTZMANN AND COMPUTATIONAL FLUID DYNAMICS NAVIER-STOKES EQUATIONS u t + u u + 1 ρ p = Ԧg + ν u u=0 WHAT IS COMPUTATIONAL FLUID DYNAMICS? Branch of Fluid Dynamics which uses computer power to approximate

More information

CUDA (Compute Unified Device Architecture)

CUDA (Compute Unified Device Architecture) CUDA (Compute Unified Device Architecture) Mike Bailey History of GPU Performance vs. CPU Performance GFLOPS Source: NVIDIA G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010 Parallelizing FPGA Technology Mapping using GPUs Doris Chen Deshanand Singh Aug 31 st, 2010 Motivation: Compile Time In last 12 years: 110x increase in FPGA Logic, 23x increase in CPU speed, 4.8x gap Question:

More information

CUDA Memories. Introduction 5/4/11

CUDA Memories. Introduction 5/4/11 5/4/11 CUDA Memories James Gain, Michelle Kuttel, Sebastian Wyngaard, Simon Perkins and Jason Brownbridge { jgain mkuttel sperkins jbrownbr}@cs.uct.ac.za swyngaard@csir.co.za 3-6 May 2011 Introduction

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012 Overview 1. Memory Access Efficiency 2. CUDA Memory Types 3. Reducing Global Memory Traffic 4. Example: Matrix-Matrix

More information

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34 1 / 34 GPU Programming Lecture 2: CUDA C Basics Miaoqing Huang University of Arkansas 2 / 34 Outline Evolvements of NVIDIA GPU CUDA Basic Detailed Steps Device Memories and Data Transfer Kernel Functions

More information

Understanding Storage I/O Behaviors of Mobile Applications. Louisiana State University Department of Computer Science and Engineering

Understanding Storage I/O Behaviors of Mobile Applications. Louisiana State University Department of Computer Science and Engineering Understanding Storage I/O Behaviors of Mobile Applications Jace Courville jcourv@csc.lsu.edu Feng Chen fchen@csc.lsu.edu Louisiana State University Department of Computer Science and Engineering The Rise

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

Real-Time Support for GPU. GPU Management Heechul Yun

Real-Time Support for GPU. GPU Management Heechul Yun Real-Time Support for GPU GPU Management Heechul Yun 1 This Week Topic: Real-Time Support for General Purpose Graphic Processing Unit (GPGPU) Today Background Challenges Real-Time GPU Management Frameworks

More information

ATS-GPU Real Time Signal Processing Software

ATS-GPU Real Time Signal Processing Software Transfer A/D data to at high speed Up to 4 GB/s transfer rate for PCIe Gen 3 digitizer boards Supports CUDA compute capability 2.0+ Designed to work with AlazarTech PCI Express waveform digitizers Optional

More information

New Communication Standard Takyon Proposal Overview

New Communication Standard Takyon Proposal Overview Khronos Group Inc. 2018 - Page 1 Heterogenous Communications Exploratory Group New Communication Standard Takyon Proposal Overview November 2018 Khronos Group Inc. 2018 - Page 2 Khronos Exploratory Group

More information

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors Michael Boyer, David Tarjan, Scott T. Acton, and Kevin Skadron University of Virginia IPDPS 2009 Outline Leukocyte

More information

A Hardware-Friendly Bilateral Solver for Real-Time Virtual-Reality Video

A Hardware-Friendly Bilateral Solver for Real-Time Virtual-Reality Video A Hardware-Friendly Bilateral Solver for Real-Time Virtual-Reality Video Amrita Mazumdar Armin Alaghi Jonathan T. Barron David Gallup Luis Ceze Mark Oskin Steven M. Seitz University of Washington Google

More information

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s) Islam Harb 08/21/2015 Agenda Motivation Research Challenges Contributions & Approach Results Conclusion Future Work 2

More information

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog CUDA GPGPU Workshop 2012 CUDA/GPGPU Arch&Prog Yip Wichita State University 7/11/2012 GPU-Hardware perspective GPU as PCI device Original PCI PCIe Inside GPU architecture GPU as PCI device Traditional PC

More information

RAVENNA WDM Virtual Sound Card (RVSC) Specification

RAVENNA WDM Virtual Sound Card (RVSC) Specification Draft 1.0 RAVENNA WDM Virtual Sound Card (RVSC) Specification This document describes the specification of the RAVENNA Virtual Sound Card (RVSC) with WDM API. ALC NetworX GmbH Am Loferfeld 58 81249 Munich

More information

ReFlex: Remote Flash Local Flash

ReFlex: Remote Flash Local Flash ReFlex: Remote Flash Local Flash Ana Klimovic Heiner Litz Christos Kozyrakis NVMW 18 Memorable Paper Award Finalist 1 Flash in Datacenters Flash provides 1000 higher throughput and 100 lower latency than

More information

Introduction to CELL B.E. and GPU Programming. Agenda

Introduction to CELL B.E. and GPU Programming. Agenda Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU

More information

Prototyping the Autonomous Future Joe Cassar, Engineering Group Manager. dspace Inc Pontiac Trail, Wixom, MI 48393

Prototyping the Autonomous Future Joe Cassar, Engineering Group Manager. dspace Inc Pontiac Trail, Wixom, MI 48393 Prototyping the Autonomous Future Joe Cassar, Engineering Group Manager dspace Inc 50131 Pontiac Trail, Wixom, MI 48393 2 What s the Common Denominator? Ford AUDI RS7 Concept Nissan Porsche ZF 3 MicroAutobox

More information

Microsecond Latency, Real-Time, Multi-Input/Output Control using GPU Processing

Microsecond Latency, Real-Time, Multi-Input/Output Control using GPU Processing Microsecond Latency, Real-Time, Multi-Input/Output Control using GPU Processing Nikolaus Rath March 20th, 2013 N. Rath (Columbia University) µs Latency Control using GPU Processing March 20th, 2013 1 /

More information

GPUfs: Integrating a file system with GPUs

GPUfs: Integrating a file system with GPUs GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Traditional System Architecture Applications OS CPU

More information

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구 MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구 Leading Supplier of End-to-End Interconnect Solutions Analyze Enabling the Use of Data Store ICs Comprehensive End-to-End InfiniBand and Ethernet Portfolio

More information