Implementing CUDA Audio Networks. Giancarlo Del Sordo Acustica Audio

Similar documents
Using a GPU in InSAR processing to improve performance

Introduction to CUDA

GPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways:

High Performance Computing and GPU Programming

High Performance Linear Algebra on Data Parallel Co-Processors I

Georgia Institute of Technology Center for Signal and Image Processing Steve Conover February 2009

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Introduction to Parallel Computing with CUDA. Oswald Haan

Cartoon parallel architectures; CPUs and GPUs

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

Realtime Signal Processing on Nvidia TX2 using CUDA

CUDA Kenjiro Taura 1 / 36

High Performance Computing and GPU Programming

CUDA Performance Considerations (2 of 2) Varun Sampath Original Slides by Patrick Cozzi University of Pennsylvania CIS Spring 2012

Experiences Porting Real Time Signal Processing Pipeline CUDA Kernels from Kepler to Maxwell

GPU Programming EE Final Examination

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Scientific Computations Using Graphics Processors

CS377P Programming for Performance GPU Programming - I

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

GPU Programming EE Final Examination

Parallel Numerical Algorithms

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

A Sampling of CUDA Libraries Michael Garland

Compiling and Executing CUDA Programs in Emulation Mode. High Performance Scientific Computing II ICSI-541 Spring 2010

ECE 574 Cluster Computing Lecture 15

Scientific GPU computing with Go A novel approach to highly reliable CUDA HPC 1 February 2014

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Massively Parallel Architectures

CUDA. More on threads, shared memory, synchronization. cuprintf

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

Fundamental CUDA Optimization. NVIDIA Corporation

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

CUDA Architecture & Programming Model

High-Performance Computing Using GPUs

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Accelerating MATLAB with CUDA

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

Lecture 11: GPU programming

Atomic Operations. Atomic operations, fast reduction. GPU Programming. Szénási Sándor.

Today s Content. Lecture 7. Trends. Factors contributed to the growth of Beowulf class computers. Introduction. CUDA Programming CUDA (I)

Real-time Graphics 9. GPGPU

Accelerating image registration on GPUs

Fundamental CUDA Optimization. NVIDIA Corporation

Predictive Runtime Code Scheduling for Heterogeneous Architectures

ECE 408 / CS 483 Final Exam, Fall 2014

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

E6895 Advanced Big Data Analytics Lecture 8: GPU Examples and GPU on ios devices

COSC 462 Parallel Programming

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

PANOPTES: A BINARY TRANSLATION FRAMEWORK FOR CUDA. Chris Kennelly D. E. Shaw Research

Scientific discovery, analysis and prediction made possible through high performance computing.

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Tesla Architecture, CUDA and Optimization Strategies

Advanced CUDA Optimizations. Umar Arshad ArrayFire

Lecture 2: CUDA Programming

Speed Up Your Codes Using GPU

Real-Time GPU Fluid Dynamics

Real-time Graphics 9. GPGPU

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Adding CUDA Support to Cling: JIT Compile to GPUs

Warps and Reduction Algorithms

CS 179: GPU Computing. Lecture 2: The Basics

Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. Stephen Jones, GTC 2017

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

GPU ARCHITECTURE Chris Schultz, June 2017

Unrolling parallel loops

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

GPU programming. Dr. Bernhard Kainz

GPU Computing Master Clss. Development Tools

Pinned-Memory. Table of Contents. Streams Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Stream. Pinned-memory.

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

Practical Exercise Inter-Process Communication

Image Processing Optimization C# on GPU with Hybridizer

Tesla GPU Computing A Revolution in High Performance Computing

REDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS

Using OpenACC With CUDA Libraries

Introduction to Parallel Programming

COMP 635: Seminar on Heterogeneous Processors. Lecture 5: Introduction to GPGPUs (contd.) Mary Fletcher Vivek Sarkar

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Graph Search

GPU programming basics. Prof. Marco Bertini

CUDA C Programming Mark Harris NVIDIA Corporation

Distributed Systems 8. Remote Procedure Calls

Lecture 3: Introduction to CUDA

ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Class. Windows and CUDA : Shu Guo. Program from last time: Constant memory

MODELING CUDA COMPUTE APPLICATIONS BY CRITICAL PATH. PATRIC ZHAO, JIRI KRAUS, SKY WU

Transcription:

Implementing CUDA Audio Networks Giancarlo Del Sordo Acustica Audio giancarlo@acusticaudio.com

Motivation Vintage gear processing in software domain High audio quality results Low cost and user-driven content approach

Outline Audio theory Basic engine solution Network implementation Results

Audio theory

Standard DSP approach input output Example: simple 2 poles filter (source www.musicdsp.org) c = pow(0.5, (128-cutoff) / 16.0); r = pow(0.5, (resonance+24) / 16.0); //Loop: v0 = (1-r*c)*v0 - (c)*v1 + (c)*input; v1 = (1-r*c)*v1 + (c)*v0; output = v1; Drawbacks: time/costs required for achieving good quality algorithms complexity for taking advantage of GPU resources

Volterra Series x(n) y(n) x(n) *x(n) *x(n) h 1 (n) h 2 (n) h 3 (n) + + y(n)

Solution for dynamic processing (preamps,reverbs) 0 db Length: 2ms to 200ms Refresh: 2ms to 50ms -2 db h 1 (n) preamplifiers h 2 (n) h 3 (n) -4 db Length: 200ms to 10s Refresh: up to 180ms Envelope detector/follower h 1 (n) reverbs h 2 (n) h 3 (n)

Solution for dynamic processing (comp/limiters) Tf (input, output) 0 db -2 db Envelope detector/follower Transfer curve -4 db Length: 2ms to 200ms Refresh: 2ms to 50ms

Solution for equalizers/filters Length: 10ms to 500ms knob 1 knob 2 root nodes leafs

Solution for time-varying fx (chorus, flanger, ) interpolation Length: 2ms to 200ms Refresh: 2ms to 50ms leafs 0 ms 20 ms 40 ms 60 ms time root

Block diagram of the engine (1) x(n) *x(n) Kernel engine Realtime h 1 (n) h 2 (n) + y(n) Advantages: low costs required for achieving accurate algorithms (sampling approach) easy CUDA implementation for Kernel engine Vectorial engine Refresh: 2ms to 180ms

Block diagram of the engine (2) x ivldim ^2 Kernel engine 0 ivextendedbufferbytes 0 FFT H 1 1 H 1 2 FFT H 2 1 H 2 2 X * * y ivldim * ivbins IFFT circular accumulator

Basic engine solution

Inner block (1) Input memset (cvtemp, 0, cop32->ivextendedbufferbytes); for (iv = 0; iv < cop32->ivldim; iv++) cvtemp [iv].x = (float) dx_ [iv]; cudamemcpyasync ( cvtemp_, cvtemp, cop32->ivextendedbufferbytes, cudamemcpyhosttodevice, cop32->ivstream); FFT (input and kernel) if (!bupdate) acfftexecc2c (cop32->cvacplanx, cvtemp_, cvtemp_, CUFFT_FORWARD, cop32->ivstream); else { cudamemcpyasync ( cop32->fvirfftpartition_ [iirchannel], cop32->fvirfftpartition [iirchannel], cop32->ivcircularbufferbytes, cudamemcpyhosttodevice, cop32->ivstream); acfftexecc2c ( cop32->cvacplany [iirchannel], (cufftcomplex*) cop32->fvcoalescedirfftpartition_ [iirchannel], (cufftcomplex*) cop32->fvcoalescedirfftpartition_ [iirchannel], CUFFT_FORWARD, cop32->ivstream); } x FFT 0 ivldim H 1 1 Multiply and add muacp<<<icpar0, icpar1, 0, cop32->ivstream>>> ( cvtemppointera_, (tccomplex*) fvtemppointerb_, (tccomplex*) fvtemppointerc_, cop32->ivbins, cop32->ivpartitions, cop32->ivcircularpointer); *

Inner block (2) Output switch (cop32->ivchainedposition) { case icchainedfirst_: cudamemcpyasync (&cop32->fvchainedbuffer_ [ivtemp2], fvtemppointerc_, cop32->ivextendedbufferbytes, cudamemcpydevicetodevice, cop32->ivstream); break; case icchainedmiddle: addfl<<<icpar0, icpar1, 0, cop32->ivstream>>> (fvtemppointerc_, &cop32->fvchainedbuffer_ [ivtemp2], cop32->ivextendedsection); break; case icchainedlast : addfl<<<icpar0, icpar1, 0, cop32->ivstream>>> (&cop32->fvchainedbuffer_ [ivtemp2], fvtemppointerc_, cop32->ivextendedsection); case icchainednone : acfftexecc2c (cop32->cvacplanxinv, (cufftcomplex*) fvtemppointerc_, cvtemp_, CUFFT_INVERSE, cop32->ivstream); cudamemcpyasync (cop32->fvextendedbuffer, cvtemp_, cop32->ivextendedbufferbytes, cudamemcpydevicetohost, cop32->ivstream); y IFFT

Inner block (3) In-depth analysis: multiply and add muacp<<< >>> ( ); * static global void muacp ( tccomplex* ca, tccomplex* cb, tccomplex* cc, int isize, int ipartitions, int icircularpointer) { const int icnumthreads = blockdim.x * griddim.x; const int icthreadid = blockidx.x * blockdim.x + threadidx.x; int ivctemp; icnumthreads } for (int iv = icthreadid; iv < isize; iv += icnumthreads) for (int ivinner = 0; ivinner < ipartitions; ivinner++) { ivctemp = isize * ((icircularpointer + ivinner) % ipartitions) + iv; cc [ivctemp] = masce (ca [iv], cb [isize * ivinner + iv], cc [ivctemp]); } Grid static device host inline tccomplex masce (tccomplex ca, tccomplex cb, tccomplex cc) { tccomplex cvreturn; } cvreturn.x = cc.x + ca.x*cb.x - ca.y*cb.y; cvreturn.y = cc.y + ca.x*cb.y + ca.y*cb.x; return cvreturn;

Inner block (4) In-depth analysis: FFT acfftexecc2c ( ) FFT void acfftexecc2c ( ACFFTLOCHANDLE cplan, float2* cinput, float2* coutput, int idirection, int istream = 0) { } ((ACFFTLOC*) cplan)->cvexecutefun [idirection <= 0? 0 : 1] (cinput, coutput, cplan, istream); (*cplan)->cvexecutefun [0] = cvacfftfunfw [ ]; int ivacfftsize [] = {8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384}; ACFFTFUN cvacfftfunfw [] = {acfnv, acfnv, acfnv, acfnv, acfnv, acfnv, acf01, acf02, acf03, acf04, acf05, acf06}; void acf06 ( float2* cinput, float2* coutput, ACFFTLOCHANDLE cacfft,int istream = 0) { ACFFTLOC* cvloc = (ACFFTLOC*) cacfft; } v2_16<<< grid2d(cvloc->ivbatches*(16384/16)/64), 64, 0, istream >>>(cinput, coutput); v1024<<< grid2d(cvloc->ivbatches*16), 64, 0, istream >>>(coutput, coutput); custom FFT: acfnv = basic c2c radix2 Vasily Volkov inner blocks: 1024 = 16*4*16 16384 = 16*1024

Buffer tuning and latency (1) 18000 16000 14000 12000 10000 8000 6000 4000 2000 0 FFT buffer (ivextendedbuffersize) CUDA accelleration 0 100 200 300 400 500 600 ms compressors preamps equalizers early reflections long TMV rooms reverbs

Buffer tuning and latency (2) 25 % load 20 15 10 GPU CPU 5 0 700 2700 4700 6700 8700 10700 ms CPU: Core2 Duo T5800 2.0 Ghz GPU: GeForce 9600M GT Results: real-time resources are doubled adding GPU-powered instances

Buffer tuning and latency (3) CPU: Core2 Duo T5800 2.0 Ghz GPU: GeForce 9600M GT 3 instances, 1 kernel, 5.5 seconds, FFT buf = 16384, 30 partitions IPP (Intel Performance Primitives): CPU 40% FFTW: CPU 44% CUDA cufft library: CPU 5%, GPU 90% CUDA custom fft implementation: CPU 5%, GPU 36% IFFT FFT

Audio processing buffers (1) input processed realtime headroom output latency compensation sleeps

Audio processing buffers (2) FFT buffer refresh period (kernel swap) FFT output example: linear interpolation processed result: per-sample smoothing

Audio processing buffers (3) Example: premplifier featuring 50 ms kernels Without interpolation db Log interpolation Custom interpolation

Threading model bypass sending old data new data Host/DSP thread GUI thread loading command latency CUDA thread alloc H 1 alloc H 2 chaining critical section proc H 1 proc H 2 critical section CUDA RUNTIME

Block diagram of a plug-in implementation HOST DSP EDITOR DSP EDITOR DSP EDITOR ENGINE ENGINE ENGINE KERNEL VECTORIAL KERNEL VECT KERNEL VECT BRIDGE BRIDGE BRIDGE CUDA RUNTIME

Network implementation

Network implementation HOST DSP EDITOR DSP EDITOR DSP EDITOR ENGINE ENGINE ENGINE KERNEL VECTORIAL KERNEL VECT KERNEL VECT IPC SERVER

Inter-process communication (1) Custom socket implementation cvsocket = new CoreSckt (type); accept connect bind listen send recv Socket Reliable Slower Remote host UDP Not reliable (only for media data) Remote host Named pipes/ domain socket Reliable Fast (10x) Local host

Inter-process communication (2) Example: set a 2 state variable bean cvprocess->setbb(coreprcs::icbeanbypass, true); Client cvtransportopcode->setvl (0, icopcodesetbb); cvtransportopcode->send_ (cvlocalsocket); cvbeanstructb->setvl (0, iobject); cvbeanstructb->setvl (1, bvalue); cvbeanstructb->send_ (cvlocalsocket); cvlocalsocket->lflsh (); cvtransportack->recve (cvlocalsocket); Server class is wrapped case icopcodesetbb: cvbeanstructb->recve (cvlocalsocket); cvbeanstructb->getvl (0, ivsubopcode); cvbeanstructb->getvl (1, bvtemp); cvbeanstructb->getvl (2, cvbeanstruct.ivobject); cvbeanstructb->getvl (3, cvbeanstruct.ivsubobject); cvprocess->setbb (ivsubopcode, bvtemp, &cvbeanstruct); cvtransportack->send_ (cvlocalsocket); L L message icopcodesetbb icbeanbypass Value Different threads on server side: Audio processing Editor: low priority beans opcode subopcode/bean item

32 bits legacy hosts in 64 bits OS HOST 32 bits: x64 version is not available plug-in is a dynamic library PLUG-IN IPC SERVER BRIDGE CUDA runtime/drivers 64 bits: extended address space easy CUDA installation (driver, toolkit) very low overhead using named pipes/domain sockets lacking of memory fragmentation effect and fault tolerance

Multi-servers/Multi-clients block diagram HOST PLUG-IN PLUG-IN PLUG-IN HOST 1 IPC IPC IPC Achievements: SERVER BRIDGE SERVER BRIDGE SERVER BRIDGE single control point, shared resources, easy mantainance more OS are available at the same time (example: MAC OSX host connected to cheap windows servers) HOST 2 CUDA runtime/drivers HOST 3 CUDA runtime

Results

The library MAC OSX Tiger..Snow Leopard, Windows XP... 7 Support for Microsoft, Borland/CodeGear, Apple compilers Support for IPP library Optional support for FFTW library Direct convolution implemented in ASM ia32 and x64 Automatic application aimed to vintage gear sampling Huge number of emulations available and used daily in professional audio productions

On the road sampling Big recording facility Client/Server as replacement of missing outboard

Next steps CUDA implementation of the vectorial engine Resource optimization (clients connected to the same server instance could share resources) New features based on client-to-client communication (ie cross-talk modeling)