CMPE 665:Multiple Processor Systems CUDA-AWARE MPI VIGNESH GOVINDARAJULU KOTHANDAPANI RANJITH MURUGESAN

Similar documents
CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Current Trends in Computer Graphics Hardware

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Graphics Hardware. Instructor Stephen J. Guy

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

Portland State University ECE 588/688. Graphics Processors

GPU Architecture. Alan Gray EPCC The University of Edinburgh

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

CS 179: GPU Programming

GPGPU. Peter Laurens 1st-year PhD Student, NSC

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

GPUs and GPGPUs. Greg Blanton John T. Lubia

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

high performance medical reconstruction using stream programming paradigms

Fast Interactive Sand Simulation for Gesture Tracking systems Shrenik Lad

Using Graphics Chips for General Purpose Computation

From Brook to CUDA. GPU Technology Conference

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Threading Hardware in G80

CS427 Multicore Architecture and Parallel Computing

Martin Kruliš, v

CS 220: Introduction to Parallel Computing. Introduction to CUDA. Lecture 28

Computer Architecture

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

! Readings! ! Room-level, on-chip! vs.!

General Purpose GPU Programming. Advanced Operating Systems Tutorial 9

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

Jeremy W. Sheaffer 1 David P. Luebke 2 Kevin Skadron 1. University of Virginia Computer Science 2. NVIDIA Research

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU.

Antonio R. Miele Marco D. Santambrogio

Graphics Processing Unit Architecture (GPU Arch)

CUDA Conference. Walter Mundt-Blum March 6th, 2008

What Next? Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University. * slides thanks to Kavita Bala & many others

Comparison of High-Speed Ray Casting on GPU

Graphics Processing Unit (GPU)

3D Computer Games Technology and History. Markus Hadwiger VRVis Research Center

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

CUDA GPGPU Workshop 2012

Spring 2009 Prof. Hyesoon Kim

GPGPU, 4th Meeting Mordechai Butrashvily, CEO GASS Company for Advanced Supercomputing Solutions

GPU Architecture and Function. Michael Foster and Ian Frasch

Spring 2011 Prof. Hyesoon Kim

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

Programming Graphics Hardware

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

CUDA. Matthew Joyner, Jeremy Williams

ECE 8823: GPU Architectures. Objectives

Cornell University CS 569: Interactive Computer Graphics. Introduction. Lecture 1. [John C. Stone, UIUC] NASA. University of Calgary

GPU Architecture. Michael Doggett Department of Computer Science Lund university

Technical Brief. AGP 8X Evolving the Graphics Interface

Technology for a better society. hetcomp.com

GENERAL-PURPOSE COMPUTATION USING GRAPHICAL PROCESSING UNITS

General Purpose Computing on Graphical Processing Units (GPGPU(

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

General Purpose GPU Programming. Advanced Operating Systems Tutorial 7

Bifurcation Between CPU and GPU CPUs General purpose, serial GPUs Special purpose, parallel CPUs are becoming more parallel Dual and quad cores, roadm

Parallel Processing SIMD, Vector and GPU s cont.

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

Introduction to Multicore Programming

General Purpose GPU Computing in Partial Wave Analysis

GPU Programming Using NVIDIA CUDA

Tesla GPU Computing A Revolution in High Performance Computing

Real-time Graphics 9. GPGPU

GPU for HPC. October 2010

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors

GPU-Based Volume Rendering of. Unstructured Grids. João L. D. Comba. Fábio F. Bernardon UFRGS

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

TUNING CUDA APPLICATIONS FOR MAXWELL

Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs

n N c CIni.o ewsrg.au

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

Implementation of Bilateral Filtering on CUDA

Headline in Arial Bold 30pt. Visualisation using the Grid Jeff Adie Principal Systems Engineer, SAPK July 2008

Xbox 360 Architecture. Lennard Streat Samuel Echefu

GPGPU on Mobile Devices

GPU A rchitectures Architectures Patrick Neill May

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

TUNING CUDA APPLICATIONS FOR MAXWELL

Multimedia in Mobile Phones. Architectures and Trends Lund

1. Introduction 2. Methods for I/O Operations 3. Buses 4. Liquid Crystal Displays 5. Other Types of Displays 6. Graphics Adapters 7.

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

CUDA Programming Model

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

PowerVR Hardware. Architecture Overview for Developers

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

frame buffer depth buffer stencil buffer

CENG 477 Introduction to Computer Graphics. Graphics Hardware and OpenGL

Lecture 1: Gentle Introduction to GPUs

Accelerating CFD with Graphics Hardware

World s most advanced data center accelerator for PCIe-based servers

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

Real-time Graphics 9. GPGPU

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

Transcription:

CMPE 665:Multiple Processor Systems CUDA-AWARE MPI VIGNESH GOVINDARAJULU KOTHANDAPANI RANJITH MURUGESAN

Graphics Processing Unit Accelerate the creation of images in a frame buffer intended for the output display. Used for image processing and computer graphics. Term was popularized by NVIDIA in the year 1999. Also known as Visual Processing Unit. ATI technologies released the first GPU in the year 2002.

History of GPU development 1970 s and 1980 s Video shifters and video address generators. They acted as a hardware between the main processor and display unit. RCA s Pixie Video chip(1976): Capable of outputting a signal of 62*128 resolution. MC6845 video address generator by Motorola(1978): Became the basis for IBM display and Apple II display adapter. IBM Professional Graphics Controller (PGA) (1984): Was one of the very first video cards for PC. Silicon Graphics Inc. (SGI) introduced the OpenGL technology in the year in 1989.

History of GPU development 1990 s SGI s graphics hardware was mainly used in workstations. Vodoo s 3dfx was one of the first true game cards, it operated at a speed of 50 Mhz with a 4MB of 64-Bit DRAM. NVIDIA s GeForce256 offered many features such as multi-texering, bump map, light maps and hardware geometry transforms and lighting. It operated at a clock speed of 120 Mhz and 32 MB of 128- Bit DRAM. It had a fixed pipeline model. This is the time when the GPU hardware and computer gaming market took off.

History of GPU development 2000 s Programmable pipeline was introduced. Cards popular at this time include Nvidia s GeForce3 and ATI Radeon 8500. Fully programmable graphic cards were introduced in the year 2002, NVIDIA GeForce FX, ATI Radeon 9700. In 2004, GPU programming was starting to take off. In 2006, GPU started being exposed as massively parallel processors. More programmability was added to the pixel and vertex shaders Current GPU s are highly programmable and trend is towards GPU accelerated processing.

CUDA - Compute Unified Device Architecture CUDA is a parallel computing platform and programming model invented by NVIDIA. It enables dramatic increases in computing performance by harnessing the power of the graphics-processing unit (GPU). CUDA gives developers direct access to the virtual instruction set and memory of the parallel computational elements in CUDA GPUs.

CUDA - Compute Unified Device Architecture Using CUDA, the GPUs can be used for general purpose processing,this approach is known as GPGPU- General-purpose computing on graphics processing units Unlike CPUs, however, GPUs have a parallel throughput architecture that emphasizes executing many concurrent threads slowly, rather than executing a single thread very quickly.

Processing Flow Copy data from main memory to GPU memory CPU instructs the process to GPU GPU execute parallel in each core Copy the result from GPU memory to main memory

Advantages With millions of CUDA-enabled GPUs sold to date, software developers, scientists and researchers are finding broad ranging uses for GPU computing with CUDA. Scattered reads code can read from arbitrary addresses in memory Unified Memory Bridges CPU GPU divide Faster downloads and read back to and from the GPU Full support for integer and bitwise operations, including integer texture lookups

Disadvantages Unlike OpenCL, CUDA-enabled GPUs are only available from Nvidia CUDA does not support the full C standard, Copying between host and device memory may incur a performance hit due to system bus bandwidth and latency Valid C/C++ may sometimes be flagged and prevent compilation due to optimization techniques the compiler is required to employ to use limited resources.

Real Time Applications Identify hidden plaque in arteries: Heart attacks are the leading cause of death worldwide. GPUs can be used to simulate blood flow and identify hidden arterial plaque without invasive imaging techniques or exploratory surgery. Analyze air traffic flow: The National Airspace System manages the nationwide coordination of air traffic flow. Computer models help identify new ways to alleviate congestion and keep airplane traffic moving efficiently.

Real Time Applications Using the computational power of GPUs, a team at NASA obtained a large performance gain, reducing analysis time from ten minutes to three seconds. Visualize molecules: A molecular simulation called NAMD (nanoscale molecular dynamics) gets a large performance boost with GPUs. The speed-up is a result of the parallel architecture of GPUs, which enables NAMD developers to port compute-intensive portions of the application to the GPU using the CUDA Toolkit.

Message Passing Interface (MPI) MPI process runs on a system with distributed memory space, such as a cluster. MPI actually defines a message-passing API which covers point-to-point messages as well as collective operations like reductions. In MPI each processor is called as a rank.

Reason for combining CUDA and MPI To solve problems with a data size too large to fit into the memory of a single GPU To solve problems that would require unreasonably long compute time on a single node To accelerate an existing MPI application with GPUs To enable a single-node multi-gpu application to scale across multiple nodes

CUDA-AWARE MPI In normal MPI if one wants to send GPU gpu buffers, one would need to stage it through the host memory as shown in the code below. //MPI Rank 0 cudamemcpy(s_buf_h,s_buf_d,size,cudamemcpydevicetohost); MPI_Send(s_buf_h,size,MPI_CHAR,1,100,MPI_COMM_WORLD); //MPI Rank 1 MPI_Recv(r_buf_h,size,MPI_CHAR,0,100,MPI_COMM_WORLD,&status); cudamemcpy(r_buf_d,r_buf_h,size,cudamemcpyhosttodevice);

CUDA-AWARE MPI With Cuda-Aware MPI the GPU buffers can be directly passed on to MPI. //MPI Rank 0 MPI_Send(s_buf_h,size,MPI_CHAR,1,100,MPI_COMM_WORLD); //MPI Rank 1 MPI_Recv(r_buf_h,size,MPI_CHAR,0,100,MPI_COMM_WORLD,&status);

CUDA-AWARE MPI Handling Buffers

CUDA-AWARE MPI All operation that require message transfer can be pipelined. Acceleration technologies like GPU direct can be utilized by the MPI library. GPU buffers can be directly passed to the network adapter.

CUDA-AWARE MPI

CUDA-Aware MPI Working of CUDA-AWARE MPI The below diagram shows the various process involved:

CUDA-AWARE MPI

CUDA-AWARE MPI

MPI vs. CUDA-AWARE MPI PERFORMANCE For tasks where communication between processors is low.

MPI vs. CUDA-AWARE MPI PERFORMANCE For communication-intensive tasks

MPI vs. CUDA-AWARE MPI PERFORMANCE Ease of use Pipelined data transfers that automatically provide optimizations when available.

CUDA-AWARE MPI Implementations OpenMPI 1.7 (beta) Better interactions with streams Better small message performance eager protocol Support for reduction operations Support for non-blocking collectives

CUDA-AWARE MPI Implementations IBM Platform MPI(8.3) Obtain higher quality results faster Reduce development and support costs Improve engineer and developer productivity Supports the broadest range of industry-standard platforms, interconnects, and operating systems helping ensure that parallel applications can run almost anywhere