Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Similar documents
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA Tesla GPU Cluster

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

High Performance Computing with Accelerators

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

Tesla GPU Computing A Revolution in High Performance Computing

rcuda: an approach to provide remote access to GPU computational power

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

Solving Dense Linear Systems on Graphics Processors

Tesla Architecture, CUDA and Optimization Strategies

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Massively Parallel Architectures

Portland State University ECE 588/688. Graphics Processors

Tesla GPU Computing A Revolution in High Performance Computing

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

General Purpose GPU Computing in Partial Wave Analysis

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Dealing with Heterogeneous Multicores

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

Auto-tunable GPU BLAS

Introduction to CELL B.E. and GPU Programming. Agenda

arxiv: v1 [physics.comp-ph] 4 Nov 2013

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

Technology for a better society. hetcomp.com

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA

Automatic Development of Linear Algebra Libraries for the Tesla Series

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

Graphics Processor Acceleration and YOU

Scientific Computations Using Graphics Processors

computational power computational

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

High-Order Finite-Element Earthquake Modeling on very Large Clusters of CPUs or GPUs

Introduction to CUDA Programming

Accelerating Linpack with CUDA on heterogeneous clusters

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators


An approach to provide remote access to GPU computational power

Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory

Exploiting Task-Parallelism on GPU Clusters via OmpSs and rcuda Virtualization

Introduction to CUDA

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design

Pedraforca: a First ARM + GPU Cluster for HPC

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Applications of Berkeley s Dwarfs on Nvidia GPUs

Introduction to Xeon Phi. Bill Barth January 11, 2013

Parallel Processing SIMD, Vector and GPU s cont.

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Practical Introduction to CUDA and GPU

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

GPUs and Emerging Architectures

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

CME 213 S PRING Eric Darve

Introduction to Parallel Computing with CUDA. Oswald Haan

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Fundamental CUDA Optimization. NVIDIA Corporation

Application Performance on Dual Processor Cluster Nodes

GViM: GPU-accelerated Virtual Machines

GPU Clusters for High- Performance Computing Jeremy Enos Innovative Systems Laboratory

n N c CIni.o ewsrg.au

Architecture, Programming and Performance of MIC Phi Coprocessor

6.1 Multiprocessor Computing Environment

Multi-Processors and GPU

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Turing Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

GPGPU, 4th Meeting Mordechai Butrashvily, CEO GASS Company for Advanced Supercomputing Solutions

Summary of CERN Workshop on Future Challenges in Tracking and Trigger Concepts. Abdeslem DJAOUI PPD Rutherford Appleton Laboratory

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

Experts in Application Acceleration Synective Labs AB

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Automated Finite Element Computations in the FEniCS Framework using GPUs

OP2 FOR MANY-CORE ARCHITECTURES

GPU Programming. Ringberg Theorie Seminar 2010

7 DAYS AND 8 NIGHTS WITH THE CARMA DEV KIT

Parallel Numerical Algorithms

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Cray events. ! Cray User Group (CUG): ! Cray Technical Workshop Europe:

A Multi-Tiered Optimization Framework for Heterogeneous Computing

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Intel Performance Libraries

Transcription:

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University & Brett Bode National Center for Supercomputing Applications University of Illinois, Urbana-Champaign Workshop on Parallel Programming on Accelerator Clusters IEEE Cluster 2009, New Orleans, August 31, 2009

Outline Introduction Tesla Processor for Scientific Computing CUDA programming model NetPIPE discuss performance results GEMM routines from CUBLAS and Intel MKL Conclusions & Future work

Introduction Commodity clusters are augmented with special purpose computing devices for scientific application acceleration Contemporary architectures GPU, Cell Processor, FPGA etc. GPU s are widely deployed due to their high computational density, performance per price ratio, and active programming framework support by vendors In our research, the algorithms that are currently targeted for acceleration come from the computational chemistry domain.

Introduction Studies were focused towards understanding communication bottlenecks among the cluster compute nodes Understanding bottlenecks in transferring data to the coprocessors might be useful for application developers to achieve a balanced computing Computational chemistry algorithms also depend on the matrix-vector and matrix-matrix subroutines. In this work we focused on the Level 3 BLAS routines SGEMM/DGEMM and memory transfer throughput and latencies

NVIDIA TESLA for Scientific Computing Features: Scalable array of streaming multiprocessors (SM) Eight scalar processors (SP) per SM Two special functional units and a single double precision unit Fast on chip shared memory 4 GB of GPU device memory NVIDIA TESLA HARDWARE ARCHITECTURE Reference: NVIDIA CUDA Programmers Guide

CUDA Programming Model Salient Features GPU is viewed as a highly fined grained multithreaded compute device Capable of executing many threads in parallel using a Single Instruction Multi-Threaded (SIMT) model Performance benefits due to overlapping computing and memory fetches among the threads Designed with a minimal extensions to C/C++ programming language Two programming APIs CUDA high level run-time API and low level driver API

Tesla GPU Cluster Tesla S1070 system 4 Tesla T10 processors each with 4GB device memory Compute node Dual Intel Core2 Quad Xeon 5405, 8 GB of main memory Each compute node is connected to two T10 GPU s via PCI Express 2.0 link

Network Protocol Independent Performance Emulator (NetPIPE) Network Protocol Independent Performance Emulator Micro benchmark that measures the latency and throughput of memory transfer, based on the principles of HINT benchmark Neither fix the problem size nor the execution time An integrated framework to measure and compare the performances of various interconnects, useful to determine the bottlenecks amount the cluster compute nodes Extended the framework to measure the performance of the memory transfers between the host and GPU.

NetPIPE For each buffer size of c bytes, three measurements are taken, c-p, c, c+p bytes, where p is the user defined perturbation value Timing measurements are done using the gettimeofday() system call NetPIPE Throughput plot x-axis, packet size in Bytes in logarithmic scale y-axis, throughput in Gigabits/sec NetPIPE Latency plot x-axis, packet size in Bytes in logarithmic scale y-axix, latency in seconds in logarithmic scale

NetPIPE MPI mpich,lam/mpi NetPIPE TCP PVM 2-sided protocols Native software layers Myrinet TCGMSG Infiniband 1-sided protocols Internal Systems ARMCI MPI-2 One-sided MPI_Put/MPI_Get LAPI SHMEM 1-sided puts/gets memcpy cudamemcpy cumemcpy SHMEM GPSHMEM

NetPIPE cudamemcpy CUDA provides two types of APIs for managing contexts, memory and data transfers CUDA run-time API and the driver API NetPIPE available for both APIs NPcudaMemcpy, NPcuMemcpy Host memory allocation paged memory via malloc() system call pinned (page-locked) memory using the CUDA API Memory on the device is allocated as a linear array Three types of tests were performed one-sided copies between host and the device roundtrip copies between host and the device device to device memory copies

NetPIPE cudamemcpy Algorithm Input: one-sided/roundtrip Input: host-device/device-host/device-device Input: paged/pinned memory Output: Latency in second, Throughput in Gigabits/sec /* Begin */ T = MAXTIME for i = 1 to NTRAILS do end to = time() for j = 1 to NREPEAT do if RT then end else copy buf from host (paged/pinned) to dev copy buf from dev to host (paged/pinned) copy buf from host (paged/pinned) to dev t1 = time() /* keep the minimum value of T */ T = min(t, t1 t0) T = T/((1+rt)*NREPEAT) NREPEAT = target/((bzs2/bsz1)*tlast)

NetPIPE cudamemcpy ( Host to Device) SDK Results NetPIPE Results Size ( Bytes) MB/sec Size (Bytes) MB/sec 16855040 2012.5 16777216 2043 33632256 2075 33554432 2087 67186688 2102 67108864 2103

NetPIPE CudaMemcpy ( Device to Host) Host to Device attains maxium around 6MB Device to host attains maximum around 1MB Latency ratio varies roughly 1 to 3 times Device to host throughput is larger than the host to device with pinned buffers

NetPIPE CudaMemcpy ( Device to Device) 4GB of 512-bit GDDR3 memory peak device throughput of 72GB/s for 16MB of buffer sizes variations towards perturbations seen for smaller buffer sizes

Basic Linear Algebra Subroutines C:= alpha*op(a)*op(b) + beta*op(c) alpha, beta scalars, A, B, C are matrices of dimension m x k, k x n, m x n respectively op(x) = X or X CUBLAS is the implementation of BLAS routines using the cuda driver Functions to create matrix and vector objects in the GPU memory space and to move data from host to the device Column major ordering similar to the FORTRAN language O(N 3 ) computations O(N 3 )/O(N 2 ) computation to communication ratio

Intel Math Kernel Library (MKL), KMP_AFFINITY Intel Math Kernel Library (MKL) provides highly optimized and multithreaded BLAS routines based on OpenMP run-time library 10.0.1 version is used for this study Thread Affinity Interface restricts execution of certain threads to a subset of physical processing units Performance varies if threads are not scheduled to all the physical processing units High-level affinity interface uses environmental variable KMP_AFFINITY

cublassgemm / cublsasdgemm Tesla T10 has 30 SM s with 8 SP s per SM 1 MAD = 2 floating point operations DP Theo = 30 *1.3 GHz * 2 = 78 GFLOPS SP Theo = 240 * 1.3 GHz * 2 = 936 GFLOPS CPU DP Theo = 4 * 3 GHz * 4 cores = 48GFLOPS CPU SP Theo = 8 * 3 GHz * 4 cores =96GFLOPS cublassgemm cublasdgemm Size(m = n = k) GFLOPS Size (m = n = k) GFLOPS 8128 344.32 8128 69.96 8136 167.08 8136 48.46 8144 167.42 8144 62.11 8152 167.46 8152 48.46 8160 245.47 8160 62.11 8168 167.13 8168 48.46 8172 166.58 8172 62.11 8180 166.08 8180 48.44

Performance of SGEMM subroutine scaled well with 8 cores, however explicit control of threads was necessary for the 157.8 GFLOPS maximum achieved Performance segmented based on the matrix size 345 GFLOPS

Performance of DGEMM subroutine scaled well with 8 cores, however explicit control of threads was necessary using the KMP_AFFINITY 75.82 GFLOPS maximum achieved Performance segmented over the input matrix size 70 GFLOPS maximum achieved Single precision performance is maximum, so it should be harnessed as much as possible. Techniques for mixed-precision algorithms should be developed to make a judicious choice

Conclusions A cudamemcpy module is developed within the NetPIPE framework to compute the memory transfer overheads Results were compared against the bandwidth benchmark program provided with the cuda SDK. The single precision cublassgemm is roughly 3 times faster than the MKL version Performance of the cublasdgemm and the MKL dgemm with 8 threads

Future Work Study the performance of the asynchronous memory copies Copies to different devices by independent threads of the CPU Copies to the remote GPU memory via Infiniband one-sided communications (would be useful to extend the distributed data array programming model for the GPU device memory space) A NetPIPE port for the OpenCL version.

Acknowledgements Professor Mark S Gordon Professor Theresa Windus Andrey Asadchev, PhD Candidate Mark Klein, Systems personnel, SCL, Ameslab NVIDIA Corporation

Questions