computational power computational

Size: px
Start display at page:

Download "computational power computational"

Transcription

1 rcuda: rcuda: an an approach approach to to provide provide remote remote access access to to computational computational power power Rafael Mayo Gual Universitat Jaume I Spain (1 of 59) HPC Advisory Council Workshop

2 Outline computing rcuda goals rcuda structure rcuda implementations rcuda current status (2 of 59) HPC Advisory Council Workshop

3 Outline computing rcuda goals rcuda structure rcuda implementations rcuda current status (3 of 59) HPC Advisory Council Workshop

4 computing Will be the near future in HPC in fact, it is already here!!!! (4 of 59) HPC Advisory Council Workshop

5 computing It has been the first massively parallel hardware. For the right kind of code the use of computing brings huge benefits in terms of performance and energy. Development tools and libraries facilitate the use of the. (5 of 59) HPC Advisory Council Workshop

6 computing Two main approaches in computing development environments: CUDA: nvidia propietary OpenCL: open standard OpenCL (6 of 59) HPC Advisory Council Workshop

7 computing Basically OpenCL and CUDA have the same work scheme: Compilation Separate CPU code and code ( kernel) Running: Data transfers: CPU and memory spaces Before kernel execution: data from CPU memory space to memory space Computation: Kernel execution After kernel execution: results from memory space to CPU memory space. (7 of 59) HPC Advisory Council Workshop

8 computing Not all algorithms take profit of power. In some cases only part of a program must be run on a. Depending on the algorithms, the can be idle for long periods. (8 of 59) HPC Advisory Council Workshop

9 computing You can find two different scenarios: Scenario 1 If all your programs are going to use the for long periods Add a to each node You don't need our tool (9 of 59) HPC Advisory Council Workshop

10 computing You can find two different scenarios: Scenario 2 Only Only part part of of your your programs programs are are going going to use use the the All All your your programs programs use use the the,, but but part-time part-time use use You could think in adding a, only to some nodes OUR TOOL CAN HELP YOU!!! (10 of 59) HPC Advisory Council Workshop

11 computing Cost from the energy point of view Nvidia Tesla s2050 near 900 Watts (TDP specification) Usage: 75%, so 25% is wasted. Then for each node (aprox): 160 Kwh are wasted per month. 2 Mwh are wasted per year. It could be several hundreds of Kg CO2/year (11 of 59) HPC Advisory Council Workshop

12 Outline computing rcuda goals rcuda structure rcuda implementations rcuda current status (12 of 59) HPC Advisory Council Workshop

13 rcuda Add only the computing nodes that give the necessary computational power (13 of 59) HPC Advisory Council Workshop

14 rcuda rcuda provides remote access from each node to any in the system (14 of 59) HPC Advisory Council Workshop

15 Outline computing Cost of a node rcuda goals rcuda structure rcuda implementations rcuda current status (15 of 59) HPC Advisory Council Workshop

16 rcuda structure CUDA application Application Application CUDA CUDA driver driver++runtime runtime (16 of 59) HPC Advisory Council Workshop

17 rcuda structure Client side Application Application Server side Application Application CUDA CUDA driver driver++runtime runtime CUDA CUDA driver driver++runtime runtime (17 of 59) HPC Advisory Council Workshop

18 rcuda structure Client side Server side Application Application rcuda rcuda library library Network Networkinterface interface rcuda rcudadaemon daemon Network Networkinterface interface CUDA CUDA driver driver++runtime runtime (18 of 59) HPC Advisory Council Workshop

19 rcuda functionality CUDA programming CCextensions. extensions. Runtime Runtimelibrary. library. C extensions Not Notsupported supportedininthe thecurrent currentversion versionof ofrcuda. rcuda. We Wedon't don'twant wantto torewrite rewriteaacompiler compiler (by (bynow) now) Runtime library Support Supportfor foralmost almostall allfunctions. functions. For Forsome someinternal internalfunctions, functions,nvidia nvidiadoes doesnot notgive giveinformation information(not (not supported supportedininrcuda) rcuda) (19 of 59) HPC Advisory Council Workshop

20 rcuda functionallity Supported CUDA 4.0 Runtime Functions Module Functions Supported Device Management Error handling 3 3 Event management 7 7 Execution control 7 7 Memory management Peer device memory access 5 4 Stream management 2 2 Suface reference management 8 8 Texture refefence managemet 6 6 Thread management 6 6 Version managemet 2 2 (20 of 59) HPC Advisory Council Workshop

21 rcuda functionallity NOT YET Supported CUDA 4.0 Runtime Functions Module Functions Supported Unified addressing 11 0 Peer Device Memory Access 3 0 OpenGL Interoperability 3 0 Direct3D 9 Interoperability 5 0 Direct3D 10 Interoperability 5 0 Direct3D 11 Interoperability 5 0 VDPAU Interoperability 4 0 Graphics Interoperability 6 0 (21 of 59) HPC Advisory Council Workshop

22 rcuda functionallity Supported CUBLAS Functions Module Functions Supported Helper function reference BLAS BLAS BLAS (22 of 59) HPC Advisory Council Workshop

23 Outline computing rcuda goals rcuda structure rcuda implementations rcuda current status (23 of 59) HPC Advisory Council Workshop

24 rcuda: basic TCP/IP version Characteristics Use UseTCP/IP TCP/IPstack stack ItItisisaabasic basicversion versionto toshow showthe thefunctionallity functionallity Estimation Estimationof ofthe theoverhead overheaddue dueto tothe thecommunication communicationnetwork. network. Runs Runsover overall alltpc/ip TPC/IPnetworks networks Ethernet Ethernet InfiniBand InfiniBand etc. etc. (24 of 59) HPC Advisory Council Workshop

25 rcuda: basic TCP/IP version Example Example of of rcuda rcuda interaction interaction rcuda initialization Network Client Client application application Server Server daemon daemon Get Load Kernel Return result query Kernel software Locate and send kernel SEND Data transfer RECEIVE to time (25 of 59) HPC Advisory Council Workshop

26 rcuda: basic TCP/IP version Example Example of of rcuda rcuda interaction interaction CudaMemcpy(..., cudamemcpyhosttodevice); Network Client Client application application Copy data from application to send buffers Send buffers to server Server Server daemon daemon Copy data from receive buffers to daemon buffers Copy data to memory Data transfer SEND Data transfer RECEIVE to time (26 of 59) HPC Advisory Council Workshop

27 rcuda: basic TCP/IP version Main problem: data movement overhead On CUDA this overhead is due to: PCIe transfer On rcuda this overhead is due to: Network transfer PCIe transfer (but this appears in CUDA) (27 of 59) HPC Advisory Council Workshop

28 rcuda: basic TCP/IP version Data transfer time for matrix-matrix multiplication (GEMM) (2 data matrices from client to remote ) (1 result matrix from remote to client) Gb 10Gb Ethernet Ethernet rcuda CUDA Time (msec) Matrix dimension (28 of 59) HPC Advisory Council Workshop

29 rcuda: basic TCP/IP version Execution time for matrix-matrix multiplication Tesla Tesla c1060 c1060 Intel Xeon E5410 2'33 Ghz Intel Xeon E5410 2'33 Ghz 10Gb Ethernet 10Gb Ethernet 70 CPU 60 kernel execution rcuda kernel and data transfer rcuda data transfer data transfer Matrix dimension rcuda misc Time (sec) 50 (29 of 59) HPC Advisory Council Workshop

30 rcuda: basic TCP/IP version Estimated execution time for matrix multiplication, including data transfers for some HPC networks 120 CPU Time (sec) Gbit Ethernet 80 10Gbit InfiniBand 40Gbit InfiniBand Matrix dimension (30 of 59) HPC Advisory Council Workshop

31 rcuda: basic TCP/IP version We have shown the functionality (almost all CUDA SDK examples has been tested) As we decrease the network overhead, our solution will have a performance close to the CUDA solution (31 of 59) HPC Advisory Council Workshop

32 rcuda: InfiniBand version why? InfiniBand InfiniBand isis the the most most used used HPC HPC network network Low Low latency latency High High bandwidth bandwidth As As shown, shown, good good results results are are expected... expected... (32 of 59) HPC Advisory Council Workshop

33 rcuda: InfiniBand version InfiniBand version facts Use Use of of IB-Verbs IB-Verbs All All the thetcp/ip TCP/IPsoftware software stack stack overflow overflow isis out out Our Our goal goal isis to to get get near near this this peak peak bandwidth. bandwidth. Bandwidth Bandwidth test test of of our our IB IB network network isis about about MB/sec MB/sec (33 of 59) HPC Advisory Council Workshop

34 rcuda: InfiniBand version but... Bandwidth Bandwidth far far from from the the peak. peak. We We want want to to be be closer closer to to the the peak peak IB IB bandwidth bandwidth What we can do? Reduce Reduce the the data data movements movements between between memory memory buffers buffers Overlap Overlap de de memory memory access access with with the the network network communication communication (34 of 59) HPC Advisory Council Workshop

35 rcuda: Optimized InfiniBand version Same user level functionallity. Client to/from remote bandwidth near the peak of InfiniBand network bandwidth. Use of Direct Reduce the number of memory copies Use of pipelined transfers. Overlap memory copies and communications (35 of 59) HPC Advisory Council Workshop

36 rcuda: Direct Without Without direct direct InfiniBand chipset memory CPU Main memory to to Main Main Memory Memory Main Main Memory Memory to to Main Main Memory Memory Main Main Memory Memory to to Network Network (36 of 59) HPC Advisory Council Workshop

37 rcuda: Direct Without Without direct direct InfiniBand chipset memory CPU Main memory to to Main Main Memory Memory Main Main Memory Memory to to Main Main Memory Memory Main Main Memory Memory to to Network Network (37 of 59) HPC Advisory Council Workshop

38 rcuda: Direct Without Without direct direct InfiniBand chipset memory CPU Main memory to to Main Main Memory Memory Main Main Memory Memory to to Main Main Memory Memory Main Main Memory Memory to to Network Network (38 of 59) HPC Advisory Council Workshop

39 rcuda: Direct Without Without direct direct InfiniBand chipset memory CPU Main memory to to Main Main Memory Memory Main Main Memory Memory to to Main Main Memory Memory Main Main Memory Memory to to Network Network (39 of 59) HPC Advisory Council Workshop

40 rcuda: Direct WITH WITH direct direct InfiniBand chipset memory CPU Main memory to to Main Main Memory Memory Main Main Memory Memory to to Network Network (40 of 59) HPC Advisory Council Workshop

41 rcuda: Direct WITH WITH direct direct InfiniBand chipset memory CPU Main memory to to Main Main Memory Memory Main Main Memory Memory to to Network Network (41 of 59) HPC Advisory Council Workshop

42 rcuda: Direct WITH WITH direct direct InfiniBand chipset memory CPU Main memory to to Main Main Memory Memory Main Main Memory Memory to to Network Network A A memory memory copy copy is is avoided avoided (42 of 59) HPC Advisory Council Workshop

43 rcuda: Pipelined transfers SERVER NODE chipset CPU InfiniBand InfiniBand chipset memory CPU Main memory Main memory CLIENT NODE Client Network Server (43 of 59) HPC Advisory Council Workshop

44 rcuda: Pipelined transfers Without Without pipelined pipelined transfers transfers CPU InfiniBand InfiniBand chipset memory CPU chipset Client SERVER NODE Main memory Main memory CLIENT NODE Copy to network buffers Network Server (44 of 59) HPC Advisory Council Workshop

45 rcuda: Pipelined transfers Without Without pipelined pipelined transfers transfers Network CPU InfiniBand InfiniBand chipset memory CPU chipset Client SERVER NODE Main memory Main memory CLIENT NODE Copy to network buffers Data transfer Server (45 of 59) HPC Advisory Council Workshop

46 rcuda: Pipelined transfers Without Without pipelined pipelined transfers transfers Network Server CPU InfiniBand InfiniBand chipset memory CPU chipset Client SERVER NODE Main memory Main memory CLIENT NODE Copy to network buffers Data transfer Copy to (46 of 59) HPC Advisory Council Workshop

47 rcuda: Pipelined transfers WITH WITH pipelined pipelined transfers transfers CPU InfiniBand InfiniBand chipset memory CPU chipset Client SERVER NODE Main memory Main memory CLIENT NODE Copy to network buffers Network Server (47 of 59) HPC Advisory Council Workshop

48 rcuda: Pipelined transfers WITH WITH pipelined pipelined transfers transfers Network Copy to network buffers CPU InfiniBand InfiniBand chipset memory CPU chipset Client SERVER NODE Main memory Main memory CLIENT NODE Copy to network buffers Data transfer Server (48 of 59) HPC Advisory Council Workshop

49 rcuda: Pipelined transfers WITH WITH pipelined pipelined transfers transfers Network Server Copy to network buffers CPU InfiniBand Copy to network buffers Copy to network buffers Data transfer Data transfer InfiniBand chipset memory CPU chipset Client SERVER NODE Main memory Main memory CLIENT NODE Copy to (49 of 59) HPC Advisory Council Workshop

50 rcuda: Pipelined transfers WITH WITH pipelined pipelined transfers transfers Network Server Copy to network buffers CPU InfiniBand InfiniBand Copy to network buffers Copy to network buffers Data transfer Data transfer Data transfer Copy to Copy to chipset memory CPU chipset Client SERVER NODE Main memory Main memory CLIENT NODE (50 of 59) HPC Advisory Council Workshop

51 rcuda: Pipelined transfers WITH WITH pipelined pipelined transfers transfers Network Server Copy to network buffers CPU InfiniBand InfiniBand Copy to network buffers Copy to network buffers Data transfer Data transfer Data transfer Copy to Copy to chipset memory CPU chipset Client SERVER NODE Main memory Main memory CLIENT NODE Copy to (51 of 59) HPC Advisory Council Workshop

52 rcuda: Optimized InfiniBand version Bandwidth for matrix-matrix product 4096x Bandwidth (MB/sec) IB peak bandwidth 2900 MB/sec rcuda GigaE rcuda IPoIB rcuda IBVerbs 40Gb InfiniBand CUDA (52 of 59) HPC Advisory Council Workshop

53 rcuda: optimized InfiniBand version Time for matrix-matrix product (4096x4096) 2,50 GeForce GeForce GTX GTX Intel Xeon E5645 Intel Xeon E5645 2,28 Time (sec) 2,00 1,50 1,30 1,00 0,70 0,65 0,62 0,50 0,00 rcuda IpoIB rcuda GigaE CUDA rcuda IBVerbs CPU (MKL) (53 of 59) HPC Advisory Council Workshop

54 Outline computing rcuda goals rcuda structure rcuda implementations rcuda current status (54 of 59) HPC Advisory Council Workshop

55 rcuda: work in progress rcuda port to Microsoft rcuda thread safe rcuda support to CUDA 4.0 Support for CUDA C/C++ extensions ropencl (55 of 59) HPC Advisory Council Workshop

56 rcuda: near future Support for iwarp communications. Dynamic remote scheduling. Workload balance. Remote data cache. Remote kernels cache. (56 of 59) HPC Advisory Council Workshop

57 rcuda: more information virtualization in high performance clusters J. Duato, F. Igual, R. Mayo, A. Peña, E. S. Quintana, F. Silla. 4th Workshop on Virtualization and High-Performance Cloud Computing, VHPC'09. rcuda: reducing the number of -based accelerators in high performance clusters. J. Duato, A. Peña, F. Silla, R. Mayo, E. S. Quintana. Workshop on Optimization Issues in Energy Efficient Distributed Systems, OPTIM Performance of CUDA virtualized remote s in high performance clusters. J. Duato, R. Mayo, A. Peña, E. S. Quintana, F. Silla. International Conference on Parallel Processing, ICPP 2011 (accepted). (57 of 59) HPC Advisory Council Workshop

58 rcuda: credits Parallel Architectures Group Technical University of València (Spain) High Performance Computing and Architectures Group University Jaume I of Castelló (Spain) (58 of 59) HPC Advisory Council Workshop

59 rcuda Thanks to and for their hardware donation for the devlopment of this work MORE INFORMATION: POSTER SESSION (Tuesday 21 and Wednesday 22) Thanks for your attention. Questions? (59 of 59) HPC Advisory Council Workshop

rcuda: an approach to provide remote access to GPU computational power

rcuda: an approach to provide remote access to GPU computational power rcuda: an approach to provide remote access to computational power Rafael Mayo Gual Universitat Jaume I Spain (1 of 60) HPC Advisory Council Workshop Outline computing Cost of a node rcuda goals rcuda

More information

An approach to provide remote access to GPU computational power

An approach to provide remote access to GPU computational power An approach to provide remote access to computational power University Jaume I, Spain Joint research effort 1/84 Outline computing computing scenarios Introduction to rcuda rcuda structure rcuda functionality

More information

Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators

Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators Remote CUDA (rcuda) Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators Better performance-watt, performance-cost

More information

rcuda: towards energy-efficiency in GPU computing by leveraging low-power processors and InfiniBand interconnects

rcuda: towards energy-efficiency in GPU computing by leveraging low-power processors and InfiniBand interconnects rcuda: towards energy-efficiency in computing by leveraging low-power processors and InfiniBand interconnects Federico Silla Technical University of Valencia Spain Joint research effort Outline Current

More information

The rcuda middleware and applications

The rcuda middleware and applications The rcuda middleware and applications Will my application work with rcuda? rcuda currently provides binary compatibility with CUDA 5.0, virtualizing the entire Runtime API except for the graphics functions,

More information

Exploiting Task-Parallelism on GPU Clusters via OmpSs and rcuda Virtualization

Exploiting Task-Parallelism on GPU Clusters via OmpSs and rcuda Virtualization Exploiting Task-Parallelism on Clusters via Adrián Castelló, Rafael Mayo, Judit Planas, Enrique S. Quintana-Ortí RePara 2015, August Helsinki, Finland Exploiting Task-Parallelism on Clusters via Power/energy/utilization

More information

Framework of rcuda: An Overview

Framework of rcuda: An Overview Framework of rcuda: An Overview Mohamed Hussain 1, M.B.Potdar 2, Third Viraj Choksi 3 11 Research scholar, VLSI & Embedded Systems, Gujarat Technological University, Ahmedabad, India 2 Project Director,

More information

Carlos Reaño Universitat Politècnica de València (Spain) HPC Advisory Council Switzerland Conference April 3, Lugano (Switzerland)

Carlos Reaño Universitat Politècnica de València (Spain) HPC Advisory Council Switzerland Conference April 3, Lugano (Switzerland) Carlos Reaño Universitat Politècnica de València (Spain) Switzerland Conference April 3, 2014 - Lugano (Switzerland) What is rcuda? Installing and using rcuda rcuda over HPC networks InfiniBand How taking

More information

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &

More information

Solving Dense Linear Systems on Graphics Processors

Solving Dense Linear Systems on Graphics Processors Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing

More information

Remote GPU virtualization: pros and cons of a recent technology. Federico Silla Technical University of Valencia Spain

Remote GPU virtualization: pros and cons of a recent technology. Federico Silla Technical University of Valencia Spain Remote virtualization: pros and cons of a recent technology Federico Silla Technical University of Valencia Spain The scope of this talk HPC Advisory Council Brazil Conference 2015 2/43 st Outline What

More information

The rcuda technology: an inexpensive way to improve the performance of GPU-based clusters Federico Silla

The rcuda technology: an inexpensive way to improve the performance of GPU-based clusters Federico Silla The rcuda technology: an inexpensive way to improve the performance of -based clusters Federico Silla Technical University of Valencia Spain The scope of this talk Delft, April 2015 2/47 More flexible

More information

Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain)

Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain) Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain) 4th IEEE International Workshop of High-Performance Interconnection Networks in the Exascale and Big-Data Era (HiPINEB

More information

Increasing the efficiency of your GPU-enabled cluster with rcuda. Federico Silla Technical University of Valencia Spain

Increasing the efficiency of your GPU-enabled cluster with rcuda. Federico Silla Technical University of Valencia Spain Increasing the efficiency of your -enabled cluster with rcuda Federico Silla Technical University of Valencia Spain Outline Why remote virtualization? How does rcuda work? The performance of the rcuda

More information

Opportunities of the rcuda remote GPU virtualization middleware. Federico Silla Universitat Politècnica de València Spain

Opportunities of the rcuda remote GPU virtualization middleware. Federico Silla Universitat Politècnica de València Spain Opportunities of the rcuda remote virtualization middleware Federico Silla Universitat Politècnica de València Spain st Outline What is rcuda? HPC Advisory Council China Conference 2017 2/45 s are the

More information

rcuda: desde máquinas virtuales a clústers mixtos CPU-GPU

rcuda: desde máquinas virtuales a clústers mixtos CPU-GPU rcuda: desde máquinas virtuales a clústers mixtos CPU-GPU Federico Silla Universitat Politècnica de València HPC ADMINTECH 2018 rcuda: from virtual machines to hybrid CPU-GPU clusters Federico Silla Universitat

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

Building NVLink for Developers

Building NVLink for Developers Building NVLink for Developers Unleashing programmatic, architectural and performance capabilities for accelerated computing Why NVLink TM? Simpler, Better and Faster Simplified Programming No specialized

More information

Is remote GPU virtualization useful? Federico Silla Technical University of Valencia Spain

Is remote GPU virtualization useful? Federico Silla Technical University of Valencia Spain Is remote virtualization useful? Federico Silla Technical University of Valencia Spain st Outline What is remote virtualization? HPC Advisory Council Spain Conference 2015 2/57 We deal with s, obviously!

More information

Document downloaded from:

Document downloaded from: Document downloaded from: http://hdl.handle.net/10251/70225 This paper must be cited as: Reaño González, C.; Silla Jiménez, F. (2015). On the Deployment and Characterization of CUDA Teaching Laboratories.

More information

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento

More information

On the Use of Remote GPUs and Low-Power Processors for the Acceleration of Scientific Applications

On the Use of Remote GPUs and Low-Power Processors for the Acceleration of Scientific Applications On the Use of Remote GPUs and Low-Power Processors for the Acceleration of Scientific Applications A. Castelló, J. Duato, R. Mayo, A. J. Peña, E. S. Quintana-Ortí, V. Roca, F. Silla Universitat Politècnica

More information

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries Jeffrey Young, Alex Merritt, Se Hoon Shon Advisor: Sudhakar Yalamanchili 4/16/13 Sponsors: Intel, NVIDIA, NSF 2 The Problem Big

More information

Speeding up the execution of numerical computations and simulations with rcuda José Duato

Speeding up the execution of numerical computations and simulations with rcuda José Duato Speeding up the execution of numerical computations and simulations with rcuda José Duato Universidad Politécnica de Valencia Spain Outline 1. Introduction to GPU computing 2. What is remote GPU virtualization?

More information

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation CUDA Accelerated Linpack on Clusters E. Phillips, NVIDIA Corporation Outline Linpack benchmark CUDA Acceleration Strategy Fermi DGEMM Optimization / Performance Linpack Results Conclusions LINPACK Benchmark

More information

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand Miao Luo, Hao Wang, & D. K. Panda Network- Based Compu2ng Laboratory Department of Computer Science and Engineering The Ohio State

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction

More information

NAMD GPU Performance Benchmark. March 2011

NAMD GPU Performance Benchmark. March 2011 NAMD GPU Performance Benchmark March 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Dell, Intel, Mellanox Compute resource - HPC Advisory

More information

Improving overall performance and energy consumption of your cluster with remote GPU virtualization

Improving overall performance and energy consumption of your cluster with remote GPU virtualization Improving overall performance and energy consumption of your cluster with remote GPU virtualization Federico Silla & Carlos Reaño Technical University of Valencia Spain Tutorial Agenda 9.00-10.00 SESSION

More information

Hybrid Implementation of 3D Kirchhoff Migration

Hybrid Implementation of 3D Kirchhoff Migration Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation

More information

Distributed In-GPU Data Cache for Document-Oriented Data Store via PCIe over 10Gbit Ethernet

Distributed In-GPU Data Cache for Document-Oriented Data Store via PCIe over 10Gbit Ethernet Distributed In-GPU Data Cache for Document-Oriented Data Store via PCIe over 10Gbit Ethernet Shin Morishima 1 and Hiroki Matsutani 1,2,3 1 Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama, Japan 223-8522

More information

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction

More information

Deploying remote GPU virtualization with rcuda. Federico Silla Technical University of Valencia Spain

Deploying remote GPU virtualization with rcuda. Federico Silla Technical University of Valencia Spain Deploying remote virtualization with rcuda Federico Silla Technical University of Valencia Spain st Outline What is remote virtualization? HPC ADMINTECH 2016 2/53 It deals with s, obviously! HPC ADMINTECH

More information

High Performance Computing with Accelerators

High Performance Computing with Accelerators High Performance Computing with Accelerators Volodymyr Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced Computing Applications and Technologies (IACAT) National Center for Supercomputing

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Jonas Hahnfeld 1, Christian Terboven 1, James Price 2, Hans Joachim Pflug 1, Matthias S. Müller

More information

GPU for HPC. October 2010

GPU for HPC. October 2010 GPU for HPC Simone Melchionna Jonas Latt Francis Lapique October 2010 EPFL/ EDMX EPFL/EDMX EPFL/DIT simone.melchionna@epfl.ch jonas.latt@epfl.ch francis.lapique@epfl.ch 1 Moore s law: in the old days,

More information

! Readings! ! Room-level, on-chip! vs.!

! Readings! ! Room-level, on-chip! vs.! 1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads

More information

Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies

Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies Alexander Merritt, Vishakha Gupta, Abhishek Verma, Ada Gavrilovska, Karsten Schwan {merritt.alex,abhishek.verma}@gatech.edu {vishakha,ada,schwan}@cc.gtaech.edu

More information

OCTOPUS Performance Benchmark and Profiling. June 2015

OCTOPUS Performance Benchmark and Profiling. June 2015 OCTOPUS Performance Benchmark and Profiling June 2015 2 Note The following research was performed under the HPC Advisory Council activities Special thanks for: HP, Mellanox For more information on the

More information

GPUs as better MPI Citizens

GPUs as better MPI Citizens s as better MPI Citizens Author: Dale Southard, NVIDIA Date: 4/6/2011 www.openfabrics.org 1 Technology Conference 2011 October 11-14 San Jose, CA The one event you can t afford to miss Learn about leading-edge

More information

GPU-centric communication for improved efficiency

GPU-centric communication for improved efficiency GPU-centric communication for improved efficiency Benjamin Klenk *, Lena Oden, Holger Fröning * * Heidelberg University, Germany Fraunhofer Institute for Industrial Mathematics, Germany GPCDP Workshop

More information

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D HPC with GPU and its applications from Inspur Haibo Xie, Ph.D xiehb@inspur.com 2 Agenda I. HPC with GPU II. YITIAN solution and application 3 New Moore s Law 4 HPC? HPC stands for High Heterogeneous Performance

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

VSC Users Day 2018 Start to GPU Ehsan Moravveji

VSC Users Day 2018 Start to GPU Ehsan Moravveji Outline A brief intro Available GPUs at VSC GPU architecture Benchmarking tests General Purpose GPU Programming Models VSC Users Day 2018 Start to GPU Ehsan Moravveji Image courtesy of Nvidia.com Generally

More information

Accelerating image registration on GPUs

Accelerating image registration on GPUs Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining

More information

OpenPOWER Performance

OpenPOWER Performance OpenPOWER Performance Alex Mericas Chief Engineer, OpenPOWER Performance IBM Delivering the Linux ecosystem for Power SOLUTIONS OpenPOWER IBM SOFTWARE LINUX ECOSYSTEM OPEN SOURCE Solutions with full stack

More information

Motivation Hardware Overview Programming model. GPU computing. Part 1: General introduction. Ch. Hoelbling. Wuppertal University

Motivation Hardware Overview Programming model. GPU computing. Part 1: General introduction. Ch. Hoelbling. Wuppertal University Part 1: General introduction Ch. Hoelbling Wuppertal University Lattice Practices 2011 Outline 1 Motivation 2 Hardware Overview History Present Capabilities 3 Programming model Past: OpenGL Present: CUDA

More information

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구 MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구 Leading Supplier of End-to-End Interconnect Solutions Analyze Enabling the Use of Data Store ICs Comprehensive End-to-End InfiniBand and Ethernet Portfolio

More information

High-Performance Computing Using GPUs

High-Performance Computing Using GPUs High-Performance Computing Using GPUs Luca Caucci caucci@email.arizona.edu Center for Gamma-Ray Imaging November 7, 2012 Outline Slide 1 of 27 Why GPUs? What is CUDA? The CUDA programming model Anatomy

More information

GPU ARCHITECTURE Chris Schultz, June 2017

GPU ARCHITECTURE Chris Schultz, June 2017 GPU ARCHITECTURE Chris Schultz, June 2017 MISC All of the opinions expressed in this presentation are my own and do not reflect any held by NVIDIA 2 OUTLINE CPU versus GPU Why are they different? CUDA

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain

More information

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

GPU Clusters for High- Performance Computing Jeremy Enos Innovative Systems Laboratory

GPU Clusters for High- Performance Computing Jeremy Enos Innovative Systems Laboratory GPU Clusters for High- Performance Computing Jeremy Enos Innovative Systems Laboratory National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Presentation Outline NVIDIA

More information

Memcached Design on High Performance RDMA Capable Interconnects

Memcached Design on High Performance RDMA Capable Interconnects Memcached Design on High Performance RDMA Capable Interconnects Jithin Jose, Hari Subramoni, Miao Luo, Minjia Zhang, Jian Huang, Md. Wasi- ur- Rahman, Nusrat S. Islam, Xiangyong Ouyang, Hao Wang, Sayantan

More information

Introduction to CELL B.E. and GPU Programming. Agenda

Introduction to CELL B.E. and GPU Programming. Agenda Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU

More information

Paralization on GPU using CUDA An Introduction

Paralization on GPU using CUDA An Introduction Paralization on GPU using CUDA An Introduction Ehsan Nedaaee Oskoee 1 1 Department of Physics IASBS IPM Grid and HPC workshop IV, 2011 Outline 1 Introduction to GPU 2 Introduction to CUDA Graphics Processing

More information

Automatic Development of Linear Algebra Libraries for the Tesla Series

Automatic Development of Linear Algebra Libraries for the Tesla Series Automatic Development of Linear Algebra Libraries for the Tesla Series Enrique S. Quintana-Ortí quintana@icc.uji.es Universidad Jaime I de Castellón (Spain) Dense Linear Algebra Major problems: Source

More information

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR Presentation at Mellanox Theater () Dhabaleswar K. (DK) Panda - The Ohio State University panda@cse.ohio-state.edu Outline Communication

More information

Jose Aliaga (Universitat Jaume I, Castellon, Spain), Ruyman Reyes, Mehdi Goli (Codeplay Software) 2017 Codeplay Software Ltd.

Jose Aliaga (Universitat Jaume I, Castellon, Spain), Ruyman Reyes, Mehdi Goli (Codeplay Software) 2017 Codeplay Software Ltd. SYCL-BLAS: LeveragingSYCL-BLAS Expression Trees for Linear Algebra Jose Aliaga (Universitat Jaume I, Castellon, Spain), Ruyman Reyes, Mehdi Goli (Codeplay Software) 1 About me... Phd in Compilers and Parallel

More information

Large scale Imaging on Current Many- Core Platforms

Large scale Imaging on Current Many- Core Platforms Large scale Imaging on Current Many- Core Platforms SIAM Conf. on Imaging Science 2012 May 20, 2012 Dr. Harald Köstler Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen,

More information

Matrix Computations on GPUs, multiple GPUs and clusters of GPUs

Matrix Computations on GPUs, multiple GPUs and clusters of GPUs Matrix Computations on GPUs, multiple GPUs and clusters of GPUs Francisco D. Igual Departamento de Ingeniería y Ciencia de los Computadores. University Jaume I. Castellón (Spain). Matrix Computations on

More information

CPU-GPU Heterogeneous Computing

CPU-GPU Heterogeneous Computing CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems

More information

GPU ARCHITECTURE Chris Schultz, June 2017

GPU ARCHITECTURE Chris Schultz, June 2017 Chris Schultz, June 2017 MISC All of the opinions expressed in this presentation are my own and do not reflect any held by NVIDIA 2 OUTLINE Problems Solved Over Time versus Why are they different? Complex

More information

Introduction to Xeon Phi. Bill Barth January 11, 2013

Introduction to Xeon Phi. Bill Barth January 11, 2013 Introduction to Xeon Phi Bill Barth January 11, 2013 What is it? Co-processor PCI Express card Stripped down Linux operating system Dense, simplified processor Many power-hungry operations removed Wider

More information

Manuel F. Dolz, Juan C. Fernández, Rafael Mayo, Enrique S. Quintana-Ortí. High Performance Computing & Architectures (HPCA)

Manuel F. Dolz, Juan C. Fernández, Rafael Mayo, Enrique S. Quintana-Ortí. High Performance Computing & Architectures (HPCA) EnergySaving Cluster Roll: Power Saving System for Clusters Manuel F. Dolz, Juan C. Fernández, Rafael Mayo, Enrique S. Quintana-Ortí High Performance Computing & Architectures (HPCA) University Jaume I

More information

John W. Romein. Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands

John W. Romein. Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands Signal Processing on GPUs for Radio Telescopes John W. Romein Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands 1 Overview radio telescopes six radio telescope algorithms on

More information

CP2K Performance Benchmark and Profiling. April 2011

CP2K Performance Benchmark and Profiling. April 2011 CP2K Performance Benchmark and Profiling April 2011 Note The following research was performed under the HPC Advisory Council HPC works working group activities Participating vendors: HP, Intel, Mellanox

More information

QR Decomposition on GPUs

QR Decomposition on GPUs QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of

More information

Distributed-Shared CUDA: Virtualization of Large-Scale GPU Systems for Programmability and Reliability

Distributed-Shared CUDA: Virtualization of Large-Scale GPU Systems for Programmability and Reliability Distributed-Shared CUDA: Virtualization of Large-Scale GPU Systems for Programmability and Reliability Atsushi Kawai, Kenji Yasuoka Department of Mechanical Engineering, Keio University Yokohama, Japan

More information

ATS-GPU Real Time Signal Processing Software

ATS-GPU Real Time Signal Processing Software Transfer A/D data to at high speed Up to 4 GB/s transfer rate for PCIe Gen 3 digitizer boards Supports CUDA compute capability 2.0+ Designed to work with AlazarTech PCI Express waveform digitizers Optional

More information

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE Haibo Xie, Ph.D. Chief HSA Evangelist AMD China OUTLINE: The Challenges with Computing Today Introducing Heterogeneous System Architecture (HSA)

More information

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes. HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes Ian Glendinning Outline NVIDIA GPU cards CUDA & OpenCL Parallel Implementation

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Outline 1 Motivation 2 Theory of a non-blocking benchmark 3 The benchmark and results 4 Future work

Outline 1 Motivation 2 Theory of a non-blocking benchmark 3 The benchmark and results 4 Future work Using Non-blocking Operations in HPC to Reduce Execution Times David Buettner, Julian Kunkel, Thomas Ludwig Euro PVM/MPI September 8th, 2009 Outline 1 Motivation 2 Theory of a non-blocking benchmark 3

More information

Performance potential for simulating spin models on GPU

Performance potential for simulating spin models on GPU Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational

More information

7 DAYS AND 8 NIGHTS WITH THE CARMA DEV KIT

7 DAYS AND 8 NIGHTS WITH THE CARMA DEV KIT 7 DAYS AND 8 NIGHTS WITH THE CARMA DEV KIT Draft Printed for SECO Murex S.A.S 2012 all rights reserved Murex Analytics Only global vendor of trading, risk management and processing systems focusing also

More information

Mathematical computations with GPUs

Mathematical computations with GPUs Master Educational Program Information technology in applications Mathematical computations with GPUs GPU architecture Alexey A. Romanenko arom@ccfit.nsu.ru Novosibirsk State University GPU Graphical Processing

More information

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization Duksu Kim Assistant professor, KORATEHC Education Ph.D. Computer Science, KAIST Parallel Proximity Computation on Heterogeneous Computing Systems for Graphics Applications Professional Experience Senior

More information

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010

More information

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics Why GPU? Chapter 1 Graphics Hardware Graphics Processing Unit (GPU) is a Subsidiary hardware With massively multi-threaded many-core Dedicated to 2D and 3D graphics Special purpose low functionality, high

More information

World s most advanced data center accelerator for PCIe-based servers

World s most advanced data center accelerator for PCIe-based servers NVIDIA TESLA P100 GPU ACCELERATOR World s most advanced data center accelerator for PCIe-based servers HPC data centers need to support the ever-growing demands of scientists and researchers while staying

More information

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

Scalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA. NVIDIA Corporation 2012

Scalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA. NVIDIA Corporation 2012 Scalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA Outline Introduction to Multi-GPU Programming Communication for Single Host, Multiple GPUs Communication for Multiple Hosts, Multiple GPUs

More information

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty

More information

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture

More information

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators Francisco D. Igual Enrique S. Quintana-Ortí Gregorio Quintana-Ortí Universidad Jaime I de Castellón (Spain) Robert A. van de

More information

Current Trends in Computer Graphics Hardware

Current Trends in Computer Graphics Hardware Current Trends in Computer Graphics Hardware Dirk Reiners University of Louisiana Lafayette, LA Quick Introduction Assistant Professor in Computer Science at University of Louisiana, Lafayette (since 2006)

More information

Numerical Algorithms on Multi-GPU Architectures

Numerical Algorithms on Multi-GPU Architectures Numerical Algorithms on Multi-GPU Architectures Dr.-Ing. Harald Köstler 2 nd International Workshops on Advances in Computational Mechanics Yokohama, Japan 30.3.2010 2 3 Contents Motivation: Applications

More information

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering

More information