computational power computational
|
|
- Isaac Martin
- 6 years ago
- Views:
Transcription
1 rcuda: rcuda: an an approach approach to to provide provide remote remote access access to to computational computational power power Rafael Mayo Gual Universitat Jaume I Spain (1 of 59) HPC Advisory Council Workshop
2 Outline computing rcuda goals rcuda structure rcuda implementations rcuda current status (2 of 59) HPC Advisory Council Workshop
3 Outline computing rcuda goals rcuda structure rcuda implementations rcuda current status (3 of 59) HPC Advisory Council Workshop
4 computing Will be the near future in HPC in fact, it is already here!!!! (4 of 59) HPC Advisory Council Workshop
5 computing It has been the first massively parallel hardware. For the right kind of code the use of computing brings huge benefits in terms of performance and energy. Development tools and libraries facilitate the use of the. (5 of 59) HPC Advisory Council Workshop
6 computing Two main approaches in computing development environments: CUDA: nvidia propietary OpenCL: open standard OpenCL (6 of 59) HPC Advisory Council Workshop
7 computing Basically OpenCL and CUDA have the same work scheme: Compilation Separate CPU code and code ( kernel) Running: Data transfers: CPU and memory spaces Before kernel execution: data from CPU memory space to memory space Computation: Kernel execution After kernel execution: results from memory space to CPU memory space. (7 of 59) HPC Advisory Council Workshop
8 computing Not all algorithms take profit of power. In some cases only part of a program must be run on a. Depending on the algorithms, the can be idle for long periods. (8 of 59) HPC Advisory Council Workshop
9 computing You can find two different scenarios: Scenario 1 If all your programs are going to use the for long periods Add a to each node You don't need our tool (9 of 59) HPC Advisory Council Workshop
10 computing You can find two different scenarios: Scenario 2 Only Only part part of of your your programs programs are are going going to use use the the All All your your programs programs use use the the,, but but part-time part-time use use You could think in adding a, only to some nodes OUR TOOL CAN HELP YOU!!! (10 of 59) HPC Advisory Council Workshop
11 computing Cost from the energy point of view Nvidia Tesla s2050 near 900 Watts (TDP specification) Usage: 75%, so 25% is wasted. Then for each node (aprox): 160 Kwh are wasted per month. 2 Mwh are wasted per year. It could be several hundreds of Kg CO2/year (11 of 59) HPC Advisory Council Workshop
12 Outline computing rcuda goals rcuda structure rcuda implementations rcuda current status (12 of 59) HPC Advisory Council Workshop
13 rcuda Add only the computing nodes that give the necessary computational power (13 of 59) HPC Advisory Council Workshop
14 rcuda rcuda provides remote access from each node to any in the system (14 of 59) HPC Advisory Council Workshop
15 Outline computing Cost of a node rcuda goals rcuda structure rcuda implementations rcuda current status (15 of 59) HPC Advisory Council Workshop
16 rcuda structure CUDA application Application Application CUDA CUDA driver driver++runtime runtime (16 of 59) HPC Advisory Council Workshop
17 rcuda structure Client side Application Application Server side Application Application CUDA CUDA driver driver++runtime runtime CUDA CUDA driver driver++runtime runtime (17 of 59) HPC Advisory Council Workshop
18 rcuda structure Client side Server side Application Application rcuda rcuda library library Network Networkinterface interface rcuda rcudadaemon daemon Network Networkinterface interface CUDA CUDA driver driver++runtime runtime (18 of 59) HPC Advisory Council Workshop
19 rcuda functionality CUDA programming CCextensions. extensions. Runtime Runtimelibrary. library. C extensions Not Notsupported supportedininthe thecurrent currentversion versionof ofrcuda. rcuda. We Wedon't don'twant wantto torewrite rewriteaacompiler compiler (by (bynow) now) Runtime library Support Supportfor foralmost almostall allfunctions. functions. For Forsome someinternal internalfunctions, functions,nvidia nvidiadoes doesnot notgive giveinformation information(not (not supported supportedininrcuda) rcuda) (19 of 59) HPC Advisory Council Workshop
20 rcuda functionallity Supported CUDA 4.0 Runtime Functions Module Functions Supported Device Management Error handling 3 3 Event management 7 7 Execution control 7 7 Memory management Peer device memory access 5 4 Stream management 2 2 Suface reference management 8 8 Texture refefence managemet 6 6 Thread management 6 6 Version managemet 2 2 (20 of 59) HPC Advisory Council Workshop
21 rcuda functionallity NOT YET Supported CUDA 4.0 Runtime Functions Module Functions Supported Unified addressing 11 0 Peer Device Memory Access 3 0 OpenGL Interoperability 3 0 Direct3D 9 Interoperability 5 0 Direct3D 10 Interoperability 5 0 Direct3D 11 Interoperability 5 0 VDPAU Interoperability 4 0 Graphics Interoperability 6 0 (21 of 59) HPC Advisory Council Workshop
22 rcuda functionallity Supported CUBLAS Functions Module Functions Supported Helper function reference BLAS BLAS BLAS (22 of 59) HPC Advisory Council Workshop
23 Outline computing rcuda goals rcuda structure rcuda implementations rcuda current status (23 of 59) HPC Advisory Council Workshop
24 rcuda: basic TCP/IP version Characteristics Use UseTCP/IP TCP/IPstack stack ItItisisaabasic basicversion versionto toshow showthe thefunctionallity functionallity Estimation Estimationof ofthe theoverhead overheaddue dueto tothe thecommunication communicationnetwork. network. Runs Runsover overall alltpc/ip TPC/IPnetworks networks Ethernet Ethernet InfiniBand InfiniBand etc. etc. (24 of 59) HPC Advisory Council Workshop
25 rcuda: basic TCP/IP version Example Example of of rcuda rcuda interaction interaction rcuda initialization Network Client Client application application Server Server daemon daemon Get Load Kernel Return result query Kernel software Locate and send kernel SEND Data transfer RECEIVE to time (25 of 59) HPC Advisory Council Workshop
26 rcuda: basic TCP/IP version Example Example of of rcuda rcuda interaction interaction CudaMemcpy(..., cudamemcpyhosttodevice); Network Client Client application application Copy data from application to send buffers Send buffers to server Server Server daemon daemon Copy data from receive buffers to daemon buffers Copy data to memory Data transfer SEND Data transfer RECEIVE to time (26 of 59) HPC Advisory Council Workshop
27 rcuda: basic TCP/IP version Main problem: data movement overhead On CUDA this overhead is due to: PCIe transfer On rcuda this overhead is due to: Network transfer PCIe transfer (but this appears in CUDA) (27 of 59) HPC Advisory Council Workshop
28 rcuda: basic TCP/IP version Data transfer time for matrix-matrix multiplication (GEMM) (2 data matrices from client to remote ) (1 result matrix from remote to client) Gb 10Gb Ethernet Ethernet rcuda CUDA Time (msec) Matrix dimension (28 of 59) HPC Advisory Council Workshop
29 rcuda: basic TCP/IP version Execution time for matrix-matrix multiplication Tesla Tesla c1060 c1060 Intel Xeon E5410 2'33 Ghz Intel Xeon E5410 2'33 Ghz 10Gb Ethernet 10Gb Ethernet 70 CPU 60 kernel execution rcuda kernel and data transfer rcuda data transfer data transfer Matrix dimension rcuda misc Time (sec) 50 (29 of 59) HPC Advisory Council Workshop
30 rcuda: basic TCP/IP version Estimated execution time for matrix multiplication, including data transfers for some HPC networks 120 CPU Time (sec) Gbit Ethernet 80 10Gbit InfiniBand 40Gbit InfiniBand Matrix dimension (30 of 59) HPC Advisory Council Workshop
31 rcuda: basic TCP/IP version We have shown the functionality (almost all CUDA SDK examples has been tested) As we decrease the network overhead, our solution will have a performance close to the CUDA solution (31 of 59) HPC Advisory Council Workshop
32 rcuda: InfiniBand version why? InfiniBand InfiniBand isis the the most most used used HPC HPC network network Low Low latency latency High High bandwidth bandwidth As As shown, shown, good good results results are are expected... expected... (32 of 59) HPC Advisory Council Workshop
33 rcuda: InfiniBand version InfiniBand version facts Use Use of of IB-Verbs IB-Verbs All All the thetcp/ip TCP/IPsoftware software stack stack overflow overflow isis out out Our Our goal goal isis to to get get near near this this peak peak bandwidth. bandwidth. Bandwidth Bandwidth test test of of our our IB IB network network isis about about MB/sec MB/sec (33 of 59) HPC Advisory Council Workshop
34 rcuda: InfiniBand version but... Bandwidth Bandwidth far far from from the the peak. peak. We We want want to to be be closer closer to to the the peak peak IB IB bandwidth bandwidth What we can do? Reduce Reduce the the data data movements movements between between memory memory buffers buffers Overlap Overlap de de memory memory access access with with the the network network communication communication (34 of 59) HPC Advisory Council Workshop
35 rcuda: Optimized InfiniBand version Same user level functionallity. Client to/from remote bandwidth near the peak of InfiniBand network bandwidth. Use of Direct Reduce the number of memory copies Use of pipelined transfers. Overlap memory copies and communications (35 of 59) HPC Advisory Council Workshop
36 rcuda: Direct Without Without direct direct InfiniBand chipset memory CPU Main memory to to Main Main Memory Memory Main Main Memory Memory to to Main Main Memory Memory Main Main Memory Memory to to Network Network (36 of 59) HPC Advisory Council Workshop
37 rcuda: Direct Without Without direct direct InfiniBand chipset memory CPU Main memory to to Main Main Memory Memory Main Main Memory Memory to to Main Main Memory Memory Main Main Memory Memory to to Network Network (37 of 59) HPC Advisory Council Workshop
38 rcuda: Direct Without Without direct direct InfiniBand chipset memory CPU Main memory to to Main Main Memory Memory Main Main Memory Memory to to Main Main Memory Memory Main Main Memory Memory to to Network Network (38 of 59) HPC Advisory Council Workshop
39 rcuda: Direct Without Without direct direct InfiniBand chipset memory CPU Main memory to to Main Main Memory Memory Main Main Memory Memory to to Main Main Memory Memory Main Main Memory Memory to to Network Network (39 of 59) HPC Advisory Council Workshop
40 rcuda: Direct WITH WITH direct direct InfiniBand chipset memory CPU Main memory to to Main Main Memory Memory Main Main Memory Memory to to Network Network (40 of 59) HPC Advisory Council Workshop
41 rcuda: Direct WITH WITH direct direct InfiniBand chipset memory CPU Main memory to to Main Main Memory Memory Main Main Memory Memory to to Network Network (41 of 59) HPC Advisory Council Workshop
42 rcuda: Direct WITH WITH direct direct InfiniBand chipset memory CPU Main memory to to Main Main Memory Memory Main Main Memory Memory to to Network Network A A memory memory copy copy is is avoided avoided (42 of 59) HPC Advisory Council Workshop
43 rcuda: Pipelined transfers SERVER NODE chipset CPU InfiniBand InfiniBand chipset memory CPU Main memory Main memory CLIENT NODE Client Network Server (43 of 59) HPC Advisory Council Workshop
44 rcuda: Pipelined transfers Without Without pipelined pipelined transfers transfers CPU InfiniBand InfiniBand chipset memory CPU chipset Client SERVER NODE Main memory Main memory CLIENT NODE Copy to network buffers Network Server (44 of 59) HPC Advisory Council Workshop
45 rcuda: Pipelined transfers Without Without pipelined pipelined transfers transfers Network CPU InfiniBand InfiniBand chipset memory CPU chipset Client SERVER NODE Main memory Main memory CLIENT NODE Copy to network buffers Data transfer Server (45 of 59) HPC Advisory Council Workshop
46 rcuda: Pipelined transfers Without Without pipelined pipelined transfers transfers Network Server CPU InfiniBand InfiniBand chipset memory CPU chipset Client SERVER NODE Main memory Main memory CLIENT NODE Copy to network buffers Data transfer Copy to (46 of 59) HPC Advisory Council Workshop
47 rcuda: Pipelined transfers WITH WITH pipelined pipelined transfers transfers CPU InfiniBand InfiniBand chipset memory CPU chipset Client SERVER NODE Main memory Main memory CLIENT NODE Copy to network buffers Network Server (47 of 59) HPC Advisory Council Workshop
48 rcuda: Pipelined transfers WITH WITH pipelined pipelined transfers transfers Network Copy to network buffers CPU InfiniBand InfiniBand chipset memory CPU chipset Client SERVER NODE Main memory Main memory CLIENT NODE Copy to network buffers Data transfer Server (48 of 59) HPC Advisory Council Workshop
49 rcuda: Pipelined transfers WITH WITH pipelined pipelined transfers transfers Network Server Copy to network buffers CPU InfiniBand Copy to network buffers Copy to network buffers Data transfer Data transfer InfiniBand chipset memory CPU chipset Client SERVER NODE Main memory Main memory CLIENT NODE Copy to (49 of 59) HPC Advisory Council Workshop
50 rcuda: Pipelined transfers WITH WITH pipelined pipelined transfers transfers Network Server Copy to network buffers CPU InfiniBand InfiniBand Copy to network buffers Copy to network buffers Data transfer Data transfer Data transfer Copy to Copy to chipset memory CPU chipset Client SERVER NODE Main memory Main memory CLIENT NODE (50 of 59) HPC Advisory Council Workshop
51 rcuda: Pipelined transfers WITH WITH pipelined pipelined transfers transfers Network Server Copy to network buffers CPU InfiniBand InfiniBand Copy to network buffers Copy to network buffers Data transfer Data transfer Data transfer Copy to Copy to chipset memory CPU chipset Client SERVER NODE Main memory Main memory CLIENT NODE Copy to (51 of 59) HPC Advisory Council Workshop
52 rcuda: Optimized InfiniBand version Bandwidth for matrix-matrix product 4096x Bandwidth (MB/sec) IB peak bandwidth 2900 MB/sec rcuda GigaE rcuda IPoIB rcuda IBVerbs 40Gb InfiniBand CUDA (52 of 59) HPC Advisory Council Workshop
53 rcuda: optimized InfiniBand version Time for matrix-matrix product (4096x4096) 2,50 GeForce GeForce GTX GTX Intel Xeon E5645 Intel Xeon E5645 2,28 Time (sec) 2,00 1,50 1,30 1,00 0,70 0,65 0,62 0,50 0,00 rcuda IpoIB rcuda GigaE CUDA rcuda IBVerbs CPU (MKL) (53 of 59) HPC Advisory Council Workshop
54 Outline computing rcuda goals rcuda structure rcuda implementations rcuda current status (54 of 59) HPC Advisory Council Workshop
55 rcuda: work in progress rcuda port to Microsoft rcuda thread safe rcuda support to CUDA 4.0 Support for CUDA C/C++ extensions ropencl (55 of 59) HPC Advisory Council Workshop
56 rcuda: near future Support for iwarp communications. Dynamic remote scheduling. Workload balance. Remote data cache. Remote kernels cache. (56 of 59) HPC Advisory Council Workshop
57 rcuda: more information virtualization in high performance clusters J. Duato, F. Igual, R. Mayo, A. Peña, E. S. Quintana, F. Silla. 4th Workshop on Virtualization and High-Performance Cloud Computing, VHPC'09. rcuda: reducing the number of -based accelerators in high performance clusters. J. Duato, A. Peña, F. Silla, R. Mayo, E. S. Quintana. Workshop on Optimization Issues in Energy Efficient Distributed Systems, OPTIM Performance of CUDA virtualized remote s in high performance clusters. J. Duato, R. Mayo, A. Peña, E. S. Quintana, F. Silla. International Conference on Parallel Processing, ICPP 2011 (accepted). (57 of 59) HPC Advisory Council Workshop
58 rcuda: credits Parallel Architectures Group Technical University of València (Spain) High Performance Computing and Architectures Group University Jaume I of Castelló (Spain) (58 of 59) HPC Advisory Council Workshop
59 rcuda Thanks to and for their hardware donation for the devlopment of this work MORE INFORMATION: POSTER SESSION (Tuesday 21 and Wednesday 22) Thanks for your attention. Questions? (59 of 59) HPC Advisory Council Workshop
rcuda: an approach to provide remote access to GPU computational power
rcuda: an approach to provide remote access to computational power Rafael Mayo Gual Universitat Jaume I Spain (1 of 60) HPC Advisory Council Workshop Outline computing Cost of a node rcuda goals rcuda
More informationAn approach to provide remote access to GPU computational power
An approach to provide remote access to computational power University Jaume I, Spain Joint research effort 1/84 Outline computing computing scenarios Introduction to rcuda rcuda structure rcuda functionality
More informationWhy? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators
Remote CUDA (rcuda) Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators Better performance-watt, performance-cost
More informationrcuda: towards energy-efficiency in GPU computing by leveraging low-power processors and InfiniBand interconnects
rcuda: towards energy-efficiency in computing by leveraging low-power processors and InfiniBand interconnects Federico Silla Technical University of Valencia Spain Joint research effort Outline Current
More informationThe rcuda middleware and applications
The rcuda middleware and applications Will my application work with rcuda? rcuda currently provides binary compatibility with CUDA 5.0, virtualizing the entire Runtime API except for the graphics functions,
More informationExploiting Task-Parallelism on GPU Clusters via OmpSs and rcuda Virtualization
Exploiting Task-Parallelism on Clusters via Adrián Castelló, Rafael Mayo, Judit Planas, Enrique S. Quintana-Ortí RePara 2015, August Helsinki, Finland Exploiting Task-Parallelism on Clusters via Power/energy/utilization
More informationFramework of rcuda: An Overview
Framework of rcuda: An Overview Mohamed Hussain 1, M.B.Potdar 2, Third Viraj Choksi 3 11 Research scholar, VLSI & Embedded Systems, Gujarat Technological University, Ahmedabad, India 2 Project Director,
More informationCarlos Reaño Universitat Politècnica de València (Spain) HPC Advisory Council Switzerland Conference April 3, Lugano (Switzerland)
Carlos Reaño Universitat Politècnica de València (Spain) Switzerland Conference April 3, 2014 - Lugano (Switzerland) What is rcuda? Installing and using rcuda rcuda over HPC networks InfiniBand How taking
More informationPerformance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &
More informationSolving Dense Linear Systems on Graphics Processors
Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationData Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions
Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing
More informationRemote GPU virtualization: pros and cons of a recent technology. Federico Silla Technical University of Valencia Spain
Remote virtualization: pros and cons of a recent technology Federico Silla Technical University of Valencia Spain The scope of this talk HPC Advisory Council Brazil Conference 2015 2/43 st Outline What
More informationThe rcuda technology: an inexpensive way to improve the performance of GPU-based clusters Federico Silla
The rcuda technology: an inexpensive way to improve the performance of -based clusters Federico Silla Technical University of Valencia Spain The scope of this talk Delft, April 2015 2/47 More flexible
More informationCarlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain)
Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain) 4th IEEE International Workshop of High-Performance Interconnection Networks in the Exascale and Big-Data Era (HiPINEB
More informationIncreasing the efficiency of your GPU-enabled cluster with rcuda. Federico Silla Technical University of Valencia Spain
Increasing the efficiency of your -enabled cluster with rcuda Federico Silla Technical University of Valencia Spain Outline Why remote virtualization? How does rcuda work? The performance of the rcuda
More informationOpportunities of the rcuda remote GPU virtualization middleware. Federico Silla Universitat Politècnica de València Spain
Opportunities of the rcuda remote virtualization middleware Federico Silla Universitat Politècnica de València Spain st Outline What is rcuda? HPC Advisory Council China Conference 2017 2/45 s are the
More informationrcuda: desde máquinas virtuales a clústers mixtos CPU-GPU
rcuda: desde máquinas virtuales a clústers mixtos CPU-GPU Federico Silla Universitat Politècnica de València HPC ADMINTECH 2018 rcuda: from virtual machines to hybrid CPU-GPU clusters Federico Silla Universitat
More informationNVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield
NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationBuilding NVLink for Developers
Building NVLink for Developers Unleashing programmatic, architectural and performance capabilities for accelerated computing Why NVLink TM? Simpler, Better and Faster Simplified Programming No specialized
More informationIs remote GPU virtualization useful? Federico Silla Technical University of Valencia Spain
Is remote virtualization useful? Federico Silla Technical University of Valencia Spain st Outline What is remote virtualization? HPC Advisory Council Spain Conference 2015 2/57 We deal with s, obviously!
More informationDocument downloaded from:
Document downloaded from: http://hdl.handle.net/10251/70225 This paper must be cited as: Reaño González, C.; Silla Jiménez, F. (2015). On the Deployment and Characterization of CUDA Teaching Laboratories.
More informationAn Extension of the StarSs Programming Model for Platforms with Multiple GPUs
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento
More informationOn the Use of Remote GPUs and Low-Power Processors for the Acceleration of Scientific Applications
On the Use of Remote GPUs and Low-Power Processors for the Acceleration of Scientific Applications A. Castelló, J. Duato, R. Mayo, A. J. Peña, E. S. Quintana-Ortí, V. Roca, F. Silla Universitat Politècnica
More informationOncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries
Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries Jeffrey Young, Alex Merritt, Se Hoon Shon Advisor: Sudhakar Yalamanchili 4/16/13 Sponsors: Intel, NVIDIA, NSF 2 The Problem Big
More informationSpeeding up the execution of numerical computations and simulations with rcuda José Duato
Speeding up the execution of numerical computations and simulations with rcuda José Duato Universidad Politécnica de Valencia Spain Outline 1. Introduction to GPU computing 2. What is remote GPU virtualization?
More informationCUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation
CUDA Accelerated Linpack on Clusters E. Phillips, NVIDIA Corporation Outline Linpack benchmark CUDA Acceleration Strategy Fermi DGEMM Optimization / Performance Linpack Results Conclusions LINPACK Benchmark
More informationEnabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters
Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda
More informationMulti-Threaded UPC Runtime for GPU to GPU communication over InfiniBand
Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand Miao Luo, Hao Wang, & D. K. Panda Network- Based Compu2ng Laboratory Department of Computer Science and Engineering The Ohio State
More informationTesla GPU Computing A Revolution in High Performance Computing
Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction
More informationNAMD GPU Performance Benchmark. March 2011
NAMD GPU Performance Benchmark March 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Dell, Intel, Mellanox Compute resource - HPC Advisory
More informationImproving overall performance and energy consumption of your cluster with remote GPU virtualization
Improving overall performance and energy consumption of your cluster with remote GPU virtualization Federico Silla & Carlos Reaño Technical University of Valencia Spain Tutorial Agenda 9.00-10.00 SESSION
More informationHybrid Implementation of 3D Kirchhoff Migration
Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation
More informationDistributed In-GPU Data Cache for Document-Oriented Data Store via PCIe over 10Gbit Ethernet
Distributed In-GPU Data Cache for Document-Oriented Data Store via PCIe over 10Gbit Ethernet Shin Morishima 1 and Hiroki Matsutani 1,2,3 1 Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama, Japan 223-8522
More informationEnergy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package
High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction
More informationDeploying remote GPU virtualization with rcuda. Federico Silla Technical University of Valencia Spain
Deploying remote virtualization with rcuda Federico Silla Technical University of Valencia Spain st Outline What is remote virtualization? HPC ADMINTECH 2016 2/53 It deals with s, obviously! HPC ADMINTECH
More informationHigh Performance Computing with Accelerators
High Performance Computing with Accelerators Volodymyr Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced Computing Applications and Technologies (IACAT) National Center for Supercomputing
More informationGPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC
GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of
More informationEvaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices
Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Jonas Hahnfeld 1, Christian Terboven 1, James Price 2, Hans Joachim Pflug 1, Matthias S. Müller
More informationGPU for HPC. October 2010
GPU for HPC Simone Melchionna Jonas Latt Francis Lapique October 2010 EPFL/ EDMX EPFL/EDMX EPFL/DIT simone.melchionna@epfl.ch jonas.latt@epfl.ch francis.lapique@epfl.ch 1 Moore s law: in the old days,
More information! Readings! ! Room-level, on-chip! vs.!
1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads
More informationShadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies
Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies Alexander Merritt, Vishakha Gupta, Abhishek Verma, Ada Gavrilovska, Karsten Schwan {merritt.alex,abhishek.verma}@gatech.edu {vishakha,ada,schwan}@cc.gtaech.edu
More informationOCTOPUS Performance Benchmark and Profiling. June 2015
OCTOPUS Performance Benchmark and Profiling June 2015 2 Note The following research was performed under the HPC Advisory Council activities Special thanks for: HP, Mellanox For more information on the
More informationGPUs as better MPI Citizens
s as better MPI Citizens Author: Dale Southard, NVIDIA Date: 4/6/2011 www.openfabrics.org 1 Technology Conference 2011 October 11-14 San Jose, CA The one event you can t afford to miss Learn about leading-edge
More informationGPU-centric communication for improved efficiency
GPU-centric communication for improved efficiency Benjamin Klenk *, Lena Oden, Holger Fröning * * Heidelberg University, Germany Fraunhofer Institute for Industrial Mathematics, Germany GPCDP Workshop
More informationHPC with GPU and its applications from Inspur. Haibo Xie, Ph.D
HPC with GPU and its applications from Inspur Haibo Xie, Ph.D xiehb@inspur.com 2 Agenda I. HPC with GPU II. YITIAN solution and application 3 New Moore s Law 4 HPC? HPC stands for High Heterogeneous Performance
More informationTesla GPU Computing A Revolution in High Performance Computing
Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationVSC Users Day 2018 Start to GPU Ehsan Moravveji
Outline A brief intro Available GPUs at VSC GPU architecture Benchmarking tests General Purpose GPU Programming Models VSC Users Day 2018 Start to GPU Ehsan Moravveji Image courtesy of Nvidia.com Generally
More informationAccelerating image registration on GPUs
Accelerating image registration on GPUs Harald Köstler, Sunil Ramgopal Tatavarty SIAM Conference on Imaging Science (IS10) 13.4.2010 Contents Motivation: Image registration with FAIR GPU Programming Combining
More informationOpenPOWER Performance
OpenPOWER Performance Alex Mericas Chief Engineer, OpenPOWER Performance IBM Delivering the Linux ecosystem for Power SOLUTIONS OpenPOWER IBM SOFTWARE LINUX ECOSYSTEM OPEN SOURCE Solutions with full stack
More informationMotivation Hardware Overview Programming model. GPU computing. Part 1: General introduction. Ch. Hoelbling. Wuppertal University
Part 1: General introduction Ch. Hoelbling Wuppertal University Lattice Practices 2011 Outline 1 Motivation 2 Hardware Overview History Present Capabilities 3 Programming model Past: OpenGL Present: CUDA
More informationMELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구
MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구 Leading Supplier of End-to-End Interconnect Solutions Analyze Enabling the Use of Data Store ICs Comprehensive End-to-End InfiniBand and Ethernet Portfolio
More informationHigh-Performance Computing Using GPUs
High-Performance Computing Using GPUs Luca Caucci caucci@email.arizona.edu Center for Gamma-Ray Imaging November 7, 2012 Outline Slide 1 of 27 Why GPUs? What is CUDA? The CUDA programming model Anatomy
More informationGPU ARCHITECTURE Chris Schultz, June 2017
GPU ARCHITECTURE Chris Schultz, June 2017 MISC All of the opinions expressed in this presentation are my own and do not reflect any held by NVIDIA 2 OUTLINE CPU versus GPU Why are they different? CUDA
More informationCSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University
CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand
More informationCS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in
More informationOptimization solutions for the segmented sum algorithmic function
Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code
More informationExperiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor
Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain
More informationLatest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand
Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda
More informationOverview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
More informationGPU Clusters for High- Performance Computing Jeremy Enos Innovative Systems Laboratory
GPU Clusters for High- Performance Computing Jeremy Enos Innovative Systems Laboratory National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Presentation Outline NVIDIA
More informationMemcached Design on High Performance RDMA Capable Interconnects
Memcached Design on High Performance RDMA Capable Interconnects Jithin Jose, Hari Subramoni, Miao Luo, Minjia Zhang, Jian Huang, Md. Wasi- ur- Rahman, Nusrat S. Islam, Xiangyong Ouyang, Hao Wang, Sayantan
More informationIntroduction to CELL B.E. and GPU Programming. Agenda
Introduction to CELL B.E. and GPU Programming Department of Electrical & Computer Engineering Rutgers University Agenda Background CELL B.E. Architecture Overview CELL B.E. Programming Environment GPU
More informationParalization on GPU using CUDA An Introduction
Paralization on GPU using CUDA An Introduction Ehsan Nedaaee Oskoee 1 1 Department of Physics IASBS IPM Grid and HPC workshop IV, 2011 Outline 1 Introduction to GPU 2 Introduction to CUDA Graphics Processing
More informationAutomatic Development of Linear Algebra Libraries for the Tesla Series
Automatic Development of Linear Algebra Libraries for the Tesla Series Enrique S. Quintana-Ortí quintana@icc.uji.es Universidad Jaime I de Castellón (Spain) Dense Linear Algebra Major problems: Source
More informationExploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR
Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR Presentation at Mellanox Theater () Dhabaleswar K. (DK) Panda - The Ohio State University panda@cse.ohio-state.edu Outline Communication
More informationJose Aliaga (Universitat Jaume I, Castellon, Spain), Ruyman Reyes, Mehdi Goli (Codeplay Software) 2017 Codeplay Software Ltd.
SYCL-BLAS: LeveragingSYCL-BLAS Expression Trees for Linear Algebra Jose Aliaga (Universitat Jaume I, Castellon, Spain), Ruyman Reyes, Mehdi Goli (Codeplay Software) 1 About me... Phd in Compilers and Parallel
More informationLarge scale Imaging on Current Many- Core Platforms
Large scale Imaging on Current Many- Core Platforms SIAM Conf. on Imaging Science 2012 May 20, 2012 Dr. Harald Köstler Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen,
More informationMatrix Computations on GPUs, multiple GPUs and clusters of GPUs
Matrix Computations on GPUs, multiple GPUs and clusters of GPUs Francisco D. Igual Departamento de Ingeniería y Ciencia de los Computadores. University Jaume I. Castellón (Spain). Matrix Computations on
More informationCPU-GPU Heterogeneous Computing
CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems
More informationGPU ARCHITECTURE Chris Schultz, June 2017
Chris Schultz, June 2017 MISC All of the opinions expressed in this presentation are my own and do not reflect any held by NVIDIA 2 OUTLINE Problems Solved Over Time versus Why are they different? Complex
More informationIntroduction to Xeon Phi. Bill Barth January 11, 2013
Introduction to Xeon Phi Bill Barth January 11, 2013 What is it? Co-processor PCI Express card Stripped down Linux operating system Dense, simplified processor Many power-hungry operations removed Wider
More informationManuel F. Dolz, Juan C. Fernández, Rafael Mayo, Enrique S. Quintana-Ortí. High Performance Computing & Architectures (HPCA)
EnergySaving Cluster Roll: Power Saving System for Clusters Manuel F. Dolz, Juan C. Fernández, Rafael Mayo, Enrique S. Quintana-Ortí High Performance Computing & Architectures (HPCA) University Jaume I
More informationJohn W. Romein. Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands
Signal Processing on GPUs for Radio Telescopes John W. Romein Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands 1 Overview radio telescopes six radio telescope algorithms on
More informationCP2K Performance Benchmark and Profiling. April 2011
CP2K Performance Benchmark and Profiling April 2011 Note The following research was performed under the HPC Advisory Council HPC works working group activities Participating vendors: HP, Intel, Mellanox
More informationQR Decomposition on GPUs
QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of
More informationDistributed-Shared CUDA: Virtualization of Large-Scale GPU Systems for Programmability and Reliability
Distributed-Shared CUDA: Virtualization of Large-Scale GPU Systems for Programmability and Reliability Atsushi Kawai, Kenji Yasuoka Department of Mechanical Engineering, Keio University Yokohama, Japan
More informationATS-GPU Real Time Signal Processing Software
Transfer A/D data to at high speed Up to 4 GB/s transfer rate for PCIe Gen 3 digitizer boards Supports CUDA compute capability 2.0+ Designed to work with AlazarTech PCI Express waveform digitizers Optional
More informationHETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE
HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE Haibo Xie, Ph.D. Chief HSA Evangelist AMD China OUTLINE: The Challenges with Computing Today Introducing Heterogeneous System Architecture (HSA)
More informationHiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.
HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes Ian Glendinning Outline NVIDIA GPU cards CUDA & OpenCL Parallel Implementation
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationOutline 1 Motivation 2 Theory of a non-blocking benchmark 3 The benchmark and results 4 Future work
Using Non-blocking Operations in HPC to Reduce Execution Times David Buettner, Julian Kunkel, Thomas Ludwig Euro PVM/MPI September 8th, 2009 Outline 1 Motivation 2 Theory of a non-blocking benchmark 3
More informationPerformance potential for simulating spin models on GPU
Performance potential for simulating spin models on GPU Martin Weigel Institut für Physik, Johannes-Gutenberg-Universität Mainz, Germany 11th International NTZ-Workshop on New Developments in Computational
More information7 DAYS AND 8 NIGHTS WITH THE CARMA DEV KIT
7 DAYS AND 8 NIGHTS WITH THE CARMA DEV KIT Draft Printed for SECO Murex S.A.S 2012 all rights reserved Murex Analytics Only global vendor of trading, risk management and processing systems focusing also
More informationMathematical computations with GPUs
Master Educational Program Information technology in applications Mathematical computations with GPUs GPU architecture Alexey A. Romanenko arom@ccfit.nsu.ru Novosibirsk State University GPU Graphical Processing
More informationDuksu Kim. Professional Experience Senior researcher, KISTI High performance visualization
Duksu Kim Assistant professor, KORATEHC Education Ph.D. Computer Science, KAIST Parallel Proximity Computation on Heterogeneous Computing Systems for Graphics Applications Professional Experience Senior
More informationMAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures
MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010
More informationGraphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics
Why GPU? Chapter 1 Graphics Hardware Graphics Processing Unit (GPU) is a Subsidiary hardware With massively multi-threaded many-core Dedicated to 2D and 3D graphics Special purpose low functionality, high
More informationWorld s most advanced data center accelerator for PCIe-based servers
NVIDIA TESLA P100 GPU ACCELERATOR World s most advanced data center accelerator for PCIe-based servers HPC data centers need to support the ever-growing demands of scientists and researchers while staying
More informationDIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka
USE OF FOR Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka Faculty of Nuclear Sciences and Physical Engineering Czech Technical University in Prague Mini workshop on advanced numerical methods
More informationGeneral Purpose GPU Computing in Partial Wave Analysis
JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data
More informationScalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA. NVIDIA Corporation 2012
Scalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA Outline Introduction to Multi-GPU Programming Communication for Single Host, Multiple GPUs Communication for Multiple Hosts, Multiple GPUs
More informationG P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G
Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty
More informationCUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav
CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture
More informationSolving Dense Linear Systems on Platforms with Multiple Hardware Accelerators
Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators Francisco D. Igual Enrique S. Quintana-Ortí Gregorio Quintana-Ortí Universidad Jaime I de Castellón (Spain) Robert A. van de
More informationCurrent Trends in Computer Graphics Hardware
Current Trends in Computer Graphics Hardware Dirk Reiners University of Louisiana Lafayette, LA Quick Introduction Assistant Professor in Computer Science at University of Louisiana, Lafayette (since 2006)
More informationNumerical Algorithms on Multi-GPU Architectures
Numerical Algorithms on Multi-GPU Architectures Dr.-Ing. Harald Köstler 2 nd International Workshops on Advances in Computational Mechanics Yokohama, Japan 30.3.2010 2 3 Contents Motivation: Applications
More informationHigh performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli
High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering
More information