An approach to provide remote access to GPU computational power

Size: px
Start display at page:

Download "An approach to provide remote access to GPU computational power"

Transcription

1 An approach to provide remote access to computational power University Jaume I, Spain Joint research effort 1/84

2 Outline computing computing scenarios Introduction to rcuda rcuda structure rcuda functionality Basic TCP/IP version Infiniband version Work in progress and near future 2/84

3 Outline computing computing scenarios Introduction to rcuda rcuda structure rcuda functionallity Basic TCP/IP version Infiniband version Work in progress and near future 3/84

4 computing computing includes all the technological issues (hardware and software) for using the computational power for executing general purpose code. This leads to a heterogeneous system. computing has had a great growing in the last years. 4/84

5 computing Nov 2008, top500 list: First supercomputer on top500 (#29) using computing at Tokyo Institute of Technology. 5/84

6 computing Rank Site Vendor 1 RIKEN Advanced Institute for Computational Science (AICS) Japan K computer, SPARC64 VIIIfx 2.0GHz, Tofu interconnect / 2011 Fujitsu 2 National Supercomputing Center in Tianjin China Tianhe-1A - NUDT TH MPP, X Ghz 6C, NVIDIA, FT C / 2010 NUDT 3 DOE/SC/Oak Ridge National Laboratory United States Jaguar - Cray XT5-HE Opteron 6-core 2.6 GHz / 2009 Cray Inc. 4 National Supercomputing Centre in Shenzhen (NSCS) China Nebulae - Dawning TC3600 Blade, Intel X5650, NVidia Tesla C2050 / 2010 Dawning 5 GSIC Center, Tokyo Institute of Technology Japan TSUBAME HP ProLiant SL390s G7 Xeon 6C X5670, Nvidia, Linux/Windows / 2010 NEC/HP 6/84

7 computing Rank Site Vendor 1 RIKEN Advanced Institute for Computational Science (AICS) Japan K computer, SPARC64 VIIIfx 2.0GHz, Tofu interconnect / 2011 Fujitsu 2 National Supercomputing Center in Tianjin China Tianhe-1A - NUDT TH MPP, X Ghz 6C, nvidia, FT C / 2010 NUDT 3 DOE/SC/Oak Ridge National Laboratory United States Jaguar - Cray XT5-HE Opteron 6-core 2.6 GHz / 2009 Cray Inc. 4 National Supercomputing Centre in Shenzhen (NSCS) China Nebulae - Dawning TC3600 Blade, Intel X5650, nvidia Tesla C2050 / 2010 Dawning 5 GSIC Center, Tokyo Institute of Technology Japan TSUBAME HP ProLiant SL390s G7 Xeon 6C X5670, nvidia, Linux/Windows / 2010 NEC/HP 7/84

8 computing JUNE 2011 Green500 Rank Site Computer 1 IBM Thomas J. Watson Research Center NNSA/SC Blue Gene/Q Prototype 2 2 IBM Thomas J. Watson Research Center NNSA/SC Blue Gene/Q Prototype 2 3 Nagasaki University DEGIMA Cluster, Intel i5, ATI Radeon, Infiniband QDR 4 GSIC Center, Tokyo Institute of Technology HP Proliant SL 390s G7 Xeon 6C, nvidia, Linux/Windows 5 CINECA/SCS Supercomputing Solution IdataPlex DX360M3, Xeon 2.4, nvidia, Infiniband 8/84

9 computing JUNE 2011 Green500 Rank Site Computer 1 IBM Thomas J. Watson Research Center NNSA/SC Blue Gene/Q Prototype 2 2 IBM Thomas J. Watson Research Center NNSA/SC Blue Gene/Q Prototype 2 3 Nagasaki University DEGIMA Cluster, Intel i5, ATI Radeon, Infiniband QDR 4 GSIC Center, Tokyo Institute of Technology HP Proliant SL 390s G7 Xeon 6C, nvidia, Linux/Windows 5 CINECA/SCS Supercomputing Solution IdataPlex DX360M3, Xeon 2.4, nvidia, Infiniband 9/84

10 computing processors have been the first commodity massively parallel processors. For the right kind of code the use of s brings huge benefits in terms of performance and energy. Development tools have been introduced in order to ease the programming of the s. 10/84

11 computing Basic construction node: s inside the box. Main Memory 11/84

12 computing Basic construction node: Main Memory s outside the box. 12/84

13 computing From the programming point of view: A set of nodes, each one with: one or more s (with several cores per ) one or more s (1-4). An interconnection network. Main Memory Main Memory Main Memory Main Memory Main Memory Main Memory Interconnection 13/84

14 computing Two main approaches in computing development environments: CUDA nvidia proprietary OpenCL open standard 14/84

15 computing Basically OpenCL and CUDA have the same work scheme: Compilation Separate code. code ( kernel). 15/84

16 computing Basically OpenCL and CUDA have the same work scheme: Running: Data transfers: and ory spaces 1. Before kernel execution: data from ory space to ory space. 2. Computation: Kernel execution. 3. After kernel execution: results from ory space to ory space. 16/84

17 computing What means the right kind of code? There must be data parallelism in the code: this is the only way of taking benefit of the hundreds of processors in a. 17/84

18 computing What means the right kind of code? There must be a limited overhead due to data movement between the ory space and the ory space. 18/84

19 computing What means the right kind of code? Influence of data transfers for SGEMM Time devoted to data transfers (%) 100 Pinned Memory Non-Pinned Memory Matrix Size Matrix computations on graphics processors and clusters of s Francisco D. Igual Peña. Phd Degree dissertation. 19/84

20 Outline computing computing scenarios Introduction to rcuda rcuda structure rcuda functionallity Basic TCP/IP version Infiniband version Work in progress and near future 20/84

21 computing scenarios Different scenarios from the point of view of the application: Low amount of data parallelism. High amount of data parallelism. Moderate amount of data parallelism. Applications for multi-computing. 21/84

22 computing scenarios Low amount of data parallelism: Application has a little part where data parallelism can be extracted. BAD for computing No is needed in the system, just proceed with the tradicional HPC strategies 22/84

23 computing scenarios High amount of data parallelism. A lot of data parallellism can be extracted from every application. GOOD for computing Add as many as possible s to each node in the system and rewrite the applications in order to use them. 23/84

24 computing scenarios Moderate amount of data parallelism Application has a moderate level of data parallelism ( [40%-80%]) What about computing? If every node in the system includes s, these s are used only when data parallelism appears in some part of the application. Rest of the time s are idle, and this is an overcost in both adquisition and maintenace (energy). 24/84

25 computing scenarios Applications for multi-computing An application can use in parallel a great amount of s. What about computing? The code running in a core can only access to the s in that core but it can run faster if it would be possible to access to more s. 25/84

26 Outline computing computing scenarios Introduction to rcuda rcuda structure rcuda functionallity Basic TCP/IP version Infiniband version Work in progress and near future 26/84

27 Introduction to rcuda A tool that enables that a code running in one node can access to s in other node. It is useful when you have: Moderate level of data parallelism. Applications for multi computing. 27/84

28 Introduction to rcuda Moderate level of data parallelism Main Memory Main Memory Main Memory Main Memory Main Memory Interconnection Adding a set of s on each node leads to have a set of s idle for long periods. This is a waste in money and energy 28/84

29 Introduction to rcuda Moderate level of data parallelism Main Memory Main Memory Main Memory Main Memory Main Memory Interconnection Add only just the s that can be used, considering the applications and their amount of data parallelism and 29/84

30 Introduction to rcuda Moderate level of data parallelism Logical interconnections Main Memory Main Memory Main Memory Main Memory Main Memory Interconnection Add only just the s that can be used, considering the applications and their amount of data parallelism and make all of them accessible from every node 30/84

31 Introduction to rcuda Applications for muti-computing Main Memory Main Memory Main Memory Main Memory Main Memory Interconnection From each it is only possible to access to the corresponding s 31/84

32 Introduction to rcuda Applications for multi-computing Logical interconnections Main Memory Main Memory Main Memory Main Memory Main Memory Interconnection Put all s accessible from every node 32/84

33 Introduction to rcuda Applications for multi-computing Logical interconnections Main Memory Main Memory Main Memory Main Memory Main Memory Interconnection Put all s accessible from every node and enable the access from a to as many as s are necessary 33/84

34 Outline computing computing scenarios Introduction to rcuda rcuda structure rcuda functionallity Basic TCP/IP version Infiniband version Work in progress and near future 34/84

35 rcuda structure CUDA application Application CUDA driver + runtime 35/84

36 rcuda structure Client side Application CUDA application Application CUDA driver + runtime Server side CUDA driver + runtime 36/84

37 rcuda structure Client side Application Server side CUDA application rcuda daemon CUDA driver + runtime rcuda library device device 37/84

38 Outline computing computing scenarios Introduction to rcuda rcuda structure rcuda functionallity Basic TCP/IP version Infiniband version Work in progress and near future 38/84

39 rcuda functionallity CUDA programming CCextensions. extensions. Runtime Runtimelibrary. library. C extensions Not Notsupported supportedininthe thecurrent currentversion versionof ofrcuda. rcuda. We Weare areworking workingon onitit Runtime library Support Supportfor foralmost almostall allfunctions. functions. For Forsome someinternal internalfunctions, functions,nvidia nvidiadoes doesnot notgive giveinformation information(not (not supported supportedininrcuda) rcuda) 39/84

40 rcuda functionallity Supported CUDA 4.0 Runtime Functions Module Functions Supported Device Management Error handling 3 3 Event management 7 7 Execution control 7 7 Memory management Peer device ory access 3 3 Stream management 5 5 Suface reference management 2 2 Texture refefence managemet 8 8 Thread management 6 6 Unified addressing 1 1 Version managemet /84

41 rcuda functionallity NOT YET Supported CUDA 4.0 Runtime Functions Module Functions Supported OpenGL Interoperability 4 0 Direct3D 9 Interoperability 5 0 Direct3D 10 Interoperability 5 0 Direct3D 11 Interoperability 5 0 VDPAU Interoperability 4 0 Graphics Interoperability /84

42 rcuda functionallity Supported CUBLAS Functions Module Functions Supported Helper function reference BLAS BLAS BLAS /84

43 Outline computing computing scenarios Introduction to rcuda rcuda structure rcuda functionallity Basic TCP/IP version Infiniband version Work in progress and near future 43/84

44 Basic TPC/IP version Proof of concept Use TCP/IP stack It is a basic version to show the functionallity Estimation of the overhead due to the communication network. Runs over all TPC/IP networks: Ethernet, InfiniBand, etc. 44/84

45 Basic TPC/IP version Example Example of of rcuda rcuda interaction interaction Initialization Initialization Client Client application application Server Server daemon daemon query Kernel software Locate and send kernel Get Load Kernel Return result Result SEND RECEIVE Get Load Kernel SEND RECEIVE 45/84

46 Basic TPC/IP version Example Example of of rcuda rcuda interaction interaction CudaMemcpy( CudaMemcpy(...,..., cudamemcpyhosttodevice) cudamemcpyhosttodevice) Client Client application application Server Server daemon daemon query Kernel software Send data to server Copy data to ory Result Return result SEND RECEIVE CudaMemcpy SEND RECEIVE 46/84

47 Basic TPC/IP version Main problem: data movement overhead On CUDA this overhead is due to: PCIe data transfers On rcuda this overhead is due to: PCIe data transfers data transfers 47/84

48 Basic TPC/IP version Data Data transfer transfer time time for for matrix-matrix matrix-matrix multiplication multiplication (GEMM) (GEMM) (2 (2 data data matrices matrices from from client client to to remote remote ) ) (1 (1 result result matrix matrix from from remote remote to to client) client) Gb 10Gb Ethernet Ethernet Time (msec) rcuda CUDA Matrix dimension 48/84

49 Basic TPC/IP version Execution Execution time time for for matrix-matrix matrix-matrix multiplication multiplication (GEMM) (GEMM) Tesla Tesla c1060 c Intel Intel Xeon Xeon E5410 E5410 2'33 2'33 Ghz Ghz 10Gb Ethernet 60 10Gb Ethernet rcuda kernel execution Time (sec) Data transfers rcuda misc operations Matrix dimension 49/84

50 Basic TPC/IP version Estimated Estimated execution execution time time for for matrix matrix multiplication, multiplication, including including data data transfers transfers for for some some HPC HPC networks networks Gb etehrnet Time (sec) Gb Infiniband CUDA Matrix dimension 50/84

51 Basic TPC/IP version The functionallity has been shown Almost all CUDA SDK examples have been tested As the network overhead can be minimized a remote rcuda device will have a performance close to the local CUDA device. 51/84

52 Outline computing computing scenarios Introduction to rcuda rcuda structure rcuda functionallity Basic TCP/IP version Infiniband version Work in progress and near future 52/84

53 InfiniBand version Why InfiniBand version? InfiniBand is the most used HPC network Low latency and high bandwidth 38,4 Infiniband Infiniband QDR Infiniband DDR Infiniband DDR 4x Gigabit Ethernet Propietary Custom X4 internal internnect Others Top500 June 2011 Interconnect 53/84

54 InfiniBand version Why InfiniBand version? InfiniBand is the most used HPC network Good results are expected Time (sec) Low latency and high bandwidth 70 SGEMM Gb Infiniband 50 CUDA Matrix dimension 54/84

55 InfiniBand version InfiniBand version facts: Use IB-Verbs All TPC/IP stack overflow is out Our goal is to be as close as possible to the network peak performance Bandwidth test of our IB network is about 2900 MB/sec 55/84

56 InfiniBand version Same user level functionallity Bandwidth client to/from remote near the peak InfiniBand network bandwidth Use of Direct Reduce the number of intra-node data movements Use of pipelined transfers Overlap intra-node data movements and transfers 56/84

57 InfiniBand version Intra-node data movement: basic method Two different main ory zones needed. INFINIBAND Chipset Main ory proc 57/84

58 InfiniBand version Intra-node data movement: basic method Step 1 Copy data from ory to the main ory associated with the INFINIBAND Chipset Main ory proc 58/84

59 InfiniBand version Intra-node data movement: basic method Step 2 Copy data between the two main ory buffers. INFINIBAND Chipset Main ory proc 59/84

60 InfiniBand version Intra-node data movement: basic method Step 3 Send data from the main ory buffer associated with the network card. INFINIBAND Chipset Main ory proc 60/84

61 InfiniBand version Intra-node data movement: basic method Three data movements have been needed. INFINIBAND Chipset Main ory proc 61/84

62 InfiniBand version Intra-node data movement: Direct Only ONE main ory zone is needed. This zone is bound to both the and the network device INFINIBAND Chipset Main ory proc 62/84

63 InfiniBand version Intra-node data movement: Direct Step 1 Copy data from the ory to the main ory. INFINIBAND Chipset Main ory proc 63/84

64 InfiniBand version Intra-node data movement: Direct Step 2 Send data from the main ory. INFINIBAND Chipset Main ory proc 64/84

65 InfiniBand version Intra-node data movement: Direct Only TWO data movements have been needed. INFINIBAND Chipset Main ory proc 65/84

66 InfiniBand version Standard data transfers between nodes Chipset InfiniBand InfiniBand Main ory rcuda server Main ory rcuda client Chipset 66/84

67 InfiniBand version Standard data transfers between nodes InfiniBand InfiniBand Main ory rcuda server Main ory rcuda client Chipset Chipset Client Copy to network buffers Server 67/84

68 InfiniBand version Standard data transfers between nodes InfiniBand InfiniBand Main ory rcuda server Main ory rcuda client Chipset Chipset Client Copy to network buffers Send Server 68/84

69 InfiniBand version Standard data transfers between nodes InfiniBand InfiniBand Main ory rcuda server Main ory rcuda client Chipset Chipset Client Copy to network buffers Send Copy to Server 69/84

70 InfiniBand version Pipelined data transfers Chipset Client InfiniBand InfiniBand Main ory rcuda server Main ory rcuda client Chipset Copy to network Server 70/84

71 InfiniBand version Pipelined data transfers InfiniBand InfiniBand Chipset Client Copy to network Copy to network Main ory rcuda server Main ory rcuda client Chipset Copy to network Server 71/84

72 InfiniBand version Pipelined data transfers Chipset Client InfiniBand InfiniBand Main ory rcuda server Main ory rcuda client Chipset Copy to network Server 72/84

73 InfiniBand version Pipelined data transfers Chipset Client Copy to network InfiniBand InfiniBand Main ory rcuda server Main ory rcuda client Chipset Copy to network Send Server 73/84

74 InfiniBand version Pipelined data transfers rcuda client rcuda server InfiniBand InfiniBand Chipset Client Copy to network Server Main ory Main ory Copy to network Copy to network Send Send Chipset Copy to 74/84

75 InfiniBand version Pipelined data transfers rcuda client rcuda server InfiniBand InfiniBand Chipset Client Copy to network Server Main ory Main ory Copy to network Copy to network Send Send Send Copy to Copy to Chipset 75/84

76 InfiniBand version Pipelined data transfers rcuda client rcuda server InfiniBand InfiniBand Chipset Client Copy to network Server Main ory Main ory Copy to network Copy to network Send Send Send Copy to Copy to Chipset Copy to 76/84

77 InfiniBand version Pipelined data transfers rcuda client rcuda server InfiniBand InfiniBand Chipset Client Copy to network Server Main ory Main ory Chipset Copy to network Copy to network Send Send Send Copy to Copy to Copy to This is the overhead for transfering data to the remote node 77/84

78 InfiniBand version Bandwidth for a matrix of 4096 x 4096 single precission 5000 Bandwidth (MB/sec) Gb InfiniBand IB peak bandwidth 2900 MB/sec rcuda GigaE rcuda IPoIB rcuda IBVerbs CUDA 78/84

79 InfiniBand version Execution time for a matrix x matrix (dim=4096) 2,50 2,28 GeForce 9800 GTX Intel Xeon E5645 Time (sec) 2,00 1,50 1,30 1,00 0,70 0,65 0,62 0,50 0,00 rcuda IpoIB rcuda GigaE CUDA rcuda IBVerbs (MKL) 79/84

80 Outline computing computing scenarios Introduction to rcuda rcuda structure rcuda functionallity Basic TCP/IP version Infiniband version Work in progress and near future 80/84

81 Work in progress Dynamic remote scheduling Port to Microsoft Full support for CUDA 4.0 Support for C/C++ extension Apply to OpenCL 81/84

82 Near future Support for iwarp communications Workload balance Remote data cache Remote kernel cache 82/84

83 More information virtualization in high performance clusters. J. Duato, F. Igual, R. Mayo, A. J. Peña, E. S. Quintana, F. Silla. 4th Workshop on Virtualization and High-Performance Cloud Computing, VHPC rcuda: reducing the number of -based accelerators in high performance clusters. J.Duato, A. J. Peña, F. Silla, R. Mayo, E. S. Quintana. Workshop on Optimization Issues in Energy Efficient Distributed Systems, OPTIM Performance of CUDA virtualized remote s in high performance clusters. J. Duato, R. Mayo, A. J. Peña, E. S. Quintana, F. Silla. International Conference on Parallel Processing, ICPP 2011 Enabling CUDA acceleration within virtual machines using rcuda. J. Duato, A. J. Peña, F. Silla, J. C. Fernández R. Mayo, E. S. Quintana. High Performance Computing Conference, HiPC /84

84 Antonio Peña Jose Duato Federico Silla People People Enrique S. Quintana-Ortí Rafael Mayo Thanks to Mellanox and AIC for their support to this work 84/84

computational power computational

computational power computational rcuda: rcuda: an an approach approach to to provide provide remote remote access access to to computational computational power power Rafael Mayo Gual Universitat Jaume I Spain (1 of 59) HPC Advisory Council

More information

rcuda: an approach to provide remote access to GPU computational power

rcuda: an approach to provide remote access to GPU computational power rcuda: an approach to provide remote access to computational power Rafael Mayo Gual Universitat Jaume I Spain (1 of 60) HPC Advisory Council Workshop Outline computing Cost of a node rcuda goals rcuda

More information

Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators

Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators Remote CUDA (rcuda) Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators Better performance-watt, performance-cost

More information

rcuda: towards energy-efficiency in GPU computing by leveraging low-power processors and InfiniBand interconnects

rcuda: towards energy-efficiency in GPU computing by leveraging low-power processors and InfiniBand interconnects rcuda: towards energy-efficiency in computing by leveraging low-power processors and InfiniBand interconnects Federico Silla Technical University of Valencia Spain Joint research effort Outline Current

More information

Exploiting Task-Parallelism on GPU Clusters via OmpSs and rcuda Virtualization

Exploiting Task-Parallelism on GPU Clusters via OmpSs and rcuda Virtualization Exploiting Task-Parallelism on Clusters via Adrián Castelló, Rafael Mayo, Judit Planas, Enrique S. Quintana-Ortí RePara 2015, August Helsinki, Finland Exploiting Task-Parallelism on Clusters via Power/energy/utilization

More information

The rcuda technology: an inexpensive way to improve the performance of GPU-based clusters Federico Silla

The rcuda technology: an inexpensive way to improve the performance of GPU-based clusters Federico Silla The rcuda technology: an inexpensive way to improve the performance of -based clusters Federico Silla Technical University of Valencia Spain The scope of this talk Delft, April 2015 2/47 More flexible

More information

The rcuda middleware and applications

The rcuda middleware and applications The rcuda middleware and applications Will my application work with rcuda? rcuda currently provides binary compatibility with CUDA 5.0, virtualizing the entire Runtime API except for the graphics functions,

More information

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing On the Efficacy of a Fued CPU+GPU Proceor (or APU) for Parallel Computing Mayank Daga, Ahwin M. Aji, and Wu-chun Feng Dept. of Computer Science Sampling of field that ue GPU Mac OS X Comology Molecular

More information

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation CUDA Accelerated Linpack on Clusters E. Phillips, NVIDIA Corporation Outline Linpack benchmark CUDA Acceleration Strategy Fermi DGEMM Optimization / Performance Linpack Results Conclusions LINPACK Benchmark

More information

InfiniBand Strengthens Leadership as the Interconnect Of Choice By Providing Best Return on Investment. TOP500 Supercomputers, June 2014

InfiniBand Strengthens Leadership as the Interconnect Of Choice By Providing Best Return on Investment. TOP500 Supercomputers, June 2014 InfiniBand Strengthens Leadership as the Interconnect Of Choice By Providing Best Return on Investment TOP500 Supercomputers, June 2014 TOP500 Performance Trends 38% CAGR 78% CAGR Explosive high-performance

More information

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &

More information

Increasing the efficiency of your GPU-enabled cluster with rcuda. Federico Silla Technical University of Valencia Spain

Increasing the efficiency of your GPU-enabled cluster with rcuda. Federico Silla Technical University of Valencia Spain Increasing the efficiency of your -enabled cluster with rcuda Federico Silla Technical University of Valencia Spain Outline Why remote virtualization? How does rcuda work? The performance of the rcuda

More information

HPC Technology Update Challenges or Chances?

HPC Technology Update Challenges or Chances? HPC Technology Update Challenges or Chances? Swiss Distributed Computing Day Thomas Schoenemeyer, Technology Integration, CSCS 1 Move in Feb-April 2012 1500m2 16 MW Lake-water cooling PUE 1.2 New Datacenter

More information

Carlos Reaño Universitat Politècnica de València (Spain) HPC Advisory Council Switzerland Conference April 3, Lugano (Switzerland)

Carlos Reaño Universitat Politècnica de València (Spain) HPC Advisory Council Switzerland Conference April 3, Lugano (Switzerland) Carlos Reaño Universitat Politècnica de València (Spain) Switzerland Conference April 3, 2014 - Lugano (Switzerland) What is rcuda? Installing and using rcuda rcuda over HPC networks InfiniBand How taking

More information

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries Jeffrey Young, Alex Merritt, Se Hoon Shon Advisor: Sudhakar Yalamanchili 4/16/13 Sponsors: Intel, NVIDIA, NSF 2 The Problem Big

More information

Framework of rcuda: An Overview

Framework of rcuda: An Overview Framework of rcuda: An Overview Mohamed Hussain 1, M.B.Potdar 2, Third Viraj Choksi 3 11 Research scholar, VLSI & Embedded Systems, Gujarat Technological University, Ahmedabad, India 2 Project Director,

More information

Trends in HPC (hardware complexity and software challenges)

Trends in HPC (hardware complexity and software challenges) Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18

More information

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D HPC with GPU and its applications from Inspur Haibo Xie, Ph.D xiehb@inspur.com 2 Agenda I. HPC with GPU II. YITIAN solution and application 3 New Moore s Law 4 HPC? HPC stands for High Heterogeneous Performance

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Waiting for Moore s Law to save your serial code start getting bleak in 2004 Source: published SPECInt data Moore s Law is not at all

More information

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing

More information

Fujitsu s Technologies Leading to Practical Petascale Computing: K computer, PRIMEHPC FX10 and the Future

Fujitsu s Technologies Leading to Practical Petascale Computing: K computer, PRIMEHPC FX10 and the Future Fujitsu s Technologies Leading to Practical Petascale Computing: K computer, PRIMEHPC FX10 and the Future November 16 th, 2011 Motoi Okuda Technical Computing Solution Unit Fujitsu Limited Agenda Achievements

More information

Pedraforca: a First ARM + GPU Cluster for HPC

Pedraforca: a First ARM + GPU Cluster for HPC www.bsc.es Pedraforca: a First ARM + GPU Cluster for HPC Nikola Puzovic, Alex Ramirez We ve hit the power wall ALL computers are limited by power consumption Energy-efficient approaches Multi-core Fujitsu

More information

Speeding up the execution of numerical computations and simulations with rcuda José Duato

Speeding up the execution of numerical computations and simulations with rcuda José Duato Speeding up the execution of numerical computations and simulations with rcuda José Duato Universidad Politécnica de Valencia Spain Outline 1. Introduction to GPU computing 2. What is remote GPU virtualization?

More information

Remote GPU virtualization: pros and cons of a recent technology. Federico Silla Technical University of Valencia Spain

Remote GPU virtualization: pros and cons of a recent technology. Federico Silla Technical University of Valencia Spain Remote virtualization: pros and cons of a recent technology Federico Silla Technical University of Valencia Spain The scope of this talk HPC Advisory Council Brazil Conference 2015 2/43 st Outline What

More information

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento

More information

Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain)

Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain) Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain) 4th IEEE International Workshop of High-Performance Interconnection Networks in the Exascale and Big-Data Era (HiPINEB

More information

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters

Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Improving Application Performance and Predictability using Multiple Virtual Lanes in Modern Multi-Core InfiniBand Clusters Hari Subramoni, Ping Lai, Sayantan Sur and Dhabhaleswar. K. Panda Department of

More information

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits

MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefits Ashish Kumar Singh, Sreeram Potluri, Hao Wang, Krishna Kandalla, Sayantan Sur, and Dhabaleswar K. Panda Network-Based

More information

Solving Dense Linear Systems on Graphics Processors

Solving Dense Linear Systems on Graphics Processors Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad

More information

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters Designing High Performance Heterogeneous Broadcast for Streaming Applications on Clusters 1 Ching-Hsiang Chu, 1 Khaled Hamidouche, 1 Hari Subramoni, 1 Akshay Venkatesh, 2 Bracy Elton and 1 Dhabaleswar

More information

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Waiting for Moore s Law to save your serial code started getting bleak in 2004 Source: published SPECInt

More information

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues Top500 Supercomputer list represent parallel computers, so distributed systems such as SETI@Home are not considered Does not consider storage or I/O issues Both custom designed machines and commodity machines

More information

Document downloaded from:

Document downloaded from: Document downloaded from: http://hdl.handle.net/10251/70225 This paper must be cited as: Reaño González, C.; Silla Jiménez, F. (2015). On the Deployment and Characterization of CUDA Teaching Laboratories.

More information

GPU Architecture. Alan Gray EPCC The University of Edinburgh

GPU Architecture. Alan Gray EPCC The University of Edinburgh GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From

More information

CP2K Performance Benchmark and Profiling. April 2011

CP2K Performance Benchmark and Profiling. April 2011 CP2K Performance Benchmark and Profiling April 2011 Note The following research was performed under the HPC Advisory Council HPC works working group activities Participating vendors: HP, Intel, Mellanox

More information

CS 5803 Introduction to High Performance Computer Architecture: Performance Metrics

CS 5803 Introduction to High Performance Computer Architecture: Performance Metrics CS 5803 Introduction to High Performance Computer Architecture: Performance Metrics A.R. Hurson 323 Computer Science Building, Missouri S&T hurson@mst.edu 1 Instructor: Ali R. Hurson 323 CS Building hurson@mst.edu

More information

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR Presentation at Mellanox Theater () Dhabaleswar K. (DK) Panda - The Ohio State University panda@cse.ohio-state.edu Outline Communication

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Manuel F. Dolz, Juan C. Fernández, Rafael Mayo, Enrique S. Quintana-Ortí. High Performance Computing & Architectures (HPCA)

Manuel F. Dolz, Juan C. Fernández, Rafael Mayo, Enrique S. Quintana-Ortí. High Performance Computing & Architectures (HPCA) EnergySaving Cluster Roll: Power Saving System for Clusters Manuel F. Dolz, Juan C. Fernández, Rafael Mayo, Enrique S. Quintana-Ortí High Performance Computing & Architectures (HPCA) University Jaume I

More information

NAMD Performance Benchmark and Profiling. November 2010

NAMD Performance Benchmark and Profiling. November 2010 NAMD Performance Benchmark and Profiling November 2010 Note The following research was performed under the HPC Advisory Council activities Participating vendors: HP, Mellanox Compute resource - HPC Advisory

More information

Opportunities of the rcuda remote GPU virtualization middleware. Federico Silla Universitat Politècnica de València Spain

Opportunities of the rcuda remote GPU virtualization middleware. Federico Silla Universitat Politècnica de València Spain Opportunities of the rcuda remote virtualization middleware Federico Silla Universitat Politècnica de València Spain st Outline What is rcuda? HPC Advisory Council China Conference 2017 2/45 s are the

More information

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

High Performance Computing with Accelerators

High Performance Computing with Accelerators High Performance Computing with Accelerators Volodymyr Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced Computing Applications and Technologies (IACAT) National Center for Supercomputing

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Unifying UPC and MPI Runtimes: Experience with MVAPICH

Unifying UPC and MPI Runtimes: Experience with MVAPICH Unifying UPC and MPI Runtimes: Experience with MVAPICH Jithin Jose Miao Luo Sayantan Sur D. K. Panda Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University,

More information

Top500

Top500 Top500 www.top500.org Salvatore Orlando (from a presentation by J. Dongarra, and top500 website) 1 2 MPPs Performance on massively parallel machines Larger problem sizes, i.e. sizes that make sense Performance

More information

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand Miao Luo, Hao Wang, & D. K. Panda Network- Based Compu2ng Laboratory Department of Computer Science and Engineering The Ohio State

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

Matrix Computations on GPUs, multiple GPUs and clusters of GPUs

Matrix Computations on GPUs, multiple GPUs and clusters of GPUs Matrix Computations on GPUs, multiple GPUs and clusters of GPUs Francisco D. Igual Departamento de Ingeniería y Ciencia de los Computadores. University Jaume I. Castellón (Spain). Matrix Computations on

More information

2014 China TOP100 List of High Performance Computer

2014 China TOP100 List of High Performance Computer 2014 China TOP100 List of High Performance Computer (November, 2014) 1/ 15 2014 China TOP100 List of High Performance Computer The Specialty Association of Mathematical & Scientific Software, CSIA Evaluation

More information

On the Use of Remote GPUs and Low-Power Processors for the Acceleration of Scientific Applications

On the Use of Remote GPUs and Low-Power Processors for the Acceleration of Scientific Applications On the Use of Remote GPUs and Low-Power Processors for the Acceleration of Scientific Applications A. Castelló, J. Duato, R. Mayo, A. J. Peña, E. S. Quintana-Ortí, V. Roca, F. Silla Universitat Politècnica

More information

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters

CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters Ching-Hsiang Chu, Khaled Hamidouche, Akshay Venkatesh, Ammar Ahmad Awan and Dhabaleswar K. (DK) Panda Speaker: Sourav Chakraborty

More information

GPU-centric communication for improved efficiency

GPU-centric communication for improved efficiency GPU-centric communication for improved efficiency Benjamin Klenk *, Lena Oden, Holger Fröning * * Heidelberg University, Germany Fraunhofer Institute for Industrial Mathematics, Germany GPCDP Workshop

More information

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices Jonas Hahnfeld 1, Christian Terboven 1, James Price 2, Hans Joachim Pflug 1, Matthias S. Müller

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

NAMD GPU Performance Benchmark. March 2011

NAMD GPU Performance Benchmark. March 2011 NAMD GPU Performance Benchmark March 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Dell, Intel, Mellanox Compute resource - HPC Advisory

More information

Memcached Design on High Performance RDMA Capable Interconnects

Memcached Design on High Performance RDMA Capable Interconnects Memcached Design on High Performance RDMA Capable Interconnects Jithin Jose, Hari Subramoni, Miao Luo, Minjia Zhang, Jian Huang, Md. Wasi- ur- Rahman, Nusrat S. Islam, Xiangyong Ouyang, Hao Wang, Sayantan

More information

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments Torben Kling-Petersen, PhD Presenter s Name Principle Field Title andengineer Division HPC &Cloud LoB SunComputing Microsystems

More information

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management Hideyuki Shamoto, Tatsuhiro Chiba, Mikio Takeuchi Tokyo Institute of Technology IBM Research Tokyo Programming for large

More information

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Waiting for Moore s Law to save your serial code started getting bleak in 2004 Source: published SPECInt

More information

Improving overall performance and energy consumption of your cluster with remote GPU virtualization

Improving overall performance and energy consumption of your cluster with remote GPU virtualization Improving overall performance and energy consumption of your cluster with remote GPU virtualization Federico Silla & Carlos Reaño Technical University of Valencia Spain Tutorial Agenda 9.00-10.00 SESSION

More information

OCTOPUS Performance Benchmark and Profiling. June 2015

OCTOPUS Performance Benchmark and Profiling. June 2015 OCTOPUS Performance Benchmark and Profiling June 2015 2 Note The following research was performed under the HPC Advisory Council activities Special thanks for: HP, Mellanox For more information on the

More information

Thinking Outside of the Tera-Scale Box. Piotr Luszczek

Thinking Outside of the Tera-Scale Box. Piotr Luszczek Thinking Outside of the Tera-Scale Box Piotr Luszczek Brief History of Tera-flop: 1997 1997 ASCI Red Brief History of Tera-flop: 2007 Intel Polaris 2007 1997 ASCI Red Brief History of Tera-flop: GPGPU

More information

Operational Robustness of Accelerator Aware MPI

Operational Robustness of Accelerator Aware MPI Operational Robustness of Accelerator Aware MPI Sadaf Alam Swiss National Supercomputing Centre (CSSC) Switzerland 2nd Annual MVAPICH User Group (MUG) Meeting, 2014 Computing Systems @ CSCS http://www.cscs.ch/computers

More information

Large scale Imaging on Current Many- Core Platforms

Large scale Imaging on Current Many- Core Platforms Large scale Imaging on Current Many- Core Platforms SIAM Conf. on Imaging Science 2012 May 20, 2012 Dr. Harald Köstler Chair for System Simulation Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen,

More information

The TianHe-1A Supercomputer: Its Hardware and Software

The TianHe-1A Supercomputer: Its Hardware and Software Yang XJ, Liao XK, Lu K et al. The TianHe-1A supercomputer: Its hardware and software. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 26(3): 344 351 May 2011. DOI 10.1007/s11390-011-1137-4 The TianHe-1A Supercomputer:

More information

Automatic Tuning of the High Performance Linpack Benchmark

Automatic Tuning of the High Performance Linpack Benchmark Automatic Tuning of the High Performance Linpack Benchmark Ruowei Chen Supervisor: Dr. Peter Strazdins The Australian National University What is the HPL Benchmark? World s Top 500 Supercomputers http://www.top500.org

More information

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2 Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2 H. Wang, S. Potluri, M. Luo, A. K. Singh, X. Ouyang, S. Sur, D. K. Panda Network-Based

More information

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS Adithya Bhat, Nusrat Islam, Xiaoyi Lu, Md. Wasi- ur- Rahman, Dip: Shankar, and Dhabaleswar K. (DK) Panda Network- Based Compu2ng

More information

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09

Study. Dhabaleswar. K. Panda. The Ohio State University HPIDC '09 RDMA over Ethernet - A Preliminary Study Hari Subramoni, Miao Luo, Ping Lai and Dhabaleswar. K. Panda Computer Science & Engineering Department The Ohio State University Introduction Problem Statement

More information

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters Exploiting InfiniBand and Direct Technology for High Performance Collectives on Clusters Ching-Hsiang Chu chu.368@osu.edu Department of Computer Science and Engineering The Ohio State University OSU Booth

More information

Is remote GPU virtualization useful? Federico Silla Technical University of Valencia Spain

Is remote GPU virtualization useful? Federico Silla Technical University of Valencia Spain Is remote virtualization useful? Federico Silla Technical University of Valencia Spain st Outline What is remote virtualization? HPC Advisory Council Spain Conference 2015 2/57 We deal with s, obviously!

More information

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar CRAY XK6 REDEFINING SUPERCOMPUTING - Sanjana Rakhecha - Nishad Nerurkar CONTENTS Introduction History Specifications Cray XK6 Architecture Performance Industry acceptance and applications Summary INTRODUCTION

More information

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구 MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구 Leading Supplier of End-to-End Interconnect Solutions Analyze Enabling the Use of Data Store ICs Comprehensive End-to-End InfiniBand and Ethernet Portfolio

More information

Building NVLink for Developers

Building NVLink for Developers Building NVLink for Developers Unleashing programmatic, architectural and performance capabilities for accelerated computing Why NVLink TM? Simpler, Better and Faster Simplified Programming No specialized

More information

Parallel and Distributed Systems. Hardware Trends. Why Parallel or Distributed Computing? What is a parallel computer?

Parallel and Distributed Systems. Hardware Trends. Why Parallel or Distributed Computing? What is a parallel computer? Parallel and Distributed Systems Instructor: Sandhya Dwarkadas Department of Computer Science University of Rochester What is a parallel computer? A collection of processing elements that communicate and

More information

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA OPEN MPI WITH RDMA SUPPORT AND CUDA Rolf vandevaart, NVIDIA OVERVIEW What is CUDA-aware History of CUDA-aware support in Open MPI GPU Direct RDMA support Tuning parameters Application example Future work

More information

Presentations: Jack Dongarra, University of Tennessee & ORNL. The HPL Benchmark: Past, Present & Future. Mike Heroux, Sandia National Laboratories

Presentations: Jack Dongarra, University of Tennessee & ORNL. The HPL Benchmark: Past, Present & Future. Mike Heroux, Sandia National Laboratories HPC Benchmarking Presentations: Jack Dongarra, University of Tennessee & ORNL The HPL Benchmark: Past, Present & Future Mike Heroux, Sandia National Laboratories The HPCG Benchmark: Challenges It Presents

More information

GPUs as better MPI Citizens

GPUs as better MPI Citizens s as better MPI Citizens Author: Dale Southard, NVIDIA Date: 4/6/2011 www.openfabrics.org 1 Technology Conference 2011 October 11-14 San Jose, CA The one event you can t afford to miss Learn about leading-edge

More information

Future Routing Schemes in Petascale clusters

Future Routing Schemes in Petascale clusters Future Routing Schemes in Petascale clusters Gilad Shainer, Mellanox, USA Ola Torudbakken, Sun Microsystems, Norway Richard Graham, Oak Ridge National Laboratory, USA Birds of a Feather Presentation Abstract

More information

The Mont-Blanc approach towards Exascale

The Mont-Blanc approach towards Exascale http://www.montblanc-project.eu The Mont-Blanc approach towards Exascale Alex Ramirez Barcelona Supercomputing Center Disclaimer: Not only I speak for myself... All references to unavailable products are

More information

HPCG UPDATE: SC 15 Jack Dongarra Michael Heroux Piotr Luszczek

HPCG UPDATE: SC 15 Jack Dongarra Michael Heroux Piotr Luszczek 1 HPCG UPDATE: SC 15 Jack Dongarra Michael Heroux Piotr Luszczek HPCG Snapshot High Performance Conjugate Gradient (HPCG). Solves Ax=b, A large, sparse, b known, x computed. An optimized implementation

More information

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators Francisco D. Igual Enrique S. Quintana-Ortí Gregorio Quintana-Ortí Universidad Jaime I de Castellón (Spain) Robert A. van de

More information

Distributed-Shared CUDA: Virtualization of Large-Scale GPU Systems for Programmability and Reliability

Distributed-Shared CUDA: Virtualization of Large-Scale GPU Systems for Programmability and Reliability Distributed-Shared CUDA: Virtualization of Large-Scale GPU Systems for Programmability and Reliability Atsushi Kawai, Kenji Yasuoka Department of Mechanical Engineering, Keio University Yokohama, Japan

More information

2008 International ANSYS Conference

2008 International ANSYS Conference 2008 International ANSYS Conference Maximizing Productivity With InfiniBand-Based Clusters Gilad Shainer Director of Technical Marketing Mellanox Technologies 2008 ANSYS, Inc. All rights reserved. 1 ANSYS,

More information

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007

Mellanox Technologies Maximize Cluster Performance and Productivity. Gilad Shainer, October, 2007 Mellanox Technologies Maximize Cluster Performance and Productivity Gilad Shainer, shainer@mellanox.com October, 27 Mellanox Technologies Hardware OEMs Servers And Blades Applications End-Users Enterprise

More information

Mathematical computations with GPUs

Mathematical computations with GPUs Master Educational Program Information technology in applications Mathematical computations with GPUs Introduction Alexey A. Romanenko arom@ccfit.nsu.ru Novosibirsk State University How to.. Process terabytes

More information

Chapter 1. Introduction

Chapter 1. Introduction Chapter 1 Introduction Why High Performance Computing? Quote: It is hard to understand an ocean because it is too big. It is hard to understand a molecule because it is too small. It is hard to understand

More information

An Overview of Fujitsu s Lustre Based File System

An Overview of Fujitsu s Lustre Based File System An Overview of Fujitsu s Lustre Based File System Shinji Sumimoto Fujitsu Limited Apr.12 2011 For Maximizing CPU Utilization by Minimizing File IO Overhead Outline Target System Overview Goals of Fujitsu

More information

University at Buffalo Center for Computational Research

University at Buffalo Center for Computational Research University at Buffalo Center for Computational Research The following is a short and long description of CCR Facilities for use in proposals, reports, and presentations. If desired, a letter of support

More information

GPU for HPC. October 2010

GPU for HPC. October 2010 GPU for HPC Simone Melchionna Jonas Latt Francis Lapique October 2010 EPFL/ EDMX EPFL/EDMX EPFL/DIT simone.melchionna@epfl.ch jonas.latt@epfl.ch francis.lapique@epfl.ch 1 Moore s law: in the old days,

More information

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters K. Kandalla, A. Venkatesh, K. Hamidouche, S. Potluri, D. Bureddy and D. K. Panda Presented by Dr. Xiaoyi

More information

Distributed In-GPU Data Cache for Document-Oriented Data Store via PCIe over 10Gbit Ethernet

Distributed In-GPU Data Cache for Document-Oriented Data Store via PCIe over 10Gbit Ethernet Distributed In-GPU Data Cache for Document-Oriented Data Store via PCIe over 10Gbit Ethernet Shin Morishima 1 and Hiroki Matsutani 1,2,3 1 Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama, Japan 223-8522

More information

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Preparing GPU-Accelerated Applications for the Summit Supercomputer Preparing GPU-Accelerated Applications for the Summit Supercomputer Fernanda Foertter HPC User Assistance Group Training Lead foertterfs@ornl.gov This research used resources of the Oak Ridge Leadership

More information

Birds of a Feather Presentation

Birds of a Feather Presentation Mellanox InfiniBand QDR 4Gb/s The Fabric of Choice for High Performance Computing Gilad Shainer, shainer@mellanox.com June 28 Birds of a Feather Presentation InfiniBand Technology Leadership Industry Standard

More information

John W. Romein. Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands

John W. Romein. Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands Signal Processing on GPUs for Radio Telescopes John W. Romein Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands 1 Overview radio telescopes six radio telescope algorithms on

More information

Automatic Development of Linear Algebra Libraries for the Tesla Series

Automatic Development of Linear Algebra Libraries for the Tesla Series Automatic Development of Linear Algebra Libraries for the Tesla Series Enrique S. Quintana-Ortí quintana@icc.uji.es Universidad Jaime I de Castellón (Spain) Dense Linear Algebra Major problems: Source

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information