Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain)

Size: px
Start display at page:

Download "Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain)"

Transcription

1 Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain) 4th IEEE International Workshop of High-Performance Interconnection Networks in the Exascale and Big-Data Era (HiPINEB 2018) Vienna, Austria, 24-Feb, 2018

2 Introduction GPU Virtualization Approach proposed The NVIDIA Jetson TX1 system Performance evaluation Conclusions and future work HiPINEB /62

3 Introduction GPU Virtualization Approach proposed The NVIDIA Jetson TX1 system Performance evaluation Conclusions and future work HiPINEB /62

4 Current supercomputers: Performance Energy consumption HiPINEB /62

5 Current supercomputers: Performance Energy consumption Desired future supercomputers: Performance: exaflops Energy consumption: 20 MW HiPINEB /62

6 Current high-performance processors (Xeon): Performance Energy (~165W) HiPINEB /62

7 Current high-performance processors (Xeon): Performance Energy (~165W)...plus high-performance accelerators (GPU): Performance Energy (~100W) HiPINEB /62

8 Current high-performance processors (Xeon): Performance Energy (~165W)...plus high-performance accelerators (GPU): Performance Energy (~100W) Current low-power processors (ARM): Energy consumption (~15W) Performance HiPINEB /62

9 How improving efficiency of future exascale systems? HiPINEB /62

10 How improving efficiency of future exascale systems? Our proposal: heterogeneous platforms Low-power processors (ARM) Accelerated with remote HPC servers (Xeon + GPUs) Remote HPC servers shared among all low-power processors HiPINEB /62

11 How improving efficiency of future exascale systems? Our proposal: heterogeneous platforms Low-power processors (ARM) Accelerated with remote HPC servers (Xeon + GPUs) Remote HPC servers shared among all low-power processors How sharing? GPU virtualization GPUs transparently shared among all cluster nodes Virtualization framework overhead Network overhead (high-performance interconnection!!) HiPINEB /62

12 Introduction GPU Virtualization Approach proposed The NVIDIA Jetson TX1 system Performance evaluation Conclusions and future work HiPINEB /62

13 GPGPU: OpenCL or CUDA HiPINEB /62

14 GPGPU: OpenCL or CUDA General architecture remote GPU solutions: HiPINEB /62

15 GPGPU: OpenCL or CUDA General architecture remote GPU solutions: Existing solutions: rcuda ( rcuda: freely available, supports CUDA 8 HiPINEB /62

16 rcuda communication architecture: Eth, IB,... HiPINEB /62

17 Local GPU: Node 1 GPU Remote GPU: Performance depends on GPU bandwidth Node 2 Network Node 1 GPU Performance depends on network bandwidth HiPINEB /62

18 Introduction GPU Virtualization Approach proposed The NVIDIA Jetson TX1 system Performance evaluation Conclusions and future work HiPINEB /62

19 ARM ARM ARM Network Xeon GPU ARM ARM ARM Offload GPU workloads from ARMs to remote GPU server... HiPINEB /62

20 ARM NVIDIA Jetson TX1: 4-core ARM processor 256-core CUDA compatible GPU Supermicro 1027GR-TRF: 2 hexa-core Xeon processors NVIDIA Tesla K20 GPU with 2496 CUDA cores Xeon K20 HiPINEB /62

21 Introduction GPU Virtualization Approach proposed The NVIDIA Jetson TX1 system Performance evaluation Conclusions and future work HiPINEB /62

22 Performance comparison: ARM CPU vs. Xeon ARM GPU vs. K20 HiPINEB /62

23 Performance comparison: ARM CPU vs. Xeon ARM GPU vs. K20 HiPINEB /62

24 1,59 3,04 Test unit 719, , , , , ,60 Xeon Jetson TX CPU (events/s) Memory (MiB/s) Threads (events) Mutex (seconds) Test Performance comparison of NVIDIA Jetson TX1 and Supermicro 1027GR-TRF systems using the sysbench benchmark. Notice that the Y-axis is in logarithmic scale. HiPINEB /62

25 1,59 3,04 Better Test unit 719, , , , , ,60 Xeon Jetson TX CPU (events/s) Memory (MiB/s) Threads (events) Mutex (seconds) Test Performance comparison of NVIDIA Jetson TX1 and Supermicro 1027GR-TRF systems using the sysbench benchmark. Notice that the Y-axis is in logarithmic scale. HiPINEB /62

26 1,59 3,04 Better Better Test unit 719, , , , , ,60 Xeon Jetson TX CPU (events/s) Memory (MiB/s) Threads (events) Mutex (seconds) Test Performance comparison of NVIDIA Jetson TX1 and Supermicro 1027GR-TRF systems using the sysbench benchmark. Notice that the Y-axis is in logarithmic scale. HiPINEB /62

27 1,59 3,04 Better Better Better Test unit 719, , , , , ,60 Xeon Jetson TX CPU (events/s) Memory (MiB/s) Threads (events) Mutex (seconds) Test Performance comparison of NVIDIA Jetson TX1 and Supermicro 1027GR-TRF systems using the sysbench benchmark. Notice that the Y-axis is in logarithmic scale. HiPINEB /62

28 1,59 3,04 Better Better Better Better Test unit 719, , , , , ,60 Xeon Jetson TX CPU (events/s) Memory (MiB/s) Threads (events) Mutex (seconds) Test Performance comparison of NVIDIA Jetson TX1 and Supermicro 1027GR-TRF systems using the sysbench benchmark. Notice that the Y-axis is in logarithmic scale. HiPINEB /62

29 Performance comparison: ARM CPU vs. Xeon ARM GPU vs. K20 HiPINEB /62

30 HiPINEB /62

31 HiPINEB /62

32 HiPINEB /62

33 32,58 Test unit 250,56 Speed-up 2205,9 2963,3 2238,2 2884, Tegra X1 Tesla K20m K20m speed-up 7, ,34 2 1, Memory copy to GPU (MiB/s) Memory copy from GPU (MiB/s) Matrix multiplication (GFlop/s) Performance comparison of the GPU included in the Tegra X1 chip and the Tesla K20m GPU in terms of data transfer bandwidth and computational power. Notice that primary Y-axis is in scale. 0 HiPINEB /62

34 32,58 Test unit Better 250,56 Speed-up 2205,9 2963,3 2238,2 2884, Tegra X1 Tesla K20m K20m speed-up K20 wins 7, ,34 2 1, Memory copy to GPU (MiB/s) Memory copy from GPU (MiB/s) Matrix multiplication (GFlop/s) Performance comparison of the GPU included in the Tegra X1 chip and the Tesla K20m GPU in terms of data transfer bandwidth and computational power. Notice that primary Y-axis is in scale. 0 HiPINEB /62

35 32,58 Test unit Better Better 250,56 Speed-up 2205,9 2963,3 2238,2 2884, Tegra X1 Tesla K20m K20m speed-up K20 wins K20 wins 7, ,34 2 1, Memory copy to GPU (MiB/s) Memory copy from GPU (MiB/s) Matrix multiplication (GFlop/s) Performance comparison of the GPU included in the Tegra X1 chip and the Tesla K20m GPU in terms of data transfer bandwidth and computational power. Notice that primary Y-axis is in scale. 0 HiPINEB /62

36 32,58 Better Test unit Better Better 250,56 Speed-up 2205,9 2963,3 2238,2 2884, Tegra X1 Tesla K20m K20m speed-up K20 wins K20 wins 7, K20 wins ,34 2 1, Memory copy to GPU (MiB/s) Memory copy from GPU (MiB/s) Matrix multiplication (GFlop/s) Performance comparison of the GPU included in the Tegra X1 chip and the Tesla K20m GPU in terms of data transfer bandwidth and computational power. Notice that primary Y-axis is in scale. 0 HiPINEB /62

37 Introduction GPU Virtualization Approach proposed The NVIDIA Jetson TX1 system Performance evaluation Conclusions and future work HiPINEB /62

38 Three scenarios: 1. ARM 2. Xeon+GPU 3. ARM+remote GPU Performance comparison: Rodinia benchmarks Gaussian benchmark varying problem size Gaussian benchmark, fixed problem size, sharing GPU among multiple ARM clients HiPINEB /62

39 Three scenarios: 1. ARM 2. Xeon+GPU 3. ARM+remote GPU Performance comparison: Rodinia benchmarks Gaussian benchmark varying problem size Gaussian benchmark, fixed problem size, sharing GPU among multiple ARM clients HiPINEB /62

40 Local GPU: ARM TX1 Xeon K20 Remote GPU: ARM Network Xeon K20 HiPINEB /62

41 0,397 0,126 0,258 0,399 0,239 1,116 0,424 0,244 1,182 0,44 0,178 0,278 0,451 0,301 0,75 0,461 0,279 0,485 0,583 0,521 1,619 3,038 0,601 0,932 1,469 0,739 0,83 1,361 0,928 2,45 Time (s) 1,531 2,259 1,769 2,71 3,388 6,521 2,075 3,36 3,869 2,321 4,907 5,028 4,016 4,509 9,177 9,906 6,704 7,542 9,06 14,566 38,366 33,423 45,118 30,519 34,522 65,562 24,473 13,024 82,634 49,318 18,145 29, ,457 Xeon+K20m Jetson TX1 Jetson TX1 + remote K20m ,1 Comparison of Rodinia benchmarks executed in three different scenarios: (i) a regular server with one NVIDIA K20 GPU, (ii) an NVIDIA Jetson TX1, and (iii) an NVIDIA Jetson TX1 offloading GPU work to a remote regular server with one NVIDIA K20 GPU. HiPINEB /62

42 0,397 0,126 0,258 0,399 0,239 1,116 0,424 0,244 1,182 0,44 0,178 0,278 0,451 0,301 0,75 0,461 0,279 0,485 0,583 0,521 1,619 3,038 0,601 0,932 1,469 0,739 0,83 1,361 0,928 2,45 Time (s) 1,531 2,259 1,769 2,71 3,388 6,521 2,075 3,36 3,869 2,321 4,907 5,028 4,016 4,509 9,177 9,906 6,704 7,542 9,06 14,566 38,366 33,423 45,118 30,519 34,522 65,562 24,473 13,024 82,634 49,318 18,145 29, , Zone A Xeon+K20m Jetson TX1 Jetson TX1 + remote K20m Zone B Zone C ,1 HiPINEB /62

43 0,397 0,126 0,258 0,399 0,239 1,116 0,424 0,244 1,182 0,44 0,178 0,278 0,451 0,301 0,75 0,461 0,279 0,485 0,583 0,521 1,619 3,038 0,601 0,932 1,469 0,739 0,83 1,361 0,928 2,45 Time (s) 1,531 2,259 1,769 2,71 3,388 6,521 2,075 3,36 3,869 2,321 4,907 5,028 4,016 4,509 9,177 9,906 6,704 7,542 9,06 14,566 38,366 33,423 45,118 30,519 34,522 65,562 24,473 13,024 82,634 49,318 18,145 29, , Zone A Xeon+K20m Jetson TX1 Jetson TX1 + remote K20m Zone B Zone C ,1 Short execution time: ARM wins HiPINEB /62

44 0,397 0,126 0,258 0,399 0,239 1,116 0,424 0,244 1,182 0,44 0,178 0,278 0,451 0,301 0,75 0,461 0,279 0,485 0,583 0,521 1,619 3,038 0,601 0,932 1,469 0,739 0,83 1,361 0,928 2,45 Time (s) 1,531 2,259 1,769 2,71 3,388 6,521 2,075 3,36 3,869 2,321 4,907 5,028 4,016 4,509 9,177 9,906 6,704 7,542 9,06 14,566 38,366 33,423 45,118 30,519 34,522 65,562 24,473 13,024 82,634 49,318 18,145 29, , Zone A Xeon+K20m Jetson TX1 Jetson TX1 + remote K20m Zone B Zone C ,1 Short execution time: ARM wins Medium and large execution time: Xeon wins HiPINEB /62

45 0,397 0,126 0,258 0,399 0,239 1,116 0,424 0,244 1,182 0,44 0,178 0,278 0,451 0,301 0,75 0,461 0,279 0,485 0,583 0,521 1,619 3,038 0,601 0,932 1,469 0,739 0,83 1,361 0,928 2,45 Time (s) 1,531 2,259 1,769 2,71 3,388 6,521 2,075 3,36 3,869 2,321 4,907 5,028 4,016 4,509 9,177 9,906 6,704 7,542 9,06 14,566 38,366 33,423 45,118 30,519 34,522 65,562 24,473 13,024 82,634 49,318 18,145 29, , Zone A Xeon+K20m Jetson TX1 Jetson TX1 + remote K20m Zone B Zone C ,1 Short execution time: ARM wins Medium and large execution time: Xeon wins Large execution time: remote GPU better than ARM HiPINEB /62

46 Three scenarios: 1. ARM 2. Xeon+GPU 3. ARM+remote GPU Performance comparison: Rodinia benchmarks Gaussian benchmark varying problem size Gaussian benchmark, fixed problem size, sharing GPU among multiple ARM clients HiPINEB /62

47 Local GPU: ARM TX1 Xeon K20 Remote GPU: ARM Network Xeon K20 HiPINEB /62

48 0,39 0,18 0,68 0,40 0,43 0,44 0,94 0,93 1,99 2,45 2,68 4,74 6,52 20,29 16,21 18,52 Time (s) 52,98 252,53 163,44 238, ,43 Xeon+K20m Jetson TX1 Jetson TX1 - rcuda Xeon+K20m Comparison of Rodinia benchmark gaussian varying the problem size, and executed in three different scenarios: (i) a regular server with one NVIDIA K20 GPU, (ii) a NVIDIA Jetson TX1, and (iii) a NVIDIA Jetson TX1 offloading GPU work to a remote regular server with one NVIDIA K20 GPU. Size HiPINEB /62

49 0,39 0,18 0,68 0,40 0,43 0,44 0,94 0,93 1,99 2,45 2,68 4,74 6,52 20,29 16,21 18,52 Time (s) 52,98 252,53 163,44 238, , Xeon+K20m Jetson TX1 Jetson TX1 - rcuda Xeon+K20m Zone A Zone B Zone C Short execution time: ARM wins Size HiPINEB /62

50 0,39 0,18 0,68 0,40 0,43 0,44 0,94 0,93 1,99 2,45 2,68 4,74 6,52 20,29 16,21 18,52 Time (s) 52,98 252,53 163,44 238, , Xeon+K20m Jetson TX1 Jetson TX1 - rcuda Xeon+K20m Zone A Zone B Zone C Short execution time: ARM wins Medium and Sizelarge execution time: Xeon wins HiPINEB /62

51 0,39 0,18 0,68 0,40 0,43 0,44 0,94 0,93 1,99 2,45 2,68 4,74 6,52 20,29 16,21 18,52 Time (s) 52,98 252,53 163,44 238, , Xeon+K20m Jetson TX1 Jetson TX1 - rcuda Xeon+K20m Zone A Zone B Zone C Short execution time: ARM wins Medium and Sizelarge execution time: Xeon wins Large execution time: remote GPU better than ARM HiPINEB /62

52 Three scenarios: 1. ARM 2. Xeon+GPU 3. ARM+remote GPU Performance comparison: Rodinia benchmarks Gaussian benchmark varying problem size Gaussian benchmark, fixed problem size, sharing GPU among multiple ARM clients HiPINEB /62

53 ARM ARM ARM Network Xeon GPU ARM ARM ARM Offload GPU workloads from ARMs to remote GPU server HiPINEB /62

54 Time (minutes) Speed-up Jetson TX1 + remote K20m Xeon+K20m Jetson TX1 Remote K20m speed-up , , ,87 12,27 17, ,72 3,97 6,33 4,06 2, Clients concurrently sharing the same rcuda server Comparison of Rodinia benchmark gaussian with problem size 8,192, executed in three different scenarios: (i) server with one K20, (ii) Jetson TX1, and (iii) Jetson TX1 offloading GPU work to a remote server with one K20 (in this scenario, the number of concurrent clients sharing the same remote server varies from 1 up to 6) HiPINEB /62

55 Time (minutes) Speed-up Jetson TX1 + remote K20m Xeon+K20m Jetson TX1 Remote K20m speed-up ,82 12,54 No sharing: over 12x speed-up! ,87 12,27 17, ,72 3,97 6,33 4,06 2, Clients concurrently sharing the same rcuda server Comparison of Rodinia benchmark gaussian with problem size 8,192, executed in three different scenarios: (i) server with one K20, (ii) Jetson TX1, and (iii) Jetson TX1 offloading GPU work to a remote server with one K20 (in this scenario, the number of concurrent clients sharing the same remote server varies from 1 up to 6) HiPINEB /62

56 Time (minutes) Speed-up Jetson TX1 + remote K20m Xeon+K20m Jetson TX1 Remote K20m speed-up ,82 12,54 2 ARMS sharing 1 remote GPU ,87 12,27 17, ,72 3,97 6,33 4,06 2, Clients concurrently sharing the same rcuda server Comparison of Rodinia benchmark gaussian with problem size 8,192, executed in three different scenarios: (i) server with one K20, (ii) Jetson TX1, and (iii) Jetson TX1 offloading GPU work to a remote server with one K20 (in this scenario, the number of concurrent clients sharing the same remote server varies from 1 up to 6) HiPINEB /62

57 Time (minutes) Speed-up Jetson TX1 + remote K20m Xeon+K20m Jetson TX1 Remote K20m speed-up , ,54 4 ARMS sharing 1 remote GPU ,87 12,27 17, ,72 3,97 6,33 4,06 2, Clients concurrently sharing the same rcuda server Comparison of Rodinia benchmark gaussian with problem size 8,192, executed in three different scenarios: (i) server with one K20, (ii) Jetson TX1, and (iii) Jetson TX1 offloading GPU work to a remote server with one K20 (in this scenario, the number of concurrent clients sharing the same remote server varies from 1 up to 6) HiPINEB /62

58 Time (minutes) Speed-up Jetson TX1 + remote K20m Xeon+K20m Jetson TX1 Remote K20m speed-up , ,54 6 ARMS sharing 1 remote GPU: 2.83 x speed-up! 7,87 12,27 17, ,72 3,97 6,33 4,06 2, Clients concurrently sharing the same rcuda server Comparison of Rodinia benchmark gaussian with problem size 8,192, executed in three different scenarios: (i) server with one K20, (ii) Jetson TX1, and (iii) Jetson TX1 offloading GPU work to a remote server with one K20 (in this scenario, the number of concurrent clients sharing the same remote server varies from 1 up to 6) HiPINEB /62

59 Introduction GPU Virtualization Approach proposed The NVIDIA Jetson TX1 system Performance evaluation Conclusions and future work HiPINEB /62

60 Proposal for improving efficiency of future exascale systems Using rcuda to provide ARMs access to high-end remote GPUs Execution time speed-up by up to 12x ARM power efficiency is leveraged along with the highend accelerators Accelerators dynamically shared among multiple ARMs HiPINEB /62

61 Analysis of this proposal with HPC apps Include energy and power measurements Validate if our proposal is able to contribute to energy efficiency in future exascale systems HiPINEB /62

62 Get a free copy of rcuda at

Opportunities of the rcuda remote GPU virtualization middleware. Federico Silla Universitat Politècnica de València Spain

Opportunities of the rcuda remote GPU virtualization middleware. Federico Silla Universitat Politècnica de València Spain Opportunities of the rcuda remote virtualization middleware Federico Silla Universitat Politècnica de València Spain st Outline What is rcuda? HPC Advisory Council China Conference 2017 2/45 s are the

More information

rcuda: desde máquinas virtuales a clústers mixtos CPU-GPU

rcuda: desde máquinas virtuales a clústers mixtos CPU-GPU rcuda: desde máquinas virtuales a clústers mixtos CPU-GPU Federico Silla Universitat Politècnica de València HPC ADMINTECH 2018 rcuda: from virtual machines to hybrid CPU-GPU clusters Federico Silla Universitat

More information

The rcuda technology: an inexpensive way to improve the performance of GPU-based clusters Federico Silla

The rcuda technology: an inexpensive way to improve the performance of GPU-based clusters Federico Silla The rcuda technology: an inexpensive way to improve the performance of -based clusters Federico Silla Technical University of Valencia Spain The scope of this talk Delft, April 2015 2/47 More flexible

More information

Remote GPU virtualization: pros and cons of a recent technology. Federico Silla Technical University of Valencia Spain

Remote GPU virtualization: pros and cons of a recent technology. Federico Silla Technical University of Valencia Spain Remote virtualization: pros and cons of a recent technology Federico Silla Technical University of Valencia Spain The scope of this talk HPC Advisory Council Brazil Conference 2015 2/43 st Outline What

More information

Is remote GPU virtualization useful? Federico Silla Technical University of Valencia Spain

Is remote GPU virtualization useful? Federico Silla Technical University of Valencia Spain Is remote virtualization useful? Federico Silla Technical University of Valencia Spain st Outline What is remote virtualization? HPC Advisory Council Spain Conference 2015 2/57 We deal with s, obviously!

More information

Carlos Reaño Universitat Politècnica de València (Spain) HPC Advisory Council Switzerland Conference April 3, Lugano (Switzerland)

Carlos Reaño Universitat Politècnica de València (Spain) HPC Advisory Council Switzerland Conference April 3, Lugano (Switzerland) Carlos Reaño Universitat Politècnica de València (Spain) Switzerland Conference April 3, 2014 - Lugano (Switzerland) What is rcuda? Installing and using rcuda rcuda over HPC networks InfiniBand How taking

More information

Deploying remote GPU virtualization with rcuda. Federico Silla Technical University of Valencia Spain

Deploying remote GPU virtualization with rcuda. Federico Silla Technical University of Valencia Spain Deploying remote virtualization with rcuda Federico Silla Technical University of Valencia Spain st Outline What is remote virtualization? HPC ADMINTECH 2016 2/53 It deals with s, obviously! HPC ADMINTECH

More information

Improving overall performance and energy consumption of your cluster with remote GPU virtualization

Improving overall performance and energy consumption of your cluster with remote GPU virtualization Improving overall performance and energy consumption of your cluster with remote GPU virtualization Federico Silla & Carlos Reaño Technical University of Valencia Spain Tutorial Agenda 9.00-10.00 SESSION

More information

rcuda: an approach to provide remote access to GPU computational power

rcuda: an approach to provide remote access to GPU computational power rcuda: an approach to provide remote access to computational power Rafael Mayo Gual Universitat Jaume I Spain (1 of 60) HPC Advisory Council Workshop Outline computing Cost of a node rcuda goals rcuda

More information

Framework of rcuda: An Overview

Framework of rcuda: An Overview Framework of rcuda: An Overview Mohamed Hussain 1, M.B.Potdar 2, Third Viraj Choksi 3 11 Research scholar, VLSI & Embedded Systems, Gujarat Technological University, Ahmedabad, India 2 Project Director,

More information

Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators

Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators Remote CUDA (rcuda) Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators Better performance-watt, performance-cost

More information

The rcuda middleware and applications

The rcuda middleware and applications The rcuda middleware and applications Will my application work with rcuda? rcuda currently provides binary compatibility with CUDA 5.0, virtualizing the entire Runtime API except for the graphics functions,

More information

Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing

Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing Gaurav Mitra Andrew Haigh Luke Angove Anish Varghese Eric McCreath Alistair P. Rendell Research School of Computer Science Australian

More information

Increasing the efficiency of your GPU-enabled cluster with rcuda. Federico Silla Technical University of Valencia Spain

Increasing the efficiency of your GPU-enabled cluster with rcuda. Federico Silla Technical University of Valencia Spain Increasing the efficiency of your -enabled cluster with rcuda Federico Silla Technical University of Valencia Spain Outline Why remote virtualization? How does rcuda work? The performance of the rcuda

More information

rcuda: hybrid CPU-GPU clusters Federico Silla Technical University of Valencia Spain

rcuda: hybrid CPU-GPU clusters Federico Silla Technical University of Valencia Spain rcuda: hybrid - clusters Federico Silla Technical University of Valencia Spain Outline 1. Hybrid - clusters 2. Concerns with hybrid clusters 3. One possible solution: virtualize s! 4. rcuda what s that?

More information

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구

MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구 MELLANOX EDR UPDATE & GPUDIRECT MELLANOX SR. SE 정연구 Leading Supplier of End-to-End Interconnect Solutions Analyze Enabling the Use of Data Store ICs Comprehensive End-to-End InfiniBand and Ethernet Portfolio

More information

Manycore and GPU Channelisers. Seth Hall High Performance Computing Lab, AUT

Manycore and GPU Channelisers. Seth Hall High Performance Computing Lab, AUT Manycore and GPU Channelisers Seth Hall High Performance Computing Lab, AUT GPU Accelerated Computing GPU-accelerated computing is the use of a graphics processing unit (GPU) together with a CPU to accelerate

More information

Speeding up the execution of numerical computations and simulations with rcuda José Duato

Speeding up the execution of numerical computations and simulations with rcuda José Duato Speeding up the execution of numerical computations and simulations with rcuda José Duato Universidad Politécnica de Valencia Spain Outline 1. Introduction to GPU computing 2. What is remote GPU virtualization?

More information

Pedraforca: a First ARM + GPU Cluster for HPC

Pedraforca: a First ARM + GPU Cluster for HPC www.bsc.es Pedraforca: a First ARM + GPU Cluster for HPC Nikola Puzovic, Alex Ramirez We ve hit the power wall ALL computers are limited by power consumption Energy-efficient approaches Multi-core Fujitsu

More information

rcuda: towards energy-efficiency in GPU computing by leveraging low-power processors and InfiniBand interconnects

rcuda: towards energy-efficiency in GPU computing by leveraging low-power processors and InfiniBand interconnects rcuda: towards energy-efficiency in computing by leveraging low-power processors and InfiniBand interconnects Federico Silla Technical University of Valencia Spain Joint research effort Outline Current

More information

Elaborazione dati real-time su architetture embedded many-core e FPGA

Elaborazione dati real-time su architetture embedded many-core e FPGA Elaborazione dati real-time su architetture embedded many-core e FPGA DAVIDE ROSSI A L E S S A N D R O C A P O T O N D I G I U S E P P E T A G L I A V I N I A N D R E A M A R O N G I U C I R I - I C T

More information

FPGA-based Supercomputing: New Opportunities and Challenges

FPGA-based Supercomputing: New Opportunities and Challenges FPGA-based Supercomputing: New Opportunities and Challenges Naoya Maruyama (RIKEN AICS)* 5 th ADAC Workshop Feb 15, 2018 * Current Main affiliation is Lawrence Livermore National Laboratory SIAM PP18:

More information

On the Use of Remote GPUs and Low-Power Processors for the Acceleration of Scientific Applications

On the Use of Remote GPUs and Low-Power Processors for the Acceleration of Scientific Applications On the Use of Remote GPUs and Low-Power Processors for the Acceleration of Scientific Applications A. Castelló, J. Duato, R. Mayo, A. J. Peña, E. S. Quintana-Ortí, V. Roca, F. Silla Universitat Politècnica

More information

CPU-GPU Heterogeneous Computing

CPU-GPU Heterogeneous Computing CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems

More information

GPU-centric communication for improved efficiency

GPU-centric communication for improved efficiency GPU-centric communication for improved efficiency Benjamin Klenk *, Lena Oden, Holger Fröning * * Heidelberg University, Germany Fraunhofer Institute for Industrial Mathematics, Germany GPCDP Workshop

More information

HPC with Multicore and GPUs

HPC with Multicore and GPUs HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware

More information

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated

More information

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications

More information

Paper Summary Problem Uses Problems Predictions/Trends. General Purpose GPU. Aurojit Panda

Paper Summary Problem Uses Problems Predictions/Trends. General Purpose GPU. Aurojit Panda s Aurojit Panda apanda@cs.berkeley.edu Summary s SIMD helps increase performance while using less power For some tasks (not everything can use data parallelism). Can use less power since DLP allows use

More information

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D HPC with GPU and its applications from Inspur Haibo Xie, Ph.D xiehb@inspur.com 2 Agenda I. HPC with GPU II. YITIAN solution and application 3 New Moore s Law 4 HPC? HPC stands for High Heterogeneous Performance

More information

computational power computational

computational power computational rcuda: rcuda: an an approach approach to to provide provide remote remote access access to to computational computational power power Rafael Mayo Gual Universitat Jaume I Spain (1 of 59) HPC Advisory Council

More information

Exploiting CUDA Dynamic Parallelism for low power ARM based prototypes

Exploiting CUDA Dynamic Parallelism for low power ARM based prototypes www.bsc.es Exploiting CUDA Dynamic Parallelism for low power ARM based prototypes Vishal Mehta Engineer, Barcelona Supercomputing Center vishal.mehta@bsc.es BSC/UPC CUDA Centre of Excellence (CCOE) Training

More information

GPUs and Emerging Architectures

GPUs and Emerging Architectures GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs

More information

7 DAYS AND 8 NIGHTS WITH THE CARMA DEV KIT

7 DAYS AND 8 NIGHTS WITH THE CARMA DEV KIT 7 DAYS AND 8 NIGHTS WITH THE CARMA DEV KIT Draft Printed for SECO Murex S.A.S 2012 all rights reserved Murex Analytics Only global vendor of trading, risk management and processing systems focusing also

More information

E4-ARKA: ARM64+GPU+IB is Now Here Piero Altoè. ARM64 and GPGPU

E4-ARKA: ARM64+GPU+IB is Now Here Piero Altoè. ARM64 and GPGPU E4-ARKA: ARM64+GPU+IB is Now Here Piero Altoè ARM64 and GPGPU 1 E4 Computer Engineering Company E4 Computer Engineering S.p.A. specializes in the manufacturing of high performance IT systems of medium

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies

Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies Alexander Merritt, Vishakha Gupta, Abhishek Verma, Ada Gavrilovska, Karsten Schwan {merritt.alex,abhishek.verma}@gatech.edu {vishakha,ada,schwan}@cc.gtaech.edu

More information

An approach to provide remote access to GPU computational power

An approach to provide remote access to GPU computational power An approach to provide remote access to computational power University Jaume I, Spain Joint research effort 1/84 Outline computing computing scenarios Introduction to rcuda rcuda structure rcuda functionality

More information

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation Ray Browell nvidia Technology Theater SC12 1 2012 ANSYS, Inc. nvidia Technology Theater SC12 HPC Revolution Recent

More information

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction

More information

Exploiting Task-Parallelism on GPU Clusters via OmpSs and rcuda Virtualization

Exploiting Task-Parallelism on GPU Clusters via OmpSs and rcuda Virtualization Exploiting Task-Parallelism on Clusters via Adrián Castelló, Rafael Mayo, Judit Planas, Enrique S. Quintana-Ortí RePara 2015, August Helsinki, Finland Exploiting Task-Parallelism on Clusters via Power/energy/utilization

More information

Building the Most Efficient Machine Learning System

Building the Most Efficient Machine Learning System Building the Most Efficient Machine Learning System Mellanox The Artificial Intelligence Interconnect Company June 2017 Mellanox Overview Company Headquarters Yokneam, Israel Sunnyvale, California Worldwide

More information

Real-Time Support for GPU. GPU Management Heechul Yun

Real-Time Support for GPU. GPU Management Heechul Yun Real-Time Support for GPU GPU Management Heechul Yun 1 This Week Topic: Real-Time Support for General Purpose Graphic Processing Unit (GPGPU) Today Background Challenges Real-Time GPU Management Frameworks

More information

Mathematical computations with GPUs

Mathematical computations with GPUs Master Educational Program Information technology in applications Mathematical computations with GPUs Introduction Alexey A. Romanenko arom@ccfit.nsu.ru Novosibirsk State University How to.. Process terabytes

More information

Using Graphics Chips for General Purpose Computation

Using Graphics Chips for General Purpose Computation White Paper Using Graphics Chips for General Purpose Computation Document Version 0.1 May 12, 2010 442 Northlake Blvd. Altamonte Springs, FL 32701 (407) 262-7100 TABLE OF CONTENTS 1. INTRODUCTION....1

More information

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &

More information

Introduction to GPGPU and GPU-architectures

Introduction to GPGPU and GPU-architectures Introduction to GPGPU and GPU-architectures Henk Corporaal Gert-Jan van den Braak http://www.es.ele.tue.nl/ Contents 1. What is a GPU 2. Programming a GPU 3. GPU thread scheduling 4. GPU performance bottlenecks

More information

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS CIS 601 - Graduate Seminar Presentation 1 GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS PRESENTED BY HARINATH AMASA CSU ID: 2697292 What we will talk about.. Current problems GPU What are GPU Databases GPU

More information

Vectorisation and Portable Programming using OpenCL

Vectorisation and Portable Programming using OpenCL Vectorisation and Portable Programming using OpenCL Mitglied der Helmholtz-Gemeinschaft Jülich Supercomputing Centre (JSC) Andreas Beckmann, Ilya Zhukov, Willi Homberg, JSC Wolfram Schenck, FH Bielefeld

More information

ACCELERATED COMPUTING: THE PATH FORWARD. Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015

ACCELERATED COMPUTING: THE PATH FORWARD. Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015 ACCELERATED COMPUTING: THE PATH FORWARD Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015 COMMODITY DISRUPTS CUSTOM SOURCE: Top500 ACCELERATED COMPUTING: THE PATH FORWARD It s time to start

More information

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

The Era of Heterogeneous Computing

The Era of Heterogeneous Computing The Era of Heterogeneous Computing EU-US Summer School on High Performance Computing New York, NY, USA June 28, 2013 Lars Koesterke: Research Staff @ TACC Nomenclature Architecture Model -------------------------------------------------------

More information

Mapping MPI+X Applications to Multi-GPU Architectures

Mapping MPI+X Applications to Multi-GPU Architectures Mapping MPI+X Applications to Multi-GPU Architectures A Performance-Portable Approach Edgar A. León Computer Scientist San Jose, CA March 28, 2018 GPU Technology Conference This work was performed under

More information

Exploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API

Exploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API EuroPAR 2016 ROME Workshop Exploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API Suyang Zhu 1, Sunita Chandrasekaran 2, Peng Sun 1, Barbara Chapman 1, Marcus Winter 3,

More information

Timothy Lanfear, NVIDIA HPC

Timothy Lanfear, NVIDIA HPC GPU COMPUTING AND THE Timothy Lanfear, NVIDIA FUTURE OF HPC Exascale Computing will Enable Transformational Science Results First-principles simulation of combustion for new high-efficiency, lowemision

More information

Big Data Systems on Future Hardware. Bingsheng He NUS Computing

Big Data Systems on Future Hardware. Bingsheng He NUS Computing Big Data Systems on Future Hardware Bingsheng He NUS Computing http://www.comp.nus.edu.sg/~hebs/ 1 Outline Challenges for Big Data Systems Why Hardware Matters? Open Challenges Summary 2 3 ANYs in Big

More information

HPC future trends from a science perspective

HPC future trends from a science perspective HPC future trends from a science perspective Simon McIntosh-Smith University of Bristol HPC Research Group simonm@cs.bris.ac.uk 1 Business as usual? We've all got used to new machines being relatively

More information

Recent Advances in Heterogeneous Computing using Charm++

Recent Advances in Heterogeneous Computing using Charm++ Recent Advances in Heterogeneous Computing using Charm++ Jaemin Choi, Michael Robson Parallel Programming Laboratory University of Illinois Urbana-Champaign April 12, 2018 1 / 24 Heterogeneous Computing

More information

GPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA

GPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA GPU COMPUTING AND THE FUTURE OF HPC Timothy Lanfear, NVIDIA ~1 W ~3 W ~100 W ~30 W 1 kw 100 kw 20 MW Power-constrained Computers 2 EXASCALE COMPUTING WILL ENABLE TRANSFORMATIONAL SCIENCE RESULTS First-principles

More information

Solutions for Scalable HPC

Solutions for Scalable HPC Solutions for Scalable HPC Scot Schultz, Director HPC/Technical Computing HPC Advisory Council Stanford Conference Feb 2014 Leading Supplier of End-to-End Interconnect Solutions Comprehensive End-to-End

More information

CUDA on ARM Update. Developing Accelerated Applications on ARM. Bas Aarts and Donald Becker

CUDA on ARM Update. Developing Accelerated Applications on ARM. Bas Aarts and Donald Becker CUDA on ARM Update Developing Accelerated Applications on ARM Bas Aarts and Donald Becker CUDA on ARM: a forward-looking development platform for high performance, energy efficient hybrid computing It

More information

Atos announces the Bull sequana X1000 the first exascale-class supercomputer. Jakub Venc

Atos announces the Bull sequana X1000 the first exascale-class supercomputer. Jakub Venc Atos announces the Bull sequana X1000 the first exascale-class supercomputer Jakub Venc The world is changing The world is changing Digital simulation will be the key contributor to overcome 21 st century

More information

How GPUs can find your next hit: Accelerating virtual screening with OpenCL. Simon Krige

How GPUs can find your next hit: Accelerating virtual screening with OpenCL. Simon Krige How GPUs can find your next hit: Accelerating virtual screening with OpenCL Simon Krige ACS 2013 Agenda > Background > About blazev10 > What is a GPU? > Heterogeneous computing > OpenCL: a framework for

More information

World s most advanced data center accelerator for PCIe-based servers

World s most advanced data center accelerator for PCIe-based servers NVIDIA TESLA P100 GPU ACCELERATOR World s most advanced data center accelerator for PCIe-based servers HPC data centers need to support the ever-growing demands of scientists and researchers while staying

More information

MICROWAY S NVIDIA TESLA V100 GPU SOLUTIONS GUIDE

MICROWAY S NVIDIA TESLA V100 GPU SOLUTIONS GUIDE MICROWAY S NVIDIA TESLA V100 GPU SOLUTIONS GUIDE LEVERAGE OUR EXPERTISE sales@microway.com http://microway.com/tesla NUMBERSMASHER TESLA 4-GPU SERVER/WORKSTATION Flexible form factor 4 PCI-E GPUs + 3 additional

More information

IBM Power Advanced Compute (AC) AC922 Server

IBM Power Advanced Compute (AC) AC922 Server IBM Power Advanced Compute (AC) AC922 Server The Best Server for Enterprise AI Highlights IBM Power Systems Accelerated Compute (AC922) server is an acceleration superhighway to enterprise- class AI. A

More information

TR An Overview of NVIDIA Tegra K1 Architecture. Ang Li, Radu Serban, Dan Negrut

TR An Overview of NVIDIA Tegra K1 Architecture. Ang Li, Radu Serban, Dan Negrut TR-2014-17 An Overview of NVIDIA Tegra K1 Architecture Ang Li, Radu Serban, Dan Negrut November 20, 2014 Abstract This paperwork gives an overview of NVIDIA s Jetson TK1 Development Kit and its Tegra K1

More information

NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY. Peter Messmer

NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY. Peter Messmer NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY Peter Messmer pmessmer@nvidia.com COMPUTATIONAL CHALLENGES IN HEP Low-Level Trigger High-Level Trigger Monte Carlo Analysis Lattice QCD 2 COMPUTATIONAL

More information

Heterogeneous SoCs. May 28, 2014 COMPUTER SYSTEM COLLOQUIUM 1

Heterogeneous SoCs. May 28, 2014 COMPUTER SYSTEM COLLOQUIUM 1 COSCOⅣ Heterogeneous SoCs M5171111 HASEGAWA TORU M5171112 IDONUMA TOSHIICHI May 28, 2014 COMPUTER SYSTEM COLLOQUIUM 1 Contents Background Heterogeneous technology May 28, 2014 COMPUTER SYSTEM COLLOQUIUM

More information

arxiv: v1 [physics.ins-det] 11 Jul 2015

arxiv: v1 [physics.ins-det] 11 Jul 2015 GPGPU for track finding in High Energy Physics arxiv:7.374v [physics.ins-det] Jul 5 L Rinaldi, M Belgiovine, R Di Sipio, A Gabrielli, M Negrini, F Semeria, A Sidoti, S A Tupputi 3, M Villa Bologna University

More information

GPU Acceleration of Particle Advection Workloads in a Parallel, Distributed Memory Setting

GPU Acceleration of Particle Advection Workloads in a Parallel, Distributed Memory Setting Girona, Spain May 4-5 GPU Acceleration of Particle Advection Workloads in a Parallel, Distributed Memory Setting David Camp, Hari Krishnan, David Pugmire, Christoph Garth, Ian Johnson, E. Wes Bethel, Kenneth

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

GPU Architecture. Alan Gray EPCC The University of Edinburgh

GPU Architecture. Alan Gray EPCC The University of Edinburgh GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From

More information

Empirical Modeling: an Auto-tuning Method for Linear Algebra Routines on CPU plus Multi-GPU Platforms

Empirical Modeling: an Auto-tuning Method for Linear Algebra Routines on CPU plus Multi-GPU Platforms Empirical Modeling: an Auto-tuning Method for Linear Algebra Routines on CPU plus Multi-GPU Platforms Javier Cuenca Luis-Pedro García Domingo Giménez Francisco J. Herrera Scientific Computing and Parallel

More information

ENERGY-EFFICIENT VISUALIZATION PIPELINES A CASE STUDY IN CLIMATE SIMULATION

ENERGY-EFFICIENT VISUALIZATION PIPELINES A CASE STUDY IN CLIMATE SIMULATION ENERGY-EFFICIENT VISUALIZATION PIPELINES A CASE STUDY IN CLIMATE SIMULATION Vignesh Adhinarayanan Ph.D. (CS) Student Synergy Lab, Virginia Tech INTRODUCTION Supercomputers are constrained by power Power

More information

Building the Most Efficient Machine Learning System

Building the Most Efficient Machine Learning System Building the Most Efficient Machine Learning System Mellanox The Artificial Intelligence Interconnect Company June 2017 Mellanox Overview Company Headquarters Yokneam, Israel Sunnyvale, California Worldwide

More information

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29 Introduction CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Introduction Spring 2018 1 / 29 Outline 1 Preface Course Details Course Requirements 2 Background Definitions

More information

PyFR: Heterogeneous Computing on Mixed Unstructured Grids with Python. F.D. Witherden, M. Klemm, P.E. Vincent

PyFR: Heterogeneous Computing on Mixed Unstructured Grids with Python. F.D. Witherden, M. Klemm, P.E. Vincent PyFR: Heterogeneous Computing on Mixed Unstructured Grids with Python F.D. Witherden, M. Klemm, P.E. Vincent 1 Overview Motivation. Accelerators and Modern Hardware Python and PyFR. Summary. Motivation

More information

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware

More information

Concurrent execution of an analytical workload on a POWER8 server with K40 GPUs A Technology Demonstration

Concurrent execution of an analytical workload on a POWER8 server with K40 GPUs A Technology Demonstration Concurrent execution of an analytical workload on a POWER8 server with K40 GPUs A Technology Demonstration Sina Meraji sinamera@ca.ibm.com Berni Schiefer schiefer@ca.ibm.com Tuesday March 17th at 12:00

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues Top500 Supercomputer list represent parallel computers, so distributed systems such as SETI@Home are not considered Does not consider storage or I/O issues Both custom designed machines and commodity machines

More information

Document downloaded from:

Document downloaded from: Document downloaded from: http://hdl.handle.net/10251/70225 This paper must be cited as: Reaño González, C.; Silla Jiménez, F. (2015). On the Deployment and Characterization of CUDA Teaching Laboratories.

More information

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA STATE OF THE ART 2012 18,688 Tesla K20X GPUs 27 PetaFLOPS FLAGSHIP SCIENTIFIC APPLICATIONS

More information

FCUDA-SoC: Platform Integration for Field-Programmable SoC with the CUDAto-FPGA

FCUDA-SoC: Platform Integration for Field-Programmable SoC with the CUDAto-FPGA 1 FCUDA-SoC: Platform Integration for Field-Programmable SoC with the CUDAto-FPGA Compiler Tan Nguyen 1, Swathi Gurumani 1, Kyle Rupnow 1, Deming Chen 2 1 Advanced Digital Sciences Center, Singapore {tan.nguyen,

More information

GPUs and the Future of Accelerated Computing Emerging Technology Conference 2014 University of Manchester

GPUs and the Future of Accelerated Computing Emerging Technology Conference 2014 University of Manchester NVIDIA GPU Computing A Revolution in High Performance Computing GPUs and the Future of Accelerated Computing Emerging Technology Conference 2014 University of Manchester John Ashley Senior Solutions Architect

More information

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

NVIDIA Update and Directions on GPU Acceleration for Earth System Models NVIDIA Update and Directions on GPU Acceleration for Earth System Models Stan Posey, HPC Program Manager, ESM and CFD, NVIDIA, Santa Clara, CA, USA Carl Ponder, PhD, Applications Software Engineer, NVIDIA,

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

NVIDIA Jetson Platform Characterization

NVIDIA Jetson Platform Characterization NVIDIA Jetson Platform Characterization Hassan Halawa, Hazem A. Abdelhafez, Andrew Boktor, Matei Ripeanu The University of British Columbia {hhalawa, hazem, boktor, matei}@ece.ubc.ca Abstract. This study

More information

"On the Capability and Achievable Performance of FPGAs for HPC Applications"

On the Capability and Achievable Performance of FPGAs for HPC Applications "On the Capability and Achievable Performance of FPGAs for HPC Applications" Wim Vanderbauwhede School of Computing Science, University of Glasgow, UK Or in other words "How Fast Can Those FPGA Thingies

More information

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing

More information

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation CUDA Accelerated Linpack on Clusters E. Phillips, NVIDIA Corporation Outline Linpack benchmark CUDA Acceleration Strategy Fermi DGEMM Optimization / Performance Linpack Results Conclusions LINPACK Benchmark

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

High Performance Computing

High Performance Computing High Performance Computing Dror Goldenberg, HPCAC Switzerland Conference March 2015 End-to-End Interconnect Solutions for All Platforms Highest Performance and Scalability for X86, Power, GPU, ARM and

More information

Understanding the Role of GPGPU-accelerated SoC-based ARM Clusters

Understanding the Role of GPGPU-accelerated SoC-based ARM Clusters Understanding the Role of GPGPU-accelerated SoC-based ARM Clusters Reza Azimi, Tyler Fox and Sherief Reda Brown University Providence, RI firstname lastname@brown.edu Abstract The last few years saw the

More information

Building supercomputers from commodity embedded chips

Building supercomputers from commodity embedded chips http://www.montblanc-project.eu Building supercomputers from commodity embedded chips Alex Ramirez Barcelona Supercomputing Center Technical Coordinator This project and the research leading to these results

More information

Deep learning in MATLAB From Concept to CUDA Code

Deep learning in MATLAB From Concept to CUDA Code Deep learning in MATLAB From Concept to CUDA Code Roy Fahn Applications Engineer Systematics royf@systematics.co.il 03-7660111 Ram Kokku Principal Engineer MathWorks ram.kokku@mathworks.com 2017 The MathWorks,

More information

An Introduction to the SPEC High Performance Group and their Benchmark Suites

An Introduction to the SPEC High Performance Group and their Benchmark Suites An Introduction to the SPEC High Performance Group and their Benchmark Suites Robert Henschel Manager, Scientific Applications and Performance Tuning Secretary, SPEC High Performance Group Research Technologies

More information