Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain)

Size: px

Start display at page:

Download "Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain)"

Noreen Golden
5 years ago
Views:

1 Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain) 4th IEEE International Workshop of High-Performance Interconnection Networks in the Exascale and Big-Data Era (HiPINEB 2018) Vienna, Austria, 24-Feb, 2018

2 Introduction GPU Virtualization Approach proposed The NVIDIA Jetson TX1 system Performance evaluation Conclusions and future work HiPINEB /62

3 Introduction GPU Virtualization Approach proposed The NVIDIA Jetson TX1 system Performance evaluation Conclusions and future work HiPINEB /62

4 Current supercomputers: Performance Energy consumption HiPINEB /62

5 Current supercomputers: Performance Energy consumption Desired future supercomputers: Performance: exaflops Energy consumption: 20 MW HiPINEB /62

6 Current high-performance processors (Xeon): Performance Energy (~165W) HiPINEB /62

7 Current high-performance processors (Xeon): Performance Energy (~165W)...plus high-performance accelerators (GPU): Performance Energy (~100W) HiPINEB /62

8 Current high-performance processors (Xeon): Performance Energy (~165W)...plus high-performance accelerators (GPU): Performance Energy (~100W) Current low-power processors (ARM): Energy consumption (~15W) Performance HiPINEB /62

9 How improving efficiency of future exascale systems? HiPINEB /62

10 How improving efficiency of future exascale systems? Our proposal: heterogeneous platforms Low-power processors (ARM) Accelerated with remote HPC servers (Xeon + GPUs) Remote HPC servers shared among all low-power processors HiPINEB /62

11 How improving efficiency of future exascale systems? Our proposal: heterogeneous platforms Low-power processors (ARM) Accelerated with remote HPC servers (Xeon + GPUs) Remote HPC servers shared among all low-power processors How sharing? GPU virtualization GPUs transparently shared among all cluster nodes Virtualization framework overhead Network overhead (high-performance interconnection!!) HiPINEB /62

12 Introduction GPU Virtualization Approach proposed The NVIDIA Jetson TX1 system Performance evaluation Conclusions and future work HiPINEB /62

13 GPGPU: OpenCL or CUDA HiPINEB /62

14 GPGPU: OpenCL or CUDA General architecture remote GPU solutions: HiPINEB /62

15 GPGPU: OpenCL or CUDA General architecture remote GPU solutions: Existing solutions: rcuda ( rcuda: freely available, supports CUDA 8 HiPINEB /62

16 rcuda communication architecture: Eth, IB,... HiPINEB /62

17 Local GPU: Node 1 GPU Remote GPU: Performance depends on GPU bandwidth Node 2 Network Node 1 GPU Performance depends on network bandwidth HiPINEB /62

18 Introduction GPU Virtualization Approach proposed The NVIDIA Jetson TX1 system Performance evaluation Conclusions and future work HiPINEB /62

19 ARM ARM ARM Network Xeon GPU ARM ARM ARM Offload GPU workloads from ARMs to remote GPU server... HiPINEB /62

20 ARM NVIDIA Jetson TX1: 4-core ARM processor 256-core CUDA compatible GPU Supermicro 1027GR-TRF: 2 hexa-core Xeon processors NVIDIA Tesla K20 GPU with 2496 CUDA cores Xeon K20 HiPINEB /62

21 Introduction GPU Virtualization Approach proposed The NVIDIA Jetson TX1 system Performance evaluation Conclusions and future work HiPINEB /62

22 Performance comparison: ARM CPU vs. Xeon ARM GPU vs. K20 HiPINEB /62

23 Performance comparison: ARM CPU vs. Xeon ARM GPU vs. K20 HiPINEB /62

24 1,59 3,04 Test unit 719, , , , , ,60 Xeon Jetson TX CPU (events/s) Memory (MiB/s) Threads (events) Mutex (seconds) Test Performance comparison of NVIDIA Jetson TX1 and Supermicro 1027GR-TRF systems using the sysbench benchmark. Notice that the Y-axis is in logarithmic scale. HiPINEB /62

1,59 3,04 Better Test unit 719,28 1359,23 3557,52 1770,79 13740,20 44045,60 Xeon Jetson TX1 100000 10000 1000 100 10 1 CPU (events/s) Memory (MiB/s) Threads (events) Mutex

25 1,59 3,04 Better Test unit 719, , , , , ,60 Xeon Jetson TX CPU (events/s) Memory (MiB/s) Threads (events) Mutex (seconds) Test Performance comparison of NVIDIA Jetson TX1 and Supermicro 1027GR-TRF systems using the sysbench benchmark. Notice that the Y-axis is in logarithmic scale. HiPINEB /62

26 1,59 3,04 Better Better Test unit 719, , , , , ,60 Xeon Jetson TX CPU (events/s) Memory (MiB/s) Threads (events) Mutex (seconds) Test Performance comparison of NVIDIA Jetson TX1 and Supermicro 1027GR-TRF systems using the sysbench benchmark. Notice that the Y-axis is in logarithmic scale. HiPINEB /62

27 1,59 3,04 Better Better Better Test unit 719, , , , , ,60 Xeon Jetson TX CPU (events/s) Memory (MiB/s) Threads (events) Mutex (seconds) Test Performance comparison of NVIDIA Jetson TX1 and Supermicro 1027GR-TRF systems using the sysbench benchmark. Notice that the Y-axis is in logarithmic scale. HiPINEB /62

28 1,59 3,04 Better Better Better Better Test unit 719, , , , , ,60 Xeon Jetson TX CPU (events/s) Memory (MiB/s) Threads (events) Mutex (seconds) Test Performance comparison of NVIDIA Jetson TX1 and Supermicro 1027GR-TRF systems using the sysbench benchmark. Notice that the Y-axis is in logarithmic scale. HiPINEB /62

29 Performance comparison: ARM CPU vs. Xeon ARM GPU vs. K20 HiPINEB /62

30 HiPINEB /62

31 HiPINEB /62

32 HiPINEB /62

33 32,58 Test unit 250,56 Speed-up 2205,9 2963,3 2238,2 2884, Tegra X1 Tesla K20m K20m speed-up 7, ,34 2 1, Memory copy to GPU (MiB/s) Memory copy from GPU (MiB/s) Matrix multiplication (GFlop/s) Performance comparison of the GPU included in the Tegra X1 chip and the Tesla K20m GPU in terms of data transfer bandwidth and computational power. Notice that primary Y-axis is in scale. 0 HiPINEB /62

34 32,58 Test unit Better 250,56 Speed-up 2205,9 2963,3 2238,2 2884, Tegra X1 Tesla K20m K20m speed-up K20 wins 7, ,34 2 1, Memory copy to GPU (MiB/s) Memory copy from GPU (MiB/s) Matrix multiplication (GFlop/s) Performance comparison of the GPU included in the Tegra X1 chip and the Tesla K20m GPU in terms of data transfer bandwidth and computational power. Notice that primary Y-axis is in scale. 0 HiPINEB /62

35 32,58 Test unit Better Better 250,56 Speed-up 2205,9 2963,3 2238,2 2884, Tegra X1 Tesla K20m K20m speed-up K20 wins K20 wins 7, ,34 2 1, Memory copy to GPU (MiB/s) Memory copy from GPU (MiB/s) Matrix multiplication (GFlop/s) Performance comparison of the GPU included in the Tegra X1 chip and the Tesla K20m GPU in terms of data transfer bandwidth and computational power. Notice that primary Y-axis is in scale. 0 HiPINEB /62

36 32,58 Better Test unit Better Better 250,56 Speed-up 2205,9 2963,3 2238,2 2884, Tegra X1 Tesla K20m K20m speed-up K20 wins K20 wins 7, K20 wins ,34 2 1, Memory copy to GPU (MiB/s) Memory copy from GPU (MiB/s) Matrix multiplication (GFlop/s) Performance comparison of the GPU included in the Tegra X1 chip and the Tesla K20m GPU in terms of data transfer bandwidth and computational power. Notice that primary Y-axis is in scale. 0 HiPINEB /62

37 Introduction GPU Virtualization Approach proposed The NVIDIA Jetson TX1 system Performance evaluation Conclusions and future work HiPINEB /62

38 Three scenarios: 1. ARM 2. Xeon+GPU 3. ARM+remote GPU Performance comparison: Rodinia benchmarks Gaussian benchmark varying problem size Gaussian benchmark, fixed problem size, sharing GPU among multiple ARM clients HiPINEB /62

39 Three scenarios: 1. ARM 2. Xeon+GPU 3. ARM+remote GPU Performance comparison: Rodinia benchmarks Gaussian benchmark varying problem size Gaussian benchmark, fixed problem size, sharing GPU among multiple ARM clients HiPINEB /62

40 Local GPU: ARM TX1 Xeon K20 Remote GPU: ARM Network Xeon K20 HiPINEB /62

41 0,397 0,126 0,258 0,399 0,239 1,116 0,424 0,244 1,182 0,44 0,178 0,278 0,451 0,301 0,75 0,461 0,279 0,485 0,583 0,521 1,619 3,038 0,601 0,932 1,469 0,739 0,83 1,361 0,928 2,45 Time (s) 1,531 2,259 1,769 2,71 3,388 6,521 2,075 3,36 3,869 2,321 4,907 5,028 4,016 4,509 9,177 9,906 6,704 7,542 9,06 14,566 38,366 33,423 45,118 30,519 34,522 65,562 24,473 13,024 82,634 49,318 18,145 29, ,457 Xeon+K20m Jetson TX1 Jetson TX1 + remote K20m ,1 Comparison of Rodinia benchmarks executed in three different scenarios: (i) a regular server with one NVIDIA K20 GPU, (ii) an NVIDIA Jetson TX1, and (iii) an NVIDIA Jetson TX1 offloading GPU work to a remote regular server with one NVIDIA K20 GPU. HiPINEB /62

42 0,397 0,126 0,258 0,399 0,239 1,116 0,424 0,244 1,182 0,44 0,178 0,278 0,451 0,301 0,75 0,461 0,279 0,485 0,583 0,521 1,619 3,038 0,601 0,932 1,469 0,739 0,83 1,361 0,928 2,45 Time (s) 1,531 2,259 1,769 2,71 3,388 6,521 2,075 3,36 3,869 2,321 4,907 5,028 4,016 4,509 9,177 9,906 6,704 7,542 9,06 14,566 38,366 33,423 45,118 30,519 34,522 65,562 24,473 13,024 82,634 49,318 18,145 29, , Zone A Xeon+K20m Jetson TX1 Jetson TX1 + remote K20m Zone B Zone C ,1 HiPINEB /62

43 0,397 0,126 0,258 0,399 0,239 1,116 0,424 0,244 1,182 0,44 0,178 0,278 0,451 0,301 0,75 0,461 0,279 0,485 0,583 0,521 1,619 3,038 0,601 0,932 1,469 0,739 0,83 1,361 0,928 2,45 Time (s) 1,531 2,259 1,769 2,71 3,388 6,521 2,075 3,36 3,869 2,321 4,907 5,028 4,016 4,509 9,177 9,906 6,704 7,542 9,06 14,566 38,366 33,423 45,118 30,519 34,522 65,562 24,473 13,024 82,634 49,318 18,145 29, , Zone A Xeon+K20m Jetson TX1 Jetson TX1 + remote K20m Zone B Zone C ,1 Short execution time: ARM wins HiPINEB /62

0,397 0,126 0,258 0,399 0,239 1,116 0,424 0,244 1,182 0,44 0,178 0,278 0,451 0,301 0,75 0,461 0,279 0,485 0,583 0,521 1,619 3,038 0,601 0,932 1,469 0,739 0,83 1,361 0,928 2,45 Time (s) 1,531 2,259

44 0,397 0,126 0,258 0,399 0,239 1,116 0,424 0,244 1,182 0,44 0,178 0,278 0,451 0,301 0,75 0,461 0,279 0,485 0,583 0,521 1,619 3,038 0,601 0,932 1,469 0,739 0,83 1,361 0,928 2,45 Time (s) 1,531 2,259 1,769 2,71 3,388 6,521 2,075 3,36 3,869 2,321 4,907 5,028 4,016 4,509 9,177 9,906 6,704 7,542 9,06 14,566 38,366 33,423 45,118 30,519 34,522 65,562 24,473 13,024 82,634 49,318 18,145 29, , Zone A Xeon+K20m Jetson TX1 Jetson TX1 + remote K20m Zone B Zone C ,1 Short execution time: ARM wins Medium and large execution time: Xeon wins HiPINEB /62

45 0,397 0,126 0,258 0,399 0,239 1,116 0,424 0,244 1,182 0,44 0,178 0,278 0,451 0,301 0,75 0,461 0,279 0,485 0,583 0,521 1,619 3,038 0,601 0,932 1,469 0,739 0,83 1,361 0,928 2,45 Time (s) 1,531 2,259 1,769 2,71 3,388 6,521 2,075 3,36 3,869 2,321 4,907 5,028 4,016 4,509 9,177 9,906 6,704 7,542 9,06 14,566 38,366 33,423 45,118 30,519 34,522 65,562 24,473 13,024 82,634 49,318 18,145 29, , Zone A Xeon+K20m Jetson TX1 Jetson TX1 + remote K20m Zone B Zone C ,1 Short execution time: ARM wins Medium and large execution time: Xeon wins Large execution time: remote GPU better than ARM HiPINEB /62

46 Three scenarios: 1. ARM 2. Xeon+GPU 3. ARM+remote GPU Performance comparison: Rodinia benchmarks Gaussian benchmark varying problem size Gaussian benchmark, fixed problem size, sharing GPU among multiple ARM clients HiPINEB /62

47 Local GPU: ARM TX1 Xeon K20 Remote GPU: ARM Network Xeon K20 HiPINEB /62

48 0,39 0,18 0,68 0,40 0,43 0,44 0,94 0,93 1,99 2,45 2,68 4,74 6,52 20,29 16,21 18,52 Time (s) 52,98 252,53 163,44 238, ,43 Xeon+K20m Jetson TX1 Jetson TX1 - rcuda Xeon+K20m Comparison of Rodinia benchmark gaussian varying the problem size, and executed in three different scenarios: (i) a regular server with one NVIDIA K20 GPU, (ii) a NVIDIA Jetson TX1, and (iii) a NVIDIA Jetson TX1 offloading GPU work to a remote regular server with one NVIDIA K20 GPU. Size HiPINEB /62

49 0,39 0,18 0,68 0,40 0,43 0,44 0,94 0,93 1,99 2,45 2,68 4,74 6,52 20,29 16,21 18,52 Time (s) 52,98 252,53 163,44 238, , Xeon+K20m Jetson TX1 Jetson TX1 - rcuda Xeon+K20m Zone A Zone B Zone C Short execution time: ARM wins Size HiPINEB /62

50 0,39 0,18 0,68 0,40 0,43 0,44 0,94 0,93 1,99 2,45 2,68 4,74 6,52 20,29 16,21 18,52 Time (s) 52,98 252,53 163,44 238, , Xeon+K20m Jetson TX1 Jetson TX1 - rcuda Xeon+K20m Zone A Zone B Zone C Short execution time: ARM wins Medium and Sizelarge execution time: Xeon wins HiPINEB /62

51 0,39 0,18 0,68 0,40 0,43 0,44 0,94 0,93 1,99 2,45 2,68 4,74 6,52 20,29 16,21 18,52 Time (s) 52,98 252,53 163,44 238, , Xeon+K20m Jetson TX1 Jetson TX1 - rcuda Xeon+K20m Zone A Zone B Zone C Short execution time: ARM wins Medium and Sizelarge execution time: Xeon wins Large execution time: remote GPU better than ARM HiPINEB /62

52 Three scenarios: 1. ARM 2. Xeon+GPU 3. ARM+remote GPU Performance comparison: Rodinia benchmarks Gaussian benchmark varying problem size Gaussian benchmark, fixed problem size, sharing GPU among multiple ARM clients HiPINEB /62

53 ARM ARM ARM Network Xeon GPU ARM ARM ARM Offload GPU workloads from ARMs to remote GPU server HiPINEB /62

54 Time (minutes) Speed-up Jetson TX1 + remote K20m Xeon+K20m Jetson TX1 Remote K20m speed-up , , ,87 12,27 17, ,72 3,97 6,33 4,06 2, Clients concurrently sharing the same rcuda server Comparison of Rodinia benchmark gaussian with problem size 8,192, executed in three different scenarios: (i) server with one K20, (ii) Jetson TX1, and (iii) Jetson TX1 offloading GPU work to a remote server with one K20 (in this scenario, the number of concurrent clients sharing the same remote server varies from 1 up to 6) HiPINEB /62

55 Time (minutes) Speed-up Jetson TX1 + remote K20m Xeon+K20m Jetson TX1 Remote K20m speed-up ,82 12,54 No sharing: over 12x speed-up! ,87 12,27 17, ,72 3,97 6,33 4,06 2, Clients concurrently sharing the same rcuda server Comparison of Rodinia benchmark gaussian with problem size 8,192, executed in three different scenarios: (i) server with one K20, (ii) Jetson TX1, and (iii) Jetson TX1 offloading GPU work to a remote server with one K20 (in this scenario, the number of concurrent clients sharing the same remote server varies from 1 up to 6) HiPINEB /62

56 Time (minutes) Speed-up Jetson TX1 + remote K20m Xeon+K20m Jetson TX1 Remote K20m speed-up ,82 12,54 2 ARMS sharing 1 remote GPU ,87 12,27 17, ,72 3,97 6,33 4,06 2, Clients concurrently sharing the same rcuda server Comparison of Rodinia benchmark gaussian with problem size 8,192, executed in three different scenarios: (i) server with one K20, (ii) Jetson TX1, and (iii) Jetson TX1 offloading GPU work to a remote server with one K20 (in this scenario, the number of concurrent clients sharing the same remote server varies from 1 up to 6) HiPINEB /62

57 Time (minutes) Speed-up Jetson TX1 + remote K20m Xeon+K20m Jetson TX1 Remote K20m speed-up , ,54 4 ARMS sharing 1 remote GPU ,87 12,27 17, ,72 3,97 6,33 4,06 2, Clients concurrently sharing the same rcuda server Comparison of Rodinia benchmark gaussian with problem size 8,192, executed in three different scenarios: (i) server with one K20, (ii) Jetson TX1, and (iii) Jetson TX1 offloading GPU work to a remote server with one K20 (in this scenario, the number of concurrent clients sharing the same remote server varies from 1 up to 6) HiPINEB /62

Time (minutes) Speed-up Jetson TX1 + remote K20m Xeon+K20m Jetson TX1 Remote K20m speed-up 60 50 49,82 20 18 16 40 30 20 12,54 6 ARMS sharing 1 remote GPU: 2.83 x speed-up!

58 Time (minutes) Speed-up Jetson TX1 + remote K20m Xeon+K20m Jetson TX1 Remote K20m speed-up , ,54 6 ARMS sharing 1 remote GPU: 2.83 x speed-up! 7,87 12,27 17, ,72 3,97 6,33 4,06 2, Clients concurrently sharing the same rcuda server Comparison of Rodinia benchmark gaussian with problem size 8,192, executed in three different scenarios: (i) server with one K20, (ii) Jetson TX1, and (iii) Jetson TX1 offloading GPU work to a remote server with one K20 (in this scenario, the number of concurrent clients sharing the same remote server varies from 1 up to 6) HiPINEB /62

59 Introduction GPU Virtualization Approach proposed The NVIDIA Jetson TX1 system Performance evaluation Conclusions and future work HiPINEB /62

60 Proposal for improving efficiency of future exascale systems Using rcuda to provide ARMs access to high-end remote GPUs Execution time speed-up by up to 12x ARM power efficiency is leveraged along with the highend accelerators Accelerators dynamically shared among multiple ARMs HiPINEB /62

61 Analysis of this proposal with HPC apps Include energy and power measurements Validate if our proposal is able to contribute to energy efficiency in future exascale systems HiPINEB /62

62 Get a free copy of rcuda at

Opportunities of the rcuda remote GPU virtualization middleware. Federico Silla Universitat Politècnica de València Spain

Opportunities of the rcuda remote virtualization middleware Federico Silla Universitat Politècnica de València Spain st Outline What is rcuda? HPC Advisory Council China Conference 2017 2/45 s are the