Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing

Size: px

Start display at page:

Download "Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing"

Ashlyn Greer
5 years ago
Views:

1 Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing Gaurav Mitra Andrew Haigh Luke Angove Anish Varghese Eric McCreath Alistair P. Rendell Research School of Computer Science Australian National University Canberra, Australia April 07, 2016

2 Overview Introduction & Background 1 Introduction & Background 2 Power Measurement Environment 3 Experimental Platforms 4 Approach 5 Results & Analysis 6 Conclusion Mitra et. al. (ANU) GTC 2016, San Francisco April 07, / 20

Introduction & Background Use of low-powered SoCs for HPC Nvidia Jetson TK1: ARM + GPU SoC Nvidia Jetson TX1: ARM + GPU SoC TI Keystone II: ARM + DSP SoC Adapteva Parallella: ARM + 64-core NoC TI

3 Introduction & Background Use of low-powered SoCs for HPC Nvidia Jetson TK1: ARM + GPU SoC Nvidia Jetson TX1: ARM + GPU SoC TI Keystone II: ARM + DSP SoC Adapteva Parallella: ARM + 64-core NoC TI BeagleBoard: ARM + DSP SoC Terasic DE1: ARM + FPGA SoC Rockchip Firefly: ARM + GPU SoC Freescale Wandboard: ARM + GPU SoC Cubieboard4: ARM + GPU SoC Mitra et. al. (ANU) GTC 2016, San Francisco April 07, / 20

4 Introduction & Background Use of low-powered SoCs for HPC In order for SoC processors to be considered viable exascale building blocks, important factors to explore include: Absolute performance Balancing use of different on-chip devices Understanding the performance-energy trade-off Mitra et. al. (ANU) GTC 2016, San Francisco April 07, / 20

5 Introduction & Background Contributions Environment for monitoring and collecting high resolution power measurements for SoC systems Understanding the benefits of exploiting both the host CPU and accelerator GPU cores simultaneously for critical HPC kernels Performance and energy comparisons with conventional HPC systems - Intel Xeon CPUs and NVIDIA K20 and K80 GPUs Mitra et. al. (ANU) GTC 2016, San Francisco April 07, / 20

6 Power Measurement Environment Measurement Requirements SoC systems generally consume very low power few Watts Subtle differences in energy consumption triggered by different factors such as the use of CPU or on-chip GPU cores Changes in DC current supplied to SoC system boards must be reliably measured Current use ranges from µamps to a few Amps, a very high-precision ammeter must be used to measure subtle changes Mitra et. al. (ANU) GTC 2016, San Francisco April 07, / 20

3V) used to measure analog output signals from µcurrent Gold https://www.eevblog.

7 Power Measurement Environment Measurement Apparatus µcurrent Gold: High-precision ammeter for measuring low-currents An mbed LPC1768 micro-controller with a 12-bit ADC (0-3.3V) used to measure analog output signals from µcurrent Gold The ADC has a resolution of 0.81±0.40mV, which corresponds to 0.81mA. This is 9.7±4.8mW at 12V. Mitra et. al. (ANU) GTC 2016, San Francisco April 07, / 20

8 Power Measurement Environment Power Measurement Environment Mitra et. al. (ANU) GTC 2016, San Francisco April 07, / 20

Experimental Platforms Experimental Platforms

Cortex-A57 Xeon E5-2665 Xeon E5-2670 v3 CPU

128GB DDR3 GPU GK20A GM20B K20m (GK110) K80

852 MHz 998 MHz 706 MHz 875 MHz GPU RAM Shared

9 Experimental Platforms Experimental Platforms TK1 TX1 SANDY HASWELL CPU ARM Cortex-A15 ARM Cortex-A57 Xeon E Xeon E v3 CPU Cores CPU Freq. 2.3 GHz 2.2 GHz 2.4 GHz 2.3 GHz RAM 2GB LPDDR3 3GB LPDDR4 128GB DDR3 128GB DDR3 GPU GK20A GM20B K20m (GK110) K80 (GK210) GPU Cores GPU Freq. 852 MHz 998 MHz 706 MHz 875 MHz GPU RAM Shared Shared 5GB 12GB CUDA v6.5 v7.0 v7.0 v7.5 Mitra et. al. (ANU) GTC 2016, San Francisco April 07, / 20

10 Evaluation Kernel Approach C 1 C = A B C 2 = A B 1 B 2 C 1 = A B 1 C 2 = A B 2 C 1 = A B 1 C 2 = A B 2 CPU GPU Mitra et. al. (ANU) GTC 2016, San Francisco April 07, / 20

11 Approach Approaches Traditional methods: Assign all work to GPU or CPU Static Partitioning: Partition work between GPU and CPU based on apriori information Beaumont et al., Matrix Multiplication on Heterogeneous Platforms C. Yang et al., Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing Donfack et al., Dynamically Balanced Synchronization-Avoiding LU Factorization with Multicore and GPUs Dynamic Partitioning: Papadrakakis et al., A New Era in Scientific Computing: Domain Decomposition Methods in Hybrid CPU-GPU Architectures Existing approaches do not consider the use of shared physical memory or the implications for energy efficiency Mitra et. al. (ANU) GTC 2016, San Francisco April 07, / 20

12 Approach Our approach Static partitioning: Guess a partition based on experimentally measured peak performances of CPU and GPU Used the achieved peaks to refine the partition Repeat until convergence Suitable for repeated calculations of the same size Use of shared memory on SoC systems: CUDA driver automatically protects CUDA-allocated memory during kernel execution phase We circumvent this by immediately unprotecting using mprotect() the memory after initiating a kernel execution Dynamic partitioning: CPU and GPU remove chunks of matrix columns from a workqueue Chunk size must be sufficient to occupy CPU and GPU fully On traditional discrete GPU systems, copies have to be carefully scheduled Implemented using OpenMP Two threads, one each for CPU and GPU, taking work off a master queue The GPU thread executes at the expense of doing productive work on the CPU cores Mitra et. al. (ANU) GTC 2016, San Francisco April 07, / 20

13 Results & Analysis Results: Best split performance Platform Matrix CPU GPU CPU SPLIT Size GFLOPS GFLOPS SPLIT COLS GFLOPS DGEMM TK TX SANDY HASWELL SGEMM TK TX SANDY HASWELL Mitra et. al. (ANU) GTC 2016, San Francisco April 07, / 20

14 Results & Analysis Best Split Search - Tegra K1/X1 DGEMM GFLOPS TK1 GFLOPS TX1 GFLOPS TK1 JOULES TX1 JOULES JOULES SGEMM GFLOPS TK1 GFLOPS TX1 GFLOPS TK1 JOULES ,000 2,000 3,000 4,000 Split Size Given to CPU TX1 JOULES JOULES Mitra et. al. (ANU) GTC 2016, San Francisco April 07, / 20

15 Results & Analysis Best Split Search - Intel + NVIDIA GPUs DGEMM GFLOPS 1,500 1, SANDY GFLOPS HASWELL GFLOPS SANDY JOULES HASWELL JOULES JOULES SGEMM GFLOPS 2,000 1,000 SANDY GFLOPS HASWELL GFLOPS SANDY JOULES HASWELL JOULES 0 1,000 2,000 3,000 4,000 Split Size Given to CPU JOULES Mitra et. al. (ANU) GTC 2016, San Francisco April 07, / 20

16 Results & Analysis Performance Scaling - TK1 DGEMM GFLOPS SGEMM GFLOPS CPU GPU SPLIT DYNAMIC TBALANCE PEAK (CPU+GPU) Matrix Dimension M=N=K Mitra et. al. (ANU) GTC 2016, San Francisco April 07, / 20

17 Results & Analysis Performance Scaling - TX1 DGEMM GFLOPS CPU GPU SPLIT DYNAMIC TBALANCE PEAK (CPU+GPU) SGEMM GFLOPS Matrix Dimension M=N=K Mitra et. al. (ANU) GTC 2016, San Francisco April 07, / 20

18 Results & Analysis Energy Efficiency - TX1 - SGEMM 10 8 CPU GPU SPLIT TBALANCE DYNAMIC Joules/FLOP (SP) Matrix Dimension M=N=K Mitra et. al. (ANU) GTC 2016, San Francisco April 07, / 20

19 Results & Analysis Energy Efficiency - Haswell - SGEMM 10 9 CPU GPU SPLIT TBALANCE DYNAMIC Joules/FLOP (SP) Matrix Dimension M=N=K Mitra et. al. (ANU) GTC 2016, San Francisco April 07, / 20

20 Conclusion Conclusion A high accuracy and high resolution energy measurement system introduced here enables tuning algorithms for optimal energy usage. This would allow libraries like ATLAS to tune and produce best-performance and best-energy optimized libraries. How might a running application use information on energy usage to dynamically change its behaviour? Use of shared physical memory on SoC systems eliminates transfer overhead Under some circumstances, there is a case (TX1 DGEMM) where an energy benefit was observed from exploting both CPU and GPU together The best energy efficiency observed on SoC systems was 37.5 pj/flop SGEMM on TX1 while on conventional systems, 82.4 pj/flop SGEMM was observed on the K80. Contact: Alistair.Rendell@anu.edu.au Mitra et. al. (ANU) GTC 2016, San Francisco April 07, / 20

Scaling the Peak: Maximizing floating point performance on the Epiphany NoC

Scaling the Peak: Maximizing floating point performance on the Epiphany NoC Anish Varghese, Gaurav Mitra, Robert Edwards and Alistair Rendell Research School of Computer Science The Australian National