CUDA on ARM Update. Developing Accelerated Applications on ARM. Bas Aarts and Donald Becker

CUDA on ARM Update Developing Accelerated Applications on ARM Bas Aarts and Donald Becker

CUDA on ARM: a forward-looking development platform for high performance, energy efficient hybrid computing It s a platform for the next generation of HPC, leveraging commodity driven improvements from the most rapidly evolving compute markets. 2

The next revolution: Power Efficiency Look at the market for the next generation of HPC components Power-effective computing driven by phones and tablets ARM, with architectural and experience advantages System-level software complexity is high HPC driven by accelerated computing All major vendors have switched to accelerators GPUs have an architectural efficiency advantage Titan gets 90% of its performance from the accelerator 3

Possible Obvious Power-efficient Future Power-efficient general purpose cores combined with Compute Accelerators Power control shared with mobile products Ultra-focused on power efficiency Competition forces rapid improvement Technology evolution driven by commodity market Bulk of compute power provided by inherently efficient GPUs Increase to over 50% of chip power for flops. 4

NVIDIA has these elements GPU and Computing ARM SoCs VGX 5

Why CUDA on ARM? Development platforms for future HPC systems Explore the efficiency and performance trade-offs Utilize existing hardware: construct systems with ARM CPUs combined with a discrete GPU 6

Current Generation: MXM Devkit SECO carrier board: SECO MXM Devkit NVidia Tegra 3 CPU on Q7 module 4 arm A9 cores, NEON and VFPv3 2GB DRAM, and 4-8GB embedded flash NVidia MXM GPU module Quadro 1000m (GF108) on 4 lanes of PCIe 96 CUDA cores with 269 GFlops peak Carrier provides I/O connectors, power supplies PCIe connected 1Gbps Ethernet (i82574), USB, SATA 7

Current Generation Software ARM Linux distribution L4T r15.2 softfp, Ubuntu 11.04 Linux 3.1.10 kernel Cuda 4.2 toolkit and samples, 302.26 driver x86 system support for cross development nvcc cross-compiler support 9

Introducing KAYLA Support of Kepler-class GPU SM35 adds dynamic parallelism and other features 2 SMX, 384 CUDA cores Comes in MXM and PCIe form factor Capability approaching Logan SoC Integrated solution will be more power-efficient 10

Next Generation: mitx Devkit Seco carrier board: Seco mini-itx GPU devkit NVidia Tegra 3 CPU on Q7 module 4 arm A9 cores, NEON and VFPv3 2GB DRAM, and 4-8GB embedded flash NVidia PCIe GPU ATX power supply supports higher power GPUs Qualified for gf108, gk107, gk104, and Kayla GPU Carrier provides I/O connectors 11

Next Generation Hardware 12

Next Generation Software Arm Linux distribution Based on L4T R16.2 hardfp, Ubuntu 12.04 Linux 3.1.10 kernel Cuda 5.0 toolkit and samples, 313.24 driver Increased parity with x86 Linux (nvcuvid, nvprof, thrust) x86 system support for cross development nvcc cross-compiler support nfs-kernel-server support to ease cross compilation Back ported to SECO MXM Devkit 13

CUDA on ARM Roadmap Software CUDA releases starting with CUDA 5.5 and 319.xy include ARM support Native ARM compiler cuda-gdb: native ARM and client-server Long term plans for the ARM platform Logan, Tegra with integrated Kepler class GPU ARMv8 64-bit platform support Enable other partners and industry support 14

Notes on Comparing Compute Efficiency Measuring power isn t always easy Multiple points to measure input powerh Multiple power rails and components Different peripherals and activity Active cooling and over-cooling are significant power draws Measuring application power draw adds to the challenge I/O and DRAM activity can be power-hungry Different phases have different power profiles A power-efficient system has widely varying power draw Turn off the lights when you leave the room Recent activity has a big influence on present power draw 15

Power, Performance, and Benchmarks Current Power Condition 0.46A No GPU installed, SATA disk 0.50A 9.12W Idle power, fan off 0.60A 10.9W Idle power, slow fan 0.66A +1.1W Idle with SATA disk 0.86A +3.65W GPU power state set to maximum performance 1.06A 19.3W Running smoke at 27FPS, average (23W peak) 2.05A 37.4W Running real-time raytracing (41.1W peak) 16

Demos Glass, galaxy, CUDA samples, etc 17

Developer Information Information: http://www.nvidia.com/object/seco-dev-kit https://developer.nvidia.com/linux-tegra https://developer.nvidia.com/category/zone/cuda-zone Forums: http://forum.seco.com/ https://devtalk.nvidia.com/ 18

Back-up material 19

Kayla MXM Module 540 MHz GPU clock 1GB GDDR5 Single precision 384*540*2 = 415 GFLOPs 20

Partners <Add info on partner hardware and tools> 21

Multi-core CPUs Multi-core as a first response to power issues Performance through parallelism, not frequency increases Slow the complexity spiral Better locality in many cases But CPUs have evolved for single thread performance rather than energy efficiency Fast clock rates with deep pipelines Data and instruction caches optimized for latency Superscalar issue with out-of-order execution Dynamic conflict detection Lots of predictions and speculative execution Lots of instruction overhead per operation Less than 2% of chip power today goes to flops. 22

The cluster revolution was driven by Cost-effective computing: Dollars per FLOP Scalable programming and use: PC to supercomputer Long-term viability: Commodity driven improvements We now need to incorporate power-efficient computing 23

Tegra T30 Performance Tachyon 0.99.1 speed tests, balls.dat 64-bit Linux Intel Core i7-3960x @ 3.3 GHz 6-cores 0.074 sec double-prec 4-cores 0.107 sec double-prec 2-cores 0.210 sec double-prec 1-core 0.415 sec double-prec NVIDIA SECO "CARMA" Tegra3 ARM+CUDA development kit 4-cores 1.150 sec double-prec 4-cores 0.819 sec single-prec Performance tests by John Stone, UIUC 24

Tegra T30 Performance Processor 1 core 2 cores 4 cores 6 cores i7-3960x @ 3.3 GHz 0.415 0.210 0.107 0.074 CARMA T30 @ 1.2GHz 1.150 25

Possible Obvious Power-efficient Future Power-efficient general core combined with GPU Power control shared with mobile products Ultra-focused on power efficiency Competition forces rapid improvement Technology evolution driven by commodity market Bulk of compute power provided by inherently efficient GPUs Increase to over 50% of chip power for flops. 26

Tegra memory controller Main memory for the host CPUs Two 32-bit channels DDR3L or LPDDR3 DRAM Design focus on efficiency Nominally 1866 Mbps/pin when at 933 MHz 2GB, the limit for current ARM Some of the pressure for ARMv8 27