CUDA on ARM Update. Developing Accelerated Applications on ARM. Bas Aarts and Donald Becker

Similar documents
CUDA on ARM Update. Developing Accelerated Applications on ARM. Bas Aarts and Donald Becker

7 DAYS AND 8 NIGHTS WITH THE CARMA DEV KIT

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?

Pedraforca: a First ARM + GPU Cluster for HPC

GPU Computing with NVIDIA s new Kepler Architecture

E4-ARKA: ARM64+GPU+IB is Now Here Piero Altoè. ARM64 and GPGPU

THE LEADER IN VISUAL COMPUTING

TR An Overview of NVIDIA Tegra K1 Architecture. Ang Li, Radu Serban, Dan Negrut

General Purpose GPU Computing in Partial Wave Analysis

The rcuda middleware and applications

n N c CIni.o ewsrg.au

The Mont-Blanc approach towards Exascale

The Mont-Blanc Project

Building supercomputers from embedded technologies

A176 Cyclone. GPGPU Fanless Small FF RediBuilt Supercomputer. IT and Instrumentation for industry. Aitech I/O

Tesla GPU Computing A Revolution in High Performance Computing

GTC 2013 March San Jose, CA The Smartest People. The Best Ideas. The Biggest Opportunities. Opportunities for Participation:

Fra superdatamaskiner til grafikkprosessorer og

Exploring System Coherency and Maximizing Performance of Mobile Memory Systems

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Elaborazione dati real-time su architetture embedded many-core e FPGA

FTF Americas. FTF Brazil. freescale.com/ftf. Secure, Embedded Processing Solutions for the Internet of Tomorrow

European energy efficient supercomputer project

Embedded Linux Conference San Diego 2016

Chapter 2 Parallel Hardware

A176 C clone. GPGPU Fanless Small FF RediBuilt Supercomputer. Aitech

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Nvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018

The Era of Heterogeneous Computing

Selecting the right Tesla/GTX GPU from a Drunken Baker's Dozen

GPU A rchitectures Architectures Patrick Neill May

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Broadberry. Artificial Intelligence Server for Fraud. Date: Q Application: Artificial Intelligence

Reducing Time-to-Market with i.mx6-based Qseven Modules

S CUDA on Xavier

Steve Scott, Tesla CTO SC 11 November 15, 2011

GPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA

Graphics Processor Acceleration and YOU

Introduction to GPU hardware and to CUDA

CS427 Multicore Architecture and Parallel Computing

Building supercomputers from commodity embedded chips

Timothy Lanfear, NVIDIA HPC

Fundamentals of Quantitative Design and Analysis

8/28/12. CSE 820 Graduate Computer Architecture. Richard Enbody. Dr. Enbody. 1 st Day 2

High performance, multiple expansion Born for machine vision applications. ABOX-E7 Series

High Performance Computing with Accelerators

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Trends in HPC (hardware complexity and software challenges)

What are Clusters? Why Clusters? - a Short History

Hypervisors at Hyperscale

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Copyright 2012, Elsevier Inc. All rights reserved.

GPUs and Emerging Architectures

ARM and x86 on Qseven & COM Express Mini. Zeljko Loncaric, Marketing Engineer, congatec AG

IBM Power AC922 Server

Power Systems AC922 Overview. Chris Mann IBM Distinguished Engineer Chief System Architect, Power HPC Systems December 11, 2017

GPUS FOR NGVLA. M Clark, April 2015

Deep Learning: Transforming Engineering and Science The MathWorks, Inc.

Lecture 1: Gentle Introduction to GPUs

Using Graphics Chips for General Purpose Computation

Kontron Technology ARM based Embedded

HPC with Multicore and GPUs

. SMARC 2.0 Compliant

Apalis A New Architecture for Embedded Computing

Benchmarking Real-World In-Vehicle Applications

CPU-GPU Heterogeneous Computing

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Introduction to Sitara AM437x Processors

F28HS Hardware-Software Interface: Systems Programming

Barcelona Supercomputing Center

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

Kontron s ARM-based COM solutions and software services

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues

NVIDIA'S DEEP LEARNING ACCELERATOR MEETS SIFIVE'S FREEDOM PLATFORM. Frans Sijstermans (NVIDIA) & Yunsup Lee (SiFive)

GPUs and GPGPUs. Greg Blanton John T. Lubia

Exploiting CUDA Dynamic Parallelism for low power ARM based prototypes

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016

Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain)

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

Rhythm: Harnessing Data Parallel Hardware for Server Workloads

An Evaluation of Unified Memory Technology on NVIDIA GPUs

Accelerating High Performance Computing.

Tesla GPU Computing A Revolution in High Performance Computing

Research on performance dependence of cluster computing system based on GPU accelerators on architecture and number of cluster nodes

Building blocks for 64-bit Systems Development of System IP in ARM

IBM Deep Learning Solutions

Atos ARM solutions for HPC

Outline Marquette University

Current Trends in Computer Graphics Hardware

It s Time for Mass Scale VDI Adoption

Trends in the Infrastructure of Computing

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Introduction to Xeon Phi. Bill Barth January 11, 2013

Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing

Fundamental CUDA Optimization. NVIDIA Corporation

BlueDBM: An Appliance for Big Data Analytics*

Building the Most Efficient Machine Learning System

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Transcription:

CUDA on ARM Update Developing Accelerated Applications on ARM Bas Aarts and Donald Becker

CUDA on ARM: a forward-looking development platform for high performance, energy efficient hybrid computing It s a platform for the next generation of HPC, leveraging commodity driven improvements from the most rapidly evolving compute markets. 2

The next revolution: Power Efficiency Look at the market for the next generation of HPC components Power-effective computing driven by phones and tablets ARM, with architectural and experience advantages System-level software complexity is high HPC driven by accelerated computing All major vendors have switched to accelerators GPUs have an architectural efficiency advantage Titan gets 90% of its performance from the accelerator 3

Possible Obvious Power-efficient Future Power-efficient general purpose cores combined with Compute Accelerators Power control shared with mobile products Ultra-focused on power efficiency Competition forces rapid improvement Technology evolution driven by commodity market Bulk of compute power provided by inherently efficient GPUs Increase to over 50% of chip power for flops. 4

NVIDIA has these elements GPU and Computing ARM SoCs VGX 5

Why CUDA on ARM? Development platforms for future HPC systems Explore the efficiency and performance trade-offs Utilize existing hardware: construct systems with ARM CPUs combined with a discrete GPU 6

Current Generation: MXM Devkit SECO carrier board: SECO MXM Devkit NVidia Tegra 3 CPU on Q7 module 4 arm A9 cores, NEON and VFPv3 2GB DRAM, and 4-8GB embedded flash NVidia MXM GPU module Quadro 1000m (GF108) on 4 lanes of PCIe 96 CUDA cores with 269 GFlops peak Carrier provides I/O connectors, power supplies PCIe connected 1Gbps Ethernet (i82574), USB, SATA 7

8

Current Generation Software ARM Linux distribution L4T r15.2 softfp, Ubuntu 11.04 Linux 3.1.10 kernel Cuda 4.2 toolkit and samples, 302.26 driver x86 system support for cross development nvcc cross-compiler support 9

Introducing KAYLA Support of Kepler-class GPU SM35 adds dynamic parallelism and other features 2 SMX, 384 CUDA cores Comes in MXM and PCIe form factor Capability approaching Logan SoC Integrated solution will be more power-efficient 10

Next Generation: mitx Devkit Seco carrier board: Seco mini-itx GPU devkit NVidia Tegra 3 CPU on Q7 module 4 arm A9 cores, NEON and VFPv3 2GB DRAM, and 4-8GB embedded flash NVidia PCIe GPU ATX power supply supports higher power GPUs Qualified for gf108, gk107, gk104, and Kayla GPU Carrier provides I/O connectors 11

Next Generation Hardware 12

Next Generation Software Arm Linux distribution Based on L4T R16.2 hardfp, Ubuntu 12.04 Linux 3.1.10 kernel Cuda 5.0 toolkit and samples, 313.24 driver Increased parity with x86 Linux (nvcuvid, nvprof, thrust) x86 system support for cross development nvcc cross-compiler support nfs-kernel-server support to ease cross compilation Back ported to SECO MXM Devkit 13

CUDA on ARM Roadmap Software CUDA releases starting with CUDA 5.5 and 319.xy include ARM support Native ARM compiler cuda-gdb: native ARM and client-server Long term plans for the ARM platform Logan, Tegra with integrated Kepler class GPU ARMv8 64-bit platform support Enable other partners and industry support 14

Notes on Comparing Compute Efficiency Measuring power isn t always easy Multiple points to measure input powerh Multiple power rails and components Different peripherals and activity Active cooling and over-cooling are significant power draws Measuring application power draw adds to the challenge I/O and DRAM activity can be power-hungry Different phases have different power profiles A power-efficient system has widely varying power draw Turn off the lights when you leave the room Recent activity has a big influence on present power draw 15

Power, Performance, and Benchmarks Current Power Condition 0.46A No GPU installed, SATA disk 0.50A 9.12W Idle power, fan off 0.60A 10.9W Idle power, slow fan 0.66A +1.1W Idle with SATA disk 0.86A +3.65W GPU power state set to maximum performance 1.06A 19.3W Running smoke at 27FPS, average (23W peak) 2.05A 37.4W Running real-time raytracing (41.1W peak) 16

Demos Glass, galaxy, CUDA samples, etc 17

Developer Information Information: http://www.nvidia.com/object/seco-dev-kit https://developer.nvidia.com/linux-tegra https://developer.nvidia.com/category/zone/cuda-zone Forums: http://forum.seco.com/ https://devtalk.nvidia.com/ 18

Back-up material 19

Kayla MXM Module 540 MHz GPU clock 1GB GDDR5 Single precision 384*540*2 = 415 GFLOPs 20

Partners <Add info on partner hardware and tools> 21

Multi-core CPUs Multi-core as a first response to power issues Performance through parallelism, not frequency increases Slow the complexity spiral Better locality in many cases But CPUs have evolved for single thread performance rather than energy efficiency Fast clock rates with deep pipelines Data and instruction caches optimized for latency Superscalar issue with out-of-order execution Dynamic conflict detection Lots of predictions and speculative execution Lots of instruction overhead per operation Less than 2% of chip power today goes to flops. 22

The cluster revolution was driven by Cost-effective computing: Dollars per FLOP Scalable programming and use: PC to supercomputer Long-term viability: Commodity driven improvements We now need to incorporate power-efficient computing 23

Tegra T30 Performance Tachyon 0.99.1 speed tests, balls.dat 64-bit Linux Intel Core i7-3960x @ 3.3 GHz 6-cores 0.074 sec double-prec 4-cores 0.107 sec double-prec 2-cores 0.210 sec double-prec 1-core 0.415 sec double-prec NVIDIA SECO "CARMA" Tegra3 ARM+CUDA development kit 4-cores 1.150 sec double-prec 4-cores 0.819 sec single-prec Performance tests by John Stone, UIUC 24

Tegra T30 Performance Processor 1 core 2 cores 4 cores 6 cores i7-3960x @ 3.3 GHz 0.415 0.210 0.107 0.074 CARMA T30 @ 1.2GHz 1.150 25

Possible Obvious Power-efficient Future Power-efficient general core combined with GPU Power control shared with mobile products Ultra-focused on power efficiency Competition forces rapid improvement Technology evolution driven by commodity market Bulk of compute power provided by inherently efficient GPUs Increase to over 50% of chip power for flops. 26

Tegra memory controller Main memory for the host CPUs Two 32-bit channels DDR3L or LPDDR3 DRAM Design focus on efficiency Nominally 1866 Mbps/pin when at 933 MHz 2GB, the limit for current ARM Some of the pressure for ARMv8 27