Scaling Throughput Processors for Machine Intelligence

Similar documents
Multi-Core Microprocessor Chips: Motivation & Challenges

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

Copyright 2012, Elsevier Inc. All rights reserved.

Overcoming the Memory System Challenge in Dataflow Processing. Darren Jones, Wave Computing Drew Wingard, Sonics

Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture

Deep Learning Accelerators

Parallelism. CS6787 Lecture 8 Fall 2017

Xilinx ML Suite Overview

Improving Performance of Machine Learning Workloads

Performance of computer systems

LECTURE 1. Introduction

Lecture 2: Performance

Introduction to Xeon Phi. Bill Barth January 11, 2013

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group

Parallel Computing: Parallel Architectures Jin, Hai

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

EECS4201 Computer Architecture

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Enabling Technology for the Cloud and AI One Size Fits All?

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Trends in HPC (hardware complexity and software challenges)

Fundamentals of Computers Design

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

Fundamentals of Quantitative Design and Analysis

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Technologies and application performance. Marc Mendez-Bermond HPC Solutions Expert - Dell Technologies September 2017

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

In-Place Activated BatchNorm for Memory- Optimized Training of DNNs

GPUs and GPGPUs. Greg Blanton John T. Lubia

EFFICIENT INFERENCE WITH TENSORRT. Han Vanholder

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning,

Stack Machines. Towards Scalable Stack Based Parallelism. 1 of 53. Tutorial Organizer: Dr Chris Crispin-Bailey

Parallel and Distributed Programming Introduction. Kenjiro Taura

Parallel Algorithm Engineering

Analysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth

Altera SDK for OpenCL

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

Tracking Acceleration with FPGAs. Future Tracking, CMS Week 4/12/17 Sioni Summers

Unified Deep Learning with CPU, GPU, and FPGA Technologies

The Stampede is Coming: A New Petascale Resource for the Open Science Community

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.

Master Informatics Eng.

GPUs and Emerging Architectures

Dataflow Supercomputers

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Technology for a better society. hetcomp.com

Fundamentals of Computer Design

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

CafeGPI. Single-Sided Communication for Scalable Deep Learning

CSCI-GA Multicore Processors: Architecture & Programming Lecture 3: The Memory System You Can t Ignore it!

Advances of parallel computing. Kirill Bogachev May 2016

Introduction to parallel Computing

The Future of High Performance Computing

Computer Architecture. R. Poss

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs

Advanced CUDA Optimization 1. Introduction

ECE 486/586. Computer Architecture. Lecture # 2

Fast Hardware For AI

Enabling the A.I. Era: From Materials to Systems

45-year CPU Evolution: 1 Law -2 Equations

Performance, Power, Die Yield. CS301 Prof Szajda

When MPPDB Meets GPU:

Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Mathematical computations with GPUs

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

Lecture 7: Parallel Processing

CPU Pipelining Issues

Philippe Thierry Sr Staff Engineer Intel Corp.

MD-CUDA. Presented by Wes Toland Syed Nabeel

General Purpose GPU Computing in Partial Wave Analysis

A performance comparison of Deep Learning frameworks on KNL

A Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis

S8822 OPTIMIZING NMT WITH TENSORRT Micah Villmow Senior TensorRT Software Engineer

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Systems Programming and Computer Architecture ( ) Timothy Roscoe

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC

Toward a Memory-centric Architecture

Lecture 1: CS/ECE 3810 Introduction

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Top500 Supercomputer list

Storage. Hwansoo Han

Cache Memories. From Bryant and O Hallaron, Computer Systems. A Programmer s Perspective. Chapter 6.

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Intel Enterprise Processors Technology

Microprocessor Trends and Implications for the Future

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 1. Computer Abstractions and Technology

Efficient Communication Library for Large-Scale Deep Learning

Transcription:

Scaling Throughput Processors for Machine Intelligence ScaledML Stanford 24-Mar-18 simon@graphcore.ai 1

MI The impact on humanity of harnessing machine intelligence will be greater than the impact of harnessing machine power. This is the strongest driver to re-invent computing since the invention of computing. Programming & software & hardware will all change. 2

There are 3 parts (so far) to intelligence compute": simulating environments exploring model structures optimizing candidate models The fundamental and challenging new data-type is the graph, not really the tensor calculations at its vertices. ResNet-50 training batch=4 3

Processor are built for specific workloads CPU Scalar Designed for office apps Evolved for web servers GPU Vector Designed for graphics Evolved for linear algebra IPU Graph Designed for intelligence 4

We ve all tried to squeeze the pips out of tensor algebra Headline vs delivered dot product flops training ResNet50: Amdahl s Law is tough! 180Tflop/s* >200Tflop/s** 125Tflop/s *TPU2 card certainly burns more power than V100 and Colossus. 11% 18% ~20% **Colossus peak tba. Volta TPU2 Colossus batch=64 batch=256 batch=4 5

Our comprehension and mechanization of intelligence is nascent A new class of workload approximate computing on probability distributions learned from data. SOTA models and algorithms change every year. Huge compute demand for model optimization, more for exploration. Until we know more, we need computing machines which... exploit massive parallelism; are broadly agnostic to model structure...but can assume sparsity, stochasticity, and physical invariances in space & time; have a simple programming abstraction; are efficient for exploring, training and deployed inference. 6

Say we want 1000x throughput performance in the next decade. How much will come for free from silicon scaling? The era of amazing silicon scaling ended around 2005 (90nm). All throughput processors are now power limited. 7

Wire C density as a proxy for total logic C density: Each process node: wire C density (normalized) 28nm 16nm ~2x T density ~ 2x C density 40nm P ~ CV 2 f 90nm 65nm (foundry nodes) Best case: reduce V by ~20% per node to deliver 2x compute at constant power this is tough. wire grid density (normalized) 8

Xeon servers are not delivering that best case: transistors / chip ~30% / year throughput / Watt ~15% / year log logic T/um 2 (10.mmp.cpp/T) SRAM bitcells/um 2 ~ 2x per node Big Xeon exemplars (plotting geomean for each node): 65n 45n Woodcrest X5160 2c Conroe X3085 2c Lynnfield X3470 4c Nehalem X7550 8c core cycles/joule ~ 2x per node 32n 22n 14n Westmere E7-8870 10c SandyBridge E5-2690 8c IvyBridge E7-2890v2 15c Haswell E7-8890v3 18c Broadwell E7-8894v4 24c Skylake 8180P 28c 65n 45n 32n 22n 14n ~2.5 years/node 9

Scaling at constant frequency active fraction @ best case scaling active fraction @ Xeon scaling 200W 200W 200W largest manufacturable die ~825mm 2 node N T transistors node N+1 2T transistors node N+2 4T transistors 10

Silicon process scaling will probably yield 3-10x throughput in the next decade. If we want another 100x it must come from connecting chips together, and from new architecture. Not enough will come from new hardware architecture without software and algorithm co-innovation. 11

Architecture drivers Silicon efficiency is now the full (productive) use of available power Memory on chip consumes little power, is more useful than logic which can t be powered! Keep data local, ie. distribute the memory. Serialise communication and compute. Cluster multiple chips to emulate a bigger chip with more power. There is a poor performance return on single processor complexity (Pollack s Rule: performance ~ transistor count) To maximize throughput performance, use many simple processors. Parallel programs are hard to get right, hard to map to massively parallel machines BSP is a simple abstraction, guaranteed free of concurrency hazards. Compile communication patterns, like you compile functions. 12

Pure distributed machine with compiled communication Static partitioning of work and memory Threads hide only local latencies (arithmetic, memory, branch) Deterministic communication over a stateless exchange P P P P R R R R M M M M Exchange 13

Two different plans GPU + DRAMs on interposer IPU pair with distributed SRAM GPU IPU IPU 16GB @ 900GB/s 600MB @ 90TB/s, zero latency Same total power, similar active logic in both cases 14

Colossus IPU pair (300W PCIe card) 2432 processor tiles >200Tflop 16.32 ~600MB card-to-card links card-to-card links host I/O PCIe-4 host I/O PCIe-4 card-to-card links card-to-card links all-to-all exchange spines each ~8TBps link + host bandwidth 384GBps/chip 15

Bulk Synchronous Parallel (BSP) chip.2 sync chip.1 Massively parallel computing with no concurrency hazards sync host I/O sync exchange phase inter-chip sync inter-chip sync compute phase sync host I/O sync (1 tile abstains) sync 16

Serializing compute and communication maximizes power-limited throughput Design point Operating point Power Power ceiling Power Power ceiling Concurrent... Compute saved power Compute Exchange Exchange Time Time Power Power ceiling Power Power ceiling Serialized... Compute Exchange Compute Exchange saved time Time Time 17

BSP Execution Trace when some Tiles are quiet, clock can increase tiles (sorted) superstep time COMPUTE EXCHANGE WAITING FOR SYNC SYNC 18

BSP Trace: ResNet-50 training, batch=4 COMPUTE EXCHANGE WAITING FOR SYNC SYNC 19

Colossus integrates enough memory to hold a large model on chip, or on a small cluster of chips. In GPU terms all kernels are then fused. Access to the on-chip model at spectacular bandwidth and zero latency yields spectacular throughput for the many models which are memory-bottlenecked on GPUs. But what about training CNNs with batch norm using large batches? 20

Small still batches learn better models, more reliably Preview of work by Dominic Masters and Carlo Luschi, coming to arxiv soon. Sweeping batch size m and base learning rate ῆ = η/m, where the weight update equation is:...so all runs do the same amount of computation, larger batches update the gradients less often. This example is training ResNet-32 on CIFAR100, with batch norm over the full batch. Many other examples in the upcoming paper. 21

Small batches learn better models, more reliably For this example, batch size ~8 gives the best models, with the least sensitivity to chosen learning rate. 22

The warm-up strategy of Goyal et al (arxiv:1706.02677) helps larger batches but doesn t level the field. Best batch sizes 4-16, eg. distribute over 8 chips at m=2 per chip. 23

SGD with small batches yields better models. Machines which are efficient for very small (sub-) batches allow parallel learning over more machines. We also need to embrace other types of memory-efficient algorithm: Trading re-computation for memory. Reversible models. 24

Trading re-computation for memory DenseNet-201 training, batch=16. Naive strategy: memorize activations only at input of each residue block, re-compute all others in backward pass. TensorFlow less greedy ~1/5 memory MiB allocated TF executing on CPU, recording total memory allocated for weights + activations. Float16 weights and activations, single weight copy. recomputed Time (#allocations normalised) ~1.25x compute 25

Re-cap Throughput processors will be increasingly power limited, so will become mostly memory. We can now integrate enough SRAM to hold large models entirely on chip. IPUs don t need large batches for efficiency. Small batches deliver better models, free most of the memory for params, and allow parallel learning over more machines. Efficient massively parallel processors need to compile communication patterns as well as compute functions. So program graphs must be pseudo-static. BSP is a simple, concurrency-safe, and power-efficient paradigm for massively parallel MI computing. 26

Thank You simon@graphcore.ai 27