Bridging Analog Neuromorphic and Digital von Neumann Computing

Similar documents
Axilog: Language Support for Approximate Hardware Design

AxBench: A Multiplatform Benchmark Suite for Approximate Computing

AxBench: A Benchmark Suite for Approximate Computing Across the System Stack

Core. Error Predictor. Figure 1: Architectural overview of our quality control approach. Approximate Accelerator. Precise.

Neural Network based Energy-Efficient Fault Tolerant Architect

Amir Yazdanbakhsh. (608)

Microprocessor Trends and Implications for the Future

RFVP: Rollback-Free Value Prediction with Safe-to-Approximate Loads

Chapter 1: Fundamentals of Quantitative Design and Analysis

Compilation and Hardware Support for Approximate Acceleration

Neural Acceleration for GPU Throughput Processors

Lecture 1: Gentle Introduction to GPUs

Neural Acceleration for GPU Throughput Processors

Approximate Overview of Approximate Computing

PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory

Computer Architecture s Changing Definition

Scale-Out Acceleration for Machine Learning

FLEXJAVA:)Language'Support' for'safe'and'modular' Approximate'Programming

Power dissipation! The VLSI Interconnect Challenge. Interconnect is the crux of the problem. Interconnect is the crux of the problem.

Gables: A Roofline Model for Mobile SoCs

Neural Computer Architectures

Neural Acceleration for General-Purpose Approximate Programs

System-on-Chip Architecture for Mobile Applications. Sabyasachi Dey

Computer Architecture

Copyright 2012, Elsevier Inc. All rights reserved.

The Effect of Temperature on Amdahl Law in 3D Multicore Era

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Fundamentals of Quantitative Design and Analysis

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology

Exploiting Hidden Layer Modular Redundancy for Fault-Tolerance in Neural Network Accelerators

Outline Marquette University

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS

Approximate Computing on Programmable SoCs via Neural Acceleration

Lecture 1: Introduction and Basics

DNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

EECS4201 Computer Architecture

GPU Architecture. Alan Gray EPCC The University of Edinburgh

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CS 475: Parallel Programming Introduction

ELCT 912: Advanced Embedded Systems

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Deep Learning Accelerators

Computer and Hardware Architecture II. Benny Thörnberg Associate Professor in Electronics

Parallel Processing SIMD, Vector and GPU s cont.

Course web site: teaching/courses/car. Piazza discussion forum:

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Big Data Systems on Future Hardware. Bingsheng He NUS Computing

Lecture 1: Introduction

CUDA. Matthew Joyner, Jeremy Williams

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Is There A Tradeoff Between Programmability and Performance?

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

ECE 588/688 Advanced Computer Architecture II

ECE 8823: GPU Architectures. Objectives

Marching Memory マーチングメモリ. UCAS-6 6 > Stanford > Imperial > Verify 中村維男 Based on Patent Application by Tadao Nakamura and Michael J.

Transistors and Wires

Advanced and parallel architectures

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Computer Architecture!

Computer Architecture!

Advanced Computer Architecture (CS620)

Embedded Systems: Hardware Components (part I) Todor Stefanov

High Performance Computing

Portland State University ECE 588/688. Graphics Processors

New Challenges in Microarchitecture and Compiler Design

Introduction. CSCI 4850/5850 High-Performance Computing Spring 2018

REAL TIME DIGITAL SIGNAL PROCESSING

Dr. Yassine Hariri CMC Microsystems

Computer Architecture. R. Poss

EITF20: Computer Architecture Part1.1.1: Introduction

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

Computer Architecture!

CO403 Advanced Microprocessors IS860 - High Performance Computing for Security. Basavaraj Talawar,

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

CS 352H Computer Systems Architecture Exam #1 - Prof. Keckler October 11, 2007

Fundamentals of Computer Design

AWS & Intel: A Partnership Dedicated to fueling your Innovations. Thomas Kellerer BDM CSP, Intel Central Europe

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation

h Coherence Controllers

Hardware Software Co-Design: Not Just a Cliché

Parallel Computing: Parallel Architectures Jin, Hai

A 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation

Computer Architecture

Adaptable Intelligence The Next Computing Era

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Toward a Memory-centric Architecture

SOLVING THE DRAM SCALING CHALLENGE: RETHINKING THE INTERFACE BETWEEN CIRCUITS, ARCHITECTURE, AND SYSTEMS

BREAKING THE MEMORY WALL

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture

Classification of Semiconductor LSI

CSCI-GA Multicore Processors: Architecture & Programming Lecture 3: The Memory System You Can t Ignore it!

In-DRAM Near-Data Approximate Acceleration for GPUs

Chapter 7. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 7 <1>

ADVANCES IN PROCESSOR DESIGN AND THE EFFECTS OF MOORES LAW AND AMDAHLS LAW IN RELATION TO THROUGHPUT MEMORY CAPACITY AND PARALLEL PROCESSING

Transcription:

Bridging Analog Neuromorphic and Digital von Neumann Computing Amir Yazdanbakhsh, Bradley Thwaites Advisors: Hadi Esmaeilzadeh and Doug Burger Qualcomm Mentors: Manu Rastogiand Girish Varatkar Alternative Computing Technologies (ACT) Lab Georgia Institute of Technology Qualcomm Innovation Fellowship - 2015

Energy is a primary constraint Data Center Mobile Internet of Things

Data growth vs performance Data growth trends: IDC's Digital Universe Study, December 2012 Performance growth trends: Esmaeilzadeh et al, Dark Silicon and the End of Multicore Scaling, ISCA 2011 3

Approximate computing Embracing error Relax the abstraction of near perfect accuracy in Processing Storage Communication Allows errors to happen to improve performance resource utilization efficiency 6

Avoiding overkill design Approximate Computing Application Programming Language Compiler Architecutre Microarchitecture Cost Precision Reliability Cost Circuit Physical Device

Adding a third dimension Embracing Error Processor Pareto.Fron0er Data center Energy Desktop Mobile IoT Performance

Navigating a three dimensional space Processor Pareto.Fron0er Data center Energy Desktop IoT Mobile Performance

Finding the Pareto surface Energy IoT Processor Pareto.Fron0er Mobile Data center Desktop Truffle [ALOS 12] FLEXJAVA [FSE 15] RFVP [PACT 14, IEEE D&T 15] Axilog [DATE 15, IEEE Micro 15] D- NPUs [MICRO 12] A- NPUs [ISCA 14] SNNAP [HPCA 15] GNPU [Micro 15] MITHRA [TechCon 15] Performance (13.5, 11.1, 10%)

Accelerating GPU Accelerators Bridging Neuromophic and von Neumann Computing Unleashing the Beast Amir Yazdanbakhsh et al., Neural Acceleration for GPU Throughput Processors, MICRO 2015.

Neural Transformation Analog Neural Network Analog Neural Network

Analog NPU Integration CPU x 0 x i x n DAC DAC DAC I(x 0 ) I(x i ) I(x n ) R 0 X (I(xi )R(w i )) ADC R(w i ) R(w n ) A-NPU V to I V to I V to I SM A-NPU SM A-NPU SM A-NPU SM SM SM A-NPU A-NPU A-NPU SM SM SM A-NPU A-NPU A-NPU SM SM SM A-NPU A-NPU A-NPU y sigmoid( X (I(x i )R(w i ))) SM SM SM SM A-NPU A-NPU A-NPU A-NPU General- Purpose Code Acceleration with Limited- Precision Analog Computation ISCA 2014 Neural Acceleration for GPU Throughput Processors Micro 2015

s w0 w 0 s x0 x 0 s wn w n s xn x n I( x 0 ) Current' Steering' DAC I( x n ) Current' Steering' DAC Resistor' Ladder Resistor' Ladder R( w 0 ) R( w n ) I + (w 0 x 0 ) Diff' Pair I (w 0 x 0 ) V + X wi x i V ( w 0 x 0 ) V X wi x i + -" I + (w n x n ) Diff' Amp Diff' Pair y sigmoid V V ( w n x n ) I (w n x n ) Flash ADC s y y X wi x i

Analog Compilation Workflow Limited Bit-Width Topology Restriction Circuit Non-idealities Annotated CUDA Code uchar4'p'='tex2d(img,'x,'y); #pragma(begin_approx) a=min(r,'min(g,b)); b=max(r,'max(g,b)); z=((a+b)'>'254)'?'255:'0; #pragma(end_approx) dst[img.width'*'y'+'x]'='z; Compiler + Customized Training Algorthim Application uchar4'p'='tex2d(img,'x,'y); send.n_data5%r0; send.n_data5%r1; send.n_data5%r2; recv.n_data5%r4; dst[img.width'*'y'+'x]'='z; Accelerator Config w 0 = 0.03,, w 8=0.10 SM SM SM SM A-NPU A-NPU A-NPU A-NPU SM SM SM SM A-NPU A-NPU A-NPU A-NPU SM SM SM SM A-NPU A-NPU A-NPU A-NPU SM SM SM SM A-NPU A-NPU A-NPU A-NPU Programming Compilation (Profiling, Training, Code Generation) Execution

Benchmarks Image Processing binarization 27 PTX instructions Finance blackscholes 96 PTX instructions Machine Learning convolution 886 PTX instructions Robotics inversek2j 132 PTX instructions 3D Gaming jmeint 2,250PTX instructions 3 8 4 1 Error: 11.43% 6 8 4 1 Error: 8.23% 17 4 4 1 Error: 9.29% 2 16 4 3 Error: 10.25% 18 16 4 1 Error: 19.70% Image Processing laplacian 51 PTX instructions Machine Vision meanfilter 35 PTX instructions Numerical Analysis newton- raph 44 PTX instructions Image Processing sobel 86 PTX instructions Medical Imaging srad 110 PTX instructions 9 4 2 1 Error: 9.87% 7 8 2 1 Error: 9.21% 5 4 2 1 Error: 11.23% 9 8 4 1 Error: 8.03% 5 8 2 1 Error: 9.87%

Analog Neuromorphic versus Conventional Computing

I 1 I 0 I 2 I out = I 0 + I 1 + I 2 Kirchhoff's Law + V o I(x n ) R(w n ) V o = I(x n ).R(w n ) Ohm s Law Saturation Property of Transistors

Speedup Energy Reduction Energy Delay 2.6 3.1 8.1 Quality Reduction 10 % Publications [1] Amir Yazdanbakhsh et al., Neural Acceleration for GPU Throughput Processors Micro 2015. [2] Renée St. Amant et al., General- Purpose Code Acceleration with Limited- Precision Analog Computation ISCA 2014.

Application Programming Language Compiler Architecutre Microarchitecture Circuit Physical Device Software Architecture Memory Hardware Design FLEXJAVA: Language Support for Safe and Modular Approximate Programming [FSE 2015] ExpAX: A Framework for Automating Approximate Programming [Tech Report 2014] Neural Acceleration for GPU Throughput Processors [Micro 2015] MITHRA: Controlling Quality Tradeoffs in Approximate Acceleration [TechCon 2015] General- Purpose Code Acceleration with Limited- Precision Analog Computation [ISCA 2014] Mitigating the Bandwidth Bottleneck with Approximate Load Value Prediction [IEEE Design and Test 2015] Rollback- Free Value Prediction with Approximate Loads [PACT 2014] Axilog: Abstractions for Approximate Hardware Design and Reuse [IEEE Micro 2015] Axilog: Language Support for Approximate Hardware Support [DATE 2015]

Rollback Free Value Prediction Front End Pipelines Load / Store Unit Pipeline L1 Cache Write back Interconnection Network Memory Partition L2 Cache Off-chip DRAM Streaming Multiprocessor (SM)

Rollback Free Value Prediction Front End Pipelines Load / Store Unit Pipeline L1 Cache Write back Interconnection Network Memory Partition L2 Cache Off-chip DRAM Streaming Multiprocessor (SM)

Rollback Free Value Prediction Front End Pipelines Load / Store Unit Pipeline L1 Cache Write back Full Interconnection Network Memory Partition L2 Cache Off-chip DRAM Streaming Multiprocessor (SM)

Rollback Free Value Prediction Front End Pipelines RFVP Predictor Load / Store Unit Pipeline L1 Cache Write back Interconnection Network Memory Partition L2 Cache Off-chip DRAM Streaming Multiprocessor (SM) RFVP Predictor quickly predicts values for approximate load misses RFVP technique mitigates the memory bandwidth bottleneck

Speedup 1.4 Energy Reduction 1.3 Bandwidth Consumption Reduction 1.5 Quality Reduction 10 % Publications [1] Amir Yazdanbakhsh et al., Mitigating the Bandwidth Bottleneck with Approximate Load Value Prediction IEEE Design and Test 2015. [2] Amir Yazdanbakhsh et al., RFVP: Rollback- Free Value Prediction with Safe- to- Approximate Loads Architecture and Code Optimization (TACO) [submitted]. [3] Bradley Thwaites et al., Rollback- Free Value Prediction with Approximate Loads International Conference on Parallel Architectures and Compilation Techniques (PACT) 2014.

module fir (clk, rst, x, y) clk rst input clk, rst; d0 d1 d2 d3 x input [15:0] x; b0 b1 b2 b3 output [31:0] y; m0 m1 m2 m3 * w0 w1 w2 w3 multiplier m1 (b1, d1, w1); multiplier m2 (b2, d2, w2); + + + a1 a2 a3 w4 w5 adder a1 (w0, w1, w4); adder a2 (w2, w4, w5); register r1 (clk, rst, x, d0); register r2 (clk, rst, d0, d1); endmodule y

module fir (clk, rst, x, y) clk input clk, rst; rst d0 d1 d2 d3 x input [15:0] x; output [31:0] y; b0 b1 b2 b3 m0 m1 m2 m3 * w0 w1 w2 w3 multiplier m1 (b1, d1, w1); multiplier m2 (b2, d2, w2); + a1 + a2 w4 + a3 w5 adder a1 (w0, w1, w4); adder a2 (w2, w4, w5); register r1 (clk, rst, x, d0); register r2 (clk, rst, d0, d1); relax(y); endmodule relax(y) y

module fir (clk, rst, x, y) clk input clk, rst; rst d0 d1 d2 d3 x input [15:0] x; output [31:0] y; b0 b1 b2 b3 m0 m1 m2 m3 * w0 w1 w2 w3 multiplier m1 (b1, d1, w1); multiplier m2 (b2, d2, w2); + a1 + a2 w4 + a3 w5 adder a1 (w0, w1, w4); adder a2 (w2, w4, w5); register r1 (clk, rst, x, d0); register r2 (clk, rst, d0, d1); relax(y); endmodule relax(y) y

module fir (clk, rst, x, y) clk input clk, rst; rst d0 d1 d2 d3 input [15:0] x; x output [31:0] y; b0 b1 b2 b3 m0 m1 m2 m3 * multiplier m1 (b1, d1, w1); w0 w1 w2 w3 restrict(w1) restrict(w2) restrict(w3) multiplier m2 (b2, d2, w2); + + + a1 a2 a3 w4 w5 adder a1 (w0, w1, w4); adder a2 (w2, w4, w5); register r1 (clk, rst, x, d0); register r2 (clk, rst, d0, d1); relax(y); restrict(w1); restrict(w2); endmodule relax(y) y

module fir (clk, rst, x, y) clk input clk, rst; rst d0 d1 d2 d3 input [15:0] x; x output [31:0] y; b0 b1 b2 b3 m0 m1 m2 m3 * multiplier m1 (b1, d1, w1); w0 w1 w2 w3 restrict(w1) restrict(w2) restrict(w3) multiplier m2 (b2, d2, w2); + + + a1 a2 a3 w4 w5 adder a1 (w0, w1, w4); adder a2 (w2, w4, w5); register r1 (clk, rst, x, d0); register r2 (clk, rst, d0, d1); relax(y); restrict(w1); restrict(w2); endmodule relax(y) y

Energy Reduction Area Reduction Code Annotations 1.6 1.3 2-12 Quality Reduction 10 % Publications [1] Divya Mahajan et al., Axilog: Abstractions for Approximate Hardware Design and Reuse IEEE Micro 2015. [2] Amir Yazdanbakhsh et al., Axilog: Language Support for Approximate Hardware Design Design Automation and Test in Europe (DATE) 2015.

Finding the Pareto surface Energy IoT Processor Pareto.Fron0er Mobile Data center Desktop Truffle [ALOS 12] FLEXJAVA [FSE 15] RFVP [PACT 14, IEEE D&T 15] Axilog [DATE 15, IEEE Micro 15] D- NPUs [MICRO 12] A- NPUs [ISCA 14] SNNAP [HPCA 15] GNPU [Micro 15] MITHRA [TechCon 15] Performance (13.5, 11.1, 10%)