Bridging Analog Neuromorphic and Digital von Neumann Computing

Bridging Analog Neuromorphic and Digital von Neumann Computing Amir Yazdanbakhsh, Bradley Thwaites Advisors: Hadi Esmaeilzadeh and Doug Burger Qualcomm Mentors: Manu Rastogiand Girish Varatkar Alternative Computing Technologies (ACT) Lab Georgia Institute of Technology Qualcomm Innovation Fellowship - 2015

Energy is a primary constraint Data Center Mobile Internet of Things

Data growth vs performance Data growth trends: IDC's Digital Universe Study, December 2012 Performance growth trends: Esmaeilzadeh et al, Dark Silicon and the End of Multicore Scaling, ISCA 2011 3

Approximate computing Embracing error Relax the abstraction of near perfect accuracy in Processing Storage Communication Allows errors to happen to improve performance resource utilization efficiency 6

Avoiding overkill design Approximate Computing Application Programming Language Compiler Architecutre Microarchitecture Cost Precision Reliability Cost Circuit Physical Device

Adding a third dimension Embracing Error Processor Pareto.Fron0er Data center Energy Desktop Mobile IoT Performance

Navigating a three dimensional space Processor Pareto.Fron0er Data center Energy Desktop IoT Mobile Performance

Finding the Pareto surface Energy IoT Processor Pareto.Fron0er Mobile Data center Desktop Truffle [ALOS 12] FLEXJAVA [FSE 15] RFVP [PACT 14, IEEE D&T 15] Axilog [DATE 15, IEEE Micro 15] D- NPUs [MICRO 12] A- NPUs [ISCA 14] SNNAP [HPCA 15] GNPU [Micro 15] MITHRA [TechCon 15] Performance (13.5, 11.1, 10%)

Accelerating GPU Accelerators Bridging Neuromophic and von Neumann Computing Unleashing the Beast Amir Yazdanbakhsh et al., Neural Acceleration for GPU Throughput Processors, MICRO 2015.

Neural Transformation Analog Neural Network Analog Neural Network

Analog NPU Integration CPU x 0 x i x n DAC DAC DAC I(x 0 ) I(x i ) I(x n ) R 0 X (I(xi )R(w i )) ADC R(w i ) R(w n ) A-NPU V to I V to I V to I SM A-NPU SM A-NPU SM A-NPU SM SM SM A-NPU A-NPU A-NPU SM SM SM A-NPU A-NPU A-NPU SM SM SM A-NPU A-NPU A-NPU y sigmoid( X (I(x i )R(w i ))) SM SM SM SM A-NPU A-NPU A-NPU A-NPU General- Purpose Code Acceleration with Limited- Precision Analog Computation ISCA 2014 Neural Acceleration for GPU Throughput Processors Micro 2015

s w0 w 0 s x0 x 0 s wn w n s xn x n I( x 0 ) Current' Steering' DAC I( x n ) Current' Steering' DAC Resistor' Ladder Resistor' Ladder R( w 0 ) R( w n ) I + (w 0 x 0 ) Diff' Pair I (w 0 x 0 ) V + X wi x i V ( w 0 x 0 ) V X wi x i + -" I + (w n x n ) Diff' Amp Diff' Pair y sigmoid V V ( w n x n ) I (w n x n ) Flash ADC s y y X wi x i

Analog Compilation Workflow Limited Bit-Width Topology Restriction Circuit Non-idealities Annotated CUDA Code uchar4'p'='tex2d(img,'x,'y); #pragma(begin_approx) a=min(r,'min(g,b)); b=max(r,'max(g,b)); z=((a+b)'>'254)'?'255:'0; #pragma(end_approx) dst[img.width'*'y'+'x]'='z; Compiler + Customized Training Algorthim Application uchar4'p'='tex2d(img,'x,'y); send.n_data5%r0; send.n_data5%r1; send.n_data5%r2; recv.n_data5%r4; dst[img.width'*'y'+'x]'='z; Accelerator Config w 0 = 0.03,, w 8=0.10 SM SM SM SM A-NPU A-NPU A-NPU A-NPU SM SM SM SM A-NPU A-NPU A-NPU A-NPU SM SM SM SM A-NPU A-NPU A-NPU A-NPU SM SM SM SM A-NPU A-NPU A-NPU A-NPU Programming Compilation (Profiling, Training, Code Generation) Execution

Benchmarks Image Processing binarization 27 PTX instructions Finance blackscholes 96 PTX instructions Machine Learning convolution 886 PTX instructions Robotics inversek2j 132 PTX instructions 3D Gaming jmeint 2,250PTX instructions 3 8 4 1 Error: 11.43% 6 8 4 1 Error: 8.23% 17 4 4 1 Error: 9.29% 2 16 4 3 Error: 10.25% 18 16 4 1 Error: 19.70% Image Processing laplacian 51 PTX instructions Machine Vision meanfilter 35 PTX instructions Numerical Analysis newton- raph 44 PTX instructions Image Processing sobel 86 PTX instructions Medical Imaging srad 110 PTX instructions 9 4 2 1 Error: 9.87% 7 8 2 1 Error: 9.21% 5 4 2 1 Error: 11.23% 9 8 4 1 Error: 8.03% 5 8 2 1 Error: 9.87%

Analog Neuromorphic versus Conventional Computing

I 1 I 0 I 2 I out = I 0 + I 1 + I 2 Kirchhoff's Law + V o I(x n ) R(w n ) V o = I(x n ).R(w n ) Ohm s Law Saturation Property of Transistors

Speedup Energy Reduction Energy Delay 2.6 3.1 8.1 Quality Reduction 10 % Publications [1] Amir Yazdanbakhsh et al., Neural Acceleration for GPU Throughput Processors Micro 2015. [2] Renée St. Amant et al., General- Purpose Code Acceleration with Limited- Precision Analog Computation ISCA 2014.

Application Programming Language Compiler Architecutre Microarchitecture Circuit Physical Device Software Architecture Memory Hardware Design FLEXJAVA: Language Support for Safe and Modular Approximate Programming [FSE 2015] ExpAX: A Framework for Automating Approximate Programming [Tech Report 2014] Neural Acceleration for GPU Throughput Processors [Micro 2015] MITHRA: Controlling Quality Tradeoffs in Approximate Acceleration [TechCon 2015] General- Purpose Code Acceleration with Limited- Precision Analog Computation [ISCA 2014] Mitigating the Bandwidth Bottleneck with Approximate Load Value Prediction [IEEE Design and Test 2015] Rollback- Free Value Prediction with Approximate Loads [PACT 2014] Axilog: Abstractions for Approximate Hardware Design and Reuse [IEEE Micro 2015] Axilog: Language Support for Approximate Hardware Support [DATE 2015]

Rollback Free Value Prediction Front End Pipelines Load / Store Unit Pipeline L1 Cache Write back Interconnection Network Memory Partition L2 Cache Off-chip DRAM Streaming Multiprocessor (SM)

Rollback Free Value Prediction Front End Pipelines Load / Store Unit Pipeline L1 Cache Write back Full Interconnection Network Memory Partition L2 Cache Off-chip DRAM Streaming Multiprocessor (SM)

Rollback Free Value Prediction Front End Pipelines RFVP Predictor Load / Store Unit Pipeline L1 Cache Write back Interconnection Network Memory Partition L2 Cache Off-chip DRAM Streaming Multiprocessor (SM) RFVP Predictor quickly predicts values for approximate load misses RFVP technique mitigates the memory bandwidth bottleneck

Speedup 1.4 Energy Reduction 1.3 Bandwidth Consumption Reduction 1.5 Quality Reduction 10 % Publications [1] Amir Yazdanbakhsh et al., Mitigating the Bandwidth Bottleneck with Approximate Load Value Prediction IEEE Design and Test 2015. [2] Amir Yazdanbakhsh et al., RFVP: Rollback- Free Value Prediction with Safe- to- Approximate Loads Architecture and Code Optimization (TACO) [submitted]. [3] Bradley Thwaites et al., Rollback- Free Value Prediction with Approximate Loads International Conference on Parallel Architectures and Compilation Techniques (PACT) 2014.

module fir (clk, rst, x, y) clk rst input clk, rst; d0 d1 d2 d3 x input [15:0] x; b0 b1 b2 b3 output [31:0] y; m0 m1 m2 m3 * w0 w1 w2 w3 multiplier m1 (b1, d1, w1); multiplier m2 (b2, d2, w2); + + + a1 a2 a3 w4 w5 adder a1 (w0, w1, w4); adder a2 (w2, w4, w5); register r1 (clk, rst, x, d0); register r2 (clk, rst, d0, d1); endmodule y

module fir (clk, rst, x, y) clk input clk, rst; rst d0 d1 d2 d3 x input [15:0] x; output [31:0] y; b0 b1 b2 b3 m0 m1 m2 m3 * w0 w1 w2 w3 multiplier m1 (b1, d1, w1); multiplier m2 (b2, d2, w2); + a1 + a2 w4 + a3 w5 adder a1 (w0, w1, w4); adder a2 (w2, w4, w5); register r1 (clk, rst, x, d0); register r2 (clk, rst, d0, d1); relax(y); endmodule relax(y) y

module fir (clk, rst, x, y) clk input clk, rst; rst d0 d1 d2 d3 input [15:0] x; x output [31:0] y; b0 b1 b2 b3 m0 m1 m2 m3 * multiplier m1 (b1, d1, w1); w0 w1 w2 w3 restrict(w1) restrict(w2) restrict(w3) multiplier m2 (b2, d2, w2); + + + a1 a2 a3 w4 w5 adder a1 (w0, w1, w4); adder a2 (w2, w4, w5); register r1 (clk, rst, x, d0); register r2 (clk, rst, d0, d1); relax(y); restrict(w1); restrict(w2); endmodule relax(y) y

Energy Reduction Area Reduction Code Annotations 1.6 1.3 2-12 Quality Reduction 10 % Publications [1] Divya Mahajan et al., Axilog: Abstractions for Approximate Hardware Design and Reuse IEEE Micro 2015. [2] Amir Yazdanbakhsh et al., Axilog: Language Support for Approximate Hardware Design Design Automation and Test in Europe (DATE) 2015.