Neural Network based Energy-Efficient Fault Tolerant Architect

Neural Network based Energy-Efficient Fault Tolerant Architectures and Accelerators University of Rochester February 7, 2013

References Flexible Error Protection for Energy Efficient Reliable Architectures T. Miller, N. Surapaneni, R. Teodorescu, Ohio State Univ., SBAC-PAD 10 BenchNN: On the Broad Potential Applications Scope of Hardware NN Accelerators T. Chen et.al., Univ. of Wisconsin, IISWC 12 A Defect-Tolerant Accelerator for Emerging High Performance Applications Olivier Temam, INRIA France, ISCA 12 Neural Acceleration for General-Purpose Approximate Programs H. Esmaeilzadeh et.al., U. of Washington & Microsoft, MICRO 12

Introduction and Motivation Technology scaling has a detrimental effect on reliability Dark silicon jeopardizes many-cores and massive on-chip parallelism One way to tackle dark silicon and energy issue is to do specialization through heterogeneous multi-cores In conventional architectures A single transistor breakdown can potentially prove fatal Artificial neural network based system Inherently more tolerant to defects and noise More energy efficient compared to conventional architectures Interest in ANN died because they were beaten by SVM Due to emergence of RMS (Recognition, Mining and Synthesis) workload ANN are being looked again

Neural Network based Architectures and Solutions 1. A multi-core architecture which achieves energy efficiency for user specified reliability (FIT) target by controlling the replications and supply voltages using hill-climbing algorithm. 2. Set of neural network based computational kernels which are alternatives to many PARSEC (i.e. blackscholes etc.) benchmarks and achieve at par or better performance. 3. A neural network based multi-purpose hardware accelerator which can tolerate multiple defects, implements computational kernels of emerging RMS workloads and like custom circuits can achieve 2 order of energy efficiency.

Machine Learning based Adaptive Multicore Architecture Presents a reliable, energy efficient and adaptive multicore architecture Each core consists of a pair of pipelines They can run independently (running separate thread) or in concert (running same threads and verifying results) The idea is to adopt the characteristics of individual cores and applications to provide the acceptable reliability with minimum energy On-line control based on hill-climbing dynamically adjusts multiple parameters to minimize the energy consumption Dynamic adaptation of voltage and redundancy can reduce the energy delay product of a CMP by 30-60% compared to static dual modular redundancy (DMR)

Architecture and Error Detection Shadow register replication mode Only timing errors can be detected and results are restored from delayed shadow registers Shadow pipeline replication mode Timing and soft errors both can be detected Re-execution would fix soft errors For timing errors instructions are marked, re-executed and if the error re-occurs the result is restored from shadow registers

Support for Timing Speculation If a FU is not fully replicated then selectively enable the pipeline registers which has delayed clock more like RAZOR.

Neural Networks for Power and Error Prediction Primary Power ANN: predicts power of primary pipelines based on voltage, utilization, and temperature Shadow Power ANN: predicts power of shadow pipelines based on voltage, utilization, replication, and temperature Error Probability ANN: predicts raw probability of an error on each cycle based on voltage and temperature ANNs are trained online by comparing predictions against measurement and weights are adjusted

Hill-climbing Search for Optimal Voltage Energy optimization for a given FIT at regular intervals Start with maximum voltages for all FUs and lower them one step at a time, checking for errors, and computing ED Voltages are lowered until minimum ED is found

Results and Analysis Area overhead: 4%; impact on cycle time: 10% A FIT target of 11.4 (MTBF=10 5 years) yields: Average power saving: 50% ; Replications: 3 FUs/app For very low FIT rate of 1.1-1.4, ED savings are around 30%

BenchNN: Potential of Neural Network Accelerator After being hyped up in the 1990s, the ANN faded away Now there is surge of interest because of their Energy efficiency and fault-tolerance properties, and Applicability to emerging high-performance applications

ANN alternative: blackscholes Function: Predicts the price at a certain date in future based on today s inputs through solving partial differential equations ANN alternative: 6-input multi layer perceptron with 1 output layer; Hidden layers are explored during the training phase Accuracy: PARSEC version - 1e-5; ANN version - 3e-5 Slowdown: NN software version over PARSEC version is 3.6x

ANN alternative: canneal Function: Optimization benchmark which uses simulated annealing to minimize the routing cost of a chip design ANN alternative: For optimization Hopfield Neural Network has been used to solve problems including layout, placement Accuracy: Average wire length calculated by HNN are at par or better than PARSEC version Slowdown: For 100K cells slowdown is significant; Hierarchical approach can be used to break the problem into smaller size

ANN alternative: ferret Function: Content similarity; Finding one or several objects matching an input object; Stationary image similarity; biased towards color moments, bounding boxes and segment sizes ANN alternative: Object data is converted into feature vectors and compressed into compact vectors (the sketch); Feature extraction is performed using a set of 2160 Gabor filters. Accuracy: PARSEC version - 88%; ANN version - 93% Slowdown: 2x compared to PARSEC version

ANN alternative: streamcluster Function: Online clustering program which classifies the input data into several groups so each group shares similar features ANN alternative: Most time-consuming task of reducing the data dimensionality (89%) can be done efficiently using Self-Organizing Maps (SOM) Accuracy: Comparable or better than PARSEC version Slowdown: Software version of ANN is sequential whereas PARSEC version is parallel and divides the data into chunks

ANN alternative: dedup Function: Data compression application which combines data-deduplication with Ziv-Lempel to achieve high compression ratios ANN alternative: 4 out of 5 stages are replaced by neural network - fragmentation, hashing, building the global database and compression Accuracy: Except for small files CR is always better Slowdown: Slowdown is so significant that even a hardware based accelerator may not be competitive

BenchNN: Summary 5 PARSEC benchmarks, considered here, are representative of emerging high-performance benchmarks For these applications it is possible to substitute the core computational task with a neural network algorithm Neural networks can achieve slightly worse, comparable or sometime even better solutions Software versions are significantly slower which advocate the need of hardware accelerator for these computational kernels These kind of accelerators would be very useful for embedded system applications which achieve very good accuracy but not always state-of-the-art accuracy.

Neural Network based Hardware Accelerator From BenchNN study, it is clear that there is a need to build neural network based hardware accelerator Neural networks are inherently tolerant to errors and defects so when a hardware is built using them it would be naturally tolerant to defects such as transistor short or open defects This study proposes a hardware based ANN accelerator Inputs and attributes to modern high-performance algorithm are rather limited (< 100) so hardware based neural network is conceivable Emerging algorithms category including PARSEC and RMS: Classification, clustering, statistical optimization, approximation Competitive ANN based algorithm exist for most of these

Time-Multiplexed vs. Spatially Expanded ANN Downside of time-multiplexed ANN Incurs extra memory latency; consumes more power and energy Control logic is vulnerable to defects; less scalable

Accelerator Implementation Only scaled down version is shown here; actual network contains 90 inputs, 10 hidden neurons and 10 outputs Input/Output: Fetch rows, write weights during training Fixed-Point computations: 16-bit Fixed point achieves same as floating-point design for most of the applications Activation function, partial time-multiplexing

Gate-level vs Transistor-level Defects Logic gate-level hardware fault (stuck at) can exhibit a significantly different behavior than transistor-level hardware faults

Impact of Defects on 4-bit Adder and Multiplier

Injection and Impact of Transistor-Level Defects

Comparison: Accelerator vs CPU versions Biggest advantage is the energy consumption by accelerator This is possible due to massive parallel multiplications/ additions and circuit-level parallelism

Evaluations Accelerator can tolerate upto 12 defects; most applications are not significantly affected by upto 20 defects Accuracy is fairly sensitive to errors at the output layers or defect occurring just before or at the activation function

Neural Acceleration for Approximate Programs Tolerance to approximation is one of the program characteristic which is growing increasingly important. Modern day applications image rendering, signal processing, augmented reality, data mining, robotics, speech recognition, face recognition etc. Key idea is to learn how and original region of approximable code behaves and replace the original code with and efficient computation of the learned model. Compiler replaces the original code with an invocation of a low-power accelerator called a neural processing unit (NPU) which is tightly coupled to the processor pipeline. NPU provides speedup of 2.3x and energy saving of 3.0x on average with quality loss of at most 9.6%

Parrot* Transformation at a Glance Programming: Programmer explicitly marks functions, amenable to approximate execution, to be transformed Compilation: Compiler selects and trains a suitable neural network and replaces the original code with a NN invocations Code observation (input-output probes), Neural network selection and training, binary generation Execution: Main core configures NPU and invokes to perform neural network evaluation

Transformation Stages of Edge Detection Algorithm Edge detection: Sobel filter, a 3x3 matrix convolution that approximates the image s intensity gradient Executed many times, so the convolution is a hot function

Neural Processing Unit Architecture and Organization Multi-layer perceptrons (MLP) are used due to their broad applicability; compiler uses the back-propagation algorithm to train the neural network

ISA and Architectural Support for NPU Acceleration NPU is a variable delay, tightly-coupled accelerator that communicates with the rest of the core via FIFO queues Config FIFO: sending and retrieving the configuration Input FIFO: sending the inputs of approximable functions Output FIFO: retrieving the neural network s outputs ISA extn: enq.c %r, deq.c %r, enq.d %r, deq.d %r deq.c %r is used during the contexts switches All NPU instructions are not reordered treated as dependent

Benchmarks Transformed in this Study Only those functions for which compiler can find a suitable competitive ANN based algorithm should be replaced Select the best topology by 70%(training) 30%(testing)

Speedup and Energy Improvement Ideal NPU: zero cycle Speedup: 0.8x 11.1x Avg NPU acceleration: 2.3x Avg energy reduction: 3.0x Optimal # of PEs in NPU: 8

Other Results Outline

Key Findings and Insights Different applications require different neural network topologies, so the NPU structure must be reconfigurable The majority (80% to 100%) of each transformed application s output elements have error less than 10% Parrot transformation and NPU acceleration provided an average 2.3x speedup and 3.0x energy reduction Proposed technique requires efficient neural network execution, such as hardware acceleration, to be beneficial For some applications, with simple neural network topologies, a tightly- coupled, low-latency NPU-CPU integrated design is highly beneficial

Neural network based accelerator are more flexible compared to an ASIC based accelerators and can easily adapt to many high performance applications ANN are inherently fault tolerant so an accelerator built using them naturally possess those qualities Typical hardware based ANNs show two order of energy efficiency compared to conventional systems Can play a major role in heterogeneous multi-core chips to solve some of the energy and dark silicon issues