Neural Network based Energy-Efficient Fault Tolerant Architect

Similar documents
Neural Acceleration for General-Purpose Approximate Programs

Exploiting Hidden Layer Modular Redundancy for Fault-Tolerance in Neural Network Accelerators

Fundamentals of Quantitative Design and Analysis

Deep Learning Accelerators

Neural Computer Architectures

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Copyright 2012, Elsevier Inc. All rights reserved.

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning,

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Bridging Analog Neuromorphic and Digital von Neumann Computing

Power dissipation! The VLSI Interconnect Challenge. Interconnect is the crux of the problem. Interconnect is the crux of the problem.

COE 561 Digital System Design & Synthesis Introduction

Memory Systems IRAM. Principle of IRAM

AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors

Lecture 1: Introduction

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology

Using FPGAs as Microservices

Fault Tolerant and Secure Architectures for On Chip Networks With Emerging Interconnect Technologies. Mohsin Y Ahmed Conlan Wesson

Chapter 2 Parallel Hardware

PARSEC vs. SPLASH-2: A Quantitative Comparison of Two Multithreaded Benchmark Suites

Stochastic Processors (or processors that do not always compute correctly by design)

Architecture as Interface

Outline. Parity-based ECC and Mechanism for Detecting and Correcting Soft Errors in On-Chip Communication. Outline

High Performance Computing Hiroki Kanezashi Tokyo Institute of Technology Dept. of mathematical and computing sciences Matsuoka Lab.

EECS4201 Computer Architecture

Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Hardware Implementation of a Fault-Tolerant Hopfield Neural Network on FPGAs

Massively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain

Outline Marquette University

Lecture 2: Performance

Commercial Real-time Operating Systems An Introduction. Swaminathan Sivasubramanian Dependable Computing & Networking Laboratory

Machine Learning 13. week

Computer Architecture

Virtual Machines. 2 Disco: Running Commodity Operating Systems on Scalable Multiprocessors([1])

Implementation of FPGA-Based General Purpose Artificial Neural Network

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory

Research Article Dynamic Reconfigurable Computing: The Alternative to Homogeneous Multicores under Massive Defect Rates

Lecture 7: Parallel Processing

Higher Level Programming Abstractions for FPGAs using OpenCL

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

Trends in the Infrastructure of Computing

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

Dataflow: The Road Less Complex

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Checker Processors. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India

MOST PROGRESS MADE ALGORITHM: COMBATING SYNCHRONIZATION INDUCED PERFORMANCE LOSS ON SALVAGED CHIP MULTI-PROCESSORS

XPU A Programmable FPGA Accelerator for Diverse Workloads

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Transistors and Wires

Co-synthesis and Accelerator based Embedded System Design

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA

A 3-D CPU-FPGA-DRAM Hybrid Architecture for Low-Power Computation

PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory

Efficient Data Movement in Modern SoC Designs Why It Matters

Reconfigurable Multicore Server Processors for Low Power Operation

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

Speeding Up Crossbar Resistive Memory by Exploiting In-memory Data Patterns

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Tutorial 11. Final Exam Review

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Chapter 5. Introduction ARM Cortex series

Imaging Solutions by Mercury Computer Systems

Indian Silicon Technologies 2013

Efficient Hardware Acceleration on SoC- FPGA using OpenCL

PARALLEL TRAINING OF NEURAL NETWORKS FOR SPEECH RECOGNITION

Overview. CSE372 Digital Systems Organization and Design Lab. Hardware CAD. Two Types of Chips

Module 5 Introduction to Parallel Processing Systems

Performance of Multicore LUP Decomposition

Embedded Systems: Hardware Components (part I) Todor Stefanov

The Use of Cloud Computing Resources in an HPC Environment

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture

Understanding Sources of Inefficiency in General-Purpose Chips

Computer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal

TENSORFLOW: LARGE-SCALE MACHINE LEARNING ON HETEROGENEOUS DISTRIBUTED SYSTEMS. by Google Research. presented by Weichen Wang

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Course web site: teaching/courses/car. Piazza discussion forum:

System-on-Chip Architecture for Mobile Applications. Sabyasachi Dey

Microprocessor Trends and Implications for the Future

Compilation and Hardware Support for Approximate Acceleration

Parallelizing Inline Data Reduction Operations for Primary Storage Systems

On Supporting Adaptive Fault Tolerant at Run-Time with Virtual FPGAs

ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests

PowerVR Hardware. Architecture Overview for Developers

Outline of Presentation Field Programmable Gate Arrays (FPGAs(

Computer Systems Research in the Post-Dennard Scaling Era. Emilio G. Cota Candidacy Exam April 30, 2013

A Neural Network Model Of Insurance Customer Ratings

Memory Systems and Compiler Support for MPSoC Architectures. Mahmut Kandemir and Nikil Dutt. Cap. 9

Linux multi-core scalability

Biologically-Inspired Massively-Parallel Architectures - computing beyond a million processors

Design Tradeoffs for Data Deduplication Performance in Backup Workloads

Concurrent/Parallel Processing

A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Transcription:

Neural Network based Energy-Efficient Fault Tolerant Architectures and Accelerators University of Rochester February 7, 2013

References Flexible Error Protection for Energy Efficient Reliable Architectures T. Miller, N. Surapaneni, R. Teodorescu, Ohio State Univ., SBAC-PAD 10 BenchNN: On the Broad Potential Applications Scope of Hardware NN Accelerators T. Chen et.al., Univ. of Wisconsin, IISWC 12 A Defect-Tolerant Accelerator for Emerging High Performance Applications Olivier Temam, INRIA France, ISCA 12 Neural Acceleration for General-Purpose Approximate Programs H. Esmaeilzadeh et.al., U. of Washington & Microsoft, MICRO 12

Introduction and Motivation Technology scaling has a detrimental effect on reliability Dark silicon jeopardizes many-cores and massive on-chip parallelism One way to tackle dark silicon and energy issue is to do specialization through heterogeneous multi-cores In conventional architectures A single transistor breakdown can potentially prove fatal Artificial neural network based system Inherently more tolerant to defects and noise More energy efficient compared to conventional architectures Interest in ANN died because they were beaten by SVM Due to emergence of RMS (Recognition, Mining and Synthesis) workload ANN are being looked again

Neural Network based Architectures and Solutions 1. A multi-core architecture which achieves energy efficiency for user specified reliability (FIT) target by controlling the replications and supply voltages using hill-climbing algorithm.

Neural Network based Architectures and Solutions 1. A multi-core architecture which achieves energy efficiency for user specified reliability (FIT) target by controlling the replications and supply voltages using hill-climbing algorithm. 2. Set of neural network based computational kernels which are alternatives to many PARSEC (i.e. blackscholes etc.) benchmarks and achieve at par or better performance.

Neural Network based Architectures and Solutions 1. A multi-core architecture which achieves energy efficiency for user specified reliability (FIT) target by controlling the replications and supply voltages using hill-climbing algorithm. 2. Set of neural network based computational kernels which are alternatives to many PARSEC (i.e. blackscholes etc.) benchmarks and achieve at par or better performance. 3. A neural network based multi-purpose hardware accelerator which can tolerate multiple defects, implements computational kernels of emerging RMS workloads and like custom circuits can achieve 2 order of energy efficiency.

Neural Network based Architectures and Solutions 1. A multi-core architecture which achieves energy efficiency for user specified reliability (FIT) target by controlling the replications and supply voltages using hill-climbing algorithm. 2. Set of neural network based computational kernels which are alternatives to many PARSEC (i.e. blackscholes etc.) benchmarks and achieve at par or better performance. 3. A neural network based multi-purpose hardware accelerator which can tolerate multiple defects, implements computational kernels of emerging RMS workloads and like custom circuits can achieve 2 order of energy efficiency. 4. A neural network based program transformation technique which targets the approximable code regions in general purpose programs and offloads it to neural processing unit.

Machine Learning based Adaptive Multicore Architecture Presents a reliable, energy efficient and adaptive multicore architecture Each core consists of a pair of pipelines They can run independently (running separate thread) or in concert (running same threads and verifying results) The idea is to adopt the characteristics of individual cores and applications to provide the acceptable reliability with minimum energy On-line control based on hill-climbing dynamically adjusts multiple parameters to minimize the energy consumption Dynamic adaptation of voltage and redundancy can reduce the energy delay product of a CMP by 30-60% compared to static dual modular redundancy (DMR)

Architecture and Error Detection Shadow register replication mode Only timing errors can be detected and results are restored from delayed shadow registers Shadow pipeline replication mode Timing and soft errors both can be detected Re-execution would fix soft errors For timing errors instructions are marked, re-executed and if the error re-occurs the result is restored from shadow registers

Support for Timing Speculation If a FU is not fully replicated then selectively enable the pipeline registers which has delayed clock more like RAZOR.

Neural Networks for Power and Error Prediction Primary Power ANN: predicts power of primary pipelines based on voltage, utilization, and temperature Shadow Power ANN: predicts power of shadow pipelines based on voltage, utilization, replication, and temperature Error Probability ANN: predicts raw probability of an error on each cycle based on voltage and temperature ANNs are trained online by comparing predictions against measurement and weights are adjusted

Hill-climbing Search for Optimal Voltage Energy optimization for a given FIT at regular intervals Start with maximum voltages for all FUs and lower them one step at a time, checking for errors, and computing ED Voltages are lowered until minimum ED is found

Results and Analysis Area overhead: 4%; impact on cycle time: 10% A FIT target of 11.4 (MTBF=10 5 years) yields: Average power saving: 50% ; Replications: 3 FUs/app For very low FIT rate of 1.1-1.4, ED savings are around 30%

BenchNN: Potential of Neural Network Accelerator After being hyped up in the 1990s, the ANN faded away Now there is surge of interest because of their Energy efficiency and fault-tolerance properties, and Applicability to emerging high-performance applications

ANN alternative: blackscholes Function: Predicts the price at a certain date in future based on today s inputs through solving partial differential equations ANN alternative: 6-input multi layer perceptron with 1 output layer; Hidden layers are explored during the training phase Accuracy: PARSEC version - 1e-5; ANN version - 3e-5 Slowdown: NN software version over PARSEC version is 3.6x

ANN alternative: canneal Function: Optimization benchmark which uses simulated annealing to minimize the routing cost of a chip design ANN alternative: For optimization Hopfield Neural Network has been used to solve problems including layout, placement Accuracy: Average wire length calculated by HNN are at par or better than PARSEC version Slowdown: For 100K cells slowdown is significant; Hierarchical approach can be used to break the problem into smaller size

ANN alternative: ferret Function: Content similarity; Finding one or several objects matching an input object; Stationary image similarity; biased towards color moments, bounding boxes and segment sizes ANN alternative: Object data is converted into feature vectors and compressed into compact vectors (the sketch); Feature extraction is performed using a set of 2160 Gabor filters. Accuracy: PARSEC version - 88%; ANN version - 93% Slowdown: 2x compared to PARSEC version

ANN alternative: streamcluster Function: Online clustering program which classifies the input data into several groups so each group shares similar features ANN alternative: Most time-consuming task of reducing the data dimensionality (89%) can be done efficiently using Self-Organizing Maps (SOM) Accuracy: Comparable or better than PARSEC version Slowdown: Software version of ANN is sequential whereas PARSEC version is parallel and divides the data into chunks

ANN alternative: dedup Function: Data compression application which combines data-deduplication with Ziv-Lempel to achieve high compression ratios ANN alternative: 4 out of 5 stages are replaced by neural network - fragmentation, hashing, building the global database and compression Accuracy: Except for small files CR is always better Slowdown: Slowdown is so significant that even a hardware based accelerator may not be competitive

BenchNN: Summary 5 PARSEC benchmarks, considered here, are representative of emerging high-performance benchmarks For these applications it is possible to substitute the core computational task with a neural network algorithm Neural networks can achieve slightly worse, comparable or sometime even better solutions Software versions are significantly slower which advocate the need of hardware accelerator for these computational kernels These kind of accelerators would be very useful for embedded system applications which achieve very good accuracy but not always state-of-the-art accuracy.

Neural Network based Hardware Accelerator From BenchNN study, it is clear that there is a need to build neural network based hardware accelerator Neural networks are inherently tolerant to errors and defects so when a hardware is built using them it would be naturally tolerant to defects such as transistor short or open defects This study proposes a hardware based ANN accelerator Inputs and attributes to modern high-performance algorithm are rather limited (< 100) so hardware based neural network is conceivable Emerging algorithms category including PARSEC and RMS: Classification, clustering, statistical optimization, approximation Competitive ANN based algorithm exist for most of these

Time-Multiplexed vs. Spatially Expanded ANN Downside of time-multiplexed ANN Incurs extra memory latency; consumes more power and energy Control logic is vulnerable to defects; less scalable

Accelerator Implementation Only scaled down version is shown here; actual network contains 90 inputs, 10 hidden neurons and 10 outputs Input/Output: Fetch rows, write weights during training Fixed-Point computations: 16-bit Fixed point achieves same as floating-point design for most of the applications Activation function, partial time-multiplexing

Gate-level vs Transistor-level Defects Logic gate-level hardware fault (stuck at) can exhibit a significantly different behavior than transistor-level hardware faults

Impact of Defects on 4-bit Adder and Multiplier

Injection and Impact of Transistor-Level Defects

Comparison: Accelerator vs CPU versions Biggest advantage is the energy consumption by accelerator This is possible due to massive parallel multiplications/ additions and circuit-level parallelism

Evaluations Accelerator can tolerate upto 12 defects; most applications are not significantly affected by upto 20 defects Accuracy is fairly sensitive to errors at the output layers or defect occurring just before or at the activation function

Neural Acceleration for Approximate Programs Tolerance to approximation is one of the program characteristic which is growing increasingly important. Modern day applications image rendering, signal processing, augmented reality, data mining, robotics, speech recognition, face recognition etc. Key idea is to learn how and original region of approximable code behaves and replace the original code with and efficient computation of the learned model. Compiler replaces the original code with an invocation of a low-power accelerator called a neural processing unit (NPU) which is tightly coupled to the processor pipeline. NPU provides speedup of 2.3x and energy saving of 3.0x on average with quality loss of at most 9.6%

Parrot* Transformation at a Glance Programming: Programmer explicitly marks functions, amenable to approximate execution, to be transformed Compilation: Compiler selects and trains a suitable neural network and replaces the original code with a NN invocations Code observation (input-output probes), Neural network selection and training, binary generation Execution: Main core configures NPU and invokes to perform neural network evaluation

Transformation Stages of Edge Detection Algorithm Edge detection: Sobel filter, a 3x3 matrix convolution that approximates the image s intensity gradient Executed many times, so the convolution is a hot function

Neural Processing Unit Architecture and Organization Multi-layer perceptrons (MLP) are used due to their broad applicability; compiler uses the back-propagation algorithm to train the neural network

ISA and Architectural Support for NPU Acceleration NPU is a variable delay, tightly-coupled accelerator that communicates with the rest of the core via FIFO queues Config FIFO: sending and retrieving the configuration Input FIFO: sending the inputs of approximable functions Output FIFO: retrieving the neural network s outputs ISA extn: enq.c %r, deq.c %r, enq.d %r, deq.d %r deq.c %r is used during the contexts switches All NPU instructions are not reordered treated as dependent

Benchmarks Transformed in this Study Only those functions for which compiler can find a suitable competitive ANN based algorithm should be replaced Select the best topology by 70%(training) 30%(testing)

Speedup and Energy Improvement Ideal NPU: zero cycle Speedup: 0.8x 11.1x Avg NPU acceleration: 2.3x Avg energy reduction: 3.0x Optimal # of PEs in NPU: 8

Other Results Outline

Key Findings and Insights Different applications require different neural network topologies, so the NPU structure must be reconfigurable The majority (80% to 100%) of each transformed application s output elements have error less than 10% Parrot transformation and NPU acceleration provided an average 2.3x speedup and 3.0x energy reduction Proposed technique requires efficient neural network execution, such as hardware acceleration, to be beneficial For some applications, with simple neural network topologies, a tightly- coupled, low-latency NPU-CPU integrated design is highly beneficial

Neural network based accelerator are more flexible compared to an ASIC based accelerators and can easily adapt to many high performance applications ANN are inherently fault tolerant so an accelerator built using them naturally possess those qualities Typical hardware based ANNs show two order of energy efficiency compared to conventional systems Can play a major role in heterogeneous multi-core chips to solve some of the energy and dark silicon issues