A Tightly-coupled Light-Weight Neural Network Processing Units with RISC-V Core

Size: px

Start display at page:

Download "A Tightly-coupled Light-Weight Neural Network Processing Units with RISC-V Core"

Iris Arnold
6 years ago
Views:

1 A Tightly-coupled Light- Neural Network Processing Units with RISC-V Core Ting Hu, Ying Wang, Lei Zhang, Tingting He Institute of Computing Technology, Chinese Academy of Sciences, Beijing, P.R. China

2 Enabling machine learning in lightweight devices Recognition Detection GPGPU Wearable devices Smart Phones Understanding FPGA IoT Machine Learning accelerator (ASICs/ASIP)?

3 Code Approximation for General-purpose Workloads p Replace the computation-intensive but error-resilient code segment with Neural Networks p e.g., Approxbench and Axbench Float foo(float a,floot b) { return val; } Computation intensive loops Approximate Float foo(float a,floot b) { //#program return val; //#program } Replace it with a NN

4 Closely-coupled Neural Accelerator through RoCC ploosely-coupled neural accelerator as in typical SoC p A FPGA implementation for example pclosely-coupled neural accelerator p Reduced overhead of NA2CPU communication Host Processor 软件部分 SD Card Processor Interconnection The Neural Accelerator 硬件部分 RoCC Interface cmd exception irq busy Neural Accelerator (NA) ANN Accel. AI-lite AIS_MM2S DDR Memory Controller AI_MM2S AI_S2MM AI DMA AIS_S2MM Accelerator L1 DCache mem.req mem.resp

5 Closely-coupled Neural Accelerator through RoCC pthe original definition of RoCC Interface pour RISC-V core with neural ACC. interface RoCC Interface cmd exception irq busy ANN Accel. L1 DCache mem.req mem.resp

6 Closely-coupled Neural Accelerator through RoCC pour RISC-V core with neural ACC. interface pna command interface RoCC Interface cmd exception irq busy ANN Accel. psignals between NA and private cache L1 DCache mem.req mem.resp

7 Extended Instruction Set For Neural Accleration e.g., for full-connection computation N instructions DMA instructions eisa AGU instructions Other instructions For neural layer operation MAC POOL LRN PDMA L.IOB2N L.WB2N S.N2IOB ClearB BTB pn instructions: for NA initialization and invocation opcode exe_mode start_cycle issue_interval issue_num mode des_register others pdma instructions: for data initialization in buffer (loading neural parameter and input) Opcode exe_mode mode layer tiling buffer_flag pagu instructions: for data streaming from buffer to processing elements Opcode exe_mode start_cycle issue_interval base_addr offset buffer_flag agu_mode batch_size kernel x_length y_length inputlayers outputlayers ohters

Inside the Neural Accelerator pone-dimension Systolic Array sig_in[21:9] >127 pkey features: p Compact size, for low power IoT applications p Good data reusability, energy-efficient

8 Inside the Neural Accelerator pone-dimension Systolic Array sig_in[21:9] >127 pkey features: p Compact size, for low power IoT applications p Good data reusability, energy-efficient data_in The NA array (linear) acc_out (acc_in) < -127 small(-13~13) bit LUT 127 Approximate activation functions -1 Mux sig_out acc_in acc_fifo sigmoid_lut sigmoid_fifo data_out (data_in)

9 Inside the Neural Accelerator preuse the input data for different output neurons preuse the parameters for convolution operations Data input direction input direction data_in acc_in acc_out (acc_in) acc_fifo sigmoid_lut sigmoid_fifo data_out (data_in)

10 Performance Evaluation TABLE 1 Description of the Rocket core and neural accelerator # Benchmark Domain Description Input Dataset 1 Black-Scholes Financial Mathematical model 4,000 options ROCKET CORE NEURAL ACCELERATOR FREQUENCY:400MHZ, L1 DCACHE SIZE: 64KB count:, Frequency: 400mHz, peak Gops: 3.2Gops, average power: 50mw 2 FFT Signal Radix-2 Fast Fourier 32,767 random floating point numbers 3 Inversek2j Robotics Inverse kinematics for arm 300,000 (x,y) random coordinates 4 Jmeint 3D gaming Triangle intersection detection 100,000 pairs of 3D triangle coordinates 5 JG encoder Compression JG encoding 512x512 pixel color image 6 K-means ML K-means clustering 262,144 paris of random (r,g,b) values 7 Sobel Image Sobel edge detector 512x512 pixel color image 5 speedup Performance Speed-up 6 Energy 功耗节省 Saving Workload: AxBench Baseline: ARM A9 dual-core without Neural approximation NA performance is measured on FPGA implementation

11 Conclusion procc interface provides efficient NA2core communication mechanism procc instruction set extension provides an effective solution of extending neural instruction for NA desgin pone-dimensional Systolic array is an energy-efficient solution to low power neural network inference on low-end devices ICCAD, 2016, Austin

12 Conclusion Thanks Q&A 12

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM