SDA: Software-Defined Accelerator for Large- Scale DNN Systems

Similar documents
SDA: Software-Defined Accelerator for Large- Scale DNN Systems

XPU A Programmable FPGA Accelerator for Diverse Workloads

SDA: Software-Defined Accelerator for general-purpose big data analysis system

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center

Data center: The center of possibility

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Research Faculty Summit Systems Fueling future disruptions

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters

Profiling the Performance of Binarized Neural Networks. Daniel Lerner, Jared Pierce, Blake Wetherton, Jialiang Zhang

World s most advanced data center accelerator for PCIe-based servers

Revolutionizing the Datacenter

CafeGPI. Single-Sided Communication for Scalable Deep Learning

GPU Accelerated Machine Learning for Bond Price Prediction

High Performance Computing

A Scalable Speech Recognizer with Deep-Neural-Network Acoustic Models

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

Parallel graph traversal for FPGA

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

OCTOPUS Performance Benchmark and Profiling. June 2015

Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research

An Adaptable Deep Learning Accelerator Unit (DLAU) for FPGA

An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection

Deep learning prevalence. first neuroscience department. Spiking Neuron Operant conditioning First 1 Billion transistor processor

Deep Learning with Intel DAAL

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Deep Learning Accelerators

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Characterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager

Xilinx ML Suite Overview

Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive. Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center

Enabling Flexible Network FPGA Clusters in a Heterogeneous Cloud Data Center

Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

IBM Power AC922 Server

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks

DEEP LEARNING ACCELERATOR UNIT WITH HIGH EFFICIENCY ON FPGA

Achieving 2.5X 1 Higher Performance for the Taboola TensorFlow* Serving Application through Targeted Software Optimization

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD

Brainchip OCTOBER

Intelligent Hybrid Flash Management

Evaluating On-Node GPU Interconnects for Deep Learning Workloads

Versal: AI Engine & Programming Environment

LegUp: Accelerating Memcached on Cloud FPGAs

GRVI Phalanx Update: Plowing the Cloud with Thousands of RISC-V Chickens. Jan Gray

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin

Deep Learning Performance and Cost Evaluation

An NVMe-based Offload Engine for Storage Acceleration Sean Gibb, Eideticom Stephen Bates, Raithlin

A Lightweight YOLOv2:

GRVI Phalanx Update: Plowing the Cloud with Thousands of RISC-V Chickens. Jan Gray

Apache Spark Graph Performance with Memory1. February Page 1 of 13

High-Performance Data Loading and Augmentation for Deep Neural Network Training

IBM Power Advanced Compute (AC) AC922 Server

NVIDIA FOR DEEP LEARNING. Bill Veenhuis

M.Tech Student, Department of ECE, S.V. College of Engineering, Tirupati, India

Introduction to Neural Networks

Deep Learning Performance and Cost Evaluation

S8765 Performance Optimization for Deep- Learning on the Latest POWER Systems

Neural Computer Architectures

GREAT PERFORMANCE FOR TINY PROBLEMS: BATCHED PRODUCTS OF SMALL MATRICES. Nikolay Markovskiy Peter Messmer

A New Era of Hardware Microservices in the Cloud. Doug Burger Distinguished Engineer, Microsoft UW Cloud Workshop March 31, 2017

Simplify System Complexity

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA

A Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis

CP2K Performance Benchmark and Profiling. April 2011

A Preliminary Approach for Modeling Energy Efficiency for K-Means Clustering Applications in Data Centers

Optimizing Efficiency of Deep Learning Workloads through GPU Virtualization

Godson Processor and its Application in High Performance Computers

Intel Solid State Drive Data Center Family for PCIe* in Baidu s Data Center Environment

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Simplify System Complexity

Scale-Out Acceleration for Machine Learning

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

Computer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal

direct hardware mapping of cnns on fpga-based smart cameras

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability

Cascade Mapping: Optimizing Memory Efficiency for Flash-based Key-value Caching

NAMD Performance Benchmark and Profiling. January 2015

Optimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms

SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou

DNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses

Deep Neural Networks for Market Prediction on the Xeon Phi *

A Parallel Decoding Algorithm of LDPC Codes using CUDA

CUDA on ARM Update. Developing Accelerated Applications on ARM. Bas Aarts and Donald Becker

BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques

Windowing System on a 3D Pipeline. February 2005

Reconfigurable Acceleration of 3D-CNNs for Human Action Recognition with Block Floating-Point Representation

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic

Matrox Imaging White Paper

GUNREAL: GPU-accelerated UNsupervised REinforcement and Auxiliary Learning

GPU-Accelerated Deep Learning

Transcription:

SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, 1 Yong Wang, 1 Bo Yu, 1 Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University

Introduction of Baidu A dominant Internet company in China ~US$80 Billion market value 600M+ users Exploiting internet markets of Brazil, Southeast Asia and Middle east Asia Main Services PC search and mobile search 70%+ market share in China LBS( local base service) 50%+ market share On-line trips QUNR[subsidiary company], US$3 billions market value Video, Top 1 mobile video in China Personal cloud storage 100M+ users, the largest in China APPs store, image and speech Baidu is a technology-driven company Tens of data centers, hundreds of thousands of servers Over one thousand PetaByte data (LOG, UGC, Webpages, etc.) 2

DNN in Baidu DNN has been deployed to accelerate many critical services at Baidu Speech recognition Reduce 25%+ error ratio compared to the GMM (Gaussian Mixture Model) method Image Image search, OCR, face recognition Ads Web page search LBS/NLP(Natural Language Processing) What is DNN ( deep neural network or deep learning) DNN is a multi-layer neural network. DNN uses usually an unsupervised and unfeatured machine learning method. Regression and classification Pattern recognition, function fitting or more Often better than shallow learning (SVM(Support Vector Machine), Logistics Regression, etc.) Unlabeled features Stronger representation ability Often demands more compute power Need to train much more parameters Need to leverage big training data to achieve better results 3

Outline Overview of the DNN algorithm and system Challenges on building large-scale DNN system Our solution: SDA (Software-Defined Accelerator) Design goals Design and implementation Performance evaluation Conclusions 4

Overview of DNN algorithm Single neuron structure fig1 Multiple neurons and layers Back-propagation training For each input vector // forward, for input layer to output layer O_i=f(W_i * O_i-1) // backward, for output layer to input layer delta_i = O_i+1 * (1-O_i) * (W_i * delta_i+1) //update weight,for input layer to output layer W_i = W_i + n* delta_i*o_i-1 Almost matrix multiplications and additions Complexity is O(3*E*S*L*N 3 ) E: epoch number; S: size of data set; L: layers number; N: size of weight Online-prediction Only forward stage Online-prediction Complexity: O(V*L*N 2 ) V : input vector number L: layer number N: size of weight matrix N=2048,L=8,V=16 for typical applications, computation of each input vector is ~1GOP, and almost consumes 33ms in latest X86 CPU core. fig2 5

Overview of DNN system Training data parameters 5% 5% Large-scale DNN training system Models Off-line training Scale 10~100TB training data 10M~100B parameters workload type Compute intensive Communication intensive Difficult to scale out Cluster type Medium size (~100) GPU and IB On-line prediction Scale 10M~B users 100M~10B requests/day Workload type Compute intensive Less communication Easy to scale out Cluster type Large scale(k~10k) CPU (AVX/SSE) and 10GE 6

Challenges on Existing Large-scale DNN system DNN training system Scale: ~100 servers due to algorithm and hardware limitations Speed: training time from days to months Cost: many machines demanded by a large number of applications DNN prediction Cost: 1K~10K servers for one service Speed: latency of seconds for large models Cost and speed are critical for both training and prediction GPU High cost High power and high space consumption Higher demand on data center cooling, power supply, and space utilization CPU Medium cost and power consumption Low speed Are any other solutions? 7

Challenges of large DNN system Other solutions ASIC High NRE Long design period, not suitable for fast iteration in Internet companies FPGA Low power Less than 40W Low cost Hundreds of dollars Hardware reconfigurable Is FPGA suitable for DNN system? 8

Challenges of large DNN system FPGA s challenges Developing time Internet applications need very fast iteration Floating point ALU Training and some predictions require floating point Memory bandwidth Lower than GPU and CPU Our Approach SDA: Software-Defined Accelerator 9

SDA Design Goals Supports major workloads Floating point: training and prediction Acceptable performance 400Gflops, higher than 16core x86 server Low cost Medium-end FPGA Not require changes of existent data center environments Low power: less than 30w of total power Half-height, half-length, and one slot thickness Support fast iteration Software-Defined 10

Designs and implementations Hardware board design Architecture Hardware and software interface 11

Design Hardware Board Specifications Xilinx K7 480t 2 DDR3 channels, 4GB PCIE 2.0x8 Size Half-height, half-length and one slot thickness Can be plugged into any types of 2U and 1U servers. Power Supplied by the PCIE slot Peak power of board less than 30w 12

Design - Architecture Major functions Floating point matrix multiplication Floating point active functions Challenges of matrix multiplications The numbers of floating point MUL and ADD Data locality Scalability for FPGAs of different sizes Challenges of active functions Tens of different active functions Reconfigurable on-line within milliseconds 13

Design - Architecture Customized FP MUL and ADD About 50% resource reduction compared to standard IPs Leverage BRAM for data locality Buffer 2x512x512 tile of matrix Scalable ALU Each for a 32x32 tile 14

Design - architecture Software-defined active functions Support tens of active functions: sigmod, tanh, softsign Implemented by lookup table and linear fitting Reconfigure the table by user-space API Evaluations 1-e5 ~1-e6 precision Can be reconfigured within 10us 15

Design - Software/hardware Interface Computation APIs Similar to CUBLAS Memory copy: host to device and device to host Matrix MUL Matrix MUL with active function Reconfiguration API Reconfigure active functions 16

Evaluations Setup HOST Intel E5620v2x2, 2.4GHz, 16 cores 128GB memory 2.6.32 Linux Kernel, MKL 11.0 SDA Xilinx K7-480t 2x2GB DDR3 on-board memory, with ECC, 72bit, 1066MHz PCIE 2.0x8 GPU One type server-class GPU Two independent devices. The following evaluation leverages one device. 17

Evaluations-Micro Benchmark SDA implementation 300MHz, 640 ADDs and 640 MULs Peak performance LUT DSP REG BRAM Resource utilization 70% 100% 37% 75% Matrix multiplication : MxNxK=2048x2048x2048 1200 1000 800 600 GFLOPS 400 200 power 0 server FPGA GPU CPU FPGA GPU Gflops/W 4 12.6 8.5 18

Evaluations-Micro Benchmark M=N=K, matrix multiplication CPU leverages one core, GPU is one device M=512,1024 and 2048 Gfops 1200 1000 800 600 400 CPU GPU FPGA 200 0 512 1024 2048 Matrix size 19

Evaluations: On-line Prediction Workload Input batch size is small Batch size: the number of input vector Typical batch size is 8 or 16 Typical layer is 8 The size of hidden layer is several hundreds to several thousands Depending on applications, practical tuning and training time Workload1 Batch size=8, layer=8, hidden layer size=512 Thread number is 1~64, test the request/s Workload2 Batch size=8, layer=8, hidden layer size=2048 Thread number is 1~32, test the request/s 20

Evaluations: On-line Prediction Workload Batch size=8, layer=8 Req/s 7000 6000 3x 4.1x Workload1 Weight matrix size=512 FPGA is 4.1x than GPU FPGA is 3x than CPU 5000 4000 3000 2000 CPU GPU FPGA 1000 Workload2 Weight matrix size=2048 FPGA is 2.5x than GPU FPGA is 3.5x than CPU Conclusions FPGA can merge the small requests to improve performance Throughput in Req/s of FPGA scales better 0 Req/s 700 600 500 400 300 200 100 1 2 4 8 12 16 24 32 40 48 56 64 Fig a:workload1 2.5x Thread # 3.5x CPU GPU FPGA 0 1 2 4 8 12 16 24 32 Fig b: workload2 Thread # 21

The features of SDA Software-defined Reconfigure active functions by user-space API Support very fast iteration of internet services Combine small requests to big one Improve the QPS while batch size is small The batch size of real workload is small CUBLAS-compatible APIs Easy to use 22

Conclusions SDA: Software-Defined Accelerator Reconfigure active functions by user-space APIs Provide higher performance in the DNN prediction system than GPU and CPU server Leverage mid-end FPGA to achieve about 380Gflops 10~20w power in real production system Can be deployed in any types of servers Demonstrate that FPGA is a good choice for large-scale DNN systems 23