Software Defined Hardware

Similar documents
Deep Learning Accelerators

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Brainchip OCTOBER

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

ENVISION: A 0.26-to-10 TOPS/W Subword-Parallel Dynamic- Voltage-Accuracy-Frequency- Scalable CNN Processor in 28nm FDSOI

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center

Comparing Memory Systems for Chip Multiprocessors

Speculations about Computer Architecture in Next Three Years. Jan. 20, 2018

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory

In Live Computer Vision

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Computer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal

Revolutionizing the Datacenter

XPU A Programmable FPGA Accelerator for Diverse Workloads

Computing s Energy Problem:

Deep Learning on Modern Architectures. Keren Zhou 4/17/2017

Versal: AI Engine & Programming Environment

Scaling Neural Network Acceleration using Coarse-Grained Parallelism

ESE532: System-on-a-Chip Architecture. Today. Message. Real Time. Real-Time Tasks. Real-Time Guarantees. Real Time Demands Challenges

Gables: A Roofline Model for Mobile SoCs

System-on-Chip Architecture for Mobile Applications. Sabyasachi Dey

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System

How to Estimate the Energy Consumption of Deep Neural Networks

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS

ESE532: System-on-a-Chip Architecture. Today. Message. Graph Cycles. Preclass 1. Reminder

Hardware/Software Codesign

direct hardware mapping of cnns on fpga-based smart cameras

Research Faculty Summit Systems Fueling future disruptions

Altera SDK for OpenCL

Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications

HIRP OPEN 2018 Compiler & Programming Language. An Efficient Framework for Optimizing Tensors

Wu Zhiwen.

Adaptable Intelligence The Next Computing Era

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

Reconfigurable Cell Array for DSP Applications

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

Hardware and Software solutions for scaling highly threaded processors. Denis Sheahan Distinguished Engineer Sun Microsystems Inc.

Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP

The S6000 Family of Processors

Heterogeneous platforms

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive. Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Research Faculty Summit Systems Fueling future disruptions

Tutorial Outline. 9:00 am 10:00 am Pre-RTL Simulation Framework: Aladdin. 8:30 am 9:00 am! Introduction! 10:00 am 10:30 am! Break!

DARPA-BAA Hierarchical Identify Verify Exploit (HIVE) Frequently Asked Questions (FAQ) August 18, 2016

Real-Time Support for GPU. GPU Management Heechul Yun

EE282 Computer Architecture. Lecture 1: What is Computer Architecture?

DNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses

INTRODUCTION TO FPGA ARCHITECTURE

The OpenVX Computer Vision and Neural Network Inference

Exploring OpenCL Memory Throughput on the Zynq

Versal: The New Xilinx Adaptive Compute Acceleration Platform (ACAP) in 7nm

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

SYSTEMS ON CHIP (SOC) FOR EMBEDDED APPLICATIONS

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.

BERKELEY PAR LAB. RAMP Gold Wrap. Krste Asanovic. RAMP Wrap Stanford, CA August 25, 2010

Embedded Systems: Hardware Components (part I) Todor Stefanov

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

Accelerating Convolutional Neural Nets. Yunming Zhang

PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory

LegUp: Accelerating Memcached on Cloud FPGAs

DNN Accelerator Architectures

ElasticFlow: A Complexity-Effective Approach for Pipelining Irregular Loop Nests

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures

Emergence of the Memory Centric Architectures

Higher Level Programming Abstractions for FPGAs using OpenCL

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA

High-Performance Linear Algebra Processor using FPGA

! Readings! ! Room-level, on-chip! vs.!

BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques

Portland State University ECE 588/688. Graphics Processors

HW/SW Co-design. Design of Embedded Systems Jaap Hofstede Version 3, September 1999

Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA

A Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis

Scaling Datacenter Accelerators With Compute-Reuse Architectures

Searching for Meaning in the Era of Big Data and IoT

Rapidly Developing Embedded Systems Using Configurable Processors

TRIPS: Extending the Range of Programmable Processors

Binary Convolutional Neural Network on RRAM

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture

Compilation for Heterogeneous Platforms

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

The Future of GPU Computing

CS370 Operating Systems

The Path to Embedded Vision & AI using a Low Power Vision DSP. Yair Siegel, Director of Segment Marketing Hotchips August 2016

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

Re-Examining Conventional Wisdom for Networks-on-Chip in the Context of FPGAs

Efficient Data Supply for Hardware Accelerators with Prefetching and Access/ Execute Decoupling

The Xilinx XC6200 chip, the software tools and the board development tools

Inference

Requirements for Scalable Application Specific Processing in Commercial HPEC

THE FUTURE OF GPU DATA MANAGEMENT. Michael Wolfe, May 9, 2017

Transcription:

Software Defined Hardware For data intensive computation Wade Shen DARPA I2O September 19, 2017 1

Goal Statement Build runtime reconfigurable hardware and software that enables near ASIC performance (within 10x) without sacrificing programmability for data-intensive algorithms. 2

Problem Processor design trades - ath/logic resources - emory (cache vs. register vs. shared) - Address computation - Data access and flow emory (shared) emory (registers) Bus/logic ath emory (cache) CPU GPU Specialized The problem: Optimal hardware configuration differs across algorithms SP DGE Logic/bus FPU emory No one hardware efficiently solves all problems well 0% 50% 100% SP = Sparse atrix ultiply DGE = Dense atrix ultiply 3

SDH: Runtime optimization of software and hardware For data intensive computation Energy Efficiency [OP/mW] SDH CPUs more programmable CPUs+GPUs GP DSPs GPU less programmable Specialized FPGA not programmable HIVE Google's TPU 0% 50% 100% Today: HW design specialization One chip per algorithm - Chip design expensive Not reprogrammable Can t take advantage of datadependent optimizations Tomorrow: Runtime optimization of hardware and software One chip many applications - One time design cost Reprogrammable via high-level languages Data-dependent optimization (10-100x) 4

Software-defined hardware High-level program Dynamic HW/SW compilers for high-level languages (TA2) 1. Generate optimal configuration based on static analysis code 2. Generates optimal code 3. Re-optimize machine code and processor configuration based on runtime data Code 1 Code 2 Code 3 Code N Config 1 Config 2 Config 3 Config 4 Config N Time Properties: Reconfigurable processors (TA1) 1. Reconfiguration times: 300-1,000 ns 2. Re-allocatable compute resources i.e. s for address computation or math 3. Re-allocatable memory resources i.e. cache/register configuration to match data 4. alleable external memory access i.e. reconfigurable memory controller 5

TA1: Reconfigurable processors Graphicionado: graph search engine Async emory Controller Scratchpad for active search nodes Address calculators for sparse vector lookup Performance: 157 edges/s/w search (BFS) Eyeriss: Image neural net engine Image convolution operators Block read Controller Image cache Performance: 250 images/s/w (AlexNet) Reconfigurable Interconnect SDH Programmable emory Controller Programmable emory Controller Plasticine: Stanford Seedling Graph search: 102 edges/s/w Image recognition: 130 images/s/w DISTRIBUTION STATEENT A. Approved for public release. Distribution is unlimited. 6

TA2: Compilers to build hardware and software High-level program High-level program Compiler SDH Compiler Code Runtime trace Predict Configuration Code Processor Configuration Runtime trace Processor Reconfigurable Processor Compilers generate optimal code via static analysis + tracing methods - Assume static processor configuration, compile code, run, trace, recompile SDH compilers don t assume a static processor configuration - Generates optimal configuration/code given program + data - Problem: Resources and architecture optimization space is large Solution: 1. Configure initial processor configuration, compile code, run and trace, then 2. Predict best configuration via reinforcement learning/stochastic optimization 7

How will TA2 work? Empirical kernel mining from D3 L corpus DARPA D3 Corpus IR (intermediate representation) component optimization X 1 * + X 2 + + AC (ultiply-accumulate) = = X 1 * + X 2 AC 1 AC 2 High-level data programs f(x) = warp(fft(window(x))) f(a, x) = Ax + b Low-level implementations not exposed Ax fft(y) Compile time Run time Ax fft(y) Pre-compilation to IR (data-independent) AC 1 AC 2 Optimization (data-dependent) AC 2 AC 1 8

Program evaluation and goals USG team will create a benchmark suite of machine learning, optimization, graph and numeric applications - 500+ programs from D3 program - Implementations for GPU and CPU - Subset of 100 optimized for ASIC (FPGA proxy) etrics: - Speedup/power relative to ASIC and general purpose processors - Programmability: time to code solution for SDH languages vs. NumPy/Python Target outcomes: vs. CPU vs. ASIC vs. ASIC (sparse math, graphs) Phase 1 100-300x within 10x 2x within 3x Phase 2 500-1000x within 5x 8-10x ~1x Programmability 9

www.darpa.mil

Backup 11

TA1: Example architecture (U. ichigan Seedling) Dramatic performance and energy opportunity by tailoring architectures to applications: e.g. sparse and dense matrix multiplication and graph algorithms Cache hierarchy, memory bandwidth, SID vs. ID, dedicated cores Sparse Computation Sparse Computation Global cache vs local scratchpad erging crossbar vs data broadcasting All cores for data throughput vs fewer cores for memory throughput Dense Computation Dense Computation 1. Outer product generation 2. erge outer products 1. Inner product on individual tiles 2. erge tiles vs. CPU: 20 100x performance gain data reuse, reduced memory bandwidth vs. GPU: 10 50x performance gain data movement & placement, async execution vs. ASIC: within 2 3x of performance but full flexibility / programmability 12

TA1: Fast reconfigurable processors ASICs Graphicionado: ASIC for graph search Element-based memory access Specialized indirect address calculator for sparse vectors Specialized on-chip scratch pad 157K edges/s/mw search (BFS) SDH SPARSE: BFS on Twitter 102K edges/s/ mw DENSE: AlexNet (full app.) 130 images/s/w Eyeriss: Specialized DNN accelerator 168 specialized convolution units Specialized implementation of neural nonlinearity (ReLU) Large block memory access only 250 images/s/w on AlexNet SDH Opportunities (vs. CPU) Attribute 64 memory control units 64 data pattern control units Dense Advantage 30x Perf, 52x Perf/W Sparse Advantage 8.2x Perf, 55.2x Perf/W Configurable off-chip access for dense and sparse (scatter/gather) 1x 18x Configurable on-chip memory for high BW & coarse-grain pipes 2.1x 18.4x Compute Pipelined SID 12.6x Variable precision 25.2x Shift network Key Challenges: Data flow: configurable memory control units, data patterns, data storage Compute flexibility: compute granularity, modular functionality, high BW malleable interconnect 7x NA