Graph Streaming Processor

Similar documents
Modern Processor Architectures. L25: Modern Compiler Design

Neural Network Exchange Format

The Path to Embedded Vision & AI using a Low Power Vision DSP. Yair Siegel, Director of Segment Marketing Hotchips August 2016

Overcoming the Memory System Challenge in Dataflow Processing. Darren Jones, Wave Computing Drew Wingard, Sonics

Research Faculty Summit Systems Fueling future disruptions

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

NVIDIA GPU CLOUD DEEP LEARNING FRAMEWORKS

Open Standards for Vision and AI Peter McGuinness NNEF WG Chair CEO, Highwai, Inc May 2018

CLICK TO EDIT MASTER TITLE STYLE. Click to edit Master text styles. Second level Third level Fourth level Fifth level

Altera SDK for OpenCL

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Higher Level Programming Abstractions for FPGAs using OpenCL

The OpenVX Computer Vision and Neural Network Inference

PowerVR Hardware. Architecture Overview for Developers

Copyright Khronos Group Page 1. Vulkan Overview. June 2015

Computer Architecture

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

SDSoC: Session 1

Parallel Computing: Parallel Architectures Jin, Hai

Portland State University ECE 588/688. Graphics Processors

Toward a Memory-centric Architecture

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

High Performance Computing

Ten Reasons to Optimize a Processor

An introduction to Machine Learning silicon

Simplify System Complexity

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

CMPE 665:Multiple Processor Systems CUDA-AWARE MPI VIGNESH GOVINDARAJULU KOTHANDAPANI RANJITH MURUGESAN

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters

GPU-Accelerated Deep Learning

Technology and Design Tools for Multicore Embedded Systems Software Development

Standards for Vision Processing and Neural Networks

Deep learning in MATLAB From Concept to CUDA Code

Industry Collaboration and Innovation

Fundamental CUDA Optimization. NVIDIA Corporation

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

World s most advanced data center accelerator for PCIe-based servers

Deep Learning Frameworks with Spark and GPUs

TUNING CUDA APPLICATIONS FOR MAXWELL

Optimizing Cache Coherent Subsystem Architecture for Heterogeneous Multicore SoCs

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Fundamental CUDA Optimization. NVIDIA Corporation

Combining Arm & RISC-V in Heterogeneous Designs

A framework for optimizing OpenVX Applications on Embedded Many Core Accelerators

Integrating CPU and GPU, The ARM Methodology. Edvard Sørgård, Senior Principal Graphics Architect, ARM Ian Rickards, Senior Product Manager, ARM

The Challenges of System Design. Raising Performance and Reducing Power Consumption

Overview. Think Silicon is a privately held company founded in 2007 by the core team of Atmel MMC IC group

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Accelerating Vision Processing

Parallel Programming on Larrabee. Tim Foley Intel Corp

Inside Intel Core Microarchitecture

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Deep Learning Accelerators

Bringing it all together: The challenge in delivering a complete graphics system architecture. Chris Porthouse

High Performance Computing on GPUs using NVIDIA CUDA

Addressing the Memory Wall

TUNING CUDA APPLICATIONS FOR MAXWELL

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

High Performance Computing. University questions with solution

Michael Kinsner, Dirk Seynhaeve IWOCL 2018

Heterogeneous Computing with a Fused CPU+GPU Device

Adaptable Intelligence The Next Computing Era

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

The Nios II Family of Configurable Soft-core Processors

Xilinx ML Suite Overview

Embarquez votre Intelligence Artificielle (IA) sur CPU, GPU et FPGA

Versal: AI Engine & Programming Environment

When MPPDB Meets GPU:

Technology for a better society. hetcomp.com

ARM Processors for Embedded Applications

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Introduction to Xeon Phi. Bill Barth January 11, 2013

Real-Time Rendering Architectures

Open Standards for AR and VR Neil Trevett Khronos President NVIDIA VP Developer January 2018

General Purpose GPU Programming. Advanced Operating Systems Tutorial 9

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

Simplify System Complexity

NVIDIA DEEP LEARNING INSTITUTE

Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture

Versal: The New Xilinx Adaptive Compute Acceleration Platform (ACAP) in 7nm

Place Your Logo Here. K. Charles Janac

I/O Handling. ECE 650 Systems Programming & Engineering Duke University, Spring Based on Operating Systems Concepts, Silberschatz Chapter 13

NVIDIA TESLA V100 GPU ARCHITECTURE THE WORLD S MOST ADVANCED DATA CENTER GPU

The Bifrost GPU architecture and the ARM Mali-G71 GPU

HPE Deep Learning Cookbook: Recipes to Run Deep Learning Workloads. Natalia Vassilieva, Sergey Serebryakov

Towards High-performance Flow-level level Packet Processing on Multi-core Network Processors

TESLA V100 PERFORMANCE GUIDE. Life Sciences Applications

Visualization of OpenCL Application Execution on CPU-GPU Systems

DRAM Tutorial Lecture. Vivek Seshadri

Accelerated Library Framework for Hybrid-x86

NVIDIA DLI HANDS-ON TRAINING COURSE CATALOG

Graphics Hardware 2008

! Readings! ! Room-level, on-chip! vs.!

Exploiting InfiniBand and GPUDirect Technology for High Performance Collectives on GPU Clusters

Accelerator-centric operating systems

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Transcription:

Graph Streaming Processor A Next-Generation Computing Architecture Val G. Cook Chief Software Architect Satyaki Koneru Chief Technology Officer Ke Yin Chief Scientist Dinakar Munagala Chief Executive Officer

Introduction THINCI, Inc. think-eye is 5-year-old strategic/venture-backed technology startup Develop silicon for machine learning, computer vision and other strategic parallel workloads Provide innovative software along with a comprehensive SDK 69-person team (95% engineering & operations) Key IP (patents, trade secrets) Streaming Graph Processor Graph Computing Compiler Product Status Early Access Program started Q1 2017 First edition PCIe-based development boards will ship Q4 2017

Architectural Objective Exceptional efficiency via balanced application of multiple parallel execution mechanisms Levels of Parallelism Task Level Parallelism Level Parallelism Data Level Parallelism Instruction Level Parallelism Key Architectural Choices Direct Graph Processing Fine-Grained Scheduling 2D Block Processing Parallel Reduction Instructions Hardware Instruction Scheduling

Task Level Parallelism Direct Graph Processing

Task Graphs Formalized Task Level Parallelism Graphs define only computational semantics Nodes reference kernels Kernels are programs Nodes bind to buffers Buffers contain structured data Data dependencies explicit A B C ThinCI Hardware Processes Graphs Natively A graph is an execution primitive D E F A program is a proper sub-set of graph G

Graph Based Frameworks Graph Processing or Data Flow Graphs They are a very old concept, for example Alan Turing s Graph Turing Machine. Gaining value as a computation model, particularly in the field of machine learning. Graph-based machine learning frameworks have proliferated in recent years. Machine Learning Frameworks TensorFlow Lasagne maxdnn Chainer MxNet CNTK Neural Designer leaf cudnn Karas DSSTNE Caffe MatConvNet Apache Kaldi Torch BIDMach deeplearning4j SINGA 2011 2012 2013 2014 2015 2016 Caffe2 2017

Streaming vs. Sequential Processing Sequential Node Processing Sequential Execution Streaming Execution Commonly used by DSPs and GPUs 0 0 Intermediate buffers are written back and forth to memory A A Intermediate buffers are generally non-cacheable globally DRAM accesses are costly 1 2 1 2 Excessive power B C B C Excessive latency Graph Streaming Processor 3 4 5 3 4 5 Intermediate buffers are small (~1% of the original size) D D Data is more easily cached 6 6 Benefits of significantly reduced memory bandwidth Node A Node B Node C Node D 1 3 5 6 Nodes A,B,C,D Lower power consumption 2 4 Higher performance time time

Level Parallelism Fine-Grained Scheduling

Fine-Grained Scheduling Scheduler Aware of data dependencies Dispatches threads when: Resources available Dependencies satisfied Maintains ordered behavior as needed Prevents dead-lock Supports Complex Scenarios Aggregates s Fractures s DMA Command Ring Unit Transfer (DMA) Unit Controller Execution Command Ring Unit L2 Cache AXI Bus Matrix L3 Cache Scheduler Read Write Cache Instruction Unit Read Only Cache Unit Read Only Cache Quad 0 Processor 0 Quad 1 Processor 0 Quad 2 Processor 0 Quad N Processor 0 Special Op Unit Processor 1 Processor 2 Processor 3 Arbiter Special Op Unit Processor 1 Processor 2 Processor 3 Arbiter Special Op Unit Processor 1 Processor 2 Processor 3 Arbiter Special Op Unit Processor 1 Processor 2 Processor 3 Input/ Output Unit Read Only Cache Array Read Write Cache Array SPU MPU SPU MPU SPU MPU SPU MPU Arbiter

Count/Node Graph Execution Trace s can execute from all nodes of the graph simultaneously True hardware managed streaming behavior Graph Execution Trace life-span time

Data Level Parallelism 2D Block Processing Parallel Reduction Instructions

2D Block Processing/Reduction Instructions Persistent data structures are accessed in blocks Arbitrary alignment support Provides for in-place compute Parallel reduction instructions support efficient processing Reduced power Greater throughput Reduced bandwidth Experience better scaling across data types vs. the 2x scaling of traditional vector pipelines 0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 src src dst

Instruction Level Parallelism Hardware Instruction Scheduling

Hardware Instruction Scheduling Scheduling Groups of Four Processors Hardware Instruction Picker Register Files Vector Pipeline Selects from 100 s of threads Scalar Pipeline Targets 10 s of independent pipelines Custom Arithmetic Instruction Scheduler Instruction Decode Flow Control Memory Ops. Move Pipeline Spawn Mgmt.

Programming Model

Programming Model Fully Programmable No a-priori constraints regarding data types, precision or graph topologies Fully pipelined concurrent graph execution Comprehensive SDK with support for all abstraction levels, assembly to frameworks Machine Learning Frameworks TensorFlow Caffe Torch OpenVX + OpenCL C/C++ Language Kernels (Seeking Khronos conformance post Si) Provides rich graph creation and execution semantics Extended with fully accelerated custom kernel support

Results Arithmetic Pipeline Utilization 95% for CNN s (VGG16, 8-bit) Physical Characteristics TSMC 28nm HPC+ Standalone SoC Mode PCIe Accelerator Mode SoC Power Estimate: 2.5W