Graph Streaming Processor

Size: px

Start display at page:

Download "Graph Streaming Processor"

Wilfred Barrett
5 years ago
Views:

1 Graph Streaming Processor A Next-Generation Computing Architecture Val G. Cook Chief Software Architect Satyaki Koneru Chief Technology Officer Ke Yin Chief Scientist Dinakar Munagala Chief Executive Officer

2 Introduction THINCI, Inc. think-eye is 5-year-old strategic/venture-backed technology startup Develop silicon for machine learning, computer vision and other strategic parallel workloads Provide innovative software along with a comprehensive SDK 69-person team (95% engineering & operations) Key IP (patents, trade secrets) Streaming Graph Processor Graph Computing Compiler Product Status Early Access Program started Q First edition PCIe-based development boards will ship Q4 2017

3 Architectural Objective Exceptional efficiency via balanced application of multiple parallel execution mechanisms Levels of Parallelism Task Level Parallelism Level Parallelism Data Level Parallelism Instruction Level Parallelism Key Architectural Choices Direct Graph Processing Fine-Grained Scheduling 2D Block Processing Parallel Reduction Instructions Hardware Instruction Scheduling

4 Task Level Parallelism Direct Graph Processing

5 Task Graphs Formalized Task Level Parallelism Graphs define only computational semantics Nodes reference kernels Kernels are programs Nodes bind to buffers Buffers contain structured data Data dependencies explicit A B C ThinCI Hardware Processes Graphs Natively A graph is an execution primitive D E F A program is a proper sub-set of graph G

Graph-based machine learning frameworks have proliferated in recent years.

6 Graph Based Frameworks Graph Processing or Data Flow Graphs They are a very old concept, for example Alan Turing s Graph Turing Machine. Gaining value as a computation model, particularly in the field of machine learning. Graph-based machine learning frameworks have proliferated in recent years. Machine Learning Frameworks TensorFlow Lasagne maxdnn Chainer MxNet CNTK Neural Designer leaf cudnn Karas DSSTNE Caffe MatConvNet Apache Kaldi Torch BIDMach deeplearning4j SINGA Caffe2 2017

7 Streaming vs. Sequential Processing Sequential Node Processing Sequential Execution Streaming Execution Commonly used by DSPs and GPUs 0 0 Intermediate buffers are written back and forth to memory A A Intermediate buffers are generally non-cacheable globally DRAM accesses are costly Excessive power B C B C Excessive latency Graph Streaming Processor Intermediate buffers are small (~1% of the original size) D D Data is more easily cached 6 6 Benefits of significantly reduced memory bandwidth Node A Node B Node C Node D Nodes A,B,C,D Lower power consumption 2 4 Higher performance time time

8 Level Parallelism Fine-Grained Scheduling

9 Fine-Grained Scheduling Scheduler Aware of data dependencies Dispatches threads when: Resources available Dependencies satisfied Maintains ordered behavior as needed Prevents dead-lock Supports Complex Scenarios Aggregates s Fractures s DMA Command Ring Unit Transfer (DMA) Unit Controller Execution Command Ring Unit L2 Cache AXI Bus Matrix L3 Cache Scheduler Read Write Cache Instruction Unit Read Only Cache Unit Read Only Cache Quad 0 Processor 0 Quad 1 Processor 0 Quad 2 Processor 0 Quad N Processor 0 Special Op Unit Processor 1 Processor 2 Processor 3 Arbiter Special Op Unit Processor 1 Processor 2 Processor 3 Arbiter Special Op Unit Processor 1 Processor 2 Processor 3 Arbiter Special Op Unit Processor 1 Processor 2 Processor 3 Input/ Output Unit Read Only Cache Array Read Write Cache Array SPU MPU SPU MPU SPU MPU SPU MPU Arbiter

10 Count/Node Graph Execution Trace s can execute from all nodes of the graph simultaneously True hardware managed streaming behavior Graph Execution Trace life-span time

11 Data Level Parallelism 2D Block Processing Parallel Reduction Instructions

12 2D Block Processing/Reduction Instructions Persistent data structures are accessed in blocks Arbitrary alignment support Provides for in-place compute Parallel reduction instructions support efficient processing Reduced power Greater throughput Reduced bandwidth Experience better scaling across data types vs. the 2x scaling of traditional vector pipelines src src dst

13 Instruction Level Parallelism Hardware Instruction Scheduling

14 Hardware Instruction Scheduling Scheduling Groups of Four Processors Hardware Instruction Picker Register Files Vector Pipeline Selects from 100 s of threads Scalar Pipeline Targets 10 s of independent pipelines Custom Arithmetic Instruction Scheduler Instruction Decode Flow Control Memory Ops. Move Pipeline Spawn Mgmt.

15 Programming Model

16 Programming Model Fully Programmable No a-priori constraints regarding data types, precision or graph topologies Fully pipelined concurrent graph execution Comprehensive SDK with support for all abstraction levels, assembly to frameworks Machine Learning Frameworks TensorFlow Caffe Torch OpenVX + OpenCL C/C++ Language Kernels (Seeking Khronos conformance post Si) Provides rich graph creation and execution semantics Extended with fully accelerated custom kernel support

17 Results Arithmetic Pipeline Utilization 95% for CNN s (VGG16, 8-bit) Physical Characteristics TSMC 28nm HPC+ Standalone SoC Mode PCIe Accelerator Mode SoC Power Estimate: 2.5W

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions