Overcoming the Memory System Challenge in Dataflow Processing. Darren Jones, Wave Computing Drew Wingard, Sonics

Size: px

Start display at page:

Download "Overcoming the Memory System Challenge in Dataflow Processing. Darren Jones, Wave Computing Drew Wingard, Sonics"

Bryce Conley
6 years ago
Views:

1 Overcoming the Memory System Challenge in Dataflow Processing Darren Jones, Wave Computing Drew Wingard, Sonics

Current Technology Limits Deep Learning Performance Deep

designed for dataflow computations Typical compute model

sequential threads on host CUDA/OpenCL kernels for

2 Current Technology Limits Deep Learning Performance Deep Learning Dataflow Graph Existing CPUs & GPUs are not designed for dataflow computations Typical compute model OpenCL and CUDA CPU + GPU/FPGA Convert dataflow graph to sequential threads on host CUDA/OpenCL kernels for acceleration From 1. Deep learning is a dataflow application 2. Dataflow is better executed on dataflow processors * MPI = Message Pass Interface 2

Wave Dataflow Processor is Ideal for Deep

Run on Wave Dataflow Processor Times Times Plus

3 Wave Dataflow Processor is Ideal for Deep Learning Deep Learning Networks are Dataflow Graphs Programmed on Deep Learning Software Softmax Times Times Softmax Plu s Sigmoid Plus Run on Wave Dataflow Processor Times Times Plus Sigmoid Plus Mem I/O I/O WaveFlow Agent Library Wave Dataflow Processor 3

4 Wave s Dataflow Computer for ML Training 2.9 Peta-Ops/Second 256,000 Processing Elements Over 2TB Bulk & High Speed Memory Up to 32TB SSD Storage Over 4.5TB/Sec Dataflow Bandwidth Up to 4 Wave Computers per Data Center Node Initially Supporting TensorFlow 4

5 Dataflow Processing Element (PE) Pipelined 256-entry Instruction RAM /w ECC PE c PE d PE a PE b Pipelined 1KB Single Port Data RAM /w BIST & ECC Quad of PEs are fully connected 5

6 Wave DPU Hierarchy 24 Compute Machines 6

7 Wave DPU Memory Hierarchy Cluster 0-16PEs Cluster 63-16PEs Cluster 0-16PEs Cluster 63-16PEs Registers (96 8-bit) SRAM (16KB) (8KB) Registers (96 8-bit) SRAM (16KB) (8KB) Registers (96 8-bit) SRAM (16KB) (8KB) Registers (96 8-bit) SRAM (16KB) (8KB) Switches (1.5KB) Switches (1.5KB) Switches (1.5KB) Switches (1.5KB) 20GB/s 20GB/s 20GB/s 20GB/s AXIS AXI Channel 0 AXIM AXIS AXI Channel 31 AXIM CPU 128 bit 128 bit Secure I-CACHE SonicsGN AXI4 NOC 640-bit 640-bit 640-bit 640-bit 256-bit 256-bit 256-bit 60GB/s (bidir) 60GB/s (bidir) 60GB/s (bidir) 60GB/s (bidir) 15GB/s (bidir) 15GB/s (bidir) 30GB/s (bidir) High speed memory 0 High speed memory 1 High speed memory 2 High speed memory 3 DDR4_0 DDR4_1 PCIe x16 (to Host) 7

8 Multichannel Memory Support is Not Easy! Application View Address Space Region 1 Hole 1 Region 2 Region 3 Hole 2 2 Channels No Interleave Region 1 Hole 1 Ch. 1 Region 2 Ch. 2 Region 3 Hole 2 Key Problems: 2 Channels Interleaved Region Must 1 balance Region memory 1 traffic 2 4 Region 2 Region reordering 2 to get higher 2 throughput Hole 2 4 Channels Interleaved Load balancing evenly among channels Physical Organization Hole 1 Hole 1 Maintaining throughput Multiple 1 channels cause throughput/ordering problems 1 3 for pipelined memories DRAM controllers rely on This means software and IP cores Region 3 Region 3 must manage multiple channels Hole 2 8

9 Multichannel Interleaving in the Interconnect Higher Performance, Lower Area, More Scalable Interleaving support requires splitting traffic for delivery to proper channel Splitting in memory scheduler/controller doesn t scale Creates wiring and performance bottleneck at internal arbiter Very difficult to scale past two channels Interleaved Multichannel Technology (IMT) splitting* Fully distributed architecture enables scalability Network overlaps channel accesses to maximize throughput Optimized protocols minimize reorder buffer area Isolating channels from IP cores makes it transparent to software and other hardware 9 *Patented

10 Seamless Multichannel Transition Physical Organization Application View 10

11 Simple 2-Channel Example (5 AXI Masters) 4K interleaving suffers from imbalanced BW on DRAM channels 64B IMT without reorder support has target switching penalty 64B IMT with reorder support delivers superior load balancing 11

domains Flexible distance spanning User-controlled partitioning Lowest power Optional services:

12 SonicsGN Network on Chip (NoC) Performance Up to 2GHz speed (14nm) Concurrency Up to 16 Virtual Channels / link Up to 8-way IMT Layout-friendly router-based fabric Unlimited clock/power domains Flexible distance spanning User-controlled partitioning Lowest power Optional services: security, power, error management Full design environment: capture, verification, performance analysis 12

obstruction but high connectivity Dual data/control rings Huge delta in clock insertion

13 HS Mem HS Mem HS Mem HS Mem Coping with Long Distances at High Frequency Die Plot NoC Ring NoC Request Network Topology DDR DDR Processor Array Processor Array Large obstruction but high connectivity Dual data/control rings Huge delta in clock insertion delay Mesochronous domain octants Long router-router distance Easy repeater insertion per domain 13

14 Summary Deep Learning Applications Are Best Served by Purpose-built, High-performance DPUs DPU Computing Fabric Leverages Huge Memory Bandwidth to Support the Faster Training of Deep Learning Algorithms IMT Capability in NoCs Abstracts the Complexities of Heterogeneous Memory Subsystems to Simplify the Software Model IMT Enables Dynamic Bandwidth Balancing of the DPU Traffic While Optimizing the Physical Chip Design 14

15 15 Thank You!

On-chip Networks Enable the Dark Silicon Advantage. Drew Wingard CTO & Co-founder Sonics, Inc.

On-chip Networks Enable the Dark Silicon Advantage Drew Wingard CTO & Co-founder Sonics, Inc. Agenda Sonics history and corporate summary Power challenges in advanced SoCs General power management techniques