Versal: AI Engine & Programming Environment

Similar documents
Versal: The New Xilinx Adaptive Compute Acceleration Platform (ACAP) in 7nm

HW/SW Programmable Engine:

Adaptable Intelligence The Next Computing Era

Xilinx ML Suite Overview

Maximizing heterogeneous system performance with ARM interconnect and CCIX

借助 SDSoC 快速開發複雜的嵌入式應用

Adaptable Computing The Future of FPGA Acceleration. Dan Gibbons, VP Software Development June 6, 2018

Simplifying FPGA Design for SDR with a Network on Chip Architecture

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

A Closer Look at the Epiphany IV 28nm 64 core Coprocessor. Andreas Olofsson PEGPUM 2013

XPU A Programmable FPGA Accelerator for Diverse Workloads

C-Based Hardware Design Platform for Dynamically Reconfigurable Processor

RFNoC : RF Network on Chip Martin Braun, Jonathon Pendlum GNU Radio Conference 2015

FPGA Entering the Era of the All Programmable SoC

Overcoming the Memory System Challenge in Dataflow Processing. Darren Jones, Wave Computing Drew Wingard, Sonics

Software Defined Modem A commercial platform for wireless handsets

Deep Learning Accelerators

Simplify System Complexity

How to Efficiently Implement Flexible and Full-Featured Digital Radio Solutions Using All Programmable SoCs

Tile Processor (TILEPro64)

Universal Hardware Platform for

Optimizing HW/SW Partition of a Complex Embedded Systems. Simon George November 2015.

New! New! New! New! New!

Third Genera+on USRP Devices and the RF Network- On- Chip. Leif Johansson Market Development RF, Comm and SDR

The WINLAB Cognitive Radio Platform

Building blocks for 64-bit Systems Development of System IP in ARM

Multimedia in Mobile Phones. Architectures and Trends Lund

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

Arm s First-Generation Machine Learning Processor

Zynq-7000 All Programmable SoC Product Overview

Ettus Research Update

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

Simplify System Complexity

XMC-RFSOC-A. XMC Module Xilinx Zynq UltraScale+ RFSOC. Overview. Key Features. Typical Applications. Advanced Information Subject To Change

Near Memory Key/Value Lookup Acceleration MemSys 2017

Xilinx DNN Processor An Inference Engine, Network Compiler + Runtime for Xilinx FPGAs

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

Inference

SDSoC: Session 1

Reconfigurable Cell Array for DSP Applications

Higher Level Programming Abstractions for FPGAs using OpenCL

ECE 8823: GPU Architectures. Objectives

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Revolutionizing the Datacenter

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System

Outline Marquette University

High Capacity and High Performance 20nm FPGAs. Steve Young, Dinesh Gaitonde August Copyright 2014 Xilinx

Copyright 2016 Xilinx

Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Threading Hardware in G80

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

Toward a Memory-centric Architecture

Deep Learning on Arm Cortex-M Microcontrollers. Rod Crawford Director Software Technologies, Arm

OCP Engineering Workshop - Telco

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

Altera SDK for OpenCL

Massively Parallel Processor Breadboarding (MPPB)

When MPPDB Meets GPU:

Optimizing Cache Coherent Subsystem Architecture for Heterogeneous Multicore SoCs

Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Software Defined Modems for The Internet of Things. Dr. John Haine, IP Operations Manager

Fundamental CUDA Optimization. NVIDIA Corporation

DNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses

CCIX: a new coherent multichip interconnect for accelerated use cases

GRVI Phalanx Update: Plowing the Cloud with Thousands of RISC-V Chickens. Jan Gray

Zynq Ultrascale+ Architecture

An 80-core GRVI Phalanx Overlay on PYNQ-Z1:

How to Write Fast Numerical Code

Computer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley

Xilinx ML Suite Overview

Research Faculty Summit Systems Fueling future disruptions

System-on-Chip Architecture for Mobile Applications. Sabyasachi Dey

Parallel Computing: Parallel Architectures Jin, Hai

The Nios II Family of Configurable Soft-core Processors

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing

Combining Arm & RISC-V in Heterogeneous Designs

CS427 Multicore Architecture and Parallel Computing

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

World s most advanced data center accelerator for PCIe-based servers

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Computer Architecture. Fall Dongkun Shin, SKKU

An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki

Next Generation Enterprise Solutions from ARM

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

Unified Deep Learning with CPU, GPU, and FPGA Technologies

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks

New! New! New! New! New!

GRVI Phalanx. A Massively Parallel RISC-V FPGA Accelerator Accelerator. Jan Gray

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

High Performance Embedded Applications. Raja Pillai Applications Engineering Specialist

COSMOS Architecture and Key Technologies. June 1 st, 2018 COSMOS Team

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

The Processor: Instruction-Level Parallelism

FCUDA-SoC: Platform Integration for Field-Programmable SoC with the CUDAto-FPGA

HEAD HardwarE Accelerated Deduplication

P51: High Performance Networking

How to validate your FPGA design using realworld

Transcription:

Engineering Director, Xilinx Silicon Architecture Group Versal: Engine & Programming Environment Presented By Ambrose Finnerty Xilinx DSP Technical Marketing Manager October 16, 2018

MEMORY MEMORY MEMORY MEMORY Motivation for Engine CORE CORE CORE CORE

Technology Scaling Applications Motivation for Engine 5G ADAS / AD Compute Intensity Real Time Capability Everywhere Power Efficiency Moore s Law Smart City Smart Factory Machine Learning Performance & Power Scaling Traditional Single / Multi-core Data Center Workloads Dynamic Markets Require Adaptable Compute Acceleration Page 3

Delivering Adaptable Compute Acceleration CPU (Sequential) GPU (Parallel) ACAP Custom ASIC Engines SW Programmable HW Adaptable Workload Flexibility Throughput vs. Latency Device / Power Efficiency Development Time & Complexity ACAP w/ Engine Weeks Months Years Page 4

MEMORY MEMORY MEMORY MEMORY Introducing the Engine SW Programmable Deterministic Efficient CORE CORE CORE CORE 1GHz+ Multi-precision Vector Processor High bandwidth extensible memory Up to 400 Engines per device 8X Compute Density 40% Lower Power Artificial Intelligence Signal Processing Computer Vision CNN LSTM / MLP Adaptable. Intelligent. Page 5

Software Programmable: Any Developer 1 Design Run 3 C/C++ C/C++ Frameworks Programming Abstraction Levels 4G/5G/Radar Library Library Vision Library Architecture Overlay Data Flow w/ Xilinx libraries Kernel Program Data Flow w/ user defined libraries 2 Compile Engine Compiler Page 6

Hardware Adaptable: Accelerating the Whole Application Scalar, Sequential & Complex Compute Scalar Arm Dual- Cortex-A72 Arm Dual- Cortex-R5 Flexible Parallel Compute, Data manipulation NETWORK-ON-CHIP I/O Adaptable ML & Signal Processing Vector, Compute Intensive Intelligent Engines 160 GB/s of B/W per Heterogeneous Acceleration from Data Center to the Edge Video + Genomics + Risk Modeling + Database + Network IPS + Storage + Any-to-Any Connectivity Custom Hierarchy TB/s of Bandwidth PL-to- Engine Delivering Deterministic Performance & Low Latency Page 7

Engine Application Performance & Power Efficiency Image Classification (GoogleNet, <1ms) 20x Massive MIMO Radio (DUC, DDC, CFR, DPD) 5x Xilinx UltraScale+ Xilinx Versal w/ Engine 40% Less Power Inference Compute 5G Wireless Bandwidth Power Consumption Page 8

MEMORY MEMORY MEMORY MEMORY Engine Architecture, Programming & Applications CORE CORE CORE CORE

Engine: Tile-Based Architecture Non-Blocking Interconnect Up to 200+ GB/s bandwidth per tile PS I/O PL Interconnect Local Multi-bank implementation Shared across neighbor cores Local ISA-based Vector Processor Vector Extensions ISA-based Vector Processor Software Programmable (e.g., C/C++) Data Mover 5G Vector Extensions Cascade Interface Partial results to next core Data Mover Non-neighbor data communication Integrated synchronization primitives Page 10

Engine: Array Architecture Array of Engines Increase in compute, memory and communication bandwidth PS I/O PL Modular and scalable architecture More tiles = more compute Up to 400 per device Versal VC1902 device Distributed memory hierarchy Maximize memory bandwidth Deterministic Performance & Low Latency Page 11

Engine: Processor 32-bit Scalar RISC Processor Local, Shareable 32KB Local, 128KB Addressable Scalar Register File Scalar Unit Scalar ALU Non-linear Functions AGU AGU AGU Interface Vector Register File Load Unit A Load Unit B Store Unit Vector Unit Fixed-Point Vector Unit Floating-Point Vector Unit Instruction Fetch & Decode Unit Stream Interface Vector Processor 512-bit SIMD Datapath Instruction Parallelism: VLIW 7+ operations / clock cycle 2 Vector Loads / 1 Mult / 1 Store 2 Scalar Ops / Stream Access Highly Parallel Data Parallelism: SIMD Multiple vector lanes Vector Datapath 8 / 16 / 32-bit & SPFP operands Up to 128 MACs / Clock Cycle per (INT 8) Page 12

Multi-Precision Support Data Types MACs / Cycle (per core) Signal Processing Data Types MACs / Cycle (per core) 128 16 64 8 8 8 16 32 2 4 32x32 SPFP 32x32 Real 32x16 Real 16x16 Real 16x8 Real 8x8 Real 32x32 Complex 32x16 Complex 16x16 Complex 16 Complex x 16 Real Page 13

Data Movement Architecture Communication Streaming Communication Dataflow Pipeline B0 B1 B2 B3 Non- Neighbor Dataflow Graph Mem Mem Mem Mem Streaming Multicast Mem Mem Interface Cascade Streaming Stream Interface Page 14 Cascade Interface

Engine Integration with Versal ACAP PS I/O PL TB/s of Interface Bandwidth Engine to Programmable Logic Engine to NOC Switch Switch Async CDC Switch DMA Engine Interface Tiles Leveraging NOC connectivity PS manages Config / Debug / Trace Engine to DRAM (no PL req d) PS / PMC Switch Switch AXI-S Switch AXI-MM NOC Ext. DRAM Programmable Logic PL Function Page 15

MEM MEM MEM MEM MEM MEM MEM MEM MEM Engine: Xilinx Reinvents Multi- Compute Traditional Multi-core (cache-based architecture) Engine Array (intelligent engine) core L0 D0 D0 D0 D0 Block 0 core L0 core L1 L0 L2 core L0 DRAM Block 1 core Data Replicated Robs bandwidth Reduces capacity L0 core L1 L0 Fixed, shared Interconnect Blocking limits compute Timing not deterministic Dedicated Interconnect Non-blocking Deterministic Local, Distributed No cache misses Higher bandwidth Less capacity required Page 16

Engine Delivers High Compute Efficiency Adaptable, non-blocking interconnect Flexible data movement architecture Avoids interconnect bottlenecks Adaptable memory hierarchy Local, distributed, shareable = extreme bandwidth No cache misses or data replication Extend to PL memory (BRAM, URAM) Vector Processor Efficiency Peak Kernel Theoretical Performance 95% 98% 80% Transfer data while Engine Computes Comm Comm Comm Compute Compute Compute Overlap Compute and Communication ML Convolutions FFT DPD Block-based Matrix Multiplication (32 64) (64 32) 1024-pt FFT/iFFT Volterra-based forward-path DPD Page 17

Versal ACAP Development Tools: Any Application, Any Developer TOOLS Frameworks New Unified Software Development Environment Vivado Design Suite USER and Data Scientists Software Application Developers Hardware Developers SUPPORTED FRAMEWORKS Page 18

Software Development Environment Application (e.g. C/C++) Performance Constraints New Unified SW Development Environment Scalar Adaptable Intelligent Unified development environment Full chip programming Processing Sub-system Programmable Logic Engines SW programmable for whole application Heterogeneous SW acceleration System Simulation Hardware Full system simulation, debug & profiling Software development experience System Debug & Profiling Page 19

Engine Programming Environment Application (e.g. C/C++) New Unified SW Development Environment PS PL Engines Full SW Programming Tool Chain (Single-engine and Multi-engine) IDE Compiler Debugger Performance Analysis Performance-Optimized Software Libraries (Examples) 4G/5G/Radar Library Library Vision Library Run-Time Software (Examples) Error Management Management Boot + Configuration Power/Thermal Management Page 20

Engine Programming Experience: Dataflow Model 1 User defines dataflow logic 3 Compiler transparently manages placement & interconnect a b c e Physical Mapping to Engines PL to e d a b c 2 User describes dataflow graph using C/C++ APIs Vector Vector Vector Vector d Vector Page 21

Frameworks for Any Developer Domain Specific Architecture (e.g. Inference) Architecture Overlay Data Flow w/ Xilinx libraries Kernel Program Data Flow w/ user defined libraries Target Domain Specific Architectures No HW Design Experience Required Page 22

Accelerating Inference in the Data Center 1 User works in Framework of choice Develop & train custom network User provides trained model Deep Learning Frameworks 2 Xilinx DNN Compiler implements network Targets Inference Domain Specific Architecture Quantize, merge layers, prune Compile to Engines Xilinx DNN Compiler Xilinx Inference Domain Specific Architecture 3 Scalable across hardware targets Start with Alveo today Alveo U200 / U250 Future Alveo Accelerator Cards Powered by Versal with Engine Page 23

Inference on Versal ACAP Convolutions Fully Connected Layers Pooling Activations single depth slice X 1 0 2 3 4 6 6 8 3 1 1 0 1 2 2 4 6 8 3 4 y i = 0 y y i = x i x y i = a i x i y y i = x i Y ReLU ReLU/PReLU Engines Video Genomics Storage Database Network IPS Risk modeling Processing System Programmable Logic I/O (GT, ADC/ DAC) Feature Map Data Volume* Custom Hierarchy *Figure credit: https://en.wikipedia.org/wiki/convolutional_neural_network Page 24

Inference Mapping on Versal ACAP A = Activations W = Weights A 00 A 01 W 00 W 01 = A 00 W 00 + A 01 W 10 A 10 A 11 W 10 W 11 A 10 W 00 + A 11 W 10 Scalar Arm Dual- Cortex-A72 Arm Dual- Cortex-R5 Adaptable Weight Buffer (URAM) Activation Buffer (URAM) PL Max Pool Intelligent Engines Convolution Layers Fully Connected Layers ReLU A 00 W 00 A 10 Engine Engine Cascade Stream Engine Engine (4x8) X = (8x4) (4x4) Page 25 NETWORK-ON-CHIP I/O External (e.g., DDR) Custom memory hierarchy Buffer on-chip vs off-chip; Reduce latency and power Stream Multi-cast on interconnect Weights and Activations Read once: reduce memory bandwidth -optimized vector instructions (128 INT8 mults/cycle)

Projected Performance Engine Delivers Real-time Inference Leadership (75W Power Envelope) Low-Latency CNN Throughput 4X Next-Gen GPU (1) Versal Device (2) Note: Versal device achieves 8X performance increase in 150W power envelope (1) 12-nanometer T4 GPU device, Projected Batch=1 performance based on currently available vendor benchmarks (2) 7-nanometer Versal Series VC1902 Device, 75W card power figures based on 2018.3 XPE power estimates, Latency <500us Page 26

Packet Processing and Wired Backhaul Higher Layer Processing Baseband Processing Switching Beam Forming & MMIO + Some Baseband Transforms Digital Radio ADC / DAC Analogue Radio Antenna Array Market Requirements and Trends: Wireless 5G 5G Complexity is 100X that of 4G Still Evolving Standard New Technologies in 5G Massive MIMO Multiple antenna, frequency bands Changing functional partitioning ETRI RWS-150029, 5G Vision and Enabling Technologies: ETRI Perspective 3GPP RAN Workshop Phoenix, Dec. 2015 http://www.3gpp.org/ftp/tsg_ran/tsg_ran/tsgr_70/docs Transport & CTRL L2 L7 Modulation & FEC IQ Switch Linear Algebra ifft/ FFT DUC, CFR, DPD, DDC PA, LNA, Diplexer Page 27

Packet Processing and Wired Backhaul Higher Layer Processing Baseband Processing Switching Beam Forming & MMIO + Some Baseband Transforms Digital Radio ADC / DAC Analogue Radio Antenna Array 5G Wireless on Versal ACAP 5G Wireless Infrastructure (i.e., base-station) Digital Radio with ADC/DAC Compute Maps to Engine Mapping Example CPRI DUC DPD Update DPD ADC/ DAC Control Maps to PS Processing System DPD Update Engines DUC DPD Programmable Logic I/O ADC/DAC CPRI 1: DUC: Digital Up Converter 2: DPD: Digital Pre-Distortion 3: Direct RF: ADC/DAC 4: CPRI: Common Public Radio Interface I/O Maps to PL Page 28

Engine Delivers 5X more 5G Wireless Compute Xilinx Zynq UltraScale+ RFSoC Xilinx Versal with Engine 16x TXRX Antennae CPRI CPRI Optics 16x TXRX Antennae CPRI/eCPRI CPRI Optics w. TSN ~1M slc ~4K DSP48 Engine: 100 s of 5G Wireless Accelerators Partial Spectrum Full Spectrum Frequency Spectrum Frequency Spectrum Page 29 Enabling Single Chip Massive MIMO 16x16 800 MHz Radio

MEMORY MEMORY MEMORY MEMORY Wrapping Up CORE CORE CORE CORE

Engine: Accelerating Inference & Signal Processing 20x 5x Inference Signal Processing Software Programmable Deterministic Efficient Frameworks & C/C++ SW Compile, Debug & Deploy Max throughput w/ low latency Real-time inference leadership Up to 8X compute density At ~40% lower power Page 31

See What Engine Can Do For You Read the Engine White Paper Visit: www.xilinx.com/versal Find out more about Engine Early Access Program Open Now Contact Your Xilinx Sales Representative Page 32