Versal: AI Engine & Programming Environment

Size: px

Start display at page:

Download "Versal: AI Engine & Programming Environment"

Arthur Horton
5 years ago
Views:

1 Engineering Director, Xilinx Silicon Architecture Group Versal: Engine & Programming Environment Presented By Ambrose Finnerty Xilinx DSP Technical Marketing Manager October 16, 2018

2 MEMORY MEMORY MEMORY MEMORY Motivation for Engine CORE CORE CORE CORE

Smart City Smart Factory Machine Learning

Single / Multi-core Data Center Workloads

3 Technology Scaling Applications Motivation for Engine 5G ADAS / AD Compute Intensity Real Time Capability Everywhere Power Efficiency Moore s Law Smart City Smart Factory Machine Learning Performance & Power Scaling Traditional Single / Multi-core Data Center Workloads Dynamic Markets Require Adaptable Compute Acceleration Page 3

4 Delivering Adaptable Compute Acceleration CPU (Sequential) GPU (Parallel) ACAP Custom ASIC Engines SW Programmable HW Adaptable Workload Flexibility Throughput vs. Latency Device / Power Efficiency Development Time & Complexity ACAP w/ Engine Weeks Months Years Page 4

5 MEMORY MEMORY MEMORY MEMORY Introducing the Engine SW Programmable Deterministic Efficient CORE CORE CORE CORE 1GHz+ Multi-precision Vector Processor High bandwidth extensible memory Up to 400 Engines per device 8X Compute Density 40% Lower Power Artificial Intelligence Signal Processing Computer Vision CNN LSTM / MLP Adaptable. Intelligent. Page 5

6 Software Programmable: Any Developer 1 Design Run 3 C/C++ C/C++ Frameworks Programming Abstraction Levels 4G/5G/Radar Library Library Vision Library Architecture Overlay Data Flow w/ Xilinx libraries Kernel Program Data Flow w/ user defined libraries 2 Compile Engine Compiler Page 6

7 Hardware Adaptable: Accelerating the Whole Application Scalar, Sequential & Complex Compute Scalar Arm Dual- Cortex-A72 Arm Dual- Cortex-R5 Flexible Parallel Compute, Data manipulation NETWORK-ON-CHIP I/O Adaptable ML & Signal Processing Vector, Compute Intensive Intelligent Engines 160 GB/s of B/W per Heterogeneous Acceleration from Data Center to the Edge Video + Genomics + Risk Modeling + Database + Network IPS + Storage + Any-to-Any Connectivity Custom Hierarchy TB/s of Bandwidth PL-to- Engine Delivering Deterministic Performance & Low Latency Page 7

8 Engine Application Performance & Power Efficiency Image Classification (GoogleNet, <1ms) 20x Massive MIMO Radio (DUC, DDC, CFR, DPD) 5x Xilinx UltraScale+ Xilinx Versal w/ Engine 40% Less Power Inference Compute 5G Wireless Bandwidth Power Consumption Page 8

9 MEMORY MEMORY MEMORY MEMORY Engine Architecture, Programming & Applications CORE CORE CORE CORE

10 Engine: Tile-Based Architecture Non-Blocking Interconnect Up to 200+ GB/s bandwidth per tile PS I/O PL Interconnect Local Multi-bank implementation Shared across neighbor cores Local ISA-based Vector Processor Vector Extensions ISA-based Vector Processor Software Programmable (e.g., C/C++) Data Mover 5G Vector Extensions Cascade Interface Partial results to next core Data Mover Non-neighbor data communication Integrated synchronization primitives Page 10

11 Engine: Array Architecture Array of Engines Increase in compute, memory and communication bandwidth PS I/O PL Modular and scalable architecture More tiles = more compute Up to 400 per device Versal VC1902 device Distributed memory hierarchy Maximize memory bandwidth Deterministic Performance & Low Latency Page 11

12 Engine: Processor 32-bit Scalar RISC Processor Local, Shareable 32KB Local, 128KB Addressable Scalar Register File Scalar Unit Scalar ALU Non-linear Functions AGU AGU AGU Interface Vector Register File Load Unit A Load Unit B Store Unit Vector Unit Fixed-Point Vector Unit Floating-Point Vector Unit Instruction Fetch & Decode Unit Stream Interface Vector Processor 512-bit SIMD Datapath Instruction Parallelism: VLIW 7+ operations / clock cycle 2 Vector Loads / 1 Mult / 1 Store 2 Scalar Ops / Stream Access Highly Parallel Data Parallelism: SIMD Multiple vector lanes Vector Datapath 8 / 16 / 32-bit & SPFP operands Up to 128 MACs / Clock Cycle per (INT 8) Page 12

13 Multi-Precision Support Data Types MACs / Cycle (per core) Signal Processing Data Types MACs / Cycle (per core) x32 SPFP 32x32 Real 32x16 Real 16x16 Real 16x8 Real 8x8 Real 32x32 Complex 32x16 Complex 16x16 Complex 16 Complex x 16 Real Page 13

14 Data Movement Architecture Communication Streaming Communication Dataflow Pipeline B0 B1 B2 B3 Non- Neighbor Dataflow Graph Mem Mem Mem Mem Streaming Multicast Mem Mem Interface Cascade Streaming Stream Interface Page 14 Cascade Interface

15 Engine Integration with Versal ACAP PS I/O PL TB/s of Interface Bandwidth Engine to Programmable Logic Engine to NOC Switch Switch Async CDC Switch DMA Engine Interface Tiles Leveraging NOC connectivity PS manages Config / Debug / Trace Engine to DRAM (no PL req d) PS / PMC Switch Switch AXI-S Switch AXI-MM NOC Ext. DRAM Programmable Logic PL Function Page 15

16 MEM MEM MEM MEM MEM MEM MEM MEM MEM Engine: Xilinx Reinvents Multi- Compute Traditional Multi-core (cache-based architecture) Engine Array (intelligent engine) core L0 D0 D0 D0 D0 Block 0 core L0 core L1 L0 L2 core L0 DRAM Block 1 core Data Replicated Robs bandwidth Reduces capacity L0 core L1 L0 Fixed, shared Interconnect Blocking limits compute Timing not deterministic Dedicated Interconnect Non-blocking Deterministic Local, Distributed No cache misses Higher bandwidth Less capacity required Page 16

Engine Delivers High Compute Efficiency Adaptable, non-blocking interconnect Flexible data movement architecture Avoids interconnect bottlenecks Adaptable memory hierarchy Local, distributed,

17 Engine Delivers High Compute Efficiency Adaptable, non-blocking interconnect Flexible data movement architecture Avoids interconnect bottlenecks Adaptable memory hierarchy Local, distributed, shareable = extreme bandwidth No cache misses or data replication Extend to PL memory (BRAM, URAM) Vector Processor Efficiency Peak Kernel Theoretical Performance 95% 98% 80% Transfer data while Engine Computes Comm Comm Comm Compute Compute Compute Overlap Compute and Communication ML Convolutions FFT DPD Block-based Matrix Multiplication (32 64) (64 32) 1024-pt FFT/iFFT Volterra-based forward-path DPD Page 17

18 Versal ACAP Development Tools: Any Application, Any Developer TOOLS Frameworks New Unified Software Development Environment Vivado Design Suite USER and Data Scientists Software Application Developers Hardware Developers SUPPORTED FRAMEWORKS Page 18

19 Software Development Environment Application (e.g. C/C++) Performance Constraints New Unified SW Development Environment Scalar Adaptable Intelligent Unified development environment Full chip programming Processing Sub-system Programmable Logic Engines SW programmable for whole application Heterogeneous SW acceleration System Simulation Hardware Full system simulation, debug & profiling Software development experience System Debug & Profiling Page 19

20 Engine Programming Environment Application (e.g. C/C++) New Unified SW Development Environment PS PL Engines Full SW Programming Tool Chain (Single-engine and Multi-engine) IDE Compiler Debugger Performance Analysis Performance-Optimized Software Libraries (Examples) 4G/5G/Radar Library Library Vision Library Run-Time Software (Examples) Error Management Management Boot + Configuration Power/Thermal Management Page 20

21 Engine Programming Experience: Dataflow Model 1 User defines dataflow logic 3 Compiler transparently manages placement & interconnect a b c e Physical Mapping to Engines PL to e d a b c 2 User describes dataflow graph using C/C++ APIs Vector Vector Vector Vector d Vector Page 21

22 Frameworks for Any Developer Domain Specific Architecture (e.g. Inference) Architecture Overlay Data Flow w/ Xilinx libraries Kernel Program Data Flow w/ user defined libraries Target Domain Specific Architectures No HW Design Experience Required Page 22

Quantize, merge layers, prune Compile to Engines Xilinx DNN Compiler Xilinx Inference Domain Specific Architecture 3 Scalable

23 Accelerating Inference in the Data Center 1 User works in Framework of choice Develop & train custom network User provides trained model Deep Learning Frameworks 2 Xilinx DNN Compiler implements network Targets Inference Domain Specific Architecture Quantize, merge layers, prune Compile to Engines Xilinx DNN Compiler Xilinx Inference Domain Specific Architecture 3 Scalable across hardware targets Start with Alveo today Alveo U200 / U250 Future Alveo Accelerator Cards Powered by Versal with Engine Page 23

Inference on Versal ACAP Convolutions Fully Connected

4 6 6 8 3 1 1 0 1 2 2 4 6 8 3 4 y i = 0 y y i = x i x y

Genomics Storage Database Network IPS Risk modeling

Feature Map Data Volume* Custom Hierarchy *Figure

24 Inference on Versal ACAP Convolutions Fully Connected Layers Pooling Activations single depth slice X y i = 0 y y i = x i x y i = a i x i y y i = x i Y ReLU ReLU/PReLU Engines Video Genomics Storage Database Network IPS Risk modeling Processing System Programmable Logic I/O (GT, ADC/ DAC) Feature Map Data Volume* Custom Hierarchy *Figure credit: Page 24

Inference Mapping on Versal ACAP A = Activations W = Weights A 00 A 01 W 00 W 01 = A 00 W 00 + A 01 W 10 A 10 A 11 W 10 W 11 A 10 W 00 + A 11 W 10 Scalar Arm Dual- Cortex-A72 Arm Dual- Cortex-R5

25 Inference Mapping on Versal ACAP A = Activations W = Weights A 00 A 01 W 00 W 01 = A 00 W 00 + A 01 W 10 A 10 A 11 W 10 W 11 A 10 W 00 + A 11 W 10 Scalar Arm Dual- Cortex-A72 Arm Dual- Cortex-R5 Adaptable Weight Buffer (URAM) Activation Buffer (URAM) PL Max Pool Intelligent Engines Convolution Layers Fully Connected Layers ReLU A 00 W 00 A 10 Engine Engine Cascade Stream Engine Engine (4x8) X = (8x4) (4x4) Page 25 NETWORK-ON-CHIP I/O External (e.g., DDR) Custom memory hierarchy Buffer on-chip vs off-chip; Reduce latency and power Stream Multi-cast on interconnect Weights and Activations Read once: reduce memory bandwidth -optimized vector instructions (128 INT8 mults/cycle)

26 Projected Performance Engine Delivers Real-time Inference Leadership (75W Power Envelope) Low-Latency CNN Throughput 4X Next-Gen GPU (1) Versal Device (2) Note: Versal device achieves 8X performance increase in 150W power envelope (1) 12-nanometer T4 GPU device, Projected Batch=1 performance based on currently available vendor benchmarks (2) 7-nanometer Versal Series VC1902 Device, 75W card power figures based on XPE power estimates, Latency <500us Page 26

Packet Processing and Wired Backhaul Higher Layer Processing Baseband Processing Switching Beam Forming & MMIO + Some Baseband Transforms Digital Radio ADC / DAC Analogue Radio Antenna Array Market

27 Packet Processing and Wired Backhaul Higher Layer Processing Baseband Processing Switching Beam Forming & MMIO + Some Baseband Transforms Digital Radio ADC / DAC Analogue Radio Antenna Array Market Requirements and Trends: Wireless 5G 5G Complexity is 100X that of 4G Still Evolving Standard New Technologies in 5G Massive MIMO Multiple antenna, frequency bands Changing functional partitioning ETRI RWS , 5G Vision and Enabling Technologies: ETRI Perspective 3GPP RAN Workshop Phoenix, Dec Transport & CTRL L2 L7 Modulation & FEC IQ Switch Linear Algebra ifft/ FFT DUC, CFR, DPD, DDC PA, LNA, Diplexer Page 27

28 Packet Processing and Wired Backhaul Higher Layer Processing Baseband Processing Switching Beam Forming & MMIO + Some Baseband Transforms Digital Radio ADC / DAC Analogue Radio Antenna Array 5G Wireless on Versal ACAP 5G Wireless Infrastructure (i.e., base-station) Digital Radio with ADC/DAC Compute Maps to Engine Mapping Example CPRI DUC DPD Update DPD ADC/ DAC Control Maps to PS Processing System DPD Update Engines DUC DPD Programmable Logic I/O ADC/DAC CPRI 1: DUC: Digital Up Converter 2: DPD: Digital Pre-Distortion 3: Direct RF: ADC/DAC 4: CPRI: Common Public Radio Interface I/O Maps to PL Page 28

29 Engine Delivers 5X more 5G Wireless Compute Xilinx Zynq UltraScale+ RFSoC Xilinx Versal with Engine 16x TXRX Antennae CPRI CPRI Optics 16x TXRX Antennae CPRI/eCPRI CPRI Optics w. TSN ~1M slc ~4K DSP48 Engine: 100 s of 5G Wireless Accelerators Partial Spectrum Full Spectrum Frequency Spectrum Frequency Spectrum Page 29 Enabling Single Chip Massive MIMO 16x MHz Radio

30 MEMORY MEMORY MEMORY MEMORY Wrapping Up CORE CORE CORE CORE

31 Engine: Accelerating Inference & Signal Processing 20x 5x Inference Signal Processing Software Programmable Deterministic Efficient Frameworks & C/C++ SW Compile, Debug & Deploy Max throughput w/ low latency Real-time inference leadership Up to 8X compute density At ~40% lower power Page 31

See What Engine Can Do For You Read the Engine White Paper Visit: www.xilinx.

32 See What Engine Can Do For You Read the Engine White Paper Visit: Find out more about Engine Early Access Program Open Now Contact Your Xilinx Sales Representative Page 32

Versal: The New Xilinx Adaptive Compute Acceleration Platform (ACAP) in 7nm

Engineering Director, Xilinx Silicon Architecture Group Versal: The New Xilinx Adaptive Compute Acceleration Platform (ACAP) in 7nm Presented By Kees Vissers Fellow February 25, FPGA 2019 Technology scaling