Versal: The New Xilinx Adaptive Compute Acceleration Platform (ACAP) in 7nm

Size: px

Start display at page:

Download "Versal: The New Xilinx Adaptive Compute Acceleration Platform (ACAP) in 7nm"

Emma Park
5 years ago
Views:

1 Engineering Director, Xilinx Silicon Architecture Group Versal: The New Xilinx Adaptive Compute Acceleration Platform (ACAP) in 7nm Presented By Kees Vissers Fellow February 25, FPGA 2019

2 Technology scaling coming to an end Processing Architectures are Not Scaling A Single Architecture Can t Do It Alone Performance vs. VAZ , YEARS OF PROCESSOR PERFORMANCE 100,000 2X / 3.5 Years 2X /? 6 Years 2X / 1.5 Years RISC End of Dennard Scaling Amdahls Law Safety Processing, or Latency-Critical Workloads Domain Specific Parallelism (e.g., Video, ML) Whole Application Irregular data types, instruction sets, data operation Sensor Fusion, Pre-Processing, Data Aggregation 10 2X / 3.5 Years CISC Complex Algorithms, Full Linux Services Source: John Hennessy and David Patterson, Computer Architecture: A Quantitative Approach, 6/e 2018 >> 2

3 Need for a New Programming Paradigm Software Developer Needs Agility and Abstraction Ecosystem of Libraries Need a Scalable, Unified Platform Hardware Developer Needs Flexibility to Optimize for Performance/Power Modify, Design, Add Code >> 3

4 Versal Architecture Overview Adaptable Engines 2X compute density Scalar Engines Platform Control Edge Compute Protocol Engines Integrated 600G cores 4X encrypted bandwidth Engines Compute Diverse DSP workloads Network-on-Chip Guaranteed Bandwidth Enables SW Programmability Programmable I/O Any interface or sensor Includes 4.2Gb/s MIPI DDR 3200-DDR4, 3200-LPDDR4 2X bandwidth/pin Transceivers Broad range, 25G 112G 58G in mainstream devices PCIe & CCIX 2X PCIe & DMA bandwidth Cache-coherent interface to accelerators >> 4

5 Overview Adaptable Engines: Brian Gaide ( 9.15 today, directly following this talk) Network on Chip: Ian Swarbrick (9.45 Tuesday) Rest of this Talk: Adaptable Intelligent Engines: New processors + interconnect

6 MEMORY MEMORY MEMORY MEMORY Motivation for Engine CORE CORE CORE CORE

Technology Scaling Applications Motivation for Engine

Everywhere Power Efficiency Moore s Law Smart City

Scaling Traditional Single / Multi-core Data Center

7 Technology Scaling Applications Motivation for Engine 5G ADAS / AD Compute Intensity Real Time Capability Everywhere Power Efficiency Moore s Law Smart City Smart Factory Machine Learning Performance & Power Scaling Traditional Single / Multi-core Data Center Workloads Dynamic Markets Require Adaptable Compute Acceleration Page 7

8 Delivering Adaptable Compute Acceleration CPU (Sequential) GPU (Parallel) ACAP Custom ASIC Engines SW Programmable HW Adaptable Workload Flexibility Throughput vs. Latency Device / Power Efficiency Development Time & Complexity ACAP w/ Engine Weeks Months Years Page 8

9 MEMORY MEMORY MEMORY MEMORY Introducing the Engine SW Programmable Deterministic Efficient CORE CORE CORE CORE 1GHz+ Multi-precision Vector Processor High bandwidth extensible memory Up to 400 Engines per device 8X Compute Density 40% Lower Power Artificial Intelligence Signal Processing Computer Vision CNN LSTM / MLP Adaptable. Intelligent. Page 9

Vision Library Architecture Overlay Data Flow w/ Xilinx libraries

10 Software Programmable: Any Developer 1 Design Run 3 C/C++ C/C++ Frameworks Programming Abstraction Levels 4G/5G/Radar Library Library Vision Library Architecture Overlay Data Flow w/ Xilinx libraries Kernel Program Data Flow w/ user defined libraries 2 Compile Engine Compiler Page 10

11 Hardware Adaptable: Accelerating the Whole Application Scalar, Sequential & Complex Compute Scalar Arm Dual- Cortex-A72 Arm Dual- Cortex-R5 Flexible Parallel Compute, Data manipulation NETWORK-ON-CHIP I/O Adaptable ML & Signal Processing Vector, Compute Intensive Intelligent Engines 160 GB/s of B/W per Heterogeneous Acceleration from Data Center to the Edge Video + Genomics + Risk Modeling + Database + Network IPS + Storage + Any-to-Any Connectivity Custom Hierarchy TB/s of Bandwidth PL-to- Engine Delivering Deterministic Performance & Low Latency Page 11

12 Engine Application Performance & Power Efficiency Image Classification (GoogleNet v1 <1ms) 10x Massive MIMO Radio (DUC, DDC, CFR, DPD) 5x Xilinx UltraScale+ Xilinx Versal w/ Engine 40% Less Power Inference Compute 5G Wireless Bandwidth Power Consumption Page 12

13 MEMORY MEMORY MEMORY MEMORY Engine Architecture, Programming & Applications CORE CORE CORE CORE

14 Engine: Tile-Based Architecture Non-Blocking Interconnect high GB/s bandwidth per tile PS I/O PL Interconnect Local Multi-bank implementation Shared across neighbor cores Local ISA-based Vector Processor Vector Extensions ISA-based Vector Processor Software Programmable (e.g., C/C++) Data Mover 5G Vector Extensions Cascade Interface Partial results to next core Data Mover Non-neighbor data communication Integrated synchronization primitives Page 14

15 PS Engine: Array Architecture PL I/O Array of Engines Increase in compute, memory and communication bandwidth Modular and scalable architecture More tiles = more compute Up to 400 per device Versal VC1902 device Distributed memory hierarchy Maximize memory bandwidth Deterministic Performance & Low Latency Page 15

16 Engine: Processor 32-bit Scalar RISC Processor Local, Shareable 32KB Local, 128KB Addressable Scalar Register File Scalar Unit Scalar ALU Non-linear Functions AGU AGU AGU Interface Vector Register File Load Unit A Load Unit B Store Unit Vector Unit Fixed-Point Vector Unit Floating-Point Vector Unit Instruction Fetch & Decode Unit Stream Interface Vector Processor 512-bit SIMD Datapath Instruction Parallelism: VLIW 7+ operations / clock cycle 2 Vector Loads / 1 Mult / 1 Store 2 Scalar Ops / Stream Access Highly Parallel Data Parallelism: SIMD Multiple vector lanes Vector Datapath 8 / 16 / 32-bit & SPFP operands Up to 128 MACs / Clock Cycle per (INT 8) Page 16

17 Multi-Precision Support Data Types MACs / Cycle (per core) Signal Processing Data Types MACs / Cycle (per core) x32 SPFP 32x32 Real 32x16 Real 16x16 Real 16x8 Real 8x8 Real 32x32 Complex 32x16 Complex 16x16 Complex 16 Complex x 16 Real Page 17

18 Data Movement Architecture Communication Streaming Communication Dataflow Pipeline B0 B1 B2 B3 Non- Neighbor Dataflow Graph Mem Mem Mem Mem Streaming Multicast Mem Mem Interface Cascade Streaming Stream Interface Page 18 Cascade Interface

Engine Integration with Versal ACAP PS I/O PL TB/s of Interface Bandwidth Engine to Programmable Logic Engine to NOC Switch Switch Async CDC Switch DMA Engine Interface Tiles

19 Engine Integration with Versal ACAP PS I/O PL TB/s of Interface Bandwidth Engine to Programmable Logic Engine to NOC Switch Switch Async CDC Switch DMA Engine Interface Tiles Leveraging NOC connectivity PS manages Config / Debug / Trace Engine to DRAM (no PL req d) PS / PMC Switch Switch AXI-S Switch AXI-MM NOC Ext. DRAM Programmable Logic PL Function Page 19

20 MEM MEM MEM MEM MEM MEM MEM MEM MEM Engine: Multi- Compute with dedicated memory Traditional Multi-core (cache-based architecture) Engine Array (intelligent engine) core L0 D0 D0 D0 D0 Block 0 core L0 core L1 L0 L2 core L0 DRAM Block 1 core Data Replicated Robs bandwidth Reduces capacity L0 core L1 L0 Fixed, shared Interconnect Blocking limits compute Timing not deterministic Dedicated Interconnect Non-blocking Deterministic Local, Distributed No cache misses Higher bandwidth Less capacity required Page 20

21 Engine Delivers High Compute Efficiency Adaptable, non-blocking interconnect Flexible data movement architecture Avoids interconnect bottlenecks Adaptable memory hierarchy Local, distributed, shareable = extreme bandwidth No cache misses or data replication Extend to PL memory (BRAM, URAM) Vector Processor Efficiency Peak Kernel Theoretical Performance 95% 98% 80% Transfer data while Engine Computes Comm Comm Comm Compute Compute Compute Overlap Compute and Communication ML Convolutions FFT DPD Block-based Matrix Multiplication (32 64) (64 32) 1024-pt FFT/iFFT Volterra-based forward-path DPD Page 21

22 Engine Programming Experience: Dataflow Model 1 User defines dataflow logic 3 Compiler transparently manages placement & interconnect a b c e Physical Mapping to Engines PL to e d a b c 2 User describes dataflow graph using C/C++ APIs Vector Vector Vector Vector d Vector Page 22

23 Versal ACAP Development Tools: TOOLS Frameworks New Unified Software Development Environment Vivado Design Suite USER and Data Scientists Software Application Developers Hardware Developers SUPPORTED FRAMEWORKS Page 23

24 Software Development Environment Application (e.g. C/C++) Performance Constraints New Unified SW Development Environment Scalar Adaptable Intelligent Unified development environment Full chip programming Processing Sub-system Programmable Logic Engines SW programmable for whole application Heterogeneous SW acceleration System Simulation Hardware Full system simulation, debug & profiling Software development experience System Debug & Profiling Page 24

25 Engine Programming Environment Application (e.g. C/C++) New Unified SW Development Environment PS PL Engines Full SW Programming Tool Chain (Single-engine and Multi-engine) IDE Compiler Debugger Performance Analysis Performance-Optimized Software Libraries (Examples) 4G/5G/Radar Library Library Vision Library Run-Time Software (Examples) Error Management Management Boot + Configuration Power/Thermal Management Page 25

26 Frameworks for Any Developer Domain Specific Architecture (e.g. Inference) Architecture Overlay Data Flow w/ Xilinx libraries Kernel Program Data Flow w/ user defined libraries Target Domain Specific Architectures No HW Design Experience Required Page 26

Accelerating Inference in the Data Center 1 User works in

provides trained model Deep Learning Frameworks 2 Xilinx DNN

Architecture Quantize, merge layers, prune Compile to Engines

Architecture 3 Scalable across hardware targets Start with

27 Accelerating Inference in the Data Center 1 User works in Framework of choice Develop & train custom network User provides trained model Deep Learning Frameworks 2 Xilinx DNN Compiler implements network Targets Inference Domain Specific Architecture Quantize, merge layers, prune Compile to Engines Xilinx DNN Compiler Xilinx Inference Domain Specific Architecture 3 Scalable across hardware targets Start with Alveo today Alveo U200 / U250/U280 New Versal based Acceleration Cards Page 27

Inference on Versal ACAP Convolutions Fully Connected

4 6 6 8 3 1 1 0 1 2 2 4 6 8 3 4 y i = 0 y y i = x i x y

Genomics Storage Database Network IPS Risk modeling

Feature Map Data Volume* Custom Hierarchy *Figure

28 Inference on Versal ACAP Convolutions Fully Connected Layers Pooling Activations single depth slice X y i = 0 y y i = x i x y i = a i x i y y i = x i Y ReLU ReLU/PReLU Engines Video Genomics Storage Database Network IPS Risk modeling Processing System Programmable Logic I/O (GT, ADC/ DAC) Feature Map Data Volume* Custom Hierarchy *Figure credit: Page 28

Inference Mapping on Versal ACAP A = Activations W = Weights A 00 A 01 W 00 W 01 = A 00 W 00 + A 01 W 10 A 10 A 11 W 10 W 11 A 10 W 00 + A 11 W 10 Scalar Arm Dual- Cortex-A72 Arm Dual- Cortex-R5

29 Inference Mapping on Versal ACAP A = Activations W = Weights A 00 A 01 W 00 W 01 = A 00 W 00 + A 01 W 10 A 10 A 11 W 10 W 11 A 10 W 00 + A 11 W 10 Scalar Arm Dual- Cortex-A72 Arm Dual- Cortex-R5 Adaptable Weight Buffer (URAM) Activation Buffer (URAM) PL Max Pool Intelligent Engines Convolution Layers Fully Connected Layers ReLU A 00 W 00 A 10 Engine Engine Cascade Stream Engine Engine (4x8) X = (8x4) (4x4) Page 29 NETWORK-ON-CHIP I/O External (e.g., DDR) Custom memory hierarchy Buffer on-chip vs off-chip; Reduce latency and power Stream Multi-cast on interconnect Weights and Activations Read once: reduce memory bandwidth -optimized vector instructions (128 INT8 mults/cycle)

30 Projected Performance Engine Delivers Real-time Inference Leadership (75W Power Envelope) Low-Latency CNN Throughput 4X Next-Gen GPU (1) Versal Device (2) Note: Versal device achieves 8X performance increase in 150W power envelope (1) 12-nanometer T4 GPU device, Projected Batch=1 performance based on currently available vendor benchmarks (2) 7-nanometer Versal Series VC1902 Device, 75W card power figures based on XPE power estimates, Latency <500us Page 30

Packet Processing and Wired Backhaul Higher Layer Processing Baseband Processing Switching Beam Forming & MMIO + Some Baseband Transforms Digital Radio ADC / DAC Analogue Radio Antenna Array Market

31 Packet Processing and Wired Backhaul Higher Layer Processing Baseband Processing Switching Beam Forming & MMIO + Some Baseband Transforms Digital Radio ADC / DAC Analogue Radio Antenna Array Market Requirements and Trends: Wireless 5G 5G Complexity is 100X that of 4G Still Evolving Standard New Technologies in 5G Massive MIMO Multiple antenna, frequency bands Changing functional partitioning ETRI RWS , 5G Vision and Enabling Technologies: ETRI Perspective 3GPP RAN Workshop Phoenix, Dec Transport & CTRL L2 L7 Modulation & FEC IQ Switch Linear Algebra ifft/ FFT DUC, CFR, DPD, DDC PA, LNA, Diplexer Page 31

32 Packet Processing and Wired Backhaul Higher Layer Processing Baseband Processing Switching Beam Forming & MMIO + Some Baseband Transforms Digital Radio ADC / DAC Analogue Radio Antenna Array 5G Wireless on Versal ACAP 5G Wireless Infrastructure (i.e., base-station) Digital Radio with ADC/DAC Compute Maps to Engine Mapping Example CPRI DUC DPD Update DPD ADC/ DAC Control Maps to PS Processing System DPD Update Engines DUC DPD Programmable Logic I/O ADC/DAC CPRI 1: DUC: Digital Up Converter 2: DPD: Digital Pre-Distortion 3: Direct RF: ADC/DAC 4: CPRI: Common Public Radio Interface I/O Maps to PL Page 32

Frameworks & C/C++ SW Compile, Debug & Deploy Max throughput w/ low

33 Engine: Accelerating Inference & Signal Processing 10x 5x Inference Signal Processing Software Programmable Deterministic Efficient Frameworks & C/C++ SW Compile, Debug & Deploy Max throughput w/ low latency Real-time inference leadership Up to 8X compute density At ~40% lower power Page 33

34 VC1902:133TOPS (int8 peak)

Versal: AI Engine & Programming Environment

Versal: AI Engine & Programming Environment Engineering Director, Xilinx Silicon Architecture Group Versal: Engine & Programming Environment Presented By Ambrose Finnerty Xilinx DSP Technical Marketing Manager October 16, 2018 MEMORY MEMORY MEMORY