Adaptable Intelligence The Next Computing Era

Similar documents
Adaptable Computing The Future of FPGA Acceleration. Dan Gibbons, VP Software Development June 6, 2018

Versal: The New Xilinx Adaptive Compute Acceleration Platform (ACAP) in 7nm

Versal: AI Engine & Programming Environment

Xilinx ML Suite Overview

Xilinx ML Suite Overview

Inference

HW/SW Programmable Engine:

Optimizing HW/SW Partition of a Complex Embedded Systems. Simon George November 2015.

Extending the Power of FPGAs to Software Developers:

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks

RDMA and Hardware Support

Xilinx DNN Processor An Inference Engine, Network Compiler + Runtime for Xilinx FPGAs

Xilinx Machine Learning Strategies For Edge

Toward a Memory-centric Architecture

Fast Hardware For AI

Extending the Power of FPGAs

System-on-Chip Architecture for Mobile Applications. Sabyasachi Dey

FPGA Entering the Era of the All Programmable SoC

Enabling the A.I. Era: From Materials to Systems

Deep Learning Accelerators

Enabling FPGAs in Hyperscale Data Centers

Altera SDK for OpenCL

Maximizing heterogeneous system performance with ARM interconnect and CCIX

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System

Accelerating Data Center Workloads with FPGAs

LegUp: Accelerating Memcached on Cloud FPGAs

Catapult: A Reconfigurable Fabric for Petaflop Computing in the Cloud

Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design

Next Generation Enterprise Solutions from ARM

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center

Data-Centric Innovation Summit DAN MCNAMARA SENIOR VICE PRESIDENT GENERAL MANAGER, PROGRAMMABLE SOLUTIONS GROUP

Xilinx Analyst & Investor Day 2018

A So%ware Developer's Journey into a Deeply Heterogeneous World. Tomas Evensen, CTO Embedded So%ware, Xilinx

FPGA 加速机器学习应用. 罗霖 2017 年 6 月 20 日

OCP Engineering Workshop - Telco

The Use Of Virtual Platforms In MP-SoC Design. Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006

Revolutionizing the Datacenter

Thomas Lin, Naif Tarafdar, Byungchul Park, Paul Chow, and Alberto Leon-Garcia

MATLAB/Simulink 기반의프로그래머블 SoC 설계및검증

Zynq-7000 All Programmable SoC Product Overview

AWS & Intel: A Partnership Dedicated to fueling your Innovations. Thomas Kellerer BDM CSP, Intel Central Europe

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

ECE 5775 (Fall 17) High-Level Digital Design Automation. Hardware-Software Co-Design

CCIX: a new coherent multichip interconnect for accelerated use cases

SmartNICs: Giving Rise To Smarter Offload at The Edge and In The Data Center

XPU A Programmable FPGA Accelerator for Diverse Workloads

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

Software Defined Modem A commercial platform for wireless handsets

The Nios II Family of Configurable Soft-core Processors

Dr. Yassine Hariri CMC Microsystems

FPGA & Hybrid Systems in the Enterprise Drivers, Exemplars and Challenges

SDSoC: Session 1

Simplify System Complexity

Optimizing Cache Coherent Subsystem Architecture for Heterogeneous Multicore SoCs

Simplify System Complexity

Mapping applications into MPSoC

World s most advanced data center accelerator for PCIe-based servers

Facilitating IP Development for the OpenCAPI Memory Interface Kevin McIlvain, Memory Development Engineer IBM. Join the Conversation #OpenPOWERSummit

Gen-Z Memory-Driven Computing

It s ALL About Execution May 22, 2017

The Challenges of System Design. Raising Performance and Reducing Power Consumption

Optimizing ARM SoC s with Carbon Performance Analysis Kits. ARM Technical Symposia, Fall 2014 Andy Ladd

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

借助 SDSoC 快速開發複雜的嵌入式應用

Multimedia in Mobile Phones. Architectures and Trends Lund

BERKELEY PAR LAB. RAMP Gold Wrap. Krste Asanovic. RAMP Wrap Stanford, CA August 25, 2010

Combining Arm & RISC-V in Heterogeneous Designs

MPSOC Design examples

UPCRC Overview. Universal Computing Research Centers launched at UC Berkeley and UIUC. Andrew A. Chien. Vice President of Research Intel Corporation

Comprehensive Arm Solutions for Innovative Machine Learning (ML) and Computer Vision (CV) Applications

Welcome. Altera Technology Roadshow 2013

Validation Strategies with pre-silicon platforms

IBM Power Systems: Open innovation to put data to work Dexter Henderson Vice President IBM Power Systems

The OpenVX Computer Vision and Neural Network Inference

NVIDIA'S DEEP LEARNING ACCELERATOR MEETS SIFIVE'S FREEDOM PLATFORM. Frans Sijstermans (NVIDIA) & Yunsup Lee (SiFive)

Accelerate Machine Learning on macos with Intel Integrated Graphics. Hisham Chowdhury May 23, 2018

SoC Systeme ultra-schnell entwickeln mit Vivado und Visual System Integrator

Embedded Systems. 8. Hardware Components. Lothar Thiele. Computer Engineering and Networks Laboratory

CPU Project in Western Digital: From Embedded Cores for Flash Controllers to Vision of Datacenter Processors with Open Interfaces

SAP HANA. Jake Klein/ SVP SAP HANA June, 2013

Embedded Systems. 7. System Components

Deep Learning on Arm Cortex-M Microcontrollers. Rod Crawford Director Software Technologies, Arm

An introduction to Machine Learning silicon

GRVI Phalanx. A Massively Parallel RISC-V FPGA Accelerator Accelerator. Jan Gray

THE PROGRAMMER S GUIDE TO THE APU GALAXY. Phil Rogers, Corporate Fellow AMD

Evolution of Rack Scale Architecture Storage

Onto Petaflops with Kubernetes

OPEN COMPUTE PLATFORMS POWER SOFTWARE-DRIVEN PACKET FLOW VISIBILITY, PART 2 EXECUTIVE SUMMARY. Key Takeaways

Agenda. Introduction Network functions virtualization (NFV) promise and mission cloud native approach Where do we want to go with NFV?

SYSTEMS ON CHIP (SOC) FOR EMBEDDED APPLICATIONS

Zynq Ultrascale+ Architecture

OpenCAPI Technology. Myron Slota Speaker name, Title OpenCAPI Consortium Company/Organization Name. Join the Conversation #OpenPOWERSummit

Computer Architecture. Fall Dongkun Shin, SKKU

Multimedia SoC System Solutions

Pactron FPGA Accelerated Computing Solutions

Overcoming the Memory System Challenge in Dataflow Processing. Darren Jones, Wave Computing Drew Wingard, Sonics

Embedded HW/SW Co-Development

RapidIO.org Update.

Industry Collaboration and Innovation

Transcription:

Adaptable Intelligence The Next Computing Era Hot Chips, August 21, 2018 Victor Peng, CEO, Xilinx

Pervasive Intelligence from Cloud to Edge to Endpoints >> 1

Exponential Growth and Opportunities Data Explosion Source: Patrick Cheesman Hyperscale Data Centers Hyperscale Market Value Source: Cisco Global Cloud Index, 2016-2021 Source: Bloomberg; company reports; The Economist estimates >> 2

Challenges: The End of Moore s Law and Scaling 55nm Radeon HD4870 16nm Zynq RFSoC.35um Alpha 21264 180nm MIPS IP Core.75um VAX 2um VAX Source: John Hennessy and David Patterson, Computer Architecture: A Quantitative Approach, 6/e 2018 >> 3

Challenge: Exponential Power Density Growth Source: John Hennessy and David Patterson: A New Golden Age for Computer Architecture Domain-Specific Hardware/Software Co-Design, Enhanced Security, Open Instruction Sets, and Agile Chip Development Power consumption based on models in Dark Silicon and the End of Multicore Scaling Hadi Esmaelizadeh, ISCA, 2011 >> 4

The Third Wave: Domain Specific Architectures on Adaptable HW Cache CPU CISC RISC Multi-Core Fixed HW Accelerators GPU, ASSP, ASIC DSA s on Adaptable Platforms FPGA, Zynq, ACAP >> 5

Massive Scale Out Requires DSA s and Adaptable Platforms Integration of AI Mainframe Era Units (M) PC Era Units (100M s) Mobile Era Units (B s) Pervasive Intelligence Era Units (50B s) >> 6

The Innovation to Deployment Acceleration Imperative AI Papers Published Over-the-air Updates Source: Scopus >> 7

Application Acceleration with DSA s on FPGA Platforms FPGA s Accelerate Entire Application Hyperscale Data Centers with FPGA Accelerators Execution Time Server CPU FPGA Accelerator CPU only Front End ML Back End Database Workloads Search (Database + ML) CPU + ML ASIC Front End ML Back End Speech (ML) Workloads Security Workloads CPU + FPGA Front End ML Back End FPGA Plane App 1 App 2 Front End ML Front End ML Back End Back End CPU Plane App N Front End ML Back End Virtualized & Scaled Out Adaptable Acceleration Dynamic Optimization for Changing Workloads & Mix >> 8

Development for DC Compute, Storage, Network Apps >> 9

Stack for Application Acceleration including ML User Applications C++ with Framework API Xilinx & Partner Provided Accelerators OpenCV Application-Specific Framework SW SQL Application-Specific Acceleration Genomics Video Search Financial Database SmartNIC ML Acceleration xdnn xdnn Frameworks Provide API to User Applications SW HW Xilinx Provided ML Engines DSA Shell Silicon FPGA or SoC or ACAP Standard Platform Interface (DDR, PCIe, Network) >> 10

Cloud: Latency-sensitive High Resolution Imaging CPU/GPU Motion Analysis Custom Operation PCIe CNN OpenCV: Pre-processing a Bottleneck CNN: Object Detection/Feature Extraction Data Sharing Between CPU and GPU CPU/GPU Results OpenCV CPU GPU CNN 100ms 2.3s Xilinx FPGA 2.7X Faster CPU PCIe CNN Image Scaling, Tiling Custom Operation FPGA OpenCV: 5x Faster CNN: Up to 3x Faster Based on Image Size Model Parallelism for High Res Processing Lower Power FPGA Results OpenCV CNN 20ms 0.85s 2.7X Faster and Lower Power >> 11

Cloud: Security / Anomaly Detection CPU/GPU Custom Operation CPU PCIe PCIe NIC MLP/Random Forrest GPU High Performance NIC: 100+ G MLP: MLP/RF Acceleration Security Vulnerability: Data Travels thru CPU to GPU CPU/GPU Results OpenCV MLP 50 µs Xilinx FPGA 5X Lower Latency CPU PCIe MLP/Random Forrest Smart NIC Custom Operation FPGA Integrated Single Card/Chip Solution High Speed Smart NIC Real-time MLP/RF at 5x Lower Latency TCO and Power Advantage: 1 Card vs 2 Cards In-line Security Detection is Much Higher Security FPGA Results OpenCV 10 µs MLP 5X Lower Latency with High Security >> 12

Cloud: Smart City / Security CPU/GPU Motion Analysis Custom Operation PCIe CNN Motion Analysis H.264 Decode H.265 Decode: 4x 1080p Streams (P4) OpenCV: Pre-Processing/Motion Analysis on CPU CNN: Object Detection on GPU Data Sharing Between CPU and GPU CPU/GPU Results Decode OpenCV CNN CPU GPU H.264 16 ms 10 ms CPU/Xilinx FPGA 10X Lower Latency CPU PCIe CNN Motion Analysis H.264 Decode Custom Operation FPGA H.265 Decode: 4x 1080p Streams (P4) OpenCV: Up to 20x Higher Performance CNN: 5ms Object Detection (xdnn Acceleration) Integrated Single Chip Solution: Cloud (VU9/13P) Lower Power FPGA Results Decode OpenCV CNN H.264 0.9ms 1.7ms 10x Lower Latency and Lower Power >> 13

Xilinx Programmable Architecture Milestones First FPGA Introduced 1980 First Virtex FPGA 1990 Virtex-2 Pro 2000 First 3D FPGA & HW/SW Programmable SoC 2010 First MPSoC & RFSoC 2016 ACAP 2019 >> 14

From FPGA to Adaptive Compute Acceleration Platform New Device Category for Adaptive Workload-specific Acceleration HW/SW Programmable Engines IP Subsystems and a Network-on-Chip Platform Offerings for Compute / Storage / Networking >> 15

Adaptive Compute Acceleration Platform Dynamically Adaptable to Workloads Exponential Increase in Acceleration Software Programmable >> 16

Adaptive Compute Acceleration Platform 20x ML Inference Performance 4x 5G Communications Bandwidth 112G Transceivers >> 17

New HW/SW Programmable Architecture Application-level Performance Enabled by SW Programmable Engine Application-level Performance Enabled by SW Programmable Engine Array Architecture 20x * Everest 7nm 4x 16nm Memory Core Memory Core Memory Core 40% Less Power Memory Memory Core Core Memory Memory Core Core Memory Memory Core Core ML Inference 5G Wireless Power Compute Efficiency Domain Specific Engine Greater Compute Density Xilinx 7nm Everest >> 18 Multiple Applications Heterogenous Architecture ML Inference for Cloud DC PE Throughout and Efficiency Wireless 5G: Radio, Baseband PL Flexibility ADAS/AD Embedded Vision Customized Memory Hierarchy Wired: DOCSIS Cable Access SW Programmable SW Programmable (e.g., C/C++) Compile, Execute, Debug Increased Productivity

xdnn: Adaptable Overlay DNN Processor for Xilinx FPGA Trained Model Compiler + Runtime Overlay DNN Processor 60-80% Computation Efficiency OUTPUT OUTPUT. Next HW In-Line Relu+ Bias+ Conv HW In-Line Relu+ Bias+ Conv HW In-Line Relu+ Bias+ Conv HW In-Line Relu+ Bias+ Conv HW In-Line Relu+ Bias+ Conv HW In-Line Relu+ Bias+ Conv Pool One Shot Network Deployment Previous INPUT INPUT No FPGA Expertise Needed Low Latency & High Throughput (Batch = 1) >> 19

Soft Overlay Architectures for ML DeePhi Aristotle Architecture DeePhi Descartes Architecture DPU High Performance Scheduler Hybrid Computing Array PE PE PE PE Memory Memory Memory Interconnect Sparse_LSTM IP Host CPU Instruction Fetch Unit DMA Global Memory Pool High Speed Data Tube Host DMA AXI4 Lite Interface DMA Input Buffer AXI4 Output Buffer Acceleration Kernel PE PE PE DMA PE DMA PE DMA PE............ RAM >> 20

Building the Adaptable, Intelligent World