Versal: AI Engine & Programming Environment
|
|
- Arthur Horton
- 5 years ago
- Views:
Transcription
1 Engineering Director, Xilinx Silicon Architecture Group Versal: Engine & Programming Environment Presented By Ambrose Finnerty Xilinx DSP Technical Marketing Manager October 16, 2018
2 MEMORY MEMORY MEMORY MEMORY Motivation for Engine CORE CORE CORE CORE
3 Technology Scaling Applications Motivation for Engine 5G ADAS / AD Compute Intensity Real Time Capability Everywhere Power Efficiency Moore s Law Smart City Smart Factory Machine Learning Performance & Power Scaling Traditional Single / Multi-core Data Center Workloads Dynamic Markets Require Adaptable Compute Acceleration Page 3
4 Delivering Adaptable Compute Acceleration CPU (Sequential) GPU (Parallel) ACAP Custom ASIC Engines SW Programmable HW Adaptable Workload Flexibility Throughput vs. Latency Device / Power Efficiency Development Time & Complexity ACAP w/ Engine Weeks Months Years Page 4
5 MEMORY MEMORY MEMORY MEMORY Introducing the Engine SW Programmable Deterministic Efficient CORE CORE CORE CORE 1GHz+ Multi-precision Vector Processor High bandwidth extensible memory Up to 400 Engines per device 8X Compute Density 40% Lower Power Artificial Intelligence Signal Processing Computer Vision CNN LSTM / MLP Adaptable. Intelligent. Page 5
6 Software Programmable: Any Developer 1 Design Run 3 C/C++ C/C++ Frameworks Programming Abstraction Levels 4G/5G/Radar Library Library Vision Library Architecture Overlay Data Flow w/ Xilinx libraries Kernel Program Data Flow w/ user defined libraries 2 Compile Engine Compiler Page 6
7 Hardware Adaptable: Accelerating the Whole Application Scalar, Sequential & Complex Compute Scalar Arm Dual- Cortex-A72 Arm Dual- Cortex-R5 Flexible Parallel Compute, Data manipulation NETWORK-ON-CHIP I/O Adaptable ML & Signal Processing Vector, Compute Intensive Intelligent Engines 160 GB/s of B/W per Heterogeneous Acceleration from Data Center to the Edge Video + Genomics + Risk Modeling + Database + Network IPS + Storage + Any-to-Any Connectivity Custom Hierarchy TB/s of Bandwidth PL-to- Engine Delivering Deterministic Performance & Low Latency Page 7
8 Engine Application Performance & Power Efficiency Image Classification (GoogleNet, <1ms) 20x Massive MIMO Radio (DUC, DDC, CFR, DPD) 5x Xilinx UltraScale+ Xilinx Versal w/ Engine 40% Less Power Inference Compute 5G Wireless Bandwidth Power Consumption Page 8
9 MEMORY MEMORY MEMORY MEMORY Engine Architecture, Programming & Applications CORE CORE CORE CORE
10 Engine: Tile-Based Architecture Non-Blocking Interconnect Up to 200+ GB/s bandwidth per tile PS I/O PL Interconnect Local Multi-bank implementation Shared across neighbor cores Local ISA-based Vector Processor Vector Extensions ISA-based Vector Processor Software Programmable (e.g., C/C++) Data Mover 5G Vector Extensions Cascade Interface Partial results to next core Data Mover Non-neighbor data communication Integrated synchronization primitives Page 10
11 Engine: Array Architecture Array of Engines Increase in compute, memory and communication bandwidth PS I/O PL Modular and scalable architecture More tiles = more compute Up to 400 per device Versal VC1902 device Distributed memory hierarchy Maximize memory bandwidth Deterministic Performance & Low Latency Page 11
12 Engine: Processor 32-bit Scalar RISC Processor Local, Shareable 32KB Local, 128KB Addressable Scalar Register File Scalar Unit Scalar ALU Non-linear Functions AGU AGU AGU Interface Vector Register File Load Unit A Load Unit B Store Unit Vector Unit Fixed-Point Vector Unit Floating-Point Vector Unit Instruction Fetch & Decode Unit Stream Interface Vector Processor 512-bit SIMD Datapath Instruction Parallelism: VLIW 7+ operations / clock cycle 2 Vector Loads / 1 Mult / 1 Store 2 Scalar Ops / Stream Access Highly Parallel Data Parallelism: SIMD Multiple vector lanes Vector Datapath 8 / 16 / 32-bit & SPFP operands Up to 128 MACs / Clock Cycle per (INT 8) Page 12
13 Multi-Precision Support Data Types MACs / Cycle (per core) Signal Processing Data Types MACs / Cycle (per core) x32 SPFP 32x32 Real 32x16 Real 16x16 Real 16x8 Real 8x8 Real 32x32 Complex 32x16 Complex 16x16 Complex 16 Complex x 16 Real Page 13
14 Data Movement Architecture Communication Streaming Communication Dataflow Pipeline B0 B1 B2 B3 Non- Neighbor Dataflow Graph Mem Mem Mem Mem Streaming Multicast Mem Mem Interface Cascade Streaming Stream Interface Page 14 Cascade Interface
15 Engine Integration with Versal ACAP PS I/O PL TB/s of Interface Bandwidth Engine to Programmable Logic Engine to NOC Switch Switch Async CDC Switch DMA Engine Interface Tiles Leveraging NOC connectivity PS manages Config / Debug / Trace Engine to DRAM (no PL req d) PS / PMC Switch Switch AXI-S Switch AXI-MM NOC Ext. DRAM Programmable Logic PL Function Page 15
16 MEM MEM MEM MEM MEM MEM MEM MEM MEM Engine: Xilinx Reinvents Multi- Compute Traditional Multi-core (cache-based architecture) Engine Array (intelligent engine) core L0 D0 D0 D0 D0 Block 0 core L0 core L1 L0 L2 core L0 DRAM Block 1 core Data Replicated Robs bandwidth Reduces capacity L0 core L1 L0 Fixed, shared Interconnect Blocking limits compute Timing not deterministic Dedicated Interconnect Non-blocking Deterministic Local, Distributed No cache misses Higher bandwidth Less capacity required Page 16
17 Engine Delivers High Compute Efficiency Adaptable, non-blocking interconnect Flexible data movement architecture Avoids interconnect bottlenecks Adaptable memory hierarchy Local, distributed, shareable = extreme bandwidth No cache misses or data replication Extend to PL memory (BRAM, URAM) Vector Processor Efficiency Peak Kernel Theoretical Performance 95% 98% 80% Transfer data while Engine Computes Comm Comm Comm Compute Compute Compute Overlap Compute and Communication ML Convolutions FFT DPD Block-based Matrix Multiplication (32 64) (64 32) 1024-pt FFT/iFFT Volterra-based forward-path DPD Page 17
18 Versal ACAP Development Tools: Any Application, Any Developer TOOLS Frameworks New Unified Software Development Environment Vivado Design Suite USER and Data Scientists Software Application Developers Hardware Developers SUPPORTED FRAMEWORKS Page 18
19 Software Development Environment Application (e.g. C/C++) Performance Constraints New Unified SW Development Environment Scalar Adaptable Intelligent Unified development environment Full chip programming Processing Sub-system Programmable Logic Engines SW programmable for whole application Heterogeneous SW acceleration System Simulation Hardware Full system simulation, debug & profiling Software development experience System Debug & Profiling Page 19
20 Engine Programming Environment Application (e.g. C/C++) New Unified SW Development Environment PS PL Engines Full SW Programming Tool Chain (Single-engine and Multi-engine) IDE Compiler Debugger Performance Analysis Performance-Optimized Software Libraries (Examples) 4G/5G/Radar Library Library Vision Library Run-Time Software (Examples) Error Management Management Boot + Configuration Power/Thermal Management Page 20
21 Engine Programming Experience: Dataflow Model 1 User defines dataflow logic 3 Compiler transparently manages placement & interconnect a b c e Physical Mapping to Engines PL to e d a b c 2 User describes dataflow graph using C/C++ APIs Vector Vector Vector Vector d Vector Page 21
22 Frameworks for Any Developer Domain Specific Architecture (e.g. Inference) Architecture Overlay Data Flow w/ Xilinx libraries Kernel Program Data Flow w/ user defined libraries Target Domain Specific Architectures No HW Design Experience Required Page 22
23 Accelerating Inference in the Data Center 1 User works in Framework of choice Develop & train custom network User provides trained model Deep Learning Frameworks 2 Xilinx DNN Compiler implements network Targets Inference Domain Specific Architecture Quantize, merge layers, prune Compile to Engines Xilinx DNN Compiler Xilinx Inference Domain Specific Architecture 3 Scalable across hardware targets Start with Alveo today Alveo U200 / U250 Future Alveo Accelerator Cards Powered by Versal with Engine Page 23
24 Inference on Versal ACAP Convolutions Fully Connected Layers Pooling Activations single depth slice X y i = 0 y y i = x i x y i = a i x i y y i = x i Y ReLU ReLU/PReLU Engines Video Genomics Storage Database Network IPS Risk modeling Processing System Programmable Logic I/O (GT, ADC/ DAC) Feature Map Data Volume* Custom Hierarchy *Figure credit: Page 24
25 Inference Mapping on Versal ACAP A = Activations W = Weights A 00 A 01 W 00 W 01 = A 00 W 00 + A 01 W 10 A 10 A 11 W 10 W 11 A 10 W 00 + A 11 W 10 Scalar Arm Dual- Cortex-A72 Arm Dual- Cortex-R5 Adaptable Weight Buffer (URAM) Activation Buffer (URAM) PL Max Pool Intelligent Engines Convolution Layers Fully Connected Layers ReLU A 00 W 00 A 10 Engine Engine Cascade Stream Engine Engine (4x8) X = (8x4) (4x4) Page 25 NETWORK-ON-CHIP I/O External (e.g., DDR) Custom memory hierarchy Buffer on-chip vs off-chip; Reduce latency and power Stream Multi-cast on interconnect Weights and Activations Read once: reduce memory bandwidth -optimized vector instructions (128 INT8 mults/cycle)
26 Projected Performance Engine Delivers Real-time Inference Leadership (75W Power Envelope) Low-Latency CNN Throughput 4X Next-Gen GPU (1) Versal Device (2) Note: Versal device achieves 8X performance increase in 150W power envelope (1) 12-nanometer T4 GPU device, Projected Batch=1 performance based on currently available vendor benchmarks (2) 7-nanometer Versal Series VC1902 Device, 75W card power figures based on XPE power estimates, Latency <500us Page 26
27 Packet Processing and Wired Backhaul Higher Layer Processing Baseband Processing Switching Beam Forming & MMIO + Some Baseband Transforms Digital Radio ADC / DAC Analogue Radio Antenna Array Market Requirements and Trends: Wireless 5G 5G Complexity is 100X that of 4G Still Evolving Standard New Technologies in 5G Massive MIMO Multiple antenna, frequency bands Changing functional partitioning ETRI RWS , 5G Vision and Enabling Technologies: ETRI Perspective 3GPP RAN Workshop Phoenix, Dec Transport & CTRL L2 L7 Modulation & FEC IQ Switch Linear Algebra ifft/ FFT DUC, CFR, DPD, DDC PA, LNA, Diplexer Page 27
28 Packet Processing and Wired Backhaul Higher Layer Processing Baseband Processing Switching Beam Forming & MMIO + Some Baseband Transforms Digital Radio ADC / DAC Analogue Radio Antenna Array 5G Wireless on Versal ACAP 5G Wireless Infrastructure (i.e., base-station) Digital Radio with ADC/DAC Compute Maps to Engine Mapping Example CPRI DUC DPD Update DPD ADC/ DAC Control Maps to PS Processing System DPD Update Engines DUC DPD Programmable Logic I/O ADC/DAC CPRI 1: DUC: Digital Up Converter 2: DPD: Digital Pre-Distortion 3: Direct RF: ADC/DAC 4: CPRI: Common Public Radio Interface I/O Maps to PL Page 28
29 Engine Delivers 5X more 5G Wireless Compute Xilinx Zynq UltraScale+ RFSoC Xilinx Versal with Engine 16x TXRX Antennae CPRI CPRI Optics 16x TXRX Antennae CPRI/eCPRI CPRI Optics w. TSN ~1M slc ~4K DSP48 Engine: 100 s of 5G Wireless Accelerators Partial Spectrum Full Spectrum Frequency Spectrum Frequency Spectrum Page 29 Enabling Single Chip Massive MIMO 16x MHz Radio
30 MEMORY MEMORY MEMORY MEMORY Wrapping Up CORE CORE CORE CORE
31 Engine: Accelerating Inference & Signal Processing 20x 5x Inference Signal Processing Software Programmable Deterministic Efficient Frameworks & C/C++ SW Compile, Debug & Deploy Max throughput w/ low latency Real-time inference leadership Up to 8X compute density At ~40% lower power Page 31
32 See What Engine Can Do For You Read the Engine White Paper Visit: Find out more about Engine Early Access Program Open Now Contact Your Xilinx Sales Representative Page 32
33
Versal: The New Xilinx Adaptive Compute Acceleration Platform (ACAP) in 7nm
Engineering Director, Xilinx Silicon Architecture Group Versal: The New Xilinx Adaptive Compute Acceleration Platform (ACAP) in 7nm Presented By Kees Vissers Fellow February 25, FPGA 2019 Technology scaling
More informationHW/SW Programmable Engine:
HW/SW Programmable Engine: Domain Specific Architecture for Project Everest Juanjo Noguera, Goran Bilski, Jan Langer, Baris Ozgul, Tim Tuan, David Clarke, Peter McColgan, Sneha Date, Zachary Dickman, Pedro
More informationAdaptable Intelligence The Next Computing Era
Adaptable Intelligence The Next Computing Era Hot Chips, August 21, 2018 Victor Peng, CEO, Xilinx Pervasive Intelligence from Cloud to Edge to Endpoints >> 1 Exponential Growth and Opportunities Data Explosion
More informationXilinx ML Suite Overview
Xilinx ML Suite Overview Yao Fu System Architect Data Center Acceleration Xilinx Accelerated Computing Workloads Machine Learning Inference Image classification and object detection Video Streaming Frame
More informationMaximizing heterogeneous system performance with ARM interconnect and CCIX
Maximizing heterogeneous system performance with ARM interconnect and CCIX Neil Parris, Director of product marketing Systems and software group, ARM Teratec June 2017 Intelligent flexible cloud to enable
More information借助 SDSoC 快速開發複雜的嵌入式應用
借助 SDSoC 快速開發複雜的嵌入式應用 May 2017 What Is C/C++ Development System-level Profiling SoC application-like programming Tools and IP for system-level profiling Specify C/C++ Functions for Acceleration Full System
More informationAdaptable Computing The Future of FPGA Acceleration. Dan Gibbons, VP Software Development June 6, 2018
Adaptable Computing The Future of FPGA Acceleration Dan Gibbons, VP Software Development June 6, 2018 Adaptable Accelerated Computing Page 2 Three Big Trends The Evolution of Computing Trend to Heterogeneous
More informationSimplifying FPGA Design for SDR with a Network on Chip Architecture
Simplifying FPGA Design for SDR with a Network on Chip Architecture Matt Ettus Ettus Research GRCon13 Outline 1 Introduction 2 RF NoC 3 Status and Conclusions USRP FPGA Capability Gen
More informationDNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs
IBM Research AI Systems Day DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang 1, Junsong Wang 2, Chao Zhu 2, Yonghua Lin 2, Jinjun Xiong 3, Wen-mei
More informationA Closer Look at the Epiphany IV 28nm 64 core Coprocessor. Andreas Olofsson PEGPUM 2013
A Closer Look at the Epiphany IV 28nm 64 core Coprocessor Andreas Olofsson PEGPUM 2013 1 Adapteva Achieves 3 World Firsts 1. First processor company to reach 50 GFLOPS/W 3. First semiconductor company
More informationXPU A Programmable FPGA Accelerator for Diverse Workloads
XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for
More informationC-Based Hardware Design Platform for Dynamically Reconfigurable Processor
C-Based Hardware Design Platform for Dynamically Reconfigurable Processor September 22 nd, 2005 IPFlex Inc. Agenda Merits of C-Based hardware design Hardware enabling C-Based hardware design DAPDNA-FW
More informationRFNoC : RF Network on Chip Martin Braun, Jonathon Pendlum GNU Radio Conference 2015
RFNoC : RF Network on Chip Martin Braun, Jonathon Pendlum GNU Radio Conference 2015 Outline Motivation Current situation Goal RFNoC Basic concepts Architecture overview Summary No Demo! See our booth,
More informationFPGA Entering the Era of the All Programmable SoC
FPGA Entering the Era of the All Programmable SoC Ivo Bolsens, Senior Vice President & CTO Page 1 Moore s Law: The Technology Pipeline Page 2 Industry Debates on Cost Page 3 Design Cost Estimated Chip
More informationOvercoming the Memory System Challenge in Dataflow Processing. Darren Jones, Wave Computing Drew Wingard, Sonics
Overcoming the Memory System Challenge in Dataflow Processing Darren Jones, Wave Computing Drew Wingard, Sonics Current Technology Limits Deep Learning Performance Deep Learning Dataflow Graph Existing
More informationSoftware Defined Modem A commercial platform for wireless handsets
Software Defined Modem A commercial platform for wireless handsets Charles F Sturman VP Marketing June 22 nd ~ 24 th Brussels charles.stuman@cognovo.com www.cognovo.com Agenda SDM Separating hardware from
More informationDeep Learning Accelerators
Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction
More informationSimplify System Complexity
Simplify System Complexity With the new high-performance CompactRIO controller Fanie Coetzer Field Sales Engineer Northern South Africa 2 3 New control system CompactPCI MMI/Sequencing/Logging FieldPoint
More informationHow to Efficiently Implement Flexible and Full-Featured Digital Radio Solutions Using All Programmable SoCs
Delivering a Generation Ahead How to Efficiently Implement Flexible and Full-Featured Digital Radio Solutions Using All Programmable SoCs Agenda Introduction to Mobile Network Introduction to Xilinx Solution
More informationTile Processor (TILEPro64)
Tile Processor Case Study of Contemporary Multicore Fall 2010 Agarwal 6.173 1 Tile Processor (TILEPro64) Performance # of cores On-chip cache (MB) Cache coherency Operations (16/32-bit BOPS) On chip bandwidth
More informationUniversal Hardware Platform for
Universal Hardware Platform for OpenAir5G OpenAir5G Daqing Li From BUPT 1. Preface 2. Universal Platform for Baseband CONTENT Unit(BBU) 3. Universal Platform for Remote Radio Unit(RRU) 4. Accelerate Cards
More informationOptimizing HW/SW Partition of a Complex Embedded Systems. Simon George November 2015.
Optimizing HW/SW Partition of a Complex Embedded Systems Simon George November 2015 Zynq-7000 All Programmable SoC HP ACP GP Page 2 Zynq UltraScale+ MPSoC Page 3 HW/SW Optimization Challenges application()
More informationNew! New! New! New! New!
New! New! New! New! New! Model 5950 Features Supports Xilinx Zynq UltraScale+ RFSoC FPGAs 18 GB of DDR4 SDRAM On-board GPS receiver PCI Express (Gen. 1, 2 and 3) interface up to x8 LVDS connections to
More informationThird Genera+on USRP Devices and the RF Network- On- Chip. Leif Johansson Market Development RF, Comm and SDR
Third Genera+on USRP Devices and the RF Network- On- Chip Leif Johansson Market Development RF, Comm and SDR About Ettus Research Leader in soeware defined radio and signals intelligence Maker of USRP
More informationThe WINLAB Cognitive Radio Platform
The WINLAB Cognitive Radio Platform IAB Meeting, Fall 2007 Rutgers, The State University of New Jersey Ivan Seskar Software Defined Radio/ Cognitive Radio Terminology Software Defined Radio (SDR) is any
More informationBuilding blocks for 64-bit Systems Development of System IP in ARM
Building blocks for 64-bit Systems Development of System IP in ARM Research seminar @ University of York January 2015 Stuart Kenny stuart.kenny@arm.com 1 2 64-bit Mobile Devices The Mobile Consumer Expects
More informationMultimedia in Mobile Phones. Architectures and Trends Lund
Multimedia in Mobile Phones Architectures and Trends Lund 091124 Presentation Henrik Ohlsson Contact: henrik.h.ohlsson@stericsson.com Working with multimedia hardware (graphics and displays) at ST- Ericsson
More informationIBM Cell Processor. Gilbert Hendry Mark Kretschmann
IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:
More informationArm s First-Generation Machine Learning Processor
Arm s First-Generation Machine Learning Processor Ian Bratt 2018 Arm Limited Introducing the Arm Machine Learning (ML) Processor Optimized ground-up architecture for machine learning processing Massive
More informationZynq-7000 All Programmable SoC Product Overview
Zynq-7000 All Programmable SoC Product Overview The SW, HW and IO Programmable Platform August 2012 Copyright 2012 2009 Xilinx Introducing the Zynq -7000 All Programmable SoC Breakthrough Processing Platform
More informationEttus Research Update
Ettus Research Update Matt Ettus Ettus Research GRCon13 Outline 1 Introduction 2 Recent New Products 3 Third Generation Introduction Who am I? Core GNU Radio contributor since 2001 Designed
More informationMaximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman
Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael
More informationSimplify System Complexity
1 2 Simplify System Complexity With the new high-performance CompactRIO controller Arun Veeramani Senior Program Manager National Instruments NI CompactRIO The Worlds Only Software Designed Controller
More informationXMC-RFSOC-A. XMC Module Xilinx Zynq UltraScale+ RFSOC. Overview. Key Features. Typical Applications. Advanced Information Subject To Change
Advanced Information Subject To Change XMC-RFSOC-A XMC Module Xilinx Zynq UltraScale+ RFSOC Overview PanaTeQ s XMC-RFSOC-A is a XMC module based on the Zynq UltraScale+ RFSoC device from Xilinx. The Zynq
More informationNear Memory Key/Value Lookup Acceleration MemSys 2017
Near Key/Value Lookup Acceleration MemSys 2017 October 3, 2017 Scott Lloyd, Maya Gokhale Center for Applied Scientific Computing This work was performed under the auspices of the U.S. Department of Energy
More informationXilinx DNN Processor An Inference Engine, Network Compiler + Runtime for Xilinx FPGAs
ilinx DNN Proceor An Inference Engine, Network Compiler Runtime for ilinx FPGA Rahul Nimaiyar, Brian Sun, Victor Wu, Thoma Branca, Yi Wang, Jutin Oo, Elliott Delaye, Aaron Ng, Paolo D'Alberto, Sean Settle,
More informationSDA: Software-Defined Accelerator for Large- Scale DNN Systems
SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, Yong Wang, Bo Yu, Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A dominant
More informationInference
Inference Architectures @Xilinx Graham Schelle, PhD Principal Engineer Xilinx Research Labs Xilinx Headlines!2 Twitch Chooses Xilinx to Enable its Broadcast-quality Livestream of esports Agenda Xilinx
More informationSDSoC: Session 1
SDSoC: Session 1 ADAM@ADIUVOENGINEERING.COM What is SDSoC SDSoC is a system optimising compiler which allows us to optimise Zynq PS / PL Zynq MPSoC PS / PL MicroBlaze What does this mean? Following the
More informationReconfigurable Cell Array for DSP Applications
Outline econfigurable Cell Array for DSP Applications Chenxin Zhang Department of Electrical and Information Technology Lund University, Sweden econfigurable computing Coarse-grained reconfigurable cell
More informationHigher Level Programming Abstractions for FPGAs using OpenCL
Higher Level Programming Abstractions for FPGAs using OpenCL Desh Singh Supervising Principal Engineer Altera Corporation Toronto Technology Center ! Technology scaling favors programmability CPUs."#/0$*12'$-*
More informationECE 8823: GPU Architectures. Objectives
ECE 8823: GPU Architectures Introduction 1 Objectives Distinguishing features of GPUs vs. CPUs Major drivers in the evolution of general purpose GPUs (GPGPUs) 2 1 Chapter 1 Chapter 2: 2.2, 2.3 Reading
More informationFrequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System
Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM
More informationRevolutionizing the Datacenter
Power-Efficient Machine Learning using FPGAs on POWER Systems Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx Revolutionizing the Datacenter Join the Conversation #OpenPOWERSummit Top-5
More informationImplementing Long-term Recurrent Convolutional Network Using HLS on POWER System
Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System Xiaofan Zhang1, Mohamed El Hadedy1, Wen-mei Hwu1, Nam Sung Kim1, Jinjun Xiong2, Deming Chen1 1 University of Illinois Urbana-Champaign
More informationOutline Marquette University
COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations
More informationHigh Capacity and High Performance 20nm FPGAs. Steve Young, Dinesh Gaitonde August Copyright 2014 Xilinx
High Capacity and High Performance 20nm FPGAs Steve Young, Dinesh Gaitonde August 2014 Not a Complete Product Overview Page 2 Outline Page 3 Petabytes per month Increasing Bandwidth Global IP Traffic Growth
More informationCopyright 2016 Xilinx
Zynq Architecture Zynq Vivado 2015.4 Version This material exempt per Department of Commerce license exception TSU Objectives After completing this module, you will be able to: Identify the basic building
More informationScaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research
Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research Nick Fraser (Xilinx & USydney) Yaman Umuroglu (Xilinx & NTNU) Giulio Gambardella (Xilinx)
More informationINSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing
UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 22 Title: and Extended
More informationThreading Hardware in G80
ing Hardware in G80 1 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA 2 3D 3D API: API: OpenGL OpenGL or or Direct3D Direct3D GPU Command &
More informationSDA: Software-Defined Accelerator for Large- Scale DNN Systems
SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, 1 Yong Wang, 1 Bo Yu, 1 Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A
More informationToward a Memory-centric Architecture
Toward a Memory-centric Architecture Martin Fink EVP & Chief Technology Officer Western Digital Corporation August 8, 2017 1 SAFE HARBOR DISCLAIMERS Forward-Looking Statements This presentation contains
More informationDeep Learning on Arm Cortex-M Microcontrollers. Rod Crawford Director Software Technologies, Arm
Deep Learning on Arm Cortex-M Microcontrollers Rod Crawford Director Software Technologies, Arm What is Machine Learning (ML)? Artificial Intelligence Machine Learning Deep Learning Neural Networks Additional
More informationOCP Engineering Workshop - Telco
OCP Engineering Workshop - Telco Low Latency Mobile Edge Computing Trevor Hiatt Product Management, IDT IDT Company Overview Founded 1980 Workforce Approximately 1,800 employees Headquarters San Jose,
More informationFINN: A Framework for Fast, Scalable Binarized Neural Network Inference
FINN: A Framework for Fast, Scalable Binarized Neural Network Inference Yaman Umuroglu (NTNU & Xilinx Research Labs Ireland) in collaboration with N Fraser, G Gambardella, M Blott, P Leong, M Jahre and
More informationTowards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA
Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, Chunyuan Zhang National University of Defense Technology,
More informationAltera SDK for OpenCL
Altera SDK for OpenCL A novel SDK that opens up the world of FPGAs to today s developers Altera Technology Roadshow 2013 Today s News Altera today announces its SDK for OpenCL Altera Joins Khronos Group
More informationMassively Parallel Processor Breadboarding (MPPB)
Massively Parallel Processor Breadboarding (MPPB) 28 August 2012 Final Presentation TRP study 21986 Gerard Rauwerda CTO, Recore Systems Gerard.Rauwerda@RecoreSystems.com Recore Systems BV P.O. Box 77,
More informationWhen MPPDB Meets GPU:
When MPPDB Meets GPU: An Extendible Framework for Acceleration Laura Chen, Le Cai, Yongyan Wang Background: Heterogeneous Computing Hardware Trend stops growing with Moore s Law Fast development of GPU
More informationOptimizing Cache Coherent Subsystem Architecture for Heterogeneous Multicore SoCs
Optimizing Cache Coherent Subsystem Architecture for Heterogeneous Multicore SoCs Niu Feng Technical Specialist, ARM Tech Symposia 2016 Agenda Introduction Challenges: Optimizing cache coherent subsystem
More informationProcessor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP
Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP Presenter: Course: EEC 289Q: Reconfigurable Computing Course Instructor: Professor Soheil Ghiasi Outline Overview of M.I.T. Raw processor
More informationChallenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008
Michael Doggett Graphics Architecture Group April 2, 2008 Graphics Processing Unit Architecture CPUs vsgpus AMD s ATI RADEON 2900 Programming Brook+, CAL, ShaderAnalyzer Architecture Challenges Accelerated
More informationSoftware Defined Modems for The Internet of Things. Dr. John Haine, IP Operations Manager
Software Defined Modems for The Internet of Things Dr. John Haine, IP Operations Manager www.cognovo.com What things? 20 billion connected devices Manufactured for global markets Low cost Lifetimes from
More informationFundamental CUDA Optimization. NVIDIA Corporation
Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control
More informationDNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses
DNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough 1,2 S. K. Lee 2, N. Mulholland 2, P. Hansen 2, S. Kodali 3, D. Brooks 2, G.-Y. Wei 2 1 ARM Research, Boston,
More informationCCIX: a new coherent multichip interconnect for accelerated use cases
: a new coherent multichip interconnect for accelerated use cases Akira Shimizu Senior Manager, Operator relations Arm 2017 Arm Limited Arm 2017 Interconnects for different scale SoC interconnect. Connectivity
More informationGRVI Phalanx Update: Plowing the Cloud with Thousands of RISC-V Chickens. Jan Gray
If you were plowing a field, which would you rather use: two strong oxen or 1024 chickens? Seymour Cray GRVI Phalanx Update: Plowing the Cloud with Thousands of RISC-V Chickens Jan Gray jan@fpga.org http://fpga.org
More informationZynq Ultrascale+ Architecture
Zynq Ultrascale+ Architecture Stephanie Soldavini and Andrew Ramsey CMPE-550 Dec 2017 Soldavini, Ramsey (CMPE-550) Zynq Ultrascale+ Architecture Dec 2017 1 / 17 Agenda Heterogeneous Computing Zynq Ultrascale+
More informationAn 80-core GRVI Phalanx Overlay on PYNQ-Z1:
An 80-core GRVI Phalanx Overlay on PYNQ-Z1: Pynq as a High Productivity Platform For FPGA Design and Exploration Jan Gray jan@fpga.org http://fpga.org/grvi-phalanx FCCM 2017 05/03/2017 Pynq Workshop My
More informationHow to Write Fast Numerical Code
How to Write Fast Numerical Code Lecture: Memory hierarchy, locality, caches Instructor: Markus Püschel TA: Alen Stojanov, Georg Ofenbeck, Gagandeep Singh Organization Temporal and spatial locality Memory
More informationComputer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley
Computer Systems Architecture I CSE 560M Lecture 19 Prof. Patrick Crowley Plan for Today Announcement No lecture next Wednesday (Thanksgiving holiday) Take Home Final Exam Available Dec 7 Due via email
More informationXilinx ML Suite Overview
Xilinx ML Suite Overview Jim Heaton Sr. FAE Deep Learning explores the study of algorithms that can learn from and make predictions on data Deep Learning is Re-defining Many Applications Cloud Acceleration
More informationResearch Faculty Summit Systems Fueling future disruptions
Research Faculty Summit 2018 Systems Fueling future disruptions Wolong: A Back-end Optimizer for Deep Learning Computation Jilong Xue Researcher, Microsoft Research Asia System Challenge in Deep Learning
More informationSystem-on-Chip Architecture for Mobile Applications. Sabyasachi Dey
System-on-Chip Architecture for Mobile Applications Sabyasachi Dey Email: sabyasachi.dey@gmail.com Agenda What is Mobile Application Platform Challenges Key Architecture Focus Areas Conclusion Mobile Revolution
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationThe Nios II Family of Configurable Soft-core Processors
The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture
More informationHRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing
HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing Mingyu Gao and Christos Kozyrakis Stanford University http://mast.stanford.edu HPCA March 14, 2016 PIM is Coming Back End of Dennard
More informationCombining Arm & RISC-V in Heterogeneous Designs
Combining Arm & RISC-V in Heterogeneous Designs Gajinder Panesar, CTO, UltraSoC gajinder.panesar@ultrasoc.com RISC-V Summit 3 5 December 2018 Santa Clara, USA Problem statement Deterministic multi-core
More informationCS427 Multicore Architecture and Parallel Computing
CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationWorld s most advanced data center accelerator for PCIe-based servers
NVIDIA TESLA P100 GPU ACCELERATOR World s most advanced data center accelerator for PCIe-based servers HPC data centers need to support the ever-growing demands of scientists and researchers while staying
More informationCSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller
Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,
More informationComputer Architecture. Fall Dongkun Shin, SKKU
Computer Architecture Fall 2018 1 Syllabus Instructors: Dongkun Shin Office : Room 85470 E-mail : dongkun@skku.edu Office Hours: Wed. 15:00-17:30 or by appointment Lecture notes nyx.skku.ac.kr Courses
More informationAn Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki
An Ultra High Performance Scalable DSP Family for Multimedia Hot Chips 17 August 2005 Stanford, CA Erik Machnicki Media Processing Challenges Increasing performance requirements Need for flexibility &
More informationNext Generation Enterprise Solutions from ARM
Next Generation Enterprise Solutions from ARM Ian Forsyth Director Product Marketing Enterprise and Infrastructure Applications Processor Product Line Ian.forsyth@arm.com 1 Enterprise Trends IT is the
More informationFINN: A Framework for Fast, Scalable Binarized Neural Network Inference
FINN: A Framework for Fast, Scalable Binarized Neural Network Inference Yaman Umuroglu (XIR & NTNU), Nick Fraser (XIR & USydney), Giulio Gambardella (XIR), Michaela Blott (XIR), Philip Leong (USydney),
More informationUnified Deep Learning with CPU, GPU, and FPGA Technologies
Unified Deep Learning with CPU, GPU, and FPGA Technologies Allen Rush 1, Ashish Sirasao 2, Mike Ignatowski 1 1: Advanced Micro Devices, Inc., 2: Xilinx, Inc. Abstract Deep learning and complex machine
More informationRecurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks
Deep neural networks have enabled major advances in machine learning and AI Computer vision Language translation Speech recognition Question answering And more Problem: DNNs are challenging to serve and
More informationNew! New! New! New! New!
New! New! New! New! New! Model 5950 Features Supports Xilinx Zynq UltraScale+ RFSoC FPGAs 18 GB of DDR4 SDRAM On-board GPS receiver PCI Express (Gen. 1, 2 and 3) interface up to x8 LVDS connections to
More informationGRVI Phalanx. A Massively Parallel RISC-V FPGA Accelerator Accelerator. Jan Gray
GRVI Phalanx A Massively Parallel RISC-V FPGA Accelerator Accelerator Jan Gray jan@fpga.org Introduction FPGA accelerators are hot MSR Catapult. Intel += Altera. OpenPOWER + Xilinx FPGAs as computers Massively
More informationEE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin
EE382 (20): Computer Architecture - ism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez The University of Texas at Austin 1 Recap 2 Streaming model 1. Use many slimmed down cores to run in parallel
More informationHigh Performance Embedded Applications. Raja Pillai Applications Engineering Specialist
High Performance Embedded Applications Raja Pillai Applications Engineering Specialist Agenda What is High Performance Embedded? NI s History in HPE FlexRIO Overview System architecture Adapter modules
More informationCOSMOS Architecture and Key Technologies. June 1 st, 2018 COSMOS Team
COSMOS Architecture and Key Technologies June 1 st, 2018 COSMOS Team COSMOS: System Architecture (2) System design based on three levels of SDR radio node (S,M,L) with M,L connected via fiber to optical
More informationProfiling and Debugging OpenCL Applications with ARM Development Tools. October 2014
Profiling and Debugging OpenCL Applications with ARM Development Tools October 2014 1 Agenda 1. Introduction to GPU Compute 2. ARM Development Solutions 3. Mali GPU Architecture 4. Using ARM DS-5 Streamline
More informationThe Processor: Instruction-Level Parallelism
The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy
More informationFCUDA-SoC: Platform Integration for Field-Programmable SoC with the CUDAto-FPGA
1 FCUDA-SoC: Platform Integration for Field-Programmable SoC with the CUDAto-FPGA Compiler Tan Nguyen 1, Swathi Gurumani 1, Kyle Rupnow 1, Deming Chen 2 1 Advanced Digital Sciences Center, Singapore {tan.nguyen,
More informationHEAD HardwarE Accelerated Deduplication
HEAD HardwarE Accelerated Deduplication Final Report CS710 Computing Acceleration with FPGA December 9, 2016 Insu Jang Seikwon Kim Seonyoung Lee Executive Summary A-Z development of deduplication SW version
More informationP51: High Performance Networking
P51: High Performance Networking Lecture 6: Programmable network devices Dr Noa Zilberman noa.zilberman@cl.cam.ac.uk Lent 2017/18 High Throughput Interfaces Performance Limitations So far we discussed
More informationHow to validate your FPGA design using realworld
How to validate your FPGA design using realworld stimuli Daniel Clapham National Instruments ni.com Agenda Typical FPGA Design NIs approach to FPGA Brief intro into platform based approach RIO architecture
More information