Versal: The New Xilinx Adaptive Compute Acceleration Platform (ACAP) in 7nm

Size: px
Start display at page:

Download "Versal: The New Xilinx Adaptive Compute Acceleration Platform (ACAP) in 7nm"

Transcription

1 Engineering Director, Xilinx Silicon Architecture Group Versal: The New Xilinx Adaptive Compute Acceleration Platform (ACAP) in 7nm Presented By Kees Vissers Fellow February 25, FPGA 2019

2 Technology scaling coming to an end Processing Architectures are Not Scaling A Single Architecture Can t Do It Alone Performance vs. VAZ , YEARS OF PROCESSOR PERFORMANCE 100,000 2X / 3.5 Years 2X /? 6 Years 2X / 1.5 Years RISC End of Dennard Scaling Amdahls Law Safety Processing, or Latency-Critical Workloads Domain Specific Parallelism (e.g., Video, ML) Whole Application Irregular data types, instruction sets, data operation Sensor Fusion, Pre-Processing, Data Aggregation 10 2X / 3.5 Years CISC Complex Algorithms, Full Linux Services Source: John Hennessy and David Patterson, Computer Architecture: A Quantitative Approach, 6/e 2018 >> 2

3 Need for a New Programming Paradigm Software Developer Needs Agility and Abstraction Ecosystem of Libraries Need a Scalable, Unified Platform Hardware Developer Needs Flexibility to Optimize for Performance/Power Modify, Design, Add Code >> 3

4 Versal Architecture Overview Adaptable Engines 2X compute density Scalar Engines Platform Control Edge Compute Protocol Engines Integrated 600G cores 4X encrypted bandwidth Engines Compute Diverse DSP workloads Network-on-Chip Guaranteed Bandwidth Enables SW Programmability Programmable I/O Any interface or sensor Includes 4.2Gb/s MIPI DDR 3200-DDR4, 3200-LPDDR4 2X bandwidth/pin Transceivers Broad range, 25G 112G 58G in mainstream devices PCIe & CCIX 2X PCIe & DMA bandwidth Cache-coherent interface to accelerators >> 4

5 Overview Adaptable Engines: Brian Gaide ( 9.15 today, directly following this talk) Network on Chip: Ian Swarbrick (9.45 Tuesday) Rest of this Talk: Adaptable Intelligent Engines: New processors + interconnect

6 MEMORY MEMORY MEMORY MEMORY Motivation for Engine CORE CORE CORE CORE

7 Technology Scaling Applications Motivation for Engine 5G ADAS / AD Compute Intensity Real Time Capability Everywhere Power Efficiency Moore s Law Smart City Smart Factory Machine Learning Performance & Power Scaling Traditional Single / Multi-core Data Center Workloads Dynamic Markets Require Adaptable Compute Acceleration Page 7

8 Delivering Adaptable Compute Acceleration CPU (Sequential) GPU (Parallel) ACAP Custom ASIC Engines SW Programmable HW Adaptable Workload Flexibility Throughput vs. Latency Device / Power Efficiency Development Time & Complexity ACAP w/ Engine Weeks Months Years Page 8

9 MEMORY MEMORY MEMORY MEMORY Introducing the Engine SW Programmable Deterministic Efficient CORE CORE CORE CORE 1GHz+ Multi-precision Vector Processor High bandwidth extensible memory Up to 400 Engines per device 8X Compute Density 40% Lower Power Artificial Intelligence Signal Processing Computer Vision CNN LSTM / MLP Adaptable. Intelligent. Page 9

10 Software Programmable: Any Developer 1 Design Run 3 C/C++ C/C++ Frameworks Programming Abstraction Levels 4G/5G/Radar Library Library Vision Library Architecture Overlay Data Flow w/ Xilinx libraries Kernel Program Data Flow w/ user defined libraries 2 Compile Engine Compiler Page 10

11 Hardware Adaptable: Accelerating the Whole Application Scalar, Sequential & Complex Compute Scalar Arm Dual- Cortex-A72 Arm Dual- Cortex-R5 Flexible Parallel Compute, Data manipulation NETWORK-ON-CHIP I/O Adaptable ML & Signal Processing Vector, Compute Intensive Intelligent Engines 160 GB/s of B/W per Heterogeneous Acceleration from Data Center to the Edge Video + Genomics + Risk Modeling + Database + Network IPS + Storage + Any-to-Any Connectivity Custom Hierarchy TB/s of Bandwidth PL-to- Engine Delivering Deterministic Performance & Low Latency Page 11

12 Engine Application Performance & Power Efficiency Image Classification (GoogleNet v1 <1ms) 10x Massive MIMO Radio (DUC, DDC, CFR, DPD) 5x Xilinx UltraScale+ Xilinx Versal w/ Engine 40% Less Power Inference Compute 5G Wireless Bandwidth Power Consumption Page 12

13 MEMORY MEMORY MEMORY MEMORY Engine Architecture, Programming & Applications CORE CORE CORE CORE

14 Engine: Tile-Based Architecture Non-Blocking Interconnect high GB/s bandwidth per tile PS I/O PL Interconnect Local Multi-bank implementation Shared across neighbor cores Local ISA-based Vector Processor Vector Extensions ISA-based Vector Processor Software Programmable (e.g., C/C++) Data Mover 5G Vector Extensions Cascade Interface Partial results to next core Data Mover Non-neighbor data communication Integrated synchronization primitives Page 14

15 PS Engine: Array Architecture PL I/O Array of Engines Increase in compute, memory and communication bandwidth Modular and scalable architecture More tiles = more compute Up to 400 per device Versal VC1902 device Distributed memory hierarchy Maximize memory bandwidth Deterministic Performance & Low Latency Page 15

16 Engine: Processor 32-bit Scalar RISC Processor Local, Shareable 32KB Local, 128KB Addressable Scalar Register File Scalar Unit Scalar ALU Non-linear Functions AGU AGU AGU Interface Vector Register File Load Unit A Load Unit B Store Unit Vector Unit Fixed-Point Vector Unit Floating-Point Vector Unit Instruction Fetch & Decode Unit Stream Interface Vector Processor 512-bit SIMD Datapath Instruction Parallelism: VLIW 7+ operations / clock cycle 2 Vector Loads / 1 Mult / 1 Store 2 Scalar Ops / Stream Access Highly Parallel Data Parallelism: SIMD Multiple vector lanes Vector Datapath 8 / 16 / 32-bit & SPFP operands Up to 128 MACs / Clock Cycle per (INT 8) Page 16

17 Multi-Precision Support Data Types MACs / Cycle (per core) Signal Processing Data Types MACs / Cycle (per core) x32 SPFP 32x32 Real 32x16 Real 16x16 Real 16x8 Real 8x8 Real 32x32 Complex 32x16 Complex 16x16 Complex 16 Complex x 16 Real Page 17

18 Data Movement Architecture Communication Streaming Communication Dataflow Pipeline B0 B1 B2 B3 Non- Neighbor Dataflow Graph Mem Mem Mem Mem Streaming Multicast Mem Mem Interface Cascade Streaming Stream Interface Page 18 Cascade Interface

19 Engine Integration with Versal ACAP PS I/O PL TB/s of Interface Bandwidth Engine to Programmable Logic Engine to NOC Switch Switch Async CDC Switch DMA Engine Interface Tiles Leveraging NOC connectivity PS manages Config / Debug / Trace Engine to DRAM (no PL req d) PS / PMC Switch Switch AXI-S Switch AXI-MM NOC Ext. DRAM Programmable Logic PL Function Page 19

20 MEM MEM MEM MEM MEM MEM MEM MEM MEM Engine: Multi- Compute with dedicated memory Traditional Multi-core (cache-based architecture) Engine Array (intelligent engine) core L0 D0 D0 D0 D0 Block 0 core L0 core L1 L0 L2 core L0 DRAM Block 1 core Data Replicated Robs bandwidth Reduces capacity L0 core L1 L0 Fixed, shared Interconnect Blocking limits compute Timing not deterministic Dedicated Interconnect Non-blocking Deterministic Local, Distributed No cache misses Higher bandwidth Less capacity required Page 20

21 Engine Delivers High Compute Efficiency Adaptable, non-blocking interconnect Flexible data movement architecture Avoids interconnect bottlenecks Adaptable memory hierarchy Local, distributed, shareable = extreme bandwidth No cache misses or data replication Extend to PL memory (BRAM, URAM) Vector Processor Efficiency Peak Kernel Theoretical Performance 95% 98% 80% Transfer data while Engine Computes Comm Comm Comm Compute Compute Compute Overlap Compute and Communication ML Convolutions FFT DPD Block-based Matrix Multiplication (32 64) (64 32) 1024-pt FFT/iFFT Volterra-based forward-path DPD Page 21

22 Engine Programming Experience: Dataflow Model 1 User defines dataflow logic 3 Compiler transparently manages placement & interconnect a b c e Physical Mapping to Engines PL to e d a b c 2 User describes dataflow graph using C/C++ APIs Vector Vector Vector Vector d Vector Page 22

23 Versal ACAP Development Tools: TOOLS Frameworks New Unified Software Development Environment Vivado Design Suite USER and Data Scientists Software Application Developers Hardware Developers SUPPORTED FRAMEWORKS Page 23

24 Software Development Environment Application (e.g. C/C++) Performance Constraints New Unified SW Development Environment Scalar Adaptable Intelligent Unified development environment Full chip programming Processing Sub-system Programmable Logic Engines SW programmable for whole application Heterogeneous SW acceleration System Simulation Hardware Full system simulation, debug & profiling Software development experience System Debug & Profiling Page 24

25 Engine Programming Environment Application (e.g. C/C++) New Unified SW Development Environment PS PL Engines Full SW Programming Tool Chain (Single-engine and Multi-engine) IDE Compiler Debugger Performance Analysis Performance-Optimized Software Libraries (Examples) 4G/5G/Radar Library Library Vision Library Run-Time Software (Examples) Error Management Management Boot + Configuration Power/Thermal Management Page 25

26 Frameworks for Any Developer Domain Specific Architecture (e.g. Inference) Architecture Overlay Data Flow w/ Xilinx libraries Kernel Program Data Flow w/ user defined libraries Target Domain Specific Architectures No HW Design Experience Required Page 26

27 Accelerating Inference in the Data Center 1 User works in Framework of choice Develop & train custom network User provides trained model Deep Learning Frameworks 2 Xilinx DNN Compiler implements network Targets Inference Domain Specific Architecture Quantize, merge layers, prune Compile to Engines Xilinx DNN Compiler Xilinx Inference Domain Specific Architecture 3 Scalable across hardware targets Start with Alveo today Alveo U200 / U250/U280 New Versal based Acceleration Cards Page 27

28 Inference on Versal ACAP Convolutions Fully Connected Layers Pooling Activations single depth slice X y i = 0 y y i = x i x y i = a i x i y y i = x i Y ReLU ReLU/PReLU Engines Video Genomics Storage Database Network IPS Risk modeling Processing System Programmable Logic I/O (GT, ADC/ DAC) Feature Map Data Volume* Custom Hierarchy *Figure credit: Page 28

29 Inference Mapping on Versal ACAP A = Activations W = Weights A 00 A 01 W 00 W 01 = A 00 W 00 + A 01 W 10 A 10 A 11 W 10 W 11 A 10 W 00 + A 11 W 10 Scalar Arm Dual- Cortex-A72 Arm Dual- Cortex-R5 Adaptable Weight Buffer (URAM) Activation Buffer (URAM) PL Max Pool Intelligent Engines Convolution Layers Fully Connected Layers ReLU A 00 W 00 A 10 Engine Engine Cascade Stream Engine Engine (4x8) X = (8x4) (4x4) Page 29 NETWORK-ON-CHIP I/O External (e.g., DDR) Custom memory hierarchy Buffer on-chip vs off-chip; Reduce latency and power Stream Multi-cast on interconnect Weights and Activations Read once: reduce memory bandwidth -optimized vector instructions (128 INT8 mults/cycle)

30 Projected Performance Engine Delivers Real-time Inference Leadership (75W Power Envelope) Low-Latency CNN Throughput 4X Next-Gen GPU (1) Versal Device (2) Note: Versal device achieves 8X performance increase in 150W power envelope (1) 12-nanometer T4 GPU device, Projected Batch=1 performance based on currently available vendor benchmarks (2) 7-nanometer Versal Series VC1902 Device, 75W card power figures based on XPE power estimates, Latency <500us Page 30

31 Packet Processing and Wired Backhaul Higher Layer Processing Baseband Processing Switching Beam Forming & MMIO + Some Baseband Transforms Digital Radio ADC / DAC Analogue Radio Antenna Array Market Requirements and Trends: Wireless 5G 5G Complexity is 100X that of 4G Still Evolving Standard New Technologies in 5G Massive MIMO Multiple antenna, frequency bands Changing functional partitioning ETRI RWS , 5G Vision and Enabling Technologies: ETRI Perspective 3GPP RAN Workshop Phoenix, Dec Transport & CTRL L2 L7 Modulation & FEC IQ Switch Linear Algebra ifft/ FFT DUC, CFR, DPD, DDC PA, LNA, Diplexer Page 31

32 Packet Processing and Wired Backhaul Higher Layer Processing Baseband Processing Switching Beam Forming & MMIO + Some Baseband Transforms Digital Radio ADC / DAC Analogue Radio Antenna Array 5G Wireless on Versal ACAP 5G Wireless Infrastructure (i.e., base-station) Digital Radio with ADC/DAC Compute Maps to Engine Mapping Example CPRI DUC DPD Update DPD ADC/ DAC Control Maps to PS Processing System DPD Update Engines DUC DPD Programmable Logic I/O ADC/DAC CPRI 1: DUC: Digital Up Converter 2: DPD: Digital Pre-Distortion 3: Direct RF: ADC/DAC 4: CPRI: Common Public Radio Interface I/O Maps to PL Page 32

33 Engine: Accelerating Inference & Signal Processing 10x 5x Inference Signal Processing Software Programmable Deterministic Efficient Frameworks & C/C++ SW Compile, Debug & Deploy Max throughput w/ low latency Real-time inference leadership Up to 8X compute density At ~40% lower power Page 33

34 VC1902:133TOPS (int8 peak)

Versal: AI Engine & Programming Environment

Versal: AI Engine & Programming Environment Engineering Director, Xilinx Silicon Architecture Group Versal: Engine & Programming Environment Presented By Ambrose Finnerty Xilinx DSP Technical Marketing Manager October 16, 2018 MEMORY MEMORY MEMORY

More information

HW/SW Programmable Engine:

HW/SW Programmable Engine: HW/SW Programmable Engine: Domain Specific Architecture for Project Everest Juanjo Noguera, Goran Bilski, Jan Langer, Baris Ozgul, Tim Tuan, David Clarke, Peter McColgan, Sneha Date, Zachary Dickman, Pedro

More information

Adaptable Intelligence The Next Computing Era

Adaptable Intelligence The Next Computing Era Adaptable Intelligence The Next Computing Era Hot Chips, August 21, 2018 Victor Peng, CEO, Xilinx Pervasive Intelligence from Cloud to Edge to Endpoints >> 1 Exponential Growth and Opportunities Data Explosion

More information

Maximizing heterogeneous system performance with ARM interconnect and CCIX

Maximizing heterogeneous system performance with ARM interconnect and CCIX Maximizing heterogeneous system performance with ARM interconnect and CCIX Neil Parris, Director of product marketing Systems and software group, ARM Teratec June 2017 Intelligent flexible cloud to enable

More information

Xilinx ML Suite Overview

Xilinx ML Suite Overview Xilinx ML Suite Overview Yao Fu System Architect Data Center Acceleration Xilinx Accelerated Computing Workloads Machine Learning Inference Image classification and object detection Video Streaming Frame

More information

XPU A Programmable FPGA Accelerator for Diverse Workloads

XPU A Programmable FPGA Accelerator for Diverse Workloads XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for

More information

Zynq-7000 All Programmable SoC Product Overview

Zynq-7000 All Programmable SoC Product Overview Zynq-7000 All Programmable SoC Product Overview The SW, HW and IO Programmable Platform August 2012 Copyright 2012 2009 Xilinx Introducing the Zynq -7000 All Programmable SoC Breakthrough Processing Platform

More information

FPGA Entering the Era of the All Programmable SoC

FPGA Entering the Era of the All Programmable SoC FPGA Entering the Era of the All Programmable SoC Ivo Bolsens, Senior Vice President & CTO Page 1 Moore s Law: The Technology Pipeline Page 2 Industry Debates on Cost Page 3 Design Cost Estimated Chip

More information

借助 SDSoC 快速開發複雜的嵌入式應用

借助 SDSoC 快速開發複雜的嵌入式應用 借助 SDSoC 快速開發複雜的嵌入式應用 May 2017 What Is C/C++ Development System-level Profiling SoC application-like programming Tools and IP for system-level profiling Specify C/C++ Functions for Acceleration Full System

More information

Adaptable Computing The Future of FPGA Acceleration. Dan Gibbons, VP Software Development June 6, 2018

Adaptable Computing The Future of FPGA Acceleration. Dan Gibbons, VP Software Development June 6, 2018 Adaptable Computing The Future of FPGA Acceleration Dan Gibbons, VP Software Development June 6, 2018 Adaptable Accelerated Computing Page 2 Three Big Trends The Evolution of Computing Trend to Heterogeneous

More information

Deep Learning Accelerators

Deep Learning Accelerators Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction

More information

Building blocks for 64-bit Systems Development of System IP in ARM

Building blocks for 64-bit Systems Development of System IP in ARM Building blocks for 64-bit Systems Development of System IP in ARM Research seminar @ University of York January 2015 Stuart Kenny stuart.kenny@arm.com 1 2 64-bit Mobile Devices The Mobile Consumer Expects

More information

Tile Processor (TILEPro64)

Tile Processor (TILEPro64) Tile Processor Case Study of Contemporary Multicore Fall 2010 Agarwal 6.173 1 Tile Processor (TILEPro64) Performance # of cores On-chip cache (MB) Cache coherency Operations (16/32-bit BOPS) On chip bandwidth

More information

CCIX: a new coherent multichip interconnect for accelerated use cases

CCIX: a new coherent multichip interconnect for accelerated use cases : a new coherent multichip interconnect for accelerated use cases Akira Shimizu Senior Manager, Operator relations Arm 2017 Arm Limited Arm 2017 Interconnects for different scale SoC interconnect. Connectivity

More information

Overcoming the Memory System Challenge in Dataflow Processing. Darren Jones, Wave Computing Drew Wingard, Sonics

Overcoming the Memory System Challenge in Dataflow Processing. Darren Jones, Wave Computing Drew Wingard, Sonics Overcoming the Memory System Challenge in Dataflow Processing Darren Jones, Wave Computing Drew Wingard, Sonics Current Technology Limits Deep Learning Performance Deep Learning Dataflow Graph Existing

More information

Toward a Memory-centric Architecture

Toward a Memory-centric Architecture Toward a Memory-centric Architecture Martin Fink EVP & Chief Technology Officer Western Digital Corporation August 8, 2017 1 SAFE HARBOR DISCLAIMERS Forward-Looking Statements This presentation contains

More information

Next Generation Enterprise Solutions from ARM

Next Generation Enterprise Solutions from ARM Next Generation Enterprise Solutions from ARM Ian Forsyth Director Product Marketing Enterprise and Infrastructure Applications Processor Product Line Ian.forsyth@arm.com 1 Enterprise Trends IT is the

More information

Simplify System Complexity

Simplify System Complexity Simplify System Complexity With the new high-performance CompactRIO controller Fanie Coetzer Field Sales Engineer Northern South Africa 2 3 New control system CompactPCI MMI/Sequencing/Logging FieldPoint

More information

A Closer Look at the Epiphany IV 28nm 64 core Coprocessor. Andreas Olofsson PEGPUM 2013

A Closer Look at the Epiphany IV 28nm 64 core Coprocessor. Andreas Olofsson PEGPUM 2013 A Closer Look at the Epiphany IV 28nm 64 core Coprocessor Andreas Olofsson PEGPUM 2013 1 Adapteva Achieves 3 World Firsts 1. First processor company to reach 50 GFLOPS/W 3. First semiconductor company

More information

C-Based Hardware Design Platform for Dynamically Reconfigurable Processor

C-Based Hardware Design Platform for Dynamically Reconfigurable Processor C-Based Hardware Design Platform for Dynamically Reconfigurable Processor September 22 nd, 2005 IPFlex Inc. Agenda Merits of C-Based hardware design Hardware enabling C-Based hardware design DAPDNA-FW

More information

Simplifying FPGA Design for SDR with a Network on Chip Architecture

Simplifying FPGA Design for SDR with a Network on Chip Architecture Simplifying FPGA Design for SDR with a Network on Chip Architecture Matt Ettus Ettus Research GRCon13 Outline 1 Introduction 2 RF NoC 3 Status and Conclusions USRP FPGA Capability Gen

More information

Ettus Research Update

Ettus Research Update Ettus Research Update Matt Ettus Ettus Research GRCon13 Outline 1 Introduction 2 Recent New Products 3 Third Generation Introduction Who am I? Core GNU Radio contributor since 2001 Designed

More information

Software Defined Modem A commercial platform for wireless handsets

Software Defined Modem A commercial platform for wireless handsets Software Defined Modem A commercial platform for wireless handsets Charles F Sturman VP Marketing June 22 nd ~ 24 th Brussels charles.stuman@cognovo.com www.cognovo.com Agenda SDM Separating hardware from

More information

Third Genera+on USRP Devices and the RF Network- On- Chip. Leif Johansson Market Development RF, Comm and SDR

Third Genera+on USRP Devices and the RF Network- On- Chip. Leif Johansson Market Development RF, Comm and SDR Third Genera+on USRP Devices and the RF Network- On- Chip Leif Johansson Market Development RF, Comm and SDR About Ettus Research Leader in soeware defined radio and signals intelligence Maker of USRP

More information

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman

Maximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael

More information

OCP Engineering Workshop - Telco

OCP Engineering Workshop - Telco OCP Engineering Workshop - Telco Low Latency Mobile Edge Computing Trevor Hiatt Product Management, IDT IDT Company Overview Founded 1980 Workforce Approximately 1,800 employees Headquarters San Jose,

More information

Simplify System Complexity

Simplify System Complexity 1 2 Simplify System Complexity With the new high-performance CompactRIO controller Arun Veeramani Senior Program Manager National Instruments NI CompactRIO The Worlds Only Software Designed Controller

More information

The WINLAB Cognitive Radio Platform

The WINLAB Cognitive Radio Platform The WINLAB Cognitive Radio Platform IAB Meeting, Fall 2007 Rutgers, The State University of New Jersey Ivan Seskar Software Defined Radio/ Cognitive Radio Terminology Software Defined Radio (SDR) is any

More information

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs IBM Research AI Systems Day DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang 1, Junsong Wang 2, Chao Zhu 2, Yonghua Lin 2, Jinjun Xiong 3, Wen-mei

More information

How to Efficiently Implement Flexible and Full-Featured Digital Radio Solutions Using All Programmable SoCs

How to Efficiently Implement Flexible and Full-Featured Digital Radio Solutions Using All Programmable SoCs Delivering a Generation Ahead How to Efficiently Implement Flexible and Full-Featured Digital Radio Solutions Using All Programmable SoCs Agenda Introduction to Mobile Network Introduction to Xilinx Solution

More information

Arm s First-Generation Machine Learning Processor

Arm s First-Generation Machine Learning Processor Arm s First-Generation Machine Learning Processor Ian Bratt 2018 Arm Limited Introducing the Arm Machine Learning (ML) Processor Optimized ground-up architecture for machine learning processing Massive

More information

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

IBM Cell Processor. Gilbert Hendry Mark Kretschmann IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:

More information

High Capacity and High Performance 20nm FPGAs. Steve Young, Dinesh Gaitonde August Copyright 2014 Xilinx

High Capacity and High Performance 20nm FPGAs. Steve Young, Dinesh Gaitonde August Copyright 2014 Xilinx High Capacity and High Performance 20nm FPGAs Steve Young, Dinesh Gaitonde August 2014 Not a Complete Product Overview Page 2 Outline Page 3 Petabytes per month Increasing Bandwidth Global IP Traffic Growth

More information

RFNoC : RF Network on Chip Martin Braun, Jonathon Pendlum GNU Radio Conference 2015

RFNoC : RF Network on Chip Martin Braun, Jonathon Pendlum GNU Radio Conference 2015 RFNoC : RF Network on Chip Martin Braun, Jonathon Pendlum GNU Radio Conference 2015 Outline Motivation Current situation Goal RFNoC Basic concepts Architecture overview Summary No Demo! See our booth,

More information

Altera SDK for OpenCL

Altera SDK for OpenCL Altera SDK for OpenCL A novel SDK that opens up the world of FPGAs to today s developers Altera Technology Roadshow 2013 Today s News Altera today announces its SDK for OpenCL Altera Joins Khronos Group

More information

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

SDA: Software-Defined Accelerator for Large- Scale DNN Systems SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, Yong Wang, Bo Yu, Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A dominant

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 22 Title: and Extended

More information

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing Mingyu Gao and Christos Kozyrakis Stanford University http://mast.stanford.edu HPCA March 14, 2016 PIM is Coming Back End of Dennard

More information

Outline Marquette University

Outline Marquette University COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations

More information

Higher Level Programming Abstractions for FPGAs using OpenCL

Higher Level Programming Abstractions for FPGAs using OpenCL Higher Level Programming Abstractions for FPGAs using OpenCL Desh Singh Supervising Principal Engineer Altera Corporation Toronto Technology Center ! Technology scaling favors programmability CPUs."#/0$*12'$-*

More information

SDSoC: Session 1

SDSoC: Session 1 SDSoC: Session 1 ADAM@ADIUVOENGINEERING.COM What is SDSoC SDSoC is a system optimising compiler which allows us to optimise Zynq PS / PL Zynq MPSoC PS / PL MicroBlaze What does this mean? Following the

More information

45-year CPU Evolution: 1 Law -2 Equations

45-year CPU Evolution: 1 Law -2 Equations 4004 8086 PowerPC 601 Pentium 4 Prescott 1971 1978 1992 45-year CPU Evolution: 1 Law -2 Equations Daniel Etiemble LRI Université Paris Sud 2004 Xeon X7560 Power9 Nvidia Pascal 2010 2017 2016 Are there

More information

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference FINN: A Framework for Fast, Scalable Binarized Neural Network Inference Yaman Umuroglu (XIR & NTNU), Nick Fraser (XIR & USydney), Giulio Gambardella (XIR), Michaela Blott (XIR), Philip Leong (USydney),

More information

Revolutionizing the Datacenter

Revolutionizing the Datacenter Power-Efficient Machine Learning using FPGAs on POWER Systems Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx Revolutionizing the Datacenter Join the Conversation #OpenPOWERSummit Top-5

More information

Copyright 2016 Xilinx

Copyright 2016 Xilinx Zynq Architecture Zynq Vivado 2015.4 Version This material exempt per Department of Commerce license exception TSU Objectives After completing this module, you will be able to: Identify the basic building

More information

Combining Arm & RISC-V in Heterogeneous Designs

Combining Arm & RISC-V in Heterogeneous Designs Combining Arm & RISC-V in Heterogeneous Designs Gajinder Panesar, CTO, UltraSoC gajinder.panesar@ultrasoc.com RISC-V Summit 3 5 December 2018 Santa Clara, USA Problem statement Deterministic multi-core

More information

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference FINN: A Framework for Fast, Scalable Binarized Neural Network Inference Yaman Umuroglu (NTNU & Xilinx Research Labs Ireland) in collaboration with N Fraser, G Gambardella, M Blott, P Leong, M Jahre and

More information

Inference

Inference Inference Architectures @Xilinx Graham Schelle, PhD Principal Engineer Xilinx Research Labs Xilinx Headlines!2 Twitch Chooses Xilinx to Enable its Broadcast-quality Livestream of esports Agenda Xilinx

More information

Research Faculty Summit Systems Fueling future disruptions

Research Faculty Summit Systems Fueling future disruptions Research Faculty Summit 2018 Systems Fueling future disruptions Wolong: A Back-end Optimizer for Deep Learning Computation Jilong Xue Researcher, Microsoft Research Asia System Challenge in Deep Learning

More information

NVIDIA'S DEEP LEARNING ACCELERATOR MEETS SIFIVE'S FREEDOM PLATFORM. Frans Sijstermans (NVIDIA) & Yunsup Lee (SiFive)

NVIDIA'S DEEP LEARNING ACCELERATOR MEETS SIFIVE'S FREEDOM PLATFORM. Frans Sijstermans (NVIDIA) & Yunsup Lee (SiFive) NVIDIA'S DEEP LEARNING ACCELERATOR MEETS SIFIVE'S FREEDOM PLATFORM Frans Sijstermans (NVIDIA) & Yunsup Lee (SiFive) NVDLA NVIDIA DEEP LEARNING ACCELERATOR IP Core for deep learning part of NVIDIA s Xavier

More information

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System Xiaofan Zhang1, Mohamed El Hadedy1, Wen-mei Hwu1, Nam Sung Kim1, Jinjun Xiong2, Deming Chen1 1 University of Illinois Urbana-Champaign

More information

Xilinx DNN Processor An Inference Engine, Network Compiler + Runtime for Xilinx FPGAs

Xilinx DNN Processor An Inference Engine, Network Compiler + Runtime for Xilinx FPGAs ilinx DNN Proceor An Inference Engine, Network Compiler Runtime for ilinx FPGA Rahul Nimaiyar, Brian Sun, Victor Wu, Thoma Branca, Yi Wang, Jutin Oo, Elliott Delaye, Aaron Ng, Paolo D'Alberto, Sean Settle,

More information

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008 Michael Doggett Graphics Architecture Group April 2, 2008 Graphics Processing Unit Architecture CPUs vsgpus AMD s ATI RADEON 2900 Programming Brook+, CAL, ShaderAnalyzer Architecture Challenges Accelerated

More information

Deep Learning on Arm Cortex-M Microcontrollers. Rod Crawford Director Software Technologies, Arm

Deep Learning on Arm Cortex-M Microcontrollers. Rod Crawford Director Software Technologies, Arm Deep Learning on Arm Cortex-M Microcontrollers Rod Crawford Director Software Technologies, Arm What is Machine Learning (ML)? Artificial Intelligence Machine Learning Deep Learning Neural Networks Additional

More information

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

SDA: Software-Defined Accelerator for Large- Scale DNN Systems SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, 1 Yong Wang, 1 Bo Yu, 1 Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A

More information

EECS4201 Computer Architecture

EECS4201 Computer Architecture Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis These slides are based on the slides provided by the publisher. The slides will be

More information

ECE 8823: GPU Architectures. Objectives

ECE 8823: GPU Architectures. Objectives ECE 8823: GPU Architectures Introduction 1 Objectives Distinguishing features of GPUs vs. CPUs Major drivers in the evolution of general purpose GPUs (GPGPUs) 2 1 Chapter 1 Chapter 2: 2.2, 2.3 Reading

More information

When MPPDB Meets GPU:

When MPPDB Meets GPU: When MPPDB Meets GPU: An Extendible Framework for Acceleration Laura Chen, Le Cai, Yongyan Wang Background: Heterogeneous Computing Hardware Trend stops growing with Moore s Law Fast development of GPU

More information

Near Memory Key/Value Lookup Acceleration MemSys 2017

Near Memory Key/Value Lookup Acceleration MemSys 2017 Near Key/Value Lookup Acceleration MemSys 2017 October 3, 2017 Scott Lloyd, Maya Gokhale Center for Applied Scientific Computing This work was performed under the auspices of the U.S. Department of Energy

More information

Reconfigurable Cell Array for DSP Applications

Reconfigurable Cell Array for DSP Applications Outline econfigurable Cell Array for DSP Applications Chenxin Zhang Department of Electrical and Information Technology Lund University, Sweden econfigurable computing Coarse-grained reconfigurable cell

More information

Data-Centric Innovation Summit DAN MCNAMARA SENIOR VICE PRESIDENT GENERAL MANAGER, PROGRAMMABLE SOLUTIONS GROUP

Data-Centric Innovation Summit DAN MCNAMARA SENIOR VICE PRESIDENT GENERAL MANAGER, PROGRAMMABLE SOLUTIONS GROUP Data-Centric Innovation Summit DAN MCNAMARA SENIOR VICE PRESIDENT GENERAL MANAGER, PROGRAMMABLE SOLUTIONS GROUP Devices / edge network Cloud/data center Removing data Bottlenecks with Fpga acceleration

More information

Computer Systems Architecture Spring 2016

Computer Systems Architecture Spring 2016 Computer Systems Architecture Spring 2016 Lecture 01: Introduction Shuai Wang Department of Computer Science and Technology Nanjing University [Adapted from Computer Architecture: A Quantitative Approach,

More information

Optimizing HW/SW Partition of a Complex Embedded Systems. Simon George November 2015.

Optimizing HW/SW Partition of a Complex Embedded Systems. Simon George November 2015. Optimizing HW/SW Partition of a Complex Embedded Systems Simon George November 2015 Zynq-7000 All Programmable SoC HP ACP GP Page 2 Zynq UltraScale+ MPSoC Page 3 HW/SW Optimization Challenges application()

More information

XMC-RFSOC-A. XMC Module Xilinx Zynq UltraScale+ RFSOC. Overview. Key Features. Typical Applications. Advanced Information Subject To Change

XMC-RFSOC-A. XMC Module Xilinx Zynq UltraScale+ RFSOC. Overview. Key Features. Typical Applications. Advanced Information Subject To Change Advanced Information Subject To Change XMC-RFSOC-A XMC Module Xilinx Zynq UltraScale+ RFSOC Overview PanaTeQ s XMC-RFSOC-A is a XMC module based on the Zynq UltraScale+ RFSoC device from Xilinx. The Zynq

More information

CPU Project in Western Digital: From Embedded Cores for Flash Controllers to Vision of Datacenter Processors with Open Interfaces

CPU Project in Western Digital: From Embedded Cores for Flash Controllers to Vision of Datacenter Processors with Open Interfaces CPU Project in Western Digital: From Embedded Cores for Flash Controllers to Vision of Datacenter Processors with Open Interfaces Zvonimir Z. Bandic, Sr. Director Robert Golla, Sr. Fellow Dejan Vucinic,

More information

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, Chunyuan Zhang National University of Defense Technology,

More information

P51: High Performance Networking

P51: High Performance Networking P51: High Performance Networking Lecture 6: Programmable network devices Dr Noa Zilberman noa.zilberman@cl.cam.ac.uk Lent 2017/18 High Throughput Interfaces Performance Limitations So far we discussed

More information

Multimedia in Mobile Phones. Architectures and Trends Lund

Multimedia in Mobile Phones. Architectures and Trends Lund Multimedia in Mobile Phones Architectures and Trends Lund 091124 Presentation Henrik Ohlsson Contact: henrik.h.ohlsson@stericsson.com Working with multimedia hardware (graphics and displays) at ST- Ericsson

More information

Computer Architecture. Fall Dongkun Shin, SKKU

Computer Architecture. Fall Dongkun Shin, SKKU Computer Architecture Fall 2018 1 Syllabus Instructors: Dongkun Shin Office : Room 85470 E-mail : dongkun@skku.edu Office Hours: Wed. 15:00-17:30 or by appointment Lecture notes nyx.skku.ac.kr Courses

More information

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin EE382 (20): Computer Architecture - ism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez The University of Texas at Austin 1 Recap 2 Streaming model 1. Use many slimmed down cores to run in parallel

More information

Optimizing ARM SoC s with Carbon Performance Analysis Kits. ARM Technical Symposia, Fall 2014 Andy Ladd

Optimizing ARM SoC s with Carbon Performance Analysis Kits. ARM Technical Symposia, Fall 2014 Andy Ladd Optimizing ARM SoC s with Carbon Performance Analysis Kits ARM Technical Symposia, Fall 2014 Andy Ladd Evolving System Requirements Processor Advances big.little Multicore Unicore DSP Cortex -R7 Block

More information

LegUp: Accelerating Memcached on Cloud FPGAs

LegUp: Accelerating Memcached on Cloud FPGAs 0 LegUp: Accelerating Memcached on Cloud FPGAs Xilinx Developer Forum December 10, 2018 Andrew Canis & Ruolong Lian LegUp Computing Inc. 1 COMPUTE IS BECOMING SPECIALIZED 1 GPU Nvidia graphics cards are

More information

The Processor: Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy

More information

RapidIO.org Update.

RapidIO.org Update. RapidIO.org Update rickoco@rapidio.org June 2015 2015 RapidIO.org 1 Outline RapidIO Overview Benefits Interconnect Comparison Ecosystem System Challenges RapidIO Markets Data Center & HPC Communications

More information

Software Defined Modems for The Internet of Things. Dr. John Haine, IP Operations Manager

Software Defined Modems for The Internet of Things. Dr. John Haine, IP Operations Manager Software Defined Modems for The Internet of Things Dr. John Haine, IP Operations Manager www.cognovo.com What things? 20 billion connected devices Manufactured for global markets Low cost Lifetimes from

More information

New! New! New! New! New!

New! New! New! New! New! New! New! New! New! New! Model 5950 Features Supports Xilinx Zynq UltraScale+ RFSoC FPGAs 18 GB of DDR4 SDRAM On-board GPS receiver PCI Express (Gen. 1, 2 and 3) interface up to x8 LVDS connections to

More information

GRVI Phalanx Update: Plowing the Cloud with Thousands of RISC-V Chickens. Jan Gray

GRVI Phalanx Update: Plowing the Cloud with Thousands of RISC-V Chickens. Jan Gray If you were plowing a field, which would you rather use: two strong oxen or 1024 chickens? Seymour Cray GRVI Phalanx Update: Plowing the Cloud with Thousands of RISC-V Chickens Jan Gray jan@fpga.org http://fpga.org

More information

Industry Collaboration and Innovation

Industry Collaboration and Innovation Industry Collaboration and Innovation OpenCAPI Topics Industry Background Technology Overview Design Enablement OpenCAPI Consortium Industry Landscape Key changes occurring in our industry Historical microprocessor

More information

Designing, developing, debugging ARM Cortex-A and Cortex-M heterogeneous multi-processor systems

Designing, developing, debugging ARM Cortex-A and Cortex-M heterogeneous multi-processor systems Designing, developing, debugging ARM and heterogeneous multi-processor systems Kinjal Dave Senior Product Manager, ARM ARM Tech Symposia India December 7 th 2016 Topics Introduction System design Software

More information

Building High Performance, Power Efficient Cortex and Mali systems with ARM CoreLink. Robert Kaye

Building High Performance, Power Efficient Cortex and Mali systems with ARM CoreLink. Robert Kaye Building High Performance, Power Efficient Cortex and Mali systems with ARM CoreLink Robert Kaye 1 Agenda Once upon a time ARM designed systems Compute trends Bringing it all together with CoreLink 400

More information

Unified Deep Learning with CPU, GPU, and FPGA Technologies

Unified Deep Learning with CPU, GPU, and FPGA Technologies Unified Deep Learning with CPU, GPU, and FPGA Technologies Allen Rush 1, Ashish Sirasao 2, Mike Ignatowski 1 1: Advanced Micro Devices, Inc., 2: Xilinx, Inc. Abstract Deep learning and complex machine

More information

CS427 Multicore Architecture and Parallel Computing

CS427 Multicore Architecture and Parallel Computing CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:

More information

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM

More information

The Path to Embedded Vision & AI using a Low Power Vision DSP. Yair Siegel, Director of Segment Marketing Hotchips August 2016

The Path to Embedded Vision & AI using a Low Power Vision DSP. Yair Siegel, Director of Segment Marketing Hotchips August 2016 The Path to Embedded Vision & AI using a Low Power Vision DSP Yair Siegel, Director of Segment Marketing Hotchips August 2016 Presentation Outline Introduction The Need for Embedded Vision & AI Vision

More information

World s most advanced data center accelerator for PCIe-based servers

World s most advanced data center accelerator for PCIe-based servers NVIDIA TESLA P100 GPU ACCELERATOR World s most advanced data center accelerator for PCIe-based servers HPC data centers need to support the ever-growing demands of scientists and researchers while staying

More information

Efficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford

Efficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford Efficiency and Programmability: Enablers for ExaScale Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford Scientific Discovery and Business Analytics Driving an Insatiable

More information

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration I/O MultiProcessor Summary 2 Virtual memory benifits Using physical memory efficiently

More information

Optimizing Cache Coherent Subsystem Architecture for Heterogeneous Multicore SoCs

Optimizing Cache Coherent Subsystem Architecture for Heterogeneous Multicore SoCs Optimizing Cache Coherent Subsystem Architecture for Heterogeneous Multicore SoCs Niu Feng Technical Specialist, ARM Tech Symposia 2016 Agenda Introduction Challenges: Optimizing cache coherent subsystem

More information

M7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle

M7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle M7: Next Generation SPARC Hotchips 26 August 12, 2014 Stephen Phillips Senior Director, SPARC Architecture Oracle Safe Harbor Statement The following is intended to outline our general product direction.

More information

SDA: Software-Defined Accelerator for general-purpose big data analysis system

SDA: Software-Defined Accelerator for general-purpose big data analysis system SDA: Software-Defined Accelerator for general-purpose big data analysis system Jian Ouyang(ouyangjian@baidu.com), Wei Qi, Yong Wang, Yichen Tu, Jing Wang, Bowen Jia Baidu is beyond a search engine Search

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

Takashi Shono, Ph.D. Intel 5G Tokyo Bay Summit 2017

Takashi Shono, Ph.D. Intel 5G Tokyo Bay Summit 2017 Takashi Shono, Ph.D. Intel Corporation @ 5G Tokyo Bay Summit 2017 Legal disclaimers Intel technologies features and benefits depend on system configuration and may require enabled hardware, software or

More information

Enabling the A.I. Era: From Materials to Systems

Enabling the A.I. Era: From Materials to Systems Enabling the A.I. Era: From Materials to Systems Sundeep Bajikar Head of Market Intelligence, Applied Materials New Street Research Conference May 30, 2018 External Use Key Message PART 1 PART 2 A.I. *

More information

An introduction to Machine Learning silicon

An introduction to Machine Learning silicon An introduction to Machine Learning silicon November 28 2017 Insight for Technology Investors AI/ML terminology Artificial Intelligence Machine Learning Deep Learning Algorithms: CNNs, RNNs, etc. Additional

More information

High Performance Embedded Applications. Raja Pillai Applications Engineering Specialist

High Performance Embedded Applications. Raja Pillai Applications Engineering Specialist High Performance Embedded Applications Raja Pillai Applications Engineering Specialist Agenda What is High Performance Embedded? NI s History in HPE FlexRIO Overview System architecture Adapter modules

More information

The Nios II Family of Configurable Soft-core Processors

The Nios II Family of Configurable Soft-core Processors The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture

More information

System-on-Chip Architecture for Mobile Applications. Sabyasachi Dey

System-on-Chip Architecture for Mobile Applications. Sabyasachi Dey System-on-Chip Architecture for Mobile Applications Sabyasachi Dey Email: sabyasachi.dey@gmail.com Agenda What is Mobile Application Platform Challenges Key Architecture Focus Areas Conclusion Mobile Revolution

More information

Bringing the benefits of Cortex-M processors to FPGA

Bringing the benefits of Cortex-M processors to FPGA Bringing the benefits of Cortex-M processors to FPGA Presented By Phillip Burr Senior Product Marketing Manager Simon George Director, Product & Technical Marketing System Software and SoC Solutions Agenda

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014 Profiling and Debugging OpenCL Applications with ARM Development Tools October 2014 1 Agenda 1. Introduction to GPU Compute 2. ARM Development Solutions 3. Mali GPU Architecture 4. Using ARM DS-5 Streamline

More information