Xilinx DNN Processor An Inference Engine, Network Compiler + Runtime for Xilinx FPGAs
|
|
- Conrad Dickerson
- 5 years ago
- Views:
Transcription
1 ilinx DNN Proceor An Inference Engine, Network Compiler Runtime for ilinx FPGA Rahul Nimaiyar, Brian Sun, Victor Wu, Thoma Branca, Yi Wang, Jutin Oo, Elliott Delaye, Aaron Ng, Paolo D'Alberto, Sean Settle, Bill Teng, Manaa Bollavaram, Chaithanya Dudha, Hanh Hoang, Swati Gupta,, Alina Huang, Ephrem Wu, Samuel Bayli, Phil Jame-Roxby, Ralph Wittig, Ahih Siraao, Ravi Sunkavalli 21 t Augut 2018
2 Spill / Retore DMA Controller Execution Controller ilinx Inference Engine DNN Proceor Compiler Trained Model Compiler Runtime ilinx DNN Proceor Image Queue Intruction Buffer DMA Controller v V v Pooling/ EWA Bia ReLU Bia ReLU Bia ReLU Bia ReLU Pooling Pooling Pooling Pooling Cro Bar Low Latency, High Throughput Batch % Efficiency No FPGA Expertie Required >> 2
3 Virtex UltraScale VU9P FPGA 16nm TSMC FF FPGA 2.5M Sytem Logic Cell 6840 DSP Block (18x27 MAC) 382 Mbit On-Die SRAM 4 DDR x72 Channel >> 3 Virtex UltraScale FPGA VCU1525 Developer Board and Reference Deign VU9P Virtex UltraScale FPGA 21 TOPS (INT8) 382 Mbit on-chip SRAM 64 GByte on-board DRAM 75W
4 Motivation for Deep Learning on FPGA Data Parallel 2D Array of MAC Flexible on-chip memory acce High Bandwidth, Multiple Acce Port Data Reue Near Memory Compute Programmable routing for data & filter reue Compreion & Sparity Flexible Data Type FP32/16, INT16/8/4/2, Binary/Ternary Sparity friendly compute >> 4
5 Virtex UltraScale Full Spectrum of Memory UltraRAM (100 of Megabit) High Bandwidth Memory (Multi-Gigabyte) External Memory 2666-DDR4 (Multi-Gigabyte) Ditributed RAM (10 of megabit) Block RAM (10 of megabit) Gap in Memory Hierarchy VU9P 36Mb 675Tb/ 77Mb 216Tb/ 270Mb 80Tb/ VU37P: 8GB, 460GB/ 64GB 85GB/ 5 Tier of Memory -> Build cutom memory hierarchy. 500 Mb of On-chip Memory and Tb/ of On-chip Memory Bandwidth >> 5
6 Spill / Retore DMA Controller Execution Controller ilinx DNN Proceor (xdnn) Image Queue Intruction Buffer DMA Controller Sytolic Array Configurable Overlay Proceor DNN Specific Intruction Set Convolution, Max Pool etc. Any Network, Any Image Size High Frequency & High Compute Efficiency Compile and run new network Bia Bia Bia Bia Pooling/ EWA ReLU ReLU ReLU ReLU Pooling Pooling Pooling Pooling Cro Bar >> 6
7 xdnn Channel Parallel Sytolic Array 2D Channel Parallel Datapath & Ditributed Buffer Micro-Architecture Optimized for underlying Ultracale FPGA Fabric >> 7
8 xdnn Proceor Tenor Memory W R W R W R W R From Parallel Proceing From Sytolic Array From DDR DMA To Sytolic Array To DDR DMA To Parallel Proceing Channel Parallel Concurrent Acce >> 8
9 Efficient Memory Utilization Previou Layer Output Tenor Memory Time: T n Tenor Memory T n1 Tenor Memory T n2 Tenor Memory T n3 Previou Layer Output Previou Layer Output Previou Layer Output Previou Layer Output 3x3 Conv Reduce 5x5 Conv Reduce 3x3 Conv 3x3 Conv 5x5 Conv 3x3 Conv Reduce 3x3 Conv Reduce Concatenated Output Tenor Memory T n3 Tenor Memory T n4 Tenor Memory T n5 Previou Layer Output Previou Layer Output Previou Layer Output 3x3 Conv 3x3 Conv 3x3 Conv Concatenated Output Input for Next Layer 5x5 Conv 3x3 Conv Reduce 5x5 Conv Reduce 5x5 Conv Reduce >> 9
10 xdnn Compiler Runtime Deep Learning Framework MxNet Frontend Framework Tenor Graph to ilinx Tenor Graph Tenor Graph Optimization Compiler Quantizer Model Calibration Set Image Runtime CPU Layer FPGA Layer >> 10
11 Graph Partitioning One time loading of ub-graph intruction Data Flow Execution FPGA or CPU FPGA CPU FPGA CPU Pre-Proceing Subgraph 1 Parallel Subgraph Pot-Proceing 1 Core -> Multi-Core -> Multi-Chip >> 11
12 Fuing of Operation Conv Batch Norm. ReLU Bia Conv. Max Pool Pipelined Operator (In-line, no RAM acce) Fue operation a pipe-lined tage Acce activation only once >> 12
13 Intruction Level Parallelim Previou Layer Output 3x3 Conv Reduce 5x5 Conv Reduce 3x3 Max Pool 3x3 Conv 5x5 Conv Concatenated Output 3x3 Conv Reduce 3x3 Conv 3x3 Max Pool Parallel Execution 5x5 Conv Reduce 5x5 Conv Time >> 13
14 Automatic Intra Layer Tiling Tile when Feature Map ize exceed on-chip memory Work on full feature map depth Any Feature Map Size >> 14
15 xdnn Key Function Feature Detail Convolution NxM, Stride 1,2,4,8 N,M=1-15 Max Pool NxM, Stride 1,2,4,8 N,M=1-15 Avg Pool NxM, Stride 1,2,4,8 N,M=1-15 Dilated Convolution Factor 1,2,4 De-Convolution NxM, Stride 1,2,4,8 N,M=1-15 Up-ampling Factor 2,4,8,16 Activation ReLU, prelu Elementwie Addition Any Square, Rectangular Preciion Int8, Int16 Network Claification: e.g. ReNet Object Detection: e.g. YOLO v2, Segmentation: e.g. MakRCNN >> 15
16 xdnn Implementation on VU9P Reource Count VU9P Utilization LUT 612k 52% DSP % BRAM % URAM % Sytolic Array 800MHz 90% of device Fmax 3 Large 96x16 PE 1 in each SLR Sytolic Array at 800 MHz; Ret of logic at 400 MHz >> 16
17 xdnn Compute Efficiency xdnn 74% 60-80% Efficiency Acro Network Other Architecture Full Benchmark Data at ilinx Developer Forum Oct 1, 2018 >> 17
18 Cutom Deep Learning Pipeline xdnn Video Decode Proceing Video ML Genomic ML xdnn DNN Rik Modelling ML Databae ML xdnn Video Proceing Encode Network IPS ML Storage ML Integrate Cutom Application with xdnn. Lower end-to-end latency >> 18
19 ilinx DNN Proceor - Summary xdnn Performance No FPGA Expertie Needed 60-80% 90% Compile & Run Trained Network Deploy on AWS F1, other Cloud Platform EFFICIENCY FREQUENCY Deploy in PCIe Card >> 19
20 Additional Information Connect Learn Share DF connect oftware developer and ytem deigner to the deep expertie of ilinx engineer, partner, and indutry leader. Silicon Valley October 1-2 Beijing October 6 th Frankfurt December 10 Learn More >> 20
Xilinx ML Suite Overview
Xilinx ML Suite Overview Yao Fu System Architect Data Center Acceleration Xilinx Accelerated Computing Workloads Machine Learning Inference Image classification and object detection Video Streaming Frame
More informationXilinx ML Suite Overview
Xilinx ML Suite Overview Jim Heaton Sr. FAE Deep Learning explores the study of algorithms that can learn from and make predictions on data Deep Learning is Re-defining Many Applications Cloud Acceleration
More informationAdaptable Intelligence The Next Computing Era
Adaptable Intelligence The Next Computing Era Hot Chips, August 21, 2018 Victor Peng, CEO, Xilinx Pervasive Intelligence from Cloud to Edge to Endpoints >> 1 Exponential Growth and Opportunities Data Explosion
More informationXPU A Programmable FPGA Accelerator for Diverse Workloads
XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for
More informationDNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs
IBM Research AI Systems Day DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang 1, Junsong Wang 2, Chao Zhu 2, Yonghua Lin 2, Jinjun Xiong 3, Wen-mei
More informationVersal: AI Engine & Programming Environment
Engineering Director, Xilinx Silicon Architecture Group Versal: Engine & Programming Environment Presented By Ambrose Finnerty Xilinx DSP Technical Marketing Manager October 16, 2018 MEMORY MEMORY MEMORY
More informationRevolutionizing the Datacenter
Power-Efficient Machine Learning using FPGAs on POWER Systems Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx Revolutionizing the Datacenter Join the Conversation #OpenPOWERSummit Top-5
More informationHW/SW Programmable Engine:
HW/SW Programmable Engine: Domain Specific Architecture for Project Everest Juanjo Noguera, Goran Bilski, Jan Langer, Baris Ozgul, Tim Tuan, David Clarke, Peter McColgan, Sneha Date, Zachary Dickman, Pedro
More information在数据中心中加速 AI - Xilinx 机器学习套件 (Xilinx ML Suite )
赛灵思高级主任 DSP/ 机器学习专家赛灵思高级主任 DSP/ 机器学习专家 赛灵思技术日 XILINX TECHNOLOGY DAY 在数据中心中加速 AI - Xilinx 机器学习套件 (Xilinx ML Suite ) 王宏强赛灵思资深主任 DSP/ 机器学习专家 2019 年 3 月 19 日 机器学习推断是赛灵思的长项 TRAINING Input cat =? labels dog
More informationVersal: The New Xilinx Adaptive Compute Acceleration Platform (ACAP) in 7nm
Engineering Director, Xilinx Silicon Architecture Group Versal: The New Xilinx Adaptive Compute Acceleration Platform (ACAP) in 7nm Presented By Kees Vissers Fellow February 25, FPGA 2019 Technology scaling
More informationIntegrating NVIDIA Deep Learning Accelerator (NVDLA) with RISC-V SoC on FireSim
Integrating NVIDIA Deep Learning Accelerator (NVDLA) with RISC-V SoC on FireSim Farzad Farshchi, Qijing Huang, Heechul Yun University of Kansas, University of California, Berkeley SiFive Internship Rocket
More informationAccelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs
Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Ritchie Zhao 1, Weinan Song 2, Wentao Zhang 2, Tianwei Xing 3, Jeng-Hau Lin 4, Mani Srivastava 3, Rajesh Gupta 4, Zhiru
More informationGRVI Phalanx Update: Plowing the Cloud with Thousands of RISC-V Chickens. Jan Gray
If you were plowing a field, which would you rather use: two strong oxen or 1024 chickens? Seymour Cray GRVI Phalanx Update: Plowing the Cloud with Thousands of RISC-V Chickens Jan Gray jan@fpga.org http://fpga.org
More informationDNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses
DNN ENGINE: A 16nm Sub-uJ DNN Inference Accelerator for the Embedded Masses Paul N. Whatmough 1,2 S. K. Lee 2, N. Mulholland 2, P. Hansen 2, S. Kodali 3, D. Brooks 2, G.-Y. Wei 2 1 ARM Research, Boston,
More informationBandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design
Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design Song Yao 姚颂 Founder & CEO DeePhi Tech 深鉴科技 song.yao@deephi.tech Outline - About DeePhi Tech - Background - Bandwidth Matters
More informationGRVI Phalanx Update: Plowing the Cloud with Thousands of RISC-V Chickens. Jan Gray
If you were plowing a field, which would you rather use: two strong oxen or 1024 chickens? Seymour Cray GRVI Phalanx Update: Plowing the Cloud with Thousands of RISC-V Chickens Jan Gray jan@fpga.org http://fpga.org
More informationFrequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System
Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM
More informationTowards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA
Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA Junzhong Shen, You Huang, Zelong Wang, Yuran Qiao, Mei Wen, Chunyuan Zhang National University of Defense Technology,
More informationMaximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman
Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael
More informationHigh Capacity and High Performance 20nm FPGAs. Steve Young, Dinesh Gaitonde August Copyright 2014 Xilinx
High Capacity and High Performance 20nm FPGAs Steve Young, Dinesh Gaitonde August 2014 Not a Complete Product Overview Page 2 Outline Page 3 Petabytes per month Increasing Bandwidth Global IP Traffic Growth
More informationDeep Learning Accelerators
Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction
More informationAn 80-core GRVI Phalanx Overlay on PYNQ-Z1:
An 80-core GRVI Phalanx Overlay on PYNQ-Z1: Pynq as a High Productivity Platform For FPGA Design and Exploration Jan Gray jan@fpga.org http://fpga.org/grvi-phalanx FCCM 2017 05/03/2017 Pynq Workshop My
More informationSDA: Software-Defined Accelerator for Large- Scale DNN Systems
SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, Yong Wang, Bo Yu, Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A dominant
More informationArm s First-Generation Machine Learning Processor
Arm s First-Generation Machine Learning Processor Ian Bratt 2018 Arm Limited Introducing the Arm Machine Learning (ML) Processor Optimized ground-up architecture for machine learning processing Massive
More informationScalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA
Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, Sarma Vrudhula School of Electrical, Computer and Energy Engineering School
More informationOvercoming the Memory System Challenge in Dataflow Processing. Darren Jones, Wave Computing Drew Wingard, Sonics
Overcoming the Memory System Challenge in Dataflow Processing Darren Jones, Wave Computing Drew Wingard, Sonics Current Technology Limits Deep Learning Performance Deep Learning Dataflow Graph Existing
More informationSDA: Software-Defined Accelerator for Large- Scale DNN Systems
SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, 1 Yong Wang, 1 Bo Yu, 1 Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A
More informationECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University
ECE5775 High-Level Digital Design Automation, Fall 2018 School of Electrical Computer Engineering, Cornell University Lab 4: Binarized Convolutional Neural Networks Due Wednesday, October 31, 2018, 11:59pm
More informationFPGA 加速机器学习应用. 罗霖 2017 年 6 月 20 日
FPGA 加速机器学习应用 罗霖 Andy.luo@Xilinx.com 2017 年 6 月 20 日 Xilinx The All Programmable Company XILINX - Founded 1984 Headquarters Research and Development Sales and Support Manufacturing $2.21B FY16 revenue
More informationDNN Accelerator Architectures
DNN Accelerator Architectures ISCA Tutorial (2017) Website: http://eyeriss.mit.edu/tutorial.html Joel Emer, Vivienne Sze, Yu-Hsin Chen 1 2 Highly-Parallel Compute Paradigms Temporal Architecture (SIMD/SIMT)
More informationThroughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks
Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Naveen Suda, Vikas Chandra *, Ganesh Dasika *, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, Yu
More informationHot Chips 2017 Xilinx 16nm Datacenter Device Family with In-Package HBM and CCIX Interconnect Gaurav Singh Sagheer Ahmad, Ralph Wittig, Millind
Hot Chips 2017 Xilinx 16nm Datacenter Device Family with In-Package HBM and Interconnect Gaurav Singh Sagheer Ahmad, Ralph Wittig, Millind Mittal, Ygal Arbel, Arun VR, Suresh Ramalingam, Kiran Puranik,
More informationVTA: Open & Flexible DL Acceleration. Thierry Moreau TVM Conference, Dec 12th 2018
VTA: Open & Flexible DL Acceleration Thierry Moreau TVM Conference, Dec 12th 2018 TVM Stack High-Level Differentiable IR Tensor Expression IR LLVM CUDA Metal TVM Stack High-Level Differentiable IR Tensor
More informationImplementing Long-term Recurrent Convolutional Network Using HLS on POWER System
Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System Xiaofan Zhang1, Mohamed El Hadedy1, Wen-mei Hwu1, Nam Sung Kim1, Jinjun Xiong2, Deming Chen1 1 University of Illinois Urbana-Champaign
More information2000 N + N <100N. When is: Find m to minimize: (N) m. N log 2 C 1. m + C 3 + C 2. ESE534: Computer Organization. Previously. Today.
ESE534: Computer Organization Previously Day 7: February 6, 2012 Memories Arithmetic: addition, subtraction Reuse: pipelining bit-serial (vectorization) Area/Time Tradeoffs Latency and Throughput 1 2 Today
More informationTHE NVIDIA DEEP LEARNING ACCELERATOR
THE NVIDIA DEEP LEARNING ACCELERATOR INTRODUCTION NVDLA NVIDIA Deep Learning Accelerator Developed as part of Xavier NVIDIA s SOC for autonomous driving applications Optimized for Convolutional Neural
More informationDeep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications
Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations, and Hardware Implications Jongsoo Park Facebook AI System SW/HW Co-design Team Sep-21 2018 Team Introduction
More informationIntroduction to Neural Networks
ECE 5775 (Fall 17) High-Level Digital Design Automation Introduction to Neural Networks Ritchie Zhao, Zhiru Zhang School of Electrical and Computer Engineering Rise of the Machines Neural networks have
More informationOptimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms
Optimizing CNN-based Object Detection Algorithms on Embedded FPGA Platforms Ruizhe Zhao 1, Xinyu Niu 1, Yajie Wu 2, Wayne Luk 1, and Qiang Liu 3 1 Imperial College London {ruizhe.zhao15,niu.xinyu10,w.luk}@imperial.ac.uk
More informationRecurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks
Deep neural networks have enabled major advances in machine learning and AI Computer vision Language translation Speech recognition Question answering And more Problem: DNNs are challenging to serve and
More informationLab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm
ECE5775 High-Level Digital Design Automation, Fall 2017 School of Electrical Computer Engineering, Cornell University Lab 4: Convolutional Neural Networks Due Friday, November 3, 2017, 11:59pm 1 Introduction
More informationThe Nios II Family of Configurable Soft-core Processors
The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture
More informationAdaptable Computing The Future of FPGA Acceleration. Dan Gibbons, VP Software Development June 6, 2018
Adaptable Computing The Future of FPGA Acceleration Dan Gibbons, VP Software Development June 6, 2018 Adaptable Accelerated Computing Page 2 Three Big Trends The Evolution of Computing Trend to Heterogeneous
More informationHRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing
HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing Mingyu Gao and Christos Kozyrakis Stanford University http://mast.stanford.edu HPCA March 14, 2016 PIM is Coming Back End of Dennard
More informationLegUp: Accelerating Memcached on Cloud FPGAs
0 LegUp: Accelerating Memcached on Cloud FPGAs Xilinx Developer Forum December 10, 2018 Andrew Canis & Ruolong Lian LegUp Computing Inc. 1 COMPUTE IS BECOMING SPECIALIZED 1 GPU Nvidia graphics cards are
More informationSimplify System Complexity
1 2 Simplify System Complexity With the new high-performance CompactRIO controller Arun Veeramani Senior Program Manager National Instruments NI CompactRIO The Worlds Only Software Designed Controller
More informationParallel graph traversal for FPGA
LETTER IEICE Electronics Express, Vol.11, No.7, 1 6 Parallel graph traversal for FPGA Shice Ni a), Yong Dou, Dan Zou, Rongchun Li, and Qiang Wang National Laboratory for Parallel and Distributed Processing,
More informationDeep Learning on Arm Cortex-M Microcontrollers. Rod Crawford Director Software Technologies, Arm
Deep Learning on Arm Cortex-M Microcontrollers Rod Crawford Director Software Technologies, Arm What is Machine Learning (ML)? Artificial Intelligence Machine Learning Deep Learning Neural Networks Additional
More informationHello Edge: Keyword Spotting on Microcontrollers
Hello Edge: Keyword Spotting on Microcontrollers Yundong Zhang, Naveen Suda, Liangzhen Lai and Vikas Chandra ARM Research, Stanford University arxiv.org, 2017 Presented by Mohammad Mofrad University of
More informationEnd to End Optimization Stack for Deep Learning
End to End Optimization Stack for Deep Learning Presenter: Tianqi Chen Paul G. Allen School of Computer Science & Engineering University of Washington Collaborators University of Washington AWS AI Team
More informationAuthenticated Storage Using Small Trusted Hardware Hsin-Jung Yang, Victor Costan, Nickolai Zeldovich, and Srini Devadas
Authenticated Storage Using Small Trusted Hardware Hsin-Jung Yang, Victor Costan, Nickolai Zeldovich, and Srini Devadas Massachusetts Institute of Technology November 8th, CCSW 2013 Cloud Storage Model
More informationFINN: A Framework for Fast, Scalable Binarized Neural Network Inference
FINN: A Framework for Fast, Scalable Binarized Neural Network Inference Yaman Umuroglu (XIR & NTNU), Nick Fraser (XIR & USydney), Giulio Gambardella (XIR), Michaela Blott (XIR), Philip Leong (USydney),
More informationVivado HLx Design Entry. June 2016
Vivado HLx Design Entry June 2016 Agenda What is the HLx Design Methodology? New & Early Access features for Connectivity Platforms Creating Differentiated Logic 2 What is the HLx Design Methodology? Page
More informationCMSC 611: Advanced Computer Architecture
CMSC 611: Advanced Computer Architecture, I/O and Disk Most slides adapted from David Patterson. Some from Mohomed Younis Main Background Performance of Main : Latency: affects cache miss penalty Access
More informationSDA: Software-Defined Accelerator for general-purpose big data analysis system
SDA: Software-Defined Accelerator for general-purpose big data analysis system Jian Ouyang(ouyangjian@baidu.com), Wei Qi, Yong Wang, Yichen Tu, Jing Wang, Bowen Jia Baidu is beyond a search engine Search
More informationHigh Performance Implementation of Microtubule Modeling on FPGA using Vivado HLS
High Performance Implementation of Microtubule Modeling on FPGA using Vivado HLS Yury Rumyantsev rumyantsev@rosta.ru 1 Agenda 1. Intoducing Rosta and Hardware overview 2. Microtubule Modeling Problem 3.
More informationTwo FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters
Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters *Argonne National Lab +BU & USTC Presented by Martin Herbordt Work by Ahmed
More informationGRVI Phalanx Update: A Massively Parallel RISC-V FPGA Accelerator Framework. Jan Gray CARRV2017: 2017/10/14
GRVI halanx Update: A Massively arallel RISC-V FGA Accelerator Framework Jan Gray jan@fpga.org http://fpga.org CARRV2017: 2017/10/14 FGA Datacenter Accelerators Are Almost Mainstream Catapult v2. Intel
More informationSimplify System Complexity
Simplify System Complexity With the new high-performance CompactRIO controller Fanie Coetzer Field Sales Engineer Northern South Africa 2 3 New control system CompactPCI MMI/Sequencing/Logging FieldPoint
More informationMain Memory. EECC551 - Shaaban. Memory latency: Affects cache miss penalty. Measured by:
Main Memory Main memory generally utilizes Dynamic RAM (DRAM), which use a single transistor to store a bit, but require a periodic data refresh by reading every row (~every 8 msec). Static RAM may be
More informationEyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks
Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks Yu-Hsin Chen 1, Joel Emer 1, 2, Vivienne Sze 1 1 MIT 2 NVIDIA 1 Contributions of This Work A novel energy-efficient
More informationPerformance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference
The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee
More informationHigh Performance Computing
High Performance Computing 9th Lecture 2016/10/28 YUKI ITO 1 Selected Paper: vdnn: Virtualized Deep Neural Networks for Scalable, MemoryEfficient Neural Network Design Minsoo Rhu, Natalia Gimelshein, Jason
More informationNVIDIA'S DEEP LEARNING ACCELERATOR MEETS SIFIVE'S FREEDOM PLATFORM. Frans Sijstermans (NVIDIA) & Yunsup Lee (SiFive)
NVIDIA'S DEEP LEARNING ACCELERATOR MEETS SIFIVE'S FREEDOM PLATFORM Frans Sijstermans (NVIDIA) & Yunsup Lee (SiFive) NVDLA NVIDIA DEEP LEARNING ACCELERATOR IP Core for deep learning part of NVIDIA s Xavier
More informationPushing the Boundaries of Moore's Law to Transition from FPGA to All Programmable Platform Ivo Bolsens, SVP & CTO Xilinx ISPD, March 2017
Pushing the Boundaries of Moore's Law to Transition from FPGA to All Programmable Platform Ivo Bolsens, SVP & CTO Xilinx ISPD, March 2017 High Growth Markets Cloud Computing Automotive IIoT 5G Wireless
More informationNeural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks
Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks Charles Eckert Xiaowei Wang Jingcheng Wang Arun Subramaniyan Ravi Iyer Dennis Sylvester David Blaauw Reetuparna Das M-Bits Research
More informationA 400Gbps Multi-Core Network Processor
A 400Gbps Multi-Core Network Processor James Markevitch, Srinivasa Malladi Cisco Systems August 22, 2017 Legal THE INFORMATION HEREIN IS PROVIDED ON AN AS IS BASIS, WITHOUT ANY WARRANTIES OR REPRESENTATIONS,
More information40Gbps+ Full Line Rate, Programmable Network Accelerators for Low Latency Applications SAAHPC 19 th July 2011
40Gbps+ Full Line Rate, Programmable Network Accelerators for Low Latency Applications SAAHPC 19 th July 2011 Allan Cantle President & Founder www.nallatech.com Company Overview ISI + Nallatech + Innovative
More informationRecap: Machine Organization
ECE232: Hardware Organization and Design Part 14: Hierarchy Chapter 5 (4 th edition), 7 (3 rd edition) http://www.ecs.umass.edu/ece/ece232/ Adapted from Computer Organization and Design, Patterson & Hennessy,
More informationP51: High Performance Networking
P51: High Performance Networking Lecture 6: Programmable network devices Dr Noa Zilberman noa.zilberman@cl.cam.ac.uk Lent 2017/18 High Throughput Interfaces Performance Limitations So far we discussed
More informationECE232: Hardware Organization and Design
ECE232: Hardware Organization and Design Lecture 21: Memory Hierarchy Adapted from Computer Organization and Design, Patterson & Hennessy, UCB Overview Ideally, computer memory would be large and fast
More informationUnlocking FPGAs Using High- Level Synthesis Compiler Technologies
Unlocking FPGAs Using High- Leel Synthesis Compiler Technologies Fernando Mar*nez Vallina, Henry Styles Xilinx Feb 22, 2015 Why are FPGAs Good Scalable, highly parallel and customizable compute 10s to
More informationPorting Performance across GPUs and FPGAs
Porting Performance across GPUs and FPGAs Deming Chen, ECE, University of Illinois In collaboration with Alex Papakonstantinou 1, Karthik Gururaj 2, John Stratton 1, Jason Cong 2, Wen-Mei Hwu 1 1: ECE
More informationOnyx: A Prototype Phase-Change Memory Storage Array
Onyx: A Prototype Phase-Change Memory Storage Array Ameen Akel * Adrian Caulfield, Todor Mollov, Rajesh Gupta, Steven Swanson Non-Volatile Systems Laboratory, Department of Computer Science and Engineering
More informationProcessor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP
Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP Presenter: Course: EEC 289Q: Reconfigurable Computing Course Instructor: Professor Soheil Ghiasi Outline Overview of M.I.T. Raw processor
More informationGRVI Phalanx. A Massively Parallel RISC-V FPGA Accelerator Accelerator. Jan Gray
GRVI Phalanx A Massively Parallel RISC-V FPGA Accelerator Accelerator Jan Gray jan@fpga.org Introduction FPGA accelerators are hot MSR Catapult. Intel += Altera. OpenPOWER + Xilinx FPGAs as computers Massively
More informationMulti-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture
The 51st Annual IEEE/ACM International Symposium on Microarchitecture Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture Byungchul Hong Yeonju Ro John Kim FuriosaAI Samsung
More informationBittWare s XUPP3R is a 3/4-length PCIe x16 card based on the
FPGA PLATFORMS Board Platforms Custom Solutions Technology Partners Integrated Platforms XUPP3R Xilinx UltraScale+ 3/4-Length PCIe Board with Quad QSFP and 512 GBytes DDR4 Xilinx Virtex UltraScale+ VU7P/VU9P/VU11P
More informationNVIDIA FOR DEEP LEARNING. Bill Veenhuis
NVIDIA FOR DEEP LEARNING Bill Veenhuis bveenhuis@nvidia.com Nvidia is the world s leading ai platform ONE ARCHITECTURE CUDA 2 GPU: Perfect Companion for Accelerating Apps & A.I. CPU GPU 3 Intro to AI AGENDA
More informationSDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center
SDAccel Environment The Xilinx SDAccel Development Environment Bringing The Best Performance/Watt to the Data Center Introduction Data center operators constantly seek more server performance. Currently
More informationMachine Learning on FPGAs
Machine Learning on FPGAs Jason Cong Chancellor s Professor, UCLA Director, Center for Domain-Specific Computing cong@cs.ucla.edu http://cadlab.cs.ucla.edu/~cong 1 Impacts of deep learning for many applications
More informationNear Memory Key/Value Lookup Acceleration MemSys 2017
Near Key/Value Lookup Acceleration MemSys 2017 October 3, 2017 Scott Lloyd, Maya Gokhale Center for Applied Scientific Computing This work was performed under the auspices of the U.S. Department of Energy
More informationInference
Inference Architectures @Xilinx Graham Schelle, PhD Principal Engineer Xilinx Research Labs Xilinx Headlines!2 Twitch Chooses Xilinx to Enable its Broadcast-quality Livestream of esports Agenda Xilinx
More informationShortcut Mining: Exploiting Cross-layer Shortcut Reuse in DCNN Accelerators
Shortcut Mining: Exploiting Cross-layer Shortcut Reuse in DCNN Accelerators Arash Azizimazreah, and Lizhong Chen School of Electrical Engineering and Computer Science Oregon State University, USA {azizimaa,
More informationPushing Performance and Integration with the UltraScale+ Portfolio
White Paper: UltraScale+ Portfolio WP471 (v1.0) November 24, 2015 Introducing the UltraScale+ Portfolio Pushing Performance and Integration with the UltraScale+ Portfolio By: Nick Mehta The Xilinx UltraScale+
More informationScaling Neural Network Acceleration using Coarse-Grained Parallelism
Scaling Neural Network Acceleration using Coarse-Grained Parallelism Mingyu Gao, Xuan Yang, Jing Pu, Mark Horowitz, Christos Kozyrakis Stanford University Platform Lab Review Feb 2018 Neural Networks (NNs)
More informationChapter 5B. Large and Fast: Exploiting Memory Hierarchy
Chapter 5B Large and Fast: Exploiting Memory Hierarchy One Transistor Dynamic RAM 1-T DRAM Cell word access transistor V REF TiN top electrode (V REF ) Ta 2 O 5 dielectric bit Storage capacitor (FET gate,
More informationMemory Systems IRAM. Principle of IRAM
Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several
More informationRe-architecting Virtualization in Heterogeneous Multicore Systems
Re-architecting Virtualization in Heterogeneous Multicore Systems Himanshu Raj, Sanjay Kumar, Vishakha Gupta, Gregory Diamos, Nawaf Alamoosa, Ada Gavrilovska, Karsten Schwan, Sudhakar Yalamanchili College
More informationDRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric
DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based
More informationThe Memory Hierarchy 1
The Memory Hierarchy 1 What is a cache? 2 What problem do caches solve? 3 Memory CPU Abstraction: Big array of bytes Memory memory 4 Performance vs 1980 Processor vs Memory Performance Memory is very slow
More informationLocality. CS429: Computer Organization and Architecture. Locality Example 2. Locality Example
Locality CS429: Computer Organization and Architecture Dr Bill Young Department of Computer Sciences University of Texas at Austin Principle of Locality: Programs tend to reuse data and instructions near
More informationUnified Deep Learning with CPU, GPU, and FPGA Technologies
Unified Deep Learning with CPU, GPU, and FPGA Technologies Allen Rush 1, Ashish Sirasao 2, Mike Ignatowski 1 1: Advanced Micro Devices, Inc., 2: Xilinx, Inc. Abstract Deep learning and complex machine
More informationPRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory
Scalable and Energy-Efficient Architecture Lab (SEAL) PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in -based Main Memory Ping Chi *, Shuangchen Li *, Tao Zhang, Cong
More informationDeep Learning Compiler
Deep Learning Compiler AWS AI Acknowledgement Amazon Sagemaker Neo Enables developers to train machine learning models once and run them anywhere in the cloud and at the edge Hardware targets Intel CPU,
More informationComPEND: Computation Pruning through Early Negative Detection for ReLU in a Deep Neural Network Accelerator
ICS 28 ComPEND: Computation Pruning through Early Negative Detection for ReLU in a Deep Neural Network Accelerator June 3, 28 Dongwoo Lee, Sungbum Kang, Kiyoung Choi Neural Processing Research Center (NPRC)
More informationBHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques
BHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Network with Blocked Hashing Techniques Jingyang Zhu 1, Zhiliang Qian 2*, and Chi-Ying Tsui 1 1 The Hong Kong University of Science and
More informationCourse Overview Revisited
Course Overview Revisited void blur_filter_3x3( Image &in, Image &blur) { // allocate blur array Image blur(in.width(), in.height()); // blur in the x dimension for (int y = ; y < in.height(); y++) for
More information5051 & 5052 PCIe Card Overview
5051 & 5052 PCIe Card Overview About New Wave New Wave DV provides high performance network interface cards, system level products, FPGA IP cores, and custom engineering for: High-bandwidth low-latency
More informationHow to Estimate the Energy Consumption of Deep Neural Networks
How to Estimate the Energy Consumption of Deep Neural Networks Tien-Ju Yang, Yu-Hsin Chen, Joel Emer, Vivienne Sze MIT 1 Problem of DNNs Recognition Smart Drone AI Computation DNN 15k 300k OP/Px DPM 0.1k
More information