Application-Specific Hardware. in the real world

Size: px
Start display at page:

Download "Application-Specific Hardware. in the real world"

Transcription

1 Application-Specific Hardware in the real world 1

2 2

3 Large-Scale Reconfigurable Computing in a Microsoft Datacenter ISCA 14

4 Capabilities Performance/Watt $

5

6 FPGAs ASICs Source: Bob Broderson, Berkeley Wireless group

7

8 1U, 2U, or 4U rack-mounted 1/2/4 x 10Ge ports Up to 4 PCIe x16 slots 2 sockets, 6-core Intel Westmere

9

10

11

12 Two 8-core Xeon 2.1 GHz CPUs 64 GB DRAM 4 2 TB, GB 10 Gb Ethernet No cable attachments to server 68 ⁰C

13 Altera Stratix V GS D5 172k ALMs, 2,014 M20Ks, 1,590 DSPs 8GB DDR MB Configuration Flash Config PCIe Gen 3 x8 8 lanes to Mini-SAS SFF-8088 connectors Powered by PCIe slot Stratix V Flash PCIe Gen3 x8 4x 20 Gbps Torus Network 8GB DDR3

14 FPGA 1U Mezz Conn.

15

16 8 shell cables 6 shell cables

17 Data Center Server (1U, ½width)

18 Infrastructure and Platform Architecture To enable productive use of the FPGA: (1) APIs for interfacing software with the FPGA SW Interface (2) interfaces b/n FPGA and board-level functions shell (3) support for resilience and debugging Error correction

19 Software Interface Design goals: Host-to-FPGA: Latency (10us/16KB) multithreading Custom PCIe interface with DMA support No sys calls 1 I/P, 1 O/P buffer in non-paged memory Buffer has 64 slots of size 64KB Status bits Services initiated through calls to low-level software library

20 Math Acceleration Service Physics Engine Comp. Vision Service FPGA F W PG e A bsearch Pi F p P e G li A ne FPGA

21 4 GB DDR ECCSO-DIMM 72 4 GB DDR ECCSO-DIMM 72 Shell DDR3 Core 0 DDR3 Core 1 Role Config Flash (RSU) JTAG Mb QSPI Config Flash Host CPU 8 x8 PCIe Core DMA Engine Application LEDs Temp Sensors I 2 C Inter-FPGARouter xcvr reconfig SEU North SLIII South SLIII East SLIII West SLIII

22 Selection as a Service (SaaS) SaaS 1 IF I M FM 1 Ranking as a Service (RaaS) RaaS1 IF I M FM 1 1 Query SaaS IF I M 2 FM 2 SaaS IF I M 3 FM 3 Selected Documents RaaS2 IF I M FM 2 2 RaaS3 IF I M FM blue links SaaS 4 IFM44 RaaS48 IFM44 Ported to Catapult Selection-as-a-Service (SaaS) -Find all docs that contain query terms, -Filter and select candidate documents for ranking Ranking-as-a-Service (RaaS) - Compute scores for how relevant each selected document is for the search query - Sort the scores and return the results

23 Query: FPGA Configuration Document NumberOfOccurrences_0 = 7 NumberOfOccurrences_1 = 4 NumberOfTuples_0_1 = 1 Score

24 Document NumberOfOccurrences_0 = 7 NumberOfOccurrences_1 = 4 NumberOfTuples_0_1 = 1 FFE #1 =(2*NumberOfOccurrences_0 + NumberOfOccurrences_1) (2 * NumberOfTuples_0_1) Metafeature #1 = 9 Score

25 PCIe Compressed Document Stream Preprocessing FSM Control/Data Tokens Distribution latches Free Form Expression (FFE) Feature Gathering Network 196 feature families 43 state machines 2.6K dynamic features extracted in less than 4us (~600us in SW)

26 Out pu Cluster 0 Core 0 Core 1 Core 2 tfst Complex Core 3 Core 4 Core 5 Scheduler I-Mem Feature Store Thread 0 Thread 1 Thread 2 Thread 3 F D E M W

27 Document 8-Stage Pipeline FE: FeatureExtraction FPGA 0 FPGA 1 Route to Head Route to Head RaaS Servers Server Server FFE: Free-Form Expressions FPGA 2 FPGA 3 FPGA 4 Document Scoring Request Return Score Server Server Server FPGA 5 Document Scoring Request Server Score Compute Score Compute Score FPGA 6 FPGA 7 Return Score Server Server

28

29 Accelerating Large-Scale Services Bing Search 1,632 Servers with FPGAs Running Bing Page Ranking Service (~30,000 lines of C++) Reduce no. of Servers More compute time for improving relevance

30

31

32 Google - Tensor Processing Unit 32

33 AGoldenAge in Microprocessor Design Stunning progress in microprocessor design 40 years 10 6 x faster! Three architectural innovations(~1000x) Width: bit(~8x) Instruction level parallelism: 4-10 clock cycles per instruction to 4+ instructions per clock cycle (~10-20x) Multicore: 1 processor to16 cores (~16x) Clock rate: 3 to 4000 MHz (~1000x thru technology & architecture) Made possible by ICtechnology: Moore s Law: growth in transistor count (2X every 1.5 years) Dennard Scaling: power/transistor shrinks at same rate as transistors are added (constant per mm2 of silicon)

34 End of Growth of Performance? Since Transistorsnot getting muchbetter Powerbudget not getting muchhigher Already switched from 1inefficient processor/chipto Nefficient processors/chip Only path left isdomainspecific Architetures Justdo afew tasks,but extremely well

35 TPUOrigin Starting as far back as 2006, Google engineers had discussions about deploying GPUs, FPGAs, or custom ASICs in their data centers. They concluded that they can use the excess capacity of the large data centers. The conversation changed in 2013 when it was projected that if people used voice search for 3 minutes a day using speech recognition DNNs, it would have required Google s data centers to double in order to meet computation demands. Google then started a high-priority project to quickly produce a custom ASIC for inference. The goal was to improve cost-performance by 10x over GPUs. Given this mandate, the TPU was designed, verified, built, and deployed in data centers in just 15 months

36 Neural nets Weights 3 Kinds of Popular NNs Multi-Layer Perceptrons(MLP) Each new layer is a set of nonlinear functions of weighted sum of all outputs ( fully connected) from a prior one Convolutional Neural Networks(CNN) Each ensuing layer is a set of nonlinear functions of weighted sums of spatially nearby subsets of outputs from the prior layer, which also reuses the weights. Recurrent Neural Networks(RNN) Each subsequent layer is a collection of nonlinear functions of weighted sums of outputs and the previous state. The most popular RNN is Long Short-Term Memory (LSTM). 36

37 Inference DatacenterWorkload(95%)

38 TPU architecture PCIe coprocessor No internal instruction fetch CISC-like instructions from host: Read_Host_Memory Read_Weights MatrixMultiply/Convolve Activate Write_Host_Memory Off-chip DDR3 weight memory 38

39 TPU architecture MACs for core computation 24MB Unified Buffer Store intermediate results Sized to match pitch of matmult unit, simplify compilation w/ specific apps Tiny control logic 39

40 TPU architecture systolic structure 40

41 Software Stack Software stack is split into auserspace Driver and a KernelDriver. The Kernel Driver is lightweight Handles only memorymanagement andinterrupts. The User Space driver changes frequently. It sets up and controls TPU execution Reformats data into TPU order Translates API calls into TPU instructions, and turns them into an application binary.

42 System configurations * TPU is less than half die size of the Intel Haswell processor K80 and TPU in 28nm process, Haswell fabbed in intel 22nm process Widely deployed in Google data centers 42

43 Performance Roofline curves computation vs memory-intensity Ridge point at intensity where app becomes compute-bound Before ridge = memory-bound After ridge = compute-bound Below curve = response timeconstrained 43

44 Performance energy efficiency 44

45 Performance energy proportionality 45

46 Design space exploration 46

47 TPU v2 At HotChips 2017: 2x 128x128x32b mixed multiply units (MXUs) 64GB HBM 64x TPU modules per pod 4TB HBM Some available in TensorFlow cloud svc 47

48 Google vs. Microsoft Why Google ASIC? Why Microsoft FPGA? Flexibility? Programmability? Cost and usefulness over time? 48

49 References Figures and slides are from Norman P. Jouppi, et al., "In-Datacenter PerformanceAnalysis of atensor Processing Unit", 44th IEEE/ACM International Symposium on Computer Architecture (ISCA-44), Toronto, Canada, June David Patterson, "Evaluation of the Tensor Processing Unit: A Deep Neural NetworkAccelerator for the Datacenter", NAE Regional Meeting,April KazSato, An in-depth look at Google s firsttensor Processing Unit (TPU),

Catapult: A Reconfigurable Fabric for Petaflop Computing in the Cloud

Catapult: A Reconfigurable Fabric for Petaflop Computing in the Cloud Catapult: A Reconfigurable Fabric for Petaflop Computing in the Cloud Doug Burger Director, Hardware, Devices, & Experiences MSR NExT November 15, 2015 The Cloud is a Growing Disruptor for HPC Moore s

More information

Today s Data Centers. How can we improve efficiencies?

Today s Data Centers. How can we improve efficiencies? Today s Data Centers O(100K) servers/data center Tens of MegaWatts, difficult to power and cool Very noisy Security taken very seriously Incrementally upgraded 3 year server depreciation, upgraded quarterly

More information

A New Era of Hardware Microservices in the Cloud. Doug Burger Distinguished Engineer, Microsoft UW Cloud Workshop March 31, 2017

A New Era of Hardware Microservices in the Cloud. Doug Burger Distinguished Engineer, Microsoft UW Cloud Workshop March 31, 2017 A New Era of Hardware Microservices in the Cloud Doug Burger Distinguished Engineer, Microsoft UW Cloud Workshop March 31, 2017 Moore s Law Dennard Scaling has been dead for a decade Moore s La is o er

More information

Deep Learning Accelerators

Deep Learning Accelerators Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction

More information

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 23 Domain- Specific Architectures

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 23 Domain- Specific Architectures CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 23 Domain- Specific Architectures Krste Asanovic Electrical Engineering and Computer Sciences University of California

More information

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks Deep neural networks have enabled major advances in machine learning and AI Computer vision Language translation Speech recognition Question answering And more Problem: DNNs are challenging to serve and

More information

Pactron FPGA Accelerated Computing Solutions

Pactron FPGA Accelerated Computing Solutions Pactron FPGA Accelerated Computing Solutions Intel Xeon + Altera FPGA 2015 Pactron HJPC Corporation 1 Motivation for Accelerators Enhanced Performance: Accelerators compliment CPU cores to meet market

More information

XPU A Programmable FPGA Accelerator for Diverse Workloads

XPU A Programmable FPGA Accelerator for Diverse Workloads XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for

More information

Adaptable Intelligence The Next Computing Era

Adaptable Intelligence The Next Computing Era Adaptable Intelligence The Next Computing Era Hot Chips, August 21, 2018 Victor Peng, CEO, Xilinx Pervasive Intelligence from Cloud to Edge to Endpoints >> 1 Exponential Growth and Opportunities Data Explosion

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 2018/19 A.J.Proença Data Parallelism 3 (GPU/CUDA, Neural Nets,...) (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 1 The

More information

Enabling Technology for the Cloud and AI One Size Fits All?

Enabling Technology for the Cloud and AI One Size Fits All? Enabling Technology for the Cloud and AI One Size Fits All? Tim Horel Collaborate. Differentiate. Win. DIRECTOR, FIELD APPLICATIONS The Growing Cloud Global IP Traffic Growth 40B+ devices with intelligence

More information

Altera SDK for OpenCL

Altera SDK for OpenCL Altera SDK for OpenCL A novel SDK that opens up the world of FPGAs to today s developers Altera Technology Roadshow 2013 Today s News Altera today announces its SDK for OpenCL Altera Joins Khronos Group

More information

Versal: The New Xilinx Adaptive Compute Acceleration Platform (ACAP) in 7nm

Versal: The New Xilinx Adaptive Compute Acceleration Platform (ACAP) in 7nm Engineering Director, Xilinx Silicon Architecture Group Versal: The New Xilinx Adaptive Compute Acceleration Platform (ACAP) in 7nm Presented By Kees Vissers Fellow February 25, FPGA 2019 Technology scaling

More information

45-year CPU Evolution: 1 Law -2 Equations

45-year CPU Evolution: 1 Law -2 Equations 4004 8086 PowerPC 601 Pentium 4 Prescott 1971 1978 1992 45-year CPU Evolution: 1 Law -2 Equations Daniel Etiemble LRI Université Paris Sud 2004 Xeon X7560 Power9 Nvidia Pascal 2010 2017 2016 Are there

More information

Xilinx ML Suite Overview

Xilinx ML Suite Overview Xilinx ML Suite Overview Yao Fu System Architect Data Center Acceleration Xilinx Accelerated Computing Workloads Machine Learning Inference Image classification and object detection Video Streaming Frame

More information

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, Sarma Vrudhula School of Electrical, Computer and Energy Engineering School

More information

THE NVIDIA DEEP LEARNING ACCELERATOR

THE NVIDIA DEEP LEARNING ACCELERATOR THE NVIDIA DEEP LEARNING ACCELERATOR INTRODUCTION NVDLA NVIDIA Deep Learning Accelerator Developed as part of Xavier NVIDIA s SOC for autonomous driving applications Optimized for Convolutional Neural

More information

NVIDIA'S DEEP LEARNING ACCELERATOR MEETS SIFIVE'S FREEDOM PLATFORM. Frans Sijstermans (NVIDIA) & Yunsup Lee (SiFive)

NVIDIA'S DEEP LEARNING ACCELERATOR MEETS SIFIVE'S FREEDOM PLATFORM. Frans Sijstermans (NVIDIA) & Yunsup Lee (SiFive) NVIDIA'S DEEP LEARNING ACCELERATOR MEETS SIFIVE'S FREEDOM PLATFORM Frans Sijstermans (NVIDIA) & Yunsup Lee (SiFive) NVDLA NVIDIA DEEP LEARNING ACCELERATOR IP Core for deep learning part of NVIDIA s Xavier

More information

Revolutionizing the Datacenter

Revolutionizing the Datacenter Power-Efficient Machine Learning using FPGAs on POWER Systems Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx Revolutionizing the Datacenter Join the Conversation #OpenPOWERSummit Top-5

More information

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

SDA: Software-Defined Accelerator for Large- Scale DNN Systems SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, Yong Wang, Bo Yu, Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A dominant

More information

POWER9 Announcement. Martin Bušek IBM Server Solution Sales Specialist

POWER9 Announcement. Martin Bušek IBM Server Solution Sales Specialist POWER9 Announcement Martin Bušek IBM Server Solution Sales Specialist Announce Performance Launch GA 2/13 2/27 3/19 3/20 POWER9 is here!!! The new POWER9 processor ~1TB/s 1 st chip with PCIe4 4GHZ 2x Core

More information

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

SDA: Software-Defined Accelerator for Large- Scale DNN Systems SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, 1 Yong Wang, 1 Bo Yu, 1 Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Naveen Suda, Vikas Chandra *, Ganesh Dasika *, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, Yu

More information

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing Mingyu Gao and Christos Kozyrakis Stanford University http://mast.stanford.edu HPCA March 14, 2016 PIM is Coming Back End of Dennard

More information

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM

More information

Versal: AI Engine & Programming Environment

Versal: AI Engine & Programming Environment Engineering Director, Xilinx Silicon Architecture Group Versal: Engine & Programming Environment Presented By Ambrose Finnerty Xilinx DSP Technical Marketing Manager October 16, 2018 MEMORY MEMORY MEMORY

More information

LegUp: Accelerating Memcached on Cloud FPGAs

LegUp: Accelerating Memcached on Cloud FPGAs 0 LegUp: Accelerating Memcached on Cloud FPGAs Xilinx Developer Forum December 10, 2018 Andrew Canis & Ruolong Lian LegUp Computing Inc. 1 COMPUTE IS BECOMING SPECIALIZED 1 GPU Nvidia graphics cards are

More information

When MPPDB Meets GPU:

When MPPDB Meets GPU: When MPPDB Meets GPU: An Extendible Framework for Acceleration Laura Chen, Le Cai, Yongyan Wang Background: Heterogeneous Computing Hardware Trend stops growing with Moore s Law Fast development of GPU

More information

Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design

Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design Song Yao 姚颂 Founder & CEO DeePhi Tech 深鉴科技 song.yao@deephi.tech Outline - About DeePhi Tech - Background - Bandwidth Matters

More information

The S6000 Family of Processors

The S6000 Family of Processors The S6000 Family of Processors Today s Design Challenges The advent of software configurable processors In recent years, the widespread adoption of digital technologies has revolutionized the way in which

More information

Won Woo Ro, Ph.D. School of Electrical and Electronic Engineering

Won Woo Ro, Ph.D. School of Electrical and Electronic Engineering Won Woo Ro, Ph.D. School of Electrical and Electronic Engineering 학위 공학박사, EE, University of Southern California, USA (2004. 5) 공학석사, EE, University of Southern California, USA (1999. 5) 공학사, EE, Yonsei

More information

FPGA-based Supercomputing: New Opportunities and Challenges

FPGA-based Supercomputing: New Opportunities and Challenges FPGA-based Supercomputing: New Opportunities and Challenges Naoya Maruyama (RIKEN AICS)* 5 th ADAC Workshop Feb 15, 2018 * Current Main affiliation is Lawrence Livermore National Laboratory SIAM PP18:

More information

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA Inference Optimization Using TensorRT with Use Cases Jack Han / 한재근 Solutions Architect NVIDIA Search Image NLP Maps TensorRT 4 Adoption Use Cases Speech Video AI Inference is exploding 1 Billion Videos

More information

Supercomputing and Mass Market Desktops

Supercomputing and Mass Market Desktops Supercomputing and Mass Market Desktops John Manferdelli Microsoft Corporation This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.

More information

S8765 Performance Optimization for Deep- Learning on the Latest POWER Systems

S8765 Performance Optimization for Deep- Learning on the Latest POWER Systems S8765 Performance Optimization for Deep- Learning on the Latest POWER Systems Khoa Huynh Senior Technical Staff Member (STSM), IBM Jonathan Samn Software Engineer, IBM Evolving from compute systems to

More information

Bringing Intelligence to Enterprise Storage Drives

Bringing Intelligence to Enterprise Storage Drives Bringing Intelligence to Enterprise Storage Drives Neil Werdmuller Director Storage Solutions Arm Santa Clara, CA 1 Who am I? 28 years experience in embedded Lead the storage solutions team Work closely

More information

Deep Learning on Arm Cortex-M Microcontrollers. Rod Crawford Director Software Technologies, Arm

Deep Learning on Arm Cortex-M Microcontrollers. Rod Crawford Director Software Technologies, Arm Deep Learning on Arm Cortex-M Microcontrollers Rod Crawford Director Software Technologies, Arm What is Machine Learning (ML)? Artificial Intelligence Machine Learning Deep Learning Neural Networks Additional

More information

Machine Learning on VMware vsphere with NVIDIA GPUs

Machine Learning on VMware vsphere with NVIDIA GPUs Machine Learning on VMware vsphere with NVIDIA GPUs Uday Kurkure, Hari Sivaraman, Lan Vu GPU Technology Conference 2017 2016 VMware Inc. All rights reserved. Gartner Hype Cycle for Emerging Technology

More information

Adaptable Computing The Future of FPGA Acceleration. Dan Gibbons, VP Software Development June 6, 2018

Adaptable Computing The Future of FPGA Acceleration. Dan Gibbons, VP Software Development June 6, 2018 Adaptable Computing The Future of FPGA Acceleration Dan Gibbons, VP Software Development June 6, 2018 Adaptable Accelerated Computing Page 2 Three Big Trends The Evolution of Computing Trend to Heterogeneous

More information

The Nios II Family of Configurable Soft-core Processors

The Nios II Family of Configurable Soft-core Processors The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture

More information

Open Standards for Vision and AI Peter McGuinness NNEF WG Chair CEO, Highwai, Inc May 2018

Open Standards for Vision and AI Peter McGuinness NNEF WG Chair CEO, Highwai, Inc May 2018 Copyright Khronos Group 2018 - Page 1 Open Standards for Vision and AI Peter McGuinness NNEF WG Chair CEO, Highwai, Inc peter.mcguinness@gobrach.com May 2018 Khronos Mission E.g. OpenGL ES provides 3D

More information

All-Flash Storage System

All-Flash Storage System All-Flash Storage System June 2016 Jungsoo Kim Manager, SK Telecom Agenda SKT Storage Solution R&D Introduction Our approaches in developing storage system AF-Media details Computing Board Storage Module

More information

COSC 6339 Accelerators in Big Data

COSC 6339 Accelerators in Big Data COSC 6339 Accelerators in Big Data Edgar Gabriel Fall 2018 Motivation Programming models such as MapReduce and Spark provide a high-level view of parallelism not easy for all problems, e.g. recursive algorithms,

More information

DeepLearning on FPGAs

DeepLearning on FPGAs DeepLearning on FPGAs Introduction to FPGAs Sebastian Buschjäger Technische Universität Dortmund - Fakultät Informatik - Lehrstuhl 8 October 24, 2017 1 Recap: Convolution Observation 1 Even smaller images

More information

Tracking Acceleration with FPGAs. Future Tracking, CMS Week 4/12/17 Sioni Summers

Tracking Acceleration with FPGAs. Future Tracking, CMS Week 4/12/17 Sioni Summers Tracking Acceleration with FPGAs Future Tracking, CMS Week 4/12/17 Sioni Summers Contents Introduction FPGAs & 'DataFlow Engines' for computing Device architecture Maxeler HLT Tracking Acceleration 2 Introduction

More information

Messaging Overview. Introduction. Gen-Z Messaging

Messaging Overview. Introduction. Gen-Z Messaging Page 1 of 6 Messaging Overview Introduction Gen-Z is a new data access technology that not only enhances memory and data storage solutions, but also provides a framework for both optimized and traditional

More information

The Why and How of Developing All-Flash Storage Server

The Why and How of Developing All-Flash Storage Server The Why and How of Developing All-Flash Storage Server June 2016 Jungsoo Kim Manager, SK Telecom Agenda Why we care about All-Flash Storage Transforming to 5G Network Open HW & SW Projects @ SKT Our approaches

More information

Gzip Compression Using Altera OpenCL. Mohamed Abdelfattah (University of Toronto) Andrei Hagiescu Deshanand Singh

Gzip Compression Using Altera OpenCL. Mohamed Abdelfattah (University of Toronto) Andrei Hagiescu Deshanand Singh Gzip Compression Using Altera OpenCL Mohamed Abdelfattah (University of Toronto) Andrei Hagiescu Deshanand Singh Gzip Widely-used lossless compression program Gzip = LZ77 + Huffman Big data needs fast

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

DNN Accelerator Architectures

DNN Accelerator Architectures DNN Accelerator Architectures ISCA Tutorial (2017) Website: http://eyeriss.mit.edu/tutorial.html Joel Emer, Vivienne Sze, Yu-Hsin Chen 1 2 Highly-Parallel Compute Paradigms Temporal Architecture (SIMD/SIMT)

More information

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs IBM Research AI Systems Day DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs Xiaofan Zhang 1, Junsong Wang 2, Chao Zhu 2, Yonghua Lin 2, Jinjun Xiong 3, Wen-mei

More information

ENDURING DIFFERENTIATION Timothy Lanfear

ENDURING DIFFERENTIATION Timothy Lanfear ENDURING DIFFERENTIATION Timothy Lanfear WHERE ARE WE? 2 LIFE AFTER DENNARD SCALING GPU-ACCELERATED PERFORMANCE 10 7 40 Years of Microprocessor Trend Data 10 6 10 5 10 4 10 3 10 2 Single-threaded perf

More information

ENDURING DIFFERENTIATION. Timothy Lanfear

ENDURING DIFFERENTIATION. Timothy Lanfear ENDURING DIFFERENTIATION Timothy Lanfear WHERE ARE WE? 2 LIFE AFTER DENNARD SCALING 10 7 40 Years of Microprocessor Trend Data 10 6 10 5 10 4 Transistors (thousands) 1.1X per year 10 3 10 2 Single-threaded

More information

Improving Packet Processing Performance of a Memory- Bounded Application

Improving Packet Processing Performance of a Memory- Bounded Application Improving Packet Processing Performance of a Memory- Bounded Application Jörn Schumacher CERN / University of Paderborn, Germany jorn.schumacher@cern.ch On behalf of the ATLAS FELIX Developer Team LHCb

More information

Bringing Intelligence to Enterprise Storage Drives

Bringing Intelligence to Enterprise Storage Drives Bringing Intelligence to Enterprise Storage Drives Neil Werdmuller Director Storage Solutions Arm Santa Clara, CA 1 Who am I? 28 years experience in embedded Lead the storage solutions team Work closely

More information

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing Accelerating HPC (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing SAAHPC, Knoxville, July 13, 2010 Legal Disclaimer Intel may make changes to specifications and product

More information

January 28 29, 2014San Jose. Engineering Workshop

January 28 29, 2014San Jose. Engineering Workshop January 28 29, 2014San Jose Microsoft cloud server specification Blade overview Compute and Storage Martin Goldstein, Principal Systems Architect Microsoft Cloud Server Infrastructure Engineering Chassis

More information

To hear the audio, please be sure to dial in: ID#

To hear the audio, please be sure to dial in: ID# Introduction to the HPP-Heterogeneous Processing Platform A combination of Multi-core, GPUs, FPGAs and Many-core accelerators To hear the audio, please be sure to dial in: 1-866-440-4486 ID# 4503739 Yassine

More information

Enabling Flexible Network FPGA Clusters in a Heterogeneous Cloud Data Center

Enabling Flexible Network FPGA Clusters in a Heterogeneous Cloud Data Center Enabling Flexible Network FPGA Clusters in a Heterogeneous Cloud Data Center Naif Tarafdar, Thomas Lin, Eric Fukuda, Hadi Bannazadeh, Alberto Leon-Garcia, Paul Chow University of Toronto 1 Cloudy with

More information

Intel s Architecture for NFV

Intel s Architecture for NFV Intel s Architecture for NFV Evolution from specialized technology to mainstream programming Net Futures 2015 Network applications Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION

More information

The Path to Embedded Vision & AI using a Low Power Vision DSP. Yair Siegel, Director of Segment Marketing Hotchips August 2016

The Path to Embedded Vision & AI using a Low Power Vision DSP. Yair Siegel, Director of Segment Marketing Hotchips August 2016 The Path to Embedded Vision & AI using a Low Power Vision DSP Yair Siegel, Director of Segment Marketing Hotchips August 2016 Presentation Outline Introduction The Need for Embedded Vision & AI Vision

More information

Making Sense of Artificial Intelligence: A Practical Guide

Making Sense of Artificial Intelligence: A Practical Guide Making Sense of Artificial Intelligence: A Practical Guide JEDEC Mobile & IOT Forum Copyright 2018 Young Paik, Samsung Senior Director Product Planning Disclaimer This presentation and/or accompanying

More information

HP ProLiant BladeSystem Gen9 vs Gen8 and G7 Server Blades on Data Warehouse Workloads

HP ProLiant BladeSystem Gen9 vs Gen8 and G7 Server Blades on Data Warehouse Workloads HP ProLiant BladeSystem Gen9 vs Gen8 and G7 Server Blades on Data Warehouse Workloads Gen9 server blades give more performance per dollar for your investment. Executive Summary Information Technology (IT)

More information

The Xilinx XC6200 chip, the software tools and the board development tools

The Xilinx XC6200 chip, the software tools and the board development tools The Xilinx XC6200 chip, the software tools and the board development tools What is an FPGA? Field Programmable Gate Array Fully programmable alternative to a customized chip Used to implement functions

More information

Computer Architecture. Fall Dongkun Shin, SKKU

Computer Architecture. Fall Dongkun Shin, SKKU Computer Architecture Fall 2018 1 Syllabus Instructors: Dongkun Shin Office : Room 85470 E-mail : dongkun@skku.edu Office Hours: Wed. 15:00-17:30 or by appointment Lecture notes nyx.skku.ac.kr Courses

More information

OCP Engineering Workshop - Telco

OCP Engineering Workshop - Telco OCP Engineering Workshop - Telco Low Latency Mobile Edge Computing Trevor Hiatt Product Management, IDT IDT Company Overview Founded 1980 Workforce Approximately 1,800 employees Headquarters San Jose,

More information

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CHRISTOS KOZYRAKIS STANFORD ISCA-40 JUNE 27, 2013 Introduction 2 Current detailed simulators are slow (~200

More information

Computer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal

Computer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal Computer Architectures for Deep Learning Ethan Dell and Daniyal Iqbal Agenda Introduction to Deep Learning Challenges Architectural Solutions Hardware Architectures CPUs GPUs Accelerators FPGAs SOCs ASICs

More information

Welcome. Altera Technology Roadshow 2013

Welcome. Altera Technology Roadshow 2013 Welcome Altera Technology Roadshow 2013 Altera at a Glance Founded in Silicon Valley, California in 1983 Industry s first reprogrammable logic semiconductors $1.78 billion in 2012 sales Over 2,900 employees

More information

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CHRISTOS KOZYRAKIS STANFORD ISCA-40 JUNE 27, 2013 Introduction 2 Current detailed simulators are slow (~200

More information

Programmable NICs. Lecture 14, Computer Networks (198:552)

Programmable NICs. Lecture 14, Computer Networks (198:552) Programmable NICs Lecture 14, Computer Networks (198:552) Network Interface Cards (NICs) The physical interface between a machine and the wire Life of a transmitted packet Userspace application NIC Transport

More information

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console Computer Architecture Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Milo Martin & Amir Roth at University of Pennsylvania! Computer Architecture

More information

Ettus Research Update

Ettus Research Update Ettus Research Update Matt Ettus Ettus Research GRCon13 Outline 1 Introduction 2 Recent New Products 3 Third Generation Introduction Who am I? Core GNU Radio contributor since 2001 Designed

More information

Flexible Architecture Research Machine (FARM)

Flexible Architecture Research Machine (FARM) Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun Motivation Why CPUs + FPGAs make sense

More information

Tile Processor (TILEPro64)

Tile Processor (TILEPro64) Tile Processor Case Study of Contemporary Multicore Fall 2010 Agarwal 6.173 1 Tile Processor (TILEPro64) Performance # of cores On-chip cache (MB) Cache coherency Operations (16/32-bit BOPS) On chip bandwidth

More information

Scaling Throughput Processors for Machine Intelligence

Scaling Throughput Processors for Machine Intelligence Scaling Throughput Processors for Machine Intelligence ScaledML Stanford 24-Mar-18 simon@graphcore.ai 1 MI The impact on humanity of harnessing machine intelligence will be greater than the impact of harnessing

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

AI Requires Many Approaches

AI Requires Many Approaches By Linley Gwennap Principal Analyst July 2018 www.linleygroup.com By Linley Gwennap, Principal Analyst, The Linley Group Artificial intelligence (AI) is being applied to a wide range of problems, so no

More information

Express5800/ A2040c/A2020c/A2010c/A1040c Configuration Guide

Express5800/ A2040c/A2020c/A2010c/A1040c Configuration Guide Express5800/ A2040c/A2020c/A2010c/A1040c Configuration Guide Introduction This document contains product and configuration information that will enable you to configure your system. The guide will ensure

More information

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning,

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning, Index A Algorithmic noise tolerance (ANT), 93 94 Application specific instruction set processors (ASIPs), 115 116 Approximate computing application level, 95 circuits-levels, 93 94 DAS and DVAS, 107 110

More information

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center SDAccel Environment The Xilinx SDAccel Development Environment Bringing The Best Performance/Watt to the Data Center Introduction Data center operators constantly seek more server performance. Currently

More information

Enabling FPGAs in Hyperscale Data Centers

Enabling FPGAs in Hyperscale Data Centers J. Weerasinghe; IEEE CBDCom 215, Beijing; 13 th August 215 Enabling s in Hyperscale Data Centers J. Weerasinghe 1, F. Abel 1, C. Hagleitner 1, A. Herkersdorf 2 1 IBM Research Zurich Laboratory 2 Technical

More information

Field Program mable Gate Arrays

Field Program mable Gate Arrays Field Program mable Gate Arrays M andakini Patil E H E P g r o u p D H E P T I F R SERC school NISER, Bhubaneshwar Nov 7-27 2017 Outline Digital electronics Short history of programmable logic devices

More information

System-on-Chip Architecture for Mobile Applications. Sabyasachi Dey

System-on-Chip Architecture for Mobile Applications. Sabyasachi Dey System-on-Chip Architecture for Mobile Applications Sabyasachi Dey Email: sabyasachi.dey@gmail.com Agenda What is Mobile Application Platform Challenges Key Architecture Focus Areas Conclusion Mobile Revolution

More information

The Transition to PCI Express* for Client SSDs

The Transition to PCI Express* for Client SSDs The Transition to PCI Express* for Client SSDs Amber Huffman Senior Principal Engineer Intel Santa Clara, CA 1 *Other names and brands may be claimed as the property of others. Legal Notices and Disclaimers

More information

FMS18 Invited Session 101-B1 Hardware Acceleration Techniques for NVMe-over-Fabric

FMS18 Invited Session 101-B1 Hardware Acceleration Techniques for NVMe-over-Fabric Flash Memory Summit 2018 Santa Clara, CA FMS18 Invited Session 101-B1 Hardware Acceleration Techniques for NVMe-over-Fabric Paper Abstract: The move from direct-attach to Composable Infrastructure is being

More information

n N c CIni.o ewsrg.au

n N c CIni.o ewsrg.au @NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU

More information

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology 1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823

More information

Implementing Ultra Low Latency Data Center Services with Programmable Logic

Implementing Ultra Low Latency Data Center Services with Programmable Logic Implementing Ultra Low Latency Data Center Services with Programmable Logic John W. Lockwood, CEO: Algo-Logic Systems, Inc. http://algo-logic.com Solutions@Algo-Logic.com (408) 707-3740 2255-D Martin Ave.,

More information

Implementing Flexible Interconnect Topologies for Machine Learning Acceleration

Implementing Flexible Interconnect Topologies for Machine Learning Acceleration Implementing Flexible Interconnect for Machine Learning Acceleration A R M T E C H S Y M P O S I A O C T 2 0 1 8 WILLIAM TSENG Mem Controller 20 mm Mem Controller Machine Learning / AI SoC New Challenges

More information

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al. Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.) Andreas Kurth 2017-12-05 1 In short: The situation Image credit:

More information

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications

NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications NFS/RDMA over 40Gbps iwarp Wael Noureddine Chelsio Communications Outline RDMA Motivating trends iwarp NFS over RDMA Overview Chelsio T5 support Performance results 2 Adoption Rate of 40GbE Source: Crehan

More information

Flash Controller Solutions in Programmable Technology

Flash Controller Solutions in Programmable Technology Flash Controller Solutions in Programmable Technology David McIntyre Senior Business Unit Manager Computer and Storage Business Unit Altera Corp. dmcintyr@altera.com Flash Memory Summit 2012 Santa Clara,

More information

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources This Unit: Putting It All Together CIS 501 Computer Architecture Unit 12: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital Circuits

More information

Practical Near-Data Processing for In-Memory Analytics Frameworks

Practical Near-Data Processing for In-Memory Analytics Frameworks Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard

More information

Characterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager

Characterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager Characterization and Benchmarking of Deep Learning Natalia Vassilieva, PhD Sr. Research Manager Deep learning applications Vision Speech Text Other Search & information extraction Security/Video surveillance

More information

Onto Petaflops with Kubernetes

Onto Petaflops with Kubernetes Onto Petaflops with Kubernetes Vishnu Kannan Google Inc. vishh@google.com Key Takeaways Kubernetes can manage hardware accelerators at Scale Kubernetes provides a playground for ML ML journey with Kubernetes

More information

Scaling Datacenter Accelerators With Compute-Reuse Architectures

Scaling Datacenter Accelerators With Compute-Reuse Architectures Scaling Datacenter Accelerators With Compute-Reuse Architectures Adi Fuchs and David Wentzlaff ISCA 2018 Session 5A June 5, 2018 Los Angeles, CA Scaling Datacenter Accelerators With Compute-Reuse Architectures

More information

Neuromorphic Data Microscope

Neuromorphic Data Microscope Neuromorphic Data Microscope CLSAC 16 October 28, 2016 David Follett Founder, CEO Lewis Rhodes Labs (LRL) david@lewis-rhodes.com 978-273-0537 Slide 1 History Neuroscience 1998-2012 Neuronal Spiking Models

More information