Scaling Datacenter Accelerators With Compute-Reuse Architectures

Size: px
Start display at page:

Download "Scaling Datacenter Accelerators With Compute-Reuse Architectures"

Transcription

1 Scaling Datacenter Accelerators With Compute-Reuse Architectures Adi Fuchs and David Wentzlaff ISCA 2018 Session 5A June 5, 2018 Los Angeles, CA

2 Scaling Datacenter Accelerators With Compute-Reuse Architectures 2 Sources: "Cramming more components onto integrated circuits GE Moore, Computer 1965 Next-Gen Power Solutions for Hyperscale Data Centers, DataCenter Knowledge 2016

3 Scaling Datacenter Accelerators With Compute-Reuse Architectures 3 Sources: "Cramming more components onto integrated circuits GE Moore, Computer 1965 Next-Gen Power Solutions for Hyperscale Data Centers, DataCenter Knowledge 2016

4 Scaling Datacenter Accelerators With Compute-Reuse Architectures 4? Sources: "Cramming more components onto integrated circuits GE Moore, Computer 1965 Next-Gen Power Solutions for Hyperscale Data Centers, DataCenter Knowledge 2016

5 Scaling Datacenter Accelerators With Compute-Reuse Architectures Sources: Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective, Hazelwood et al. HPCA 2018 Cloud TPU, Google, FPGA Accelerated Computing Using AWS F1 Instances, David Pellerin, AWS summit 2017 Microsoft unveils Project Brainwave for real-time AI, Doug Burger, NVIDIA TESLA V100, NVIDIA, 5

6 Scaling Datacenter Accelerators With Compute-Reuse Architectures Sources: Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective, Hazelwood et al. HPCA 2018 Cloud TPU, Google, FPGA Accelerated Computing Using AWS F1 Instances, David Pellerin, AWS summit 2017 Microsoft unveils Project Brainwave for real-time AI, Doug Burger, NVIDIA TESLA V100, NVIDIA, 6

7 Scaling Datacenter Accelerators With Compute-Reuse Architectures Transistor scaling stops. Chip specialization runs out of steam. What s Next? Sources: Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective, Hazelwood et al. HPCA 2018 Cloud TPU, Google, FPGA Accelerated Computing Using AWS F1 Instances, David Pellerin, AWS summit 2017 Microsoft unveils Project Brainwave for real-time AI, Doug Burger, NVIDIA TESLA V100, NVIDIA, 7

8 Scaling Datacenter Accelerators With Compute-Reuse Architectures 8 Observation I: The Density of Emerging Memories are Projected to Increase ITRS Logic Roadmap

9 Scaling Datacenter Accelerators With Compute-Reuse Architectures Observation II: Datacenter Accelerators Perform Redundant Computations Temporal locality introduces redundancy in videos encoders (recurrent blocks in white) t=0 sec t=2 sec t=4 sec Source: Face recognition in unconstrained videos with matched background similarity, Wolf et al., CVPR

10 Scaling Datacenter Accelerators With Compute-Reuse Architectures Observation II: Datacenter Accelerators Perform Redundant Computations Temporal locality introduces redundancy in videos encoders (recurrent blocks in white) t=0 sec t=2 sec t=4 sec 0% recurrence 38% recurrence 61% recurrence Source: Face recognition in unconstrained videos with matched background similarity, Wolf et al., CVPR

11 Scaling Datacenter Accelerators With Compute-Reuse Architectures Observation II: Datacenter Accelerators Perform Redundant Computations Search term commonality retrieves the similar content intercontinental downtown los angeles Source: Google 11

12 Scaling Datacenter Accelerators With Compute-Reuse Architectures Observation II: Datacenter Accelerators Perform Redundant Computations Search term commonality retrieves the similar content intercontinental downtown los angeles hotel in downtown los angeles near intercontinental Source: Google 12

13 Scaling Datacenter Accelerators With Compute-Reuse Architectures Observation II: Datacenter Accelerators Perform Redundant Computations Search term commonality retrieves the similar content intercontinental downtown los angeles hotel in downtown los angeles near intercontinental Source: Google 13

14 Scaling Datacenter Accelerators With Compute-Reuse Architectures 14 Observation II: Datacenter Accelerators Perform Redundant Computations Power laws suggest high recurrent processing of popular content Source: Twitter

15 Scaling Datacenter Accelerators With Compute-Reuse Architectures 15 Observation II: Datacenter Accelerators Perform Redundant Computations Power laws suggest high recurrent processing of popular content Source: Twitter

16 Scaling Datacenter Accelerators With Compute-Reuse Architectures Memoization: Tables store past computation outputs. Reuse outputs of recurring inputs instead of recomputing. Host Processors Shared LLC / NoC Acceleration Fabric Accelerator Core Input Lookup input core result input DMA Engine output Scratchpad Memory COREx: Compute-Reuse Architecture For Accelerators 16

17 Scaling Datacenter Accelerators With Compute-Reuse Architectures Memoization: Tables store past computation outputs. Reuse outputs of recurring inputs instead of recomputing. Host Processors Shared LLC / NoC Acceleration Fabric lookup Accelerator Core Input Lookup fetched result input core result hit input DMA Engine output Scratchpad Memory core result Compute-Reuse Storage COREx: Compute-Reuse Architecture For Accelerators 17

18 Scaling Datacenter Accelerators With Compute-Reuse Architectures Memoization: Tables store past computation outputs. Reuse outputs of recurring inputs instead of recomputing. Host Processors Shared LLC / NoC Acceleration Fabric lookup Accelerator Core Input Lookup fetched result input core result hit input DMA Engine output Scratchpad Memory core result Compute-Reuse Storage COREx: Compute-Reuse Architecture For Accelerators 18

19 Architectural Guidelines 19 Accelerator Core DMA Engine Scratchpad Specialized Compute Lanes General-Purpose CMP Shared LLC

20 Architectural Guidelines Accelerators Memoization is Natural o Little or no additional programming effort o Built-in input-compute-output flow Output Accelerator Core DMA Engine Compute Scratchpad Specialized Compute Lanes Input General-Purpose CMP Shared LLC 20

21 Architectural Guidelines Accelerators Memoization is Natural o Little or no additional programming effort o Built-in input-compute-output flow But Not Straightforward! o High lookup costs o Unnecessary accesses o High access costs COREx Key Ideas: o Hashing (reduce lookup costs) o Lookup filtering (fewer accesses) o Banking (reduce access costs) Accelerator Core DMA Engine Output Compute Scratchpad Specialized Compute Lanes Input General-Purpose CMP Shared LLC 21

22 Architectural Guidelines Accelerators Memoization is Natural o Little or no additional programming effort o Built-in input-compute-output flow Goal: Extend Specialization with Workload-Specific Memoization But Not Straightforward! o High lookup costs o Unnecessary accesses o High access costs COREx Key Ideas: o Hashing (reduce lookup costs) o Lookup filtering (fewer accesses) o Banking (reduce access costs) Accelerator Core DMA Engine Output Compute Scratchpad Specialized Compute Lanes Input General-Purpose CMP Shared LLC 22

23 Top Level Architecture Mem. Chip Func. Block Control Datapath SoC Interconnect Accelerator Core DMA Engine Scratchpad Specialized Compute Lanes General-Purpose CMP Shared LLC 23

24 Top Level Architecture New Modules: o Input Hashing Unit (IHU) Mem. Chip Func. Block Control Datapath COREx Interconnect IHU Accelerator Core DMA Engine Scratchpad Specialized Compute Lanes General-Purpose CMP Shared LLC SoC Interconnect 24

25 Top Level Architecture New Modules: o Input Hashing Unit (IHU) Mem. Chip Func. Block Control Datapath ILU Associative Cache Cache Ctrl. o Input Lookup Unit (ILU) COREx Interconnect IHU SoC Interconnect Accelerator Core DMA Engine Hashes Scratchpad Specialized Compute Lanes General-Purpose CMP Shared LLC 25

26 Top Level Architecture New Modules: o Input Hashing Unit (IHU) Mem. Chip Func. Block Control Datapath COREx Interconnect ILU Associative Cache Cache Ctrl. Fetch CHT RAM-Array Table RAM-Array Ctrl. o Input Lookup Unit (ILU) IHU Accelerator Core DMA Engine Scratchpad General-Purpose CMP o Computation History Table(CHT) Specialized Compute Lanes Shared LLC SoC Interconnect 26

27 Top Level Architecture New Modules: o Input Hashing Unit (IHU) o Input Lookup Unit (ILU) o Computation History Table(CHT) Mem. Chip Func. Block Control Datapath COREx Interconnect IHU ILU Associative Cache Cache Ctrl. Accelerator Core DMA Engine Scratchpad Fetch Specialized Compute Lanes Match Input CHT RAM-Array Table RAM-Array Ctrl. General-Purpose CMP Shared LLC SoC Interconnect 27

28 Top Level Architecture New Modules: o Input Hashing Unit (IHU) Mem. Chip Func. Block Control Datapath COREx Interconnect ILU Associative Cache Cache Ctrl. Fetch CHT RAM-Array Table RAM-Array Ctrl. o Input Lookup Unit (ILU) IHU Accelerator Core DMA Engine Scratchpad General-Purpose CMP o Computation History Table(CHT) Specialized Compute Lanes Shared LLC SoC Interconnect Use Output 28

29 Building COREx IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table. 29 Case Study: Acceleration of Video Motion Estimation Optimization Goals: o Runtime, Energy, and Energy-Delay Product (EDP) Baseline: highly-tuned accelerators o Sweep space for design alternatives (Aladdin) o Find optimal accelerator design for each goal

30 Building COREx IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table. 30 Case Study: Acceleration of Video Motion Estimation Optimization Goals: o Runtime, Energy, and Energy-Delay Product (EDP) Baseline: highly-tuned accelerators o Sweep space for design alternatives (Aladdin) o Find optimal accelerator design for each goal

31 Building COREx IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table. 31 Case Study: Acceleration of Video Motion Estimation Optimization Goals: o Runtime, Energy, and Energy-Delay Product (EDP) Baseline: highly-tuned accelerators o Sweep space for design alternatives (Aladdin) o Find optimal accelerator design for each goal Runtime OPT: 5.8[us] EDP OPT: 148.7[pJs] Energy OPT: 6.2[uJ]

32 Building COREx IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table. 32 Memoization-Layers Specialization o Extract input traces, examine hit and miss rates of different ILU/CHT sizes. o Integrate accelerators with emerging memory based ILU+CHT, and sweep gains space. Example: Resistive RAM based COREx

33 Building COREx IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table. 33 Memoization-Layers Specialization o Extract input traces, examine hit and miss rates of different ILU/CHT sizes. o Integrate accelerators with emerging memory based ILU+CHT, and sweep gains space. Example: Resistive RAM based COREx Runtime Optimization: 2.7x Speedup. 512KB ILU, 32GB CHT EDP Optimization: 63.5% EDP Saved. 512KB ILU, 2GB CHT Energy Optimization: 56.6% Energy Saved. 64KB ILU, 8MB CHT

34 Experimental Setup IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table. 34 Workloads Kernel Domain Use-Case App Source Input Source and Description DCT Video Encoding Video Server x264 YouTube Faces. 10 Videos, 10 Seconds, 24 FPS. SAD Video Encoding Video Server PARBOIL YouTube Faces. 10 Videos, 10 Seconds, 24 FPS. SNAPPY ("SNP") Compression Web-Server Traffic Compression TailBench Snappy-C Wikipedia Abstracts. 13 Million Search Queries. SSSP Graph Processing Maps Service: Shortest Internal DIMACS NYC Streets, 10 Million Zipfian Transactions. ("SSP") Walking Route BFS Graph Processing Online Retail MachSuite Amazon Co-Purchasing, 10 Million Zipfian Transactions. RBM Machine Learning Collaborative Filtering CortexSuite Netflix Prize: 10 Million Zipfian Transactions.

35 Experimental Setup IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table. 35 Workloads Kernel Domain Use-Case App Source Input Source and Description DCT Video Encoding Video Server x264 YouTube Faces. 10 Videos, 10 Seconds, 24 FPS. SAD Video Encoding Video Server PARBOIL YouTube Faces. 10 Videos, 10 Seconds, 24 FPS. SNAPPY ("SNP") Compression Web-Server Traffic Compression TailBench Snappy-C Wikipedia Abstracts. 13 Million Search Queries. SSSP Graph Processing Maps Service: Shortest Internal DIMACS NYC Streets, 10 Million Zipfian Transactions. ("SSP") Walking Route BFS Graph Processing Online Retail MachSuite Amazon Co-Purchasing, 10 Million Zipfian Transactions. RBM Machine Learning Collaborative Filtering CortexSuite Netflix Prize: 10 Million Zipfian Transactions. Temporal Redundancy

36 Experimental Setup IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table. 36 Workloads Kernel Domain Use-Case App Source Input Source and Description DCT Video Encoding Video Server x264 YouTube Faces. 10 Videos, 10 Seconds, 24 FPS. SAD Video Encoding Video Server PARBOIL YouTube Faces. 10 Videos, 10 Seconds, 24 FPS. SNAPPY ("SNP") Compression Web-Server Traffic Compression TailBench Snappy-C Wikipedia Abstracts. 13 Million Search Queries. SSSP Graph Processing Maps Service: Shortest Internal DIMACS NYC Streets, 10 Million Zipfian Transactions. ("SSP") Walking Route BFS Graph Processing Online Retail MachSuite Amazon Co-Purchasing, 10 Million Zipfian Transactions. RBM Machine Learning Collaborative Filtering CortexSuite Netflix Prize: 10 Million Zipfian Transactions. Temporal Redundancy Search Commonality

37 Experimental Setup IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table. 37 Workloads Kernel Domain Use-Case App Source Input Source and Description DCT Video Encoding Video Server x264 YouTube Faces. 10 Videos, 10 Seconds, 24 FPS. SAD Video Encoding Video Server PARBOIL YouTube Faces. 10 Videos, 10 Seconds, 24 FPS. SNAPPY ("SNP") Compression Web-Server Traffic Compression TailBench Snappy-C Wikipedia Abstracts. 13 Million Search Queries. SSSP Graph Processing Maps Service: Shortest Internal DIMACS NYC Streets, 10 Million Zipfian Transactions. ("SSP") Walking Route BFS Graph Processing Online Retail MachSuite Amazon Co-Purchasing, 10 Million Zipfian Transactions. RBM Machine Learning Collaborative Filtering CortexSuite Netflix Prize: 10 Million Zipfian Transactions. Temporal Redundancy Search Commonality Content Popularity (75%, 90%, 95% Recurrence)

38 Experimental Setup Workloads Kernel Domain Use-Case App Source Input Source and Description DCT Video Encoding Video Server x264 YouTube Faces. 10 Videos, 10 Seconds, 24 FPS. SAD Video Encoding Video Server PARBOIL YouTube Faces. 10 Videos, 10 Seconds, 24 FPS. SNAPPY ("SNP") Compression Methodology Web-Server Traffic Compression TailBench Snappy-C Wikipedia Abstracts. 13 Million Search Queries. SSSP Graph Processing Maps Service: Shortest Internal DIMACS NYC Streets, 10 Million Zipfian Transactions. ("SSP") Walking Route BFS Graph Processing Online Retail MachSuite Amazon Co-Purchasing, 10 Million Zipfian Transactions. RBM Machine Learning Collaborative Filtering CortexSuite Netflix Prize: 10 Million Zipfian Transactions. o Evaluate ILU/CHT as ReRAM, STT-RAM, PCM, or Racetrack (Destiny) o Integrate with highly-tuned accelerators (Aladdin) IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table. Temporal Redundancy Search Commonality Content Popularity (75%, 90%, 95% Recurrence) 38

39 Results IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table. 39 Runtime-OPT: Avg x Speedup o Negligible Differences Between Memories

40 Results IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table. 40 Runtime-OPT: Avg x Speedup o Negligible Differences Between Memories EDP-OPT: Avg. 50%-68% Savings o PCM/Racetrack High write energy o Gain less for low bias apps (freq. updates)

41 Results Runtime-OPT: Avg x Speedup o Negligible Differences Between Memories EDP-OPT: Avg. 50%-68% Savings o PCM/Racetrack High write energy o Gain less for low bias apps (freq. updates) Energy-OPT: Avg. 22%-50% Savings o PCM unbeneficial for 75% bias SSSP/RBM General Trends: o Large CHTs (MBs-TBs) for Speedup. Smaller (KBs-GBs) for EDP, Smallest for Energy (KBs-MBs) IHU = Input Hashing Unit. ILU = Input Lookup Unit. CHT = Computation History Table. 41

42 Conclusions 42 Memoization is Fit for Accelerators o Memoization-Ready Programming Environment+Interface

43 Conclusions 43 Memoization is Fit for Accelerators o Memoization-Ready Programming Environment+Interface Memoization is Fit for Datacenters o Temporal Redundancy, Search Commonality, Content Popularity

44 Conclusions 44 COREx Extends Hardware Specialization o Memoization-layer specialization tailored for the workload

45 Conclusions 45 COREx Extends Hardware Specialization o Memoization-layer specialization tailored for the workload COREx Opens New Opportunities for Future Architectures o Shift compute from non-scaling CMOS to still-scaling memories

46 Scaling Datacenter Accelerators With Compute-Reuse Architectures Adi Fuchs David Wentzlaff

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CSAIL IAP MEETING MAY 21, 2013 Research Agenda Lack of technology progress Moore s Law still alive Power

More information

Succinct Data Structures: Theory and Practice

Succinct Data Structures: Theory and Practice Succinct Data Structures: Theory and Practice March 16, 2012 Succinct Data Structures: Theory and Practice 1/15 Contents 1 Motivation and Context Memory Hierarchy Succinct Data Structures Basics Succinct

More information

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee

More information

LegUp: Accelerating Memcached on Cloud FPGAs

LegUp: Accelerating Memcached on Cloud FPGAs 0 LegUp: Accelerating Memcached on Cloud FPGAs Xilinx Developer Forum December 10, 2018 Andrew Canis & Ruolong Lian LegUp Computing Inc. 1 COMPUTE IS BECOMING SPECIALIZED 1 GPU Nvidia graphics cards are

More information

Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand

Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, Allan

More information

Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design

Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design Song Yao 姚颂 Founder & CEO DeePhi Tech 深鉴科技 song.yao@deephi.tech Outline - About DeePhi Tech - Background - Bandwidth Matters

More information

Deep Learning Accelerators

Deep Learning Accelerators Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction

More information

Adaptable Intelligence The Next Computing Era

Adaptable Intelligence The Next Computing Era Adaptable Intelligence The Next Computing Era Hot Chips, August 21, 2018 Victor Peng, CEO, Xilinx Pervasive Intelligence from Cloud to Edge to Endpoints >> 1 Exponential Growth and Opportunities Data Explosion

More information

Maximizing heterogeneous system performance with ARM interconnect and CCIX

Maximizing heterogeneous system performance with ARM interconnect and CCIX Maximizing heterogeneous system performance with ARM interconnect and CCIX Neil Parris, Director of product marketing Systems and software group, ARM Teratec June 2017 Intelligent flexible cloud to enable

More information

Addressing the Memory Wall

Addressing the Memory Wall Lecture 26: Addressing the Memory Wall Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Cage the Elephant Back Against the Wall (Cage the Elephant) This song is for the

More information

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu Overview Emerging memories such as PCM offer higher density than

More information

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps Nandita Vijaykumar Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarangnirun,

More information

Deduplication Storage System

Deduplication Storage System Deduplication Storage System Kai Li Charles Fitzmorris Professor, Princeton University & Chief Scientist and Co-Founder, Data Domain, Inc. 03/11/09 The World Is Becoming Data-Centric CERN Tier 0 Business

More information

Outline Marquette University

Outline Marquette University COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations

More information

Future of datacenter STORAGE. Carol Wilder, Niels Reimers,

Future of datacenter STORAGE. Carol Wilder, Niels Reimers, Future of datacenter STORAGE Carol Wilder, carol.a.wilder@intel.com Niels Reimers, niels.reimers@intel.com Legal Notices/disclaimer Intel technologies features and benefits depend on system configuration

More information

Exploration of Cache Coherent CPU- FPGA Heterogeneous System

Exploration of Cache Coherent CPU- FPGA Heterogeneous System Exploration of Cache Coherent CPU- FPGA Heterogeneous System Wei Zhang Department of Electronic and Computer Engineering Hong Kong University of Science and Technology 1 Outline ointroduction to FPGA-based

More information

Concurrent execution of an analytical workload on a POWER8 server with K40 GPUs A Technology Demonstration

Concurrent execution of an analytical workload on a POWER8 server with K40 GPUs A Technology Demonstration Concurrent execution of an analytical workload on a POWER8 server with K40 GPUs A Technology Demonstration Sina Meraji sinamera@ca.ibm.com Berni Schiefer schiefer@ca.ibm.com Tuesday March 17th at 12:00

More information

Computer Architecture s Changing Definition

Computer Architecture s Changing Definition Computer Architecture s Changing Definition 1950s Computer Architecture Computer Arithmetic 1960s Operating system support, especially memory management 1970s to mid 1980s Computer Architecture Instruction

More information

Understanding Sources of Inefficiency in General-Purpose Chips. Hameed, Rehan, et al. PRESENTED BY: XIAOMING GUO SIJIA HE

Understanding Sources of Inefficiency in General-Purpose Chips. Hameed, Rehan, et al. PRESENTED BY: XIAOMING GUO SIJIA HE Understanding Sources of Inefficiency in General-Purpose Chips Hameed, Rehan, et al. PRESENTED BY: XIAOMING GUO SIJIA HE 1 Outline Motivation H.264 Basics Key ideas Implementation & Evaluation Summary

More information

Software Defined Hardware

Software Defined Hardware Software Defined Hardware For data intensive computation Wade Shen DARPA I2O September 19, 2017 1 Goal Statement Build runtime reconfigurable hardware and software that enables near ASIC performance (within

More information

The Use Of Virtual Platforms In MP-SoC Design. Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006

The Use Of Virtual Platforms In MP-SoC Design. Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006 The Use Of Virtual Platforms In MP-SoC Design Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006 1 MPSoC Is MP SoC design happening? Why? Consumer Electronics Complexity Cost of ASIC Increased SW Content

More information

NVIDIA'S DEEP LEARNING ACCELERATOR MEETS SIFIVE'S FREEDOM PLATFORM. Frans Sijstermans (NVIDIA) & Yunsup Lee (SiFive)

NVIDIA'S DEEP LEARNING ACCELERATOR MEETS SIFIVE'S FREEDOM PLATFORM. Frans Sijstermans (NVIDIA) & Yunsup Lee (SiFive) NVIDIA'S DEEP LEARNING ACCELERATOR MEETS SIFIVE'S FREEDOM PLATFORM Frans Sijstermans (NVIDIA) & Yunsup Lee (SiFive) NVDLA NVIDIA DEEP LEARNING ACCELERATOR IP Core for deep learning part of NVIDIA s Xavier

More information

SAP HANA. Jake Klein/ SVP SAP HANA June, 2013

SAP HANA. Jake Klein/ SVP SAP HANA June, 2013 SAP HANA Jake Klein/ SVP SAP HANA June, 2013 SAP 3 YEARS AGO Middleware BI / Analytics Core ERP + Suite 2013 WHERE ARE WE NOW? Cloud Mobile Applications SAP HANA Analytics D&T Changed Reality Disruptive

More information

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Ritchie Zhao 1, Weinan Song 2, Wentao Zhang 2, Tianwei Xing 3, Jeng-Hau Lin 4, Mani Srivastava 3, Rajesh Gupta 4, Zhiru

More information

OCP Engineering Workshop - Telco

OCP Engineering Workshop - Telco OCP Engineering Workshop - Telco Low Latency Mobile Edge Computing Trevor Hiatt Product Management, IDT IDT Company Overview Founded 1980 Workforce Approximately 1,800 employees Headquarters San Jose,

More information

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing Mingyu Gao and Christos Kozyrakis Stanford University http://mast.stanford.edu HPCA March 14, 2016 PIM is Coming Back End of Dennard

More information

Optimizing Datacenter Power with Memory System Levers for Guaranteed Quality-of-Service

Optimizing Datacenter Power with Memory System Levers for Guaranteed Quality-of-Service Optimizing Datacenter Power with Memory System Levers for Guaranteed Quality-of-Service * Kshitij Sudan* Sadagopan Srinivasan Rajeev Balasubramonian* Ravi Iyer Executive Summary Goal: Co-schedule N applications

More information

Meet the Walkers! Accelerating Index Traversals for In-Memory Databases"

Meet the Walkers! Accelerating Index Traversals for In-Memory Databases Meet the Walkers! Accelerating Index Traversals for In-Memory Databases Onur Kocberber Boris Grot, Javier Picorel, Babak Falsafi, Kevin Lim, Parthasarathy Ranganathan Our World is Data-Driven! Data resides

More information

Signs of Intelligent Life: AI Simplifies IoT

Signs of Intelligent Life: AI Simplifies IoT Signs of Intelligent Life: AI Simplifies IoT JEDEC Mobile & IOT Forum Stephen Lum Samsung Semiconductor, Inc. Copyright 2018 APPLICATIONS DRIVE CHANGES IN ARCHITECTURES x86 Processors Apps Processors FPGA

More information

OpenCAPI Technology. Myron Slota Speaker name, Title OpenCAPI Consortium Company/Organization Name. Join the Conversation #OpenPOWERSummit

OpenCAPI Technology. Myron Slota Speaker name, Title OpenCAPI Consortium Company/Organization Name. Join the Conversation #OpenPOWERSummit OpenCAPI Technology Myron Slota Speaker name, Title OpenCAPI Consortium Company/Organization Name Join the Conversation #OpenPOWERSummit Industry Collaboration and Innovation OpenCAPI Topics Computation

More information

Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh

Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Transparent Offloading and Mapping () Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O Connor, Nandita Vijaykumar,

More information

Persistent Memory in Mission-Critical Architecture (How and Why) Adam Roberts Engineering Fellow, Western Digital Corporation

Persistent Memory in Mission-Critical Architecture (How and Why) Adam Roberts Engineering Fellow, Western Digital Corporation Persistent Memory in Mission-Critical Architecture (How and Why) Adam Roberts Engineering Fellow, Western Digital Corporation Forward-Looking Statements Safe Harbor Disclaimers This presentation contains

More information

Near Memory Key/Value Lookup Acceleration MemSys 2017

Near Memory Key/Value Lookup Acceleration MemSys 2017 Near Key/Value Lookup Acceleration MemSys 2017 October 3, 2017 Scott Lloyd, Maya Gokhale Center for Applied Scientific Computing This work was performed under the auspices of the U.S. Department of Energy

More information

PageVault: Securing Off-Chip Memory Using Page-Based Authen?ca?on. Blaise-Pascal Tine Sudhakar Yalamanchili

PageVault: Securing Off-Chip Memory Using Page-Based Authen?ca?on. Blaise-Pascal Tine Sudhakar Yalamanchili PageVault: Securing Off-Chip Memory Using Page-Based Authen?ca?on Blaise-Pascal Tine Sudhakar Yalamanchili Outline Background: Memory Security Motivation Proposed Solution Implementation Evaluation Conclusion

More information

Practical Near-Data Processing for In-Memory Analytics Frameworks

Practical Near-Data Processing for In-Memory Analytics Frameworks Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard

More information

Enterprise. Breadth-First Graph Traversal on GPUs. November 19th, 2015

Enterprise. Breadth-First Graph Traversal on GPUs. November 19th, 2015 Enterprise Breadth-First Graph Traversal on GPUs Hang Liu H. Howie Huang November 9th, 5 Graph is Ubiquitous Breadth-First Search (BFS) is Important Wide Range of Applications Single Source Shortest Path

More information

Speeding Up Crossbar Resistive Memory by Exploiting In-memory Data Patterns

Speeding Up Crossbar Resistive Memory by Exploiting In-memory Data Patterns March 12, 2018 Speeding Up Crossbar Resistive Memory by Exploiting In-memory Data Patterns Wen Wen Lei Zhao, Youtao Zhang, Jun Yang Executive Summary Problems: performance and reliability of write operations

More information

Deep learning prevalence. first neuroscience department. Spiking Neuron Operant conditioning First 1 Billion transistor processor

Deep learning prevalence. first neuroscience department. Spiking Neuron Operant conditioning First 1 Billion transistor processor WELCOME TO Operant conditioning 1938 Spiking Neuron 1952 first neuroscience department 1964 Deep learning prevalence mid 2000s The Turing Machine 1936 Transistor 1947 First computer science department

More information

Decentralized Distributed Storage System for Big Data

Decentralized Distributed Storage System for Big Data Decentralized Distributed Storage System for Big Presenter: Wei Xie -Intensive Scalable Computing Laboratory(DISCL) Computer Science Department Texas Tech University Outline Trends in Big and Cloud Storage

More information

Sort vs. Hash Join Revisited for Near-Memory Execution. Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot

Sort vs. Hash Join Revisited for Near-Memory Execution. Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot Sort vs. Hash Join Revisited for Near-Memory Execution Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot 1 Near-Memory Processing (NMP) Emerging technology Stacked memory: A logic die w/ a stack

More information

HETEROGENEOUS COMPUTE INFRASTRUCTURE FOR SINGAPORE

HETEROGENEOUS COMPUTE INFRASTRUCTURE FOR SINGAPORE HETEROGENEOUS COMPUTE INFRASTRUCTURE FOR SINGAPORE PHILIP HEAH ASSISTANT CHIEF EXECUTIVE TECHNOLOGY & INFRASTRUCTURE GROUP LAUNCH OF SERVICES AND DIGITAL ECONOMY (SDE) TECHNOLOGY ROADMAP (NOV 2018) Source

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

L7: Performance. Frans Kaashoek Spring 2013

L7: Performance. Frans Kaashoek Spring 2013 L7: Performance Frans Kaashoek kaashoek@mit.edu 6.033 Spring 2013 Overview Technology fixes some performance problems Ride the technology curves if you can Some performance requirements require thinking

More information

GRVI Phalanx Update: Plowing the Cloud with Thousands of RISC-V Chickens. Jan Gray

GRVI Phalanx Update: Plowing the Cloud with Thousands of RISC-V Chickens. Jan Gray If you were plowing a field, which would you rather use: two strong oxen or 1024 chickens? Seymour Cray GRVI Phalanx Update: Plowing the Cloud with Thousands of RISC-V Chickens Jan Gray jan@fpga.org http://fpga.org

More information

Building the Most Efficient Machine Learning System

Building the Most Efficient Machine Learning System Building the Most Efficient Machine Learning System Mellanox The Artificial Intelligence Interconnect Company June 2017 Mellanox Overview Company Headquarters Yokneam, Israel Sunnyvale, California Worldwide

More information

Near- Data Computa.on: It s Not (Just) About Performance

Near- Data Computa.on: It s Not (Just) About Performance Near- Data Computa.on: It s Not (Just) About Performance Steven Swanson Non- Vola0le Systems Laboratory Computer Science and Engineering University of California, San Diego 1 Solid State Memories NAND

More information

OAP: An Obstruction-Aware Cache Management Policy for STT-RAM Last-Level Caches

OAP: An Obstruction-Aware Cache Management Policy for STT-RAM Last-Level Caches OAP: An Obstruction-Aware Cache Management Policy for STT-RAM Last-Level Caches Jue Wang, Xiangyu Dong, Yuan Xie Department of Computer Science and Engineering, Pennsylvania State University Qualcomm Technology,

More information

In-Memory Data Management Jens Krueger

In-Memory Data Management Jens Krueger In-Memory Data Management Jens Krueger Enterprise Platform and Integration Concepts Hasso Plattner Intitute OLTP vs. OLAP 2 Online Transaction Processing (OLTP) Organized in rows Online Analytical Processing

More information

Flash Controller Solutions in Programmable Technology

Flash Controller Solutions in Programmable Technology Flash Controller Solutions in Programmable Technology David McIntyre Senior Business Unit Manager Computer and Storage Business Unit Altera Corp. dmcintyr@altera.com Flash Memory Summit 2012 Santa Clara,

More information

Peeling the Power Onion

Peeling the Power Onion CERCS IAB Workshop, April 26, 2010 Peeling the Power Onion Hsien-Hsin S. Lee Associate Professor Electrical & Computer Engineering Georgia Tech Power Allocation for Server Farm Room Datacenter 8.1 Total

More information

RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS

RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS Yash Ukidave, Perhaad Mistry, Charu Kalra, Dana Schaa and David Kaeli Department of Electrical and Computer Engineering

More information

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks Deep neural networks have enabled major advances in machine learning and AI Computer vision Language translation Speech recognition Question answering And more Problem: DNNs are challenging to serve and

More information

Meet in the Middle: Leveraging Optical Interconnection Opportunities in Chip Multi Processors

Meet in the Middle: Leveraging Optical Interconnection Opportunities in Chip Multi Processors Meet in the Middle: Leveraging Optical Interconnection Opportunities in Chip Multi Processors Sandro Bartolini* Department of Information Engineering, University of Siena, Italy bartolini@dii.unisi.it

More information

15-740/ Computer Architecture Lecture 5: Project Example. Jus%n Meza Yoongu Kim Fall 2011, 9/21/2011

15-740/ Computer Architecture Lecture 5: Project Example. Jus%n Meza Yoongu Kim Fall 2011, 9/21/2011 15-740/18-740 Computer Architecture Lecture 5: Project Example Jus%n Meza Yoongu Kim Fall 2011, 9/21/2011 Reminder: Project Proposals Project proposals due NOON on Monday 9/26 Two to three pages consisang

More information

Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs

Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs The 34 th IEEE International Conference on Computer Design Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs Shin-Ying Lee and Carole-Jean Wu Arizona State University October

More information

Rhythm: Harnessing Data Parallel Hardware for Server Workloads

Rhythm: Harnessing Data Parallel Hardware for Server Workloads Rhythm: Harnessing Data Parallel Hardware for Server Workloads Sandeep R. Agrawal $ Valentin Pistol $ Jun Pang $ John Tran # David Tarjan # Alvin R. Lebeck $ $ Duke CS # NVIDIA Explosive Internet Growth

More information

OpenCAPI and its Roadmap

OpenCAPI and its Roadmap OpenCAPI and its Roadmap Myron Slota, President OpenCAPI Speaker name, Consortium Title Company/Organization Name Join the Conversation #OpenPOWERSummit Industry Collaboration and Innovation OpenCAPI and

More information

MRAM, XPoint, ReRAM PM Fuel to Propel Tomorrow s Computing Advances

MRAM, XPoint, ReRAM PM Fuel to Propel Tomorrow s Computing Advances MRAM, XPoint, ReRAM PM Fuel to Propel Tomorrow s Computing Advances Jim Handy Objective Analysis Tom Coughlin Coughlin Associates The Market is at a Nexus PM 2 Emerging Memory Technologies MRAM: Magnetic

More information

Nonblocking Memory Refresh. Kate Nguyen, Kehan Lyu, Xianze Meng, Vilas Sridharan, Xun Jian

Nonblocking Memory Refresh. Kate Nguyen, Kehan Lyu, Xianze Meng, Vilas Sridharan, Xun Jian Nonblocking Memory Refresh Kate Nguyen, Kehan Lyu, Xianze Meng, Vilas Sridharan, Xun Jian Latency (ns) History of DRAM 2 Refresh Latency Bus Cycle Time Min. Read Latency 512 550 16 13.5 0.5 0.75 1968 DRAM

More information

Deep Learning Processing Technologies for Embedded Systems. October 2018

Deep Learning Processing Technologies for Embedded Systems. October 2018 Deep Learning Processing Technologies for Embedded Systems October 2018 1 Neural Networks Architecture Single Neuron DNN Multi Task NN Multi-Task Vehicle Detection With Region-of-Interest Voting Popular

More information

Building the Most Efficient Machine Learning System

Building the Most Efficient Machine Learning System Building the Most Efficient Machine Learning System Mellanox The Artificial Intelligence Interconnect Company June 2017 Mellanox Overview Company Headquarters Yokneam, Israel Sunnyvale, California Worldwide

More information

Jim Keller. Digital Equipment Corp. Hudson MA

Jim Keller. Digital Equipment Corp. Hudson MA Jim Keller Digital Equipment Corp. Hudson MA ! Performance - SPECint95 100 50 21264 30 21164 10 1995 1996 1997 1998 1999 2000 2001 CMOS 5 0.5um CMOS 6 0.35um CMOS 7 0.25um "## Continued Performance Leadership

More information

CS 537 Lecture 6 Fast Translation - TLBs

CS 537 Lecture 6 Fast Translation - TLBs CS 537 Lecture 6 Fast Translation - TLBs Michael Swift 9/26/7 2004-2007 Ed Lazowska, Hank Levy, Andrea and Remzi Arpaci-Dussea, Michael Swift Faster with TLBS Questions answered in this lecture: Review

More information

How Might Recently Formed System Interconnect Consortia Affect PM? Doug Voigt, SNIA TC

How Might Recently Formed System Interconnect Consortia Affect PM? Doug Voigt, SNIA TC How Might Recently Formed System Interconnect Consortia Affect PM? Doug Voigt, SNIA TC Three Consortia Formed in Oct 2016 Gen-Z Open CAPI CCIX complex to rack scale memory fabric Cache coherent accelerator

More information

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Introduction to Data Management CSE 344 Lecture 25: Parallel Databases CSE 344 - Winter 2013 1 Announcements Webquiz due tonight last WQ! J HW7 due on Wednesday HW8 will be posted soon Will take more hours

More information

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CHRISTOS KOZYRAKIS STANFORD ISCA-40 JUNE 27, 2013 Introduction 2 Current detailed simulators are slow (~200

More information

Revolutionizing the Datacenter

Revolutionizing the Datacenter Power-Efficient Machine Learning using FPGAs on POWER Systems Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx Revolutionizing the Datacenter Join the Conversation #OpenPOWERSummit Top-5

More information

Introduction to Database Services

Introduction to Database Services Introduction to Database Services Shaun Pearce AWS Solutions Architect 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved Today s agenda Why managed database services? A non-relational

More information

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CHRISTOS KOZYRAKIS STANFORD ISCA-40 JUNE 27, 2013 Introduction 2 Current detailed simulators are slow (~200

More information

Tiny GPU Cluster for Big Spatial Data: A Preliminary Performance Evaluation

Tiny GPU Cluster for Big Spatial Data: A Preliminary Performance Evaluation Tiny GPU Cluster for Big Spatial Data: A Preliminary Performance Evaluation Jianting Zhang 1,2 Simin You 2, Le Gruenwald 3 1 Depart of Computer Science, CUNY City College (CCNY) 2 Department of Computer

More information

AMD Opteron Processors In the Cloud

AMD Opteron Processors In the Cloud AMD Opteron Processors In the Cloud Pat Patla Vice President Product Marketing AMD DID YOU KNOW? By 2020, every byte of data will pass through the cloud *Source IDC 2 AMD Opteron In The Cloud October,

More information

Designing High-Performance and Fair Shared Multi-Core Memory Systems: Two Approaches. Onur Mutlu March 23, 2010 GSRC

Designing High-Performance and Fair Shared Multi-Core Memory Systems: Two Approaches. Onur Mutlu March 23, 2010 GSRC Designing High-Performance and Fair Shared Multi-Core Memory Systems: Two Approaches Onur Mutlu onur@cmu.edu March 23, 2010 GSRC Modern Memory Systems (Multi-Core) 2 The Memory System The memory system

More information

Architectures for Scalable Media Object Search

Architectures for Scalable Media Object Search Architectures for Scalable Media Object Search Dennis Sng Deputy Director & Principal Scientist NVIDIA GPU Technology Workshop 10 July 2014 ROSE LAB OVERVIEW 2 Large Database of Media Objects Next- Generation

More information

Dynamic Vertical Memory Scalability for OpenJDK Cloud Applications

Dynamic Vertical Memory Scalability for OpenJDK Cloud Applications Dynamic Vertical Memory Scalability for OpenJDK Cloud Applications Rodrigo Bruno, Paulo Ferreira: INESC-ID / Instituto Superior Técnico, University of Lisbon Ruslan Synytsky, Tetiana Fydorenchyk: Jelastic

More information

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Preparing GPU-Accelerated Applications for the Summit Supercomputer Preparing GPU-Accelerated Applications for the Summit Supercomputer Fernanda Foertter HPC User Assistance Group Training Lead foertterfs@ornl.gov This research used resources of the Oak Ridge Leadership

More information

Finding a needle in Haystack: Facebook's photo storage

Finding a needle in Haystack: Facebook's photo storage Finding a needle in Haystack: Facebook's photo storage The paper is written at facebook and describes a object storage system called Haystack. Since facebook processes a lot of photos (20 petabytes total,

More information

TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING

TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING Table of Contents: The Accelerated Data Center Optimizing Data Center Productivity Same Throughput with Fewer Server Nodes

More information

Database Systems II. Secondary Storage

Database Systems II. Secondary Storage Database Systems II Secondary Storage CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 The Memory Hierarchy Swapping, Main-memory DBMS s Tertiary Storage: Tape, Network Backup 3,200 MB/s (DDR-SDRAM

More information

The impact of 3D storage solutions on the next generation of memory systems

The impact of 3D storage solutions on the next generation of memory systems The impact of 3D storage solutions on the next generation of memory systems DevelopEX 2017 Airport City Israel Avi Klein Engineering Fellow, Memory Technology Group Western Digital Corp October 31, 2017

More information

A Closer Look at the Epiphany IV 28nm 64 core Coprocessor. Andreas Olofsson PEGPUM 2013

A Closer Look at the Epiphany IV 28nm 64 core Coprocessor. Andreas Olofsson PEGPUM 2013 A Closer Look at the Epiphany IV 28nm 64 core Coprocessor Andreas Olofsson PEGPUM 2013 1 Adapteva Achieves 3 World Firsts 1. First processor company to reach 50 GFLOPS/W 3. First semiconductor company

More information

Distributed systems: paradigms and models Motivations

Distributed systems: paradigms and models Motivations Distributed systems: paradigms and models Motivations Prof. Marco Danelutto Dept. Computer Science University of Pisa Master Degree (Laurea Magistrale) in Computer Science and Networking Academic Year

More information

Memory Systems IRAM. Principle of IRAM

Memory Systems IRAM. Principle of IRAM Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several

More information

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:

More information

Database Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu

Database Architecture 2 & Storage. Instructor: Matei Zaharia cs245.stanford.edu Database Architecture 2 & Storage Instructor: Matei Zaharia cs245.stanford.edu Summary from Last Time System R mostly matched the architecture of a modern RDBMS» SQL» Many storage & access methods» Cost-based

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Enabling Technology for the Cloud and AI One Size Fits All?

Enabling Technology for the Cloud and AI One Size Fits All? Enabling Technology for the Cloud and AI One Size Fits All? Tim Horel Collaborate. Differentiate. Win. DIRECTOR, FIELD APPLICATIONS The Growing Cloud Global IP Traffic Growth 40B+ devices with intelligence

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Row Buffer Locality Aware Caching Policies for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu

Row Buffer Locality Aware Caching Policies for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu Executive Summary Different memory technologies have different

More information

IBM Education Assistance for z/os V2R2

IBM Education Assistance for z/os V2R2 IBM Education Assistance for z/os V2R2 Item: RSM Scalability Element/Component: Real Storage Manager Material current as of May 2015 IBM Presentation Template Full Version Agenda Trademarks Presentation

More information

GRVI Phalanx Update: A Massively Parallel RISC-V FPGA Accelerator Framework. Jan Gray CARRV2017: 2017/10/14

GRVI Phalanx Update: A Massively Parallel RISC-V FPGA Accelerator Framework. Jan Gray   CARRV2017: 2017/10/14 GRVI halanx Update: A Massively arallel RISC-V FGA Accelerator Framework Jan Gray jan@fpga.org http://fpga.org CARRV2017: 2017/10/14 FGA Datacenter Accelerators Are Almost Mainstream Catapult v2. Intel

More information

Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search

Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search Jialiang Zhang, Soroosh Khoram and Jing Li 1 Outline Background Big graph analytics Hybrid

More information

Interconnect Challenges in a Many Core Compute Environment. Jerry Bautista, PhD Gen Mgr, New Business Initiatives Intel, Tech and Manuf Grp

Interconnect Challenges in a Many Core Compute Environment. Jerry Bautista, PhD Gen Mgr, New Business Initiatives Intel, Tech and Manuf Grp Interconnect Challenges in a Many Core Compute Environment Jerry Bautista, PhD Gen Mgr, New Business Initiatives Intel, Tech and Manuf Grp Agenda Microprocessor general trends Implications Tradeoffs Summary

More information

Integrating NVIDIA Deep Learning Accelerator (NVDLA) with RISC-V SoC on FireSim

Integrating NVIDIA Deep Learning Accelerator (NVDLA) with RISC-V SoC on FireSim Integrating NVIDIA Deep Learning Accelerator (NVDLA) with RISC-V SoC on FireSim Farzad Farshchi, Qijing Huang, Heechul Yun University of Kansas, University of California, Berkeley SiFive Internship Rocket

More information

Computer Architecture. R. Poss

Computer Architecture. R. Poss Computer Architecture R. Poss 1 ca01-10 september 2015 Course & organization 2 ca01-10 september 2015 Aims of this course The aims of this course are: to highlight current trends to introduce the notion

More information

Phase Change Memory An Architecture and Systems Perspective

Phase Change Memory An Architecture and Systems Perspective Phase Change Memory An Architecture and Systems Perspective Benjamin C. Lee Stanford University bcclee@stanford.edu Fall 2010, Assistant Professor @ Duke University Benjamin C. Lee 1 Memory Scaling density,

More information

Emerging NV Storage and Memory Technologies --Development, Manufacturing and

Emerging NV Storage and Memory Technologies --Development, Manufacturing and Emerging NV Storage and Memory Technologies --Development, Manufacturing and Applications-- Tom Coughlin, Coughlin Associates Ed Grochowski, Computer Storage Consultant 2014 Coughlin Associates 1 Outline

More information

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX

Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX Scaling Internet TV Content Delivery ALEX GUTARIN DIRECTOR OF ENGINEERING, NETFLIX Inventing Internet TV Available in more than 190 countries 104+ million subscribers Lots of Streaming == Lots of Traffic

More information

CPS101 Computer Organization and Programming Lecture 13: The Memory System. Outline of Today s Lecture. The Big Picture: Where are We Now?

CPS101 Computer Organization and Programming Lecture 13: The Memory System. Outline of Today s Lecture. The Big Picture: Where are We Now? cps 14 memory.1 RW Fall 2 CPS11 Computer Organization and Programming Lecture 13 The System Robert Wagner Outline of Today s Lecture System the BIG Picture? Technology Technology DRAM A Real Life Example

More information

Infrastructure Innovation Opportunities Y Combinator 2013

Infrastructure Innovation Opportunities Y Combinator 2013 Infrastructure Innovation Opportunities Y Combinator 2013 James Hamilton, 2013/1/22 VP & Distinguished Engineer, Amazon Web Services email: James@amazon.com web: mvdirona.com/jrh/work blog: perspectives.mvdirona.com

More information

Rethinking DRAM Power Modes for Energy Proportionality

Rethinking DRAM Power Modes for Energy Proportionality Rethinking DRAM Power Modes for Energy Proportionality Krishna Malladi 1, Ian Shaeffer 2, Liji Gopalakrishnan 2, David Lo 1, Benjamin Lee 3, Mark Horowitz 1 Stanford University 1, Rambus Inc 2, Duke University

More information