Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks

Size: px
Start display at page:

Download "Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks"

Transcription

1 Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks Charles Eckert Xiaowei Wang Jingcheng Wang Arun Subramaniyan Ravi Iyer Dennis Sylvester David Blaauw Reetuparna Das M-Bits Research Group

2 Can we transform CPU into a neural accelerator? CPU $ GPU 2

3 Can we transform CPU into a neural accelerator? CPU Neural Cache GPU ++ Parallelism -- Data Movement 3

4 Transforming caches into massively parallel vector ALUs 8-core Xeon processor 45 MB LLC 8 LLC slices 4

5 Way Way 2 Way 9 Way 2 Transforming caches into massively parallel vector ALUs 8-core Xeon processor 45 MB LLC 2.5MB LLC slice TMU CBOX 32kB data bank 8kB array 8 LLC slices 36 ways 5

6 Way Way 2 Way 2 Way 9 Transforming caches into massively parallel vector ALUs 8-core Xeon processor 45 MB LLC 2.5MB LLC slice BL/BLB 8kB SRAM array WL TMU CBOX Row decoder 32kB data bank 8kB array 8 LLC slices 36 ways 576 arrays 6

7 Way Way 2 Way 2 Way 9 A + B Array B Array A Transforming caches into massively parallel vector ALUs 8-core Xeon processor 45 MB LLC 2.5MB LLC slice TMU CBOX Row decoders 8kB SRAM array BL/BLB Bit-Slice 3 Bit-Slice 2 Bit-Slice Bit-Slice Bit-Slice 3 Bit-Slice 2 Bit-Slice Bit-Slice WL 32kB data bank 8kB array Logic = A + B 8 LLC slices 36 ways 576 arrays 7

8 Way Way 2 Way 2 Way 9 A + B DR Array B Array A Transforming caches into massively parallel vector ALUs 8-core Xeon processor 45 MB LLC 32kB data bank 2.5MB LLC slice TMU CBOX 8kB array Row decoders 8kB SRAM array BL/BLB Bit-Slice 3 Bit-Slice 2 Bit-Slice Bit-Slice Bit-Slice 3 Bit-Slice 2 Bit-Slice Bit-Slice Bitline ALU BL A&B Cout EN C D Q Cin A^B BLB S S = A^B^C 8 LLC slices 36 ways 576 arrays,474,56 ALUs Logic = A + B WL 8 ~A & ~B C_EN

9 Way Way 2 Way 2 Way 9 DR Transforming caches into massively parallel vector ALUs 8-core Xeon processor 45 MB LLC 2.5MB LLC slice TMU CBOX 8kB SRAM array BL/BLB Array A Array B Row Passive Last Level Cache transformed decoders into million bit-serial active A&B ALUs ~A & ~B A + B Multiply Divide Add WL Bitline ALU BL A^B BLB 32kB data bank Configurable Precision 8kB array Logic Bit-serial GHz = A + B Cout EN C D Q Cin S S = A^B^C 8 LLC slices 36 ways 576 arrays,474,56 ALUs 9 C_EN

10 Why bit-serial? A + B BL/BLB Bit-parallel arithmetic Row decoders Logic

11 Why bit-serial? A + B BL/BLB Word 3 Word 2 Word Word Array A Bit-parallel arithmetic Row decoders Word 3 Word 2 Word Word Array B A + B Logic

12 Why bit-serial? A + B Bit-parallel arithmetic Row decoders BL/BLB Word 3 Word 2 Word Word Word 3 Word 2 Word Word WL WL2 Array A Array B A + B Logic S 2

13 Why bit-serial? A + B Bit-parallel arithmetic Row decoders BL/BLB Word 3 Word 2 Word Word Word 3 Word 2 Word Word WL WL2 Array A Array B A + B Logic S S C Carry propagation across bitlines 3

14 Why bit-serial? A + B Bit-parallel arithmetic Row decoders BL/BLB Word 3 Word 2 Word Word Word 3 Word 2 Word Word WL WL2 Array A Array B A + B Logic S S S C C Carry propagation across bitlines 4

15 Why bit-serial? A + B Bit-parallel arithmetic Row decoders!! BL/BLB Word 3 Word 2 Word Word High complexity Word 3 Word 2 Word Word WL WL2 Array A Array B Loss of throughput and efficiency A + B Logic S S S S C C C Carry propagation across bitlines 5

16 Why bit-serial? A + B BL/BLB Bit-serial arithmetic Row decoders Logic 6

17 Why bit-serial? Word 3 Word 2 Word Word Array A Array B A + B A + B Transposed data BL/BLB Bit-serial arithmetic Row decoders Sum Carry S S S S 7

18 Why bit-serial? Word 3 Word 2 Word Word Array A Array B A + B A + B Transposed data BL/BLB Bit-Slice 3 Bit-Slice 2 Bit-Slice Bit-Slice WL Bit-serial arithmetic Row decoders WL2 Sum Carry S S S S Cycle 8

19 Why bit-serial? Word 3 Word 2 Word Word Array A Array B A + B A + B Transposed data BL/BLB Bit-Slice 3 Bit-Slice 2 Bit-Slice Bit-Slice WL Bit-serial arithmetic Row decoders WL2 Sum S S S S Carry C C C C Cycle 2 9

20 Why bit-serial? Word 3 Word 2 Word Word Array A Array B A + B A + B Transposed data BL/BLB Bit-Slice 3 Bit-Slice 2 Bit-Slice Bit-Slice WL Bit-serial arithmetic Row decoders WL2 Sum S S S S Carry C C C C Cycle 3 2

21 Why bit-serial? Word 3 Word 2 Word Word Array A Array B A + B Bit-serial arithmetic Transposed data Row decoders A + B BL/BLB Bit-Slice 3 Bit-Slice 2 Bit-Slice Bit-Slice Low area complexity High throughput WL WL2 Configurable & High precision Sum Carry S S S S C C C C Cycle 4 2

22 Outline Motivation Bit-Serial Arithmetic Transpose Mapping of Convolution to Array Methodology Results 22

23 Way Way 2 In-SRAM Arithmetic Way 2 Way 9 Array B A + B Array A DR 8-core Xeon processor 45 MB LLC 32kB data bank 2.5MB LLC slice TMU CBOX 8kB array Row decoders 8kB SRAM array BL/BLB Logic Bit-Slice 3 Bit-Slice 2 Bit-Slice Bit-Slice Bit-Slice 3 Bit-Slice 2 Bit-Slice Bit-Slice = A + B WL Bitline ALU BL A&B Cout EN C D Q Cin A^B BLB S S = A^B^C 8 LLC slices 36 ways 576 arrays,474,56 ALUs 23 ~A & ~B C_EN

24 Row Decoder-O Row Decoder Logical Operations In-SRAM Changes Bitlines BLB BL BLBn BLn Additional row decoder Wordlines Reconfigurable sense amplifiers Differential Sense Amplifiers Single-ended Sense Amplifiers 24

25 Row Decoder Row Decoder Logical Operations In-SRAM A AND B A B BLB BL BLBn BLn A B Single-ended Sense Amplifiers A AND B 25

26 Row Decoder Row Decoder Logical Operations In-SRAM A B BLB BL BLBn BLn A B Single-ended Sense Amplifiers A NOR B A AND B 26

27 DR Row Decoder Row Decoder A BP Addition In-SRAM 256 Bitlines BLB BL BLBn BLn A A B B BL BLB Carry Sum P P P 2 A&B Cout EN C D Q Cin A^B ~A & ~B S S = A^B^C C_EN 27

28 Row Decoder Row Decoder A BP Addition [Cycle ] BLB BL BLBn BLn A A B B P P P 2 Carry Sum 28

29 Row Decoder Row Decoder A BP Addition [Cycle 2] BLB BL BLBn BLn A A B B P P P 2 Carry Sum 29

30 Row Decoder Row Decoder Addition [Cycle 3] P BLB BL BLBn BLn A A B B P P P 2 Carry Sum 3

31 Row Decoder Row Decoder Multiplication In-SRAM BLB BL BLBn BLn A A B B P P P 2 P 3 Carry Sum Tag 3

32 Row Decoder Row Decoder Multiplication [Cycle ] BLB BL BLBn BLn A B A B P 2 A A X B B A B A B P P A A B B P P P 2 P 3 Carry Sum Tag 32

33 Row Decoder Row Decoder Multiplication [Cycle 2] BLB BL BLBn BLn A B A B P 2 A A X B B A B A B P P A A B B P P P 2 P 3 P <- A B Carry Sum Tag 33

34 Row Decoder Row Decoder Multiplication [Cycle 3] BLB BL BLBn BLn A B A B P 2 A A X B B A B A B P P A A B B P P P 2 P 3 P <- A B P <- A B Carry Sum Tag 34

35 Row Decoder Row Decoder Multiplication [Cycle 4] BLB BL BLBn BLn A B A B P 2 A A X B B A B A B P P A A B B P P P 2 P 3 P <- A B P <- A B Carry Sum Tag 35

36 Row Decoder Row Decoder Multiplication [Cycle 5] BLB BL BLBn BLn A B A B P 2 A A X B B A B A B P P A A B B P P P 2 P 3 P <- A B P <- A B + A B P <- P + A B If(B ), P <- P + A Else, P <- P Carry Sum Tag 36

37 Row Decoder Row Decoder Multiplication [Cycle 6] BLB BL BLBn BLn A B A B P 2 A A X B B A B A B P P A A B B P P P 2 P 3 P <- A B P <- A B + A B P 2 <- A B Carry Sum Tag 37

38 Row Decoder Row Decoder Multiplication [Cycle 7] BLB BL BLBn BLn A B A B P 2 A A X B B A B A B P P A A B B P P P 2 P 3 P <- A B P <- A B + A B P 2 <- A B P 3 <- C in Carry Sum Tag 38

39 Supported Arithmetic Operation Cycles ADD N+ SUB 2N+ MUL N 2 + 5N -2 DIV.5N N Comparison 2N+ Synthesized array 7.5% area overhead Processor Chip 2% area overhead 39

40 Outline Motivation Bit-Serial Arithmetic Transpose Mapping of Convolution to Array Methodology Results 4

41 Way Way 2 Way 9 Way 2 DR DR DR DR DR DR Row Decoder TMU CBOX Transpose Control A [MSB] A [MSB] A 2 [MSB]... A [LSB] A [LSB] A 2 [LSB] Col Decoder... B [MSB] B [MSB] B 2 [MSB] B [LSB] B [LSB] B 2 [LSB] DR DR DR DR DR DR DR DR DR DR DR DR 8-T transpose bit-cell Transpose read/write Regular read/write 4

42 Transpose A 2 A A B 2 B B TMU C 2 C C A A A 2 B B B 2 C C C 2 42

43 Outline Motivation Transpose Bit-Serial Arithmetic Mapping of Convolution to Array Methodology Results 43

44 A Convolutional Layer 3D Filters (M) each filter: C channels each channel: RxS weights C Input Activations (C channels) Output Activations (M channels) R C S H W C E F M R M S 44

45 RxS RxS Output 4x8 Partial Sum 4x8 Input Activation RxSx8 Weights RxSx8 Mapping CNN to Neural Cache 8 kb SRAM Array 256 Bitlines Filter Weights C Input Activations Output Activations... R S C R M S Unroll H Unroll W C E F M Wordlines C... C... MAC Partial Sum C... Reduction Output Activation C 45

46 Partial Sum 4x8 Input Activation RxSx8 Weights RxSx8 channel channel 2 channel 3 channel 4 channel 256 Mapping CNN to Neural Cache 8 kb SRAM Array E Filter (C = 256) F M 256 Wordlines 2.5 MB LLC Slice M = 32 Output Position Output Position Quad Quad 2 Quad 3 Quad Bitlines Way Way 2 Way 3... Way 46

47 Way -8 Way 9-2 Way -8 Way 9-2 Mapping of Convolution to Array E F M Slice Slice 4 47

48 Put 2 34 Filter Input MAC Output it together + Loading Reduction Transfer 2.5 MB LLC Slice LLC Slice Filter Weights Core Input Activations Output Activations... Ring Interconnect... DRAM Core 4 LLC Slice 4 Way Way 2 Way Way 9 (Reserved) Quad Quad 2 Quad 3 Quad 4 48

49 Outline Motivation Transpose Bit-Serial Arithmetic Mapping of Convolution to Array Methodology Results 49

50 Evaluation Methodology DNN Models - Inception V3-8-bit weights and inputs Processor CPU (2 sockets) GPU ( card) Neural Cache Intel Xeon E v3, 2.6GHz, 28 cores, 56 threads Nvidia Titan Xp,.6GHz, 384 cuda cores On-chip memory MB 9.4 MB 2.5GHz Compute SRAM, 3292 Bit-serial ALUs 7 MB (Dual Socket) Off-chip memory 64 GB DRAM 2 GB DRAM 64 GB DRAM Profiler / Simulator (Performance) Profiler / Simulator (Energy) TensorFlow tfprof Intel RAPL Interface TensorFlow tfprof NVIDIA System Management Interface Cycle accurate simulator + C Microbench SPICE simulation + Intel RAPL Interface 5

51 Outline Motivation Transpose Bit-Serial Arithmetic Mapping of Convolution to Array Methodology Results 5

52 Throughput (Inferences / sec) Throughput Latency CPU - Xeon E5 GPU - Titan Xp Neural Cache Latency (ms) Batch Size 2.2x Improved throughput over GPU CPU GPU Neural Cache 7.7x Latency improvement over GPU 52

53 Energy (Joules) Power (Watts) Power/Energy Comparison Total Energy Avg Power CPU GPU Neural Cache 53

54 Neural Cache Summary Repurpose Cache to Data Parallel DNN Accelerator Massively Parallel Bit-Serial In-SRAM Arithmetic Data Layout for CNNs 2x 2x 2x.. over server class CPU at 2% area overhead 6x.. over server class GPU 54

55 Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks Charles Eckert Xiaowei Wang Jingcheng Wang Arun Subramaniyan Ravi Iyer Dennis Sylvester David Blaauw Reetuparna Das M-Bits Research Group 55

arxiv: v1 [cs.ar] 9 May 2018

arxiv: v1 [cs.ar] 9 May 2018 To appear in the 45th ACM/IEEE International Symposium on Computer Architecture (ISCA 28) Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks arxiv:85.378v [cs.ar] 9 May 28 Charles Eckert,

More information

Compute Cache. Shaizeen Aga, Supreet Jeloka, Arun Subramaniyan, Satish Narayanasamy, David Blaauw, and Reetuparna Das

Compute Cache. Shaizeen Aga, Supreet Jeloka, Arun Subramaniyan, Satish Narayanasamy, David Blaauw, and Reetuparna Das Compute Cache Shaizeen Aga, Supreet Jeloka, Arun Subramaniyan, Satish Narayanasamy, David Blaauw, and Reetuparna Das Presented by Gefei Zuo and Jiacheng Ma 1 Agenda Motivation Goal Design Evaluation Discussion

More information

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based

More information

ENGR 303 Introduction to Logic Design Lecture 7. Dr. Chuck Brown Engineering and Computer Information Science Folsom Lake College

ENGR 303 Introduction to Logic Design Lecture 7. Dr. Chuck Brown Engineering and Computer Information Science Folsom Lake College Introduction to Logic Design Lecture 7 Dr. Chuck Brown Engineering and Computer Information Science Folsom Lake College Outline for Todays Lecture Shifter Multiplier / Divider Memory Shifters Logical

More information

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based

More information

XPU A Programmable FPGA Accelerator for Diverse Workloads

XPU A Programmable FPGA Accelerator for Diverse Workloads XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for

More information

IN-MEMORY ASSOCIATIVE COMPUTING

IN-MEMORY ASSOCIATIVE COMPUTING IN-MEMORY ASSOCIATIVE COMPUTING AVIDAN AKERIB, GSI TECHNOLOGY AAKERIB@GSITECHNOLOGY.COM AGENDA The AI computational challenge Introduction to associative computing Examples An NLP use case What s next?

More information

Deep Learning Accelerators

Deep Learning Accelerators Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Memory Hierarchy & Caches Motivation 10000 Performance 1000 100 10 Processor Memory 1 1985 1990 1995 2000 2005 2010 Want memory to appear: As fast as CPU As large as required

More information

Low-Cost Inter-Linked Subarrays (LISA) Enabling Fast Inter-Subarray Data Movement in DRAM

Low-Cost Inter-Linked Subarrays (LISA) Enabling Fast Inter-Subarray Data Movement in DRAM Low-Cost Inter-Linked ubarrays (LIA) Enabling Fast Inter-ubarray Data Movement in DRAM Kevin Chang rashant Nair, Donghyuk Lee, augata Ghose, Moinuddin Qureshi, and Onur Mutlu roblem: Inefficient Bulk Data

More information

Nvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018

Nvidia Jetson TX2 and its Software Toolset. João Fernandes 2017/2018 Nvidia Jetson TX2 and its Software Toolset João Fernandes 2017/2018 In this presentation Nvidia Jetson TX2: Hardware Nvidia Jetson TX2: Software Machine Learning: Neural Networks Convolutional Neural Networks

More information

Machine Learning on VMware vsphere with NVIDIA GPUs

Machine Learning on VMware vsphere with NVIDIA GPUs Machine Learning on VMware vsphere with NVIDIA GPUs Uday Kurkure, Hari Sivaraman, Lan Vu GPU Technology Conference 2017 2016 VMware Inc. All rights reserved. Gartner Hype Cycle for Emerging Technology

More information

High Performance Computing

High Performance Computing High Performance Computing 9th Lecture 2016/10/28 YUKI ITO 1 Selected Paper: vdnn: Virtualized Deep Neural Networks for Scalable, MemoryEfficient Neural Network Design Minsoo Rhu, Natalia Gimelshein, Jason

More information

TESLA V100 PERFORMANCE GUIDE. Life Sciences Applications

TESLA V100 PERFORMANCE GUIDE. Life Sciences Applications TESLA V100 PERFORMANCE GUIDE Life Sciences Applications NOVEMBER 2017 TESLA V100 PERFORMANCE GUIDE Modern high performance computing (HPC) data centers are key to solving some of the world s most important

More information

Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems

Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems 1 Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems Ronald Dreslinski, Korey Sewell, Thomas Manville, Sudhir Satpathy, Nathaniel Pinckney, Geoff Blake, Michael Cieslak, Reetuparna

More information

Mark Redekopp, All rights reserved. EE 352 Unit 10. Memory System Overview SRAM vs. DRAM DMA & Endian-ness

Mark Redekopp, All rights reserved. EE 352 Unit 10. Memory System Overview SRAM vs. DRAM DMA & Endian-ness EE 352 Unit 10 Memory System Overview SRAM vs. DRAM DMA & Endian-ness The Memory Wall Problem: The Memory Wall Processor speeds have been increasing much faster than memory access speeds (Memory technology

More information

Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture

Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture The 51st Annual IEEE/ACM International Symposium on Microarchitecture Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture Byungchul Hong Yeonju Ro John Kim FuriosaAI Samsung

More information

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks Deep neural networks have enabled major advances in machine learning and AI Computer vision Language translation Speech recognition Question answering And more Problem: DNNs are challenging to serve and

More information

Z-RAM Ultra-Dense Memory for 90nm and Below. Hot Chips David E. Fisch, Anant Singh, Greg Popov Innovative Silicon Inc.

Z-RAM Ultra-Dense Memory for 90nm and Below. Hot Chips David E. Fisch, Anant Singh, Greg Popov Innovative Silicon Inc. Z-RAM Ultra-Dense Memory for 90nm and Below Hot Chips 2006 David E. Fisch, Anant Singh, Greg Popov Innovative Silicon Inc. Outline Device Overview Operation Architecture Features Challenges Z-RAM Performance

More information

Integrating NVIDIA Deep Learning Accelerator (NVDLA) with RISC-V SoC on FireSim

Integrating NVIDIA Deep Learning Accelerator (NVDLA) with RISC-V SoC on FireSim Integrating NVIDIA Deep Learning Accelerator (NVDLA) with RISC-V SoC on FireSim Farzad Farshchi, Qijing Huang, Heechul Yun University of Kansas, University of California, Berkeley SiFive Internship Rocket

More information

In-Place Associative Computing:

In-Place Associative Computing: In-Place Associative Computing: 1 Page Abstract... 3 Overview... 3 Associative Processing Unit (APU) Card... 3 Host-Device interface... 4 The APU Card Controller... 4 Host to Device Interactions... 5 APU

More information

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation

More information

Lecture-14 (Memory Hierarchy) CS422-Spring

Lecture-14 (Memory Hierarchy) CS422-Spring Lecture-14 (Memory Hierarchy) CS422-Spring 2018 Biswa@CSE-IITK The Ideal World Instruction Supply Pipeline (Instruction execution) Data Supply - Zero-cycle latency - Infinite capacity - Zero cost - Perfect

More information

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand

Spring 2016 :: CSE 502 Computer Architecture. Caches. Nima Honarmand Caches Nima Honarmand Motivation 10000 Performance 1000 100 10 Processor Memory 1 1985 1990 1995 2000 2005 2010 Want memory to appear: As fast as CPU As large as required by all of the running applications

More information

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters *Argonne National Lab +BU & USTC Presented by Martin Herbordt Work by Ahmed

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Memory Classification revisited. Slide 3

Memory Classification revisited. Slide 3 Slide 1 Topics q Introduction to memory q SRAM : Basic memory element q Operations and modes of failure q Cell optimization q SRAM peripherals q Memory architecture and folding Slide 2 Memory Classification

More information

Multi-Processors and GPU

Multi-Processors and GPU Multi-Processors and GPU Philipp Koehn 7 December 2016 Predicted CPU Clock Speed 1 Clock speed 1971: 740 khz, 2016: 28.7 GHz Source: Horowitz "The Singularity is Near" (2005) Actual CPU Clock Speed 2 Clock

More information

Orchestrated Scheduling and Prefetching for GPGPUs. Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das

Orchestrated Scheduling and Prefetching for GPGPUs. Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das Parallelize your code! Launch more threads! Multi- threading

More information

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology 1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823

More information

The DRAM Cell. EEC 581 Computer Architecture. Memory Hierarchy Design (III) 1T1C DRAM cell

The DRAM Cell. EEC 581 Computer Architecture. Memory Hierarchy Design (III) 1T1C DRAM cell EEC 581 Computer Architecture Memory Hierarchy Design (III) Department of Electrical Engineering and Computer Science Cleveland State University The DRAM Cell Word Line (Control) Bit Line (Information)

More information

ECE 152 Introduction to Computer Architecture

ECE 152 Introduction to Computer Architecture Introduction to Computer Architecture Main Memory and Virtual Memory Copyright 2009 Daniel J. Sorin Duke University Slides are derived from work by Amir Roth (Penn) Spring 2009 1 Where We Are in This Course

More information

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM

More information

15-740/ Computer Architecture Lecture 19: Main Memory. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 19: Main Memory. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 19: Main Memory Prof. Onur Mutlu Carnegie Mellon University Last Time Multi-core issues in caching OS-based cache partitioning (using page coloring) Handling

More information

PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory

PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory Scalable and Energy-Efficient Architecture Lab (SEAL) PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in -based Main Memory Ping Chi *, Shuangchen Li *, Tao Zhang, Cong

More information

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

SDA: Software-Defined Accelerator for Large- Scale DNN Systems SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, Yong Wang, Bo Yu, Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A dominant

More information

ComPEND: Computation Pruning through Early Negative Detection for ReLU in a Deep Neural Network Accelerator

ComPEND: Computation Pruning through Early Negative Detection for ReLU in a Deep Neural Network Accelerator ICS 28 ComPEND: Computation Pruning through Early Negative Detection for ReLU in a Deep Neural Network Accelerator June 3, 28 Dongwoo Lee, Sungbum Kang, Kiyoung Choi Neural Processing Research Center (NPRC)

More information

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know. Administrivia HW0 scores, HW1 peer-review assignments out. HW2 out, due Nov. 2. If you re having Cython trouble with HW2, let us know. Review on Wednesday: Post questions on Piazza Introduction to GPUs

More information

+1 (479)

+1 (479) Memory Courtesy of Dr. Daehyun Lim@WSU, Dr. Harris@HMC, Dr. Shmuel Wimer@BIU and Dr. Choi@PSU http://csce.uark.edu +1 (479) 575-6043 yrpeng@uark.edu Memory Arrays Memory Arrays Random Access Memory Serial

More information

Lecture 24 Near Data Computing II

Lecture 24 Near Data Computing II EECS 570 Lecture 24 Near Data Computing II Winter 2018 Prof. Satish Narayanasamy http://www.eecs.umich.edu/courses/eecs570/ EECS 570 Lecture 23 Slide 1 Readings ISAAC: A Convolutional Neural Network Accelerator

More information

Slalom: Fast, Verifiable and Private Execution of Neural Networks in Trusted Hardware

Slalom: Fast, Verifiable and Private Execution of Neural Networks in Trusted Hardware Slalom: Fast, Verifiable and Private Execution of Neural Networks in Trusted Hardware Florian Tramèr (joint work with Dan Boneh) Intel, Santa Clara August 30 th 2018 Trusted execution of ML: 3 motivating

More information

Article begins on next page

Article begins on next page Title: A 19.4 nj/ 364K s/s in-memory random forest classifier in 6T SRAM array Archived version Accepted manuscript: the content is identical to the published paper, but without the final typesetting by

More information

arxiv: v1 [cs.ar] 24 Oct 2016

arxiv: v1 [cs.ar] 24 Oct 2016 A 481pJ/decision 3.4M decision/s Multifunctional Deep In-memory Inference Processor using Standard 6T SRAM Array arxiv:161.751v1 [cs.ar] 24 Oct 216 Mingu Kang, Sujan Gonugondla, Ameya Patil, and Naresh

More information

Cell Broadband Engine. Spencer Dennis Nicholas Barlow

Cell Broadband Engine. Spencer Dennis Nicholas Barlow Cell Broadband Engine Spencer Dennis Nicholas Barlow The Cell Processor Objective: [to bring] supercomputer power to everyday life Bridge the gap between conventional CPU s and high performance GPU s History

More information

Deep learning in MATLAB From Concept to CUDA Code

Deep learning in MATLAB From Concept to CUDA Code Deep learning in MATLAB From Concept to CUDA Code Roy Fahn Applications Engineer Systematics royf@systematics.co.il 03-7660111 Ram Kokku Principal Engineer MathWorks ram.kokku@mathworks.com 2017 The MathWorks,

More information

1/19/2009. Data Locality. Exploiting Locality: Caches

1/19/2009. Data Locality. Exploiting Locality: Caches Spring 2009 Prof. Hyesoon Kim Thanks to Prof. Loh & Prof. Prvulovic Data Locality Temporal: if data item needed now, it is likely to be needed again in near future Spatial: if data item needed now, nearby

More information

COMPUTER ARCHITECTURES

COMPUTER ARCHITECTURES COMPUTER ARCHITECTURES Random Access Memory Technologies Gábor Horváth BUTE Department of Networked Systems and Services ghorvath@hit.bme.hu Budapest, 2019. 02. 24. Department of Networked Systems and

More information

Rhythm: Harnessing Data Parallel Hardware for Server Workloads

Rhythm: Harnessing Data Parallel Hardware for Server Workloads Rhythm: Harnessing Data Parallel Hardware for Server Workloads Sandeep R. Agrawal $ Valentin Pistol $ Jun Pang $ John Tran # David Tarjan # Alvin R. Lebeck $ $ Duke CS # NVIDIA Explosive Internet Growth

More information

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy Chapter 5B Large and Fast: Exploiting Memory Hierarchy One Transistor Dynamic RAM 1-T DRAM Cell word access transistor V REF TiN top electrode (V REF ) Ta 2 O 5 dielectric bit Storage capacitor (FET gate,

More information

Introduction to Semiconductor Memory Dr. Lynn Fuller Webpage:

Introduction to Semiconductor Memory Dr. Lynn Fuller Webpage: ROCHESTER INSTITUTE OF TECHNOLOGY MICROELECTRONIC ENGINEERING Introduction to Semiconductor Memory Webpage: http://people.rit.edu/lffeee 82 Lomb Memorial Drive Rochester, NY 14623-5604 Tel (585) 475-2035

More information

High-Performance Packet Classification on GPU

High-Performance Packet Classification on GPU High-Performance Packet Classification on GPU Shijie Zhou, Shreyas G. Singapura, and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California 1 Outline Introduction

More information

Profiling the Performance of Binarized Neural Networks. Daniel Lerner, Jared Pierce, Blake Wetherton, Jialiang Zhang

Profiling the Performance of Binarized Neural Networks. Daniel Lerner, Jared Pierce, Blake Wetherton, Jialiang Zhang Profiling the Performance of Binarized Neural Networks Daniel Lerner, Jared Pierce, Blake Wetherton, Jialiang Zhang 1 Outline Project Significance Prior Work Research Objectives Hypotheses Testing Framework

More information

2008 International ANSYS Conference

2008 International ANSYS Conference 28 International ANSYS Conference Maximizing Performance for Large Scale Analysis on Multi-core Processor Systems Don Mize Technical Consultant Hewlett Packard 28 ANSYS, Inc. All rights reserved. 1 ANSYS,

More information

Chapter 5. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 5 <1>

Chapter 5. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 5 <1> Chapter 5 Digital Design and Computer Architecture, 2 nd Edition David Money Harris and Sarah L. Harris Chapter 5 Chapter 5 :: Topics Introduction Arithmetic Circuits umber Systems Sequential Building

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

Computer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal

Computer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal Computer Architectures for Deep Learning Ethan Dell and Daniyal Iqbal Agenda Introduction to Deep Learning Challenges Architectural Solutions Hardware Architectures CPUs GPUs Accelerators FPGAs SOCs ASICs

More information

Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture

Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, Onur Mutlu Carnegie Mellon University HPCA - 2013 Executive

More information

Marching Memory マーチングメモリ. UCAS-6 6 > Stanford > Imperial > Verify 中村維男 Based on Patent Application by Tadao Nakamura and Michael J.

Marching Memory マーチングメモリ. UCAS-6 6 > Stanford > Imperial > Verify 中村維男 Based on Patent Application by Tadao Nakamura and Michael J. UCAS-6 6 > Stanford > Imperial > Verify 2011 Marching Memory マーチングメモリ Tadao Nakamura 中村維男 Based on Patent Application by Tadao Nakamura and Michael J. Flynn 1 Copyright 2010 Tadao Nakamura C-M-C Computer

More information

Accelerating Image Feature Comparisons using CUDA on Commodity Hardware

Accelerating Image Feature Comparisons using CUDA on Commodity Hardware Accelerating Image Feature Comparisons using CUDA on Commodity Hardware Seth Warn, Wesley Emeneker, John Gauch, Jackson Cothren, Amy Apon University of Arkansas 1 Outline Background GPU kernel implementation

More information

Parallel Architectures

Parallel Architectures Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36

More information

CS650 Computer Architecture. Lecture 9 Memory Hierarchy - Main Memory

CS650 Computer Architecture. Lecture 9 Memory Hierarchy - Main Memory CS65 Computer Architecture Lecture 9 Memory Hierarchy - Main Memory Andrew Sohn Computer Science Department New Jersey Institute of Technology Lecture 9: Main Memory 9-/ /6/ A. Sohn Memory Cycle Time 5

More information

Power Reduction Techniques in the Memory System. Typical Memory Hierarchy

Power Reduction Techniques in the Memory System. Typical Memory Hierarchy Power Reduction Techniques in the Memory System Low Power Design for SoCs ASIC Tutorial Memories.1 Typical Memory Hierarchy On-Chip Components Control edram Datapath RegFile ITLB DTLB Instr Data Cache

More information

Overcoming the Memory System Challenge in Dataflow Processing. Darren Jones, Wave Computing Drew Wingard, Sonics

Overcoming the Memory System Challenge in Dataflow Processing. Darren Jones, Wave Computing Drew Wingard, Sonics Overcoming the Memory System Challenge in Dataflow Processing Darren Jones, Wave Computing Drew Wingard, Sonics Current Technology Limits Deep Learning Performance Deep Learning Dataflow Graph Existing

More information

ENDURING DIFFERENTIATION. Timothy Lanfear

ENDURING DIFFERENTIATION. Timothy Lanfear ENDURING DIFFERENTIATION Timothy Lanfear WHERE ARE WE? 2 LIFE AFTER DENNARD SCALING 10 7 40 Years of Microprocessor Trend Data 10 6 10 5 10 4 Transistors (thousands) 1.1X per year 10 3 10 2 Single-threaded

More information

ENDURING DIFFERENTIATION Timothy Lanfear

ENDURING DIFFERENTIATION Timothy Lanfear ENDURING DIFFERENTIATION Timothy Lanfear WHERE ARE WE? 2 LIFE AFTER DENNARD SCALING GPU-ACCELERATED PERFORMANCE 10 7 40 Years of Microprocessor Trend Data 10 6 10 5 10 4 10 3 10 2 Single-threaded perf

More information

6T- SRAM for Low Power Consumption. Professor, Dept. of ExTC, PRMIT &R, Badnera, Amravati, Maharashtra, India 1

6T- SRAM for Low Power Consumption. Professor, Dept. of ExTC, PRMIT &R, Badnera, Amravati, Maharashtra, India 1 6T- SRAM for Low Power Consumption Mrs. J.N.Ingole 1, Ms.P.A.Mirge 2 Professor, Dept. of ExTC, PRMIT &R, Badnera, Amravati, Maharashtra, India 1 PG Student [Digital Electronics], Dept. of ExTC, PRMIT&R,

More information

Slalom: Fast, Verifiable and Private Execution of Neural Networks in Trusted Hardware

Slalom: Fast, Verifiable and Private Execution of Neural Networks in Trusted Hardware Slalom: Fast, Verifiable and Private Execution of Neural Networks in Trusted Hardware Florian Tramèr (joint work with Dan Boneh) Stanford security lunch June 13 th Trusted execution of ML: 3 motivating

More information

The University of Adelaide, School of Computer Science 13 September 2018

The University of Adelaide, School of Computer Science 13 September 2018 Computer Architecture A Quantitative Approach, Sixth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

Unleashing the Power of Embedded DRAM

Unleashing the Power of Embedded DRAM Copyright 2005 Design And Reuse S.A. All rights reserved. Unleashing the Power of Embedded DRAM by Peter Gillingham, MOSAID Technologies Incorporated Ottawa, Canada Abstract Embedded DRAM technology offers

More information

2000 N + N <100N. When is: Find m to minimize: (N) m. N log 2 C 1. m + C 3 + C 2. ESE534: Computer Organization. Previously. Today.

2000 N + N <100N. When is: Find m to minimize: (N) m. N log 2 C 1. m + C 3 + C 2. ESE534: Computer Organization. Previously. Today. ESE534: Computer Organization Previously Day 5: February 1, 2010 Memories Arithmetic: addition, subtraction Reuse: pipelining bit-serial (vectorization) shared datapath elements FSMDs Area/Time Tradeoffs

More information

ECE 5775 (Fall 17) High-Level Digital Design Automation. Specialized Computing

ECE 5775 (Fall 17) High-Level Digital Design Automation. Specialized Computing ECE 5775 (Fall 7) High-Level Digital Design Automation Specialized Computing Announcements All students enrolled in CMS & Piazza Vivado HLS tutorial on Tuesday 8/29 Install an SSH client (mobaxterm or

More information

Lecture 13: SRAM. Slides courtesy of Deming Chen. Slides based on the initial set from David Harris. 4th Ed.

Lecture 13: SRAM. Slides courtesy of Deming Chen. Slides based on the initial set from David Harris. 4th Ed. Lecture 13: SRAM Slides courtesy of Deming Chen Slides based on the initial set from David Harris CMOS VLSI Design Outline Memory Arrays SRAM Architecture SRAM Cell Decoders Column Circuitry Multiple Ports

More information

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction

More information

ECE 2300 Digital Logic & Computer Organization

ECE 2300 Digital Logic & Computer Organization ECE 2300 Digital Logic & Computer Organization Spring 201 Memories Lecture 14: 1 Announcements HW6 will be posted tonight Lab 4b next week: Debug your design before the in-lab exercise Lecture 14: 2 Review:

More information

DRISA: A DRAM-based Reconfigurable In-Situ Accelerator

DRISA: A DRAM-based Reconfigurable In-Situ Accelerator DRI: A DRAM-based Reconfigurable In-Situ Accelerator Shuangchen Li, Dimin Niu, Krishna T. Malladi, Hongzhong Zheng, Bob Brennan, Yuan Xie University of California, Santa Barbara Memory Solutions Lab, Samsung

More information

WALL: A Writeback-Aware LLC Management for PCM-based Main Memory Systems

WALL: A Writeback-Aware LLC Management for PCM-based Main Memory Systems : A Writeback-Aware LLC Management for PCM-based Main Memory Systems Bahareh Pourshirazi *, Majed Valad Beigi, Zhichun Zhu *, and Gokhan Memik * University of Illinois at Chicago Northwestern University

More information

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System Xiaofan Zhang1, Mohamed El Hadedy1, Wen-mei Hwu1, Nam Sung Kim1, Jinjun Xiong2, Deming Chen1 1 University of Illinois Urbana-Champaign

More information

Spring 2018 :: CSE 502. Cache Design Basics. Nima Honarmand

Spring 2018 :: CSE 502. Cache Design Basics. Nima Honarmand Cache Design Basics Nima Honarmand Storage Hierarchy Make common case fast: Common: temporal & spatial locality Fast: smaller, more expensive memory Bigger Transfers Registers More Bandwidth Controlled

More information

FABRICATION TECHNOLOGIES

FABRICATION TECHNOLOGIES FABRICATION TECHNOLOGIES DSP Processor Design Approaches Full custom Standard cell** higher performance lower energy (power) lower per-part cost Gate array* FPGA* Programmable DSP Programmable general

More information

Intelligent Hybrid Flash Management

Intelligent Hybrid Flash Management Intelligent Hybrid Flash Management Jérôme Gaysse Senior Technology&Market Analyst jerome.gaysse@silinnov-consulting.com Flash Memory Summit 2018 Santa Clara, CA 1 Research context Analysis of system &

More information

Versal: AI Engine & Programming Environment

Versal: AI Engine & Programming Environment Engineering Director, Xilinx Silicon Architecture Group Versal: Engine & Programming Environment Presented By Ambrose Finnerty Xilinx DSP Technical Marketing Manager October 16, 2018 MEMORY MEMORY MEMORY

More information

Introduction to memory system :from device to system

Introduction to memory system :from device to system Introduction to memory system :from device to system Jianhui Yue Electrical and Computer Engineering University of Maine The Position of DRAM in the Computer 2 The Complexity of Memory 3 Question Assume

More information

An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection

An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection An Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection Hiroyuki Usui, Jun Tanabe, Toru Sano, Hui Xu, and Takashi Miyamori Toshiba Corporation, Kawasaki, Japan Copyright 2013,

More information

WaveView. System Requirement V6. Reference: WST Page 1. WaveView System Requirements V6 WST

WaveView. System Requirement V6. Reference: WST Page 1. WaveView System Requirements V6 WST WaveView System Requirement V6 Reference: WST-0125-01 www.wavestore.com Page 1 WaveView System Requirements V6 Copyright notice While every care has been taken to ensure the information contained within

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

World s most advanced data center accelerator for PCIe-based servers

World s most advanced data center accelerator for PCIe-based servers NVIDIA TESLA P100 GPU ACCELERATOR World s most advanced data center accelerator for PCIe-based servers HPC data centers need to support the ever-growing demands of scientists and researchers while staying

More information

How to Write Fast Numerical Code

How to Write Fast Numerical Code How to Write Fast Numerical Code Lecture: Memory hierarchy, locality, caches Instructor: Markus Püschel TA: Alen Stojanov, Georg Ofenbeck, Gagandeep Singh Organization Temporal and spatial locality Memory

More information

Lecture: DRAM Main Memory. Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3)

Lecture: DRAM Main Memory. Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3) Lecture: DRAM Main Memory Topics: virtual memory wrap-up, DRAM intro and basics (Section 2.3) 1 TLB and Cache Is the cache indexed with virtual or physical address? To index with a physical address, we

More information

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management Hideyuki Shamoto, Tatsuhiro Chiba, Mikio Takeuchi Tokyo Institute of Technology IBM Research Tokyo Programming for large

More information

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL (stashing recurrent weights on-chip) Baidu SVAIL April 7, 2016 SVAIL Think hard AI. Goal Develop hard AI technologies that impact 100 million users. Deep Learning at SVAIL 100 GFLOP/s 1 laptop 6 TFLOP/s

More information

Speeding Up Crossbar Resistive Memory by Exploiting In-memory Data Patterns

Speeding Up Crossbar Resistive Memory by Exploiting In-memory Data Patterns March 12, 2018 Speeding Up Crossbar Resistive Memory by Exploiting In-memory Data Patterns Wen Wen Lei Zhao, Youtao Zhang, Jun Yang Executive Summary Problems: performance and reliability of write operations

More information

Hardware/Software Co-Design

Hardware/Software Co-Design 1 / 27 Hardware/Software Co-Design Miaoqing Huang University of Arkansas Fall 2011 2 / 27 Outline 1 2 3 3 / 27 Outline 1 2 3 CSCE 5013-002 Speical Topic in Hardware/Software Co-Design Instructor Miaoqing

More information

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Finite Element Integration and Assembly on Modern Multi and Many-core Processors Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,

More information

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications

More information

ECE 485/585 Microprocessor System Design

ECE 485/585 Microprocessor System Design Microprocessor System Design Lecture 4: Memory Hierarchy Memory Taxonomy SRAM Basics Memory Organization DRAM Basics Zeshan Chishti Electrical and Computer Engineering Dept Maseeh College of Engineering

More information

Timothy Lanfear, NVIDIA HPC

Timothy Lanfear, NVIDIA HPC GPU COMPUTING AND THE Timothy Lanfear, NVIDIA FUTURE OF HPC Exascale Computing will Enable Transformational Science Results First-principles simulation of combustion for new high-efficiency, lowemision

More information

Implementation of Deep Convolutional Neural Net on a Digital Signal Processor

Implementation of Deep Convolutional Neural Net on a Digital Signal Processor Implementation of Deep Convolutional Neural Net on a Digital Signal Processor Elaina Chai December 12, 2014 1. Abstract In this paper I will discuss the feasibility of an implementation of an algorithm

More information

Understanding Reduced-Voltage Operation in Modern DRAM Devices

Understanding Reduced-Voltage Operation in Modern DRAM Devices Understanding Reduced-Voltage Operation in Modern DRAM Devices Experimental Characterization, Analysis, and Mechanisms Kevin Chang A. Giray Yaglikci, Saugata Ghose,Aditya Agrawal *, Niladrish Chatterjee

More information

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Memory / DRAM SRAM = Static RAM SRAM vs. DRAM As long as power is present, data is retained DRAM = Dynamic RAM If you don t do anything, you lose the data SRAM: 6T per bit

More information