IN-MEMORY ASSOCIATIVE COMPUTING

Size: px
Start display at page:

Download "IN-MEMORY ASSOCIATIVE COMPUTING"

Transcription

1 IN-MEMORY ASSOCIATIVE COMPUTING AVIDAN AKERIB, GSI TECHNOLOGY

2 AGENDA The AI computational challenge Introduction to associative computing Examples An NLP use case What s next?

3 THE CHALLENGE IN AI COMPUTING AI Requirement 32 bit FP Multi precision mining, etc. Scaling Sort-search speech, classify image/video Neural network learning Neural network inference, data Data center Top-K, recommendation, Heavy computation Non linearity, Softmax, exponent, normalize Bandwidth Use Case Example Required for speed and power

4 CURRENT SOLUTION Question CPU Answer General Purpose GPU Very Wide Bus DRAM Tens of Cores Bottleneck when register file data needs to be replaced on a regular basis Limits performance Increases power consumption Thousands of Cores Does not scale with the search, sort, and rank requirements of applications like recommender systems, NLP, speech recognition, and data mining that requires functions like Top-K and Softmax.

5 GPU VS CPU VS FPGA

6 GPU VS CPU VS FPGA VS APU

7 GSI S SOLUTION APU ASSOCIATIVE PROCESSING UNIT Question Millions Processors Simple CPU Simple & Narrow Bus Answer APU Associative Memory Computes in-place directly in the memory array removes the I/O bottleneck Significantly increases performances Reduces power

8 IN-MEMORY COMPUTING CONCEPT

9 THE COMPUTING MODEL FOR THE PAST 8 YEARS Read Write Address Decoder Read Write ALU Sense Amp /IO Drivers

10 THE CHANGE IN-MEMORY COMPUTING NOR NOR Read Read Read Write Write Read Simple Controller Patented in-memory logic using only Read/Write operations Any logic/arithmetic function can be generated internally

11 CAM/ ASSOCIATIVE SEARCH Records in the combines key goes to the read enable Values Duplicate vales with inverse data =match RE RE RE RE KEY: Search Duplicate the key with inverse. Move the original key next to the inverse data

12 TCAM SEARCH WITH STANDARD MEMORY CELLS Don t care Don t care Don t care 2

13 TCAM SEARCH WITH STANDARD MEMORY CELLS in the combines key goes to the read enable Insert zero instead of don t-care Duplicate data. Inverse only to those which are not don t care = match =match RE RE RE RE KEY: Search Duplicate the key with inverse. Move The original Key next to the inverse data 3

14 COMPUTING IN THE BIT LINES Vector A Vector B C=f(A,B) Each bit line becomes a processor and storage millions of bit lines = millions of processors

15 NEIGHBORHOOD COMPUTING Shift vector C=f(A,SL(B,)) Parallel shift of bit cycle sections Enables neighborhood operations such as convolutions

16 SEARCH & COUNT Search Count = 3 Search (binary or ternary) all bit lines in cycle 28 M bit lines => 28 Peta search/sec Key applications for search and count for predictive analytics: Recommender systems K-nearest neighbors (using cosine similarity search) Random forest Image histogram Regular expression

17 DATABASE SEARCH AND UPDATE Content-based search, record can be placed anywhere Update, modify, insert, delete is immediate Exact Match CAM/TCAM Similarity Match In-Place Aggregate

18 TRADITIONAL STORAGE CAN DO MUCH MORE Standard memory cell Standard memory cell bit 2 bits 2 input NOR, TCAM cell 3 Bits 3 input NOR, 2 Input NOR + Output 4 State CAM Standard memory cell Standard memory cell

19 CPU/GPGPU VS APU

20 ARCHITECTURE

21 SECTION COMPUTING TO IMPROVE PERFORMANCE MLB section 24 rows Memory control Connecting Mux MLB section Connecting mux... Instr. Buffer 2

22 COMMUNICATION BETWEEN SECTIONS Shift between sections enable neighborhood operations (filters, CNN etc.) Store, compute, search and move data anywhere 22

23 APU CHIP LAYOUT 2M bit processors or 28K vector processors runs at G Hz with up to 2 Peta OPS peak performance

24 EVALUATION BOARD PERFORMANCE Precision : Unlimited : from bit to 6 bits or more. 6.4 TOPS (FP) 8 Peta OPS for one bit computing or 6 bit exact search Similarity Search, Top-k, min, max, Softmax, O() complexity in μs, any size of K compared to ms with current solutions In-memory IO 2 Petabit/sec > X GPGPU/CPU/FPGA Sparse matrix multiplication > X GPGPU/CPU/FPGA

25 APU SERVER 64 APU chips, GByte DDR, From TFLOPS Up to 28 Peta OPS with peak performance 28TOPS/W O() Top-K, min, max, 32 Peta bits/sec internal IO < K Watts > X GPGPs on average Linearly scalable Currently 28nm process and scalable to 7nm or less Well suited to advanced memory technology such as non volatile ReRAM and more

26 EXAMPLE APPLICATIONS

27 K-NEAREST NEIGHBORS (K-NN) Simple example: N = 36, 3 groups, 2 dimensions (D = 2 ) for X and Y K = 4 Group Green selected as the majority For actual applications: N = Billions, D = Tens, K = Tens of thousands

28 K-NN USE CASE IN AN APU Item Item2 Item N Item N Item 3 Item Item 2 Features of item Features of item 2 Features of item N Item features and label storage Q Compute cosine distances for all N in parallel ( μs, assuming D=5 features) Computing Distribute data Area 2 ns (to all) K Mins at O() complexity ( 3μs) In-Place ranking Majority Calculation With the data base in an APU, computation for all N items done in.5 ms, independent of K

29 LARGE DATABASE EXAMPLE USING APU SERVERS Number of items: billions Features per item: tens to hundreds Latency: msec Throughput: Scales to M similarity searches/sec k-nn: Top, nearest neighbors

30 EXAMPLE K-NN FOR RECOGNITION Image Convolution Layer Feature Extractor (Neural Network) K-NN Classifier (Associative Memory) Text BOW, Word Embedding

31 K-MINS: O() ALGORITHM MSB LSB KMINS(int K, vector C){ M :=, V := ; FOR b = msb to b = lsb: D := not(c[b]); N := M & D; cnt = COUNT(N V) IF cnt > K: M := N; ELIF cnt < K: V := N V; ELSE: // cnt == K V := N V; EXIT; ENDIF ENDFOR } C C C 2 N

32 K-MINS: THE ALGORITHM V N V N M D C[] KMINS(int K, vector C){ M :=, V := ; FOR b = msb to b = lsb: D := not(c[b]); N := M & D; cnt = COUNT(N V) IF cnt > K: M := N; ELIF cnt < K: V := N V; ELSE: // cnt == K V := N V; EXIT; ENDIF ENDFOR } cnt=

33 K-MINS: THE ALGORITHM V N V N M D C[] KMINS(int K, vector C){ M :=, V := ; FOR b = msb to b = lsb: D := not(c[b]); N := M & D; cnt = COUNT(N V) IF cnt > K: M := N; ELIF cnt < K: V := N V; ELSE: // cnt == K V := N V; EXIT; ENDIF ENDFOR } cnt=8

34 K-MINS: THE ALGORITHM KMINS(int K, vector C){ M :=, V := ; FOR b = msb to b = lsb: D := not(c[b]); N := M & D; cnt = COUNT(N V) IF cnt > K: M := N; ELIF cnt < K: V := N V; ELSE: // cnt == K V := N V; EXIT; ENDIF ENDFOR } final output V C O() Complexity

35 DENSE (XN) VECTOR BY SPARSE NXM MATRIX Input Vector Sparse Matrix Output Vector APU Representations and Computing Column Row Search all columns for row = 2 :Distribute -2 : 2 Cy Search all columns for row = 3 :Distribute 3 : 2 Cy Search all columns for row = 4 :Distribute - : 2 Cy Multiply in Parallel : Cy Shift and Add all belonging to same column Complexity including IO : O (N +logβ) where β is the number of nonzero elements in the sparse matrix N << M in general for recommender systems

36 SPARSE MATRIX MULTIPLICATION PERFORMANCE ANALYSIS G3 circuit matrix.5m X.5M sparse matrix Roughly 8M nonzero elements 2 GFLOPS with GPGPU solution APU solution provides 64 TFLOPS using the same amount of power as the GPGPU solution above > 5x improvement with APU solution

37 ASSOCIATIVE MEMORY FOR NATURAL LANGUAGE PROCESSING (NLP) Q&A, dialog, language translation, speech recognition etc. Requires learning things from the past needs memory More memory, more accuracy i.e. Dan put the book in his car,.. Long story here. Mike took Dan s car Long story here. He drove to SF Q : Where is the book now? A: Car, SF

38 END-TO-END MEMORY NETWORKS End-To-End Memory Networks, (Weston et. al., NIPS 25). (a): Single hope, (b) 3 hops

39 Q&A : END TO END NETWORK

40 REQUIREMENTS FOR AUGMENTED MEMORY Vertical Embedding Multiplication Input features selectedto next coloumn, Columns to any and other horizontal location or sumlocation based on content Compute softmax to selected Output Cosine Similarity Search + Top-K

41 APU MEMORY FOR NLP 2 N- Control I O m i M T i Broadcast input I to selected columns Compute any function at selected columns Generate output Generate tags for selection i m i i m i

42 GSI SOLUTION FOR END TO END Constant time of 3 µsec per iteration, any memory size.

43 PROGRAMING MODEL

44 PROGRAMMING MODEL Application (C++, Python) HOST Framework (TensorFlow ) Graph Execution and Tasks Scheduling APU (Associative Processing Unit Hardware) Device

45 A TF EXAMPLE: MATMUL a = tf.placeholder(tf.int32, shape=[3,4]) b = tf.placeholder(tf.int32, shape=[4,6]) c = tf.matmul(a, b) # shape = [3,6] matmul c a b

46 A TF EXAMPLE: MATMUL GRAPH PREPARATION a = tf.placeholder(tf.int32, shape=[3, 4]) b = tf.placeholder(tf.int32, shape=[4, 6]) c = tf.matmul(a, b) APU DEVICE SPACE host apu device (tf+eigen) gnl_create_array(a) gnl_create_array(b) gnl_create_array(c) L4 a b c apuc s L MMB

47 A TF EXAMPLE: MATMUL GVML_SET, GVML_MUL with tf.session() as sess: result = sess.run(c, feed_dict= {a: [[ ] b: [[ ] [ -3 4] [ ] [ ]], [ ] [ ]]}) APU DEVICE SPACE gnlpd_dma_6b_start(gnlpd_sys_2_vmr, ) L a b c dma keeps loading data to apuc dma copy L4 to L gvml_set_6( ) gnlpd_mat_mul(c,a,b) gvml_set_6( ) gvml_mul_s6( ) apuc controller c while matmul is being computed in apuc X gvml_mul_s6( ) L MMB

48 TENSORFLOW ENHANCEMENT: FUSED OPERATIONS a = tf.placeholder(tf.int32, shape=[3,4]) b = tf.placeholder(tf.int32, shape=[4,6]) c = tf.matmul(a, b) # shape = [3,6] d = tf.nn.top_k(c, k=2) # shape = [3,2] fused The two operations are computed inside the apuc top_k Data stays in L d No IO operations between them Saves valuable data transfer time and power fused(matmul,top_k) matmul c a b

49 A TF EXAMPLE: MATMUL CODE EXAMPLE APU DEVICE c L MMB APL_FRAG add_u6(rn_reg x, RN_REG y, RN_REG t_xory, RN_REG t_ci) { SM_XFFFF: RL = SB[x]; // RL[-5] = x SM_XFFFF: RL = SB[y]; // RL[-5] = x y { SM_XFFFF: SB[t_xory] = RL; // t_xory[-5] = x y SM_XFFFF: RL = SB[x, y]; // RL[-5] = x&y } // Add init state: // : RL = co[] //..5: RL = x&y { (SM_X << ): SB[t_ci] = NRL; // t_ci[5,9,3] = x&y (SM_X << ): RL = SB[t_xory] & NRL; // RL[] = Cout[] = x&y ci(x y) // 5,9,3: RL = Cout[5,9,3] = x&y ci(x y) } { } { } { }... (SM_X << 4): RL = SB[t_xory]; (SM_X << 2): SB[t_ci] = NRL; // Propagate Cin (SM_X << 2): RL = SB[t_xory] & NRL; // Propagate Cout (SM_X << 5): RL = SB[t_xory] & NRL; // Propagate Cout (SM_X << 3): SB[t_ci] = NRL; // Propagate Cin (SM_X << 3): RL = SB[t_xory] & NRL; // Propagate Cout (SM_X << 6): RL = SB[t_xory] & NRL; // Propagate Cout (SM_X << 5): GL = RL; // RL[4,8,2] = Cout[4,8,2] = x&y &(x y) (SM_X << 4): SB[t_ci] = NRL; // t_ci[8,2,6] = Cout[7,, 5] SM_X: SB[t_ci] = GL; // t_ci[8,2,6] = Cout[7,, 5] (SM_X << 7): RL = SB[t_xory] & NRL; // Propagate Cout (SM_X << 5): GL = RL;

50 FUTURE APPROACH NON VOLATILE CONCEPT

51 SOLUTIONS FOR FUTURE DATA CENTERS CPU Register File L/L2/L3 DRAM ASSOCIATIVE High endurance Full computing (floating points etc.) requires read & write : Low endurance Data search engines (read most of the time) Standard SRAM Based STT-RAM RAM Based PC-RAM Based ReRam Based Flash HDD Volatile Non Volatile Mid endurance Machine learning, malware detection detection etc., : Much more read and much less write

52 THANK YOU

In-Place Associative Computing

In-Place Associative Computing In-Place Associative Computing All Images are Public in the Web Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com Agenda Introduction to associative computing Use case

More information

Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks

Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks Charles Eckert Xiaowei Wang Jingcheng Wang Arun Subramaniyan Ravi Iyer Dennis Sylvester David Blaauw Reetuparna Das M-Bits Research

More information

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks Deep neural networks have enabled major advances in machine learning and AI Computer vision Language translation Speech recognition Question answering And more Problem: DNNs are challenging to serve and

More information

Mohsen Imani. University of California San Diego. System Energy Efficiency Lab seelab.ucsd.edu

Mohsen Imani. University of California San Diego. System Energy Efficiency Lab seelab.ucsd.edu Mohsen Imani University of California San Diego Winter 2016 Technology Trend for IoT http://www.flashmemorysummit.com/english/collaterals/proceedi ngs/2014/20140807_304c_hill.pdf 2 Motivation IoT significantly

More information

Memory technology and optimizations ( 2.3) Main Memory

Memory technology and optimizations ( 2.3) Main Memory Memory technology and optimizations ( 2.3) 47 Main Memory Performance of Main Memory: Latency: affects Cache Miss Penalty» Access Time: time between request and word arrival» Cycle Time: minimum time between

More information

In-Place Associative Computing:

In-Place Associative Computing: In-Place Associative Computing: 1 Page Abstract... 3 Overview... 3 Associative Processing Unit (APU) Card... 3 Host-Device interface... 4 The APU Card Controller... 4 Host to Device Interactions... 5 APU

More information

Revolutionizing the Datacenter

Revolutionizing the Datacenter Power-Efficient Machine Learning using FPGAs on POWER Systems Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx Revolutionizing the Datacenter Join the Conversation #OpenPOWERSummit Top-5

More information

Deep Learning Accelerators

Deep Learning Accelerators Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction

More information

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

SDA: Software-Defined Accelerator for Large- Scale DNN Systems SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, Yong Wang, Bo Yu, Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A dominant

More information

Exploring the Structure of Data at Scale. Rudy Agovic, PhD CEO & Chief Data Scientist at Reliancy January 16, 2019

Exploring the Structure of Data at Scale. Rudy Agovic, PhD CEO & Chief Data Scientist at Reliancy January 16, 2019 Exploring the Structure of Data at Scale Rudy Agovic, PhD CEO & Chief Data Scientist at Reliancy January 16, 2019 Outline Why exploration of large datasets matters Challenges in working with large data

More information

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based

More information

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA Inference Optimization Using TensorRT with Use Cases Jack Han / 한재근 Solutions Architect NVIDIA Search Image NLP Maps TensorRT 4 Adoption Use Cases Speech Video AI Inference is exploding 1 Billion Videos

More information

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Ritchie Zhao 1, Weinan Song 2, Wentao Zhang 2, Tianwei Xing 3, Jeng-Hau Lin 4, Mani Srivastava 3, Rajesh Gupta 4, Zhiru

More information

CMOS Logic Circuit Design Link( リンク ): センター教官講義ノートの下 CMOS 論理回路設計

CMOS Logic Circuit Design   Link( リンク ): センター教官講義ノートの下 CMOS 論理回路設計 CMOS Logic Circuit Design http://www.rcns.hiroshima-u.ac.jp Link( リンク ): センター教官講義ノートの下 CMOS 論理回路設計 Memory Circuits (Part 1) Overview of Memory Types Memory with Address-Based Access Principle of Data Access

More information

Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture

Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture The 51st Annual IEEE/ACM International Symposium on Microarchitecture Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture Byungchul Hong Yeonju Ro John Kim FuriosaAI Samsung

More information

The Nios II Family of Configurable Soft-core Processors

The Nios II Family of Configurable Soft-core Processors The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture

More information

Basic Organization Memory Cell Operation. CSCI 4717 Computer Architecture. ROM Uses. Random Access Memory. Semiconductor Memory Types

Basic Organization Memory Cell Operation. CSCI 4717 Computer Architecture. ROM Uses. Random Access Memory. Semiconductor Memory Types CSCI 4717/5717 Computer Architecture Topic: Internal Memory Details Reading: Stallings, Sections 5.1 & 5.3 Basic Organization Memory Cell Operation Represent two stable/semi-stable states representing

More information

Effect of memory latency

Effect of memory latency CACHE AWARENESS Effect of memory latency Consider a processor operating at 1 GHz (1 ns clock) connected to a DRAM with a latency of 100 ns. Assume that the processor has two ALU units and it is capable

More information

PROGRAMMABLE MODULES SPECIFICATION OF PROGRAMMABLE COMBINATIONAL AND SEQUENTIAL MODULES

PROGRAMMABLE MODULES SPECIFICATION OF PROGRAMMABLE COMBINATIONAL AND SEQUENTIAL MODULES PROGRAMMABLE MODULES SPECIFICATION OF PROGRAMMABLE COMBINATIONAL AND SEQUENTIAL MODULES. psa. rom. fpga THE WAY THE MODULES ARE PROGRAMMED NETWORKS OF PROGRAMMABLE MODULES EXAMPLES OF USES Programmable

More information

Memory. Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

Memory. Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University Memory Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University Big Picture: Building a Processor memory inst register file alu PC +4 +4 new pc offset target imm control extend =? cmp

More information

UNIT - V MEMORY P.VIDYA SAGAR ( ASSOCIATE PROFESSOR) Department of Electronics and Communication Engineering, VBIT

UNIT - V MEMORY P.VIDYA SAGAR ( ASSOCIATE PROFESSOR) Department of Electronics and Communication Engineering, VBIT UNIT - V MEMORY P.VIDYA SAGAR ( ASSOCIATE PROFESSOR) contents Memory: Introduction, Random-Access memory, Memory decoding, ROM, Programmable Logic Array, Programmable Array Logic, Sequential programmable

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 22, 2016 Course Information Website: http://www.stat.ucdavis.edu/~chohsieh/teaching/ ECS289G_Fall2016/main.html My office: Mathematical Sciences

More information

PUSHING THE LIMITS, A PERSPECTIVE ON ROUTER ARCHITECTURE CHALLENGES

PUSHING THE LIMITS, A PERSPECTIVE ON ROUTER ARCHITECTURE CHALLENGES PUSHING THE LIMITS, A PERSPECTIVE ON ROUTER ARCHITECTURE CHALLENGES Greg Hankins APRICOT 2012 2012 Brocade Communications Systems, Inc. 2012/02/28 Lookup Capacity and Forwarding

More information

High Performance Computing Lecture 26. Matthew Jacob Indian Institute of Science

High Performance Computing Lecture 26. Matthew Jacob Indian Institute of Science High Performance Computing Lecture 26 Matthew Jacob Indian Institute of Science Agenda 1. Program execution: Compilation, Object files, Function call and return, Address space, Data & its representation

More information

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Naveen Suda, Vikas Chandra *, Ganesh Dasika *, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, Yu

More information

Latches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter

Latches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter IT 3123 Hardware and Software Concepts Notice: This session is being recorded. CPU and Memory June 11 Copyright 2005 by Bob Brown Latches Can store one bit of data Can be ganged together to store more

More information

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

SDA: Software-Defined Accelerator for Large- Scale DNN Systems SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, 1 Yong Wang, 1 Bo Yu, 1 Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A

More information

CSCI-UA.0201 Computer Systems Organization Memory Hierarchy

CSCI-UA.0201 Computer Systems Organization Memory Hierarchy CSCI-UA.0201 Computer Systems Organization Memory Hierarchy Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Programmer s Wish List Memory Private Infinitely large Infinitely fast Non-volatile

More information

Caches. Hiding Memory Access Times

Caches. Hiding Memory Access Times Caches Hiding Memory Access Times PC Instruction Memory 4 M U X Registers Sign Ext M U X Sh L 2 Data Memory M U X C O N T R O L ALU CTL INSTRUCTION FETCH INSTR DECODE REG FETCH EXECUTE/ ADDRESS CALC MEMORY

More information

NeuroMem. A Neuromorphic Memory patented architecture. NeuroMem 1

NeuroMem. A Neuromorphic Memory patented architecture. NeuroMem 1 NeuroMem A Neuromorphic Memory patented architecture NeuroMem 1 Unique simple architecture NM bus A chain of identical neurons, no supervisor 1 neuron = memory + logic gates Context Category ted during

More information

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based

More information

Memory & Logic Array. Lecture # 23 & 24 By : Ali Mustafa

Memory & Logic Array. Lecture # 23 & 24 By : Ali Mustafa Memory & Logic Array Lecture # 23 & 24 By : Ali Mustafa Memory Memory unit is a device to which a binary information is transferred for storage. From which information is retrieved when needed. Types of

More information

XPU A Programmable FPGA Accelerator for Diverse Workloads

XPU A Programmable FPGA Accelerator for Diverse Workloads XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for

More information

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology 1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823

More information

Lecture: Memory, Coherence Protocols. Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols

Lecture: Memory, Coherence Protocols. Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols Lecture: Memory, Coherence Protocols Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols 1 Modern Memory System...... PROC.. 4 DDR3 channels 64-bit data channels

More information

ECE 485/585 Microprocessor System Design

ECE 485/585 Microprocessor System Design Microprocessor System Design Lecture 4: Memory Hierarchy Memory Taxonomy SRAM Basics Memory Organization DRAM Basics Zeshan Chishti Electrical and Computer Engineering Dept Maseeh College of Engineering

More information

,e-pg PATHSHALA- Computer Science Computer Architecture Module 25 Memory Hierarchy Design - Basics

,e-pg PATHSHALA- Computer Science Computer Architecture Module 25 Memory Hierarchy Design - Basics ,e-pg PATHSHALA- Computer Science Computer Architecture Module 25 Memory Hierarchy Design - Basics The objectives of this module are to discuss about the need for a hierarchical memory system and also

More information

ECEN 449 Microprocessor System Design. Memories. Texas A&M University

ECEN 449 Microprocessor System Design. Memories. Texas A&M University ECEN 449 Microprocessor System Design Memories 1 Objectives of this Lecture Unit Learn about different types of memories SRAM/DRAM/CAM Flash 2 SRAM Static Random Access Memory 3 SRAM Static Random Access

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

COMPUTER ARCHITECTURES

COMPUTER ARCHITECTURES COMPUTER ARCHITECTURES Random Access Memory Technologies Gábor Horváth BUTE Department of Networked Systems and Services ghorvath@hit.bme.hu Budapest, 2019. 02. 24. Department of Networked Systems and

More information

Computer Architecture Memory hierarchies and caches

Computer Architecture Memory hierarchies and caches Computer Architecture Memory hierarchies and caches S Coudert and R Pacalet January 23, 2019 Outline Introduction Localities principles Direct-mapped caches Increasing block size Set-associative caches

More information

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

GPU Computation Strategies & Tricks. Ian Buck NVIDIA GPU Computation Strategies & Tricks Ian Buck NVIDIA Recent Trends 2 Compute is Cheap parallelism to keep 100s of ALUs per chip busy shading is highly parallel millions of fragments per frame 0.5mm 64-bit

More information

Chapter Two - SRAM 1. Introduction to Memories. Static Random Access Memory (SRAM)

Chapter Two - SRAM 1. Introduction to Memories. Static Random Access Memory (SRAM) 1 3 Introduction to Memories The most basic classification of a memory device is whether it is Volatile or Non-Volatile (NVM s). These terms refer to whether or not a memory device loses its contents when

More information

Windowing System on a 3D Pipeline. February 2005

Windowing System on a 3D Pipeline. February 2005 Windowing System on a 3D Pipeline February 2005 Agenda 1.Overview of the 3D pipeline 2.NVIDIA software overview 3.Strengths and challenges with using the 3D pipeline GeForce 6800 220M Transistors April

More information

NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory

NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory Dhananjoy Das, Sr. Systems Architect SanDisk Corp. 1 Agenda: Applications are KING! Storage landscape (Flash / NVM)

More information

Novel Hardware Architecture for Fast Address Lookups

Novel Hardware Architecture for Fast Address Lookups Novel Hardware Architecture for Fast Address Lookups Pronita Mehrotra Paul D. Franzon Department of Electrical and Computer Engineering North Carolina State University {pmehrot,paulf}@eos.ncsu.edu This

More information

Memory Hierarchy and Caches

Memory Hierarchy and Caches Memory Hierarchy and Caches COE 301 / ICS 233 Computer Organization Dr. Muhamed Mudawar College of Computer Sciences and Engineering King Fahd University of Petroleum and Minerals Presentation Outline

More information

Chapter 5. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 5 <1>

Chapter 5. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 5 <1> Chapter 5 Digital Design and Computer Architecture, 2 nd Edition David Money Harris and Sarah L. Harris Chapter 5 Chapter 5 :: Topics Introduction Arithmetic Circuits umber Systems Sequential Building

More information

Deep Learning Requirements for Autonomous Vehicles

Deep Learning Requirements for Autonomous Vehicles Deep Learning Requirements for Autonomous Vehicles Pierre Paulin, Director of R&D Synopsys Inc. Chipex, 1 May 2018 1 Agenda Deep Learning and Convolutional Neural Networks for Embedded Vision Automotive

More information

Architectural Support for Large-Scale Visual Search. Carlo C. del Mundo Vincent Lee Armin Alaghi Luis Ceze Mark Oskin

Architectural Support for Large-Scale Visual Search. Carlo C. del Mundo Vincent Lee Armin Alaghi Luis Ceze Mark Oskin Architectural Support for Large-Scale Visual Search Carlo C. del Mundo Vincent Lee Armin Alaghi Luis Ceze Mark Oskin Motivation: Visual Data & Their Applications Rebooting the IT Revolution, SIA, September

More information

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017 Caches and Memory Hierarchy: Review UCSB CS24A, Fall 27 Motivation Most applications in a single processor runs at only - 2% of the processor peak Most of the single processor performance loss is in the

More information

EGCP 1010 Digital Logic Design (DLD) Laboratory #6

EGCP 1010 Digital Logic Design (DLD) Laboratory #6 EGCP 11 Digital Logic Design (DLD) Laboratory #6 Four by Four (4 x 4) Sorting Stack Prepared By: Alex Laird on October 1st, 2 Lab Partner: Ryan Morehart Objective: The goal of this laboratory is to expose

More information

Toward a Memory-centric Architecture

Toward a Memory-centric Architecture Toward a Memory-centric Architecture Martin Fink EVP & Chief Technology Officer Western Digital Corporation August 8, 2017 1 SAFE HARBOR DISCLAIMERS Forward-Looking Statements This presentation contains

More information

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, Sarma Vrudhula School of Electrical, Computer and Energy Engineering School

More information

Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive. Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center

Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive. Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center 3/17/2015 2014 IBM Corporation Outline IBM OpenPower Platform Accelerating

More information

CS 31: Intro to Systems Digital Logic. Kevin Webb Swarthmore College February 3, 2015

CS 31: Intro to Systems Digital Logic. Kevin Webb Swarthmore College February 3, 2015 CS 31: Intro to Systems Digital Logic Kevin Webb Swarthmore College February 3, 2015 Reading Quiz Today Hardware basics Machine memory models Digital signals Logic gates Circuits: Borrow some paper if

More information

COSC 6385 Computer Architecture - Memory Hierarchies (II)

COSC 6385 Computer Architecture - Memory Hierarchies (II) COSC 6385 Computer Architecture - Memory Hierarchies (II) Edgar Gabriel Spring 2018 Types of cache misses Compulsory Misses: first access to a block cannot be in the cache (cold start misses) Capacity

More information

MANAGING MULTI-TIERED NON-VOLATILE MEMORY SYSTEMS FOR COST AND PERFORMANCE 8/9/16

MANAGING MULTI-TIERED NON-VOLATILE MEMORY SYSTEMS FOR COST AND PERFORMANCE 8/9/16 MANAGING MULTI-TIERED NON-VOLATILE MEMORY SYSTEMS FOR COST AND PERFORMANCE 8/9/16 THE DATA CHALLENGE Performance Improvement (RelaLve) 4.4 ZB Total data created, replicated, and consumed in a single year

More information

Computer Organization and Assembly Language (CS-506)

Computer Organization and Assembly Language (CS-506) Computer Organization and Assembly Language (CS-506) Muhammad Zeeshan Haider Ali Lecturer ISP. Multan ali.zeeshan04@gmail.com https://zeeshanaliatisp.wordpress.com/ Lecture 2 Memory Organization and Structure

More information

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference FINN: A Framework for Fast, Scalable Binarized Neural Network Inference Yaman Umuroglu (XIR & NTNU), Nick Fraser (XIR & USydney), Giulio Gambardella (XIR), Michaela Blott (XIR), Philip Leong (USydney),

More information

Lecture: Memory, Coherence Protocols. Topics: wrap-up of memory systems, intro to multi-thread programming models

Lecture: Memory, Coherence Protocols. Topics: wrap-up of memory systems, intro to multi-thread programming models Lecture: Memory, Coherence Protocols Topics: wrap-up of memory systems, intro to multi-thread programming models 1 Refresh Every DRAM cell must be refreshed within a 64 ms window A row read/write automatically

More information

CS 31: Intro to Systems Digital Logic. Kevin Webb Swarthmore College February 2, 2016

CS 31: Intro to Systems Digital Logic. Kevin Webb Swarthmore College February 2, 2016 CS 31: Intro to Systems Digital Logic Kevin Webb Swarthmore College February 2, 2016 Reading Quiz Today Hardware basics Machine memory models Digital signals Logic gates Circuits: Borrow some paper if

More information

ECEN 449 Microprocessor System Design. Memories

ECEN 449 Microprocessor System Design. Memories ECEN 449 Microprocessor System Design Memories 1 Objectives of this Lecture Unit Learn about different types of memories SRAM/DRAM/CAM /C Flash 2 1 SRAM Static Random Access Memory 3 SRAM Static Random

More information

Device Placement Optimization with Reinforcement Learning

Device Placement Optimization with Reinforcement Learning Device Placement Optimization with Reinforcement Learning Azalia Mirhoseini, Anna Goldie, Hieu Pham, Benoit Steiner, Mohammad Norouzi, Naveen Kumar, Rasmus Larsen, Yuefeng Zhou, Quoc Le, Samy Bengio, and

More information

The Memory Hierarchy Part I

The Memory Hierarchy Part I Chapter 6 The Memory Hierarchy Part I The slides of Part I are taken in large part from V. Heuring & H. Jordan, Computer Systems esign and Architecture 1997. 1 Outline: Memory components: RAM memory cells

More information

Machine Learning on VMware vsphere with NVIDIA GPUs

Machine Learning on VMware vsphere with NVIDIA GPUs Machine Learning on VMware vsphere with NVIDIA GPUs Uday Kurkure, Hari Sivaraman, Lan Vu GPU Technology Conference 2017 2016 VMware Inc. All rights reserved. Gartner Hype Cycle for Emerging Technology

More information

Very Large Scale Integration (VLSI)

Very Large Scale Integration (VLSI) Very Large Scale Integration (VLSI) Lecture 6 Dr. Ahmed H. Madian Ah_madian@hotmail.com Dr. Ahmed H. Madian-VLSI 1 Contents FPGA Technology Programmable logic Cell (PLC) Mux-based cells Look up table PLA

More information

LECTURE 5: MEMORY HIERARCHY DESIGN

LECTURE 5: MEMORY HIERARCHY DESIGN LECTURE 5: MEMORY HIERARCHY DESIGN Abridged version of Hennessy & Patterson (2012):Ch.2 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive

More information

Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research

Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research Nick Fraser (Xilinx & USydney) Yaman Umuroglu (Xilinx & NTNU) Giulio Gambardella (Xilinx)

More information

DRAM Main Memory. Dual Inline Memory Module (DIMM)

DRAM Main Memory. Dual Inline Memory Module (DIMM) DRAM Main Memory Dual Inline Memory Module (DIMM) Memory Technology Main memory serves as input and output to I/O interfaces and the processor. DRAMs for main memory, SRAM for caches Metrics: Latency,

More information

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning,

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning, Index A Algorithmic noise tolerance (ANT), 93 94 Application specific instruction set processors (ASIPs), 115 116 Approximate computing application level, 95 circuits-levels, 93 94 DAS and DVAS, 107 110

More information

Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design

Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design Song Yao 姚颂 Founder & CEO DeePhi Tech 深鉴科技 song.yao@deephi.tech Outline - About DeePhi Tech - Background - Bandwidth Matters

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology

More information

Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN SAROFEEN MICHAEL RUBERRY BEN BARSDELL

Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN SAROFEEN MICHAEL RUBERRY BEN BARSDELL Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN SAROFEEN MICHAEL RUBERRY BEN BARSDELL 1 THIS TALK Using mixed precision and Volta your networks can be: 1. 2-4x faster 2. half the

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

Massively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain

Massively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain Massively Parallel Computing on Silicon: SIMD Implementations V.M.. Brea Univ. of Santiago de Compostela Spain GOAL Give an overview on the state-of of-the- art of Digital on-chip CMOS SIMD Solutions,

More information

Deep Learning with Tensorflow AlexNet

Deep Learning with Tensorflow   AlexNet Machine Learning and Computer Vision Group Deep Learning with Tensorflow http://cvml.ist.ac.at/courses/dlwt_w17/ AlexNet Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton, "Imagenet classification

More information

Read Only Memory ROM

Read Only Memory ROM Read Only Memory ROM A read only memory have address inputs and data outputs With m address lines you can access the 2 m different memory addresses At each address, there is one data word with n bits Usually,

More information

The Impact of Persistent Memory and Intelligent Data Encoding

The Impact of Persistent Memory and Intelligent Data Encoding The Impact of Persistent Memory and Intelligent Data Encoding Or, How to Succeed with NVDIMMs Without Really Trying Rob Peglar SVP/CTO, Symbolic IO rpeglar@symbolicio.com @peglarr Wisdom Lower R/W Latency

More information

Lecture 8: Virtual Memory. Today: DRAM innovations, virtual memory (Sections )

Lecture 8: Virtual Memory. Today: DRAM innovations, virtual memory (Sections ) Lecture 8: Virtual Memory Today: DRAM innovations, virtual memory (Sections 5.3-5.4) 1 DRAM Technology Trends Improvements in technology (smaller devices) DRAM capacities double every two years, but latency

More information

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu Natural Language Processing CS 6320 Lecture 6 Neural Language Models Instructor: Sanda Harabagiu In this lecture We shall cover: Deep Neural Models for Natural Language Processing Introduce Feed Forward

More information

ENGIN 112 Intro to Electrical and Computer Engineering

ENGIN 112 Intro to Electrical and Computer Engineering ENGIN 112 Intro to Electrical and Computer Engineering Lecture 30 Random Access Memory (RAM) Overview Memory is a collection of storage cells with associated input and output circuitry Possible to read

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

CREATED BY M BILAL & Arslan Ahmad Shaad Visit:

CREATED BY M BILAL & Arslan Ahmad Shaad Visit: CREATED BY M BILAL & Arslan Ahmad Shaad Visit: www.techo786.wordpress.com Q1: Define microprocessor? Short Questions Chapter No 01 Fundamental Concepts Microprocessor is a program-controlled and semiconductor

More information

Brainchip OCTOBER

Brainchip OCTOBER Brainchip OCTOBER 2017 1 Agenda Neuromorphic computing background Akida Neuromorphic System-on-Chip (NSoC) Brainchip OCTOBER 2017 2 Neuromorphic Computing Background Brainchip OCTOBER 2017 3 A Brief History

More information

Memory. Lecture 22 CS301

Memory. Lecture 22 CS301 Memory Lecture 22 CS301 Administrative Daily Review of today s lecture w Due tomorrow (11/13) at 8am HW #8 due today at 5pm Program #2 due Friday, 11/16 at 11:59pm Test #2 Wednesday Pipelined Machine Fetch

More information

ESE: Efficient Speech Recognition Engine for Sparse LSTM on FPGA

ESE: Efficient Speech Recognition Engine for Sparse LSTM on FPGA ESE: Efficient Speech Recognition Engine for Sparse LSTM on FPGA Song Han 1,2, Junlong Kang 2, Huizi Mao 1, Yiming Hu 3, Xin Li 2, Yubin Li 2, Dongliang Xie 2, Hong Luo 2, Song Yao 2, Yu Wang 2,3, Huazhong

More information

Cache/Memory Optimization. - Krishna Parthaje

Cache/Memory Optimization. - Krishna Parthaje Cache/Memory Optimization - Krishna Parthaje Hybrid Cache Architecture Replacing SRAM Cache with Future Memory Technology Suji Lee, Jongpil Jung, and Chong-Min Kyung Department of Electrical Engineering,KAIST

More information

Cloud Computing with FPGA-based NVMe SSDs

Cloud Computing with FPGA-based NVMe SSDs Cloud Computing with FPGA-based NVMe SSDs Bharadwaj Pudipeddi, CTO NVXL Santa Clara, CA 1 Choice of NVMe Controllers ASIC NVMe: Fully off-loaded, consistent performance, M.2 or U.2 form factor ASIC OpenChannel:

More information

Versatile RRAM Technology and Applications

Versatile RRAM Technology and Applications Versatile RRAM Technology and Applications Hagop Nazarian Co-Founder and VP of Engineering, Crossbar Inc. Santa Clara, CA 1 Agenda Overview of RRAM Technology RRAM for Embedded Memory Mass Storage Memory

More information

Resistive GP-SIMD Processing-In-Memory

Resistive GP-SIMD Processing-In-Memory Page 1 of 22 Transactions on Architecture and Code Optimization Resistive GP-SIMD Processing-In-Memory AMIR MORAD, Technion LEONID YAVITS, Technion SHAHAR KVATINSKY, Technion RAN GINOSAR, Technion GP-SIMD,

More information

From Silicon to Solutions: Getting the Right Memory Mix for the Application

From Silicon to Solutions: Getting the Right Memory Mix for the Application From Silicon to Solutions: Getting the Right Memory Mix for the Application Ed Doller Numonyx CTO Flash Memory Summit 2008 Legal Notices and Important Information Regarding this Presentation Numonyx may

More information

Computer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal

Computer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal Computer Architectures for Deep Learning Ethan Dell and Daniyal Iqbal Agenda Introduction to Deep Learning Challenges Architectural Solutions Hardware Architectures CPUs GPUs Accelerators FPGAs SOCs ASICs

More information

Original PlayStation: no vector processing or floating point support. Photorealism at the core of design strategy

Original PlayStation: no vector processing or floating point support. Photorealism at the core of design strategy Competitors using generic parts Performance benefits to be had for custom design Original PlayStation: no vector processing or floating point support Geometry issues Photorealism at the core of design

More information

Implementing Ultra Low Latency Data Center Services with Programmable Logic

Implementing Ultra Low Latency Data Center Services with Programmable Logic Implementing Ultra Low Latency Data Center Services with Programmable Logic John W. Lockwood, CEO: Algo-Logic Systems, Inc. http://algo-logic.com Solutions@Algo-Logic.com (408) 707-3740 2255-D Martin Ave.,

More information

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Performance COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals What is Performance? How do we measure the performance of

More information

Overview. Memory Classification Read-Only Memory (ROM) Random Access Memory (RAM) Functional Behavior of RAM. Implementing Static RAM

Overview. Memory Classification Read-Only Memory (ROM) Random Access Memory (RAM) Functional Behavior of RAM. Implementing Static RAM Memories Overview Memory Classification Read-Only Memory (ROM) Types of ROM PROM, EPROM, E 2 PROM Flash ROMs (Compact Flash, Secure Digital, Memory Stick) Random Access Memory (RAM) Types of RAM Static

More information

SAE5C Computer Organization and Architecture. Unit : I - V

SAE5C Computer Organization and Architecture. Unit : I - V SAE5C Computer Organization and Architecture Unit : I - V UNIT-I Evolution of Pentium and Power PC Evolution of Computer Components functions Interconnection Bus Basics of PCI Memory:Characteristics,Hierarchy

More information

Structure of Computer Systems

Structure of Computer Systems 288 between this new matrix and the initial collision matrix M A, because the original forbidden latencies for functional unit A still have to be considered in later initiations. Figure 5.37. State diagram

More information