IN-MEMORY ASSOCIATIVE COMPUTING
|
|
- Pierce Oliver
- 6 years ago
- Views:
Transcription
1 IN-MEMORY ASSOCIATIVE COMPUTING AVIDAN AKERIB, GSI TECHNOLOGY
2 AGENDA The AI computational challenge Introduction to associative computing Examples An NLP use case What s next?
3 THE CHALLENGE IN AI COMPUTING AI Requirement 32 bit FP Multi precision mining, etc. Scaling Sort-search speech, classify image/video Neural network learning Neural network inference, data Data center Top-K, recommendation, Heavy computation Non linearity, Softmax, exponent, normalize Bandwidth Use Case Example Required for speed and power
4 CURRENT SOLUTION Question CPU Answer General Purpose GPU Very Wide Bus DRAM Tens of Cores Bottleneck when register file data needs to be replaced on a regular basis Limits performance Increases power consumption Thousands of Cores Does not scale with the search, sort, and rank requirements of applications like recommender systems, NLP, speech recognition, and data mining that requires functions like Top-K and Softmax.
5 GPU VS CPU VS FPGA
6 GPU VS CPU VS FPGA VS APU
7 GSI S SOLUTION APU ASSOCIATIVE PROCESSING UNIT Question Millions Processors Simple CPU Simple & Narrow Bus Answer APU Associative Memory Computes in-place directly in the memory array removes the I/O bottleneck Significantly increases performances Reduces power
8 IN-MEMORY COMPUTING CONCEPT
9 THE COMPUTING MODEL FOR THE PAST 8 YEARS Read Write Address Decoder Read Write ALU Sense Amp /IO Drivers
10 THE CHANGE IN-MEMORY COMPUTING NOR NOR Read Read Read Write Write Read Simple Controller Patented in-memory logic using only Read/Write operations Any logic/arithmetic function can be generated internally
11 CAM/ ASSOCIATIVE SEARCH Records in the combines key goes to the read enable Values Duplicate vales with inverse data =match RE RE RE RE KEY: Search Duplicate the key with inverse. Move the original key next to the inverse data
12 TCAM SEARCH WITH STANDARD MEMORY CELLS Don t care Don t care Don t care 2
13 TCAM SEARCH WITH STANDARD MEMORY CELLS in the combines key goes to the read enable Insert zero instead of don t-care Duplicate data. Inverse only to those which are not don t care = match =match RE RE RE RE KEY: Search Duplicate the key with inverse. Move The original Key next to the inverse data 3
14 COMPUTING IN THE BIT LINES Vector A Vector B C=f(A,B) Each bit line becomes a processor and storage millions of bit lines = millions of processors
15 NEIGHBORHOOD COMPUTING Shift vector C=f(A,SL(B,)) Parallel shift of bit cycle sections Enables neighborhood operations such as convolutions
16 SEARCH & COUNT Search Count = 3 Search (binary or ternary) all bit lines in cycle 28 M bit lines => 28 Peta search/sec Key applications for search and count for predictive analytics: Recommender systems K-nearest neighbors (using cosine similarity search) Random forest Image histogram Regular expression
17 DATABASE SEARCH AND UPDATE Content-based search, record can be placed anywhere Update, modify, insert, delete is immediate Exact Match CAM/TCAM Similarity Match In-Place Aggregate
18 TRADITIONAL STORAGE CAN DO MUCH MORE Standard memory cell Standard memory cell bit 2 bits 2 input NOR, TCAM cell 3 Bits 3 input NOR, 2 Input NOR + Output 4 State CAM Standard memory cell Standard memory cell
19 CPU/GPGPU VS APU
20 ARCHITECTURE
21 SECTION COMPUTING TO IMPROVE PERFORMANCE MLB section 24 rows Memory control Connecting Mux MLB section Connecting mux... Instr. Buffer 2
22 COMMUNICATION BETWEEN SECTIONS Shift between sections enable neighborhood operations (filters, CNN etc.) Store, compute, search and move data anywhere 22
23 APU CHIP LAYOUT 2M bit processors or 28K vector processors runs at G Hz with up to 2 Peta OPS peak performance
24 EVALUATION BOARD PERFORMANCE Precision : Unlimited : from bit to 6 bits or more. 6.4 TOPS (FP) 8 Peta OPS for one bit computing or 6 bit exact search Similarity Search, Top-k, min, max, Softmax, O() complexity in μs, any size of K compared to ms with current solutions In-memory IO 2 Petabit/sec > X GPGPU/CPU/FPGA Sparse matrix multiplication > X GPGPU/CPU/FPGA
25 APU SERVER 64 APU chips, GByte DDR, From TFLOPS Up to 28 Peta OPS with peak performance 28TOPS/W O() Top-K, min, max, 32 Peta bits/sec internal IO < K Watts > X GPGPs on average Linearly scalable Currently 28nm process and scalable to 7nm or less Well suited to advanced memory technology such as non volatile ReRAM and more
26 EXAMPLE APPLICATIONS
27 K-NEAREST NEIGHBORS (K-NN) Simple example: N = 36, 3 groups, 2 dimensions (D = 2 ) for X and Y K = 4 Group Green selected as the majority For actual applications: N = Billions, D = Tens, K = Tens of thousands
28 K-NN USE CASE IN AN APU Item Item2 Item N Item N Item 3 Item Item 2 Features of item Features of item 2 Features of item N Item features and label storage Q Compute cosine distances for all N in parallel ( μs, assuming D=5 features) Computing Distribute data Area 2 ns (to all) K Mins at O() complexity ( 3μs) In-Place ranking Majority Calculation With the data base in an APU, computation for all N items done in.5 ms, independent of K
29 LARGE DATABASE EXAMPLE USING APU SERVERS Number of items: billions Features per item: tens to hundreds Latency: msec Throughput: Scales to M similarity searches/sec k-nn: Top, nearest neighbors
30 EXAMPLE K-NN FOR RECOGNITION Image Convolution Layer Feature Extractor (Neural Network) K-NN Classifier (Associative Memory) Text BOW, Word Embedding
31 K-MINS: O() ALGORITHM MSB LSB KMINS(int K, vector C){ M :=, V := ; FOR b = msb to b = lsb: D := not(c[b]); N := M & D; cnt = COUNT(N V) IF cnt > K: M := N; ELIF cnt < K: V := N V; ELSE: // cnt == K V := N V; EXIT; ENDIF ENDFOR } C C C 2 N
32 K-MINS: THE ALGORITHM V N V N M D C[] KMINS(int K, vector C){ M :=, V := ; FOR b = msb to b = lsb: D := not(c[b]); N := M & D; cnt = COUNT(N V) IF cnt > K: M := N; ELIF cnt < K: V := N V; ELSE: // cnt == K V := N V; EXIT; ENDIF ENDFOR } cnt=
33 K-MINS: THE ALGORITHM V N V N M D C[] KMINS(int K, vector C){ M :=, V := ; FOR b = msb to b = lsb: D := not(c[b]); N := M & D; cnt = COUNT(N V) IF cnt > K: M := N; ELIF cnt < K: V := N V; ELSE: // cnt == K V := N V; EXIT; ENDIF ENDFOR } cnt=8
34 K-MINS: THE ALGORITHM KMINS(int K, vector C){ M :=, V := ; FOR b = msb to b = lsb: D := not(c[b]); N := M & D; cnt = COUNT(N V) IF cnt > K: M := N; ELIF cnt < K: V := N V; ELSE: // cnt == K V := N V; EXIT; ENDIF ENDFOR } final output V C O() Complexity
35 DENSE (XN) VECTOR BY SPARSE NXM MATRIX Input Vector Sparse Matrix Output Vector APU Representations and Computing Column Row Search all columns for row = 2 :Distribute -2 : 2 Cy Search all columns for row = 3 :Distribute 3 : 2 Cy Search all columns for row = 4 :Distribute - : 2 Cy Multiply in Parallel : Cy Shift and Add all belonging to same column Complexity including IO : O (N +logβ) where β is the number of nonzero elements in the sparse matrix N << M in general for recommender systems
36 SPARSE MATRIX MULTIPLICATION PERFORMANCE ANALYSIS G3 circuit matrix.5m X.5M sparse matrix Roughly 8M nonzero elements 2 GFLOPS with GPGPU solution APU solution provides 64 TFLOPS using the same amount of power as the GPGPU solution above > 5x improvement with APU solution
37 ASSOCIATIVE MEMORY FOR NATURAL LANGUAGE PROCESSING (NLP) Q&A, dialog, language translation, speech recognition etc. Requires learning things from the past needs memory More memory, more accuracy i.e. Dan put the book in his car,.. Long story here. Mike took Dan s car Long story here. He drove to SF Q : Where is the book now? A: Car, SF
38 END-TO-END MEMORY NETWORKS End-To-End Memory Networks, (Weston et. al., NIPS 25). (a): Single hope, (b) 3 hops
39 Q&A : END TO END NETWORK
40 REQUIREMENTS FOR AUGMENTED MEMORY Vertical Embedding Multiplication Input features selectedto next coloumn, Columns to any and other horizontal location or sumlocation based on content Compute softmax to selected Output Cosine Similarity Search + Top-K
41 APU MEMORY FOR NLP 2 N- Control I O m i M T i Broadcast input I to selected columns Compute any function at selected columns Generate output Generate tags for selection i m i i m i
42 GSI SOLUTION FOR END TO END Constant time of 3 µsec per iteration, any memory size.
43 PROGRAMING MODEL
44 PROGRAMMING MODEL Application (C++, Python) HOST Framework (TensorFlow ) Graph Execution and Tasks Scheduling APU (Associative Processing Unit Hardware) Device
45 A TF EXAMPLE: MATMUL a = tf.placeholder(tf.int32, shape=[3,4]) b = tf.placeholder(tf.int32, shape=[4,6]) c = tf.matmul(a, b) # shape = [3,6] matmul c a b
46 A TF EXAMPLE: MATMUL GRAPH PREPARATION a = tf.placeholder(tf.int32, shape=[3, 4]) b = tf.placeholder(tf.int32, shape=[4, 6]) c = tf.matmul(a, b) APU DEVICE SPACE host apu device (tf+eigen) gnl_create_array(a) gnl_create_array(b) gnl_create_array(c) L4 a b c apuc s L MMB
47 A TF EXAMPLE: MATMUL GVML_SET, GVML_MUL with tf.session() as sess: result = sess.run(c, feed_dict= {a: [[ ] b: [[ ] [ -3 4] [ ] [ ]], [ ] [ ]]}) APU DEVICE SPACE gnlpd_dma_6b_start(gnlpd_sys_2_vmr, ) L a b c dma keeps loading data to apuc dma copy L4 to L gvml_set_6( ) gnlpd_mat_mul(c,a,b) gvml_set_6( ) gvml_mul_s6( ) apuc controller c while matmul is being computed in apuc X gvml_mul_s6( ) L MMB
48 TENSORFLOW ENHANCEMENT: FUSED OPERATIONS a = tf.placeholder(tf.int32, shape=[3,4]) b = tf.placeholder(tf.int32, shape=[4,6]) c = tf.matmul(a, b) # shape = [3,6] d = tf.nn.top_k(c, k=2) # shape = [3,2] fused The two operations are computed inside the apuc top_k Data stays in L d No IO operations between them Saves valuable data transfer time and power fused(matmul,top_k) matmul c a b
49 A TF EXAMPLE: MATMUL CODE EXAMPLE APU DEVICE c L MMB APL_FRAG add_u6(rn_reg x, RN_REG y, RN_REG t_xory, RN_REG t_ci) { SM_XFFFF: RL = SB[x]; // RL[-5] = x SM_XFFFF: RL = SB[y]; // RL[-5] = x y { SM_XFFFF: SB[t_xory] = RL; // t_xory[-5] = x y SM_XFFFF: RL = SB[x, y]; // RL[-5] = x&y } // Add init state: // : RL = co[] //..5: RL = x&y { (SM_X << ): SB[t_ci] = NRL; // t_ci[5,9,3] = x&y (SM_X << ): RL = SB[t_xory] & NRL; // RL[] = Cout[] = x&y ci(x y) // 5,9,3: RL = Cout[5,9,3] = x&y ci(x y) } { } { } { }... (SM_X << 4): RL = SB[t_xory]; (SM_X << 2): SB[t_ci] = NRL; // Propagate Cin (SM_X << 2): RL = SB[t_xory] & NRL; // Propagate Cout (SM_X << 5): RL = SB[t_xory] & NRL; // Propagate Cout (SM_X << 3): SB[t_ci] = NRL; // Propagate Cin (SM_X << 3): RL = SB[t_xory] & NRL; // Propagate Cout (SM_X << 6): RL = SB[t_xory] & NRL; // Propagate Cout (SM_X << 5): GL = RL; // RL[4,8,2] = Cout[4,8,2] = x&y &(x y) (SM_X << 4): SB[t_ci] = NRL; // t_ci[8,2,6] = Cout[7,, 5] SM_X: SB[t_ci] = GL; // t_ci[8,2,6] = Cout[7,, 5] (SM_X << 7): RL = SB[t_xory] & NRL; // Propagate Cout (SM_X << 5): GL = RL;
50 FUTURE APPROACH NON VOLATILE CONCEPT
51 SOLUTIONS FOR FUTURE DATA CENTERS CPU Register File L/L2/L3 DRAM ASSOCIATIVE High endurance Full computing (floating points etc.) requires read & write : Low endurance Data search engines (read most of the time) Standard SRAM Based STT-RAM RAM Based PC-RAM Based ReRam Based Flash HDD Volatile Non Volatile Mid endurance Machine learning, malware detection detection etc., : Much more read and much less write
52 THANK YOU
In-Place Associative Computing
In-Place Associative Computing All Images are Public in the Web Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com Agenda Introduction to associative computing Use case
More informationNeural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks
Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks Charles Eckert Xiaowei Wang Jingcheng Wang Arun Subramaniyan Ravi Iyer Dennis Sylvester David Blaauw Reetuparna Das M-Bits Research
More informationRecurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks
Deep neural networks have enabled major advances in machine learning and AI Computer vision Language translation Speech recognition Question answering And more Problem: DNNs are challenging to serve and
More informationMohsen Imani. University of California San Diego. System Energy Efficiency Lab seelab.ucsd.edu
Mohsen Imani University of California San Diego Winter 2016 Technology Trend for IoT http://www.flashmemorysummit.com/english/collaterals/proceedi ngs/2014/20140807_304c_hill.pdf 2 Motivation IoT significantly
More informationMemory technology and optimizations ( 2.3) Main Memory
Memory technology and optimizations ( 2.3) 47 Main Memory Performance of Main Memory: Latency: affects Cache Miss Penalty» Access Time: time between request and word arrival» Cycle Time: minimum time between
More informationIn-Place Associative Computing:
In-Place Associative Computing: 1 Page Abstract... 3 Overview... 3 Associative Processing Unit (APU) Card... 3 Host-Device interface... 4 The APU Card Controller... 4 Host to Device Interactions... 5 APU
More informationRevolutionizing the Datacenter
Power-Efficient Machine Learning using FPGAs on POWER Systems Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx Revolutionizing the Datacenter Join the Conversation #OpenPOWERSummit Top-5
More informationDeep Learning Accelerators
Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction
More informationSDA: Software-Defined Accelerator for Large- Scale DNN Systems
SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, Yong Wang, Bo Yu, Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A dominant
More informationExploring the Structure of Data at Scale. Rudy Agovic, PhD CEO & Chief Data Scientist at Reliancy January 16, 2019
Exploring the Structure of Data at Scale Rudy Agovic, PhD CEO & Chief Data Scientist at Reliancy January 16, 2019 Outline Why exploration of large datasets matters Challenges in working with large data
More informationDRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric
DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based
More informationInference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA
Inference Optimization Using TensorRT with Use Cases Jack Han / 한재근 Solutions Architect NVIDIA Search Image NLP Maps TensorRT 4 Adoption Use Cases Speech Video AI Inference is exploding 1 Billion Videos
More informationAccelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs
Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Ritchie Zhao 1, Weinan Song 2, Wentao Zhang 2, Tianwei Xing 3, Jeng-Hau Lin 4, Mani Srivastava 3, Rajesh Gupta 4, Zhiru
More informationCMOS Logic Circuit Design Link( リンク ): センター教官講義ノートの下 CMOS 論理回路設計
CMOS Logic Circuit Design http://www.rcns.hiroshima-u.ac.jp Link( リンク ): センター教官講義ノートの下 CMOS 論理回路設計 Memory Circuits (Part 1) Overview of Memory Types Memory with Address-Based Access Principle of Data Access
More informationMulti-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture
The 51st Annual IEEE/ACM International Symposium on Microarchitecture Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture Byungchul Hong Yeonju Ro John Kim FuriosaAI Samsung
More informationThe Nios II Family of Configurable Soft-core Processors
The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture
More informationBasic Organization Memory Cell Operation. CSCI 4717 Computer Architecture. ROM Uses. Random Access Memory. Semiconductor Memory Types
CSCI 4717/5717 Computer Architecture Topic: Internal Memory Details Reading: Stallings, Sections 5.1 & 5.3 Basic Organization Memory Cell Operation Represent two stable/semi-stable states representing
More informationEffect of memory latency
CACHE AWARENESS Effect of memory latency Consider a processor operating at 1 GHz (1 ns clock) connected to a DRAM with a latency of 100 ns. Assume that the processor has two ALU units and it is capable
More informationPROGRAMMABLE MODULES SPECIFICATION OF PROGRAMMABLE COMBINATIONAL AND SEQUENTIAL MODULES
PROGRAMMABLE MODULES SPECIFICATION OF PROGRAMMABLE COMBINATIONAL AND SEQUENTIAL MODULES. psa. rom. fpga THE WAY THE MODULES ARE PROGRAMMED NETWORKS OF PROGRAMMABLE MODULES EXAMPLES OF USES Programmable
More informationMemory. Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University
Memory Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University Big Picture: Building a Processor memory inst register file alu PC +4 +4 new pc offset target imm control extend =? cmp
More informationUNIT - V MEMORY P.VIDYA SAGAR ( ASSOCIATE PROFESSOR) Department of Electronics and Communication Engineering, VBIT
UNIT - V MEMORY P.VIDYA SAGAR ( ASSOCIATE PROFESSOR) contents Memory: Introduction, Random-Access memory, Memory decoding, ROM, Programmable Logic Array, Programmable Array Logic, Sequential programmable
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 22, 2016 Course Information Website: http://www.stat.ucdavis.edu/~chohsieh/teaching/ ECS289G_Fall2016/main.html My office: Mathematical Sciences
More informationPUSHING THE LIMITS, A PERSPECTIVE ON ROUTER ARCHITECTURE CHALLENGES
PUSHING THE LIMITS, A PERSPECTIVE ON ROUTER ARCHITECTURE CHALLENGES Greg Hankins APRICOT 2012 2012 Brocade Communications Systems, Inc. 2012/02/28 Lookup Capacity and Forwarding
More informationHigh Performance Computing Lecture 26. Matthew Jacob Indian Institute of Science
High Performance Computing Lecture 26 Matthew Jacob Indian Institute of Science Agenda 1. Program execution: Compilation, Object files, Function call and return, Address space, Data & its representation
More informationThroughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks
Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks Naveen Suda, Vikas Chandra *, Ganesh Dasika *, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, Yu
More informationLatches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter
IT 3123 Hardware and Software Concepts Notice: This session is being recorded. CPU and Memory June 11 Copyright 2005 by Bob Brown Latches Can store one bit of data Can be ganged together to store more
More informationSDA: Software-Defined Accelerator for Large- Scale DNN Systems
SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, 1 Yong Wang, 1 Bo Yu, 1 Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A
More informationCSCI-UA.0201 Computer Systems Organization Memory Hierarchy
CSCI-UA.0201 Computer Systems Organization Memory Hierarchy Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Programmer s Wish List Memory Private Infinitely large Infinitely fast Non-volatile
More informationCaches. Hiding Memory Access Times
Caches Hiding Memory Access Times PC Instruction Memory 4 M U X Registers Sign Ext M U X Sh L 2 Data Memory M U X C O N T R O L ALU CTL INSTRUCTION FETCH INSTR DECODE REG FETCH EXECUTE/ ADDRESS CALC MEMORY
More informationNeuroMem. A Neuromorphic Memory patented architecture. NeuroMem 1
NeuroMem A Neuromorphic Memory patented architecture NeuroMem 1 Unique simple architecture NM bus A chain of identical neurons, no supervisor 1 neuron = memory + logic gates Context Category ted during
More informationDRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric
DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based
More informationMemory & Logic Array. Lecture # 23 & 24 By : Ali Mustafa
Memory & Logic Array Lecture # 23 & 24 By : Ali Mustafa Memory Memory unit is a device to which a binary information is transferred for storage. From which information is retrieved when needed. Types of
More informationXPU A Programmable FPGA Accelerator for Diverse Workloads
XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for
More informationMultilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology
1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823
More informationLecture: Memory, Coherence Protocols. Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols
Lecture: Memory, Coherence Protocols Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols 1 Modern Memory System...... PROC.. 4 DDR3 channels 64-bit data channels
More informationECE 485/585 Microprocessor System Design
Microprocessor System Design Lecture 4: Memory Hierarchy Memory Taxonomy SRAM Basics Memory Organization DRAM Basics Zeshan Chishti Electrical and Computer Engineering Dept Maseeh College of Engineering
More information,e-pg PATHSHALA- Computer Science Computer Architecture Module 25 Memory Hierarchy Design - Basics
,e-pg PATHSHALA- Computer Science Computer Architecture Module 25 Memory Hierarchy Design - Basics The objectives of this module are to discuss about the need for a hierarchical memory system and also
More informationECEN 449 Microprocessor System Design. Memories. Texas A&M University
ECEN 449 Microprocessor System Design Memories 1 Objectives of this Lecture Unit Learn about different types of memories SRAM/DRAM/CAM Flash 2 SRAM Static Random Access Memory 3 SRAM Static Random Access
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More informationCOMPUTER ARCHITECTURES
COMPUTER ARCHITECTURES Random Access Memory Technologies Gábor Horváth BUTE Department of Networked Systems and Services ghorvath@hit.bme.hu Budapest, 2019. 02. 24. Department of Networked Systems and
More informationComputer Architecture Memory hierarchies and caches
Computer Architecture Memory hierarchies and caches S Coudert and R Pacalet January 23, 2019 Outline Introduction Localities principles Direct-mapped caches Increasing block size Set-associative caches
More informationGPU Computation Strategies & Tricks. Ian Buck NVIDIA
GPU Computation Strategies & Tricks Ian Buck NVIDIA Recent Trends 2 Compute is Cheap parallelism to keep 100s of ALUs per chip busy shading is highly parallel millions of fragments per frame 0.5mm 64-bit
More informationChapter Two - SRAM 1. Introduction to Memories. Static Random Access Memory (SRAM)
1 3 Introduction to Memories The most basic classification of a memory device is whether it is Volatile or Non-Volatile (NVM s). These terms refer to whether or not a memory device loses its contents when
More informationWindowing System on a 3D Pipeline. February 2005
Windowing System on a 3D Pipeline February 2005 Agenda 1.Overview of the 3D pipeline 2.NVIDIA software overview 3.Strengths and challenges with using the 3D pipeline GeForce 6800 220M Transistors April
More informationNVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory
NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory Dhananjoy Das, Sr. Systems Architect SanDisk Corp. 1 Agenda: Applications are KING! Storage landscape (Flash / NVM)
More informationNovel Hardware Architecture for Fast Address Lookups
Novel Hardware Architecture for Fast Address Lookups Pronita Mehrotra Paul D. Franzon Department of Electrical and Computer Engineering North Carolina State University {pmehrot,paulf}@eos.ncsu.edu This
More informationMemory Hierarchy and Caches
Memory Hierarchy and Caches COE 301 / ICS 233 Computer Organization Dr. Muhamed Mudawar College of Computer Sciences and Engineering King Fahd University of Petroleum and Minerals Presentation Outline
More informationChapter 5. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 5 <1>
Chapter 5 Digital Design and Computer Architecture, 2 nd Edition David Money Harris and Sarah L. Harris Chapter 5 Chapter 5 :: Topics Introduction Arithmetic Circuits umber Systems Sequential Building
More informationDeep Learning Requirements for Autonomous Vehicles
Deep Learning Requirements for Autonomous Vehicles Pierre Paulin, Director of R&D Synopsys Inc. Chipex, 1 May 2018 1 Agenda Deep Learning and Convolutional Neural Networks for Embedded Vision Automotive
More informationArchitectural Support for Large-Scale Visual Search. Carlo C. del Mundo Vincent Lee Armin Alaghi Luis Ceze Mark Oskin
Architectural Support for Large-Scale Visual Search Carlo C. del Mundo Vincent Lee Armin Alaghi Luis Ceze Mark Oskin Motivation: Visual Data & Their Applications Rebooting the IT Revolution, SIA, September
More informationCaches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017
Caches and Memory Hierarchy: Review UCSB CS24A, Fall 27 Motivation Most applications in a single processor runs at only - 2% of the processor peak Most of the single processor performance loss is in the
More informationEGCP 1010 Digital Logic Design (DLD) Laboratory #6
EGCP 11 Digital Logic Design (DLD) Laboratory #6 Four by Four (4 x 4) Sorting Stack Prepared By: Alex Laird on October 1st, 2 Lab Partner: Ryan Morehart Objective: The goal of this laboratory is to expose
More informationToward a Memory-centric Architecture
Toward a Memory-centric Architecture Martin Fink EVP & Chief Technology Officer Western Digital Corporation August 8, 2017 1 SAFE HARBOR DISCLAIMERS Forward-Looking Statements This presentation contains
More informationScalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA
Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, Sarma Vrudhula School of Electrical, Computer and Energy Engineering School
More informationExploiting the OpenPOWER Platform for Big Data Analytics and Cognitive. Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center
Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center 3/17/2015 2014 IBM Corporation Outline IBM OpenPower Platform Accelerating
More informationCS 31: Intro to Systems Digital Logic. Kevin Webb Swarthmore College February 3, 2015
CS 31: Intro to Systems Digital Logic Kevin Webb Swarthmore College February 3, 2015 Reading Quiz Today Hardware basics Machine memory models Digital signals Logic gates Circuits: Borrow some paper if
More informationCOSC 6385 Computer Architecture - Memory Hierarchies (II)
COSC 6385 Computer Architecture - Memory Hierarchies (II) Edgar Gabriel Spring 2018 Types of cache misses Compulsory Misses: first access to a block cannot be in the cache (cold start misses) Capacity
More informationMANAGING MULTI-TIERED NON-VOLATILE MEMORY SYSTEMS FOR COST AND PERFORMANCE 8/9/16
MANAGING MULTI-TIERED NON-VOLATILE MEMORY SYSTEMS FOR COST AND PERFORMANCE 8/9/16 THE DATA CHALLENGE Performance Improvement (RelaLve) 4.4 ZB Total data created, replicated, and consumed in a single year
More informationComputer Organization and Assembly Language (CS-506)
Computer Organization and Assembly Language (CS-506) Muhammad Zeeshan Haider Ali Lecturer ISP. Multan ali.zeeshan04@gmail.com https://zeeshanaliatisp.wordpress.com/ Lecture 2 Memory Organization and Structure
More informationFINN: A Framework for Fast, Scalable Binarized Neural Network Inference
FINN: A Framework for Fast, Scalable Binarized Neural Network Inference Yaman Umuroglu (XIR & NTNU), Nick Fraser (XIR & USydney), Giulio Gambardella (XIR), Michaela Blott (XIR), Philip Leong (USydney),
More informationLecture: Memory, Coherence Protocols. Topics: wrap-up of memory systems, intro to multi-thread programming models
Lecture: Memory, Coherence Protocols Topics: wrap-up of memory systems, intro to multi-thread programming models 1 Refresh Every DRAM cell must be refreshed within a 64 ms window A row read/write automatically
More informationCS 31: Intro to Systems Digital Logic. Kevin Webb Swarthmore College February 2, 2016
CS 31: Intro to Systems Digital Logic Kevin Webb Swarthmore College February 2, 2016 Reading Quiz Today Hardware basics Machine memory models Digital signals Logic gates Circuits: Borrow some paper if
More informationECEN 449 Microprocessor System Design. Memories
ECEN 449 Microprocessor System Design Memories 1 Objectives of this Lecture Unit Learn about different types of memories SRAM/DRAM/CAM /C Flash 2 1 SRAM Static Random Access Memory 3 SRAM Static Random
More informationDevice Placement Optimization with Reinforcement Learning
Device Placement Optimization with Reinforcement Learning Azalia Mirhoseini, Anna Goldie, Hieu Pham, Benoit Steiner, Mohammad Norouzi, Naveen Kumar, Rasmus Larsen, Yuefeng Zhou, Quoc Le, Samy Bengio, and
More informationThe Memory Hierarchy Part I
Chapter 6 The Memory Hierarchy Part I The slides of Part I are taken in large part from V. Heuring & H. Jordan, Computer Systems esign and Architecture 1997. 1 Outline: Memory components: RAM memory cells
More informationMachine Learning on VMware vsphere with NVIDIA GPUs
Machine Learning on VMware vsphere with NVIDIA GPUs Uday Kurkure, Hari Sivaraman, Lan Vu GPU Technology Conference 2017 2016 VMware Inc. All rights reserved. Gartner Hype Cycle for Emerging Technology
More informationVery Large Scale Integration (VLSI)
Very Large Scale Integration (VLSI) Lecture 6 Dr. Ahmed H. Madian Ah_madian@hotmail.com Dr. Ahmed H. Madian-VLSI 1 Contents FPGA Technology Programmable logic Cell (PLC) Mux-based cells Look up table PLA
More informationLECTURE 5: MEMORY HIERARCHY DESIGN
LECTURE 5: MEMORY HIERARCHY DESIGN Abridged version of Hennessy & Patterson (2012):Ch.2 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive
More informationScaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research
Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research Nick Fraser (Xilinx & USydney) Yaman Umuroglu (Xilinx & NTNU) Giulio Gambardella (Xilinx)
More informationDRAM Main Memory. Dual Inline Memory Module (DIMM)
DRAM Main Memory Dual Inline Memory Module (DIMM) Memory Technology Main memory serves as input and output to I/O interfaces and the processor. DRAMs for main memory, SRAM for caches Metrics: Latency,
More informationIndex. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning,
Index A Algorithmic noise tolerance (ANT), 93 94 Application specific instruction set processors (ASIPs), 115 116 Approximate computing application level, 95 circuits-levels, 93 94 DAS and DVAS, 107 110
More informationBandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design
Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design Song Yao 姚颂 Founder & CEO DeePhi Tech 深鉴科技 song.yao@deephi.tech Outline - About DeePhi Tech - Background - Bandwidth Matters
More informationCS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS
CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology
More informationTraining Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN SAROFEEN MICHAEL RUBERRY BEN BARSDELL
Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN SAROFEEN MICHAEL RUBERRY BEN BARSDELL 1 THIS TALK Using mixed precision and Volta your networks can be: 1. 2-4x faster 2. half the
More informationComputer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per
More informationMassively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain
Massively Parallel Computing on Silicon: SIMD Implementations V.M.. Brea Univ. of Santiago de Compostela Spain GOAL Give an overview on the state-of of-the- art of Digital on-chip CMOS SIMD Solutions,
More informationDeep Learning with Tensorflow AlexNet
Machine Learning and Computer Vision Group Deep Learning with Tensorflow http://cvml.ist.ac.at/courses/dlwt_w17/ AlexNet Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton, "Imagenet classification
More informationRead Only Memory ROM
Read Only Memory ROM A read only memory have address inputs and data outputs With m address lines you can access the 2 m different memory addresses At each address, there is one data word with n bits Usually,
More informationThe Impact of Persistent Memory and Intelligent Data Encoding
The Impact of Persistent Memory and Intelligent Data Encoding Or, How to Succeed with NVDIMMs Without Really Trying Rob Peglar SVP/CTO, Symbolic IO rpeglar@symbolicio.com @peglarr Wisdom Lower R/W Latency
More informationLecture 8: Virtual Memory. Today: DRAM innovations, virtual memory (Sections )
Lecture 8: Virtual Memory Today: DRAM innovations, virtual memory (Sections 5.3-5.4) 1 DRAM Technology Trends Improvements in technology (smaller devices) DRAM capacities double every two years, but latency
More informationNatural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu
Natural Language Processing CS 6320 Lecture 6 Neural Language Models Instructor: Sanda Harabagiu In this lecture We shall cover: Deep Neural Models for Natural Language Processing Introduce Feed Forward
More informationENGIN 112 Intro to Electrical and Computer Engineering
ENGIN 112 Intro to Electrical and Computer Engineering Lecture 30 Random Access Memory (RAM) Overview Memory is a collection of storage cells with associated input and output circuitry Possible to read
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More informationCREATED BY M BILAL & Arslan Ahmad Shaad Visit:
CREATED BY M BILAL & Arslan Ahmad Shaad Visit: www.techo786.wordpress.com Q1: Define microprocessor? Short Questions Chapter No 01 Fundamental Concepts Microprocessor is a program-controlled and semiconductor
More informationBrainchip OCTOBER
Brainchip OCTOBER 2017 1 Agenda Neuromorphic computing background Akida Neuromorphic System-on-Chip (NSoC) Brainchip OCTOBER 2017 2 Neuromorphic Computing Background Brainchip OCTOBER 2017 3 A Brief History
More informationMemory. Lecture 22 CS301
Memory Lecture 22 CS301 Administrative Daily Review of today s lecture w Due tomorrow (11/13) at 8am HW #8 due today at 5pm Program #2 due Friday, 11/16 at 11:59pm Test #2 Wednesday Pipelined Machine Fetch
More informationESE: Efficient Speech Recognition Engine for Sparse LSTM on FPGA
ESE: Efficient Speech Recognition Engine for Sparse LSTM on FPGA Song Han 1,2, Junlong Kang 2, Huizi Mao 1, Yiming Hu 3, Xin Li 2, Yubin Li 2, Dongliang Xie 2, Hong Luo 2, Song Yao 2, Yu Wang 2,3, Huazhong
More informationCache/Memory Optimization. - Krishna Parthaje
Cache/Memory Optimization - Krishna Parthaje Hybrid Cache Architecture Replacing SRAM Cache with Future Memory Technology Suji Lee, Jongpil Jung, and Chong-Min Kyung Department of Electrical Engineering,KAIST
More informationCloud Computing with FPGA-based NVMe SSDs
Cloud Computing with FPGA-based NVMe SSDs Bharadwaj Pudipeddi, CTO NVXL Santa Clara, CA 1 Choice of NVMe Controllers ASIC NVMe: Fully off-loaded, consistent performance, M.2 or U.2 form factor ASIC OpenChannel:
More informationVersatile RRAM Technology and Applications
Versatile RRAM Technology and Applications Hagop Nazarian Co-Founder and VP of Engineering, Crossbar Inc. Santa Clara, CA 1 Agenda Overview of RRAM Technology RRAM for Embedded Memory Mass Storage Memory
More informationResistive GP-SIMD Processing-In-Memory
Page 1 of 22 Transactions on Architecture and Code Optimization Resistive GP-SIMD Processing-In-Memory AMIR MORAD, Technion LEONID YAVITS, Technion SHAHAR KVATINSKY, Technion RAN GINOSAR, Technion GP-SIMD,
More informationFrom Silicon to Solutions: Getting the Right Memory Mix for the Application
From Silicon to Solutions: Getting the Right Memory Mix for the Application Ed Doller Numonyx CTO Flash Memory Summit 2008 Legal Notices and Important Information Regarding this Presentation Numonyx may
More informationComputer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal
Computer Architectures for Deep Learning Ethan Dell and Daniyal Iqbal Agenda Introduction to Deep Learning Challenges Architectural Solutions Hardware Architectures CPUs GPUs Accelerators FPGAs SOCs ASICs
More informationOriginal PlayStation: no vector processing or floating point support. Photorealism at the core of design strategy
Competitors using generic parts Performance benefits to be had for custom design Original PlayStation: no vector processing or floating point support Geometry issues Photorealism at the core of design
More informationImplementing Ultra Low Latency Data Center Services with Programmable Logic
Implementing Ultra Low Latency Data Center Services with Programmable Logic John W. Lockwood, CEO: Algo-Logic Systems, Inc. http://algo-logic.com Solutions@Algo-Logic.com (408) 707-3740 2255-D Martin Ave.,
More informationPerformance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals
Performance COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals What is Performance? How do we measure the performance of
More informationOverview. Memory Classification Read-Only Memory (ROM) Random Access Memory (RAM) Functional Behavior of RAM. Implementing Static RAM
Memories Overview Memory Classification Read-Only Memory (ROM) Types of ROM PROM, EPROM, E 2 PROM Flash ROMs (Compact Flash, Secure Digital, Memory Stick) Random Access Memory (RAM) Types of RAM Static
More informationSAE5C Computer Organization and Architecture. Unit : I - V
SAE5C Computer Organization and Architecture Unit : I - V UNIT-I Evolution of Pentium and Power PC Evolution of Computer Components functions Interconnection Bus Basics of PCI Memory:Characteristics,Hierarchy
More informationStructure of Computer Systems
288 between this new matrix and the initial collision matrix M A, because the original forbidden latencies for functional unit A still have to be considered in later initiations. Figure 5.37. State diagram
More information