IN-MEMORY ASSOCIATIVE COMPUTING

Similar documents
In-Place Associative Computing

Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks

Mohsen Imani. University of California San Diego. System Energy Efficiency Lab seelab.ucsd.edu

Memory technology and optimizations ( 2.3) Main Memory

In-Place Associative Computing:

Revolutionizing the Datacenter

Deep Learning Accelerators

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

Exploring the Structure of Data at Scale. Rudy Agovic, PhD CEO & Chief Data Scientist at Reliancy January 16, 2019

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

Inference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA

Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs

CMOS Logic Circuit Design Link( リンク ): センター教官講義ノートの下 CMOS 論理回路設計

Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture

The Nios II Family of Configurable Soft-core Processors

Basic Organization Memory Cell Operation. CSCI 4717 Computer Architecture. ROM Uses. Random Access Memory. Semiconductor Memory Types

Effect of memory latency

PROGRAMMABLE MODULES SPECIFICATION OF PROGRAMMABLE COMBINATIONAL AND SEQUENTIAL MODULES

Memory. Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

UNIT - V MEMORY P.VIDYA SAGAR ( ASSOCIATE PROFESSOR) Department of Electronics and Communication Engineering, VBIT

ECS289: Scalable Machine Learning

PUSHING THE LIMITS, A PERSPECTIVE ON ROUTER ARCHITECTURE CHALLENGES

High Performance Computing Lecture 26. Matthew Jacob Indian Institute of Science

Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks

Latches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter

SDA: Software-Defined Accelerator for Large- Scale DNN Systems

CSCI-UA.0201 Computer Systems Organization Memory Hierarchy

Caches. Hiding Memory Access Times

NeuroMem. A Neuromorphic Memory patented architecture. NeuroMem 1

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric

Memory & Logic Array. Lecture # 23 & 24 By : Ali Mustafa

XPU A Programmable FPGA Accelerator for Diverse Workloads

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Lecture: Memory, Coherence Protocols. Topics: wrap-up of memory systems, multi-thread programming models, snooping-based protocols

ECE 485/585 Microprocessor System Design

,e-pg PATHSHALA- Computer Science Computer Architecture Module 25 Memory Hierarchy Design - Basics

ECEN 449 Microprocessor System Design. Memories. Texas A&M University

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

COMPUTER ARCHITECTURES

Computer Architecture Memory hierarchies and caches

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

Chapter Two - SRAM 1. Introduction to Memories. Static Random Access Memory (SRAM)

Windowing System on a 3D Pipeline. February 2005

NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory

Novel Hardware Architecture for Fast Address Lookups

Memory Hierarchy and Caches

Chapter 5. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 5 <1>

Deep Learning Requirements for Autonomous Vehicles

Architectural Support for Large-Scale Visual Search. Carlo C. del Mundo Vincent Lee Armin Alaghi Luis Ceze Mark Oskin

Caches and Memory Hierarchy: Review. UCSB CS240A, Fall 2017

EGCP 1010 Digital Logic Design (DLD) Laboratory #6

Toward a Memory-centric Architecture

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive. Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center

CS 31: Intro to Systems Digital Logic. Kevin Webb Swarthmore College February 3, 2015

COSC 6385 Computer Architecture - Memory Hierarchies (II)

MANAGING MULTI-TIERED NON-VOLATILE MEMORY SYSTEMS FOR COST AND PERFORMANCE 8/9/16

Computer Organization and Assembly Language (CS-506)

FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

Lecture: Memory, Coherence Protocols. Topics: wrap-up of memory systems, intro to multi-thread programming models

CS 31: Intro to Systems Digital Logic. Kevin Webb Swarthmore College February 2, 2016

ECEN 449 Microprocessor System Design. Memories

Device Placement Optimization with Reinforcement Learning

The Memory Hierarchy Part I

Machine Learning on VMware vsphere with NVIDIA GPUs

Very Large Scale Integration (VLSI)

LECTURE 5: MEMORY HIERARCHY DESIGN

Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research

DRAM Main Memory. Dual Inline Memory Module (DIMM)

Index. Springer Nature Switzerland AG 2019 B. Moons et al., Embedded Deep Learning,

Bandwidth-Centric Deep Learning Processing through Software-Hardware Co-Design

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Copyright 2012, Elsevier Inc. All rights reserved.

Training Neural Networks with Mixed Precision MICHAEL CARILLI CHRISTIAN SAROFEEN MICHAEL RUBERRY BEN BARSDELL

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Massively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain

Deep Learning with Tensorflow AlexNet

Read Only Memory ROM

The Impact of Persistent Memory and Intelligent Data Encoding

Lecture 8: Virtual Memory. Today: DRAM innovations, virtual memory (Sections )

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

ENGIN 112 Intro to Electrical and Computer Engineering

Copyright 2012, Elsevier Inc. All rights reserved.

CREATED BY M BILAL & Arslan Ahmad Shaad Visit:

Brainchip OCTOBER

Memory. Lecture 22 CS301

ESE: Efficient Speech Recognition Engine for Sparse LSTM on FPGA

Cache/Memory Optimization. - Krishna Parthaje

Cloud Computing with FPGA-based NVMe SSDs

Versatile RRAM Technology and Applications

Resistive GP-SIMD Processing-In-Memory

From Silicon to Solutions: Getting the Right Memory Mix for the Application

Computer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal

Original PlayStation: no vector processing or floating point support. Photorealism at the core of design strategy

Implementing Ultra Low Latency Data Center Services with Programmable Logic

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Overview. Memory Classification Read-Only Memory (ROM) Random Access Memory (RAM) Functional Behavior of RAM. Implementing Static RAM

SAE5C Computer Organization and Architecture. Unit : I - V

Structure of Computer Systems

Transcription:

IN-MEMORY ASSOCIATIVE COMPUTING AVIDAN AKERIB, GSI TECHNOLOGY AAKERIB@GSITECHNOLOGY.COM

AGENDA The AI computational challenge Introduction to associative computing Examples An NLP use case What s next?

THE CHALLENGE IN AI COMPUTING AI Requirement 32 bit FP Multi precision mining, etc. Scaling Sort-search speech, classify image/video Neural network learning Neural network inference, data Data center Top-K, recommendation, Heavy computation Non linearity, Softmax, exponent, normalize Bandwidth Use Case Example Required for speed and power

CURRENT SOLUTION Question CPU Answer General Purpose GPU Very Wide Bus DRAM Tens of Cores Bottleneck when register file data needs to be replaced on a regular basis Limits performance Increases power consumption Thousands of Cores Does not scale with the search, sort, and rank requirements of applications like recommender systems, NLP, speech recognition, and data mining that requires functions like Top-K and Softmax.

GPU VS CPU VS FPGA

GPU VS CPU VS FPGA VS APU

GSI S SOLUTION APU ASSOCIATIVE PROCESSING UNIT Question Millions Processors Simple CPU Simple & Narrow Bus Answer APU Associative Memory Computes in-place directly in the memory array removes the I/O bottleneck Significantly increases performances Reduces power

IN-MEMORY COMPUTING CONCEPT

THE COMPUTING MODEL FOR THE PAST 8 YEARS Read Write Address Decoder Read Write ALU Sense Amp /IO Drivers

THE CHANGE IN-MEMORY COMPUTING NOR NOR Read Read Read Write Write Read Simple Controller Patented in-memory logic using only Read/Write operations Any logic/arithmetic function can be generated internally

CAM/ ASSOCIATIVE SEARCH Records in the combines key goes to the read enable Values Duplicate vales with inverse data =match RE RE RE RE KEY: Search Duplicate the key with inverse. Move the original key next to the inverse data

TCAM SEARCH WITH STANDARD MEMORY CELLS Don t care Don t care Don t care 2

TCAM SEARCH WITH STANDARD MEMORY CELLS in the combines key goes to the read enable Insert zero instead of don t-care Duplicate data. Inverse only to those which are not don t care = match =match RE RE RE RE KEY: Search Duplicate the key with inverse. Move The original Key next to the inverse data 3

COMPUTING IN THE BIT LINES Vector A Vector B C=f(A,B) Each bit line becomes a processor and storage millions of bit lines = millions of processors

NEIGHBORHOOD COMPUTING Shift vector C=f(A,SL(B,)) Parallel shift of bit lines @ cycle sections Enables neighborhood operations such as convolutions

SEARCH & COUNT Search 2-7 5 2 3 2 54 2 8 Count = 3 Search (binary or ternary) all bit lines in cycle 28 M bit lines => 28 Peta search/sec Key applications for search and count for predictive analytics: Recommender systems K-nearest neighbors (using cosine similarity search) Random forest Image histogram Regular expression

DATABASE SEARCH AND UPDATE Content-based search, record can be placed anywhere Update, modify, insert, delete is immediate Exact Match CAM/TCAM Similarity Match In-Place Aggregate

TRADITIONAL STORAGE CAN DO MUCH MORE Standard memory cell Standard memory cell bit 2 bits 2 input NOR, TCAM cell 3 Bits 3 input NOR, 2 Input NOR + Output 4 State CAM Standard memory cell Standard memory cell

CPU/GPGPU VS APU

ARCHITECTURE

SECTION COMPUTING TO IMPROVE PERFORMANCE MLB section 24 rows Memory control Connecting Mux MLB section Connecting mux... Instr. Buffer 2

COMMUNICATION BETWEEN SECTIONS Shift between sections enable neighborhood operations (filters, CNN etc.) Store, compute, search and move data anywhere 22

APU CHIP LAYOUT 2M bit processors or 28K vector processors runs at G Hz with up to 2 Peta OPS peak performance

EVALUATION BOARD PERFORMANCE Precision : Unlimited : from bit to 6 bits or more. 6.4 TOPS (FP) 8 Peta OPS for one bit computing or 6 bit exact search Similarity Search, Top-k, min, max, Softmax, O() complexity in μs, any size of K compared to ms with current solutions In-memory IO 2 Petabit/sec > X GPGPU/CPU/FPGA Sparse matrix multiplication > X GPGPU/CPU/FPGA

APU SERVER 64 APU chips, 256-52GByte DDR, From TFLOPS Up to 28 Peta OPS with peak performance 28TOPS/W O() Top-K, min, max, 32 Peta bits/sec internal IO < K Watts > X GPGPs on average Linearly scalable Currently 28nm process and scalable to 7nm or less Well suited to advanced memory technology such as non volatile ReRAM and more

EXAMPLE APPLICATIONS

K-NEAREST NEIGHBORS (K-NN) Simple example: N = 36, 3 groups, 2 dimensions (D = 2 ) for X and Y K = 4 Group Green selected as the majority For actual applications: N = Billions, D = Tens, K = Tens of thousands

K-NN USE CASE IN AN APU Item Item2 Item N Item N Item 3 Item Item 2 Features of item Features of item 2 Features of item N Item features and label storage Q 4 3 2 Compute cosine distances for all N in parallel ( μs, assuming D=5 features) Computing Distribute data Area 2 ns (to all) K Mins at O() complexity ( 3μs) In-Place ranking Majority Calculation With the data base in an APU, computation for all N items done in.5 ms, independent of K

LARGE DATABASE EXAMPLE USING APU SERVERS Number of items: billions Features per item: tens to hundreds Latency: msec Throughput: Scales to M similarity searches/sec k-nn: Top, nearest neighbors

EXAMPLE K-NN FOR RECOGNITION Image Convolution Layer Feature Extractor (Neural Network) K-NN Classifier (Associative Memory) Text BOW, Word Embedding

K-MINS: O() ALGORITHM MSB LSB KMINS(int K, vector C){ M :=, V := ; FOR b = msb to b = lsb: D := not(c[b]); N := M & D; cnt = COUNT(N V) IF cnt > K: M := N; ELIF cnt < K: V := N V; ELSE: // cnt == K V := N V; EXIT; ENDIF ENDFOR } C C C 2 N

K-MINS: THE ALGORITHM V N V N M D C[] KMINS(int K, vector C){ M :=, V := ; FOR b = msb to b = lsb: D := not(c[b]); N := M & D; cnt = COUNT(N V) IF cnt > K: M := N; ELIF cnt < K: V := N V; ELSE: // cnt == K V := N V; EXIT; ENDIF ENDFOR } cnt=

K-MINS: THE ALGORITHM V N V N M D C[] KMINS(int K, vector C){ M :=, V := ; FOR b = msb to b = lsb: D := not(c[b]); N := M & D; cnt = COUNT(N V) IF cnt > K: M := N; ELIF cnt < K: V := N V; ELSE: // cnt == K V := N V; EXIT; ENDIF ENDFOR } cnt=8

K-MINS: THE ALGORITHM KMINS(int K, vector C){ M :=, V := ; FOR b = msb to b = lsb: D := not(c[b]); N := M & D; cnt = COUNT(N V) IF cnt > K: M := N; ELIF cnt < K: V := N V; ELSE: // cnt == K V := N V; EXIT; ENDIF ENDFOR } final output V C O() Complexity

DENSE (XN) VECTOR BY SPARSE NXM MATRIX Input Vector 4-2 3 - Sparse Matrix 3 5 9 7 Output Vector -6 6-7 APU Representations and Computing Column Row 3 5 9 7 2 2 4 2 3 4 4 Search all columns for row = 2 :Distribute -2 : 2 Cy -2 3 - - Search all columns for row = 3 :Distribute 3 : 2 Cy -6 5-9 -7 Search all columns for row = 4 :Distribute - : 2 Cy Multiply in Parallel : Cy -6 6-7 Shift and Add all belonging to same column Complexity including IO : O (N +logβ) where β is the number of nonzero elements in the sparse matrix N << M in general for recommender systems

SPARSE MATRIX MULTIPLICATION PERFORMANCE ANALYSIS G3 circuit matrix.5m X.5M sparse matrix Roughly 8M nonzero elements 2 GFLOPS with GPGPU solution APU solution provides 64 TFLOPS using the same amount of power as the GPGPU solution above > 5x improvement with APU solution

ASSOCIATIVE MEMORY FOR NATURAL LANGUAGE PROCESSING (NLP) Q&A, dialog, language translation, speech recognition etc. Requires learning things from the past needs memory More memory, more accuracy i.e. Dan put the book in his car,.. Long story here. Mike took Dan s car Long story here. He drove to SF Q : Where is the book now? A: Car, SF

END-TO-END MEMORY NETWORKS End-To-End Memory Networks, (Weston et. al., NIPS 25). (a): Single hope, (b) 3 hops

Q&A : END TO END NETWORK

REQUIREMENTS FOR AUGMENTED MEMORY Vertical Embedding Multiplication Input features selectedto next coloumn, Columns to any and other horizontal location or sumlocation based on content Compute softmax to selected Output Cosine Similarity Search + Top-K

APU MEMORY FOR NLP 2 N- Control I O m i M T i Broadcast input I to selected columns Compute any function at selected columns Generate output Generate tags for selection i m i i m i

GSI SOLUTION FOR END TO END Constant time of 3 µsec per iteration, any memory size.

PROGRAMING MODEL

PROGRAMMING MODEL Application (C++, Python) HOST Framework (TensorFlow ) Graph Execution and Tasks Scheduling APU (Associative Processing Unit Hardware) Device

A TF EXAMPLE: MATMUL a = tf.placeholder(tf.int32, shape=[3,4]) b = tf.placeholder(tf.int32, shape=[4,6]) c = tf.matmul(a, b) # shape = [3,6] matmul c a b

A TF EXAMPLE: MATMUL GRAPH PREPARATION a = tf.placeholder(tf.int32, shape=[3, 4]) b = tf.placeholder(tf.int32, shape=[4, 6]) c = tf.matmul(a, b) APU DEVICE SPACE host apu device (tf+eigen) gnl_create_array(a) gnl_create_array(b) gnl_create_array(c) L4 a b c apuc s L MMB

A TF EXAMPLE: MATMUL GVML_SET, GVML_MUL with tf.session() as sess: result = sess.run(c, feed_dict= {a: [[2-4 5 3] b: [[27-8 -4 2 9-32] [ -3 4] [-7 52-6 2 4] [-8 23-9 7]], [-8 2 6 9-3] [-38 9 5 2 3 77]]}) APU DEVICE SPACE gnlpd_dma_6b_start(gnlpd_sys_2_vmr, ) L4 2-4 5 3-3 4-8 23-9 7 27-8 -4 2 9-32 -7 52-6 2 4-8 2 6 9-3 -38 9 5 2 3 77 a b c dma keeps loading data to apuc dma copy L4 to L gvml_set_6( ) gnlpd_mat_mul(c,a,b) gvml_set_6( ) gvml_mul_s6( ) apuc -7 52-6 2 4 controller c 27-8 -4 2 9-32 2 2 2 2 2 2 54-6 -28 4 8-64 while matmul is being computed in apuc X gvml_mul_s6( ) L MMB

TENSORFLOW ENHANCEMENT: FUSED OPERATIONS a = tf.placeholder(tf.int32, shape=[3,4]) b = tf.placeholder(tf.int32, shape=[4,6]) c = tf.matmul(a, b) # shape = [3,6] d = tf.nn.top_k(c, k=2) # shape = [3,2] fused The two operations are computed inside the apuc top_k Data stays in L d No IO operations between them Saves valuable data transfer time and power fused(matmul,top_k) matmul c a b

A TF EXAMPLE: MATMUL CODE EXAMPLE APU DEVICE 27-8 -4 2 9-32 54-6 -28 4 8-64 c 27-8 -4 2 9-32 2 2 2 2 2 2 54-6 -28 4 8-64 L MMB APL_FRAG add_u6(rn_reg x, RN_REG y, RN_REG t_xory, RN_REG t_ci) { SM_XFFFF: RL = SB[x]; // RL[-5] = x SM_XFFFF: RL = SB[y]; // RL[-5] = x y { SM_XFFFF: SB[t_xory] = RL; // t_xory[-5] = x y SM_XFFFF: RL = SB[x, y]; // RL[-5] = x&y } // Add init state: // : RL = co[] //..5: RL = x&y { (SM_X << ): SB[t_ci] = NRL; // t_ci[5,9,3] = x&y (SM_X << ): RL = SB[t_xory] & NRL; // RL[] = Cout[] = x&y ci(x y) // 5,9,3: RL = Cout[5,9,3] = x&y ci(x y) } { } { } { }... (SM_X << 4): RL = SB[t_xory]; (SM_X << 2): SB[t_ci] = NRL; // Propagate Cin (SM_X << 2): RL = SB[t_xory] & NRL; // Propagate Cout (SM_X << 5): RL = SB[t_xory] & NRL; // Propagate Cout (SM_X << 3): SB[t_ci] = NRL; // Propagate Cin (SM_X << 3): RL = SB[t_xory] & NRL; // Propagate Cout (SM_X << 6): RL = SB[t_xory] & NRL; // Propagate Cout (SM_X << 5): GL = RL; // RL[4,8,2] = Cout[4,8,2] = x&y &(x y) (SM_X << 4): SB[t_ci] = NRL; // t_ci[8,2,6] = Cout[7,, 5] SM_X: SB[t_ci] = GL; // t_ci[8,2,6] = Cout[7,, 5] (SM_X << 7): RL = SB[t_xory] & NRL; // Propagate Cout (SM_X << 5): GL = RL;

FUTURE APPROACH NON VOLATILE CONCEPT

SOLUTIONS FOR FUTURE DATA CENTERS CPU Register File L/L2/L3 DRAM ASSOCIATIVE High endurance Full computing (floating points etc.) requires read & write : Low endurance Data search engines (read most of the time) Standard SRAM Based STT-RAM RAM Based PC-RAM Based ReRam Based Flash HDD Volatile Non Volatile Mid endurance Machine learning, malware detection detection etc., : Much more read and much less write

THANK YOU