In-Place Associative Computing
|
|
- Tracey Hudson
- 5 years ago
- Views:
Transcription
1 In-Place Associative Computing All Images are Public in the Web Avidan Akerib Ph.D. Vice President Associative Computing BU
2 Agenda Introduction to associative computing Use case examples Similarity search Large Scale Attention Computing Few-shot learning Software model Future Approaches 2
3 The Challenge In AI Computing (Matrix Multiplication is not enough!!) AI Requirement High Precision Floating Point Multi precision Linearly Scalable Sort-search Heavy computation Bandwidth/power tradeoff Use Case Example Neural network learning Real time inference, saving memory Big Data Top-K, recommendation, speech, classify image/video Non linearity, Softmax, exponent, normalization High speed at low power 3
4 4 Von Neumann Architecture Memory CPU High Density (Repeated cells) Slower Leveraging Moore s Law Lower Density (Lots of Logic ) Faster 4
5 5 Von Neumann Architecture Memory CPU High Density (Reputed cells) Slower Lower Density (Lots of Logic) Faster CPU frequency outpacing memory - need to add cache. Continue to leverage Moore s Law 5
6 6 Since 26 Clock speed start flattening Sharply Source: Intel 6
7 7 Thinking Parallel : 2 cores and more Memory CPU However, memory utilization becomes an issue 7
8 More and more memory to solve utilization problem Memory CPU Local and Global Memory 8
9 9 Memory still growing rapidly Memory CPU Memory becomes a larger part of each chip 9
10 Same Concept even with GPGPUs Memories GPGPU Very High Power, large die, Expensive What s Next??
11 Most of Power goes to BW Source: Song Han Stanford University
12 2 Changing the Rules of the Game!!! Standard Memory cells are smarter than we thought!! 2
13 APU Associative Processing Unit Simple CPU Question Simple & Narrow Bus Answer Millions Processors APU Associative Processing Computes in-place directly in the memory array removes the I/O bottleneck Significantly increases performance Reduces power 3
14 4 How Computers Work Today RE/WE Address Decoder ALU Sense Amp /IO Drivers 4
15 5 Accessing Multiple Rows Simultaneously RE RE RE WE NOR? WE RE Bus Contention is not an error!!! It s a simple NOR/NAND satisfying De-Morgan s law 5
16 Truth Table Example A B C D C AB!A!C + BC =!!(!A!C + BC ) =! (!(!A!C)!(BC)) Every Minterm takes one Clock All bit lines executes Karnaugh tables in parallel = NAND( NAND(!A,!C),NAND(B,C)) CLOCK Read (!A,!C) ; WRITE T Read (B,C) ; WRITE T2 CLOCK Read (T,T2) ; WRITE D 6
17 7 Vector Add Example A[] + B[] = C[] vector A(8,32M) vector B(8,32M) Vector C(9,32M) C = A + B No. Of Clocks = 4 * 8 = 32 Clocks/byte= 32/32M=/M OPS = Ghz X M = PetaOPS 7
18 CAM/ Associative Search Records in the combines key goes to the read enable Values Duplicate Vales with inverse data =match RE RE RE RE KEY: Search Duplicate the Key with Inverse. Move The original Key next to the inverse data 8
19 TCAM Search By Standard Memory Cells Don t Care Don t Care Don t Care 9
20 TCAM Search By Standard Memory Cells in the combines key goes to the read enable Insert Zero instead of don t-care Duplicate data. Inverse only to those which are not don t care = match =match RE RE RE RE KEY: Search Duplicate the Key with Inverse. Move The original Key next to the inverse data 2
21 Computing in the Bit Lines Vector A a a a2 a3 a4 a5 a6 a7 Vector B b b b2 b3 b4 b5 b6 b7 C=f(A,B) Each bit line becomes a processor and storage Millions of bit lines = millions of processors 2
22 Neighborhood Computing Shift vector C=f(A,SL(B,)) Parallel shift of bit cycle sections Enables neighborhood operations such as convolutions 22
23 Search & Count Search Count = 3 Search (binary or ternary) all bit lines in cycle 28 M bit lines => 28 Peta search/sec Key Applications for search and count for predictive analytics: Recommender systems K-Nearest Neighbors (using cosine similarity search) Random forest Image histogram Regular expression 23
24 CPU vs GPU vs FPGA vs APU 24
25 CPU/GPGPU vs APU CPU/GPGPU (Current Solution) In-Place Computing (APU) Send an address to memory Fetch the data from memory and send it to the processor Compute serially per core (thousands of cores at most) Write the data back to memory, further wasting IO resources Send data to each location that needs it Search by content Mark in place Compute in place on millions of processors (the memory itself becomes millions of processors No need to write data back the result is already in the memory If needed, distribute or broadcast at once 25
26 ARCHITECTURE 26
27 Communication between Sections Shift between sections enable neighborhood operations (filters, CNN etc.) Store, Compute, Search and Transport data anywhere. 27
28 Memory Section Computing to Improve Performance MLB section 24 rows control Connecting Mux MLB section Connecting mux... Instr. Buffer 28
29 APU Chip Layout 2M bit processors or 28K vector processors runs at G Hz with up to 2 Peta OPS peak performance 29
30 APU Layout vs GPU Layout Multi-Functional, Programmable Blocks Acceleration of FP operation Blocks 3
31 EXAMPLE APPLICATIONS 3
32 K-Nearest Neighbors (k-nn) Simple example: N = 36, 3 Groups, 2 dimensions (D = 2 ) for X and Y K = 4 Group Green selected as the majority For actual applications: N = Billions, D = Tens, K = Tens of thousands 32
33 k-nn Use Case in an APU Item N Item 3 Item Item 2 Features of item Features of item 2 Features of item N Item features and label storage Q C p = Dp Q = n σ i= D p Q σn p 2 i= D i Di p Qi σi= n Q i Majority Calculation Compute cosine distances for all N in parallel ( s, assuming D=5 features) Distribute data 2 ns (to all) Computing Area K Mins at O() complexity ( 3 s) In-Place ranking With the data base in an APU, computation for all N items done in.5 ms, independent of K (X Improvement over current solutions) 33
34 K-MINS: O() Algorithm KMINS(int K, vector C){ M :=, V := ; FOR b = msb to b = lsb: D := not(c[b]); N := M & D; cnt = COUNT(N V) IF cnt > K: M := N; ELIF cnt < K: V := N V; ELSE: // cnt == K V := N V; EXIT; ENDIF ENDFOR } MSB C C C 2 LSB N 34
35 K-MINS: The Algorithm V N V N M D C[] KMINS(int K, vector C){ M :=, V := ; FOR b = msb to b = lsb: D := not(c[b]); N := M & D; cnt = COUNT(N V) IF cnt > K: M := N; ELIF cnt < K: V := N V; ELSE: // cnt == K V := N V; EXIT; ENDIF ENDFOR } cnt= 35
36 K-MINS: The Algorithm V N V N M D C[] KMINS(int K, vector C){ M :=, V := ; FOR b = msb to b = lsb: D := not(c[b]); N := M & D; cnt = COUNT(N V) IF cnt > K: M := N; ELIF cnt < K: V := N V; ELSE: // cnt == K V := N V; EXIT; ENDIF ENDFOR } cnt=8 36
37 K-MINS: The Algorithm KMINS(int K, vector C){ M :=, V := ; FOR b = msb to b = lsb: D := not(c[b]); N := M & D; cnt = COUNT(N V) IF cnt > K: M := N; ELIF cnt < K: V := N V; ELSE: // cnt == K V := N V; EXIT; ENDIF ENDFOR } final output V C O() Complexity 37
38 Similarity Search and Top-K for Recognition Image Convolution Layer Feature Extractor (Neural network) Data Base Every image/sentence/doc has a label Word/Sentence/doc Embedding Text 38
39 Dense (XN) Vector by Sparse NxM Matrix Input Vector Sparse Matrix Output Vector APU Representations and Computing Column Row Search all columns for row = 2 :Distribute -2 : 2 Cy Search all columns for row = 3 :Distribute 3 : 2 Cy Search all columns for row = 4 :Distribute - : 2 Cy Multiply in Parallel : Cy Shift and Add all belonging to same column Complexity including IO : O (N +logβ) where β is the number of nonzero elements in the sparse matrix N << M in general for recommender systems 39
40 Two NxN Sparse Matrix Multiplication Sparse In Matrix Sparse Inb-2 Matrix Output Matrix X = COL ROW In-DB In-DB2 Out-DB Choose Next free Entry from In-DB Read its Row value Search and Mark Similar Rows For all Marked Row Search where Col(In-DB) = Row(In-DB2) Broadcast selected value to Output Table bit lines Multiply in Parallel Shift and Add all belonging to same Column Update Out-DB Go Back to Step if there are more free entries Exit Complexity Including IO : O(β+logβ) Compared to O(β.7 N.2 +N 2 ) in CPU ( > X Improvement) 4
41 Softmax Used in many neural networks applications, especially for attention networks The Softmax function takes an N dimensional vector of scores and generates probabilities between to, as defined by the function Si = ezi σ N j= e Zj Where Z is the dot product between a query vector and feature vector ( for example, word emending of English vocabulary ) 4
42 The Difficulties in Softmax Computing. Dot Product for millions vectors 2. Non Linearity function (Exp) 3. Dependency : every score depends on all others in data base 4. Dynamic range: fast overflow, requires high precision calculations 5. Speed and Latency 42
43 Taylor Series e x = + x + x2 2 + x3 3! + Very Expensive, Requires more than 2 coefficients and double precision for good accuracy. 43
44 M SoftMax Performance Proprietary algorithm leverages APU s lookup capability Provide M High accuracy exact Softmax values = < 5 µsec vs - msec in GPU > 3 orders of magnitude improvement 44
45 Associative Memory for Natural Language Processing (NLP) Q&A, dialog, language translation, speech recognition etc. Requires learning past events Needs large array with attention capabilities 45
46 Examples Q&A: Dan put the book in his car,.. Long story here. Mike took Dan s car Long story here. He drove to SF Q : Where is the book now? A: Car, SF Attention Computing Language Translation: The cow ate the hay because it was delicious. The cow ate the hay because it was hungry. Source: Łukasz Kaiser 46
47 Example of Associative Attention Computing Input Data (i.e. Sentence in English for translation or for Q&A) Encoder (NN) Feature Vector Embedding Sentences Features Representation (Key) 47
48 Example of Associative Attention Computing V V2 V3 V4 V5 V6 Compute TOP K Value)... X Attention SoftMax Result Dot Product Result Next Stage (Encoder or Decoder) Query Encoder (NN) Feature Vector Dot Product Key Dot Product :O() SoftMax O() Top-K O() 48
49 Q&A : End to End Network (Weston) Source: Weston et al 49
50 GSI Associative Solution for End to End Source: Weston et al Constant time of 3 µsec per one iteration, any memory size > few orders of magnitude improvement 5
51 Associative Computing for Low-Shot Learning Gradient-Based Optimization has achieved impressive results on supervised tasks such as image classification These models need a lot of data People can learn efficiently from few examples Associative Computing Like people, can measure similarity to features stored in memory Can also create a new label for similar features in the future 5
52 Zero-Shot Learning with k-nn Input Images with labels Pixels Similar Image without label Feature Extractor by Convolution Layer Features Embedding Input features Similar Image Label Cosine Similarity Search + Top-K Extract features using any pre- trained CNN, for example VGG/Inception on ImageNet New data set is embedded using a pre-trained model and stored in memory with its label Query (test images) are input without label and their features are cosine similarity searched to predict the label 52
53 Dimension Reduction Output of convolution layer is large ( 2, features in VGG, very sparse) Simple matrix or multi-layer non-linear transformation Learned simply Loss function: Cosine distance found between any two records 2, Difference between the distance of input and output Learns to preserve the cosine distance through transform 2 Associative 53
54 Low-Shot: Train the network on distance k-nn Data Base Start with untrained network Output of network is already reduced-dimension keys for k-nn DB Train the network only to keep similar-valued keys close 54
55 Cut Short k-nn Data Base (Associative) Stop training when system starts to converge (Cut Short) Use similarity search instead of Fully Connected Requires less complete training 55
56 PROGRAMING MODEL 56
57 Programming Model Write application In Standard Host Using TensorFlow /Tesor2Tensor Frame Work Generates TensorFlow Graph for Execution in Device Memory APU Chip/Card Execute the Graph using fused Capabilities 57
58 PCIe Development Boards 4 APU Chips 8 Millions bit lines rows (processors) 8 Peta Boolean OPS 6.4 TFLOPS 2 Petabit/sec Internal IO 6-64 GB(Device memory) TensorFlow Frame-work (basic functions) GNL (GSI Numeric Lib) 58
59 FUTURE APPROACH NON VOLATILE CONCEPT 59
60 Computing in Non-Volatile Cells Select Multiple Lines for read (as NOR/NAND input) Ref = V-read The Sense Unit is Sensing Bit Line for Logic or Select or Multiple Lines for write (NOR/NAND results) Ref = V-write Write Control Generates logic or for bit line Sense Unit & Write Control Non Volatile bit cell Select REF 6
61 Solutions for Future Data Centers CPU Register File L/L2/L3 DRAM ASSOCIATIVE High endurance Full computing (floating points etc.) requires read & write : Low endurance Data search engines (read most of the time) Standard SRAM Based STT-RAM RAM Based PC-RAM Based ReRam Based Flash HDD Volatile Non Volatile Mid endurance Machine learning, malware detection detection etc., : Much more read and much less write 6
62 Summary APU enables state of the art, next-generation machine learning : In-Place from basic Boolean Algebra to complex algorithms O() Dot Produces computation O() Min/Max O() Top K O() Softmax Ultra high Internal BW.5 Peta bit/sec Up to 2 PetaOps of Boolean Algebra in a single chip Fully Scalable Fully Programmable Efficient TensorFlow based capabilities 62
63 Summary Extending Moore s Law and Leveraging Advanced Memory Technology Growth For M.Sc./Ph.D. students that would like to collaborate on research please contact me: aakerib@gsitechnology.com 63
64 Thank You! Any Questions? APU Page 64
IN-MEMORY ASSOCIATIVE COMPUTING
IN-MEMORY ASSOCIATIVE COMPUTING AVIDAN AKERIB, GSI TECHNOLOGY AAKERIB@GSITECHNOLOGY.COM AGENDA The AI computational challenge Introduction to associative computing Examples An NLP use case What s next?
More informationMohsen Imani. University of California San Diego. System Energy Efficiency Lab seelab.ucsd.edu
Mohsen Imani University of California San Diego Winter 2016 Technology Trend for IoT http://www.flashmemorysummit.com/english/collaterals/proceedi ngs/2014/20140807_304c_hill.pdf 2 Motivation IoT significantly
More informationAccelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs
Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs Ritchie Zhao 1, Weinan Song 2, Wentao Zhang 2, Tianwei Xing 3, Jeng-Hau Lin 4, Mani Srivastava 3, Rajesh Gupta 4, Zhiru
More informationRevolutionizing the Datacenter
Power-Efficient Machine Learning using FPGAs on POWER Systems Ralph Wittig, Distinguished Engineer Office of the CTO, Xilinx Revolutionizing the Datacenter Join the Conversation #OpenPOWERSummit Top-5
More informationCS 31: Intro to Systems Digital Logic. Kevin Webb Swarthmore College February 3, 2015
CS 31: Intro to Systems Digital Logic Kevin Webb Swarthmore College February 3, 2015 Reading Quiz Today Hardware basics Machine memory models Digital signals Logic gates Circuits: Borrow some paper if
More informationCS 31: Intro to Systems Digital Logic. Kevin Webb Swarthmore College February 2, 2016
CS 31: Intro to Systems Digital Logic Kevin Webb Swarthmore College February 2, 2016 Reading Quiz Today Hardware basics Machine memory models Digital signals Logic gates Circuits: Borrow some paper if
More informationCENG 4480 L09 Memory 3
CENG 4480 L09 Memory 3 Bei Yu Chapter 11 Memories Reference: CMOS VLSI Design A Circuits and Systems Perspective by H.E.Weste and D.M.Harris 1 Memory Arrays Memory Arrays Random Access Memory Serial Access
More informationLatches. IT 3123 Hardware and Software Concepts. Registers. The Little Man has Registers. Data Registers. Program Counter
IT 3123 Hardware and Software Concepts Notice: This session is being recorded. CPU and Memory June 11 Copyright 2005 by Bob Brown Latches Can store one bit of data Can be ganged together to store more
More informationRecurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks
Deep neural networks have enabled major advances in machine learning and AI Computer vision Language translation Speech recognition Question answering And more Problem: DNNs are challenging to serve and
More informationDRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric
DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based
More informationDRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric
DRAF: A Low-Power DRAM-based Reconfigurable Acceleration Fabric Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna Malladi, Hongzhong Zheng, Bob Brennan, Christos Kozyrakis ISCA June 22, 2016 FPGA-Based
More informationM1 Computers and Data
M1 Computers and Data Module Outline Architecture vs. Organization. Computer system and its submodules. Concept of frequency. Processor performance equation. Representation of information characters, signed
More informationMemory. Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University
Memory Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University Big Picture: Building a Processor memory inst register file alu PC +4 +4 new pc offset target imm control extend =? cmp
More informationMANAGING MULTI-TIERED NON-VOLATILE MEMORY SYSTEMS FOR COST AND PERFORMANCE 8/9/16
MANAGING MULTI-TIERED NON-VOLATILE MEMORY SYSTEMS FOR COST AND PERFORMANCE 8/9/16 THE DATA CHALLENGE Performance Improvement (RelaLve) 4.4 ZB Total data created, replicated, and consumed in a single year
More informationComputers: Inside and Out
Computers: Inside and Out Computer Components To store binary information the most basic components of a computer must exist in two states State # 1 = 1 State # 2 = 0 1 Transistors Computers use transistors
More informationMachine Learning 13. week
Machine Learning 13. week Deep Learning Convolutional Neural Network Recurrent Neural Network 1 Why Deep Learning is so Popular? 1. Increase in the amount of data Thanks to the Internet, huge amount of
More informationSemantic Image Search. Alex Egg
Semantic Image Search Alex Egg Inspiration Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing
More information10th August Part One: Introduction to Parallel Computing
Part One: Introduction to Parallel Computing 10th August 2007 Part 1 - Contents Reasons for parallel computing Goals and limitations Criteria for High Performance Computing Overview of parallel computer
More informationToward a Memory-centric Architecture
Toward a Memory-centric Architecture Martin Fink EVP & Chief Technology Officer Western Digital Corporation August 8, 2017 1 SAFE HARBOR DISCLAIMERS Forward-Looking Statements This presentation contains
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 22, 2016 Course Information Website: http://www.stat.ucdavis.edu/~chohsieh/teaching/ ECS289G_Fall2016/main.html My office: Mathematical Sciences
More informationSDA: Software-Defined Accelerator for Large- Scale DNN Systems
SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, Yong Wang, Bo Yu, Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A dominant
More informationOverview. Memory Classification Read-Only Memory (ROM) Random Access Memory (RAM) Functional Behavior of RAM. Implementing Static RAM
Memories Overview Memory Classification Read-Only Memory (ROM) Types of ROM PROM, EPROM, E 2 PROM Flash ROMs (Compact Flash, Secure Digital, Memory Stick) Random Access Memory (RAM) Types of RAM Static
More informationBrainchip OCTOBER
Brainchip OCTOBER 2017 1 Agenda Neuromorphic computing background Akida Neuromorphic System-on-Chip (NSoC) Brainchip OCTOBER 2017 2 Neuromorphic Computing Background Brainchip OCTOBER 2017 3 A Brief History
More informationNeural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks
Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks Charles Eckert Xiaowei Wang Jingcheng Wang Arun Subramaniyan Ravi Iyer Dennis Sylvester David Blaauw Reetuparna Das M-Bits Research
More informationImplementation of Deep Convolutional Neural Net on a Digital Signal Processor
Implementation of Deep Convolutional Neural Net on a Digital Signal Processor Elaina Chai December 12, 2014 1. Abstract In this paper I will discuss the feasibility of an implementation of an algorithm
More informationMulti-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture
The 51st Annual IEEE/ACM International Symposium on Microarchitecture Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture Byungchul Hong Yeonju Ro John Kim FuriosaAI Samsung
More informationAlgorithms and Architecture. William D. Gropp Mathematics and Computer Science
Algorithms and Architecture William D. Gropp Mathematics and Computer Science www.mcs.anl.gov/~gropp Algorithms What is an algorithm? A set of instructions to perform a task How do we evaluate an algorithm?
More informationDeep Learning with Tensorflow AlexNet
Machine Learning and Computer Vision Group Deep Learning with Tensorflow http://cvml.ist.ac.at/courses/dlwt_w17/ AlexNet Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton, "Imagenet classification
More informationExploring the Structure of Data at Scale. Rudy Agovic, PhD CEO & Chief Data Scientist at Reliancy January 16, 2019
Exploring the Structure of Data at Scale Rudy Agovic, PhD CEO & Chief Data Scientist at Reliancy January 16, 2019 Outline Why exploration of large datasets matters Challenges in working with large data
More informationTransistor: Digital Building Blocks
Final Exam Review Transistor: Digital Building Blocks Logically, each transistor acts as a switch Combined to implement logic functions (gates) AND, OR, NOT Combined to build higher-level structures Multiplexer,
More informationVery Large Scale Integration (VLSI)
Very Large Scale Integration (VLSI) Lecture 6 Dr. Ahmed H. Madian Ah_madian@hotmail.com Dr. Ahmed H. Madian-VLSI 1 Contents FPGA Technology Programmable logic Cell (PLC) Mux-based cells Look up table PLA
More informationUnit 9 : Fundamentals of Parallel Processing
Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing
More informationS8822 OPTIMIZING NMT WITH TENSORRT Micah Villmow Senior TensorRT Software Engineer
S8822 OPTIMIZING NMT WITH TENSORRT Micah Villmow Senior TensorRT Software Engineer 2 100 倍以上速く 本当に可能ですか? 2 DOUGLAS ADAMS BABEL FISH Neural Machine Translation Unit 3 4 OVER 100X FASTER, IS IT REALLY POSSIBLE?
More informationMemory and Programmable Logic
Memory and Programmable Logic Memory units allow us to store and/or retrieve information Essentially look-up tables Good for storing data, not for function implementation Programmable logic device (PLD),
More informationECSE-2610 Computer Components & Operations (COCO)
ECSE-2610 Computer Components & Operations (COCO) Part 18: Random Access Memory 1 Read-Only Memories 2 Why ROM? Program storage Boot ROM for personal computers Complete application storage for embedded
More informationNeuroMem. A Neuromorphic Memory patented architecture. NeuroMem 1
NeuroMem A Neuromorphic Memory patented architecture NeuroMem 1 Unique simple architecture NM bus A chain of identical neurons, no supervisor 1 neuron = memory + logic gates Context Category ted during
More informationInference Optimization Using TensorRT with Use Cases. Jack Han / 한재근 Solutions Architect NVIDIA
Inference Optimization Using TensorRT with Use Cases Jack Han / 한재근 Solutions Architect NVIDIA Search Image NLP Maps TensorRT 4 Adoption Use Cases Speech Video AI Inference is exploding 1 Billion Videos
More informationMultilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology
1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823
More informationComputer Organization and Levels of Abstraction
Computer Organization and Levels of Abstraction Announcements Today: PS 7 Lab 8: Sound Lab tonight bring machines and headphones! PA 7 Tomorrow: Lab 9 Friday: PS8 Today (Short) Floating point review Boolean
More informationBasic Organization Memory Cell Operation. CSCI 4717 Computer Architecture. ROM Uses. Random Access Memory. Semiconductor Memory Types
CSCI 4717/5717 Computer Architecture Topic: Internal Memory Details Reading: Stallings, Sections 5.1 & 5.3 Basic Organization Memory Cell Operation Represent two stable/semi-stable states representing
More informationAdrian Proctor Vice President, Marketing Viking Technology
Storage PRESENTATION in the TITLE DIMM GOES HERE Socket Adrian Proctor Vice President, Marketing Viking Technology SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless
More informationNatural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu
Natural Language Processing CS 6320 Lecture 6 Neural Language Models Instructor: Sanda Harabagiu In this lecture We shall cover: Deep Neural Models for Natural Language Processing Introduce Feed Forward
More informationComputer Organization and Assembly Language (CS-506)
Computer Organization and Assembly Language (CS-506) Muhammad Zeeshan Haider Ali Lecturer ISP. Multan ali.zeeshan04@gmail.com https://zeeshanaliatisp.wordpress.com/ Lecture 2 Memory Organization and Structure
More informationDeep Learning Accelerators
Deep Learning Accelerators Abhishek Srivastava (as29) Samarth Kulshreshtha (samarth5) University of Illinois, Urbana-Champaign Submitted as a requirement for CS 433 graduate student project Outline Introduction
More informationExploiting the OpenPOWER Platform for Big Data Analytics and Cognitive. Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center
Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center 3/17/2015 2014 IBM Corporation Outline IBM OpenPower Platform Accelerating
More informationHuman Body Recognition and Tracking: How the Kinect Works. Kinect RGB-D Camera. What the Kinect Does. How Kinect Works: Overview
Human Body Recognition and Tracking: How the Kinect Works Kinect RGB-D Camera Microsoft Kinect (Nov. 2010) Color video camera + laser-projected IR dot pattern + IR camera $120 (April 2012) Kinect 1.5 due
More informationEffect of memory latency
CACHE AWARENESS Effect of memory latency Consider a processor operating at 1 GHz (1 ns clock) connected to a DRAM with a latency of 100 ns. Assume that the processor has two ALU units and it is capable
More informationPersistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL
(stashing recurrent weights on-chip) Baidu SVAIL April 7, 2016 SVAIL Think hard AI. Goal Develop hard AI technologies that impact 100 million users. Deep Learning at SVAIL 100 GFLOP/s 1 laptop 6 TFLOP/s
More informationMemory. Lecture 22 CS301
Memory Lecture 22 CS301 Administrative Daily Review of today s lecture w Due tomorrow (11/13) at 8am HW #8 due today at 5pm Program #2 due Friday, 11/16 at 11:59pm Test #2 Wednesday Pipelined Machine Fetch
More informationGPU Acceleration of Matrix Algebra. Dr. Ronald C. Young Multipath Corporation. fmslib.com
GPU Acceleration of Matrix Algebra Dr. Ronald C. Young Multipath Corporation FMS Performance History Machine Year Flops DEC VAX 1978 97,000 FPS 164 1982 11,000,000 FPS 164-MAX 1985 341,000,000 DEC VAX
More information,e-pg PATHSHALA- Computer Science Computer Architecture Module 25 Memory Hierarchy Design - Basics
,e-pg PATHSHALA- Computer Science Computer Architecture Module 25 Memory Hierarchy Design - Basics The objectives of this module are to discuss about the need for a hierarchical memory system and also
More informationIntel s s Memory Strategy for the Wireless Phone
Intel s s Memory Strategy for the Wireless Phone Stefan Lai VP and Co-Director, CTM Intel Corporation Nikkei Microdevices Memory Symposium January 26 th, 2005 Agenda Evolution of Memory Requirements Evolution
More informationUnified Deep Learning with CPU, GPU, and FPGA Technologies
Unified Deep Learning with CPU, GPU, and FPGA Technologies Allen Rush 1, Ashish Sirasao 2, Mike Ignatowski 1 1: Advanced Micro Devices, Inc., 2: Xilinx, Inc. Abstract Deep learning and complex machine
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #11 2/21/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline Midterm 1:
More informationOrganization. 5.1 Semiconductor Main Memory. William Stallings Computer Organization and Architecture 6th Edition
William Stallings Computer Organization and Architecture 6th Edition Chapter 5 Internal Memory 5.1 Semiconductor Main Memory 5.2 Error Correction 5.3 Advanced DRAM Organization 5.1 Semiconductor Main Memory
More informationNear Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri
Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri Scene Completion Problem The Bare Data Approach High Dimensional Data Many real-world problems Web Search and Text Mining Billions
More informationThe Impact of Persistent Memory and Intelligent Data Encoding
The Impact of Persistent Memory and Intelligent Data Encoding Or, How to Succeed with NVDIMMs Without Really Trying Rob Peglar SVP/CTO, Symbolic IO rpeglar@symbolicio.com @peglarr Wisdom Lower R/W Latency
More informationChapter 5. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 5 <1>
Chapter 5 Digital Design and Computer Architecture, 2 nd Edition David Money Harris and Sarah L. Harris Chapter 5 Chapter 5 :: Topics Introduction Arithmetic Circuits umber Systems Sequential Building
More informationCSCI-UA.0201 Computer Systems Organization Memory Hierarchy
CSCI-UA.0201 Computer Systems Organization Memory Hierarchy Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Programmer s Wish List Memory Private Infinitely large Infinitely fast Non-volatile
More informationComputer Organization and Levels of Abstraction
Computer Organization and Levels of Abstraction Announcements PS8 Due today PS9 Due July 22 Sound Lab tonight bring machines and headphones! Binary Search Today Review of binary floating point notation
More informationSTRING KERNEL TESTING ACCELERATION USING MICRON S AUTOMATA PROCESSOR
STRING KERNEL TESTING ACCELERATION USING MICRON S AUTOMATA PROCESSOR Chunkun Bo 1,2, Ke Wang 1,2, Yanjun (Jane) Qi 1, Kevin Skadron 1,2 1 Department of Computer Science 2 Center for Automata Processing
More informationScaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research
Scaling Convolutional Neural Networks on Reconfigurable Logic Michaela Blott, Principal Engineer, Xilinx Research Nick Fraser (Xilinx & USydney) Yaman Umuroglu (Xilinx & NTNU) Giulio Gambardella (Xilinx)
More informationLecture-14 (Memory Hierarchy) CS422-Spring
Lecture-14 (Memory Hierarchy) CS422-Spring 2018 Biswa@CSE-IITK The Ideal World Instruction Supply Pipeline (Instruction execution) Data Supply - Zero-cycle latency - Infinite capacity - Zero cost - Perfect
More informationSDA: Software-Defined Accelerator for Large- Scale DNN Systems
SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, 1 Yong Wang, 1 Bo Yu, 1 Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A
More informationWYSE Academic Challenge Computer Fundamentals Test (State Finals)
WYSE Academic Challenge Computer Fundamentals Test (State Finals) - 1998 1. What is the decimal value for the result of the addition of the binary values: 1111 + 0101? (Assume a 4 bit, 2's complement representation.)
More informationEND-TERM EXAMINATION
(Please Write your Exam Roll No. immediately) END-TERM EXAMINATION DECEMBER 2006 Exam. Roll No... Exam Series code: 100919DEC06200963 Paper Code: MCA-103 Subject: Digital Electronics Time: 3 Hours Maximum
More informationMIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius
MIXED PRECISION TRAINING: THEORY AND PRACTICE Paulius Micikevicius What is Mixed Precision Training? Reduced precision tensor math with FP32 accumulation, FP16 storage Successfully used to train a variety
More information7/28/ Prentice-Hall, Inc Prentice-Hall, Inc Prentice-Hall, Inc Prentice-Hall, Inc Prentice-Hall, Inc.
Technology in Action Technology in Action Chapter 9 Behind the Scenes: A Closer Look a System Hardware Chapter Topics Computer switches Binary number system Inside the CPU Cache memory Types of RAM Computer
More informationShow how to connect three Full Adders to implement a 3-bit ripple-carry adder
Show how to connect three Full Adders to implement a 3-bit ripple-carry adder 1 Reg. A Reg. B Reg. Sum 2 Chapter 5 Computing Components Yet another layer of abstraction! Components Circuits Gates Transistors
More informationUsing Machine Learning for Classification of Cancer Cells
Using Machine Learning for Classification of Cancer Cells Camille Biscarrat University of California, Berkeley I Introduction Cell screening is a commonly used technique in the development of new drugs.
More informationTrends in the Infrastructure of Computing
Trends in the Infrastructure of Computing CSCE 9: Computing in the Modern World Dr. Jason D. Bakos My Questions How do computer processors work? Why do computer processors get faster over time? How much
More informationCO Computer Architecture and Programming Languages CAPL. Lecture 15
CO20-320241 Computer Architecture and Programming Languages CAPL Lecture 15 Dr. Kinga Lipskoch Fall 2017 How to Compute a Binary Float Decimal fraction: 8.703125 Integral part: 8 1000 Fraction part: 0.703125
More informationDeep Learning Performance and Cost Evaluation
Micron 5210 ION Quad-Level Cell (QLC) SSDs vs 7200 RPM HDDs in Centralized NAS Storage Repositories A Technical White Paper Don Wang, Rene Meyer, Ph.D. info@ AMAX Corporation Publish date: October 25,
More informationLECTURE 10: Improving Memory Access: Direct and Spatial caches
EECS 318 CAD Computer Aided Design LECTURE 10: Improving Memory Access: Direct and Spatial caches Instructor: Francis G. Wolff wolff@eecs.cwru.edu Case Western Reserve University This presentation uses
More informationNeural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017
3/0/207 Neural Networks Emily Fox University of Washington March 0, 207 Slides adapted from Ali Farhadi (via Carlos Guestrin and Luke Zettlemoyer) Single-layer neural network 3/0/207 Perceptron as a neural
More informationCS 101, Mock Computer Architecture
CS 101, Mock Computer Architecture Computer organization and architecture refers to the actual hardware used to construct the computer, and the way that the hardware operates both physically and logically
More informationSlide Set 9. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng
Slide Set 9 for ENCM 369 Winter 2018 Section 01 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary March 2018 ENCM 369 Winter 2018 Section 01
More informationEnabling Technology for the Cloud and AI One Size Fits All?
Enabling Technology for the Cloud and AI One Size Fits All? Tim Horel Collaborate. Differentiate. Win. DIRECTOR, FIELD APPLICATIONS The Growing Cloud Global IP Traffic Growth 40B+ devices with intelligence
More informationChecking for duplicates Maximum density Battling computers and algorithms Barometer Instructions Big O expressions. John Edgar 2
CMPT 125 Checking for duplicates Maximum density Battling computers and algorithms Barometer Instructions Big O expressions John Edgar 2 Write a function to determine if an array contains duplicates int
More informationESE: Efficient Speech Recognition Engine for Sparse LSTM on FPGA
ESE: Efficient Speech Recognition Engine for Sparse LSTM on FPGA Song Han 1,2, Junlong Kang 2, Huizi Mao 1, Yiming Hu 3, Xin Li 2, Yubin Li 2, Dongliang Xie 2, Hong Luo 2, Song Yao 2, Yu Wang 2,3, Huazhong
More informationCS/COE 0447 Example Problems for Exam 2 Spring 2011
CS/COE 0447 Example Problems for Exam 2 Spring 2011 1) Show the steps to multiply the 4-bit numbers 3 and 5 with the fast shift-add multipler. Use the table below. List the multiplicand (M) and product
More informationComputer Architectures for Deep Learning. Ethan Dell and Daniyal Iqbal
Computer Architectures for Deep Learning Ethan Dell and Daniyal Iqbal Agenda Introduction to Deep Learning Challenges Architectural Solutions Hardware Architectures CPUs GPUs Accelerators FPGAs SOCs ASICs
More informationSemiconductor Memories: RAMs and ROMs
Semiconductor Memories: RAMs and ROMs Lesson Objectives: In this lesson you will be introduced to: Different memory devices like, RAM, ROM, PROM, EPROM, EEPROM, etc. Different terms like: read, write,
More informationCaches. Hiding Memory Access Times
Caches Hiding Memory Access Times PC Instruction Memory 4 M U X Registers Sign Ext M U X Sh L 2 Data Memory M U X C O N T R O L ALU CTL INSTRUCTION FETCH INSTR DECODE REG FETCH EXECUTE/ ADDRESS CALC MEMORY
More informationKnowledge Organiser. Computing. Year 10 Term 1 Hardware
Organiser Computing Year 10 Term 1 Hardware Enquiry Question How does a computer do everything it does? Big questions that will help you answer this enquiry question: 1. What is the purpose of the CPU?
More informationMachine Learning on VMware vsphere with NVIDIA GPUs
Machine Learning on VMware vsphere with NVIDIA GPUs Uday Kurkure, Hari Sivaraman, Lan Vu GPU Technology Conference 2017 2016 VMware Inc. All rights reserved. Gartner Hype Cycle for Emerging Technology
More informationFacilitating IP Development for the OpenCAPI Memory Interface Kevin McIlvain, Memory Development Engineer IBM. Join the Conversation #OpenPOWERSummit
Facilitating IP Development for the OpenCAPI Memory Interface Kevin McIlvain, Memory Development Engineer IBM Join the Conversation #OpenPOWERSummit Moral of the Story OpenPOWER is the best platform to
More informationCharacterization and Benchmarking of Deep Learning. Natalia Vassilieva, PhD Sr. Research Manager
Characterization and Benchmarking of Deep Learning Natalia Vassilieva, PhD Sr. Research Manager Deep learning applications Vision Speech Text Other Search & information extraction Security/Video surveillance
More informationA Deep Relevance Matching Model for Ad-hoc Retrieval
A Deep Relevance Matching Model for Ad-hoc Retrieval Jiafeng Guo 1, Yixing Fan 1, Qingyao Ai 2, W. Bruce Croft 2 1 CAS Key Lab of Web Data Science and Technology, Institute of Computing Technology, Chinese
More informationDeep Learning Performance and Cost Evaluation
Micron 5210 ION Quad-Level Cell (QLC) SSDs vs 7200 RPM HDDs in Centralized NAS Storage Repositories A Technical White Paper Rene Meyer, Ph.D. AMAX Corporation Publish date: October 25, 2018 Abstract Introduction
More informationGrundlagen Microcontroller Memory. Günther Gridling Bettina Weiss
Grundlagen Microcontroller Memory Günther Gridling Bettina Weiss 1 Lecture Overview Memory Memory Types Address Space Allocation 2 Memory Requirements What do we want to store? program constants (e.g.
More informationRapid growth of massive datasets
Overview Rapid growth of massive datasets E.g., Online activity, Science, Sensor networks Data Distributed Clusters are Pervasive Data Distributed Computing Mature Methods for Common Problems e.g., classification,
More informationMemory Systems IRAM. Principle of IRAM
Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several
More informationHARDWARE. There are a number of factors that effect the speed of the processor. Explain how these factors affect the speed of the computer s CPU.
HARDWARE hardware ˈhɑːdwɛː noun [ mass noun ] the machines, wiring, and other physical components of a computer or other electronic system. select a software package that suits your requirements and buy
More informationDeep Learning for Visual Computing Prof. Debdoot Sheet Department of Electrical Engineering Indian Institute of Technology, Kharagpur
Deep Learning for Visual Computing Prof. Debdoot Sheet Department of Electrical Engineering Indian Institute of Technology, Kharagpur Lecture - 05 Classification with Perceptron Model So, welcome to today
More informationReal-time Object Detection CS 229 Course Project
Real-time Object Detection CS 229 Course Project Zibo Gong 1, Tianchang He 1, and Ziyi Yang 1 1 Department of Electrical Engineering, Stanford University December 17, 2016 Abstract Objection detection
More informationClass 6 Large-Scale Image Classification
Class 6 Large-Scale Image Classification Liangliang Cao, March 7, 2013 EECS 6890 Topics in Information Processing Spring 2013, Columbia University http://rogerioferis.com/visualrecognitionandsearch Visual
More informationParallelism and Concurrency. COS 326 David Walker Princeton University
Parallelism and Concurrency COS 326 David Walker Princeton University Parallelism What is it? Today's technology trends. How can we take advantage of it? Why is it so much harder to program? Some preliminary
More informationMemory technology and optimizations ( 2.3) Main Memory
Memory technology and optimizations ( 2.3) 47 Main Memory Performance of Main Memory: Latency: affects Cache Miss Penalty» Access Time: time between request and word arrival» Cycle Time: minimum time between
More informationOutline Marquette University
COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations
More information