In-Place Associative Computing

Size: px

Start display at page:

Download "In-Place Associative Computing"

Tracey Hudson
5 years ago
Views:

1 In-Place Associative Computing All Images are Public in the Web Avidan Akerib Ph.D. Vice President Associative Computing BU

2 Agenda Introduction to associative computing Use case examples Similarity search Large Scale Attention Computing Few-shot learning Software model Future Approaches 2

3 The Challenge In AI Computing (Matrix Multiplication is not enough!!) AI Requirement High Precision Floating Point Multi precision Linearly Scalable Sort-search Heavy computation Bandwidth/power tradeoff Use Case Example Neural network learning Real time inference, saving memory Big Data Top-K, recommendation, speech, classify image/video Non linearity, Softmax, exponent, normalization High speed at low power 3

4 4 Von Neumann Architecture Memory CPU High Density (Repeated cells) Slower Leveraging Moore s Law Lower Density (Lots of Logic ) Faster 4

5 5 Von Neumann Architecture Memory CPU High Density (Reputed cells) Slower Lower Density (Lots of Logic) Faster CPU frequency outpacing memory - need to add cache. Continue to leverage Moore s Law 5

6 6 Since 26 Clock speed start flattening Sharply Source: Intel 6

7 7 Thinking Parallel : 2 cores and more Memory CPU However, memory utilization becomes an issue 7

8 More and more memory to solve utilization problem Memory CPU Local and Global Memory 8

9 9 Memory still growing rapidly Memory CPU Memory becomes a larger part of each chip 9

10 Same Concept even with GPGPUs Memories GPGPU Very High Power, large die, Expensive What s Next??

11 Most of Power goes to BW Source: Song Han Stanford University

12 2 Changing the Rules of the Game!!! Standard Memory cells are smarter than we thought!! 2

13 APU Associative Processing Unit Simple CPU Question Simple & Narrow Bus Answer Millions Processors APU Associative Processing Computes in-place directly in the memory array removes the I/O bottleneck Significantly increases performance Reduces power 3

14 4 How Computers Work Today RE/WE Address Decoder ALU Sense Amp /IO Drivers 4

15 5 Accessing Multiple Rows Simultaneously RE RE RE WE NOR? WE RE Bus Contention is not an error!!! It s a simple NOR/NAND satisfying De-Morgan s law 5

16 Truth Table Example A B C D C AB!A!C + BC =!!(!A!C + BC ) =! (!(!A!C)!(BC)) Every Minterm takes one Clock All bit lines executes Karnaugh tables in parallel = NAND( NAND(!A,!C),NAND(B,C)) CLOCK Read (!A,!C) ; WRITE T Read (B,C) ; WRITE T2 CLOCK Read (T,T2) ; WRITE D 6

17 7 Vector Add Example A[] + B[] = C[] vector A(8,32M) vector B(8,32M) Vector C(9,32M) C = A + B No. Of Clocks = 4 * 8 = 32 Clocks/byte= 32/32M=/M OPS = Ghz X M = PetaOPS 7

18 CAM/ Associative Search Records in the combines key goes to the read enable Values Duplicate Vales with inverse data =match RE RE RE RE KEY: Search Duplicate the Key with Inverse. Move The original Key next to the inverse data 8

19 TCAM Search By Standard Memory Cells Don t Care Don t Care Don t Care 9

20 TCAM Search By Standard Memory Cells in the combines key goes to the read enable Insert Zero instead of don t-care Duplicate data. Inverse only to those which are not don t care = match =match RE RE RE RE KEY: Search Duplicate the Key with Inverse. Move The original Key next to the inverse data 2

21 Computing in the Bit Lines Vector A a a a2 a3 a4 a5 a6 a7 Vector B b b b2 b3 b4 b5 b6 b7 C=f(A,B) Each bit line becomes a processor and storage Millions of bit lines = millions of processors 2

22 Neighborhood Computing Shift vector C=f(A,SL(B,)) Parallel shift of bit cycle sections Enables neighborhood operations such as convolutions 22

23 Search & Count Search Count = 3 Search (binary or ternary) all bit lines in cycle 28 M bit lines => 28 Peta search/sec Key Applications for search and count for predictive analytics: Recommender systems K-Nearest Neighbors (using cosine similarity search) Random forest Image histogram Regular expression 23

24 CPU vs GPU vs FPGA vs APU 24

25 CPU/GPGPU vs APU CPU/GPGPU (Current Solution) In-Place Computing (APU) Send an address to memory Fetch the data from memory and send it to the processor Compute serially per core (thousands of cores at most) Write the data back to memory, further wasting IO resources Send data to each location that needs it Search by content Mark in place Compute in place on millions of processors (the memory itself becomes millions of processors No need to write data back the result is already in the memory If needed, distribute or broadcast at once 25

26 ARCHITECTURE 26

27 Communication between Sections Shift between sections enable neighborhood operations (filters, CNN etc.) Store, Compute, Search and Transport data anywhere. 27

28 Memory Section Computing to Improve Performance MLB section 24 rows control Connecting Mux MLB section Connecting mux... Instr. Buffer 28

29 APU Chip Layout 2M bit processors or 28K vector processors runs at G Hz with up to 2 Peta OPS peak performance 29

30 APU Layout vs GPU Layout Multi-Functional, Programmable Blocks Acceleration of FP operation Blocks 3

31 EXAMPLE APPLICATIONS 3

32 K-Nearest Neighbors (k-nn) Simple example: N = 36, 3 Groups, 2 dimensions (D = 2 ) for X and Y K = 4 Group Green selected as the majority For actual applications: N = Billions, D = Tens, K = Tens of thousands 32

33 k-nn Use Case in an APU Item N Item 3 Item Item 2 Features of item Features of item 2 Features of item N Item features and label storage Q C p = Dp Q = n σ i= D p Q σn p 2 i= D i Di p Qi σi= n Q i Majority Calculation Compute cosine distances for all N in parallel ( s, assuming D=5 features) Distribute data 2 ns (to all) Computing Area K Mins at O() complexity ( 3 s) In-Place ranking With the data base in an APU, computation for all N items done in.5 ms, independent of K (X Improvement over current solutions) 33

34 K-MINS: O() Algorithm KMINS(int K, vector C){ M :=, V := ; FOR b = msb to b = lsb: D := not(c[b]); N := M & D; cnt = COUNT(N V) IF cnt > K: M := N; ELIF cnt < K: V := N V; ELSE: // cnt == K V := N V; EXIT; ENDIF ENDFOR } MSB C C C 2 LSB N 34

35 K-MINS: The Algorithm V N V N M D C[] KMINS(int K, vector C){ M :=, V := ; FOR b = msb to b = lsb: D := not(c[b]); N := M & D; cnt = COUNT(N V) IF cnt > K: M := N; ELIF cnt < K: V := N V; ELSE: // cnt == K V := N V; EXIT; ENDIF ENDFOR } cnt= 35

36 K-MINS: The Algorithm V N V N M D C[] KMINS(int K, vector C){ M :=, V := ; FOR b = msb to b = lsb: D := not(c[b]); N := M & D; cnt = COUNT(N V) IF cnt > K: M := N; ELIF cnt < K: V := N V; ELSE: // cnt == K V := N V; EXIT; ENDIF ENDFOR } cnt=8 36

37 K-MINS: The Algorithm KMINS(int K, vector C){ M :=, V := ; FOR b = msb to b = lsb: D := not(c[b]); N := M & D; cnt = COUNT(N V) IF cnt > K: M := N; ELIF cnt < K: V := N V; ELSE: // cnt == K V := N V; EXIT; ENDIF ENDFOR } final output V C O() Complexity 37

38 Similarity Search and Top-K for Recognition Image Convolution Layer Feature Extractor (Neural network) Data Base Every image/sentence/doc has a label Word/Sentence/doc Embedding Text 38

39 Dense (XN) Vector by Sparse NxM Matrix Input Vector Sparse Matrix Output Vector APU Representations and Computing Column Row Search all columns for row = 2 :Distribute -2 : 2 Cy Search all columns for row = 3 :Distribute 3 : 2 Cy Search all columns for row = 4 :Distribute - : 2 Cy Multiply in Parallel : Cy Shift and Add all belonging to same column Complexity including IO : O (N +logβ) where β is the number of nonzero elements in the sparse matrix N << M in general for recommender systems 39

40 Two NxN Sparse Matrix Multiplication Sparse In Matrix Sparse Inb-2 Matrix Output Matrix X = COL ROW In-DB In-DB2 Out-DB Choose Next free Entry from In-DB Read its Row value Search and Mark Similar Rows For all Marked Row Search where Col(In-DB) = Row(In-DB2) Broadcast selected value to Output Table bit lines Multiply in Parallel Shift and Add all belonging to same Column Update Out-DB Go Back to Step if there are more free entries Exit Complexity Including IO : O(β+logβ) Compared to O(β.7 N.2 +N 2 ) in CPU ( > X Improvement) 4

41 Softmax Used in many neural networks applications, especially for attention networks The Softmax function takes an N dimensional vector of scores and generates probabilities between to, as defined by the function Si = ezi σ N j= e Zj Where Z is the dot product between a query vector and feature vector ( for example, word emending of English vocabulary ) 4

42 The Difficulties in Softmax Computing. Dot Product for millions vectors 2. Non Linearity function (Exp) 3. Dependency : every score depends on all others in data base 4. Dynamic range: fast overflow, requires high precision calculations 5. Speed and Latency 42

43 Taylor Series e x = + x + x2 2 + x3 3! + Very Expensive, Requires more than 2 coefficients and double precision for good accuracy. 43

44 M SoftMax Performance Proprietary algorithm leverages APU s lookup capability Provide M High accuracy exact Softmax values = < 5 µsec vs - msec in GPU > 3 orders of magnitude improvement 44

45 Associative Memory for Natural Language Processing (NLP) Q&A, dialog, language translation, speech recognition etc. Requires learning past events Needs large array with attention capabilities 45

46 Examples Q&A: Dan put the book in his car,.. Long story here. Mike took Dan s car Long story here. He drove to SF Q : Where is the book now? A: Car, SF Attention Computing Language Translation: The cow ate the hay because it was delicious. The cow ate the hay because it was hungry. Source: Łukasz Kaiser 46

47 Example of Associative Attention Computing Input Data (i.e. Sentence in English for translation or for Q&A) Encoder (NN) Feature Vector Embedding Sentences Features Representation (Key) 47

48 Example of Associative Attention Computing V V2 V3 V4 V5 V6 Compute TOP K Value)... X Attention SoftMax Result Dot Product Result Next Stage (Encoder or Decoder) Query Encoder (NN) Feature Vector Dot Product Key Dot Product :O() SoftMax O() Top-K O() 48

49 Q&A : End to End Network (Weston) Source: Weston et al 49

50 GSI Associative Solution for End to End Source: Weston et al Constant time of 3 µsec per one iteration, any memory size > few orders of magnitude improvement 5

51 Associative Computing for Low-Shot Learning Gradient-Based Optimization has achieved impressive results on supervised tasks such as image classification These models need a lot of data People can learn efficiently from few examples Associative Computing Like people, can measure similarity to features stored in memory Can also create a new label for similar features in the future 5

Zero-Shot Learning with k-nn Input Images with labels Pixels Similar Image without label Feature Extractor by Convolution Layer Features Embedding Input features Similar Image Label Cosine Similarity

52 Zero-Shot Learning with k-nn Input Images with labels Pixels Similar Image without label Feature Extractor by Convolution Layer Features Embedding Input features Similar Image Label Cosine Similarity Search + Top-K Extract features using any pre- trained CNN, for example VGG/Inception on ImageNet New data set is embedded using a pre-trained model and stored in memory with its label Query (test images) are input without label and their features are cosine similarity searched to predict the label 52

53 Dimension Reduction Output of convolution layer is large ( 2, features in VGG, very sparse) Simple matrix or multi-layer non-linear transformation Learned simply Loss function: Cosine distance found between any two records 2, Difference between the distance of input and output Learns to preserve the cosine distance through transform 2 Associative 53

54 Low-Shot: Train the network on distance k-nn Data Base Start with untrained network Output of network is already reduced-dimension keys for k-nn DB Train the network only to keep similar-valued keys close 54

55 Cut Short k-nn Data Base (Associative) Stop training when system starts to converge (Cut Short) Use similarity search instead of Fully Connected Requires less complete training 55

56 PROGRAMING MODEL 56

TensorFlow Graph for Execution in Device Memory APU

57 Programming Model Write application In Standard Host Using TensorFlow /Tesor2Tensor Frame Work Generates TensorFlow Graph for Execution in Device Memory APU Chip/Card Execute the Graph using fused Capabilities 57

58 PCIe Development Boards 4 APU Chips 8 Millions bit lines rows (processors) 8 Peta Boolean OPS 6.4 TFLOPS 2 Petabit/sec Internal IO 6-64 GB(Device memory) TensorFlow Frame-work (basic functions) GNL (GSI Numeric Lib) 58

59 FUTURE APPROACH NON VOLATILE CONCEPT 59

60 Computing in Non-Volatile Cells Select Multiple Lines for read (as NOR/NAND input) Ref = V-read The Sense Unit is Sensing Bit Line for Logic or Select or Multiple Lines for write (NOR/NAND results) Ref = V-write Write Control Generates logic or for bit line Sense Unit & Write Control Non Volatile bit cell Select REF 6

61 Solutions for Future Data Centers CPU Register File L/L2/L3 DRAM ASSOCIATIVE High endurance Full computing (floating points etc.) requires read & write : Low endurance Data search engines (read most of the time) Standard SRAM Based STT-RAM RAM Based PC-RAM Based ReRam Based Flash HDD Volatile Non Volatile Mid endurance Machine learning, malware detection detection etc., : Much more read and much less write 6

62 Summary APU enables state of the art, next-generation machine learning : In-Place from basic Boolean Algebra to complex algorithms O() Dot Produces computation O() Min/Max O() Top K O() Softmax Ultra high Internal BW.5 Peta bit/sec Up to 2 PetaOps of Boolean Algebra in a single chip Fully Scalable Fully Programmable Efficient TensorFlow based capabilities 62

63 Summary Extending Moore s Law and Leveraging Advanced Memory Technology Growth For M.Sc./Ph.D. students that would like to collaborate on research please contact me: aakerib@gsitechnology.com 63

64 Thank You! Any Questions? APU Page 64

IN-MEMORY ASSOCIATIVE COMPUTING

IN-MEMORY ASSOCIATIVE COMPUTING AVIDAN AKERIB, GSI TECHNOLOGY AAKERIB@GSITECHNOLOGY.COM AGENDA The AI computational challenge Introduction to associative computing Examples An NLP use case What s next?