Some fast and compact neural network solutions for artificial intelligence applications

Size: px

Start display at page:

Download "Some fast and compact neural network solutions for artificial intelligence applications"

Mae Parks
5 years ago
Views:

Bucharest ETTI, Dept. of Applied Electronics and Info. Eng.

1 Some fast and compact neural network solutions for artificial intelligence applications Radu Dogaru, University Politehnica of Bucharest ETTI, Dept. of Applied Electronics and Info. Eng., Natural Computing Laboratory Bucharest, Romania 1

2 Artificial Intelligence Today Architectures: Deep Learning (multiple layer perceptrons) (have very good accuracies, useful in many big data problems image recognition, speech etc. Shallow networks faster (learner) but less accurate can be used as sub-modules in Deep classifiers Challenges: In hardware-oriented applications (e.g. Intelligent sensors) one needs compact, low complexity, still accurate solutions 2

3 LRF based Deep Classifier (faster approach to deep=learning) From: Convolutional layer here we may use Simplicial cells instead (morphological processors, nonlinear, low complexity for hw dedicated) Shallow classifier (e.g. ELM or SFSVC (fast training / data-driven) Nonlinear preprocessing unit adapted to the specific problem (includes convolutional and pooling layers) 3

4 A unique architecture (kernel network) used for our compact solutions All learning is focused (done) in a linear adaptive layer (ADALINE)! - fast and convergent LEARNING (eg. LMS, Linear SVM or Moore-Penrose (like in Extreme Learning Machine) Functional capability achieved using proper nonlinear expanding (kernel modules) can be optimized for specialized HW/Sw implements ; No tuning of the hidden layer (only 1-2 generic parameters) 4

5 Kernel networks are Adalines operating in a nonlinearly-expanded input space x n Nonlinear Expander (must conserve input information) but expands it in a higher dimensional space o x o m. w i w 1 Adaline y d m>n w 35 A theorem of Cover (1965): if m>>n a problem that is linearly not separable in the input space x, may become lineary separable (thus learnable by Adaline) in the expanded space o. It only requires a proper choice of the kernel functions: x,... j x,..., x 1 m 5

6 27 years old 6

7 7

8 Cover theorem as a basis to build fast learners Need for fast? Because the learner (deep one) has many additional parameters (e.g. Hidden nodes, pooling size, etc.) one shall try various possibilities each time training again Compact, tehnology adapted: using specific kernels (one has to design them such that no backprop tuning is necessary (the main hint to make it fast). The idea exists / implemented already in widely known architectures (but not emphasized as so): SVM (support vector machine) gamma parameter ELM (extreme learning machine) # of hidden units.. Simplicial expanding factor of input space (S)FSVC - radius of RBF units centered on selection from training set 8

9 Simplicial neural nets what are they? What are they good for? A bit of history first 9

10 A bit of history.. CNNS the need for universal cells Convolutional layer linear 1998 Chua: can you design a universal CNN cell? (capable to represent arbitrary nonlinear local functions)? 10

11 With (linear) convolution 11

12 One of the first CNN chips with Simplicial Cells With nolinear operators replacing convolution 12

13 Steps towards universal CNN cells 1999: Universal CNN cell Multi-nested PWL Cell Universal Very compact Adaptive Only for Boolean local functions (white/black images) We need to generalize (gray-scale / color processing) -Pedro Julian previous works on simplicial decomposition - in 2000 we started to disccuss on the possibility to implement this theory as a hw (analogic) cell -Results: [1][2] 13

The trainable function f and its parameters c j (weights of Adaline) Simplicial cell theory A particular problem is learned (via the simple LMS) in the set of 2 n coefficients c j Kernels With a

14 The trainable function f and its parameters c j (weights of Adaline) Simplicial cell theory A particular problem is learned (via the simple LMS) in the set of 2 n coefficients c j Kernels With a nested nonlinear expansion at the input the above formula is an universal approximator Simplex selected by the input vector. The input vector can be decomposed as a linear (weighted) combination of the vertices (only n+1 vectors) of the simplex. 14

15 It is a kernel network where kernels can be computed very conveniently in a mixed-signal circuit The circuit implementation of the existent algorithm was my idea together with P. Julian we then investigated the performances of the S-cell for use as either a new kind of neural network or as a universal gray level CNN cell 15

16 Program (image proc) 16

17 17

18 Can it be implemented in a digital system? Actually this is the case with all chips reported (simplicity comes to the price of sequencial processing) Program 6 bits/pixel => 2^6 = 64 processing cycles Often 6 bits is enough (1ns/cycle => 64 ns per frame) From [6] Enough for most apps 18

19 Learning image processing (feature extractors) with Simplicial cells Program 19

Applications (Median and Order Statistic

20 Applications (Median and Order Statistic Filtering) A regression problem For comparison, the same function requires 85 combinational logic blocks in a digital implementation using FPGAs. Radu DOGARU - Some Program fast and compact neural network 20

21 <- edge detection training samples The binary gene can be realized with a simple m-nest cell instead of the RAM -----> Program 21

22 Can it learn any other problems? Yes Learning a SIN (sinusoidal) function (1 input 1 output regression problem) m=3 Nonlinear preprocessing of the input space has been done using the multi-nested recursion: k 2,..., m u , k u k m=5 22

23 Simplicial cells are COMPACT well suited in multiple core implementations (visual microprocessors) Distributed synaptic memories All synapses in a unique memory module 23

24 2018 Latest CNN chips with Simplicial Cells Only 0.81 TOPS/W in 2014 Binary information processing to reduce power consumption 24

25 Simplicial cell essentials An apparently sophisticated theory (simplicial decomposition) fits quite well in a compact circuit A kernel network can be implemented to approximate any arbitrarily local functional (mostly useful in image processing) This solution was already adopted in implementing optical sensors with integrated processing capabilities (first step towards an intelligent, fully integrated, optical sensor) Best suited to implement, in a fully parallel mode, image feature detectors (local, i.e. similar to convolutional layers) 25

26 Simplical cell local image processing We need a classifier in the output stage (large size for input vectors) e.g. MNIST 28x28 pixels (n=784) Usually: SVM (Support vector machine) ELM (Extreme learning machine) They are not fast enough.. and for other reasons we propose : Fast Support Vector Classifier (FSVC) Introduced in 1996 as RBF-M (a HW oriented model) [9] SFSVC (Super FSVC) no Adaline training at all (2016) [10] 26

27 x n Nonlinear Expander (must conserve input information) but expands it in a higher dimensional space o x o m. w i w 1 Adaline y d m>n w 35 ELM (Extreme Learning Machine) widely popularized as the fastest Training in the Adaline Layer with Moore-Penrose algorithm (quadratic in m), training times become excessive for large numbers of neurons. SVM (similar behavior of the training algorithm), + no hardware oriented kernels are possible (usually Gaussian) 27

28 Fast support vector classifiers 28

29 Challenges: Big Data i.e. collections of annotated samples (e.g. hyper-spectral imagery, handwritten text, images of different objects) are becoming widely available. There is a need for classifiers capable to learn rapidly (and generalize well) from such large amounts of data. Somehow conflicting issues: Speed (training time) Classification performance Low complexity - for convenient integration into high performance computing platforms (GPU, FPGA, specialized hardware embedded into sensors) 29

30 Actual solutions: DEEP LEARNING (multiple-layer neural architectures) Very good accuracies but relatively slow training and requiring costly computational platforms; yet it is a solution widely adopted now by academics but industry as well (several chips implementing deep-learning solutions are available on the market SHALLOW architectures Only a single (nonlinear) hidden layer offer fast training (when using techniques other than back-propagation) but on image classification problems accuracies are generally lower than what is obtained in deep-learning. Still, using adequate additional layers (e.g. based on local receptive fields), the global accuracy can be improved, while maintaining the speed advantage. Our approach SFSVC is a shallow architecture, providing very fast training speed and comparable accuracy to any of the typical shallow architectures. Why is it fast? No output layer tuning (can it be? Yes) Fast selection of support vectors using a supervised novelty based selection algorithm (no parameter adjusting is involved). 30

31 Shallow (kernel-based) Neural Networks y General defining formula for a kernel neuron m o k 0 k w k Adaline (output layer) It can be trained using LMS or pseudo-inverse method. In our approach weights w k are directly assigned +1 or 0 values! (no tuning), where o k k x Properly chosen nonlinear (basis) function (also called a kernel ) It may be also called hidden neuron, there are many choiches 31

32 Kernel networks are Adalines operating in a nonlinearly-expanded input space x n Nonlinear Expander (must conserve input information) but expands it in a higher dimensional space o x o m. w i w 1 Adaline y d m>n w 35 A theorem of Cover (1965): if m>>n a problem that is linearly not separable in the input space x, may become lineary separable (thus learnable by Adaline) in the expanded space o. It only requires a proper choice of the kernel functions: x,... j x,..., x 1 m 32

33 Shallow architectures SVM selects support vectors for kernels from the training samples using an optimization algorithm minimizing the risk relatively slow training (although the actual LIBSVM implementation used widely is relatively well optimized) ELM (extreme learning machine) considered by its authors as the fastest neural paradigm uses the trick of generating randomly weights (parameters) in the hidden layer which lets most of the training time for the pseudo-inverse training of the output linear layer. Note that for big-data the pseudo-inverse requires large training times, proportional to M 2 where the number M of hidden nodes (also large when databases are large). Another problem with ELM, because of randomness, several trials (often ignored by authors in reporting training speed) are necessary to reach a maximum performance using the same architecture. NoProp proposed recently by Widrow is essentially an ELM where output layer training is done faster via LMS instead of pseudo-inverse. In our approach: hidden layer has no random elements (only support vectors selected (not computed) from the training dataset); when using tuning it uses LMS 33

Datasets and classifier In addition, a test set TS is considered to evaluate the generalization performance Input feature vector x TIX R, type of basis fnct. SFSVC classifier A list of integeres (1,.

34 Datasets and classifier In addition, a test set TS is considered to evaluate the generalization performance Input feature vector x TIX R, type of basis fnct. SFSVC classifier A list of integeres (1,..N): indexes k of the selected vectors in TR; It is the result of the training phase Predicted class 1,..M Each class is assigned an output Adaline (the one with highest activation indicates predicted class) 34

35 (S)FSVC Architecture and Equations (RBF-M [9]) Simple LMS training (only for SFSVC-T) or d k values (1 or 0) in SFSVC Centroids c j are the vector supports selected from TR (according to TIX). RBF allows a wider variety of kernels, not necessary to satisfy Mercer s condition Unsupervised selection of centers in FSVC; Supervised selection in SFSVC (accelerates the algorithm and improves performance) For each class a search for centers (support vectors) is done 35

Various Distances and RBF functions Manhattan d m x c j 1 n i 1 x i c ji Best for hardware-oriented applications d e x c j 2 Euclidean n xi c ji i 1 2 0

36 Various Distances and RBF functions Manhattan d m x c j 1 n i 1 x i c ji Best for hardware-oriented applications d e x c j 2 Euclidean n xi c ji i if d k r ( d, r) d 1 else k r k 2 m Triangular ( d, r) d exp 2r 2 Such choices do not influence significantly classification performance g Gaussian 2 36

37 Without tuning in the Adaline layer the only limiting factor for training speed is the novelty-based support vector finding algorithm (almost linear in the number of units) 37

38 Time (support vectors search) for MNIST problem (60000 samples) and its dependency on the number of RBF units For getting best performance only the radius should be varied (in SVM one has the gamma and C ; in ELM the number of hidden neurons) 38

39 For large datasets (like MNIST) a huge number of neurons will give the best accuracy. With proper additional layers (LRF-deep structure for example) it can be dramatically improved (over 99%) 39

40 But in SVM and ELM training time dependency on the number of hidden units (on the same PC) goes like this: 900 sec. for neurons 80 ELM neurons give an error! 300 SVM For any gamma and C lasts hundred of seconds (difficult to tune the model!) 40

41 The Support Vector selection algorithm 2 m 1 m+1 New input sample becomes a centroid (less overlap with existing coverage) 2 m 1 Not added New input sample is not becoming a centroid (much overlap with existing coverage) 1 It is a measure of the degree of overlap between two RBF units: a small degree (e.g. 0.1) implies that some input vector that is placed just in the middle of the distance between the two RBF centers will generate an overall output of the RBF output close to 0 making thus very difficult to discriminate such vectors. If the overlap becomes large (e.g. 10) the two RBF units are redundant. 41

Training algorithm = Determining TIX The overlapping threshold matters at training (in SFSVC it can be as small as Selecting a 1/128, for the tunable new support version SFSVC-T vector usually is

42 Training algorithm = Determining TIX The overlapping threshold matters at training (in SFSVC it can be as small as Selecting a 1/128, for the tunable new support version SFSVC-T vector usually is taken 1). Computes hidden layer activity FSVC uses unsupervised selection (i.e. class assignment of input pattern does not matter); Supervised approach gives faster speed and offers better performance 42

hidden/output layer and adjust the Adaline weights Pseudo-inverse

43 Results t 1 TIX generating t 2 compute hidden and output layer t 3 Adaline training Tuning adds important times needed to compute the hidden/output layer and adjust the Adaline weights Pseudo-inverse learning of ELM adds additional training time! It is a reduced size MNIST From [10] 43

44 Comparison / speed-ups Generally accuracies are quite simillar 44

45 Newer results to be published soon MNIST In [13] an investigation was done on solving the MNIST problem with SVM. As for the SFSVC where r and ov should be optimized for the best performance, in the case of SVM one needs to optimize the regularization parameter C and the parameter (related to radius r). Using Python and likely a similar CPU as ours, they report seconds for training 48 different SVM models resulting in the best accuracy 98.5% (i.e. around 4608 seconds per model). On the other hand 22 SFSVC models were tried (22 different radius values) for a total time of only 1504 seconds (i.e. around 68 seconds per model) to achieve 97.78% accuracy. This indicates an important speed-up of 67 times on a similar computational platform (Python with SCIKIT-LEARN) thus indicating the validity and efficiency of the SFSVC model 45

46 SFSVC work in progress we would still want to improve accuracy while not sacrificing training speed too much, other issues are also on the agenda.. Encouraging preliminary results: Adaline tuning but on a reduced sub-set of the training data. SFSVC: It is faster than usual shallow classifiers (using proper implementation platform we found Python / NUMPY & SCIPY very convenient When using hardware-oriented RBFs can be also conveniently implemented in other platforms : FPGA, GPU.. (Gaussian kernels replced with PWL ones) 46

47 Thank You! More questions? 47

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Neural Networks CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Biological and artificial neural networks Feed-forward neural networks Single layer