A User's Guide to Stochastic Encoder/Decoders

Size: px

Start display at page:

Download "A User's Guide to Stochastic Encoder/Decoders"

Valentine Carr
6 years ago
Views:

1 A User's Guide to Stochastic Encoder/Decoders Dr S P Luttrell The overall goal of this research is to develop the theory and practice of self-organising networks that can discover objects and correlations in data, and the application of this to the fusion of data derived from multiple sensors. The purpose of this report is to give a practical introduction to self-organising stochastic encoder/decoders, in which each input vector is encoded as a stochastic sequence of code indices, and then decoded as a superposition of the corresponding sequence of code vectors. Mathematica software for implementing this type of encoder/decoder is presented, and numerical simulations are run to illustrate a variety of emergent properties. I. EXECUTIVE SUMMARY II. INTRODUCTION Research aim: The overall goal of this research is to develop the theory and practice of self-organising networks that can discover objects and correlations in data, and the application of this to the fusion of data derived from multiple sensors. Results: To reach the above overall goal it is necessary to identify and study an appropriate self-organising network structure. To this end, this report is a self-contained tutorial which demonstrates the use of self-organising stochastic encoder/decoder networks. A complete software suite, written in Mathematica, is presented, and many worked examples of how to run the software are given. Conclusions: The main conclusion is that stochastic encoder/decoder networks are simple to implement in Mathematica, and they automatically (i.e. by a process of self-organisation) discover a wide range of useful ways of encoding data. These properties can then be put to use to address the problem of fusing data derived from multiple sensors. Customer benets: The main benet is that this self-organising approach to designing encoder networks, and ultimately data fusion networks, will lead to large savings when it is applied to real-world problems. This benet arises principally from the hands-o nature of the self-organising approach, in which the task of identifying objects and correlations in data is delegated to a computer, rather than being done manually by one or more expert humans. Recommendations: Extend the approach advocated in this report to the case of a multi-layer encoder/decoder network, to allow discovery by the network of more complicated objects and correlations in data. This will move the research towards more realistic data fusion scenarios. Key words: Encoder, Decoder, Stochastic Vector Quantiser, SVQ, Data Fusion, Self-Organisation This paper appeared as DERA Technical Report, DERA/S&P/SPI/TR990290, 18 October c Crown Copyright 1999 Defence Evaluation and Research Agency UK A. Background to This Report The overall goal of this research is to develop the theory and practice of self-organising networks that can discover objects and correlations in data, and to apply this to the fusion of data derived from multiple sensors. This self-organising approach to designing encoder networks, and ultimately data fusion networks, will lead to large savings when it is applied to real-world problems. This benet arises principally from the hands-o nature of the self-organising approach, in which the task of identifying objects and correlations in data is delegated to a computer, rather than being done manually by one or more expert humans. This report focuses on a particular type of self-organising network which encodes/decodes data with minimum distortion. A useful side eect of optimising this type of network is that it must discover objects and correlations in data, as is required of a network that is to be applied to data fusion problems. To visualise a minimum distortion encoder/decoder, consider a communication system that consists of a transmitter, a limited bandwidth communication channel, and a receiver. In order to send a signal from the transmitter to the receiver, it is necessary to encode it before it can be accommodated within the limited bandwidth of the communication channel, and then decoded at the receiver. Such encoding/decoding leads to the received signal being a distorted version of the original, and by carefully optimising this encoding/decoding scheme the associated distortion can be minimised. The simplest type of encoder/decoder is the vector quantiser (VQ) [1], which encodes each input vector as one of a nite number of possible integers, which is then transmitted along the communication channel, and then decoded as one of a nite number of possible reconstruction vectors. The set of reconstruction vectors (the code book) contains all of the information that is needed to specify the encoder/decoder. Thus the encoder applies a nearest neighbour algorithm to the code book to determine which of the reconstruction vectors is closest (in the Euclidean sense) to the input vector, and then transmits this integer (the code index) along the communication channel to the decoder, which then uses it to look up the corresponding reconstruction vector in its own identical

2 A User's Guide to Stochastic Encoder/Decoders 2 copy of the code book. The codebook must be optimised so that the average Euclidean reconstruction error is minimised. This type of encoder/decoder may be generalised to allow for corruption of the code index whilst in transit over the communication channel. The optimisation must now make information in the transmitted code index robust with respect to channel distortion [2]. In the simplest case where the code index is transmitted as an analogue signal (a vector of voltages, say), and the communication channel distortion is an additive noise process, then the optimisation process is very similar to the training algorithm used to optimise a topographic mapping [3]. For this reason, this type of encoder/decoder is known as a topographic vector quantiser. This report discusses a further generalisation of this type of encoder/decoder, in which the encoder uses a stochastic (rather than deterministic) algorithm to pick the code index for transmission along the communication channel - this is called a stochastic vector quantiser (SVQ) (see the discussion in Section A 2). An obvious disadvantage of using a stochastic encoding algorithm is that it must discard more information about its input than would an optimal deterministic encoding algorithm. However, the main advantage of stochastic encoding is that when multiple code indices are sampled they do not all have to be the same (unlike in the case of deterministic encoding), so dierent samples can record dierent types of information about the input. This eect could also be achieved by using multiple deterministic encodings of the input, but this would be a manually designed approach that is anathema to the self-organised approach that is required here. B. Layout of This Report In Appendix A all of the Mathematica [4] routines that are required for simulating encoder/decoders are developed, and in Appendix B some complicated expressions are derived using Mathematica. The body of the report in Section III demonstrates how these routines may be used to obtain optimal encoder/decoders (all of which are SVQs) for various simple types of training data. III. ENCODER/DECODER EXAMPLES A. Preliminaries If further background information is required, then Appendix A should be read rst of all. B. Notation d inputdimensionality x inputvector (dimensionalityd) M totalnumberofcodeindices n samplednumberofcodeindices w matrixofweightvectors (dimensionalitym d) b vectorofbiasses (dimensionalitym) r matrixofreconstructionvectors (dimensionalitym d) A partitioningmatrix (unused) seesectiona 6fordetails L eakagematrix (dimensionalitym M) seesectiona 6fordeta ε updatestepsizeparameter seesectiona 10fordetails λ weightdecayparameter seesectiona 11fordetails C. Methodology For each simulation a number of parameters have to be initialised. For instance, in the case of circular training data (see Section III D), the initialisation takes the form} d = 2; M = 4; n = 10; ε = 0.05; w = Table[0.1(Random[] - 0.5), {M}, {d}]; b = Table[0.1(Random[] - 0.5), {M}]; r = Table[0.1(Random[] - 0.5), {M}, {d}]; A = {Table[1, {M}]}; L = IdentityMatrix[M]; The rst row initialises all of the scalar parameters d, M, n and ε. The second row initialises the elements of M d matrix of weight vectors w to small random values uniformly distributed in the interval [ 0.05, +0.05]. The third row initialises the elements of the M - dimensional vector of biasses b to small random values uniformly distributed in the interval [ 0.05, +0.05]. The fourth row initialises the elements of the M d matrix of reconstruction vectors r to small random values uniformly distributed in the interval [ 0.05, +0.05]. The fth row initialises the partitioning matrix A to a default state which removes its eect (it is not used in this report), and initialises the leakage matrix L to a default state which removes its eect (in Section III L a dierent type of L is used). The simulation may then be run with these parameter values by invoking the following update routine (see Section A 10) as many times as required (with x={cos[#],sin[#]}&[2π Random[]] in the case of circular training data). {D12, w, b, r} = UpdateSVQ[x, w, b, r, n, ε]; For convenience it is useful to record the training history using the following code fragment (whose initialisation is {whistory,bhistory,rhistory,d12history}={{},{},{},{}}). {whistory, bhistory, rhistory, D12history} = MapThread[App {{whistory, bhistory, rhistory, D12history}, {w, b, r, D12} The training history contains all the information that is required to create graphical displays that show what

3 A User's Guide to Stochastic Encoder/Decoders 3 the SVQ has been doing. D. Circle This rst simulation is designed to show how the SVQ behaves with data that live on a curved manifold, which is typical of high-dimensional sensor data, such as images. A circle is the simplest type of curved manifold, so it can be used to explore the basic SVQ behaviour. Each input vector then has the form x = (cos θ, sin θ). Thus a circular manifold has one intrinsic coordinate (the θ parameter), but it is embedded in a 2-dimensional space (the (cos θ, sin θ) vector). An analytic solution to this particular example was given in [5], where it was shown how the circle is sliced up into overlapping arcs by the SVQ. The following simulation has all of the same behaviour as the analytic solution, even though it is suboptimal because of the limited functional form of the sigmoid functions used in the encoder. Initialise the parameter values. d = 2; M = 4; n = 10; ε = 0.05; w = Table[0.1(Random[] - 0.5), {M}, {d}]; b = Table[0.1(Random[] - 0.5), {M}]; r = Table[0.1(Random[] - 0.5), {M}, {d}]; A = {Table[1, {M}]}; L = IdentityMatrix[M]; These parameter values state that the input space is 2-dimensional (d = 2), the code book has 4 entries (M = 4), 10 code indices are sampled for each input vector (n = 10), the update step size is 0.05 (ε = 0.05), the elements of the weight matrix, bias vector and reconstruction matrix are initialised to uniformly distributed random numbers in the interval [ 0.05, +0.05], the partitioning matrix and the leakage matrix are initialised to a default state in which their eect is switched o. Train on 400 vectors derived from {Cos[#],Sin[#]}&[2π Random[]], which generates points on the unit circle. The training history of the loss function D 1 + D 2 is shown in Figure 1, where every 10th sample is shown. This shows the expected downward trend towards convergence. The training history of the rows of the reconstruction matrix is shown in Figure 2, where every 10th sample is shown and the nal state is highlighted with shaded circles. This shows the expected outward drift from the initial random state near the origin, to eventually jitter about just outside the unit circle. For each input vector on the unit circle, the reconstruction vector is a linear combination of the rows of the reconstruction matrix, where the coecients of the linear combination are nonnegative and sum to unity; this explains why the rows of the reconstruction matrix lie outside the unit circle. The posterior probabilities that each code index is selected for input vectors lying in the square [ 1, +1] 2 (i.e. Figure 1: The training history of the loss function for n = 10. Every 10th sample is shown. Figure 2: The training history of the rows of the reconstruction matrix for n = 10. Every 10th sample is shown and the nal state is highlighted. not only the unit circle) are shown in Figure 3. The only points in [ 1, +1] 2 for which these posterior probabilities are actually used is the unit circle. The contour plots of the same posterior probabilities are shown in Figure 4. As in Figure 3, the only points in [ 1, +1] 2 for which these posterior probabilities are actually used is the unit circle. If the same simulation is repeated, but using n = 250, then the results are shown in Figure 5, Figure 6, Figure 7, and Figure 8. The loss function is smaller in Figure 5 than in Figure 1. This is because more code indices are used in Figure 5, which preserves more information about the input vector, thus allowing a more accurate reconstruction to be made. The rows of the reconstruction matrix r are larger in Figure 6 than in Figure 2. Also, the posterior probabilities that each code index is selected overlap more with each other in Figure 8 than in Figure 4. As before, this eect is explained by the need for the reconstruction to lie near the unit circle when formed from a (constrained)

4 A User's Guide to Stochastic Encoder/Decoders 4 Figure 5: The training history of the loss function for n = 250. Every 10th sample is shown. Figure 3: The posterior probabilities that each code index is selected for n = 10. Figure 4: Contour plots of the posterior probabilities that each code index is selected for n = 10. linear combination of the rows of the reconstruction matrix. If the same simulation is repeated, but using n = 2, then the results are shown in Figure 9, Figure 10, Figure 11, and Figure 12. When comparing Figure 99Figure 12 with Figure 11Figure 4 all the trends are the opposite that were observed when comparing Figure 5 to Figure 8 with Figure 1 to Figure 4, as expected. Ideally, in the n = 1 case a pure vector quantiser would be obtained, in which the circle is partitioned into nonoverlapping arcs (each covering π 2 radians in this case), and the rows of the reconstruction matrix would lie just inside the unit circle at the centroids of each of these arcs. It should be noted that, for an encoder based on sigmoid functions, the ideal vector quantiser result cannot be obtained when the input data lives on an arbitrarily chosen manifold, because sigmoid functions have a highly Figure 6: The training history of the rows of the reconstruction matrix for n = 250. Every 10th sample is shown and the nal state is highlighted. restricted functional dependence on the input vector. E. 2-Torus The circular manifold used in Section III D has only one intrinsic coordinate, so it cannot be used to investigate any new SVQ behaviour that emerges when the data live on a higher dimensional curved manifold. However, if a pair of circles is used, so that each data vector is given by x = (cos θ 1, sin θ 1, cos θ 2, sin θ 2 ), then the manifold has two intrinsic coordinates, which may be used to investigate SVQ behaviour for data that live on a 2-dimensional curved manifold. This manifold has two intrinsic coordinates (the (θ 1, θ 2 ) vector), but it is embedded in a 4-dimensional space (the (cos θ 1, sin θ 1, cos θ 2, sin θ 2 ) vector). A manifold that is formed in this way from two circular manifolds is a 2-torus, which has the familiar doughnut shape when it is projected down into only three dimensions.

A User's Guide to Stochastic Encoder/Decoders 5 Figure 9: The training history of the rows of the reconstruction matrix for n = 2. Every 10th sample is shown.

In [5] the case of a 2-torus was solved analytically to reveal that the behaviour of an SVQ depended on the size M of the code book and the number n of sampled code indices.

5 A User's Guide to Stochastic Encoder/Decoders 5 Figure 9: The training history of the rows of the reconstruction matrix for n = 2. Every 10th sample is shown. Figure 7: The posterior probabilities that each code index is selected for n = 250. Figure 8: Contour plots of the posterior probabilities that each code index is selected for n = 250. In [5] the case of a 2-torus was solved analytically to reveal that the behaviour of an SVQ depended on the size M of the code book and the number n of sampled code indices. The following simulation has all of the same behaviour as the analytic solution, even though it is suboptimal because of the limited functional form of the sigmoid functions used in the encoder. Initialise the parameter values. d = 4; M = 8; n = 50; ε = 0.05; λ = 0.005; w = Table[0.1(Random[] - 0.5), {M}, {d}]; b = Table[0.1(Random[] - 0.5), {M}]; r = Table[0.1(Random[] - 0.5), {M}, {d}]; A = {Table[1, {M}]}; L = IdentityMatrix[M]; Figure 10: The training history of the rows of the reconstruction matrix for n = 2. Every 10th sample is shown and the nal state is highlighted. These parameter values state that the input space is 4-dimensional (d = 4), the code book has 8 entries (M = 8), 50 code indices are sampled for each input vector (n = 50), the update step size is 0.05 (ε = 0.05), the weight decay parameter is (λ = 0.005), the elements of the weight matrix, bias vector and reconstruction matrix are initialised to uniformly distributed random numbers in the interval [ 0.05, +0.05], the partitioning matrix and the leakage matrix are initialised to a default state in which their eect is switched o. Weight decay is used to impose a prior bias towards solutions that have few non-zero entries in the weight matrix, because it is known that the optimal solution in this case is a factorial encoder [5]. Train on 1000 vectors derived from Flatten[Table[{Cos[#],Sin[#]}&[2π Random[]],{2}]], which generates points on a 2-torus formed from the Cartesian product of a pair a unit circles. The training history of the loss function is shown in Figure 13. This shows the expected downward trend to-

A User's Guide to Stochastic Encoder/Decoders 6 Figure 13: The training history of the rows of the reconstruction matrix for n = 50. Every 10th sample is shown.

The two circular subspaces of the 2-torus are shown separately. Figure 15. In each plot the horizontal and vertical axes wrap circularly to form a 2-torus.

6 A User's Guide to Stochastic Encoder/Decoders 6 Figure 13: The training history of the rows of the reconstruction matrix for n = 50. Every 10th sample is shown. Figure 11: The posterior probabilities that each code index is selected for n = 2. Figure 14: The training history of the rows of the reconstruction matrix for n = 50. The two circular subspaces of the 2-torus are shown separately. Figure 15. In each plot the horizontal and vertical axes wrap circularly to form a 2-torus. Figure 12: Contour plots of the posterior probabilities that each code index is selected for n = 2. wards convergence. The training history of the rows of the reconstruction matrix is shown in Figure 14, where the left hand plot is one circular subspace of the 2-torus, and the right hand plot is the other circular subspace. In both of these subspaces 4 of the rows of the reconstruction matrix behave in a similar way to the case of training data derived from a unit circle (e.g. see Figure 6), whereas the other 4 rows of the reconstruction matrix remain much closer to the origin. Also, the 4 large rows of the reconstruction matrix in the left hand plot pair up with the 4 small rows of the reconstruction matrix in the right hand plot. Weight decay has a symmetry breaking side eect, in which the large components of the rows of the reconstruction matrix drift around until they become axis aligned, as is clearly seen in Figure 14. The posterior probabilities that each code index is selected for input vectors lying on the 2-torus are shown in Figure 15: Discretised versions of the posterior probabilities that each code index is selected for input vectors lying on the 2-torus and for n = 50. The results shown in Figure 15 show that the code book operates as a factorial encoder, in which half of the code indices encode one of the circular subspaces, and the other half encode the other subspace. Each pair of vertical and horizontal stripes then intersect to denes a small patch of the 2-torus, and input vectors lying in that small patch will be encoded mostly as samples of the corresponding pair of code indices, with a little overspill into other code indices in general. A factorial encoder operates by a process that is akin to triangulation, by slicing up the input space into intersecting subspaces. An SVQ can only do this provided that the number of code indices that are sampled is large enough to virtually

7 A User's Guide to Stochastic Encoder/Decoders 7 guarantee that each of the subspaces has at least one code index associated with it. If the same simulation is repeated, but using n = 2 and ε = 0.1, then the results are shown in Figure 16, Figure 17, and Figure 18. Note that to obtain the results shown in Figure 18 weight decay has been left switched on (although it is actually unnecessary in this case) in order to make sure that the change from factorial to joint encoder is genuinely caused by reducing the value of n. Also note that the value of ε is increased (relative to that used in the factorial encoder simulation) to oset the tendency for the training algorithm to become trapped in a frustrated conguration when n is small. This transition between joint and factorial encoding of a 2-torus has been correctly predicted by a exact analytic optimisation of the loss function [5]. However, the fact that it was possible to do an analytic calculation at all depended critically on the high degree of symmetry of the 2-torus. Such exact analytic calculations are not possible in the general case. Figure 16: The training history of the rows of the reconstruction matrix for n = 2. Every 10th sample is shown. Figure 17: The training history of the rows of the reconstruction matrix for n = 2. The two circular subspaces of the 2-torus are shown separately. Note the increase in scale by a factor of 2 compared with Figure 14. Figure 18: Discretised versions of the posterior probabilities that each code index is selected for input vectors lying on the 2-torus and for n = 2. The results shown in Figure 18 show that the code book operates as a joint encoder, in which each code index jointly encodes both of the circular subspaces of the 2-torus (each code index encodes a small patch of the 2-torus, as is clearly seen in Figure 18). This is because there are too few code indices (n = 2) being sampled to allow a factorial encoder a good chance of encoding both subspaces, and thus to have a small loss function. F. Imaging Sensor The simulations in Section III D (circular manifold) and Section III E (toroidal manifold) showed how an SVQ behaves when the input data lie on an idealised curved manifold. In this Section these simulations will be extended to more realistic data which lie on a curved manifolds with either a circular or a toroidal topology. The manifolds studied in this Section have the same topology as the earlier idealised manifolds, but they do not have the same geometry. However, if only their local properties are considered, then the idealised manifolds are very good approximations to the more realistic manifolds. In these simulations a target will be imaged by one or more 1-dimensional sensors. The output of each sensor will be a vector of pixel values which depends on the target's position relative to the sensor. For a target that lives on a 1-dimensional manifold, the vector of images (derived from all of the sensors) lies on a manifold with one intrinsic coordinate. This generalises straightforwardly to the case of a target that lives on a multidimensional manifold, and yet further to the case of multiple targets. The general rule is that the number of continous parameters needed to describe the state of the system under observation is equal to the dimensionality of the manifold on which the sensor data live, even though the actual dimensionality of the sensor data is usually much higher (e.g. imaging sensors). In Section III G the case of 1 target living in 1 dimension imaged by 1 sensor is simulated (e.g. a range prole in which 1 target is embedded), which is approximated by the circular case simulated in Section III D. In Section III H the case of 1 target living in 2 dimensions independently imaged by 2 sensors is simulated (e.g. a range prole and an azimuth prole in which 1 target is embedded), which is approximated by the toroidal case simulated in Section III E. In Section III I the case multiple independent (but identical) targets living in 1 dimension imaged by 1 sensor is simulated (e.g. a range prole in which multiple independent targets are embedded).

A User's Guide to Stochastic Encoder/Decoders 8 Note that the case of 1 target living in 2 dimensions independently imaged by 2 sensors (see Section III H) is not identical to the case of 2

8 A User's Guide to Stochastic Encoder/Decoders 8 Note that the case of 1 target living in 2 dimensions independently imaged by 2 sensors (see Section III H) is not identical to the case of 2 independent (but identical) targets living in 1 dimension imaged by 1 sensor. Dierences between these two cases arise only when the images of the 2 targets on the 1 sensor overlap each other. When the target is centred on the sensor it will be modelled as (d = number of sensor pixels, s = standard deviation of the Gaussian prole function that is used to represent the target) target = Table[Exp[- (i - d 2 )2 ], {i, d}]; 2s 2 Figure 19: The training history of the loss function. Every 10th sample is shown. G. Imaging Sensor (circular topology) Initialise the parameter values. d = 20; M = 4; n = 10; ε = 0.1; s = 2; w = Table[0.1(Random[] - 0.5), {M}, {d}]; b = Table[0.1(Random[] - 0.5), {M}]; r = Table[0.1(Random[] - 0.5), {M}, {d}]; A = {Table[1, {M}]}; L = IdentityMatrix[M]; These parameter values state that the sensor has 20 pixels (d = 20), the code book has 4 entries (M = 4), 10 code indices are sampled for each input vector (n = 10), the update step size is 0.1 (ε = 0.1), the target half-width is 2 (s = 2), the elements of the weight matrix, bias vector and reconstruction matrix are initialised to uniformly distributed random numbers in the interval [ 0.05, +0.05], the partitioning matrix and the leakage matrix are initialised to a default state in which their eect is switched o. Train on 400 vectors derived from RotateRight[target,Random[Integer,{0,d-1}]], which centres the target on a randomly selected sensor pixel. Circular wraparound is used. The training history of the loss function is shown in Figure 19, which should be compared with the roughly comparable case of training data derived from a unit circle in Figure 1. In this case the loss function is much noisier, because although the input manifold here is topologically equivalent to a circle (embedded in a 20-dimensional space), it is not geometrically a circle (embedded in a 2- dimensional space), which makes the encoding/decoding problem harder than before. The training history of the rows of the reconstruction matrix is shown in Figure 20. Each image displays the entire training history of a single row reading down the page. The posterior probabilities that each code index is selected as a function of target position are shown in Figure 21. Each code index responds smoothly to a localised range of target locations. Figure 20: The training history of the rows of the reconstruction matrix. Each image displays the entire training history of a single row reading down the page. H. Independent Imaging Sensors (2-toroidal topology) Initialise the parameter values. d = 40; M = 8; n = 50; ε = 0.1; λ = 0.005; σ = 2; w = Table[0.1(Random[] - 0.5), {M}, {d}]; b = Table[0.1(Random[] - 0.5), {M}]; r = Table[0.1(Random[] - 0.5), {M}, {d}]; A = {Table[1, {M}]}; L = IdentityMatrix[M]; These parameter values state that each sensor has 20 pixels (d = 40 = 2 20), the code book has 8 entries (M = 8), 50 code indices are sampled for each input vector (n = 50), the update step size is 0.1 (ε = 0.1), the weight decay parameter is (λ = 0.005), the target half-width is 2 (σ = 2), the elements of the weight matrix, bias vector and reconstruction matrix are initialised to uniformly distributed random numbers in the interval [ 0.05, +0.05], the partitioning matrix and the leakage matrix are initialised to a default state in which their eect is switched o. Generate each training vector using Flatten[Table[RotateRight[target, Random[Integer, {0, d 2 -

A User's Guide to Stochastic Encoder/Decoders 9 The posterior probabilities that each code index is selected as a function of target position are shown in Figure 24, which should be compared with

.., Random[Integer, {0, d 2 1}]] centres the target at a randomised position on a sensor using circular wraparound. 3. Table[...,{2}] generates two independently randomised instances of such targets.

Figure 24: The posterior probabilities that each code index is selected for all possible locations of the target on each sensor for n = 50.

9 A User's Guide to Stochastic Encoder/Decoders 9 The posterior probabilities that each code index is selected as a function of target position are shown in Figure 24, which should be compared with Figure 15. Figure 21: The posterior probabilities that each code index is selected for all possible locations of the target. 1. target centres the target on a sensor. 2. RotateRight[..., Random[Integer, {0, d 2 1}]] centres the target at a randomised position on a sensor using circular wraparound. 3. Table[...,{2}] generates two independently randomised instances of such targets. 4. Flatten[...] concatenates these into a single input vector. Train on 400 vectors derived as above. The training history of the loss function is shown in Figure 22. Figure 24: The posterior probabilities that each code index is selected for all possible locations of the target on each sensor for n = 50. The results shown in Figure 24 show that the code book operates as a factorial encoder for the same reasons a were discussed in the context of Figure 15. If the same simulation is repeated, but using n = 2 and increasing the number of training vectors to 1000, then the results are Figure 25: The training history of the loss function for n = 2. Every 10th sample is shown. Figure 22: The training history of the loss function for n = 50. Every 10th sample is shown. The training history of the rows of the reconstruction matrix is shown in Figure 23. Each image displays the entire training history of a single row reading down the page. Figure 26: The training history of the rows of the reconstruction matrix for n = 2. Each image displays the entire training history of a single row reading down the page. Figure 23: The training history of the rows of the reconstruction matrix for n = 50. Each image displays the entire training history of a single row reading down the page. The results shown in Figure 27 are not perfect, because the training process has got itself stuck in a trapped conguration. However, they show that the codebook mainly operates as a joint encoder for the same reasons that were discussed in the context of Figure 18.

A User's Guide to Stochastic Encoder/Decoders 10 Figure 27: The posterior probabilities that each code index is selected for all possible locations of the target on each sensor for n = 2.

10 A User's Guide to Stochastic Encoder/Decoders 10 Figure 27: The posterior probabilities that each code index is selected for all possible locations of the target on each sensor for n = 2. Figure 28: Superposition of 10 Gaussian targets. I. Imaging Sensor with Multiple Independent Targets Initialise the parameter values. d = 100; M = 15; n = 10; ε = 0.1; λ = 0.005; σ = 2; w = Table[0.1(Random[] - 0.5), {M}, {d}]; b = Table[0.1(Random[] - 0.5), {M}]; r = Table[0.1(Random[] - 0.5), {M}, {d}]; A = {Table[1, {M}]}; L = IdentityMatrix[M]; These parameter values state that the sensor has 100 pixels (d = 100), the code book has 15 entries (M = 15), 10 code indices are sampled for each input vector (n = 10), the update step size is 0.1 (ε = 0.1), the weight decay parameter is (λ = 0.005), the target half-width is 2 (σ = 2), the elements of the weight matrix, bias vector and reconstruction matrix are initialised to uniformly distributed random numbers in the interval [ 0.05, +0.05], the partitioning matrix and the leakage matrix are initialised to a default state in which their eect is switched o. Generate each training vector using Apply[Plus, Table[RotateRight[target, Random[Integer, {0, d - 1}]], {10}]] 1. target centres the target on a sensor. 2. RotateRight[..., Random[Integer, {0, d 2 1}]] centres the target at a randomised position on a sensor using circular wraparound. 3. Table[...,{10}] generates 10 independently randomised instances of such targets. 4. Apply[Plus,...] sums these to give a single input vector. A typical example of such an input vector is shown in Figure 28. Train on 1000 vectors derived as above. The training history of the loss function is shown in Figure 29. The training history of the rows of the reconstruction matrix is shown in Figure 30. Each image displays the Figure 29: The training history of the loss function. Every 10th sample is shown. entire training history of a single row reading down the page. After some initial confusion each code index begins to respond to a very localised region of the input space. Figure 30: The training history of the rows of the reconstruction matrix. Each image displays the entire training history of a single row reading down the page. The posterior probabilities that each code index is selected as a function the position of single test target are shown in Figure 31. Each code index encodes a small patch of the input space, and there is a little overlap adjacent patches. The results shown in Figure 31 shows that the code book operates very clearly as a factorial encoder, because despite the training data consisting of a large number of superimposed targets (see Figure 28), the code indices essentially code for single targets. In eect, the minimisation of the loss function has discovered the fundamental constituents out of which the training data have been built. This behaviour is reminiscent of independent compo-

A User's Guide to Stochastic Encoder/Decoders 11 Figure 31: The posterior probabilities that each code index is selected for all possible locations of a single test target.

11 A User's Guide to Stochastic Encoder/Decoders 11 Figure 31: The posterior probabilities that each code index is selected for all possible locations of a single test target. nent analysis (ICA) [6], where an input that is an unknown mixture of a number of independent unknown components is analysed to discover the mixing matrix and the components. In the above simulation, the input is an unknown mixture of components, where the mixing matrix (whose entries that are 0's and 1's in this case) forms a mixture of components, where each component is a target located at a particular position. Optimising the SVQ (on a training set of data) discovers the form of the components (as displayed in Figure 30), and subsequently using the SVQ (on a test set of data) to compute posterior probabilities eectively derives an estimate of the mixing matrix (e.g. for a single test target this estimate can be deduced from Figure 31) for each input vector. J. Noisy Bars In this simulation the problem is to encode an image which consists of a number of horizontal or vertical (but not both at the same time) noisy bars, in which each image pixel independently has a large amount of multiplicative noise and a small amount of additive noise (this type of training set was proposed in [7]). This is a more complicated version of the multiple target simulation in Section III I, where each horizontal bar is one type of target and each vertical bar is another type of target, and only one of these two types of targets is allowed to be present in each image. Initialise the parameter values. K = 6; d = K 2 ; M = 2K; n = 20; ε = 0.1; ρ = 0.3; σ = 0.2; w = Table[0.1(Random[] - 0.5), {M}, {d}]; b = Table[0.1(Random[] - 0.5), {M}]; r = Table[0.1(Random[] - 0.5), {M}, {d}]; A = {Table[1, {M}]}; L = IdentityMatrix[M]; These parameter values state that the image is 6 by 6 pixels (K = 6), the total number of pixels is 36 (d = K 2 = 36), the code book has 12 entries (M = 2K = 12), 20 code indices are sampled for each input vector (n = 20), the update step size is 0.1 (ε = 0.1), the probability of a bar being present is 0.3 (ρ = 0.3), the background noise level is 0.2 (σ = 0.2), the elements of the weight matrix, bias vector and reconstruction matrix are initialised to uniformly distributed random numbers in the interval [ 0.05, +0.05], the partitioning matrix and the leakage matrix are initialised to a default state in which their eect is switched o. Generate each training image using If[Random[] < 0.5, #, Transpose[ # ]] &[ Map[# + σ Random[] &, #, {2}] &[Table[# Table[Random[], {K} Table[If[Random[] < ρ, 1, 0], {K}]]]]; 1. Table[If[Random[]<ρ,1,0],{K}] is a bit vector that decides at random whether each row of the image has a bar present with probability ρ. 2. Table[# Table[Random[],{K}],{K}]&[...] generates a whole image such that each column of the image is the product of the bit vector and an independent uniformly distributed random number in the interval [0, 1]. 3. Map[#+σ Random[]&,#,{2}]&[...] adds an independent uniformly distributed random number in the range [0, σ] to each pixel value. 4. If[Random[]<0.5,#,Transpose[#]]&[...] transposes the whole image with probability 1 2, ensures that the generated image is equally likely to consist of horizontal bars or vertical bars. Some typical examples of such input images are shown in Figure 32. Figure 32: Some typical examples of noisy bar images. Train on 4000 images derived as above. The training history of the loss function is shown in Figure 33. This has a very noisy behaviour because the training data are very noisy, yet the codebook is relatively small. The training history of the rows of the reconstruction matrix is shown in Figure 34. In order to make them easier to interpet, the rows of the reconstruction matrix may be displayed in image format as shown in Figure 35, where it is clear that each code index encodes exactly one horizontal or vertical bar. The results shown in Figure 35 show that the code book operates very clearly as a factorial encoder, because

Stereo Disparity In each simulation that has been presented thus far the data live on a manifold whose intrinsic coordinates are statistically independent of each other.

12 A User's Guide to Stochastic Encoder/Decoders 12 Figure 33: The training history of the loss function. Every 100th sample is shown. Figure 35: The rows of the reconstruction matrix displayed in image format. K. Stereo Disparity In each simulation that has been presented thus far the data live on a manifold whose intrinsic coordinates are statistically independent of each other. The purpose of this simulation is to demonstrate what happens when the intrinsic coordinates are correlated. In this simulation the problem is to encode a pair of 1- dimensional images of a target which derive from the two sensors of a stereoscopic imaging system. The location of the target on a sensor is specied by a single intrinsic coordinate, and the pair of such coordinates (one for each of the two sensors) are correlated with each other, because Figure 34: The training history of the rows of the reconstruction matrix. Each image displays the entire training history the stereoscopic imaging system. Also, each image pixel the target appears in similar positions on each sensor in of a single row reading down the page. Every 100th sample independently has a large amount of multiplicative noise is shown. and a small amount of additive noise. Initialise the parameter values. K = 18; d = 2K; despite the training data consisting of images of a variable number of horizontal (or vertical) bars, the code n = 3; M = 24; indices essentially code for single horizontal (or vertical) ε = 0.1; bars. In eect, the minimisation of the loss function has s = 2.0; discovered the fundamental constituents out of which the a = 4.0; training data have been built. σ = 0.2; w = Table[0.1(Random[] - 0.5), {M}, {d}]; If the training image generator is altered so that each b = Table[0.1(Random[] - 0.5), {M}]; pixel within a bar has the same amount of multiplicative r = Table[0.1(Random[] - 0.5), {M}, {d}]; noise (i.e. the multiplicative noise is correlated), then A = {Table[1, {M}]}; L = IdentityMatrix[M]; training tends to get trapped in frustrated congurations. These parameter values state that each 1-dimensional This correlated multiplicative noise training image generator is basically the same as the one used in [7], and can image has 18 pixels (K = 18), the total number of pixels is 36 (d = 2K = 36), the code book has 24 entries be implemented by making the following replacements (M = 24), 3 code indices are sampled for each input to the above uncorrelated multiplicative noise training vector (n = 3), the update step size is 0.1 (ε = 0.1), image generator: the target half-width is 2 (s = 2.0), the half-range of 1. Table[If[Random[]<ρ,1,0],{K}] stereo disparities is 4.0 (a = 4.0), the background noise Table[If[Random[]<ρ,Random[],0],{K}] level is 0.2 (σ = 0.2), the elements of the weight matrix, bias vector and reconstruction matrix are initialised 2. Table[# Table[Random[],{K}],{K}]&[...] Table[#,{K}]&[...] to uniformly distributed random numbers in the interval

A User's Guide to Stochastic Encoder/Decoders 13 [ 0.05, +0.05], the partitioning matrix and the leakage matrix are initialised to a default state in which their eect is switched o.

RotateRight[target, {0, Random[Integer, {0, K - 1}]}], {2}]] 1.

RotateRight[...,{0,Random[Integer,{0,K-1}]}] centres the stereo image of the target at a randomised position on the sensors using circular wraparound. 3. Map[# Random[]&,.

..,{2}] adds an independent uniformly distributed random number in the range [0, σ] to each pixel value. Some typical examples of such input images are shown in Figure 36.

13 A User's Guide to Stochastic Encoder/Decoders 13 [ 0.05, +0.05], the partitioning matrix and the leakage matrix are initialised to a default state in which their eect is switched o. Generate each training vector using target = Map[Table[Exp[- (i - Floor[ K 2 ] - #)2 ], {i, K}] &, 2s 2 {0, Random[Real, {-a, a}]}] Map[# + σ Random[] &, #, {2}] &[Map[# Random[] &, RotateRight[target, {0, Random[Integer, {0, K - 1}]}], {2}]] 1. target is a stereo image of a target obtained by centring a target on one of the sensors and generating a randomly shifted copy (the shift is uniformly distributed in [ a, a]) on the other sensor. 2. RotateRight[...,{0,Random[Integer,{0,K-1}]}] centres the stereo image of the target at a randomised position on the sensors using circular wraparound. 3. Map[# Random[]&,...,{2}] multiplies each pixel value by a random number uniformly distributed in [0, 1]. 4. Map[#+σ Random[]&,...,{2}] adds an independent uniformly distributed random number in the range [0, σ] to each pixel value. Some typical examples of such input images are shown in Figure 36. Figure 38: The training history of the rows of the reconstruction matrix for n = 2. Each image displays the entire training history of a single row reading down the page. Every 100th sample is shown. mat as shown in Figure 39, where it is seen that each code index typically encodes a stereo image of a target at a given position and with a given stereo disparity. Figure 39: The rows of the reconstruction matrix displayed in stereo image format for n = 2. Figure 36: Some typical examples of stereo target images. Train on 2000 stereo images derived as above. The training history of the loss function is shown in Figure 37. The posterior probabilities that each code index is selected as a function of the mean position of the two images (horizontal axis) and stereo disparity (vertical axis) of a test target are shown in Figure 40. Figure 37: The training history of the loss function for n = 2. Every 100th sample is shown. The training history of the rows of the reconstruction matrix is shown in Figure 38. In order to make them easier to interpet, the rows of the reconstruction matrix may be displayed in image for- Figure 40: The posterior probabilities that each code index is selected as a function of the mean position of the two images (horizontal axis) and stereo disparity (vertical axis) of a test target for n = 2. The results shown in Figure 40 show that the code book operates very clearly as a joint encoder, because each code index jointly encodes position and stereo dis-

total of 24 (= 3 8) dierent possible codes. The measurement of stereo disparity (and position) to this resolution requires only one code index to be observed.

14 A User's Guide to Stochastic Encoder/Decoders 14 parity. The disparity direction is resolved into approximately 3 disparities (positive, zero, and negative disparity), whereas the position direction is resolved into approximately 8 positions, giving a total of 24 (= 3 8) dierent possible codes. The measurement of stereo disparity (and position) to this resolution requires only one code index to be observed. If the same simulation is repeated, but using n = 50, then the results are shown in Figure 41, Figure 42, Figure 43, and Figure 44. Figure 44: The posterior probabilities that each code index is selected as a function of the position (horizontal axis) and stereo disparity (vertical axis) of a test target for n = 50. Figure 41: The training history of the loss function for n = 50. Every 100th sample is shown. each code index encodes a linear combination of position and stereo disparity. However there are 2 subsets of code indices, one of which has a negative slope and the other of which has a positive slope (as seen in Figure 44). The intersections between these two subsets may be used to triangulate small patches on the input manifold by the same process that was described in the context of Figure 15. The measurement of stereo disparity requires a minimum of two code indices to be observed, which must belong to oppositely sloping subsets in Figure 44. In practice, many more than two code indices must be observed to virtually guarantee that there is at least one in each of the two subsets. L. Topographic Map Figure 42: The training history of the rows of the reconstruction matrix for n = 50. Each image displays the entire training history of a single row reading down the page. Every 100th sample is shown. Figure 43: The rows of the reconstruction matrix displayed in stereo image format for n = 50. The results shown in Figure 44 show that the code book operates very clearly as a factorial encoder, because The purpose of this simulation is to show how some degree of control can be exercised over the properties of each code index. Intuitively, if the code book wants to use a particular code index, but is frustrated in this attempt by the presence of random cross-talk between code indices, and is forced to randomly use one member of a set of code indices instead, then the amount of information that is carried by the code index that is actually (and randomly) selected is thereby reduced. However, if the code book can congure itself so that random cross-talk occurs only between code indices that code for similar inputs, then the information loss can be reduced to a minimum. Conversely, if a particular type of conguration is required in the code book, then it can be encouraged by deliberately introducing the appropriate type of random cross-talk. In this report random cross-talk is introduced by the M M leakage matrix L. This is a transition matrix, in which the elements of a given row are the probabilities that a corresponding code index gets randomly converted into each of the M possible code indices. Here the problem is to encode a 2-dimensional image of a randomly placed target, and to encourage the codebook to develop a 2-dimensional topology, such that the code indices can be viewed as living in a 2-dimensional space

15 A User's Guide to Stochastic Encoder/Decoders 15 corresponding to the 2-dimensional manifold on which the target image lives. This can be encouraged by arranging the code indices on a 2-dimensional square grid, and then introducing random cross-talk between code indices that are neighbours on the grid. As was shown in [2], this is closely related to prescription for generating a topographic map [3]. Note that although the 2-dimensional input manifold is continuous, the 2-dimensional grid on which the code indices live is discrete. If only one code index is sampled (i.e. n = 1), then the optimum SVQ is a vector quantiser, which discontinuously maps the continuous input manifold onto a discrete code index. However, if more than one code index is sampled (i.e. n > 1) then this discontinuity is blurred, and when a suciently large number of code indices are sampled the discontinuity disappears altogether, and the continuous input manifold is eectively mapped onto a continuous output manifold. This is clearly seen in the limiting case n, where the output is eectively the frequency distribution of the number of times each code index is sampled, which is a continuous function of the input. Initialise the parameter values. K = 6; d = K 2 ; M0 = 12; M = M0 2 ; n = 2; ε = 0.1; s = 1.0; w = Table[0.1(Random[] - 0.5), {M}, {d}]; b = Table[0.1(Random[] - 0.5), {M}]; r = Table[0.1(Random[] - 0.5), {M}, {d}]; A = {Table[1, {M}]}; These parameter values state that the image is 6 by 6 pixels (K = 6), the total number of pixels is 36 (d = K 2 = 36), the code book is 12 by 12 entries (M0 = 12), the total number of code book entries is 144 (M = M 2 = 144), 2 code indices are sampled for each input vector (n = 2), the update step size is 0.1 (ε = 0.1), the target half-width is 1 (s = 1.0), the elements of the weight matrix, bias vector and reconstruction matrix are initialised to uniformly distributed random numbers in the interval [ 0.05, +0.05], the partitioning matrix is initialised to a default state in which its eect is switched o. The leakage matrix is initialised thus: L0 = Map[Flatten, Flatten[Table[Exp[- (i1 - i2)2 + (j1 - j2) 2 2σ ], 2 {i1, M0}, {j1, M0}, {i2, M0}, {j2, M0}], 1]];IndentingNewLine # L = Transpose[Map[ &, Transpose[L0]]]; Apply[Plus, #] 1. Table[Exp[ (i1 i2)2 +(j1 j2) 2 2σ ], {i1, M0}, {j1, M0}, {i2, M0}, {j2, M0}] is a 4-dimensional list of 2 unnormalised leakage matrix elements dening a 2-dimensional Gaussian neighbourhood with half-width σ. This acts as a transformation from a 2-dimensional image to a 2-dimensional image. 2. Flatten[...,1] combines the i1 and j1 indices into a single index, leaving a 3-dimensional list overall. 3. Map[Flatten,...] combines the i2 and j2 indices into a single index, leaving a 2-dimensional list overall. This acts as a transformation from a 1-dimensional "attened" version of the image to a 1-dimensional "attened" version of the image. # Apply[Plus,#] 4. Transpose[Map[ &, Transpose[L0]]] normalises the leakage matrix elements to ensure that probability is conserved. Generate each training vector using target = Table[Exp[- (i - # [[1]])2 + (j - # [[2]]) 2 Table[Random[Real, {1, K}], {2}]]; 1. Table[Random[Real,{1,K}],{2}] generates a list of 2 random numbers in [1, K] which is the location of the target. 2. Table[Exp[ (i #[[1]])2 +(j #[[2]]) 2 2s 2 ], {i, K}, {j, K}] & 2s ], {i, K}, {j, K}]&[...] generates a 2 K by K image of pixel values of a Gaussian target. Some typical examples of such input images are shown in Figure 45. Figure 45: Some typical examples of Gaussian target images. Train on 100 images derived as above using a leakage matrix dening a 2-dimensional Gaussian neighbourhood with half-width σ = 5.0. Then train on a further 100 images using half-width σ = 2.5. The training history of the loss function is shown in Figure 46. Figure 46: The training history of the loss function. Every 10th sample is shown. The training history of the rows of the reconstruction matrix is shown in Figure 47.

In order to make them easier to interpet, the rows of the reconstruction matrix may be displayed in image format as shown in Figure 48, where it is seen that the code indices are topographically

16 A User's Guide to Stochastic Encoder/Decoders 16 Figure 47: The training history of the rows of the reconstruction matrix. Each image displays the entire training history of a single row reading down the page. Every 10th sample is shown. In order to make them easier to interpet, the rows of the reconstruction matrix may be displayed in image format as shown in Figure 48, where it is seen that the code indices are topographically organised, and each code index typically encodes an image of a target at a given position. Figure 49: Each step in the training history of the rows of the reconstruction matrix representated as a sequence of topographic maps, which should be read left to right, rst row then second row. Every 20th sample is shown. the bottom row. Thus the rst half of the simulation is run with a large amount of leakage in order to encourage the topographic map to develop smooth long-range order. If this is not done, then typically the topographic map gets trapped in a frustrated conguration in which it is folded or twisted over on itself. The contraction which is observed at the edge of the topographic map can be alleviated by making the halfwidth of the Gaussian leakage neighbourhood (perpendicular to the edge of the map) in the vicinity of the edge of the map smaller than in the centre of the map. This renement to the training algorithm is not explored here. IV. CONCLUSIONS Figure 48: The rows of the reconstruction matrix displayed in image format. The training history of the rows of the reconstruction matrix may be displayed in a much more vivid way. Compute the centroid of each of the rows of the reconstruction matrix (when arranged in image format as in Figure 48), and then draw vectors between the centroids of rows of the reconstruction matrix corresponding to neighbouring code indices. The result of this is shown in Figure 49. The evolution of the topographic map shown in Figure 49 starts from a small crumpled map, and then gradually unfolds to yield the nal result. In the top row the leakage matrix elements dene a 2-dimensional Gaussian neighbourhood with half-width 5.0, and then the halfwidth is reduced to 2.5 (i.e. the leakage is reduced) in It has been shown in this report how a stochastic encoder/decoder (specically, a stochastic vector quantiser (SVQ)) may be used to discover useful ways of encoding data. The body of the report consists of a number of worked examples, each of which is carefully designed to illustrate a particular point, and the appendices give the complete Mathematica source code for implementing the SVQ approach. The idealised simulations in Section III D (input data live on a circle) and Section III E (input data live on a 2-torus) can be used to understand the results obtain in the simulations in Section III F (input data is target(s) viewed by imaging sensor(s)). This underlines the usefulness of the results that were obtained in [5], where the encoding of circular and toroidal input manifolds was solved analytically using the algebraic manipulator Mathematica [4]. The simulations in Section III K illustrate how these results may be extended to the case of correlated sensors, by examining the case of stereo disparity. The results presented in this report illustrate a variety of possible behaviours that a 2-layer encoder/decoder network can exhibit. Each behaviour can be interpreted as the discovery by the network of objects (e.g. targets) and correlations (e.g. stereo disparity) in data derived from one or more sensors. This forms the basis of an approach to the fusion of data from multiple sensors.

Chapter 2 Basic Structure of High-Dimensional Spaces

Chapter 2 Basic Structure of High-Dimensional Spaces Data is naturally represented geometrically by associating each record with a point in the space spanned by the attributes. This idea, although simple,