Reconfigurable co-processor for Kanerva s sparse distributed memory

Microprocessors and Microsystems 28 (2004) 127 134 www.elsevier.com/locate/micpro Reconfigurable co-processor for Kanerva s sparse distributed memory Marcus Tadeu Pinheiro Silva a, Antônio Pádua Braga b, *, Wilian Soares Lacerda c a Federal Center of Technological Education, Belo Horizonte, MG, Brazil b Department of Electronics Engineering, Federal University of Minas Gerais, Caixa Postal 209, CEP 30.161-970 Belo Horizonte, MG, Brazil c Department of Computing, Federal University of Lavras, Lavras, MG, Brazil Received 5 August 2003; revised 28 January 2004; accepted 29 January 2004 Abstract The implementation on hardware of the first layer of Kanerva s sparse distributed memory (SDM) is presented in this work. The hardware consist on a co-processor board for connection on ISA standard bus of an IBM PCcompatible computer. The board, named reconfigurable co-processor for SDM(RC-SDM), comprises on Xilinx FPGAs, local random access memory and bus interface circuits. Based on in-system reconfiguration capacity of FPGAs, RC-SDM easily allows change of the characteristics of SDM topology implemented. First results show a speed-up of four times of RC-SDM in relation to a software implementation of the algorithm. q 2004 Elsevier B.V. All rights reserved. Keywords: Associative memory; Neural networks; FPGAs 1. Introduction Physical implementation of associative memories is in the agenda for more than four decades and has been studied from different perspectives. From the computer architecture point of view, the term content addressable memories (CAM) [1] has been coined to refer to those memory implementations that can retrieve information with partial clues of the content. The physical implementation of CAM was usually accomplished with conventional random access memories (RAM) and additional hardware. From the connectionist point of view [2], associative memories were implemented with neuron-like elements, or processing elements(pe), connected by a network structure. Information was spread throughout the network and stored in the connections between processors. The sparse distributed memory (SDM) [3] is an associative memory model that can be seen from both perspectives. From the computer architecture point of view, it can be seen as a generalization of a RAM and from the connectionist perspective as an artificial neural network (ANN). Associative memory properties are achieved by * Corresponding author. Tel.: þ55-31-3499-4869; fax: þ55-31-3499-4850. E-mail address: apbraga@cpdee.ufmg.br (A.P. Braga). spreading sparse decoders into the high dimensional input (address) space. The associative memory properties appear as the dimension of the input space increases, which is in fact a difficulty for its physical implementation. Computational intensive calculations of sparse address decoding is one of the main bottlenecks of software and hardware implementations of SDMs, since it requires Hamming distance 1 calculations between the high dimensional input address and all sparse decoders. This paper describes a hardware implementation of SDMs that is aimed at implementing its most computational intensive calculations in a co-processor board. The circuit, implemented on an IBM PC platform, with reconfigurable hardware, has 32 PEs that calculate the Hamming distances between the input address and 32 sparse decoders in parallel. The co-processor board, named here as recofigurable co-processor for SDMs, or simply RC-SDM, is in the category of reconfigurable machines with extended instruction set [6], since a reconfigurable platform is established by the connection of a FPGA-based co-processor to a host machine. Adaptive computation is carried on by an application program running in the host. This program configures the FPGAs and then transfers data to be 1 The Hamming distance is defined as the number of bits in which two binary words differ. 0141-9331/$ - see front matter q 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.micpro.2004.01.003

128 M.T.P. Silva et al. / Microprocessors and Microsystems 28 (2004) 127 134 processed in the co-processor board. Data processing in the co-processor board is accomplished in parallel with program execution by the host processor. The improved performance of the reconfigurable coprocessor system is significant when compared to a program running in the host with optimized assembly code. The flexibility of the board allows for adaptive reconfiguration of current implementation and further improvements may be achieved by using an extension of the proposed architecture. 2. Description of SDM The essence of SDMs is to use sparsely distributed decoders in a high dimensional Boolean space so that any sparse decoder, or hard storage location, is accessed from anywhere in the space that is at a Hamming distance smaller than r bits from its base address. Therefore, each decoder responds to all the vectors inside a hyper-sphere, or circle in SDM s terminology, with radius r and centre at the location s base address. Depending on the selected value for r, input vectors may access more than one storage location at the same time, allowing data to be stored and retrieved concurrently to and from several memory storage locations. Using Kanerva s analogy between a hypersphere in the n-dimensional Boolean space and a circle in the plane [5], an example of the effect of the chosen radius r in the concurrent access by an input vector j to two sparse decoders z m and z n is shown in Fig. 1. The chosen radius r a is large enough to allow access by j to both decoders z m and z n : If r a were too small, j could have been located outside the two hyper-spheres and, consequently, could not have accessed any of them. Therefore, the condition for j accessing both the arbitrary sparse decoders z m and z n is that the distance from j to both of them is less than the chosen radius r: In other words, j must be inside the intersection of the hyper-spheres with centres in z m and z n [5]. The size of the radius r in SDMs must be such that the union of all the sets of elements inside the hyper-spheres S i ðz i ; r i Þ includes all the 2 n elements of the space {0; 1} n ; so that any input vector access at least one sparse decoder. Fig. 1. j is inside the intersection of the two circles and, consequently, accesses both decoders. The distributed storage and retrieval of data in SDMs is responsible for its associative memory properties. Therefore, the intersection between two hyper-spheres S i ðz i ; r i Þ and S j ðz j ; r j Þ separated by the Hamming distance h is a basic issue in the analysis of SDMs, since it indicates the number of vectors in the space that access simultaneously both storage locations z i and z j : Of course, in the limit cases when r ¼ 0 and r ¼ n; the number of patterns that access a given hard location are, respectively, 1 and 2 n : For r in the range between 0 and n; it is possible to estimate the percentage of the Boolean space that is in the intersection of two hyperspheres [7] and predict associative memory performance. The terms hard storage location and sparse decoder used in SDM terminology resemble those used to describe conventional computers RAMs. In fact, SDMs can be seen as a general case of a RAM, since, in the limit case when r is set to zero and there are 2n sparse decoders, a SDM actually becomes a conventional RAM. The main differences between a RAM and a SDM are: the address decoder of a RAM is selected only if its base address is presented at the inputs ðr ¼ 0Þ; whereas a sparse decoder of a SDM is selected from anywhere within r bits Hamming distance from the base address ðr. 0Þ; a conventional RAM has 2 n decoder outputs and storage locations, where n is the size of the input address vector, whereas a SDM has only a portion M of the maximum 2 n sparse decoders and hard storage locations; a hard storage location of a SDM, instead of storing only one data vector, like the storage locations of a RAM, holds a statistical measure of all the data vectors that accessed that location during the storage process. SDM is, therefore, an associative memory designed to store information coded in long strings of bits, corresponding to binary vectors with hundreds or thousands of coordinates. The basic principles behind Kanerva s SDM are based on the properties of high dimensional Boolean space [3]. SDMs can be analyzed from two different perspectives: as a generalization of a conventional computer RAM with associative memory properties and also as a twolayers feed-forward neural network [4]. There is not a general expression for determining r as a function of the Boolean space dimension n: As in Kanerva s original work [3], n should be large and r should be approximately n=2: In most examples in his original work n ¼ 1000 and r is a bit less than n=2: The value used in most examples was r ¼ 451: The value of r should not be larger than n=2; since it would include most of the input space (more than 50%) and would cause too much overlap among sparse decoders. If r is too small, only a small percentage of the input space is covered by the hyper-spheres. This is the reason why r is normally chosen a bit smaller than n=2:

M.T.P. Silva et al. / Microprocessors and Microsystems 28 (2004) 127 134 129 Fig. 2. Schematic view of the SDM. Fig. 2 presents the schematic view of a SDM, from which both neural network and RAM views of a SDM can be observed. Anyone familiar with digital systems would easily associate input vector I with a RAM input address, and input data vector W with memory input data. The output data, for reading information from memory, would be associated with the output vector O. The address decoder and memory array of a conventional RAM would be directly associated with sparse decoders S and cells of memory H (hard storage location). Similarly, the neural network view would characterize the structure of Fig. 2 as a two layers neural network, with information flow from input vector I to output vector O, having the threshold elements as the outputs of each layer. 2.1. Storage of information The first array of a SDM, matrix S, corresponds to a decoder-like structure. Each output of the decoder, after threshold, is capable of activating only one location of memory (the corresponding row of array H), like a conventional computer RAM decoder. Nevertheless, an input address may activate more than one decoder output simultaneously, what causes data to be stored to or read from several memory locations at the same time. The dimension n of I must be high, so that the properties of higher dimensional Boolean space become valid. Working with very high dimensional spaces is one of the fundamental principles behind SDMs, but represents also a difficulty for its implementation. For n ¼ 1000; for example, the number of possible input addresses is 2 1000, what makes it unconceivable to think of a memory location for each one of the 2 1000 input addresses. Instead of that, M decoders and memory locations are used in SDMs, where M is a very small fraction of 2 1000 (M! 2 1000 ). The M sparse decoders with radius r spread on the input space are expected to include all the possible 2 1000 input addresses. A value for r is selected, based on the properties of the higher dimensional Boolean space, so that each one of the possible input addresses access at least one memory location. Each row of matrix H corresponds to a word of the associative memory, similarly to a conventional RAM, but, instead of a register for each storage location, SDM has a binary counter for each position. Each counter is incremented or decremented depending on whether a 1 or a 0 is stored; the storage of 1 causes increment and of 0 causes decrement. Input data vector W accesses all the hard locations of matrix H in parallel, allowing data to be stored in more than one location. Since several memory words can be selected simultaneously by the same input address, several counters on different words are activated by a single input address. This process of storage causes the information to be distributed to several memory locations. A single counter is normally activated several times when many words are stored.

130 M.T.P. Silva et al. / Microprocessors and Microsystems 28 (2004) 127 134 2.2. Retrieval of information Similar to the storage process, information retrieval starts by presenting an input vector I that is expected to select more than one hard location of matrix H. An adder circuit sums up all the outputs of activated counters accessed during reading in order to generate the output vector O. The result of the sum can be zero, positive or negative. A negative value occurs when there were more 0s (counter decremented) then 1s (counter incremented) written during storage and a positive value occurs when there were more 1s then 0s. A zero output indicates that either the number of 0s and 1s stored were the same or that the selected locations were not accessed before. A threshold function is applied to the output of each adder, so that the output value is 1 if the sum is positive, or zero if the sum is negative. Since storage occurs in a distributed form, the read output value of the SDM is a statistical measure of the stored data obtained by a majority rule in order to effectively generate the output vector. When information is written in SDMs, storage occurs in distributed form in the j locations activated by the corresponding address vector j: Since the information is distributed, these j locations can also be accessed when other input vectors are stored. When more and more words are added to the memory, more and more coincidences occur among stored data until the maximum storage capacity of the SDM is reached, which is a fraction of M [3]. Vector d in Fig. 2 contains the results of the computed distances between the input vector I and all the M sparse decoder s base addresses. Between vectors d and y, threshold elements activate their outputs to 1 if the corresponding distances in vector d are within the range 0 to r: Therefore, the active elements of vector y are only those that correspond to the hard locations accessed by the input vector I. 3. Algorithmic representation of SDM s processing SDM s processing for storage and retrieval of information are presented in the next two sections in algorithmic form. This helps selecting the approach for hardware implementation of the co-processor board. 3.1. Algorithm for the first layer Processing in the first layer is the same for both storage and retrieval of information. The procedure presented in Algorithm 1 involves first the calculation of the Hamming distances h between input vector I and all the M sparse decoders of matrix S. Bitwise exclusive-or operation between input vector I and every row of matrix S is carried on by the function ExclusiveOrðI; SðrowÞÞ: The resulting vector DifVector contains 1 at every position where the two vectors I and SðrowÞ differ. Counting the number of 1s in DifVector results on the Hamming distance h: This counting operation is carried on with function CountOnesðDifVectorÞ: For every distance calculated, for each row, the corresponding output of vector y is set to 1 if h # r where r is the radius of the sparse decoders. I is a 1 by n vector, S is a M by n matrix and y is a M by 1 vector, where n and M are user defined parameters. Algorithm 1. Algorithm of the first layer for row ¼ 0toðM 2 1Þ do DifVector ¼ ExclusiveOrðI; SðrowÞÞ; h ¼ CountOnesðDifVectorÞ; if h # r then yðrowþ ¼1; else yðrowþ ¼0 end if 3.2. Algorithm for the second layer SDM s processing is different for reading and writing in the second layer, so two different procedures are presented in Algorithms 2 and 3. The storage phase involves basically the sum of every element of the input data vector W to every counter of the selected rows, or hard locations, of matrix H. In order to accomplish the counting operation correctly, vector W must be represented with þ1 and 21 elements. The retrieval phase involves basically summing every element of the selected rows of matrix H to generate vector v, to which a threshold is applied to obtain the output vector O. In Algorithms 2 and 3, H is a matrix with M rows and U columns and W, v and O are vectors with 1 row and U columns. Algorithm 2. Algorithm for the storage phase of the second layer for row ¼ 0toðM 2 1Þ do if yðrowþ ¼1 then for col ¼ 0toðU 2 1Þ do Hðrow; colþ ¼Hðrow; colþþwðcolþ end if Algorithm 3. Algorithm for the reading phase of the second layer v ¼ 0 for row ¼ 1toðM 2 1Þ do if yðrowþ ¼1 then for col ¼ 0toðU 2 1Þ do vðcolþ ¼Hðrow; colþþvðcolþ

M.T.P. Silva et al. / Microprocessors and Microsystems 28 (2004) 127 134 131 end if for col ¼ 0toðU 2 1Þ do if vðcolþ # 0 then oðcolþ ¼1 else oðcolþ ¼0 end if 3.3. Hardware implementation considerations The calculation of Hamming distances in Algorithm 1 assumes that the two functions ExclusiveOrðI; SðrowÞÞ and CountOnesðDifVectorÞ are available, but these functions are not normally in the repertoire of instructions of commercial microprocessors. Even for software implementation of the two functions we would have to consider also that microprocessors work with limited length operands, in the range of 64 bits nowadays, while addresses in SDMs are binary vectors with hundreds of bits. For addresses that are beyond the operand s length limit, the operation would have to be executed in several steps. In addition to that, the Hamming distance must be calculated between I and all the M sparse decoders of S, what results on an algorithm complexity of OðnMÞ: Since these operations are computationally intensive for large values of n and M; that is normally the case for SDMs, they look attractive for parallel hardware implementation. Bitwise hardware exclusive-or operations could be implemented in parallel between I and all the M sparse decoders. Counting the number of 1s in the resulting exclusive-or operations would normally be a serial operation. There is also an inherent parallelism in the operations of the second layer, since input data vector W is applied to all the selected hard locations. Nevertheless, the number of selected rows in matrix H would normally be a small portion of the total number of M hard locations, typically a number between 0.1 and 0.01 [4]. Parallel hardware would have to be implemented for the whole matrix H, but only a small part of it would be active at every memory access. The gain in performance of a possible hardware implementation of the second layer would also depend on the spatial position of the input vector I in relation to all sparse decoders of matrix S. For an input vector that is outside the intersection of two or more sparse decoders, the gain in performance would be null. The performance would also depend on the selected value of r; that is a user-defined parameter. So, we would have a variable gain in performance with a parallel hardware implementation of the second layer. In addition to that, two different hardware implementations would be needed, one for reading and the other for writing. It could also be observed that memory requirements for matrix S are much lesser than those for matrix H. For a SDM configuration with n ¼ 256; U ¼ 256; r ¼ 111 and M ¼ 16; 384; for example, the memory requirements for matrix S is equal to 16; 384 256 1 bit, what is equivalent to 4 Mbits. Considering now that each position of matrix H stores an integer number, and considering also that 8 bits number are sufficient, memory requirements for matrix H for the same SDM above is 16; 384 256 8 bits, what corresponds to 32 Mbits. In order to speed up local processing, fast and, consequently, lower capacity static memories chips are used, what makes it difficult to implement large local memory banks, as would be needed for the second layer. Therefore, the first layer algorithm was selected for hardware implementation. 3.4. Hardware implementation of the first layer Hamming distance calculation between input vector I and all the sparse decoders of matrix S can be carried out independently and in parallel. In our design, the elements of I, S and of the output vector Y are stored in three separate banks of static memory. The outputs of banks I and S are connected to the inputs of the PEs in order to generate input data for memory the bank Y (Fig. 3). The PEs are therefore responsible for calculating the distance between I and all the elements of S and then to generate vector Y that indicates the hard locations selected from matrix H. PEs are implemented within one Xilinx FPGA, what allows flexibility in the modification of some user-defined parameters: n (space dimension), r (radius) and M (number of hard locations). In the implementation of the PEs, a bit-serial strategy was adopted, as will be described in the next section. 3.5. Bit-serial processing element The serial PE (Fig. 4), instead of a parallel approach, was chosen for the implementation of the algorithm. The first reason for this choice was that the pin count for the serial design is much smaller than for the parallel one. This is due to the need for a ROM in the parallel approach that needs to be implemented externally to the FPGA. In addition to that, much more serial PEs the parallel PEs can be built into a FPGA. However, the design concept of the bit-parallel PE could be useful for a software implementation. Therefore, the chosen approach requires more simplified circuits and a more coherent design with current FPGAs architecture than the parallel implementation. Fig. 3. Static RAMs connections with the Processing Elements (PEs).

132 M.T.P. Silva et al. / Microprocessors and Microsystems 28 (2004) 127 134 Fig. 4. PE bit-serial for the first layer of SDM. The bit-serial PE is composed basically by a counter and a simple combinational circuit. Before processing starts, the counter (Difference Counter DC) is reset and, for every new clock cycle, a bit i of the input vector I is compared with bit i of the address decoder j of matrix S by an exclusive-or gate. If the two bits I i and S ij are different, DC is enabled to count. The combinational circuit at the outputs of DC has the function to detect whether I i is within r bits from S ij : The output y of PE goes to 0 when the counting value of PE is larger than r and goes to 1 otherwise. Therefore, in the beginning of processing, DC ¼ 0 and y ¼ 1; when counting starts y may change to 0 if DC exceeds the pre-defined value of r. In order to speed-up processing, counting is interrupted when y changes to 1 by the return path between y and input AND gate. Since in SDMs the value of r is always less than n=2 the number of bits required to implement DC is n=2: Since the PEs process each bit of the input address in parallel, distance calculation for the current 32 PEs is interrupted when d. r for all of them. This allows the distance to the next 32 PEs to be calculated. Performance improvement is obtained since all 256 bits of each decoder do not need to be scanned for every input address. Only sparse decoders for which d # r have all the 256 bits considered. 4. Architecture of the RC-SDM The RC-SDM is composed basically by two FPGAs, three banks of memory and the interface circuits with the host processor, as shown in the diagram of Fig. 5. In the start-up phase, after the board is energized, the two FPGAs remain idle and their pins in high impedance. Next, the configuration program running in the host processor loads into them their bitstream files with their internal configurations. The configuration program also loads into the memories the data to be processed. After processing is finished, the host reads back the results of the processed data. Memory banks 1 and 2 are 32 bits wide, while bank 3 is 1 bit wide for the implementation of the PEs bit-serial approach. Fig. 5. PE bit-parallel for the first layer of SDM.

M.T.P. Silva et al. / Microprocessors and Microsystems 28 (2004) 127 134 133 Fig. 6. PE bit-parallel for the first layer of SDM. It can observed from Fig. 5 that the board structure is reasonably flexible and open. Except for some limitations of the available hardware, the user has reasonable flexibility regarding the implementation of the algorithm. For example, a bit-parallel approach could be used instead of the current serial approach by simply changing the configuration program. Although, part of the control circuits for the communication between the board and the host are implemented in external programmable circuits (GALs), FPGA1 is also used for this purpose, what adds also flexibility for changing the internal communication protocol. The board structure was designed in such a way that the algorithm implementation is split into the two FPGAs. FPGA1 includes part of the communication circuits for exchanging information with the host, whereas FPGA2 is dedicated for to the implementation of the PEs. This gives also flexibility for the implementation of the PEs. The chips used to implement FPGA1 and FPGA2 were Xilinx XC4003E and XC4005E; with three and five thousand gates, respectively. The percentage of chip utilization of FPGA1 was 43% and of FPGA2 was 92% and the clock frequency was 20 MHz. A block diagram of the board after FPGAs configuration is presented in Fig. 6. As can be observed, FPGA2 contains 32 PEs, what leads to the processing of 32 sparse decoders in parallel. Control of memories carried on by FPGA1, that also interprets the following commands from the host: reading from and writing to the memories, initialize processing, read processing status, start processing and general initialization of the board. 5. Performance considerations In order to evaluate the performance of RC-SDM, two programs, one that uses the board and the other that does not, were implemented. In both programs, the SDM implemented was set with parameters n ¼ 256 and M ¼ 16; 384: The machine used to implement the programs and host the board was a PC microcomputer with a Pentium Celeron CPU running at 300 MHz. The programs were written in ANSI C with all SDM functions implemented in assembly. The main program written in C deals mainly with accesses to disk and screen, whereas the assembly routines treat SDM processing in both hardware and software versions of the program in order to have an unbiased software implementation. This assures that the comparison of the hardware is made with a as fast as possible software implementation. The second program consists of a modification of the first one, where the assembly routines were substituted by the processing in the board. Processing time, that was measured with a logic analyzer, of the assembly routine was 49.8 ms whereas the hardware processing at RC-SDM was 13.06 ms, what resulted in a speed-up of about four times in relation to the software implementation. 6. Conclusions More than a proof-of-concept, the implementation of RC-SDM presented an efficient version of SDMs with

134 M.T.P. Silva et al. / Microprocessors and Microsystems 28 (2004) 127 134 improved performance when compared to the equivalent software implementation. The proposed architecture has shown to be appropriate for the current application and flexible enough for further improvements and extensions. Since most of the hardware is implemented in the FPGAs, reconfiguration and changes in the current implementation can be easily accomplished in future developments. Current state-of-the-art FPGAs would have allowed a larger density circuit for higher dimensional SDMs, this work showed that, for a small-scale problem, the architecture of RC-SDM is suitable for a wide range of implementations. Acknowledgements The authors would like to thank the support from Xilinx University Program, CNPq and FINEP. References [1] A. Krikelis, C.C. Weems, Associative processing and processors, IEEE Computer (Guest Editorial) (November) (1994). [2] C. Orovas, Cellular Associative Networks for Pattern Recognition, PhD thesis, University of York, England, 1999. [3] P. Kanerva, Sparse Distribuited Memory, Bradford/MIT Press, Cambridge, USA, 1988. [4] P. Kanerva, Associative-memory models of the cerebellum, in: I. Aleksander, J. Taylor (Eds.), Artificial Neural Networks, Elsevier Science Publishers B. V, Amsterdam, 1992. [5] P. Kanerva, in: M.H. Hassoun (Ed.), Sparse Distributed Memory and Related Models, Associative Neural Memories: Theory and Implementation, Oxford University Press, New York, 1993, pp. 50 76. [6] R.W. Hartenstein et al., Custom Computing Machines vs. Hardware/ Software Co-Design: from a Globalized Point of View. In: 6th International Workshop on Field Programmable Logic And Applications, Darmstadt, Germany, September 23-25, 1996. [7] A.P. Braga, I. Aleksander, Geometrical treatment and statistical modeling of the distribution of patterns in the n-dimensional boolean space, Pattern Recognition Letters 16 (1994) 507 515.