A Modularization Scheme for Feedforward Networks

Size: px

Start display at page:

Download "A Modularization Scheme for Feedforward Networks"

Leslie Blake
5 years ago
Views:

1 A Modularization Scheme for Feedforward Networks Arnfried Ossen Institut für Angewandte Informatik Technical University of Berlin Berlin W-1000/19, FR Germany Abstract This article proposes a modularization scheme for feedforward networks based on controllable internal representations. Control is achieved by replacing units with pretrained modules that constrain internal patterns of activity to desired subsets. In the case of auto-associative feedforward networks these subsets can be seen as module interfaces. If enough a priori knowledge about a system is available, hierarchical systems with separately trainable and exchangable modules can be built. 1 INTRODUCTION Feedforward networks with backpropagation as a learning procedure have been successfully used for many problems. Generally speaking, they can approximate continuous nonlinear functions to any desired precision. But there are deficiencies: learning time efficiency is at least of polynomial order; generalization abilities of standard networks are insufficient; interpretation of network behavior is difficult, if not impossible. Much work has been done on the optimization of learning time efficiency. The majority of the techniques developed are based on second order methods to minimize the number of steps the gradient descent procedure has to take. But even if significant progress could be achieved, it would not help improve generalization abilities of networks. Le Cun [1989] has pointed out that the number of training examples required for good generalization scales like the logarithm of the number of functions which a specific network architecture can implement. Assuming that a small network is less general than a bigger one, i.e. that it can implement less functions, provided it is still able to compute Sekretariat FR -9, Franklin Str. 28/29, <ao@coma.cs.tuberlin.de> the desired function, we can conclude that small networks yield better generalization than bigger networks, given the same amount of training data. Learning complexity per learning cycle decreases, then, simply because small networks contain less links and units. On the other hand, overall learning efficiency may deteriorate because of a decreasing convergence rate. A couple of problem-independent strategies are known for the construction of small nets. Either an auxiliary error term is added to the cost function in order to penalize network configurations with many active links and/or units [Chauvin, 1989], or the relevance of links/units with respect to the error is determined, and the least relevant links/units are deleted [Mozer and Smolensky, 1989; le Cun et al., 1990]. The problem with minimal network construction using auxiliary error terms is the difficulty of relative weighting of terms, which may cause convergence problems. Also, it might still be very difficult to understand the emerging patterns of activity in the layers. A more general approach is to minimize the number of free parameters in the network, e.g. by incorporating equality constraints between the weights of the network based on a priori knowledge of the problem [le Cun, 1989]. A significant reduction in learning complexity and better generalization can be obtained. On an even more general level, it is now recognized that the optimal network architecture for a given problem should be the one that can be described with the least number of bits, that is, the one with the Minimum Description Length (MDL) proposed by Rissanen. A third approach is proposed in this article. A priori knowledge is used to select a constraint space for internal representations. The goal is to use the constrained internal representations as within bounds flexible interfaces between modules of feedforward networks. In the first place, a break-down into modules would result in reduced learning complexity. In addition, if enough a priori knowledge is available, it should also be possible to tailor the interfaces in such a way that the number of free parameters is also re-

2 duced without losing the network s ability to implement the desired function, resulting in enhanced generalization abilities. m 2 2 CONSTRAINING INTERNAL REPRESENTATIONS This interface can be represented as a constraint space for patterns of activation at module boundaries. The patterns have to be controllable and interpretable, but must still be general enough to capture underlying regularities of the environment of the module during the learning procedure. On the other hand, they have to facilitate the encoding of external values and the decoding of adopted patterns of activity in terms of external concepts. A promising way of defining these patterns is the use of coding schemes that restrict patterns of activation to specific subsets with well-defined values related to them. I have chosen a coarse coding [Hinton et al., 1986] scheme of scalar values. In comparison to value-unit codings, it shows improved convergence [Hancock, 1989]. It also provides the resolution needed for the development of internal representations. The intended scalar value can simply be represented by sampling a unimodal function centered at this value [Saund, 1989]. Decoding of scalars is possible via auxiliary networks that map scalar representations to an activation value or by a least squares error procedure. 2.1 A CONSTRAINED NETWORK Constraints are enforced by replacing the layer of a standard feedforward network with a pretrained autoassociative network. An auto-associative network m 1 trained for identity mapping of any desired scalar representations will generalize to reasonable representations at the s if clamped to unknown data. If backpropagated to the units, the error can be used to modify the s in a way that minimizes the error at the units. If placed in the middle of a surrounding network m 2, adopted representations at the units of m 1 serve as encodings of internal representations of m 2, while the backpropagated errors can be used to modify the weights between the layer of m 2 and the layer of m 1 (see figure 1). Thus two optimizations take place: (a) the layer of m 1 will eventually generate almost perfect scalar representations without any weight changes in m 1 proper; (b) m 2 converges to its desired / mapping. 2.2 A CONSTRAINED ENCODER NETWORK The scheme was tested on an encoder whose layer had been replaced by a module. Three m 1 Figure 1: An auto-associative module m 1 as an abstraction of the layer of m 2. simulation runs were carried out (see figure 2): a) standard system; b) system (three layers); c) 8-[4-4-4]-8 system with pretrained abstract module. The weights of the links of the three layers are copied from a system that has learned to auto-associate arbitrarily-positioned coarse-coded scalars in a separate learning procedure. After being copied, the weights in the module are fixed at their current values. Only weights from the layer to the layer of the system and from the layer of the system to the layer and the respective biases are subject to change by the backpropagation procedure. The system merely serves to propagate activations and backpropagate errors, thereby constraining the internal representations of the encoder to scalar representations. The convergence of the modular system is very similar to the original encoder and about one order of magnitude better than a system without separate training. Of course, the training effort for the pretrained module has to be taken into account, too. But since backpropagation is typically of polynomial order, the additional complexity (40 free weights) is low in comparison to the complexity of the training cycles for the (120 free weights) and 8-[4-4- 4]-8 (80 free weights) systems.

3 total sum of squares 1e [4-4-4]-8 Y e e-01 1e+00 1e+01 1e+02 1e+0 training epochs X Figure 2: Convergence of encoder, encoder, and 8-[4-4-4]-8 encoder with pretrained abstract module. APPLICATIONS.1 NONLINEAR DIMENSIONALITY REDUCTION This system can be applied to nonlinear dimensionality reduction problems [Saund, 1989], which turn out to be the special case of equal representations at and layers. The constraining module has to be pretrained to auto-associate the chosen scalar coding, while the original modules auto-associate the higher dimensional data. The constraining module pressures internal patterns of activation into the desired coding, without need for special convergence strategies 1. Figure shows a nonlinear dimensionality reduction from two-dimensional data to a onedimensional constraint using a 16-[8-8-8]-16 architecture..2 INTERFACE DEFINITION In truly modular systems, e.g. large computer programs, there is typically a design phase requiring the strict definition of interfaces before the construction of modules can take place. This includes a fixed / relation for all data to be processed. In addition, the representation of data at module boundaries is given by data types, that is, the 1 To obtain convergence, Saund [1989] has to use a simulated annealing schedule (smoothness of scalar coding) and a method of encouraging scalarized behavior by increasing peaks and decreasing valleys in a particular trial. Figure : Nonlinear dimensionality reduction from twodimensional curve data to one-dimensional data denoting points along the curve. range of possible values a parameter can adopt. Once a design is completed, modules can be constructed separately, which in general results in a significant reduction in implementation complexity. Also, modules can be exchanged, e.g. if implemented inefficiently, at any time without interfering with other parts of the system. How might this scheme be transferred to modular neural networks? An important feature of neural nets is their ability to learn from examples, that is, to develop internal representations according to the underlying regularities of the environment. If internal representations are to be used as interfaces, all internal representations that can emerge during learning have to be known in advance in order to define module / relations, making the learning phase senseless. This dilemma can partly be overcome if interfaces are defined more loosely. One way of achieving a loose coupling is to take advantage of the characteristics of scalar codings. In the case of an auto-associative feedforward network, where internal representations are forced into scalar codings and where s/targets are also presented in the form of scalar codings, a sufficient approximation of the targets is only possible if the scalar codings at the layer evolve in orderly fashion. In other words, internal codings are almost equally distributed over the given interval and are in ascending or descending order with respect to corresponding s/targets, as shown above for the two-to-

4 one dimensionality-reduction problem. The behavior very much resembles a simple topology-preserving mapping as known from Kohonen s self-organizing feature maps [Kohonen, 1989]. Given that, an interface can be defined by a type definition, that is, the range of valid scalar values, and the assertion that a neighborship relation between patterns of activation at s/targets and internal patterns will be preserved. On the other hand, a learning system consisting of more loosely coupled modules will lose some of the capabilities a strictly defined system has. m m. EXAMPLE Let us consider a system where some raw data is to be processed in several steps. First, data is compressed into compact encodings, then a second step is applied, e.g. feature values derived from the encodings are displayed. Then further processing steps are carried out. Such a system could be broken down into modules loosely coupled via scalar-coded interfaces. Figure 4 shows an example. Two modules (m 1 ; m 2 ) of auto-associative feedforward networks learn to transform the raw data into scalarcoded internal representations. Assuming that the low-level configuration is appropriate for the task (a priori knowledge), two sets of scalar-coded sequences are learned. Since the possible values are known, it is sufficient to train one display module (m ) to indicate the adopted values of both low-level modules. The display module can therefore be trained separately and may be exchanged. 4 DISCUSSION Modularization is a promising way of combatting learning time complexity in backpropagation networks. On the other hand, the above modular system would be of little use if absolute feature values were important for further processing. It was not possible to train higher-level modules separately and no reduction in learning complexity was achievable. However, the topology-preserving behavior of scalarcoded interfaces might be a sufficient justification here, at least for a subset of problems. For these cases, a looselycoupled modular system not only allows the required learning epochs to be reduced; a means of interpreting patterns of activity at module interfaces is also achieved. A LEARNING PROCEDURE ADJUSTMENTS All simulations were carried out using the standard backpropagation algorithm with a fixed learning rate of = 0:2, a fixed momentum of = 0:1, weight update after each m m 1 2 data data Figure 4: A modular system with scalar coded interfaces. presentation (on-line mode), and sequential presentation of patterns. To speed up learning, the gradient was normalized before updating the weight. This follows from the empirical observation that the product of optimal learning rate and absolute value of the gradient remains almost constant over all learning cycles, see [Salomon, 1989]. Scalar codings were created using the derivative of the sigmoidal function 1=(1 + e?x=t ) at a temperature of t = 1=8. Acknowledgement I would like to thank Albrecht Biedl for his helpful comments on early drafts of the paper. My special thanks go to Geoffrey Hinton for his critical comments during revision of a previous published version of this paper. References [Chauvin, 1989] Yves Chauvin. A back-propagation algorithm with optimal use of units. In David S. Touretzky, editors, Advances in Neural Information Processing Systems I, pages Morgan Kaufmann Publishers, San Mateo, California, [Hancock, 1989] Peter J. B. Hancock. Data representation in neural nets: An empirical study. In David Touretzky, Geoffrey Hinton, and Terrence Sejnowski, editor, Proceedings of the 1988 Connectionist Models Summer School, pages 11 20, San Mateo, CA, Morgan Kaufmann Publishers.

5 [Hinton et al., 1986] Geoffrey E. Hinton, James L. Mc- Clelland, and David E. Rumelhart. Distributed representations. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, chapter. MIT Press/Bradford Books, [Kohonen, 1989] Teuvo Kohonen. Self-Organization and Associative Memory. Springer Verlag, [le Cun et al., 1990] Yann le Cun, John S. Denker, and Sara A. Solla. Optimal brain damage. In David S. Touretzky, editors, Advances in Neural Information Processing Systems II, pages Morgan Kaufmann Publishers, San Mateo, California, [le Cun, 1989] Yann le Cun. Generalization and network design strategies. In Rolf Pfeiffer, Zoltan Schreter, Françoise Fogelman-Solié, and Luc Steels, editor, Proceedings Connectionism in Perspective. Swiss Group for Artificial Intelligence and Cognititive Science (SGAICO), Elsevier Science Publishers B.V., [Mozer and Smolensky, 1989] Michael C. Mozer and Paul Smolensky. Skeletonization: A technique for trimming the fat from a network via relevance assessment. Technical Report CU-CS , University of Colorado at Boulder, Boulder, January [Salomon, 1989] Ralf Salomon. Adaptiv geregelte Lernrate bei Back-propagation. Technical Report 89-24, Technische Universität Berlin, Forschungsberichte des Fachbereichs Informatik. [Saund, 1989] Eric Saund. Dimensionality-reduction using connectionist networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11():04 14, March 1989.

Classifier C-Net. 2D Projected Images of 3D Objects. 2D Projected Images of 3D Objects. Model I. Model II

Classifier C-Net. 2D Projected Images of 3D Objects. 2D Projected Images of 3D Objects. Model I. Model II Advances in Neural Information Processing Systems 7. (99) The MIT Press, Cambridge, MA. pp.949-96 Unsupervised Classication of 3D Objects from D Views Satoshi Suzuki Hiroshi Ando ATR Human Information