of Artificial Neural Networks

Size: px

Start display at page:

Download "of Artificial Neural Networks"

Blaise Bridges
6 years ago
Views:

1 Attribute Grammar Encoding of the Structure and Behaviour of Artificial Neural Networks by Talib Sajad Hussain A thesis submitted to the School of Computing in conformity with the requirements for the degree of Doctor of Philosophy Queen s University Kingston, Ontario, Canada August, 2003 Copyright Talib Sajad Hussain, 2003

2 Dedicated to my father, Dr. Matlub Hussain ii

3 Abstract Current techniques for the abstract representation of complex artificial neural network architectures are limited in the variety and types of neural network characteristics that may be represented. The Network Generated Attribute Grammar Encoding (NGAGE) technique is introduced to address these limitations. NGAGE uses an attribute grammar to explicitly represent both topological and behavioural properties of a neural network, and uses a common neural interpreter to generate functional neural networks from a derivation of the grammar. Grammars that represent a wide variety of current and novel neural network architectures are presented. Together, these grammars demonstrate that the NGAGE technique has greater representation flexibility than current approaches. A novel evolutionary algorithm, the Probabilistic Context-Free Grammar Genetic Programming (PCFG-GP), is introduced to enable a constrained evolutionary search of the space of context-free parse trees generated by an attribute grammar. Experimental results demonstrating the search behaviour of the PCFG-GP algorithm are presented. The NGAGE technique is shown to be a valuable tool for the representation and exploration of novel and existing neural network architectures. iii

4 Acknowledgements I would like to thank my thesis advisor Dr. Roger A. Browse for his valuable guidance throughout my graduate career and for casually pointing out one day that the grammar structure I was trying to develop from scratch already had a better cousin in existence, namely attribute grammars. I would like to thank my mother, Mary Lou, for her valuable organizational help during the initial writing stages. My father, Matlub, and my mother have always inspired me to achieve my best, and they and all my family, my wife Eliana, my brother Tariq and my sister Tamara, have given me much love and support in so many ways. I would like to thank Satish Kumeta, Juergen Dingel, Alan Ableson and Hosein Isazadeh for many productive discussions on my research. I would like to thank Irene LaFleche, Debby Robertson and Lynda Moulton of the staff of the School of Computing at Queen s University for their always cheerful help. Finally, I would like to thank Penny Ellard, Richard Lazarus and Mark Berman, my managers at BBN Technologies, for their support and encouragement during the final stages of my thesis. Financial support was provided by the Ontario government, the Jack Catherall Scholarship Fund, Queen s University and BBN Technologies. iv

5 Statement of Originality I hereby certify that this Ph.D. thesis is original and that all ideas and inventions attributed to others have been properly referenced. v

6 Contents Abstract... iii Acknowledgements... iv Statement of Originality... v Contents... vi List of Figures...xiii List of Tables...xxiii Chapter 1 Introduction Motivation Foundations Research... 5 Chapter 2 Background Neural Networks Activation Processing Hebbian Learning Perceptron Learning Back-Propagation Learning vi

7 2.1.5 Instar Processing and Learning Feedback Processing Structural Learning Evolutionary Algorithms Search Algorithms Biological Evolution Evolutionary Computation Genetic Algorithms Genetic Programming Strongly-Typed Genetic Programming Context-Free Grammar Genetic Programming Logic Grammar Based Genetic Programming Definite Clause Translation Grammar Genetic Programming Self-Adaptive Evolutionary Algorithms Baldwin Effect in Evolutionary Computation Modular Neural Network Models Redundancy Counter-Propagation ARTSTAR CALM Networks MINOS Networks Mixture of Experts Networks Auto-Associative Modules vii

8 2.3.8 Short Connection Bias Network Evolution of Neural Networks Direct Encoding Structural Encoding Parametric Encoding Developmental Rule Encoding Grammatical Encoding Cellular Encoding Basic Cellular Encoding Syntactically Constrained Cellular Encoding Syntactically Constrained Geometry-Oriented Cellular Encoding Adaptation of Learning Rules Attribute Grammars Definition Ease of Representation Properties and Limitations Chapter 3 Attribute Grammar Encoding Motivation Scalability Modularity and Scalability Structural Distinctiveness Modular Decomposition Technique viii

9 3.1.3 Evolutionary Neural Network Systems Design Principles for Scaling Solutions Representation of Neural Network Models Design Principles for Effective Network Specification Attribute Grammars for the Specification of Neural Networks Basic Approach Network Generating Attribute Grammar Encoding Introduction Attribute Grammar Component Specific Neural Network Instance Genetic Manipulation of NGAGE Representations Design Methodology Basic Design Practices Hypotheses Chapter 4 Attribute Grammar Encoding of Neural Network Topology Neural Foundations Timing Model Basic Neuron Model Connection Model Representation of Network Topology Primary Design Practices Discussion of Primary Design Practices Advanced Design Practices Discussion of Advanced Design Practices ix

10 4.3 Representation of Multiple Signal Types Design Practices for Signal Types Discussion of Design Practices for Signal Types Representation of the Structure of Modules Modular Foundations Design Practices for Modular Topology Discussion of Design Practices for Modular Topology Summary Compactness Neural Heterogeneity Structural Variety Modular Composability Modular Decomposition Novel Architectures Chapter 5 Attribute Grammar Encoding of Neural Behaviour Enhanced Neuron Model Representation of Neuron Behaviour Primary Design Practices Discussion of Primary Design Practices Compactness of Grammar Variety of Neurons Variable Grammar Design Representation of Both Network Topology and Neuron Behaviour. 278 x

11 5.2.3 Advanced Design Practices Discussion of Advanced Practices Representation of Behaviours for Multiple Signal Types Design Practices for Signal-Specific Behaviours Discussion of Design Practices for Signal-Specific Behaviours Representation of Complete Neural Network Architectures Representation of Learning Rules Neural Control Mechanisms Representation of the Behaviour of Modules Discussion of Modular Behaviour Practices Summary Behavioural Variety Architectural Behaviours Behavioural Detail Consistent Behavioural Interpretation Ease of Design Novel Behaviours Chapter 6 Attribute Grammar Encoding as Genetic Representation Properties of NGAGE Parse Trees Probabilistic Context-Free Grammar Genetic Programming Production Probabilities Symbol Probabilities Probabilistic Genetic Operators xi

12 6.2.4 Dynamic Probabilities Self-Adaptive Probabilistic Genetic Operators Experiment: Evolution of Backpropagation Networks PCFG-GP algorithm NGAGE Grammar Experimental Paradigm Experimental Conditions Results Discussion Chapter 7 Conclusions Future Directions References Vita xii

13 List of Figures Figure 1: Perceptron Node...11 Figure 2: (a) One-Layer and (b) Two-Layer Perceptron Topologies...12 Figure 3: Instar Node...16 Figure 4: (a) Original and (b) Hecht-Nielsen Formulations of Back-Propagation Network...18 Figure 5: Search Landscape Features...23 Figure 6: Genetic Algorithm Representation Schemes...31 Figure 7: One-Point Crossover Operator...32 Figure 8: One-Bit Mutation Operator...32 Figure 9: Genetic Programming Representation Scheme...33 Figure 10: Subtree Crossover Operator...34 Figure 11: Subtree Mutation Operator...34 Figure 12: Strongly-Typed Genetic Programming Representation Scheme...37 Figure 13: Crossover Points in Strongly Typed Subtree Crossover...38 Figure 14: Context-Free Grammar of CFG-GP...41 Figure 15: CFG-GP Genetic Trees and Crossover Points...42 xiii

14 Figure 16: Selective CFG-Based Reproduction Operators...46 Figure 17: LOGENPRO Logic Grammar...46 Figure 18: LOGENPRO Parse Tree with Semantics for Variables...48 Figure 19: Definite Clause Translation Grammar Production...49 Figure 20: Gaussian Mutation...52 Figure 21: Tree-Based Selective Crossover...54 Figure 22: Probabilistic Prototype Tree...55 Figure 23: Modular ARTSTAR Network...64 Figure 24: Basic CALM Module...65 Figure 25: MINOS Network...67 Figure 26: Mixture of Experts Network...69 Figure 27: Network of Auto-Associative Modules...72 Figure 28: (a) Direct Encoding of a (b) Neural Network Architecture...77 Figure 29: (a) Inefficient and (b) Efficient Structural Encoding of a (c) Neural Network...79 Figure 30: Parametric Encoding of a Back-Propagation Network...80 Figure 31: Developmental Rule Encoding...82 Figure 32: Grammatical Encoding of Neural Paths...83 Figure 33: Neuron Model of Jacob (1994b)...85 Figure 34: Cellular Encoding Starting Graphs...88 Figure 35: SEQ Program Symbol...89 Figure 36: PAR Program Symbol...90 xiv

15 Figure 37: (a) Cellular Encoding Gene of (b) Neural Network and its (c) Decoding Steps...92 Figure 38: Cellular Encoding with REC Program Symbol of (b) Neural Network and its (c) Partial Decoding Steps...95 Figure 39: LSPLIT Program Symbol...96 Figure 40: List, Set and Array Grammar Constructs...97 Figure 41: Representation Savings of Recursion Range...98 Figure 42: Context-Free Grammar that Constrains Program Symbol Tree Figure 43: Attribute Grammar for Binary Numbers Figure 44: Attributed Parse Tree Figure 45: Simple NGAGE Grammar Figure 46: (a) Sample Parse Tree, (b) Associated Attributed Parse Tree and (c) Associated Neural Network Topology Figure 47: Typed Subtree Crossover using Context-Free Parse Trees of NGAGE Figure 48: Basic NGAGE Neuron Model Figure 49: NGAGE Grammar Illustrating the Representation of Identity Values Figure 50: Attributed Parse Tree with Identity Attribute Values Figure 51: NGAGE Grammar Illustrating Distinct Sets of Similar Neural Structures Figure 52: Neural Topology Arising from Distinct Representation of Sub-Structures Figure 53: NGAGE Grammar Illustrating Replication of Neural Structure using Set Operations Figure 54: Topology Replication xv

16 Figure 55: NGAGE Grammar Illustrating Explicit Representation of External Environment Figure 56: Neural Topology Including External Ports Figure 57: (a) Compound Production for Real Numbers, (b) Instance of Compound Production and (c) Deterministic Attribute Evaluation Process Figure 58: (a) Compound Production for Range of Integers, (b) Instance of Compound Production and (c) Deterministic Attribute Evaluation Process Figure 59: (a) NGAGE Grammar Illustrating Recurrent Connections and (b) Associated Recurrent Topology Figure 60: NGAGE Grammar Illustrating Equivalent Representation to Grammatical Encoding Figure 61: NGAGE Grammar Illustrating Equivalent Representation to Cellular Encoding Figure 62: NGAGE Grammar Illustrating Multiple Clone Operation Figure 63: NGAGE Grammar Illustrating the Use of Inherited Attributes to Constrain Topology Figure 64: (a) Attributed Parse Tree Illustrating "Ignoring" Mechanism and (b) Associated Neural Topology Figure 65: NGAGE Grammar Illustrating the Use of Inherited Attributes to Avoid Unnecessary Specification of Neural Components Figure 66: Subtree Generated Using Constraint to Ensure Meaningful Neural Specification Figure 67: NGAGE Grammar Illustrating the Use of NGAGE Parameter Values xvi

17 Figure 68: NGAGE Grammar Illustrating the Interaction of Synthesized and Inherited Attributes for Structural Constraints Figure 69: Attribute Parse Tree Illustrating Interaction Between Synthesized and Inherited Attributes Figure 70: NGAGE Grammar Illustrating Synchronization of Neural Structures Figure 71: Attributed Parse Tree Illustrating Synchronization using the "Filling" Mechanism Figure 72: Attributed Subtree Illustrating Synchronization using the "Ignoring" Mechanism Figure 73: Neural Topology Produced using Synchronization Figure 74: NGAGE Grammar Illustrating Use of Validity Mechanism to Minimize Semantic Redundancy Figure 75: NGAGE Grammar Illustrating Simplified Grammar Design Through Use of Validity Mechanism Figure 76: NGAGE Grammar Illustrating the Interaction of Synthesized and Inherited Attributes for Neural Structures Figure 77: NGAGE Grammar Illustrating Specialized Local Handling of Global Structures Figure 78: Attributed Parse Tree Illustrating Local Handling of Global Structures Figure 79: Recurrent Neural Topology With Connections of Differing Delay Value Figure 80: Construction of Grid with (a) Smaller Component Grids, (b) Desired Connectivity, and (c) Potential Connectivity using Unordered Sets xvii

18 Figure 81: NGAGE Grammar Illustrating Representation of Grid Neural Topology using Ordered Sets Figure 82: Growing a Grid using Ordered Set Operations Figure 83: NGAGE Grammar Illustrating Manipulation of Connections Equivalent to Cellular Encoding Figure 84: NGAGE Grammar Illustrating Multiple Pathways of Different Signal Types208 Figure 85: NGAGE Grammar Illustrating the Propagation of Consistent Signal Types.210 Figure 86: Neural Topology with Activation and Feedback Pathways Figure 87: NGAGE Grammar Illustrating Multiple Node Types with Different Transmission Properties Figure 88: NGAGE Grammar Illustrating Explicit Specification of Valid Signal Pathways Figure 89: Neural Topology Illustrating Feedback Pathways Specific to Neuron Type.215 Figure 90: NGAGE Grammar Illustrating Localized Sun-and-Planet Architecture for Back-Propagation Networks Figure 91: Simple NGAGE Module Figure 92: Nested NGAGE Module Figure 93: NGAGE Grammar Illustrating Combination of Module-Subgrammars to Form Hybrid Grammar Figure 94: NGAGE Grammar Illustrating Multiple Relationships Between Modules Figure 95: (a) Partial Context-Free Parse Tree and (b) Associated Modular Network with Multiple Module Types and Multiple Intermodular Relationships..228 xviii

19 Figure 96: NGAGE Grammar Illustrating Partitioning of Problem into Arbitrary Subtasks that are Solved by Distinct Modules Figure 97: NGAGE Productions Illustrating Non-Empty Partitions Figure 98: Network Illustrating Task Decomposition among Heterogeneous Modules.234 Figure 99: Network Illustrating Complex Modular Topologies Solving Decomposed Task Figure 100: NGAGE Grammar Illustrating Explicit Identification of Signal-Specific Module-Input and Module-Output Nodes Figure 101: NGAGE Grammar Illustrating the Representation of a CALM Module of Arbitrary Size Figure 102: Enhanced NGAGE Neuron Model Figure 103: Enhanced Neuron Representation of Perceptron Behaviour Figure 104: NGAGE Grammar Fragment Illustrating Concatenation of Function Strings to Form Neuron Specification Figure 105: XML Template Strings for Native Operators With Substitution Keywords 252 Figure 106: NGAGE Grammar Fragment Illustrating Completion of Neural Function Specification Using Templates and String Substitution Figure 107: NGAGE Grammar Fragment Illustrating Completion of Neuron Memory Specification Using Templates and String Substitution Figure 108: Attributed Parse Subtree Illustrating Completion of Neural Function and Memory Variable Specification Using Templates and String Substitution256 Figure 109: Attributed Parse Subtree Illustrating Function Nesting Figure 110: XML Function Specification Templates with Explicit Data Types xix

20 Figure 111: NGAGE Grammar Fragment Illustrating Manipulation of Parameters with Explicit Data Types Figure 112: NGAGE Grammar Illustrating Manipulation of Memory Variables with Explicit Data Types Figure 113: NGAGE Grammar Fragment that Incorporates Reserved Variables Figure 114: XML Specification of Environmental Function on Incoming Signals Figure 115: XML Specification of Environmental Function on Incoming and Outgoing Signals Figure 116: (a) XML Template String with Assumed Parameter Variable and (b) Simplified Grammar Production Figure 117: (a) XML Template String with Duplicate Parameter Keywords and (b) Simplified Grammar Production Figure 118: (a) XML Template String for Important Posting Behaviour using Assumed Out-Signal Memory Variable and (b) Associated Grammar Production.267 Figure 119: (a) XML Template String of Nested Function and (b) Associated Production Figure 120: (a) XML Template of Function Sequence and (b) Associated Production..270 Figure 121: Context-Free Productions of NGAGE Grammar for Neural Behaviours Figure 122: Parse Tree Representing an Unsupervised Instar Neuron Figure 123: (a) Grammar Productions and (b) Parse Tree that Generate Perceptron Activation Function Figure 124: NGAGE Grammar that Imposes Fixed Behavioural Form Upon Neurons..277 Figure 125: Productions that Impose Varying Degrees of Behavioural Assumptions xx

21 Figure 126: NGAGE Grammar Representing Both Topology and Behaviour Figure 127: XML Specification Strings for Memory Variables with Explicit Initialization Properties Figure 128: XML Specification Strings for Vector Memory Variables with Explicit Initialization Properties Figure 129: XML Template Strings for Variables and for Initialization Routines Figure 130: NGAGE Grammar Illustrating the Explicit Specification of Initialization Properties for Memory Variables Figure 131: NGAGE Grammar Illustrating Use of Inherited Attributes for Behavioural Homogeneity Figure 132: NGAGE Grammar Illustrating Interaction of Synthesized and Inherited Attributes for Shared Behavioural Properties Figure 133: XML Variable Specifications with Signal Types Figure 134: XML Specification of Reserved Variables with Signal Types Figure 135: NGAGE Grammar with Explicit Terminal Symbols for Assumed Signal Types Figure 136: (a) XML Template String with Assumed Variable Names and Unspecified Signal Types and (b) Associated Grammar Productions Figure 137: (a) XML Template for Generic Post Function and (b) Associated Grammar Production Figure 138: Signal Memory Templates Enhanced with Substitution Keywords for Signal Types xxi

22 Figure 139: (a) XML Template String with Unspecified Variable Names and Signal Types and (b) Associated Grammar Productions Figure 140: NGAGE Grammar Illustrating Perceptron Learning Rule Using Multiple Signal Types Figure 141: NGAGE Grammar Illustrating Explicit Specification of Control Structures302 Figure 142: Neural Control Structure Figure 143: NGAGE Grammar Illustrating Dependent Neural Control Structures Figure 144: Dependent Neural Control Structure Figure 145: NGAGE Grammar Illustrating Neural Behaviours of Sun-and-Planet Architecture for Back-Propagation Networks Figure 146: Context-Free Productions of an NGAGE Grammar for the Representation of Learning Rules with a Fixed Form Figure 147: NGAGE Grammar that Illustrates Neural Learning Rules that Explicitly use Control Signals Figure 148: NGAGE Grammar Illustrating Explicit Specification of Neural Control Structures Associated with Specific Modules Figure 149: NGAGE Grammar Illustrating the Use of Inherited Constraints for Consistent Neuron Behaviours within a Module Figure 150: Localized Architecture for Mixture of Experts Network Figure 151: NGAGE Grammar Illustrating Explicit Behaviour Relationships Between Component Modules of a Localized Mixtures of Experts Architecture..320 Figure 152: NGAGE Grammar for Parameterized Back-Propagation Networks xxii

23 List of Tables Table 1: Experimental Results xxiii

24 Chapter 1 Introduction 1.1 Motivation The field of artificial neural networks addresses the development, exploration and application of computational models derived from the principles of biological neural networks. These models typically define a family of related neural network architectures that share key topological and behavioural properties, but may vary in their details, such as size and internal structure. Since the resurgence of the field in the mid-1980's (Rumelhart and McClelland, 1986), a great number of neural network models have been proposed based upon a wide variety of design principles. Two conclusions may be made concerning the current state of the art. One is that the scalability of architectures within current models is generally poor. Most networks that are developed and tested are very small, especially in comparison to biological neural networks (e.g., 10 s to 1000 s of neurons in artificial networks versus neurons in the human brain) and their learning capabilities diminish greatly as problem complexity and network size increase. (Clark, 1995) A second conclusion is that the methods for the representation of neural networks are 1

25 generally poor. Assumptions made when specifying current neural network architectures often vary significantly from model to model. Comparisons of the structure and behaviour of different network architectures can often be difficult due to these competing conventions of representation, and the creation of new architectures that extend and integrate useful principles of existing networks is often difficult due to the resulting lack of common techniques for identifying those principles. From these two conclusions, two relevant research needs are identified. The first is the need for methods that enable the consistent representation of large families of existing neural network architectures and facilitate theoretical comparisons among them. The second is the need for formal and computational techniques to guide the development of new models with improved scaling properties. 1.2 Foundations The design of neural networks that exhibit good scaling has been addressed by many researchers in the field. One approach that has shown promise is the development of models exhibiting a high degree of modularity (Hrycej, 1992; Ronco et al., 1997). These modular models range from ad-hoc combinations of existing models, such as the counterpropagation network (Hecht-Nielsen, 1990), to building block models, such as the CALM network (Happel and Murre, 1992, 1994), to redundant and integration models, such as the Mixtures of Experts network (Jacobs et al., 1991a, 1991b). A second approach that has shown promise is the evolutionary optimization of the structure of a neural network (e.g., Jacob, 1994a, 1994b; Kitano, 1990). Not only the size of the network, but the internal arrangements of neurons and modules within the network may be explored (e.g., 2

26 Gruau, 1994, 1995). From an analysis of current techniques that follow these approaches, seven design principles that contribute to the development of network models with good scaling properties may be identified: (1) [Hybrid systems] Neural networks may be combined with other adaptive techniques (e.g., evolutionary algorithms) to develop more effective neural solutions. (2) [Modular task decomposition] The use of modules and the specific arrangement of modules within a modular network is an important factor in the effectiveness of the resulting network. (3) [Heterogeneous modules] The use of different neural network architectures as modules within a single new architecture may produce more effective solutions to certain complex problems. (4) [Dynamic structure] The automatic adaptation of neural network structure enables discovery of solutions with appropriate size and more efficient structural resources. (5) [Dynamic functionality] The automatic adaptation of network functionality enables the most appropriate processing and learning mechanisms for a given problem to be used. (6) [Dynamic modular structure] The automatic adaptation of modular structure enables discovery of more effective task decompositions on generic problems. (7) [Dynamic modular functionality] The automatic adaptation of the processing and learning behaviours of modules within a modular network has not been explored but may enable better specialization of modules upon specific subtasks. 3

27 In the field of neural networks, there is a basic form of neural network representation - that of interconnected nodes with arbitrary local functionality sending signals to other nodes on connections with arbitrary transmission characteristics. All pure neural network models may be described solely using such a representation, but with numerous variations in the details. However, many researchers specify their models using algorithmic components that are not described in this basic nodal terminology. This is particularly true of the learning rules of network models, and such models require a translation, if possible, into the basic nodal form to ensure comparable analyses of architectural properties (Hecht-Nielsen, 1990). Other than the basic nodal representation, few alternative or complementary representation methods for neural networks have been suggested. Recently, though, researchers addressing the evolution of neural networks have developed a variety of genetic descriptions of neural networks. This work has led to several novel methods for representing neural networks, such as cellular encoding (Gruau, 1994). From an analysis of current representation techniques, five design principles that contribute to the development of representations that are robust and widely applicable may be identified: (1) [Clear assumptions] Any representation of neural network models will necessarily make certain assumptions. The clear identification of all assumptions made by a given representation is a necessary basis for a robust specification framework. (2) [Explicitness] The explicit representation of neural characteristics, both topological and behavioural, enables meaningful manipulation of those characteristics within a given neural network model. 4

28 (3) [Topological variety] The capability for specifying a variety of topological constraints, including modular constraints, enables the representation of a wide family of neural architectures (4) [Behavioural variety] The capability for specifying a variety of neural behaviours enables the specification of a wide family of neural architectures. (5) [Consistency] The consistent use of the same representations for the same neural structures and/or behaviours, as well as a common basis for the interpretation of those representations, facilitates comparisons of the similarities and difference between different neural network models. 1.3 Research A novel representation technique for the specification of neural network architectures using attribute grammars (Alblas, 1991; Bochmann, 1976; Knuth, 1968) is introduced to address the identified research needs. The technique, named the Network Generating Attribute Grammar Encoding (NGAGE), applies the twelve design principles above to accomplish these objectives. NGAGE (Hussain and Browse, 1998a, 1998b, 2000) builds upon prior research on context-free grammar representations of neural networks (Gruau, 1994, 1995; Jacob and Rehder, 1993) and on strongly-typed and grammar-based evolutionary algorithms (Montana, 1993, 1995; Ross, 2001; Whigham, 1995, 1996; Wong and Leung, 1997). An attribute grammar consists of a context-free grammar in which each symbol in a grammar production has a set of attributes associated with it, and each production specifies how the attributes of its right hand symbols and/or left hand symbol are to be 5

29 computed. Within NGAGE, the productions of an attribute grammar define a space of possible neural network architectures. The specification of a given neural network architecture is computed within the attributes of the grammar symbols. The context-free parse trees produced by the grammar are used as a genetic representation within a strongly-typed, grammar-based evolutionary algorithm to explore the space of architectures defined by the attribute grammar. It is anticipated that the NGAGE technique will be a valuable aid to neural network researchers in the development and exploration of novel neural network architectures. In particular, seven hypotheses are made: Hypothesis 1: NGAGE may be used to explicitly specify the topology and behaviour of the neural network architectures that comprise a neural network model. Hypothesis 2: Functional neural networks capable of learning may be generated from an NGAGE representation. Hypothesis 3: NGAGE may explicitly encode neural network modules with varied structure and behaviour Hypothesis 4: Existing neural network architectures may be represented within an NGAGE system Hypothesis 5: The class of neural networks represented by a given NGAGE representation may be automatically explored using genetic search Hypothesis 6: NGAGE enables the integration of multiple models and facilitates the systematic exploration of variations to a model. Hypothesis 7: NGAGE may be used to evolve solutions to many classes and sizes of problems. 6

30 In Chapter 2, descriptions of the key research areas and existing research techniques that form the basis for the NGAGE technique are presented. In Chapter 3, the strengths and weaknesses of current modular neural network and genetic representation techniques are discussed, and the basic NGAGE approach is introduced. In Chapters 4 and 5, the capabilities of NGAGE for the representation of neural network architectures are developed through the introduction of design practices that, together, apply and reinforce the key design principles above. The resulting properties of NGAGE representations are discussed in the context of both novel and existing neural network models. Chapter 4 focuses upon the representation of network structures and Chapter 5 focuses upon the representation of network behaviours. In Chapter 6, a novel evolutionary algorithm for effective exploration of NGAGE representations is introduced, and empirical evidence of the performance of the algorithm is presented. In Chapter 7, the general properties of the NGAGE technique are summarized, and useful avenues for future study and development are discussed. 7

31 Chapter 2 Background 2.1 Neural Networks The field of artificial neural networks examines the design and benefits of information processing models that are based, to varying degrees, upon the physiological characteristics of the neural systems of natural organisms. The reader is assumed to have some familiarity with the field, though a brief introduction is presented here to introduce terminology. The reader is referred to Haykin (1994) or Arbib (1995) for a comprehensive treatment of the field. Hundreds of neural network models have been proposed (Arbib, 1995), yet no encompassing formal definition of a neural network has been developed. Haykin (1994) defines a neural network to be a massively parallel distributed processor that has a natural propensity for storing experiential knowledge and making it available for use. It resembles the brain in two respects: 1. Knowledge is acquired by the network through a learning process. 2. Interneuron connection strengths known as synaptic weights are used to store the knowledge. (p. 2) Arbib (1995) notes that there are radically different types of neurons in the human brain and endless variation 8

32 in neuron types of other species In neural computation (technology based on networks of neuron-like units), the artificial neurons are designed as variations on the abstractions (usually the simpler ones) of brain theory There is no such thing as a typical neuron. (p. 4, sic) Researchers of neural networks often adopt slightly different approaches as to what constitutes a neural network. The following working definition is adopted. A neural network is a processing device that contains a number of smaller processing units known as neurons or nodes. These nodes communicate with each other by transmitting signals along connections between the nodes. A neural network thus has a topological aspect and may be visualized as a graph. Each node receives information from a set of input connections. A node contains internal memory variables, the most common of which being a weight value that is assigned to each incoming connection. Each node may perform internal manipulations based upon the signals it has received, such as the application of a transfer function and/or the modification of internal memory variables. Finally, each node may transmit new information on a set of output connections. Most commonly, the identical signal is transmitted on all outgoing connections. A neural network architecture is a complete description of the network topology and the information processing characteristics of its constituent nodes. A neural network model may specify a family of possible architectures. Each architecture represents a set of possible neural network states, based upon the set of possible values for all memory variables in the network. A pure neural network architecture is one in which each node is informationally and functionally encapsulated. No memory is shared between nodes; nodes have no access to the internal particulars of other nodes; there are no global 9

33 variables or functions that apply across multiple nodes; and all computation in the network is the result of the computations performed within nodes. In this document, the treatment of neural network models is limited to those that follow a discrete-time processing approach. In a discrete-time network model, information is transmitted along connections at discrete time intervals. Each node processes all the signals it receives in a given time step and transmits its output for the next time step Activation Processing The functionality of a neural network always includes activation processing. Activation processing typically occurs as follows. Each connection can carry a single activation signal value. In a given discrete time-step, a connection delivers its activation signal to its destination node. Each node accepts a set of signals from its incoming connections. Each node then applies an activation transfer function to those signals to produce a single resultant activation signal. The execution of the activation transfer function is considered to occur in less than a single time step. The same resultant signal value is then placed on all the outgoing connections from that node. The activation function may access any information stored locally in the node, and often uses the weights associated with the incoming connections. The signals on connections are state variables. One cycle of activation processing is completed when all nodes in the network have completed their local transfer functions. The perceptron network (Rosenblatt, 1962) was one of the first neural network models proposed. A single perceptron (see Figure 1) is a node that accepts a set of n activation signals x i, i =1..n, contains a memory variable known as a weight associated 10

34 with each connection w i, i =1..n, and also contains a single memory variable known as the threshold θ. Actual Output Activation Signal y d Desired Output Activation Signal θ Threshold w 1 w 2 w n Weights x 1 x 2 x n Figure 1: Perceptron Node Input Activation Signals During each time step, a perceptron computes the weighted sum of its inputs. If that weighted sum is less than the threshold value, the perceptron outputs an activation value of -1, otherwise it outputs a value of +1. This signum function represents a non-linear decision and is summarized below, where y is the output signal of the node. y n + 1 if θ = wi xi 0 i = 1 1 otherwise (2-1) A perceptron network is a network of interconnected perceptron nodes. The most common topology is a layered one in which all nodes in a given layer provide input to all nodes in the following layer. The final layer of nodes provides the output of the network. Figure 2 illustrates a one-layer and a two-layer perceptron network. The called-out node illustrates that each perceptron node maintains distinct memory variables. It also gives the typical subscript numbering used to identify weights, with the first subscript referring to the destination node and the second subscript referring to the source node. In a multi- 11

35 layer network, all layers except the output layer are often referred to as hidden layers since the outputs of their nodes are not directly perceived by the external environment. Output Pattern y 8 y 9 Output Pattern y 4 y 5 y 6 y 7 θ 4 w 41 w 43 w42 y 4 y 5 y 6 y 7 Hidden Layer x 1 x 2 x 3 Input Pattern x 1 x 2 x 3 Input Pattern (a) (b) Figure 2: (a) One-Layer and (b) Two-Layer Perceptron Topologies A one-layer network of perceptron is theoretically capable of approximating any linearly separable function, given an appropriate selection of connection weights. This is a limited class of functions, though. For example, it cannot represent the simple exclusiveor (XOR) function. A two-layer network of perceptrons, on the other hand, is theoretically capable of approximating any piecewise-linear function, given an appropriate selection of hidden nodes and connection weights (Haykin, 1994, p. 182). This is a powerful result that makes multi-layer networks a useful tool for pattern analysis and function approximation. However, the question remains as to how the appropriate weight values are determined. 12

36 2.1.2 Hebbian Learning The functionality of a neural network usually includes learning. Learning involves making changes to the values of the memory variables of the network, particularly the weight values, in order to improve the network s future performance. Hebb s learning law (Hebb, 1949) is the oldest and most famous of all learning rules (Haykin, 1994, p. 49). Hebb s learning law reinforces correlated activity between two neurons. For example, when one neuron provides positive activation to a second neuron, and that neuron in turn produces positive activation, then the strength of the connection (i.e., weight) between those two neurons in increased. The Hebbian learning rule may be summarized as: w ji = w ji + λ y j x i (2-2) where 0 < λ 1 is a small value known as the learning rate parameter Perceptron Learning The perceptron network is a network of layers of perceptron neurons, and was one of the first network models to incorporate supervised learning. In a supervised learning approach, it is known a priori what is the correct output pattern corresponding to each given input pattern. After the network is presented with an input pattern and the output nodes have produced their signals, the actual response y j and the desired correct response d j for each output node j may be identified. Each node then updates its weights w ji and threshold value based upon the error-correction learning rule, w ji = w ji + λ [d j y j ] x i θ j = θ j + λ [d j y j ] (2-3) (2-4) 13

37 where 0 < λ 1 is the learning rate, the difference d j y j represents the error of the node and may hold the values { 1, 0, +1}, and x i is the value of the input signal received by node j on the connection corresponding to w ji. The effect of the learning rule is to update each weight by a small amount in the direction of the error made by the node and proportional to the magnitude of the contribution of the corresponding connection. The threshold value is updated by a small amount in the direction of the error made by the node. The learning rate determines the scale of the changes made to each weight value in each time step. After the network has performed learning on a given input pattern, if the network is presented with the same input pattern again, it will tend to produce an output pattern that is slightly closer to the desired one Back-Propagation Learning The perceptron learning rule can only be applied to the output layer of a perceptron network, and cannot be used to adapt the weights for the hidden layers of nodes. The back-propagation learning rule was developed to rectify this deficiency (Rumelhart, Hinton and Williams, 1986). The back-propagation rule is typically applied to multi-layer networks of nodes that have real-valued output. Instead of using a hard threshold function to determine a binary output as is done in a perceptron, a sigmoid function is used to effect a non-linear soft threshold. The transfer function of a sigmoid node is: y = n ( w i x i θ) i = e 1 (2-5) 14

38 The back-propagation rule adjusts weights in a manner similar to the perceptron rule, but based upon the derivative of the error that was made, not upon the error itself. This uses the method of gradient descent to move the weights in the direction opposite to the instantaneous gradient of the error. All nodes apply the following weight update function: w ji = w ji + λ δ j o i θ j = θ j + λ δ j (2-6) (2-7) where δ j is the local gradient of the node j, and o i is the incoming signal received by the node j on the connection corresponding to w ji. Note that for the output nodes, o i is actually the output of a hidden node y i, while for hidden nodes, o i is a network input signal x i. The computation of the gradients for the output nodes is based straightforwardly upon the derivative of the error between the desired output and the actual output. The gradient of the hidden nodes, however, is based upon the aggregate contribution a given hidden node made to the errors of the all the output nodes. δ δ output j hidden j = = y [1 y j j y [1 y j j ][ d ] δ k j y ] output k j w kj (2-8) (2-9) where the summation is performed over all output nodes k Instar Processing and Learning An instar node (Grossberg, 1982) is similar to a perceptron node in that it contains a memory variable known as a weight associated with each incoming connection, and produces a single output based upon the weighted sum of the input activation signals (see Figure 3). 15

39 Actual Output Activation Signal y d Desired Output Activation Signal C Constant w 1 w 2 w n Weights x 1 x 2 x n Figure 3: Instar Node Input Activation Signals However, unlike a perceptron, the instar activation output is determined by a linear equation: y = C n i= 1 w i x i (2-10) where C is an internal constant. The instar learning rule updates the weight w i on each of the node s incoming connections using the equation: wi + λ[ xid wi] if xi > 0 wi = wi otherwise (2-11) where λ > 0 is a small learning rate. The key aspect of the instar learning rule is that for each input element, the corresponding weight is updated only if that element is non-zero. As learning proceeds, each weight w i approaches the value xi d, which denotes the time average of the product xid during those times when xi > 0 (Hecht-Nielsen, 1990, p. 74). In other words, each weight learns the average importance its input element has on the instar's output. For example, after training, inputs that tend to be high in value when the output should be high will have large weights, while inputs that tend to be low in value when the output should 16

40 be high will have small weights. An instar thus learns to respond to the presence of a particular feature in a given input pattern. As with the perceptron learning rule, instar learning is limited to a single layer of nodes. The instar node may be also be used as an unsupervised structure by letting the desired output value be the actual activation output of the node (i.e., d = y) (Hecht- Nielsen, 1990) Feedback Processing Networks with learning usually involve the processing of feedback signals. A simple perceptron network requires that the external environment provide feedback signals regarding the desired response for each output node. More complex learning networks also involve the processing and transmission of feedback signals among nodes. For example, the multi-layer back-propagation neural network model requires that nodes exchange information concerning the error gradients. In keeping with the notion of a pure neural network model, feedback processing should also involve only local computations based upon locally received and locally stored information. One drawback to many learning network models is that the activation processing mechanism is specified in neural terms and the feedback processing and learning mechanisms are presented algorithmically as global mechanisms. For example, the above presentation of the backpropagation network is a classic one (see Haykin, 1994; Widrow and Lehr, 1995; Rumelhart et al, 1986). However, the computation of the local gradients for the hidden nodes in (2-9) requires access to the weight memory values stored at the output nodes. The nodes are thus not informationally encapsulated. The early forms of the 17

41 backpropagation architecture were, in fact, not neural networks. They violated the restriction that all of the processing that occurs within a processing element must be localized. (Hecht-Nielsen, 1990, p. 128). y 5 φ 5 y 5 δ 5 sun θ 5 δ 3 θ 5,w 5 δ 4 z 53 =w 53 y 3 δ 5 z 54 =w 54 y 4 planets w 53 w 54 y 3 y 4 θ 3,w 3 θ 4,w 4 φ 3 =w 53 δ 5 φ 4 =w 54 δ 5 y 3 y 4 θ 3 θ 4 x 1 x 2 (a) FeedForward Path Feedback Path w 31 w 32 w 41 w 42 Sun Node Planet Node x 1 x 2 (b) Figure 4: (a) Original and (b) Hecht-Nielsen Formulations of Back-Propagation Network Specification of the back-propagation architecture such that it adheres to a strict model of neural computation requires several additional neural structures. Hecht-Nielsen (1990) proposes an architectural variation of back-propagation in which a single node in the original formulation is replaced by a sun node and several planet nodes. Figure 4 illustrates the difference between the original and Hecht-Nielsen formulations. The large bold nodes are the sun nodes and the small light nodes are the planet nodes. Each planet 18

42 node acts as a storage device for a single weight value. Each single feedforward connection in the original formulation is replaced by two feedforward connections in Hecht-Nielsen s formulation: one from a source sun node/network input element to a planet node and one from the planet node to the destination sun node. The planet node acts to modulate the activation signal it receives by the weight value it stores (2-13). The sun node performs a summation of the signals it receives from its planets nodes and applies a sigmoid function to produce an output activation signal (2-12). Together the activation functions of the sun and planet nodes combine to implement (2-5) y sun j = 1+ e ( 1 z planet ji θ j ) i (2-12) z planet ji = w planet ji x i (2-13) The benefit of the Hecht-Nielsen structural approach becomes apparent when the feedback paths are incorporated. Each sun node has a feedback connection to each of its attendant planet nodes. Along these connections, the sun node transmits the same feedback value (2-14). Each planet node has a feedback connection to the source sun node of the corresponding feedforward connection associated with that planet node. The planet node acts to modulate the feedback signal it receives by the weight value it stores (2-15). Together, the feedback functions of the sun and planet nodes combine to compute the weighted feedback signal of (2-9) using purely local operations (i.e., a sun node simply sums of all incoming feedback signals with no knowledge of the internal weights of the source node). 19

43 δ sun j = y [ 1 y ] sun j sun j k φ planet kj (2-14) φ = δ planet ji w planet ji sun j (2-15) Finally, each planet node also receives the activation signal of its sun node so that it can execute the learning rule of (2-6). Learning rule (2-7) is applied by the sun node. θ sun j = θ sun j + λδ sun j (2-16) w planet ji = w planet ji + λδ sun j x i (2-17) where the learning rate has the same value in the sun nodes and its attendant planet nodes Structural Learning In addition to learning algorithms based upon local computations in the nodes, researchers have proposed several global learning mechanisms that may be used to add and/or remove nodes and/or connections from the network with the goal of discovering the smallest network that is capable of solving the problem sufficiently well. The number of elements, particularly connections, in a network often determines the computational efficiency of the network. The reader is referred to Ash and Cotterell (1995), Haykin (1994) and Smieja (1993) for summaries of these mechanisms. Structural learning techniques are based upon one of two approaches. Growing algorithms start with a small network topology and add new nodes or weights. This often involves first training a network with a small topology using normal neural learning rules. If the performance of that network is insufficient, then new hidden nodes or new layers of hidden nodes are added and the new network topology is trained further. This cycle repeats until a sufficiently good network is discovered. The most 20

44 popular growing method is cascade-correlation (Fahlman and Lebiere, 1990). In cascade correlation, new nodes are added to the network structure one at a time. The new node is a hidden node that has incoming connections from all the inputs and from all existing hidden nodes. The weights on those connections are set to values that will on average reduce the error of the output unit (Smieja, 1993) and with each new node, the network s performance improves. Pruning algorithms involve first training a sufficiently good network topology. Connection weights are then removed from the network in an attempt to minimize the number of network elements. The Optimal Brain Surgeon is the most popular pruning method (LeCun, Denker and Solla, 1990; Hassibi, Stork and Wolff, 1993). It involves calculating an approximation to the actual effect on error of deleting each weight from the network (Ash and Cottrell, 1995, p. 993). Weights with the least impact are removed until further removal would reduce the network performance to unsatisfactory levels. For a variety of network models, pruning algorithms may be applied after normal training to improve the computational efficiency of the resulting networks. These structural learning mechanisms are not true neural mechanisms since they involve higher-level, global analysis of network performance and network elements. However, they are widely used in the field to develop efficient, practical neural solutions. 2.2 Evolutionary Algorithms Search Algorithms Consider a problem in which a number of possible solutions are available, there exists some evaluation function for determining how good each particular solution is, and 21

45 the goal is to find the best possible solution. This is a classic search problem, and the set of possible solutions defines a search space. The solution or solutions that have the highest value are optimal. All other solutions are sub-optimal. An ideal search algorithm will find the optimal solution in a reasonable amount of time. For a search space with only a small number of possible solutions, all the solutions can usually be examined in a reasonable amount of time and the optimal one found. This exhaustive search, however, quickly becomes impractical as the search space grows in size. A practical search algorithm seeks to find the best solution that it can within a certain amount of time. The result may be sub-optimal. Traditional search algorithms randomly sample or heuristically sample the search space one solution at a time in the hopes of finding the optimal solution. For example, a gradient-descent neural network learning rule, such as back-propagation, is a heuristic search algorithm. It specifies how the weights of the network change over time in response to the inputs provided to the network. The changes made during the training of a network may be regarded as a search through the space of possible network states to find the best-performing network configuration. During each phase of the search, only one possible configuration, or point in the search space, is considered. The next configuration is obtained based upon changes to the current one. Searches such as gradient descent are often referred to as iterative improvement algorithms. (Russell and Norvig, 1995) Network models with structural learning mechanisms also fall into this category. A particular search space and a particular evaluation function define a search landscape. The extra dimension added to the search space by the evaluations can be interpreted in terms of hills, valleys and other geographical features. For simplicity of 22

46 visualization, consider a neural network with a single real-valued weight that may range in value from 1 to +1 and an evaluation function that is the network s classification performance on a test set of data. Figure 5 illustrates several possible features that may exist in the resulting search landscape. Points a, c, e and h are the tops of hills and represent points in the landscape that are better than the immediately surrounding points. Higher values mean better solutions, so the solution that is at the top of the highest hill, c, is the global maximum. The tops of all other, lower, hills are local maxima. Points b, d and g are the bottoms of valleys and represent points in the landscape that are worse than the immediately surrounding points. Point f is the middle of a plateau and represents a point in the landscape that is indistinguishable in performance from all nearby points. 100% c Network Performance (% correct) a d e f g h b 0% -1 1 Weight Value Figure 5: Search Landscape Features The successive actions of an iterative improvement search algorithm can be viewed as moves around the landscape. For example, if a gradient descent algorithm starts with an initial configuration that is on the slope of a hill, it will typically move up the slope 23

47 until it reaches the top. However, since each successive move is based upon the previous position, it is quite possible that once the search reaches the top of a local maximum (i.e., a small hill), it may not be able to make a drastic enough move to get onto the slopes of a higher hill. This is known as a hill-climbing effect and is a serious drawback of many iterative improvement search algorithms (Russell and Norvig, 1995). Another hillclimbing effect is getting stuck on a plateau. This is an area of the landscape that is flat, and the algorithm may be unable to move far enough to get off the plateau. Some iterative improvement algorithms, in particular gradient descent, are regarded as strongly-biased search techniques. The solutions that are discovered by the search are highly dependent upon the initial starting point since the search algorithm makes only small moves through the space at any given time and tends to remain relatively near the initial point Biological Evolution The reader is assumed to have some familiarity with the theories of biological evolution, though a brief introduction is presented here to introduce principles discussed later in the document. In 1801, the naturalist Jean Baptiste Lamarck proposed the first popular theory of evolution, the doctrine that all species, including man, are descended from other species. (Darwin, 1859, p. 8) Nature has produced all the species of animals in succession, beginning with the most imperfect or simplest, and ending her work with the most perfect, so as to create a gradually increasing complexity in their organisation. (Lamarck, as reprinted in Belew and Mitchell, 1996, p. 56) Two of the key points of Lamarck s theory are that during its lifetime, an individual organism changes its characteristics, through use or disuse, in response to the demands of its environment; and 24

48 that these changes are subsequently preserved by reproduction and passed on to the offspring of that organism. Lamarck named these the first and second laws of nature, which lead, over many generations of successive changes, to the development of new races and, eventually, species of plants and animals. In 1859, Charles Darwin extended the principle of evolution with his theory of natural selection and its principle of the survival of the fittest. As did Lamarck, Darwin proposed that the plants and animals that exist today are the result of millions of years of adaptation to the demands of the environment. Darwin also proposed the key notion that at any given time, a species contains a number of individuals with slight variations in their characteristics, and that these organisms co-exist and compete for the same resources in an ecosystem. The organisms that are most capable of acquiring resources and successfully procreating are the ones whose descendants will tend to be numerous in the future. Organisms that are less capable, for whatever reason, will tend to have few or no descendants in the future. The former are said to be more fit than the latter, and the distinguishing characteristics that caused the former to be more fit are said to be selected for over the characteristics of the latter. Over time, the entire population of the ecosystem is said to evolve to contain organisms that, on average, are more fit than those of previous generations of the population because they exhibit more of those characteristics that tend to promote survival. It may metaphorically be said that natural selection is daily and hourly scrutinizing, throughout the world, the slightest variations; rejecting those that are bad, preserving and adding up all that are good; silently and insensibly working, whenever and wherever opportunity offers, at the improvement of each organic being in relation to its organic and inorganic conditions of life. (Darwin, 1859, p. 84) In keeping with the 25

49 available evidence and naturalist theories of his time, Darwin supported Lamarck s second law that acquired characteristics may be passed on to offspring, but viewed it as a distinct evolutionary process secondary to natural selection. We may conclude that habit, or use and disuse, have, in some cases, played a considerable part in the modification of the constitution and structure; but that the effects have often been largely combined with, and sometimes overmastered by, the natural selection of innate variations (Darwin, 1859 p. 136). The classic example used in the literature to distinguish between Lamarckian inheritance and Darwinian natural selection of characteristics concerns the long neck of the giraffe. A Lamarckian view is that a horse-like creature once stretched its neck to reach higher leaves on the trees. That acquired stretched neck was passed on to its offspring, some of which stretched even further to reach even higher leaves. Through a gradual process of such changes, the modern-day long-necked giraffe evolved. By contrast, a Darwinian view is that once a horse-like creature was born with a slightly longer neck. This proved to be an advantage in life since it could reach more food than its competitors. Some of its offspring were endowed with the longer neck characteristic as well, and over time their descendants born with longer and longer necks tended to survive better. Darwin and Lamarck both developed their theories before the discovery of genetics. In biological systems, there is a well-founded distinction between the genotype of an individual and the expressed phenotype of that individual. The genotype is a description of the basic structure of an individual, and for all living things on earth is specified by the information contained in nucleic acids, largely as chromosomes composed 26

50 of deoxyribonucleic acid (DNA). (Koza, 1994, p. 429). The information in a DNA molecule can be viewed as a character string over a four-character alphabet representing the four nucleotide bases, namely adenine (A), cytosine (C), guanine (G), and thymine (T) The genome of a biological individual is the sequence of nucleotide bases along the DNA of all its chromosomes. (Koza, 1994, p. 429) The phenotype is the actual embodiment of that description, namely the organism in nature. It is the phenotype that competes for survival and propagation. During the lifetime of most organisms, the organisms undergo various changes in response to their environment and thus adapt their characteristics over time. Their DNA, however, does not change. The process of evolution involves the propagation of the characteristics of parents to their children. Research on evolutionary genetics has shown that in biological systems, Lamarck s theory that acquired characteristics are inherited is wrong, and that Darwin s theory of natural selection is strongly supported, if the theory is logically updated to state that evolution selects for individuals based upon how fit the adapted phenotypes are, but creates new individuals based upon the unchanged genotypes. In many organisms, though, it is true that organisms acquire new characteristics and behaviours during their life that are not specified genetically. A remaining question is whether the evolutionary process is at all affected by the adaptation of the phenotypic characteristics during life. In 1896, J. Mark Baldwin proposed that although phenotypic adaptations are not directly passed on to offspring, they do influence the genetic characteristics that evolve in a population. This has come to be known as the Baldwin effect. Baldwin defines the term organic selection to refer to all adaptations that an individual makes to its behaviour and structure during lifetime in response to the demands of the environment. Through organic 27

51 selection, the organism ensures that it is kept alive during special circumstances and changes in its environment. By undergoing modifications of their congenital functions or of the structures which they get congenitally these creatures will live; while those which cannot, will not. (Baldwin, as reprinted in Belew and Mitchell, 1996, p. 62) In turn, organic selection ensures the survival and propagation of the genetic variations that do not directly specify a necessary adaptation but are close enough in the direction of an adaptation to allow the necessary modifications to be made during life. Over time, natural selection will operate upon these genetic variations that are close enough to produce an innate adaptation. Congenital variations, on the one hand, are kept alive and made effective by their use for adaptations in the life of the individual; and, on the other hand, adaptations become congenital by further progress and refinement of variation in the same lines of function as those which their acquisition by the individual called into play. (Baldwin, as reprinted in Belew and Mitchell, 1996, p. 64). Organic selection is a qualification of a positive kind that opens a new sphere for the application of the negative principle of natural selection upon organisms, i.e., with reference to what they can do, rather than to what they are; to the new use they make of their congenital functions, rather than to the mere possession of the functions. (Baldwin, as reprinted in Belew and Mitchell, 1996, p. 77) Evolutionary Computation Evolutionary computation techniques (Fogel, Owens and Walsh, 1966; Rechenberg, 1973; Holland, 1975; Koza, 1992, 1994; Angeline, 1996a; Fogel, 1998) abstract these evolutionary principles into algorithms that may be used to search for 28

52 optimal solutions to a problem. In a typical evolutionary algorithm, a genetic representation scheme is chosen by the researcher to define the set of solutions that form the search space for the algorithm. An individual solution in the space has a specific representation. A number of individual solutions are created to form an initial population. The following steps are then repeated iteratively until a solution has been found which satisfies a pre-determined termination criterion. Each individual is evaluated using a fitness function that is specific to the problem being solved. Based upon their fitness values, a number of individuals are chosen to be parents. New individuals, or offspring, are produced from those parents using reproduction operators. The reproduction operators act upon the information available in the representations of the parents to produce new individuals consistent with the representation scheme. These new individuals may be radically different from, slightly different from, or even the same as the parents. The fitness values of the offspring are determined. Finally, survivors are selected from the old population and the new offspring to form the new population of the next generation. The mechanisms determining which and how many parents to select, how many offspring to create, and which individuals will survive into the next generation together represent a selection method. Many different selection methods have been proposed in the literature, and they vary in complexity. Most selection methods ensure that the population of each generation is the same size. The key aspect distinguishing an evolutionary search algorithm from traditional search algorithms is that it is population-based. Rather than moving from one point in the search space to another during each phase of the search, as is done in iterative improvement algorithms, a population-based search moves from a set of points to another 29

53 set of points. At any given time, the points in the set may be sampled from different areas of the search space. The degree of difference between individuals is a measure of the diversity of the population. While some points in a population may be at local minima or plateaus, the odds of all the points being equally bad are, generally, low. Further, the set of points that form the next generation may be very different depending upon the reproduction and selection operators that are used. Thus, through the adaptation of successive generations of a large number of highly diverse individuals, evolutionary algorithms tend not to suffer badly from hill-climbing effects. Evolutionary algorithms are often regarded as weakly-biased search algorithms. The final set of solutions that are discovered by the search may be quite different from the initial set of solutions, and a variety of initial sets may lead to a similar final set so long as basic genetic characteristics that result in good performance are present somewhere in the starting population Genetic Algorithms The most popular technique in evolutionary computation research has been the genetic algorithm (GA) (Holland, 1975). In a conventional genetic algorithm, the representation scheme uses a fixed-length character string to describe the genotype. The alphabet may be of any size, but is often binary. Such a character string is an abstract encoding of the features of an individual phenotype. Each position may represent how a single feature is expressed in the phenotype. For example, Figure 6(a) illustrates a genetic representation scheme for interpreting a bit string of length 6 as a drinking glass. Sometimes, several positions together may represent a single feature. For example, Figure 6(b) illustrates a genetic representation scheme for interpreting a bit string of length 6 as a 30

54 person s age and income level. An alternate representation scheme for the latter problem could use a string of length 2 based upon an 8-character alphabet. The most important part of the representation scheme is the mapping that expresses each possible point in the search space of the problem as a particular fixed-length character string (i.e., as a chromosome) and each such chromosome as a point in the search space of the problem. (Koza, 1994, p. 22) 0 = Wide 1 = Narrow 0 = Tall 1 = Short 0 = Handle 1 = No Handle Age Category (in years): 000 = = = = = = = = = Clear 1 = Opaque 0 = New 1 = Old 0 = Circular 1 = Oval Income Bracket (in $1,000): 000 = = = = = = = = 70+ (a) (b) Figure 6: Genetic Algorithm Representation Schemes The main reproduction operator used in GAs is one-point crossover. Two parents are selected from the population. A single position in the bit string is randomly selected as the crossover point. All information to the left of that position in one parent is combined with all information to the right of that position in the other parent to form one offspring, and a similar swap of the information to the right and left, respectively, is made to form a second offspring. Figure 7 illustrates the combination of two parents (a) and (b) to produce two children (c) and (d) by crossing over the last two bits. Another popular operator is one-bit mutation, in which a single random bit in the string is flipped to form a new offspring string. Figure 8 illustrates the mutation of the fifth bit of parent (a) to 31

55 produce child (b). A variety of other operators have also been developed, and may be distinguished by whether or not they introduce any new information into the population. For example, mutation does, but crossover does not since it simply recombines information already available in the parents. All operators are constrained to produce strings that are legal in the given representation scheme. a) c) b) d) Crossover Point Figure 7: One-Point Crossover Operator a) b) Figure 8: One-Bit Mutation Operator The traditional selection method used in GAs chooses individuals to be parents probabilistically based upon their fitness values. Thus, every individual in the population that has a non-zero fitness has some chance of becoming a parent, though the best individuals are most likely to be chosen. Once a reproduction operator has been applied, the offspring that result replace their parents in the next generation. In designing a GA to solve a particular problem, a fitness function must be designed which is capable of evaluating any string that is part of the representation scheme and which provides a meaningful distinction between good solutions and bad ones. The choices of fitness function, representation scheme, reproduction operators, and selection methods all interact in determining the effectiveness of the GA, and many variations have been proposed. 32

56 2.2.5 Genetic Programming An increasingly popular evolutionary computation technique is that of genetic programming (GP) (Koza, 1992, 1994). GP and GA are very similar, except that GP uses a variable-sized tree representation rather than a fixed-length string. In a standard genetic program, the representation is a tree in which each internal node is a label from an available set of function labels and each external node is a label from an available set of datum labels (Note that the terminology normally used in the literature to refer to datum is terminal. However, the different name is used here to avoid confusion with the grammar terminology introduced later in this document). Each internal node represents a function and its children represent the parameters to that function. Each external node represents a single constant data value. The entire tree corresponds to a single, complex function that may be evaluated by traversing the tree in a left-most depth-first manner. A datum is evaluated as the corresponding value. A function is evaluated using as arguments the result of the evaluation of its children. Figure 9(a) illustrates a sample representation scheme, Figure 9(b) illustrates a sample gene, and Figure 9(c) illustrates the evaluation of that tree. / Function set: {+,,, /} Data set: {1, 2, 3, 4, 5, 6} 3 2 = 5 2 = (a) (b) (c) Figure 9: Genetic Programming Representation Scheme 33

57 (a) 1 4 (c) (b) (d) Figure 10: Subtree Crossover Operator (b) Figure 11: Subtree Mutation Operator 2 Genetic programming reproduction operators are tailored for a tree representation. The most commonly used operator is subtree crossover. Two parents are chosen from the population. In each parent, a crossover node is randomly selected. The entire subtree rooted at the crossover node of one parent is swapped with subtree rooted at the crossover node in the other parent to produce two new offspring. Figure 10 illustrates the 34

58 combination of two parents (a) and (b) to produce two offspring (c) and (d) by crossing the dotted subtrees. Notice that the resulting offspring differ in size from the parents. Subtree mutation is another common operator. One parent is chosen from the population. A single node in the tree is randomly chosen as a mutation point. The subtree rooted at that node is replaced with a randomly generated subtree to produce a single offspring. Figure 11 illustrates the mutation of parent (a) to produce child (b). In GP, each (datum), function and function parameter has an associated type in the traditional sense from computer programming. The return type of a (datum) is the form of a value it returns, for instance an integer. In general, a single return type is assumed for all (data), functions and function parameters. This is known as the closure principle. (Angeline, 1996a, p.7) The closure principle (Koza, 1992) allows any datum or function to be used as a parameter to any function. This ensures that operators such as subtree crossover will always produce legal offspring since any subtree is structurally on par with any other subtree. While this may be an unrealistic model of a program from a computer scientist s point of view, it is a prudent choice from an evolutionary perspective since it removes all type consideration from the structures being evolved. (Angeline, 1996a, p.7) In GP, a constraint is usually placed upon the maximum depth that a genetic tree may reach. This limit on the maximum depth is a parameter which keeps the search space finite and prevents trees from growing to an unmanageably large size. (Montana, 1993, p. 5) 35

59 2.2.6 Strongly-Typed Genetic Programming Strongly-typed genetic programming (STGP) (Montana, 1993, 1995; Haynes, Wainwright, Sen and Schoenefeld, 1995; Haynes, Schoenefeld and Wainwright, 1996) is a variation of GP that relaxes the closure principle. In STGP, a type is associated with each possible function and each possible datum. Several functions or data may share the same type, while others may have distinct types. A given function is specified to accept parameters of specific types. A valid gene tree is one in which all functions have children that are of the correct type(s) and the root function returns the type required by the problem solution. To design a GP system that will manipulate typed genetic trees, the representation scheme, the process that generates the initial parse trees, and the reproduction operators that create new offspring must be defined with special care. The definition of the representation scheme requires a clear specification of the type of each datum as well as the parameter types required by and the return type of each function. Figure 12 gives an example of a representation developed by Montana (1993, 1995), in which there are three possible types, scalar values, two-dimensional vectors, and three-dimensional vectors. The creation of legal genetic trees requires that the type constraints of the parameters be taken into account, and that a (function) can be the root of a tree of maximum depth i if and only if all of its argument types can be generated by trees of maximum depth i-1. (Montana, 1993, p.8) Montana defines a types possibilities table that is calculated once the representation scheme has been defined, and is subsequently consulted during the generation of individuals. The table indicates for every possible depth, up to i, what return types can be generated by a tree of that depth. For the sample representation scheme of Figure 12, the table entry for depth of 1 is {VECTOR-2, 36

60 VECTOR-3}, while the table entries for all depths greater than 1 are {SCALAR, VECTOR-2, VECTOR-3}. Function set = {DOT_PRODUCT-2, DOT_PRODUCT-3, VECTOR-ADD-2, VECTOR-ADD-3, SCALAR-VEC-MULT-2, SCALAR-VEC-MULT-3} Data set = {V1, V2, V3} where type properties are, DOT-PRODUCT-2: (VECTOR-2, VECTOR-2) SCALAR DOT-PRODUCT-3: (VECTOR-3, VECTOR-3) SCALAR VECTOR-ADD-2: (VECTOR-2, VECTOR-2) VECTOR-2 VECTOR-ADD-3: (VECTOR-3, VECTOR-3) VECTOR-3 SCALAR-VEC-MULT-2: (SCALAR, VECTOR-2) VECTOR-2 SCALAR-VEC-MULT-3: (SCALAR, VECTOR-3) VECTOR-3 V1: VECTOR-3 V2: VECTOR-3 V3: VECTOR-2 Required return type: VECTOR-3 Genetic tree depth restriction: 3 Figure 12: Strongly-Typed Genetic Programming Representation Scheme The sample representation scheme illustrates that a naïve definition of functions results in multiple labels being required for the same basic function as applied to different parameter types. Montana (1993) defines generic functions that allow this to be avoided. For example, the generic function DOT-PRODUCT is defined to return a SCALAR and to accept the parameters {VECTOR-i, VECTOR-i}, where i is any positive integer. VECTOR-ADD accepts parameters {VECTOR-i, VECTOR-i} and returns type {VECTOR-i}. This changes the tree creation process somewhat. To be in a (gene) tree, a generic function must be instantiated. Once instantiated, an instance of a generic 37

61 function keeps the same argument types... (and) acts exactly like a standard strongly typed function (Montana, 1993, p. 11). Haynes et al (1996) extend the notion of generic functions to a hierarchy of types. Using the subtype principle, wherever a type may appear a descendant of it in the type tree may also appear.once an argument is instantiated to a specific type, an additional check must be performed to determine if a subtype is allowed. (Haynes et al, 1996, p. 363) VECTOR-ADD-3 SCALAR-VEC-MULT-3 V1 DOT-PRODUCT-2 VECTOR-ADD-3 V3 V3 V2 V1 (a) SCALAR-VEC-MULT-3 DOT-PRODUCT-3 VECTOR-ADD-3 V1 V2 V2 V2 (b) Figure 13: Crossover Points in Strongly Typed Subtree Crossover The reproduction operators used in STGP must preserve the constraints upon the tree imposed by the parameter type requirements and preserve the depth constraint. In strongly-typed subtree crossover, two subtrees may be swapped only if they are both rooted by a node of the same return type. This requires an appropriate selection of matching crossover points. Based upon the sample representation scheme, Figure 13 38

62 illustrates a pair of legal crossover points, in solid circles, a pair of illegal crossover points due to type mismatch, in single dotted lines, and a pair of illegal crossover points due to violation of the depth restriction, in double dotted circles. Haynes et al. (1996) use a crossover operator that continues to randomly select two nodes, one from each parent tree, until the symbols at those nodes match and the resulting cross preserves the depth constraint. If no valid crossover points are discovered after a pre-determined number of attempts have been made, no offspring are generated. This method has the drawback of potentially never finding a match even if several exist. Montana (1993, 1995) uses a crossover operator in which a node in one tree is selected randomly, the second tree is analyzed to extract every node with a matching symbol and then one of those nodes is randomly selected. Montana (1993) does not explicitly include a check for the preservation of the depth constraint, but it is necessarily implied. This method has the potential of not finding a match if the first symbol was poorly selected, in which case no offspring are generated. In strongly-typed subtree mutation, a node is selected randomly as a mutation point and the subtree rooted at that node is replaced with a randomly generated subtree that returns the same type as the original node at the mutation point. However, when randomly generating a new replacement subtree, it is necessary to preserve type constraints and consult the types possibilities table to ensure that a tree of excessive depth does not result. STGP exhibits several benefits over standard GP. STGP permits the development of complex, hierarchically organized structures and thus produces the solution to problems which would have been virtually impossible for standard genetic programming. 39

63 (Montana, 1993, p. 16) Typing has the effect of reducing the size of the search space further for some problems by excluding tree organizations that in a single type language using analogous primitives might be created. (Angeline, 1996a, p. 10) The reduced search space size has been shown... to decrease the search time" (Haynes et al, 1996, p. 362). Finally, it has been claimed that solutions produced by STGPs are in general more comprehensible than solutions produced by GPs (Haynes et al., 1995, p. 271) Context-Free Grammar Genetic Programming Whigham (1995, 1996) presents context-free grammar genetic programming (CFG-GP), a form of GP that is equivalent in principle to STGP, but which uses a context-free grammar to define the representation scheme. A context-free grammar (see Lewis and Papadimitriou, 1981) is a four-tuple (N, T, P, S), where N is an alphabet of non-terminal symbols; T is an alphabet of terminal symbols; the sets N and T are disjoint; P is a set of productions, where each production is of the form X 0 X 1 X n and maps a single non-terminal, X 0 N, into one or more terminals and non-terminals, X i (N T) - S for i > 0; and S N is a designated start symbol and does not appear on the right-hand side of any production. A sentence from the grammar is derived by first applying an appropriate production from P to S to obtain a string containing non-terminal and/or terminal symbols. Productions are then recursively applied to any non-terminals in the resulting strings until a single string is obtained containing only terminal symbols. The series of derivation steps may be represented as a parse tree (or derivation tree), where the root of the tree is S, the internal nodes of the tree contain only non-terminal symbols, 40

64 the external nodes contain only terminal symbols, and the children of a given node are ordered as determined by the production that was applied to expand that node. N = {SCALAR, VECTOR-2, VECTOR-3} T = {v1, v2, v3, dot-product-2, dot-product-3, vector-add-2, vector-add-3, scalarvec-mult-2, scalar-vec-mult-3} P = { SCALAR dot-product-2 VECTOR-2 VECTOR-2 SCALAR dot-product-3 VECTOR-3 VECTOR-3 VECTOR-2 vector-add-2 VECTOR-2 VECTOR-2 VECTOR-3 vector-add-3 VECTOR-3 VECTOR-3 VECTOR-2 scalar-vec-mult-2 SCALAR VECTOR-2 VECTOR-3 scalar-vec-mult-3 SCALAR VECTOR-3 VECTOR-3 v1 VECTOR-3 v2 VECTOR-2 v3 } S = VECTOR-3 Figure 14: Context-Free Grammar of CFG-GP CFG-GP uses the parse trees of a given context-free grammar as the genetic representations. A legal gene is any tree that may be formed legally using the grammar. The organization of non-terminals and terminals in the gene are thus constrained by the grammar productions. This is unlike normal GP, where a legal gene consists of any organization of functions and data as long as the former are internal nodes and the latter are external nodes. The initial population in CFG-GP is generated by creating random parse trees using the grammar. Figure 14 illustrates a CFG-GP representation scheme that represents the same space of solutions as the STGP scheme of Figure 12. Note that by convention, non-terminal symbols are written in upper-case and terminal symbols are written in lower-case. The non-terminals correspond to the types in STGP, the functions 41

65 are now terminals, the data are also terminals, and each production identifies the number and type of parameters accepted by the associated function. The grammar used in CFG- GP allows both the typing to be automatically maintained with program constructs and also the structure (i.e., how functions are combined) to be explicitly stated and controlled. (Whigham, 1996, p. 236) VECTOR-3 vector-add-3 VECTOR-3 VECTOR-3 scalar-vec-mult-3 SCALAR VECTOR-3 v1 dot-product-2 VECTOR-2 VECTOR-2 vector-add-3 VECTOR-3 VECTOR-3 v3 v3 v2 v1 (a) VECTOR-3 scalar-vec-mult-3 SCALAR VECTOR-3 dot-product-3 VECTOR-3 VECTOR-3 vector-add-3 VECTOR-3 VECTOR-3 v1 v2 v2 v2 (b) Figure 15: CFG-GP Genetic Trees and Crossover Points Figure 15 illustrates genetic trees, generated using the sample grammar, that are equivalent in meaning to those of Figure 13. CFG-GP uses reproduction operators that preserve the syntactic validity of the offspring. In CFG-based subtree crossover, subtrees 42

66 are swapped only if they are rooted by the same non-terminal symbol. The solid circles in Figure 15 illustrate a legal pair of crossover points, while the dotted circles indicate an illegal pair due to symbol mismatch. In CFG-based subtree mutation, the original subtree, rooted by a given non-terminal symbol (e.g., SCALAR), is replaced by a new subtree randomly created from the grammar using that symbol (e.g., SCALAR) as the start symbol. Whigham includes a mechanism for limiting the depth of the parse trees. Each production is analysed to determine the minimum depth tree that it can create. During the random generation of a gene, a production is selected only if it won t violate the remaining depth restriction. During crossover, if a cross results in a tree that is too deep, the entire crossover procedure is aborted. During mutation, only productions with appropriate depth requirements are selected. Whigham (1995, 1996) incorporates a refinement operator whereby new grammar productions are dynamically created during evolution. In each generation, the fittest individual in the population is analysed. The left-most, deepest terminal t in that tree is identified. The non-terminal symbol T that led uniquely to t in a linear path (i.e., no branching) is identified. A new production is created that is identical to the production that initially contained T, except that t replaces T. This effectively creates the potential for more compact trees in the future. For example, in Figure 15(b), the left-most, deepest non-terminal is the node v1. If this tree were the fittest individual in the population, then the newly created production would be SCALAR dot-product-3 v1 VECTOR-3. However, if t was generated by a production containing only other terminal symbols (e.g., SCALAR dot-product-3 v1 v2), then the refinement operator actually creates a new 43

67 non-terminal symbol to replace the entire right-hand side, and a new rule that leads to it (e.g., SCALAR v4, where v4 = dot-product-3 v1 v2 ). Whigham introduces a search bias into his evolutionary algorithm through the use of biased population generation and biased CFG-based mutation. There may be a desire to express a bias towards the generation of certain program strings in the initial population. This bias is explicitly represented by associating a merit weighting with each production. The probability of a production being selected during the creation of the initial population is now directly proportional to the merit weighting of each production. (Whigham, 1996, p. 233) Specifically, if n multiple rules P A i, 1 i n, expand the same non-terminal symbol A, and each rule has a merit weighting mw A i, then the probability of selection prob(p A i) for each rule is given by (2-18). prob( P A i ) = n j= 1 mw A i mw A j (2-18) During mutation, the new subtree is generated as biased by the merit weightings. These merit weightings are determined arbitrarily at the beginning of evolution. Whigham uses dynamic merit weightings based upon the refinement operator. Every time a new production is created by refinement, and that new production already exists in the grammar, the merit of that production is incremented. If it does not exist in the grammar, it is added to the grammar with a merit of 1. Over time, the grammar, through its new productions and merit values, comes to reflect (in a general sense) the form of a preferred solution (Whigham, 1995, p. 40). 44

68 Finally, Whigham (1996) introduces selective reproduction operators. Each operator is defined to act upon a specific subset of the grammar non-terminals with a given probability. For example, as illustrated in Figure 16, selective CFG-based crossover is defined to operate upon the non-terminals {SCALAR, VECTOR-2} with a probability of 30%. There may also be more than one definition of each operator, each operating on different subsets with different probabilities. For example, a second crossover operator may be defined that operates upon the non-terminal {VECTOR-3} with probability 40%. Given two gene trees g 1 and g 2, selective crossover involves the following steps: 1. Randomly select a node from g 1 whose non-terminal symbol that matches one of the applicable non-terminals. If no match exists, crossover is aborted. 2. Randomly select a node from g 2 whose non-terminal symbol that matches one of the applicable non-terminals. If no match exists, crossover is aborted. 3. Swap the subtrees rooted by the two crossover nodes to produce two new gene trees. If either resulting tree exceeds the depth requirement, crossover is aborted. Selective CFG-based mutation may also be defined, as illustrated in Figure 16. Given a gene tree g, selective mutation involves randomly selecting a node from g whose nonterminal matches one of the applicable non-terminals and replacing the subtree rooted at that node as in normal CFG-based mutation. If no matching node exists, mutation is aborted. 45

69 Operator Applicable Non-Terminals Probability of Application Crossover 1 {SCALAR, VECTOR-2} 30% Crossover 2 {VECTOR-3} 40% Mutation 1 {SCALAR} 5% Mutation 2 {VECTOR-2} 10% Figure 16: Selective CFG-Based Reproduction Operators Rather than simply allowing crossover or mutation to occur with equal probability at all nodes in a given gene, the use of selective operators allows the user to be more discriminating about how a program is modified (Whigham, 1996, p. 234) Logic Grammar Based Genetic Programming Wong and Leung (1997) introduce the LOgic grammar-based GENetic PROgramming system (LOGENPRO), in which a logic grammar is used to define a space of possible genes. A logic grammar is a generalization of a context-free grammar (CFG). A key difference is that the logic grammar symbols may include arguments. These arguments may be logic variables, functions or constants. Through the use of these parameters, a logic grammar may incorporate context-sensitivity. Figure 17 illustrates a simple logic grammar adapted from Wong and Leung (1997). 1. start [(*], exp(w), exp(w), [)] 2. start [(*], exp(y), exp(y), [)] 3. start {member(?x, [W,Z]) }, [(*], exp(?x), exp(?x), [)] 4. exp(?x) {random(1, 2,?y)}, [(/?x?y)] 5. exp(w) [(/ W 3)] Figure 17: LOGENPRO Logic Grammar 46

70 As with a CFG, the logic grammar contains non-terminal and terminal symbols. The terminal symbols are enclosed in square brackets, such as [(*] in production 1 and [(/?x?y)] in production 4. Unlike a CFG, the non-terminal symbols, such as exp, may contain parameters, as indicated by exp(w) and exp(?x). A variable is prefaced by a?, such as?x and?y. Further, the productions of the grammar may also specify logic goals on the right-hand side. Logic goals are logical predicates and specify the conditions that must be satisfied before the rule can be applied (Wong and Leung, 1997, p. 146), such as {random(1, 2,?y)} and {member(?x [W, Z])}. The value of a given variable may be explicitly specified within a production. For instance, if production 1 is applied, the term exp(w) may be expanded by productions 4 or 5. In the case of production 4, the variable?x is automatically instantiated to the value W by the underlying logic interpreter. The value of a variable may also be determined by explicit logic goals. For instance, the goal {random(1, 2,?y)} will assign a random value between 1 and 2 to the variable?y if that variable has not yet been instantiated. The goal {member(?x [W, Z])} will randomly assign the value W or Z to the variable?x if that variable has not yet been instantiated. Once a variable is instantiated with a value, it is bound to that value within the scope of the production and all subsequent productions that expand it. As a result of the use of arguments, variables and logic goals, a single production may encompass a variety of possibilities. For instance, production 3 may produce an equivalent result to production 1 if?x is assigned a value of W, or it may produce an equivalent result to production 2 if?x is assigned a value of Z. A production may also incorporate context-sensitive dependencies. For instance, both the non-terminal exp 47

71 symbols within production 3 are interpreted in the same context (i.e., both are W or both are Z). Figure 18 illustrates a parse tree generated from the grammar of Figure 17. The instantiation of the variables is indicated in italics under the grammar symbols. start {member(?x, [W,Z]) } [(*] exp(?x) exp(?x) [)] {?x=w} {?x=w} {?x=w} {random(1, 2,?y)} {?y=1.3} [(/?x?y)] {?x=w} {?y=1.3} [(/ W 3)] Figure 18: LOGENPRO Parse Tree with Semantics for Variables Wong and Leung (1997) present genetic operators of crossover and mutation that operate upon the parse trees generated from a logic grammar, where the parse trees are decorated with the evaluations of the logical arguments. The operators must check the variable bindings and the conclusions deduced from all rules in the offspring to determine if the offspring are valid according to the grammar. In some cases, certain variable bindings may be changed appropriately in the new tree (e.g., changing the interpretation of?x from W to Z in a subtree formed using production 4), while in others a different variable binding may be illegal in the new tree (e.g., changing the interpretation of?x from W to Z in a subtree formed using production 5). Specific terminal and nonterminal symbols within a production may be designated as frozen and not subject to genetic manipulation. 48

72 2.2.9 Definite Clause Translation Grammar Genetic Programming Ross (2001) presents the Definite Clause Translation Grammar Genetic Programming (DCTG-GP) system, in which a definite clause translation grammar (DCTG) is used to define a genetic representation. A DCTG is a logical implementation of an attribute grammar (Knuth, 1968). The core of the DCTG is a logic grammar, and as such may contain logical goals and parameters within the syntactic productions. In addition, within a DCTG a symbol of the grammar may have a semantic component and a production may contain semantic rules with logical goals and/or semantic goals that are based upon the semantic components of the symbols of the production. The semantic values associated with a specific non-terminal may vary at different locations in the parse tree. Unlike conventional attribute grammars, in which inherited and synthesized attributes are explicitly assigned to symbols and semantic rules that compute those attributes are explicitly incorporated in the productions of the grammar, the semantic components within a DCTG-GP grammar are associated with parse trees in the form of logical goals with no explicit distinction between inherited or synthesized, and the computation of these semantic components is determined by the underlying logic interpreter, which will deduce as needed across all goals of the entire tree until all semantic components are computed. choice expr A int B recognize(sum, Prob, TotalProb) ::- B construct(val), NewProb is (Val / Sum) * Prob, A recognize(newprob, TotalProb) Figure 19: Definite Clause Translation Grammar Production 49

73 Figure 19 illustrates a sample DCTG production adapted from Ross (2001). The production contains both syntactic and semantic definitions. The top portion of the production specifies the syntax. It contains two non-terminal symbols, expr and int. The semantic component associated with a non-terminal is identified through the use of the operator and a variable. Thus, expr A links the variable A with the semantic component of the symbol expr. The production also contains one semantic rule, indicated in the lower portion. Within the semantic rule, the operator is used to identify the semantic component within which context a semantic operation is applied. Thus, B construct(val) evaluates the operation construct within the context of the semantic component associated with the variable B. The underlying logic interpreter automatically deduces the values of the variables in a bottom-up and/or top-down fashion as required. For instance, the semantic rule may retrieve the integer value of the int symbol using the operator construct ; this value is stored in the variable Val. The rule may then compute a value for the NewProb variable, where the values for the arguments Sum and Prob may be supplied from higher levels of the parse tree. Finally, the expr term may be interpreted using the call A recognize(newprob). Successful interpretation will return a value for the overall probability TotalProb. Ross (2001) demonstrates that the use of DCTG semantic rules can define semantic properties which simplify the target language s grammar (Ross, p. 314), can represent context-sensitive information about the language, and can verify grammatical constructs for semantic viability during generation (Ross, p. 317). The DCTG-GP uses the context-free parse trees generated by a DCTG grammar as a genetic representation. Unlike LOGENPRO (Wong and Leung, 1997), typed subtree 50

74 crossover and mutation operators are used, but similar to the operators used in LOGENPRO, the semantic goals and Prolog goals of a production must be verified after crossover or mutation to determine if the offspring is valid. DCTG-GP also incorporates a mechanism to analyze the syntactic productions to determine the depth properties of the parse trees generated from any given non-terminal symbol, such as the minimum depth of trees generated by each production. These properties are used to generate populations of parse trees with specific depth and fullness characteristics Self-Adaptive Evolutionary Algorithms Self-adaptive evolutionary algorithms include mechanisms that modify the values for certain operational parameters while solving a problem. (Angeline, 1996b, p. 89) Most evolutionary algorithms have many different parameters that are usually set arbitrarily by the researcher and that remain fixed during the entire evolutionary process. These include the rates at which different reproduction operators are applied, the processes used to generate new individuals, the biases used to select mutation or crossover points, and the exact behaviour of each reproduction operator. Researchers have examined a variety of techniques for dynamically modifying these various parameters during the course of evolution (Angeline, 1996b; Angeline, Fogel and Fogel, 1996; Salustowicz and Schmidhuber, 1997; Saravan, Fogel and Nelson, 1995; Whigham, 1995). For example, Whigham (1995) has designed a self-adaptive evolutionary algorithm through his use of the refinement operator to change the grammar of the representation scheme, and his use of dynamic merit weightings based upon the application of the refinement operator. 51

75 a) b) Figure 20: Gaussian Mutation Evolutionary strategies (Rechenberg, 1973) are a form of evolutionary algorithm in which the representation used is a fixed-length real-valued vector. As with the bit-strings of genetic algorithms, each position in the vector corresponds to a feature of the individual. The main reproduction operator used in evolutionary strategies is Gaussian mutation, in which a different random value from a Gaussian distribution with a fixed standard deviation σ, denoted N(0, σ), is added to each element of an individual s vector to create a new offspring. Figure 20 illustrates the Gaussian mutation of parent a to form offspring b. Note that different elements have changed by differing amounts. The nature of the changes made by the mutation operator may be varied during evolution (Saravanan et al., 1995). Each element x i of a given gene may be associated with its own standard deviation value σ i. The self-adaptive mutation operator determines the elements x i ' of the offspring through the application of (2-19) to each element x i in the parent vector, as well as determines the standard deviation values σ i ' of the offspring may determined through the application of (2-20) to each standard deviation value σ i of the parent, where α is a fixed scaling factor. The σ i values are often referred to as strategy parameters. x ' i = x i + N (0, σ ) i i (2-19) ' σ i = σ i + αn i (0, σ i ) (2-20) Angeline (1996b) introduces tree-based selective self-adaptive crossover operator for use in GP. Each individual gene tree in the population has its own parameter tree. The parameter tree has the same shape and size as its associated gene tree, but stores real- 52

76 valued numbers instead of function and data labels. For each node i in the gene tree, the probability of selecting that node as the crossover point, probcross(i), is determined by the value ρ i at the same location in the parameter tree according to (2-21), where n is the total number of nodes in the tree. probcross( i) = n ρ j= 1 i ρ j (2-21) The self-adaptive crossover works by first probabilistically selecting a crossover point in each parent based upon its respective parameter tree. The offspring gene trees are formed by crossing the selected subtrees from the parent gene trees, and the offspring parameter trees are formed by crossing the corresponding subtrees from the parent parameter trees. Once the crossover has been performed, every ρ i value in both offspring parameter trees is mutated slightly by adding Gaussian random noise according to (2-22), ' where ρ i is the new parameter value and α is a fixed scaling factor. ' ρ i = ρ i + αn i (0, ρ i ) (2-22) Figure 21 illustrates the crossover of two parent gene trees, a and b, composed of mathematical operations and numbers. Each parent has an associated parameter tree, a' and b'. Crossover is performed in both the gene trees and the associated parameter trees by swapping the subtrees illustrated in the dotted triangles. The result is the creation of two offspring gene trees, c and d, and their associated intermediate parameter trees, c' and d'. The intermediate parameters trees are then randomly mutated to form the final offspring parameter trees c'' and d''. 53

77 (a) (a') (b) (b') (c) (c') (d) (d') (c'') Figure 21: Tree-Based Selective Crossover (d'') Salustowicz and Schmidhuber (1997) present probabilistic incremental program evolution (PIPE), a self-adaptive GP algorithm in which gene tree creation is determined by a probabilistic prototype tree (PPT). A single PPT is maintained for an entire population of individuals. The PPT is a complete n-ary tree that is used to determine what function or terminal to select at each possible node location during gene creation. If the 54

78 representation scheme s function set F has f members and its data set D has d members, then each node i in the PPT contains a probability vector Φ i of size f + d. Each element φ i (I) Φ i represents the probability of choosing the instruction label I F D at node i. The magnitude of each node s probability vector is always 1. To generate a program tree, an instruction is probabilistically selected for the root node based upon the Φ vector stored in the root node. If that instruction is a data label, then the tree is complete. If it is a function label, then for each required argument of the selected function, a child of the root is expanded according to that child s probability vector. The process continues recursively until all external nodes are data labels. For example, Figure 22 illustrates a PPT of arity 2 for the representation scheme of Figure 9(a). The PPT is initialized so that all functions are equally likely compared to other functions, and all data labels are equally likely compared to other data labels (e.g., Φ 6 in Figure 22). As well, initially all nodes in the PPT have identical probability vectors. φ 1 (+) = 0.2 φ 1 ( ) = 0.1 φ 1 ( ) = 0.23 φ 1 (/) = 0.07 φ 1 (1) = 0.1 φ 1 (2) = 0.05 φ 1 (3) = 0.05 φ 1 (4) = 0.07 φ 1 (5) = 0.1 φ 1 (6) = 0.03 Φ 0 Φ 1 Φ 2 Φ 3 Φ 4 Φ 5 Φ 6 Φ 7 Φ 8 Φ 9 Φ 10 Φ 11 Φ 12 Φ 13 Φ 14 Figure 22: Probabilistic Prototype Tree φ 6 (+) = 0.16 φ 6 ( ) = 0.16 φ 6 ( ) = 0.16 φ 6 (/) = 0.16 φ 6 (1) = 0.06 φ 6 (2) = 0.06 φ 6 (3) = 0.06 φ 6 (4) = 0.06 φ 6 (5) = 0.06 φ 6 (6) = 0.06 PIPE has two mechanisms for self-adapting the PPT tree. The learning from population operator updates the PPT after each generation to reflect more closely the best individual in that generation. If g max is the individual gene with the highest fitness in the 55

79 population, then I max i is the instruction that was chosen at each node i of the PPT in order to create g max. The probability vector for each node in the PPT that contributed to g max is updated using (2-23), where is λ a fixed learning rate, and then re-normalized. φ ( I i max i ) = φ i ( I max i ) + λ(1 φ i ( I max i )) (2-23) In order to achieve a meaningful modification of the PPT, (2-23) is applied in several cycles. The higher the fitness of g max, the greater is the number of cycles. The more cycles, the more heavily the PPT comes to weight the choices that resulted in g max. The prototype tree mutation operator is also guided by g max. Only those nodes that contributed to g max may be mutated. The probability M i that a given node i is mutated is given by (2-24), where ϕ is a fixed overall mutation probability and n is the size of g max. ϕ M i = ( f + d ) n (2-24) φ ( I) = φ ( I) + µ (1 φ ( I )) i i i (2-25) If a given node i is targeted for mutation, its probability vector is updated using (2-25) for each I F D, where µ is the mutation rate, and then re-normalized. Both operators bias the evolutionary search towards exploring the search landscape around g max in subsequent generations Baldwin Effect in Evolutionary Computation Three categories of evolutionary algorithms may be identified based upon whether the genotype uniquely determines the phenotype whose fitness is evaluated and whether the genotype uniquely determines the offspring that are generated. In simple evolutionary algorithms (or Darwinian algorithms), both conditions hold true.. A population consists of 56

80 individual genotypes. Each genotype may be interpreted and converted into a phenotype through a mapping process. That phenotype does not change once created and is used to determine a fitness evaluation. Based on the fitness results, certain individuals in the population are selected to propagate. Offspring are generated based on the original genotypes of the parents. In Baldwinian algorithms, the first condition is relaxed but the second still holds true. Phenotypes are generated from the genotype. However, the phenotypes change over time in response to an environment. In other words, the phenotypes learn or acquire traits that were not described in the genotypes. The final, changed phenotype is used to determine a fitness evaluation. The offspring are then generated based on the original genotypes of the parents. Finally, in Lamarckian algorithms, both conditions are relaxed. As with Baldwinian algorithms, genotypes lead to initial phenotypes that are then modified to produce final phenotypes that have their fitness evaluated. However, when offspring are generated, they are created by incorporating some of the acquired traits of the final phenotype of the parents, as well as traits of the original genotypes of the parents. Many researchers have developed Baldwinian evolutionary algorithms and a number have explicitly studied the properties of the Baldwin effect in evolutionary computation (Boers, Borst. and Sprinkhuizen-Kuyper, 1995; French and Messinger, 1994; Gruau and Whitley, 1993; Hinton and Nowlan, 1987; Luke and Spector, 1996; Nolfi, Elman and Parisi, 1994; Turney, 1996a, 1996b; Turney, Whitley and Anderson, 1996; Whitley, Gordon and Mathias, 1994) Most have shown that the Baldwin effect has a clear benefit upon the evolutionary search. Hinton and Nowlan (1987) have presented the first treatment of the Baldwin effect in genetic algorithms. French and Messinger (1994) in 57

81 particular have demonstrated some interesting qualitative properties of Baldwinian algorithms when the evolved individuals vary in their capability to learn, or in their phenotypic plasticity. Their arguments are summarised below. A certain feature that is represented in the genotype can be considered as a Good Gene if it results in a phenotypic feature, the Good Phene, that has a positive effect on the fitness evaluation. If the Good Phene has an extremely high positive effect, those individuals who somehow manage to acquire (the Good Phene) will almost invariably outcompete their non-(good Phene) rivals and reproduce more successfully The Baldwin effect is unnecessary when the quality of the Good Phene is very high. (French and Messinger, 1994, p. 280) If the Good Phene has a moderate positive effect on the fitness evaluation, any genotype with that feature will be desirable. However, the Good Gene, once discovered, may not be good enough to be selected using only a Darwinian evolution algorithm. Without phenotypic plasticity (i.e., when no learning is possible) the genotype of the population does not evolve towards the Good Gene. (French and Messinger, p. 280) When the individuals in the population are capable of learning, it may be true that a number of Okay Genes may produce phenotypes which initially do not demonstrate the Good Phene, but which are capable of learning it. Thus, after the learning process the genotypes containing the Good Gene may be indistinguishable from the genotypes containing one of the Okay Genes. If the phenotypic plasticity of the individuals is so high that almost all genes are Okay Genes, then no Baldwin effect is observed since all individuals will quickly exhibit the high fitness resulting from the Good Phene. 58

82 However, if phenotypic plasticity is moderate and the Good Phene has a moderate effect, the Baldwin Effect becomes apparent. The discovery of the Good Phene in the population is more likely since many more genes will lead to it after learning. Since discovery is more likely, the Good Phene will occur more often during the evolutionary search and thus has a higher chance of being selected for. Once the Good Phene has survived, a secondary process occurs. Since learning the Good Phene is possible, but not too easy, not all Okay Genes will produce the Good Phene. However, if the Good Gene appears in the population, it always guarantees the Good Phene. Since the evolutionary algorithm always selects for the Good Phene, it is more likely over time to select a Good Gene individual than an Okay Gene individual, and thus the Good Gene propagates through the population. The real evolutionary value of the Baldwin Effect therefore is that it gives good - but not extraordinarily good - genes an improved chance of remaining in the population. Extremely good genes will, in general, stay in a population. (French and Messinger, p. 279) Another result that is highly interesting is that when there are learning genes that are also allowed to evolve, it can be shown that the phenotypic plasticity (i.e., ease of learning) of any beneficial, learnable trait increases over time. (French and Messinger, p. 281) Nolfi et al. (1994) have also examined Baldwinian evolutionary algorithms as applied to the evolution of neural networks. They conclude that the learning fitness and evolutionary fitness may actually be based upon different properties and the system will still demonstrate the Baldwin effect. In other words, the evolutionary algorithm may be optimizing its search based upon one set of criteria and the individuals may be learning 59

83 based upon a different set of criteria, but the fact that the individuals learn will still improve the evolutionary search. Instead of selecting for individuals that are good both at the evolutionary task and at the learning task (there may be no such individuals), evolution appears to select for individuals located in subregions of weight space where the changes due to learning during life tend to increase fitness. (Nolfi et al., 1994, p. 26) Whitley et al. (1994) compare and contrast all three types of algorithms in a set of GA experiments. They conclude that the Baldwinian search strategy will sometimes converge to a global optimum when the Lamarckian strategy converges to a local optimum. (p. 14) In all cases, the simple GA with no learning performs worse than both Baldwinian and Lamarckian GAs. Interestingly, they have also discovered that a Lamarckian algorithm almost always converges to a solution much faster than a Baldwinian algorithm, though the solution may be worse. Ackley and Littman (1994) likewise observe that a Lamarckian evolution algorithm was much faster than a (Baldwinian) algorithm given the same resources (p. 4) Turney (1996a, 1996b) has focussed upon the tradeoffs involved with the Baldwin effect. Phenotypic plasticity, one form of which is learning, smoothes the fitness landscape, which can facilitate evolution (Turney, 1996b, p. 136). On the other hand, phenotypic rigidity can also be advantageous since it avoids the potentially costly time, energy and mistakes that required by learning. For example, there can be advantages to instinctively avoiding snakes, instead of learning this behavior by trial-and-error. (Turney, 1996b, p. 136) Thus, there is evolutionary pressure to find instinctive replacements for learned behaviors, in stable environments (Turney, 1996a, p. 272). 60

84 Research on Baldwinian evolutionary algorithms offers four conclusions. An evolutionary system with individuals that are capable of learning is generally more successful than a system with non-learning individuals and more successful than a single individual capable of learning. An evolutionary system with individuals that learn and which also explores changes to the learning capabilities of those individuals will be even more successful. The training of the individuals in a population does not need to be tied directly to the fitness evaluation of those individuals. Finally, Baldwinian algorithms, though effective, may converge very slowly. 2.3 Modular Neural Network Models Many researchers have proposed neural network models that have modular properties and several reviews on the topic have been written (Caelli, Guan and Wen, 1999; Gallinari, 1995; Haykin, 1994; Ronco, Gollee and Gawthrop, 1997). Different systems differ in the definition of modularity used, the motivation followed for incorporating modularity, and the techniques implemented. A brief review is presented Redundancy A common modular technique is to incorporate redundancy into the network structure by combining the outputs of multiple, similarly trained networks to form a single output response. The rationale behind this approach is that decisions taken by teams usually are better than decisions taken by individuals, provided that suitable methods for combining the individual responses are provided. (Battiti and Colla, 1994, p. 704) This simple form of modularity has been used in a variety of network applications. Generally, the different modules are similar (often identical) in structure and are trained on the same 61

85 data and same problem. The initial (random) differences between the module networks differentiate the errors of the networks, so that the networks will be making errors on different subsets of the input vector. (Hansen and Salamon, 1990, p. 994) In some applications, though, the modules may differ greatly in structure and/or may be trained on different subsets of the available database (Hansen and Salamon, 1990, p. 1001). One method for combining the results of the redundant modules is to take the average, simple or weighted, of the outputs of the modules to obtain the output for the entire network (Battiti and Colla, 1993; Lincoln and Skrzypek, 1990; Mani, 1991). In averaging, all the modules contribute towards the final outcome, although weighted averaging adds an extra level of fault tolerance by giving the judge the ability to bias the outputs based on the past reliability of the nets. (Lincoln and Skrzypek, 1990, p. 651) Weighted averaging adds an extra level of complexity to the network s learning. Another common method for combining redundant modules is classifier voting. In classifier voting schemes, all the modules analyse the data and output a classification. A vote is then taken among these responses to determine which class best describes the input. A number of voting schemes are possible (Battiti and Colla, 1994; Gargano, 1992), including the plurality scheme, in which the system returns the classification suggested by the most modules and the majority scheme, in which the system returns a classification if one is suggested by more than half of the modules (Hansen and Salamon, 1990) Counter-Propagation The counter-propagation network (Hecht-Nielsen, 1990) is essentially an ad-hoc combination of two other models. The hybrid model consists of two modules connected 62

86 serially. The first module is a Kohonen self-organizing map network (Kohonen, 1984) and the second is a layer of instar nodes. Training of the network proceeds in two stages. In the first phase, the Kohonen layer adapts to the set of training vectors and partitions the input space into equiprobable sections. In the second phase, the instar layer associates each section of the Kohonen map with a vector representing the average desired response to inputs categorized by that section. Counter-propagation is an example of an approach to designing new network models in which existing neural networks can be viewed as building block components that can be assembled into new configurations offering new and different information processing capabilities. (Hecht-Nielsen, 1990, p. 153) ARTSTAR The ARTSTAR network (Hussain and Browse, 1994) is another ad-hoc combination of two existing models. It consists of a layer of instar nodes that combines the outputs from a number of adaptive resonance theory networks (ART) (Carpenter and Grossberg, 1987). The outputs of the instar layer are in turn used to provide feedback that regulates the internal processing of the ART modules. Figure 23 illustrates the structure of an ARTSTAR network with three ART modules. The internal structure of the ART modules is typical. The τ-layers and φ-layers apply auxiliary threshold functions. The desired responses always contain a single non-zero entry (i.e., the filled circle in the desired response vector of Figure 23). During a cycle through the network, each ART network produces a single active output node (i.e., the filled nodes in top layer of the ART modules in Figure 23), the instar 63

87 layer combines those outputs, and instar learning is performed on the weights from the τ- nodes corresponding to the active ART output nodes. Those weights come to represent the likelihood that a given ART output node belongs to a given class, and the instar layer effectively collects votes from all the modules as to which of the output classes should be activated.... Desired Response... Instar Layer Fully Connected One-to-One Connections... τ 1... τ 2... τ 3 ART Module φ 1 ART Module 2 1st Input 2nd Input 3rd Input Figure 23: Modular ARTSTAR Network φ 2 ART Module φ CALM Networks Happel and Murre (1992, 1994) present the Categorizing and Learning Module (CALM) model in which a network consists of a number of CALM modules connected to each other. Each module may vary in size, but has a fixed internal structure. All activation values are between 0.0 and 1.0, intramodular connections are either excitatory (fixed positive weight) or inhibitory (fixed negative weight), and intermodular connections 64

88 may be positive or negative. Learning occurs only upon the connections between modules, and no learning occurs within a CALM module. Weighted Intra- Modular Connection Inhibitory Connection Excitatory Connection Veto Nodes Arousal Node Rest of Network w Representation Nodes Noise Node Figure 24: Basic CALM Module Figure 24 illustrates the structure of a CALM module. A CALM module contains four different types of nodes: several representation (R) nodes, the same number of veto nodes, an arousal node, and a noise node. The size of a module is determined by the number of R-nodes. The R-nodes form the input and output interface of the module. Inputs to the module are provided by the network s inputs and/or by the R-nodes of other CALM modules. An R-node applies a sigmoid activation function. Within the module, the R-nodes are connected to the veto node in a one-to-one manner. The veto nodes form a recurrent layer in which they inhibit each other and all R-nodes. The result is a winnertake-all competitive process that results in convergence to a single active R-node. To resolve deadlocks (i.e., when two or more nodes are equally activated), the CALM module contains a state-dependent noise mechanism. All R-nodes and veto nodes send output to the arousal node. The arousal node computes a single value reflecting the number of R-nodes that are still active and passes this measure to the noise node. The 65

89 noise node in turn sends random activations to the R-nodes, where the overall magnitude of the noise is proportional to the arousal level. The connections within a CALM module are either unweighted or fixed in weight value, while those between modules are weighted and modifiable. When CALM modules are connected together to form a network, learning in the network occurs through the modification of these intermodular weights. During and following the categorization process, a form of Hebbian learning takes place and preserves input-output associations by adjusting the learning weights from the input nodes to the R-nodes of a module. (Happel and Murre, 1994, p. 990) By itself, a single CALM module operates as a self-organizing network which is capable of learning to associate a single R-node with a given input pattern class. A CALM module can represent only as many patterns as it has R-nodes. Each module represents a competitive local memory in which multiple templates are stored but only one is returned as output. The connections between the modules represent how those local templates interact and combine to form new, higher-level templates. In general, a CALM module can be used as a building block to form more complex networks. CALM modules can be interconnected to exploit redundancy (e.g., two slightly different modules may accept the same input pattern and pass their results up to third module), as well as to exploit sub-task specialization (e.g., two modules may view different subsets of the input pattern). CALM is dynamic in its method of storing internal representations, but is static in its intermodular topology. Generally, the user selects a number of modules and connects them arbitrarily, and that structure is fixed throughout learning. Happel and Murre (1992, 1994) have examined the automatic optimization of the intermodular topology of a CALM 66

90 network using a genetic algorithm. They use a fixed-length bit-string genetic representation to specify the number of CALM modules, their size and how they are connected to each other. Cho and Shimohara (1997) similarly use genetic programming to evolve the intermodular topology of CALM networks MINOS Networks Smieja (1991, 1994) and Smieja and Mühlenbein (1992) present the MINOS modular neural network model, in which each module outputs both a response vector and a measure of its self-confidence in that response. A single layer of MINOS modules is used, and each module is connected to the same set of inputs. An authority unit receives the outputs of the MINOS modules and controls the learning of the modules as well as provides the output of the network. Internally, a MINOS module consists of a worker component and a monitor component, both of which are independent back-propagation networks that are connected to the same inputs (see Figure 25). Output Vector Training Verdict Authority Response Vector Confidence Value Worker Monitor Worker Monitor MINOS Module Input Vector Figure 25: MINOS Network 67

91 During each learning event for a pattern, every module is first activated upon a given pattern. The authority selects the MINOS module that exhibits the lowest response error and allocates the learning event to that module as a positive instance (e.g., the left module in Figure 25 is selected for training through receipt of a positive training verdict). The selected MINOS module trains its worker accordingly and updates the response of the monitor to approach the value 1 for that pattern. The remaining modules perform no learning on their workers and update the response of their monitors to approach the value 0 for that pattern. The monitor thus learns to output a value indicating its confidence that the worker s response is associated with the pattern. During a recall event for a pattern, every module is activated upon the pattern and the authority selects response of the MINOS module with the highest self-confidence measure as the final output of the network. The purpose of this approach is to enable the modules to specialize upon different patterns, but to take into account the fact that an output vector that is high in strength (i.e., large magnitude of activation values) may be wrong while an output vector that is low in strength may be correct. The confidence measures allow the integrating unit to make an informed choice based upon each module s self-evaluation. The use of backpropagation networks within the MINOS modules is an arbitrary choice, and one might however envision other forms of monitor network, such as Kohonen net or ART nets (Smieja, 1991, p. 10). 68

92 2.3.6 Mixture of Experts Networks The Mixture of Experts model is a modular network that uses competing modules to partition the input space and perform task decomposition (Jacobs, Jordan and Barto, 1991a; Jacobs, Jordan, Nowlan and Hinton, 1991b). A basic Mixture of Experts network system consists of three main components (see Figure 26). The first is a number of expert modules, each of which sees the same input pattern and processes it independently. The second is a gating module that accepts the same input vector seen by the expert modules and outputs one value for each expert module. This value reflects the probability that the corresponding module represents the correct output of the network. The third component is an integrator node that computes the output of the entire network as the sum of the outputs of the expert modules weighted by the probabilities from the gating module. Output Σg i y i Integrator Node y 1 y 2 y 3 g 1 g 2 g 3 Gating Module Expert Module 1 Expert Module 2 Expert Module 3 Input Figure 26: Mixture of Experts Network In a typical Mixture of Experts network with M expert modules and an output pattern of size N, each expert (1 i M) is a single layer of N neurons (1 j N) connected to the inputs (x) via a weight vector (w j i ). Their outputs (y j i ) are produced 69

93 using (2-26). The gating module is a single layer of M neurons connected to the inputs via a weight vector (v i ). The weighted inputs produced by (2-27) are normalized using (2-28) to produce the output of the gating module. (2-28) is the softmax function and is used to exaggerate small differences between values. This generally produces a winner-take-all effect when the output (z) of the network is computed using (2-29). j yi = x T w j i (2-26) ui = x T v i (2-27) g i = M e k = 1 ui e uk (2-28) z = M g i i= 1 y i (2-29) During training, once the network s output has been computed, the weights of the expert modules and the gating module are updated based upon the error made by network. Given the desired output vector (d), (2-30) computes the relative contributions (h i ) of each expert to the overall error of the network. The error of each expert s output vector (y i ) is weighted by its original contribution to the network s output (i.e., the corresponding gating output for that module). These are passed through a softmax function to exaggerate the values. The weights of the expert modules are then updated as in (2-31). The effect of the softmax is that the module that had the smallest error (i.e., the winner ) is rewarded with the greatest amount of learning. The weights of the gating module are adapted as in (2-32). The effect of this function is that if, on a given training pattern, the system s performance is significantly better than it has been in the past, then the weights of 70

94 the network are adjusted to make the output corresponding to the winning expert network increase towards 1 and the outputs corresponding to the losing expert networks decrease to 0. Alternatively, if the system s performance has not improved, then the network s weights are adjusted to move all of its outputs towards some neutral value. (Jacobs et al., 1991a, p. 228) h i = M k = 1 g e i g k 1 2 d y i 2 e 1 d y k 2 2 (2-30) j i j i j w = λh ( d y )x i v = λ h g )x i ( i i (2-31) (2-32) The learning rules of the Mixture of Experts network are designed to assign credit to expert modules according to their performance. The output of the gating module determines the magnitudes of the expert networks error vectors, and therefore determines how much each expert network learns about each training pattern. (Jacobs et al., 1991a, p. 229). Through this process of training, a given expert module will come to represent a particular subset of the data space. During testing, the output of the module most closely associated with a data pattern will tend to overshadow the responses of the other modules. As typically presented, both the expert modules and gating module are single layers of neurons. In principle, however, the modules can be any arbitrary supervised architecture, such as a back-propagation network. Jordan and Jacobs (1993) present a hierarchical version of the system (Hierarchical Mixture of Experts) in which the experts are, in turn, other Mixture of Experts networks. 71

95 2.3.7 Auto-Associative Modules Ballard (1990) presents a modular neural network that uses auto-associative modules as building blocks. The modules are back-propagation networks, but they are trained such that the desired output is always the same as the input. In such a module, therefore, the hidden nodes learn to form an internal, usually more compact, representation of the input patterns. Ballard suggests that interesting modular networks can be formed by connecting such auto-associative modules to each other. Specifically, two modules are connected when the hidden nodes of one module provide part of the input (and thus desired output) for another module. The network can learn to form internal representations of input data and then learn to combine these internal representations further at each level. Abstract Module Sensory Module Motor Module Figure 27: Network of Auto-Associative Modules Figure 27 show an example that links a sensory image with a motor response. In the figure, double-lines represent full connectivity and single lines represent one-to-one connectivity. One module forms an internal representation of the sensory images and the other module forms an internal representation of the motor response. A third module then 72

96 accepts these two internal representations as input and learns to form an internal representation of them. After training, if input is given to one end of the network (e.g., the sensory module), activation will flow through the other modules resulting in an output at the far end (e.g., the motor response). There are several advantages exhibited by a modular network of such components. Firstly, the processing of the network is clearer to understand since the internal structure of the entire network is more open to examination than the single hidden layer in a standard back-propagation network. Secondly, the processing is more flexible since the component modules can be used to provide input to several modules, such as in a task that requires several processing streams. Thirdly, the network may also have a benefit in complexity and performance because of the compactness of the representations used. If multiple input sources are considered together in a single network, the crucial differences between data patterns may be difficult to learn because of their relatively small impact on the entire vector (e.g., a difference of only a few bits between two 1000-bit long vectors will tend to be difficult to notice). On the other hand, in a modular auto-associative network, each input source can be presented to a different module and a compact representation of it formed. As these representations are integrated in the network, the higher-level representations will generally remain compact and crucial differences in vectors will tend to be more easily noticed. Finally, the most important result of this transformation to a purely auto-associative system is that this system can be modularized in a way that is resistant to changes in problem scale. (Ballard, 1990, p. 143) 73

97 2.3.8 Short Connection Bias Network Jacobs and Jordan (1992) present an interesting modular variation of backpropagation. Each node in the network is assigned a position in space, and the backpropagation learning algorithm is modified so that it reduces a cost function which includes the normal response error term as well as a weight decay term based on the distances between nodes. This second term has two main effects. First, weights on long connections are more strongly decayed than weights on short connections. Second, relatively small weights (roughly those whose magnitude is less than 1) are more strongly decayed than large weights. (Jacobs and Jordan, 1992, p. 328) Thus, as the network learns it tends to eliminate long, weak connections and preserve short, strong ones. The result is that the network effectively develops into modules. An implicit form of competition arises from the interaction between the supervised error-correction process and the locality constraint and that this competition leads to modular specialization. (Jacobs and Jordan, 1992, p. 324) The networks resulting from the short connection bias are highly capable of decomposing a problem into subtasks. For example, Jacobs and Jordan successfully apply the technique to the dual-task problem of identifying both what an object in an image is and where in the image it is located. The features developed by the network nodes tend to be more open to interpretation than those of normal back-propagation networks since units tend to interact significantly with many fewer nodes and develop local receptive fields instead of widely dispersed fields. In large networks, such as the human brain, an advantage of short connections is that they permit relatively rapid information processing (Jacobs and Jordan, 1992, p. 325) given the slow speed of neural impulses. It can also be 74

98 argued that due to spatial constraints, increases in brain size must be accompanied by decreases in the average percentage of cells with which any one cell directly communicates. (Jacobs and Jordan, 1992, p. 326) The short connection bias may lead to great savings in both time and space efficiency, especially if a network pruning algorithm is subsequently applied to optimize the network. 2.4 Evolution of Neural Networks Many researchers have applied evolutionary algorithms to evolve the topologies of populations of neural networks (Mühlenbein and Kindermann, 1989) and to evolve new functionality for use within neural network architectures (Bengio, Bengio, Cloutier and Gecsei, 1992), and several reviews on this research have been written (Balakrishnan and Honavar, 1995a, 1995b; Schaffer, Whitley and Eshelman, 1992; Yao, 1993, 1999). The genetic representation scheme used in an evolutionary neural network system (ENNS) is a crucial aspect of the design since neural networks are complex computational mechanisms that have many different structural and functional characteristics. A genetic encoding of a network provides a concise genetic representation of all the features of a neural network that the researcher wishes to explore at the evolutionary level. A typical ENNS is a Baldwinian evolutionary algorithm with five main stages. In the first stage, a population of genotypes is created from the genetic encoding. In the second stage, each individual genotype in the population is decoded into a functioning neural network phenotype. In the third stage, each phenotype network performs learning in the context of its environment. The network training paradigm is usually fixed and incidental to the design of the ENNS. In the fourth stage, the overall fitness of each phenotype is evaluated and 75

99 that value is assigned to its corresponding genotype. This fitness value is often based upon how well the neural network performs on a set of test data. In the fifth stage, the population of genotypes is manipulated through appropriate genetic operators, such as fitness-based selection, mutation and crossover, to produce a new population. Stages two through five are repeated for multiple generations until a termination criterion is reached, such as discovering a network that performs above a certain threshold. A number of encoding techniques for neural networks have been proposed and most can be categorized into one of six approaches (similar to categorization by Yao, 1999): direct encoding, structural encoding, parametric encoding, developmental rule encoding, grammatical encoding, or cellular encoding Direct Encoding Within a direct encoding (Heistermann, 1990; Miller et al., 1989), the details of a neural network are described in the gene such that the gene may be decoded directly into a functioning neural network. No random initialization is required. A direct encoding is used to represent the exact weight values of the connections in a network of fixed topology. For example, to represent a recurrent network of four fully-connected nodes, a 4x4 real-valued matrix may be used, as in Figure 28(a). Normal GA reproduction operators may be used by treating the matrix as a 16-element vector. Each node is assumed to have a specific pre-determined functionality. To decode the matrix-of-weights gene, the row is used as the source node and column as the destination node to get the network in Figure 28(b). The connections with weights of 0 are still present in the network, and thus those weights may change before fitness evaluation if learning occurs in 76

100 the phenotype network. Notice that the matrix contains no details concerning the functionality of the network. The encoding assumes a great deal of knowledge to be used in interpreting it as a neural network (a) (b) 0.9 Figure 28: (a) Direct Encoding of a (b) Neural Network Architecture If weight values may range between 1 and 1, in increments of 0.05, then the genetic search space is of size and contains all possible neural networks with four fully connected nodes. Any permutation of the matrix in which the nodes of the structure are effectively renumbered will produce phenotypes that are functionally identical (e.g., if node 1 and 3 were renumbered 3 and 1 respectively in Figure 28(b)). The number of such permutations of a given matrix is 4!, or 24, if all values in the matrix are unique. Thus, the genetic search space is redundant by a factor of up to 24. Because of the permutation problem, genetic crossover operators may not behave well on direct encoding since the probability of producing a highly fit offspring by recombining (two functionally equivalent networks with different genotype representations) is often very low (Yao, 1999, p. 1431). Direct encoding also suffers from a problem of poor scalability. Since 77

101 every possible connection weight must be specified, the representation string must necessarily be n 2 in size, where n is the number of nodes in the network Structural Encoding Within a structural encoding, the presence of each connection is specified in the gene. The gene provides the necessary topology for a network, but may (Collins and Jefferson, 1990; Dasgupta and McGregor, 1992) or may not (Hancock and Smith, 1990; Kitano, 1990; Koza and Rice, 1991) provide the weight values. If not provided, a functioning network may be formed only once weight values have been randomly assigned to the connections. For example, to represent the topology of a recurrent network of four nodes, a 4x4 binary matrix may be used, as in Figure 29(a) (Kitano, 1990). Normal GA reproduction operators may be used by treating the matrix as a 16-element vector. Each node is assumed to have a specific pre-determined functionality. To decode the matrix-ofconnections gene, the row is used as the source node and column as the destination node to get the network in Figure 29(c). If the matrix entry is a 1, a connection between the appropriate nodes is created, and if it is a 0, no connection is created. Thus, unless a topology-changing learning rule is used in the resultant network, that connection will never be present in the phenotype. However, the network in Figure 29(c) is not quite a functioning network, and during the network initialization phase connections are usually assigned random weight values. Since these initial values are not specified in the gene itself, a given gene actually represents a set of possible network structures. Two identical genes in the population may evaluate to the very different fitness values. The genetic 78

102 search space is of size 2 16 and contains all possible neural network structures with four nodes or less. Such a structural encoding scales poorly in the same way as a direct encoding matrix-of-weights representation ?? 1 2? ?? ??? (a) 3 4 (c) {(1,1),(1,3),(2,1),(2,2),(2,3),(2,4),(3,2),(4,1)} (b) Figure 29: (a) Inefficient and (b) Efficient Structural Encoding of a (c) Neural Network Other structural encoding approaches use variable-length genetic strings (Collins and Jefferson, 1990) or genetic trees (Koza and Rice, 1991) that are linear with respect to the number of connections that exist. For example, a list whose entries contain the source and destination nodes of each connection may be used, as in Figure 29(b) Parametric Encoding Within a parametric encoding (Polani and Uthmann, 1992; Schaffer, Caruana and Eshelman, 1990), certain important aspects of a neural network architecture are represented by a usually fixed number of parameters. The gene typically contains very high-level information and requires detailed hidden assumptions and initialization steps in order to be interpreted as a functioning neural network. Normal GA fixed-length string representations and reproduction operators may be used. For example, to represent a 79

103 family of three-layer back-propagation networks, the size of the first hidden layer, the size of the second hidden layer, and the learning rate may be encoded. Figure 30 illustrates such a gene as a string of three real-values. In the design of an ENNS, reasonable limits must be imposed upon the range of values and a sampling precision for each parameter. For example, an ENNS may represent only networks with 1 to 20 hidden nodes per layer and learning rates of 0.05 to 0.30 in increments of Note that these limitations must be enforced by the genetic operators of the evolutionary algorithm. Further, in transforming the gene to a functioning network phenotype, the system may make a variety of assumptions, such as assuming that the number of input nodes and output nodes is fixed and pre-determined by the task itself, that there is full connectivity between successive layers, that each node functions in the same way with the same learning rates and momentum values, and that the connections are initialized with random weight values. # Nodes in First Hidden Layer # Nodes in Second Hidden Layer Learning Rate Figure 30: Parametric Encoding of a Back-Propagation Network The size of the genetic search space is determined by the range of permitted values and sampling precision the researcher imposes on each parameter. In our sample, it is 20*20*36, or ~10 5. A parametric encoding is usually very compact and scales well to large networks. For example, in our sample encoding, a network may have arbitrarily large hidden layers and be still be represented by only three values. A parametric encoding typically exhibits very low redundancy; ours has none at all. One drawback to a parametric encoding is its inflexibility. The ENNS may only explore a limited subset of 80

104 the possible three-layer architectures. In our sample, because of the assumptions required to form a functioning network, only fully connected, three-layer networks are explored. Another serious drawback is that, because of the required initialization, each gene actually represents a significantly large number of possible structures with different weight configurations. This high one-to-many mapping of genotype to phenotype means that any given gene will have many possible fitness values. If the globally optimal topology appears in the population, there is no guarantee that the evaluated fitness will be high. In general, the parametric representation method will be most suitable when we know what kind of architectures we are trying to find (Yao, 1999, p. 1431) Developmental Rule Encoding Within a developmental rule encoding (Boers, Kuiper, Happel and Sprinkhuizen- Kuyper, 1993; Kitano, 1990; Voigt, Born and Santibáñez-Koref, 1993), the genetic representation does not contain details of topology, but rather contains a set of rules that may be followed to develop a neural network. The evolutionary search is thus not over topologies, but over the set of possible developmental rules. Most approaches encode the rules of a context-free string rewriting system known as an L-system (Lindenmayer, 1968; Jacob, 1994a). An L-system is defined by a starting symbol, a set of non-terminal symbols, a set of terminal symbols, a single context-free re-write rule for the start symbol and a single context-free re-write rule for each non-terminal symbol. Each re-write rule maps a single symbol into one or more non-terminal or terminal symbols. In a graph L- system, a single symbol is mapped into a matrix of non-terminal or terminal symbols (Kitano, 1990). A developmental rule encoding defines all the production rules as 81

105 sequences of symbols. Since there are no productions that share the same left hand symbol, there is a unique interpretation of a given set of rules. For example, Figure 31(i) illustrates 16 graph-rewrite rules for the 16 possible 2x2 binary matrices. Figure 31(ii) illustrates a gene that defines three graph re-write rules, Figure 31(iii), that when applied accordingly to expand the start symbol produce the unique matrix Figure 31(iv). This matrix, for example, could then be interpreted as the structural encoding of a neural network. Normal GA reproduction operators may be applied. (i) a [ 0 0 ] [ 0 0 b 1 0 ] [ 0 0 c 0 1 ] p [ ] (ii) (iii) ABBAacpbapba S [ A B B A ] A [ a c p b ] B [ a p b a ] (iv) Figure 31: Developmental Rule Encoding Grammatical Encoding Within a grammatical encoding, a fixed grammar is designed and a sentence formed from the grammar is used to describe features of a neural network. The rules of a grammatical encoding may describe network topology and connection weights (Jacob and Rehder, 1993) as well as neuron functionality (Jacob, 1994b). The genetic representation may be the actual sentence formed from the grammar (Jacob and Rehder, 1993) or the parse tree used to derive a specific sentence. The level of network detail specified in the gene is determined by the nature of the production rules. The resulting sentence must be interpreted accordingly to form a working neural network. 82

106 Topology Topology PathList Topology Path PathList ; Path Path InputNeuron NeuronList OutputNeuron NeuronList NeuronList NeuronList NeuronList CortexNeuron NeuronList OutputNeuron InputNeuron an element of {i 1, i 2, i n } OutputNeuron an element of {o 1, o 2, o m } CortexNeuron an element of {c 1, c 2, c k } (a) o 1 o 2 o 3 i 1 c 2 c 1 o 3 ; i 1 o 2 c 2 o 2 ; i 1 c 2 o 3 c 1 c 2 i 1 (b) (c) Figure 32: Grammatical Encoding of Neural Paths Jacob and Rehder (1993) present an encoding that uses a context-free grammar to specify network connectivity. A sentence in the grammar is a set of connections paths. Each path begins at an input node, passes through one or more internal or output nodes, and terminates at an output node. Several paths together specify a network. Figure 32(a) illustrates the grammar used by Jacob and Rehder, slightly adapted for presentation purposes. The symbol ; and the sets {i 1, i 2, i n }, {o 1, o 2, o m } and {c 1, c 2, c k } are the terminals of the grammar. These sets represent the possible input, output and internal nodes, respectively. Figure 32(b) illustrates a gene containing a sentence that may be formed from the grammar, with n = 1, m = 3 and k = 2. Figure 32(c) illustrates the corresponding phenotype. The gene may contain overlapping sub-paths (e.g., i 1 c 2 ), but overlaps are represented only once in the phenotype. The resulting networks may contain 83

107 recurrent pathways (e.g., o 2 c 2 o 2 ), and all its paths result in output, though not all nodes in the network may be used (e.g., o 3 ). A primary goal of this design was to create a parameter coding that is easily interpretable by human experts without the need to explain complicated decoding algorithms. (Jacob and Rehder, 1993, p. 75) Special reproduction operators must be used to ensure that the offspring are valid sentences in the grammar. Jacob and Rehder (1993) introduce four mutation operators and a constrained two-point crossover operator. New path mutation creates a single new path using the grammar with Path as the start symbol, randomly selects a location in the gene which is at the beginning, at a ;, or at the end, and then inserts the new path. Remove path mutation randomly selects a single path from the list of paths in the gene and removes it. Insert neuron mutation randomly selects a path, and then randomly selects a point that is exclusively in the middle of that path. A new neuron is inserted at that point. Remove neuron mutation selects a path from the gene, randomly selects a neuron that is exclusively in the middle of that path, and removes it. Constrained two-point crossover randomly selects two crossover points in each parent. A valid crossover point may be at the beginning, at a ;, or at the end of the gene. The path(s) between those crossover points are then swapped between the parents to form two new offspring. Jacob and Rehder (1993) and Jacob (1994b) use this grammatical encoding within a hierarchical ENNS that separates network descriptions into three population levels. At the first level, the population consists of the grammar-based topology specifications as described above. At the second level, the population consists of a number of neurons that vary in their functionality. The members of this neuron pool form a dynamic CortexNeuron set {c 1, c 2, c k } that is used in the grammar of the first level. Each 84

108 neuron is represented by a string of three parameters that specify the input, activation, and output functions of the neuron. Outgoing Signal f out f act f in Output Function Activation Function Input Function Weighted Incoming Signals Figure 33: Neuron Model of Jacob (1994b) Figure 33 illustrates the neuron model used by Jacob (1994b). The input function, drawn from the set {scaled summation, scaled product, minimum, maximum}, collects the weighted incoming signals and passes the result to the activation function. The activation function, drawn from the set {linear step, sigmoid, tanh, sin, cos}, generates an internal signal which is then modified by an output function, drawn from the set {identity, threshold} and passed on to other neurons through the outgoing connections. Normal GA reproduction operators may be applied to the genes at this level. At the third level, the population consists of a set of weight bit-string pools. For each topology in the population at the first level, a separate pool of appropriately sized, fixed-length bit-string weight vectors is created. Each weights setting together with the topology then describes a fixed network structure which can now be evaluated for a predefined test environment For a fixed number of generations a genetic algorithm for weights evolution then tries to find an optimal weights setting within the given environment. Finally, a weights string evolves which lets the network solve its task in an optimal way, 85

109 and a fitness value serving as a performance measure for this network structure is returned to the topology module. (Jacob and Rehder, 1993, p. 77) Cellular Encoding Within a cellular encoding representation (Friedrich and Moraga, 1996, 1997; Gruau, 1994, 1995, 1996; Gruau and Whitley, 1993; Gruau, Whitley and Pyeatt, 1996; Kodjabachian and Meyer, 1995, 1998a, 1998b; Whitley, Gruau and Pyeatt, 1995), a fixed set of decoding instructions referred to as program symbols are defined and a tree of these program symbols is used to transform an initial graph containing one or more cells into a topology of functioning neurons. A cell represents a partially defined graph element with local memory. A cell has an input site and an output site. It is linked to other cells, with directed and ordered links that fan into the cell at the input site and fan out from the cell at the output site. (Gruau, 1995, p. 159) The local memory elements of a cell may vary between different cellular encoding representations. However, each cell always stores a duplicate copy of the complete gene tree and a reading head variable that points to a node in the gene. A program symbol may represent a cell division decoding action that transforms a single cell into multiple new cells, a modifying decoding action that changes the topology or memory of a single cell, or a terminating decoding action that transforms a single cell into a single neuron of specific functionality. A program symbol may require a parameter value. When such a program symbol is added to the gene, it is randomly assigned an associated parameter value. The specific set of program symbols used, the available memory elements of a cell, and the types of neurons that may be generated are the key defining components of a given cellular encoding representation. A cellular 86

110 encoding gene tree may be genetically manipulated in the same way as a gene tree in untyped genetic programming. All program symbols are considered to be interchangeable and normal subtree crossover and mutation operators may be applied Basic Cellular Encoding Gruau and Whitley (1993) and Gruau (1994) introduce the basic cellular encoding with a number of possible program symbols. A program symbol may have an arity of 0, 1 or 2, and all program symbols may occur at any location in a given gene so long as it has the appropriate number of offspring. In addition to storing the gene tree and reading head, each cell stores a threshold value, a link register value and a life counter. A connection is effectively considered to belong to its destination. Incoming links to the cell are numbered in terms of the order in which they are created, are given a weight of 1 or +1, and have a state of on or off. No information is associated with an outgoing link other than its destination. The decoding process begins with a starting graph containing input neurons, output neurons and one or more cells connected as appropriate to those neurons. The number of input and output neurons is determined by the problem being solved. Every cell in the starting graph has its reading head pointing to the root of its copy of the gene tree. Figure 34 illustrates two starting graphs, each consisting of a single cell a, denoted by a double-circle, whose reading head r a indicates the root of its gene tree. A starting graph may be acyclic and contain no recurrent links, as in Figure 34(a), or it may be cyclic, as in Figure 34(b). The labels SEQ and END refer to program symbols described below. 87

111 Output Output SEQ r a a SEQ r a a END END Input END END Input Gene Tree Acyclic Starting Graph Gene Tree Cyclic Starting Graph (a) (b) Figure 34: Cellular Encoding Starting Graphs Decoding proceeds as follows. All cells in the starting graph are placed upon a first-in, first-out (FIFO) queue. The cell that is at the head of the queue is removed and the program symbol that is pointed to by its reading head is executed. If the instruction represents a cell division, then two new cells are produced with a specific topology. Each cell receives a complete copy of the gene, and places its reading head as specified by the program symbol. The new cells are then placed on the tail of the queue. For all other non-terminating instructions, the cell is modified and its reading head moved as specified by the program symbol, and the modified cell is placed on the tail of the queue. For terminating instructions, the queue is unchanged and thus the resulting neuron no longer may be changed. Following the execution of the program symbol and the updating of the queue, the next cell on the queue is removed and the process continues until the queue is empty. The result is a graph in which all cells have been transformed into neurons. 88

112 Output. SEQ r a. a Input a Queue Output. SEQ. r c c b Input +1 SEQ r b.. c b Queue Figure 35: SEQ Program Symbol The sequential division (SEQ) program symbol is a cell division instruction of arity 2 that involves following steps, as illustrated in Figure Remove cell a from the queue. Let n be the highest numbered incoming connection of a. 2. Split cell a into two cells b and c. 3. Give b and c the same threshold, link register, and life counter values as a. 4. Set b s reading head to the left child of the node pointed to by a s reading head. 5. Set c s reading head to the right child of the node pointed to by a s reading head. 6. Give b the same input links as a, along with the same associated weight values, numbering and states. 7. Let c have the same output links as a. Let b replace a in the connection specifications at all destination cells or neurons. 89

113 8. Create a single connection from b to c with a weight of +1. Give this incoming connection of c the number 1 and a state of on. 9. Place b on the queue. 10. Place c on the queue. Output. PAR r a. a Input a Output. PAR r b. b 3 c Input. PAR r c. c b Figure 36: PAR Program Symbol The parallel division (PAR) program symbol is a cell division instruction of arity 2, as illustrated in Figure 36. The first five steps are the same as with the SEQ operation. The remaining steps are: 6. Give b the same input and output links as a, along with the same weight values, numbering and states associated with the input connections. Let b replace a in the connection specifications at all destination cells or neurons. 7. Give c the same input and output links as a, along with the same weight values, numbering and states associated with the input connections. For each output link, give the destination cell or neuron a new incoming link with the 90

114 appropriate weight and state. If the destination is a cell, assign a new number as well. 8. If a had a recurrent link, b and c will be linked to each other as well as to themselves. All four links will have the same weight and state as the self-link of a. The self-link of b and the link from b to c will have the same number as the self-link of a. The self-link of c and the link from c to b will be assigned the same new number. 9. Place b on the queue. 10. Place c on the queue. The END program symbol is a terminating instruction that creates a neuron with a specific sigmoid activation function that has a threshold value equal to that stored in the original cell. If a connection of the cell has an on state, it is given to the final neuron with the associated weight value. If a connection of the cell has an off state, it is not given to the final neuron. A neuron may still be modified by other cells on the queue through the addition of new incoming connections. Figure 37 illustrates a simple gene tree (in (a)), its corresponding final network topology (in (b)) and the sequence of decoding steps that were followed (in (c)) given an acyclic starting graph with a single cell. Links are directed upwards, and memory values are ignored for simplification. 91

115 Output a SEQ b PAR PAR c d SEQ e SEQ END f END g END END END END h i j k (a) Input (b) a c b e d c g f e d i h g f e k j i h g f Out a In c b Out In d c Out e f Out g f i Out d e h e h j In In In In g f i Out g k (c) Figure 37: (a) Cellular Encoding Gene of (b) Neural Network and its (c) Decoding Steps A cellular encoding representation also includes program symbols that change memory values, connectivity and weight values. The INCBIAS operation is an arity 1 instruction that increments the threshold bias memory variable of the cell. When the cell is finally terminated, that bias value will be used in the neuron s activation function. Similarly, the DECBIAS operation decrements the threshold value. The INCLR operation is an arity 1 instruction that increments the link register memory variable of the cell. The DECLR operation similarly decrements the link register value, to a minimum of 1. Let lr 92

116 be the value of the cell s link register memory variable. The CUT operation is an arity 1 instruction that sets the state of the cell s lr th incoming connection to off. The VAL+ operation is an arity 1 instruction that sets the weight of the cell s lr th incoming connection to +1. The VAL- operation is similar to VAL+, but sets the weight to 1. The reading head is moved to the child node upon completion of these operations. The program symbol set {SEQ, PAR, END, INCBIAS, DECBIAS, INCLR, DECLR, CUT, VAL-, VAL+ } is sufficient to represent any arbitrary network topology when applied to a cyclic starting graph containing a single cell. In these topologies, all weights are +1 or 1, and all neurons will have the same activation function, differing only in threshold bias. The reader is referred to Gruau (1994) for a detailed discussion on the representation capability of program symbol sets. In brief, the SEQ and PAR operations may be used to generate a relatively full topology that contains within it all the desired neurons and connections, but possibly some undesired ones as well. The INCLR and CUT operations may then be used to prune undesired connections and to effectively prune undesired neurons by removing all their incoming connections. The INCLR, INCBIAS, DECBIAS, VAL+ and VAL- may be used to set the threshold biases and connection weights as desired. One of the main goals of Gruau (1994) is to enable the compact representation of neural network architectures. The drawback to the above program symbol set is that it will generate genetic codes that may contain O(n 2 ) program symbols, where n is the number of nodes, since every connection may, in the extreme case, be specified independently. To achieve the goal of compactness, Gruau (1994) considers several other program symbols. 93

117 The clone division (CLONE) program symbol is similar to PAR, except that it is of arity 1. Two new cells are created with the same connectivity as the original. The reading head of both new cells is set to point to the only child of the node pointed to by original cell s reading head. The subtree in the genetic code will evaluate to two identical network components. The WAIT operation is an arity 1 instruction that makes no changes to the cell other than moving its reading head down to the next level and placing the cell on the tail of the queue. The WAIT operation affects the decoding process since certain operations (e.g., CUT) are dependent upon the order in which topological changes are made. In particular, connections are numbered based upon the order in which they are created, and a CUT operation that follows a WAIT operation may have a different effect. The REC operation is an arity 0 instruction that behaves differently depending upon the value of the cell s life counter. If the life counter is positive, REC moves the reading head of the cell to the root of the gene, decrements the life counter and places the cell back on the queue. If the life counter is 0, REC is treated as an END operation. The REC operation permits the instructions in the gene tree to be applied recursively to describe a neural topology with repeated structure. The life counter is used to avoid an infinite recursion. Figure 38 illustrates a gene tree that includes a REC program symbol (in (a)), the final topology that is generated (in (b)), and the initial decoding steps that are followed (in (c)). The numbers in the cells of Figure 38 (c) represent the life counter value of those cells. Note that the final topology shows a structure of three sequential nodes that is repeated several times with no extra program symbols required in the gene tree. 94

118 a,c',i' PAR Output b,h SEQ REC c,i d,j e,k SEQ END f,l g,m END END (a) Input (b) a c b e d c c' e d g f c' e i h k j i i' k j m l i' k Out Out Out Out Out a Out 3 Out e b 3 c 3 d 3 Out 3 c 3 e d Out e 3 3 g 3 k 2 k 3 2 f 3 c' 2 h 2 i 2 j 2 i 2 j 2 i' 1 k 2 m 2 l 2 i' 1 In In In In In In In In In (c) Figure 38: Cellular Encoding with REC Program Symbol of (b) Neural Network and its (c) Partial Decoding Steps Gruau (1995) uses the same basic approach as Gruau (1994), but without link registers. To change weight values, he uses the parameterised program symbol I(n), that sets the weight of the n th input connection to +1, and a similar operation D(n), that sets the weight of the n th input connection to 1. The parameter n is determined randomly when the gene tree is created. For both operations, the reading head is moved to the child node upon completion. Gruau (1995) uses two terminating instructions U and L, each producing a neuron with slightly different sigmoid functions. 95

119 Friedrich and Moraga (1996, 1997) use real-valued weights and introduce several new program symbols. For example, the LSPLIT operation is similar to a PAR operation, except that the input connections to the original cell are split evenly between the two child cells, as illustrated in Figure 39. If the number of incoming connections is odd, LSPLIT will copy the middle connection to both child cells. Friedrich and Moraga use several different terminating program symbols to develop architectures with mixed activation functions (Friedrich and Moraga, 1997, p. 153). In addition to a traditional sigmoid function, neurons may have a hyperbolic tangent, log, or Gaussian activation function. Output LSPLIT r a a a.. Input Queue Output. LSPLIT r b. b Input c Figure 39: LSPLIT Program Symbol. LSPLIT r c. c b Queue Syntactically Constrained Cellular Encoding Whitley et al. (1995), Gruau (1996) and Gruau et al. (1996) use a cellular encoding with real-valued weights. The weights are encoded by associating an integer between 256 and -255 with each connection, and calculating the weight as that integer divided by 256. Unlike Gruau (1994), they do not permit all possible trees of program symbols to be valid, and use a parameterised compact context-free grammar to limit the 96

120 space of legal program trees. The grammar is a context-free grammar with two variations that enable a compact set of production rules. The first variation is that the grammar contains several syntactic constructs that may be used on the right-hand side of a production to allow a single rule to encapsulate a set of related productions. A list range precedes a group of terminal and/or non-terminal grammar symbols. It identifies the possible number of child symbols that may be generated, and those symbols may be drawn, with replacement, from the specified group in any order. Figure 40(a) illustrates a production containing a list range structure and all the possible expansions produced by the rule. Note that the number of children may vary. <NET> list[1..2] of {CELL_CYC, CELL_ACYC} [(CELL_CYC) ; (CELL_CYC, CELL_CYC) ; (CELL_CYC, CELL_ACYC) ; (CELL_ACYC) ; (CELL_ACYC, CELL_CYC) ; (CELL_ACYC, CELL_ACYC)] (a) <NET> set[1..2] of {CELL_CYC, CELL_ACYC} [{CELL_CYC} ; {CELL_CYC, CELL_ACYC} ; {CELL_ACYC} ; {CELL_ACYC, CELL_CYC}] (b) <NET> array[1..2] of {CELL_CYC, CELL_ACYC} [(CELL_CYC 1 ) ; (CELL_CYC 1, CELL_CYC 2 ) ; (CELL_CYC 1, CELL_ACYC 2 ) ; (CELL_ACYC 1 ) ; (CELL_ACYC 1, CELL_CYC 2 ) ; (CELL_ACYC 1, CELL_ACYC 2 )] (c) Figure 40: List, Set and Array Grammar Constructs A set range is identical to the list range, except that child symbols are drawn without replacement (see Figure 40(b)). An array range is identical to a list range except that each child symbol is tagged with an appropriate index value from the range (see Figure 40(c)). An integer range produces a single terminal symbol with an integer value in the 97

121 given range. For example, integer[ ] may generate any integer in that range, inclusive. The second variation in the grammar is that a recursion range that may be used on the left-hand side of a production rule to limit the number of times that symbol may be expanded, and hence constrain the size of the generated parse trees. At first glance, the recursion range appears to be a form of context-sensitivity. However, since the value of the range is fixed for a given grammar, it merely represents a highly compact representation of a large number of alternate rules. Figure 41 illustrates how a single production that uses a recursion range and one non-terminal symbol (in (a)) can replace multiple productions requiring several non-terminal symbols (in (b)). <CELL>[0..3] PAR( <CELL>, <CELL> ) <NEURON> (a) <CELL_3> PAR( <CELL_2>, <CELL_2> ) <CELL_2> PAR( <CELL_1>, <CELL_1> ) PAR( <CELL_2>, <CELL_1> ) PAR( <CELL_1>, <CELL_0> ) PAR( <CELL_2>, <CELL_0> ) PAR( <CELL_0>, <CELL_1> ) PAR( <CELL_1>, <CELL_2> ) PAR( <CELL_0>, <CELL_0> ) PAR( <CELL_1>, <CELL_1> ) <NEURON> PAR( <CELL_1>, <CELL_0> ) <CELL_1> PAR( <CELL_0>, <CELL_0> ) PAR( <CELL_0>, <CELL_2> ) <NEURON> PAR( <CELL_0>, <CELL_1> ) <CELL_0> <NEURON> PAR( <CELL_0>, <CELL_0> ) <NEURON> (b) Figure 41: Representation Savings of Recursion Range The grammar is used to generate a family of constrained program symbol trees. However, what is encoded in the genome and stored in the GP population is not the (program symbol tree) itself, but a derivation of the (program symbol tree) using the 98

122 grammar. (Gruau, 1996, p. 381) A node in a given parse tree may be a single grammar symbol or a specific instance of a range structure (e.g., the list (CELL_ACYC)). The children of a list, set or array node are the elements of that structure. A list, set, array or integer is stored with its type, namely the instructions that created it (e.g., list[1..2] of {CELL_CYC, CELL_ACYC} ). The genetic operators are designed to produce offspring parse trees that are syntactically correct according to the grammar. Crossover behaves as follows. Initially, a node from each tree is chosen at random. Integer terminals are invalid crossover points. If the parents of both nodes are identical non-terminal symbols, then the subtrees at those nodes are swapped. If the parents of the nodes are range structures with the same type (e.g., both parents have type array[2..3] of {CELL} ), then crossover occurs within the structure like crossover between bit strings (Whitley et al., 1995, p. 462). Specifically, for a list, the selected node in one tree and all its siblings to the right are swapped with the selected node of the other tree and all its siblings to the right, subject to the range limits specified in the parent. The parent lists are updated with their new elements. For a set, the program symbols of the selected nodes and all their siblings are collected in a group and all repeated symbols are removed. Two new sets are randomly drawn from this group, subject to the range limits of the original parent sets. Each new set replaces the original parent set of one gene. The child subtrees of the new sets are selected from matching child subtrees in the original genes. For an array, if both selected nodes have the same associated index value, then they are swapped and each parent is updated with the new element. The only distinction between a list and an array is the behaviour of the crossover operator. 99

123 Mutation behaves as follows. A node is chosen at random from the tree. If it is an integer value, its value is randomly mutated within the range constraints in the structure label. If the node is list, set or array, an appropriate new element is added or an existing element is removed, subject to the structure s constraints. A random subtree is generated for the new element using the grammar. If the node is a non-terminal symbol, the subtree rooted at that node is replaced with a random subtree generated from the grammar. In all cases, depth constraints are enforced Syntactically Constrained Geometry-Oriented Cellular Encoding Kodjabachian and Meyer (1998a, 1998b) present a geometry-oriented variation of the cellular encoding scheme and syntactic constraints that reduce the size of the genetic search space. (Kodjabachian and Meyer, 1998a, p. 211) As with Gruau (1994), the basic representation is a tree of program symbols, and each cell has a reading head that points to the symbol in the tree that it will execute next. Unlike Gruau, development of the network takes place within a two-dimensional grid. Initially, the grid is populated with one or more cells. Each cell in this starting grid is placed at a specific, random location, and no two cells may occupy the same grid square. An event list is used instead of a FIFO queue during the decoding process. Whenever a cell reads a node, it executes the corresponding instruction and records in an appropriate event list that the sub-nodes of that node are to be read after a given time interval. (Kodjabachian and Meyer, 1998a, p. 213) Each cell starts with its reading head pointing to the root of the program symbol tree, and the event of executing that instruction is scheduled at the same time for all starting cells. 100

124 New cells are created using the DIVIDE instruction, of arity 2. DIVIDE accepts two parameters, a direction α and a distance r. Using the location of the original cell as a starting point, the specified direction and distance are used to identify a target grid location. A single new cell is created in the closest available grid square to that location. The new cell has no connections. Two new events are placed on the list, one for the original cell to execute the left child node of the DIVIDE instruction and one for the new cell to execute the right child node. New connections are created with the GROW or DRAW instructions, both of arity 1. GROW accepts three parameters, a direction α, a distance r and a weight w. Using the location of the executing cell, c, as a starting point, a target location is calculated and the nearest cell, d, to that location is identified. No connection is created if the target location is outside the limits of the grid. Otherwise, a connection with weight w is made from c to d. If the cell itself is closest to its own target point, a recurrent connection is created on that cell. (Kodjabachian and Meyer, 1998a, p. 213) DRAW operates in identical fashion except that the connection is directed from d to c. A new event is placed upon the list for the cell c to execute the child node of the GROW or DRAW operation, as appropriate. No terminating program symbols (e.g., END from Gruau, 1994) are used. A cell halts its development when it reads a node with no sub-node. From that moment on it is called (a neuron). (Kodjabachian and Meyer, 1998a, p. 213) All neurons have the same activation function, which has two parameters. Each cell stores a value for each parameter, namely a time constantτ and a bias b. The value of those parameters may be changed with the parameterized instructions SETTAU τ or SETBIAS b, both of arity

125 Finally, a cell and all its incoming and outgoing connections may be removed from the location grid using the instruction DIE. Terminal Symbols DIVIDE, GROW, DRAW, SETBIAS, SETTAU, DIE, SIMULT3, SIMULT4 Non-Terminal Symbols Start, Level1, Level2, Neuron, Connex, Link Production Rules Start DIVIDE(Level1, Level1) Level1 DIVIDE(Level2, Level2) Level2 DIVIDE(Neuron,Neuron) Neuron SIMULT3(SETBIAS, SETTAU, Connex) Neuron DIE Connex SIMULT4(Link,Link,Link,Link) Link GROW DRAW Starting symbol Start Figure 42: Context-Free Grammar that Constrains Program Symbol Tree Kodjabachian and Meyer (1998a, 1998b) do not permit all possible program trees to be used as genomes, and use a context-free grammar to impose syntactic constraints on the development process. Figure 42 illustrates one such grammar. The terminals of the grammar represent program symbols. The non-terminals of the grammar are used to constrain the expansion of program symbol trees. Two new program symbols, SIMULT3 and SIMULT4 are introduced. These symbols are used to group instructions for special treatment by the genetic operators. The genome of an individual is the final program symbol tree generated using the grammar. The genetic operators of subtree mutation and subtree crossover are designed to ensure that all offspring program symbol trees are valid 102

126 according to the constraint grammar. The production rules limit the possible program symbol trees and corresponding neural networks; the final numbers of (neurons) and connections created by a program that is well-formed according to (the grammar) cannot exceed eight and 32 respectively (Kodjabachian and Meyer, 1998a, p. 4) Adaptation of Learning Rules The automated development of new learning rules that may be used within a neural network is an important area of research (Bengio, Bengio and Cloutier, 1994; Bengio, Bengio, Cloutier and Gecsei, 1992; Chalmers, 1990; Radi and Poli, 1997, 1998). Even though it is generally admitted that the learning rule has a crucial role, neural models commonly use ad hoc or heuristically designed rules; furthermore, these rules are independent of the learning problem to be solved. This may be one reason why most current models (some with sound mathematical foundation) have difficulties to deal with hard problems. (Bengio et al., 1992, sic) Chalmers (1990) and Bengio et al. (1992, 1994) propose the use and optimization of a general, parametric form of learning rule that computes the change in connection weight value as the sum of several simpler terms. Each term is a function of the local variables of one or more nodes, and each term has a coefficient that determines the relative contribution of that term to the overall change in weight value. A new learning rule can be found by changing the values of the coefficients in the general form of the rule. In other words, the coefficient values are the parameters of the general learning rule. Chalmers (1990) uses a single parametric learning rule with ten terms that is a linear function of four local node variables and their six pair-wise products. Bengio et al. 103

127 (1992, 1994) use several parametric learning rules, some with up to sixteen terms, whose terms are often taken from well-known neural learning rules. For example, the following equation is an example of a general learning rule with five terms, similar to one used in Bengio et al. (1992), wji = θ0 + θ1x i + θ2 y j + θ3y jxi + θ4 x w i ji (2-33) where w ji is the change in the current weight value w ji on the connection from node i to node j that is calculated for a given learning trial; x i and y j are the output activations of nodes i and j, respectively; and θ 0..4 are the coefficients, or parameters. The terms of the general learning rule are selected arbitrarily, and the fourth term, θ 4 y j x i, is the Hebbian learning rule from (2-2) with θ 4 equivalent to the learning rate. Given a fixed general learning rule, a specific learning rule optimized for a particular problem may be automatically developed as follows. A fixed neural network topology of appropriate size for the problem is selected. In a given trial, a set of coefficient values is determined and the network is trained to solve the problem using the resultant learning rule, starting from a random configuration. The performance of the trained network is used to determine an evaluation of the effectiveness of the learning rule. The coefficients are then adapted accordingly for the next trial using an optimization algorithm. Chalmers (1990) uses a genetic algorithm with a fixed-length, binary encoding of the coefficients. Bengio et al. (1992) use two different optimization methods: gradient descent and simulated annealing; Bengio et al. (1994) additionally use a genetic algorithm with a fixed-length, binary encoding of the coefficients. In a similar approach, Bengio et al. (1994) and Radi and Poli (1997, 1998) use genetic programming to dynamically determine both the appropriate form for the learning 104

128 rule and the appropriate values for the associated coefficients. The genetic program creates new rules from arithmetic operators (e.g., +, -, *, /), a set of local variables and constants. The local variables used are similar to those used in the fixed, parametric approach above (e.g., y j, x i, w ji ). A limit is imposed upon the depth of the genetic tree in order to promote the evolution of good learning rules with a reasonable number of terms. Both researchers present experiments that show their systems develop learning rules similar to standard back-propagation when the variables used in the genetic representation include the variables used in the back-propagation rule. 2.5 Attribute Grammars Definition An attribute grammar (Knuth, 1968) uses a context-free grammar as its core and augments each symbol with a set of attributes and augments each production with a set of rules that compute values for the attributes of every symbol in the production. The attributes of a given symbol may be partitioned into two disjoint sets: inherited and synthesized. For a given production, the attribute evaluation rules compute the synthesized attributes of the left-hand symbol of the production and the inherited attributes of the right-hand symbols. The computations may depend only upon the values of attributes of symbols in that production. In general, the terminal symbols may only have attributes with fixed values and the start symbol may only have synthesized attributes. However, this may be relaxed (Alblas, 1991a). Formally, an attribute grammar is a three-tuple (G,A,R), where G = (N, T, P, S) defines a standard context-free grammar (see 2.2.7) 105

129 A = 4 A s, s N T, where A s = Inh s Syn s is a finite set of attributes associated with symbol s, Inh s Syn s = and A s A s = s,s N T, s s. A specific attribute a A s of a symbol s is denoted using the convention s.a R = 4 R p, p P, where R p = E(p, X 0.a j ) 4 E(p, X i.a k ), a j Syn X0, a k Inh Xi, 1 i n is a finite set of attribute evaluation rules E associated with production p:x o X 1..X n such that E(p,a) = f(4a Xm - a), 0 m n An attribute grammar generates sentences that have both syntactic and semantic properties. Generation of the sentences requires two phases. In the first phase, the context-free grammar is used to generate a standard sentence and its corresponding parse tree. The tree initially contains syntactic information only. In the second phase, values are computed for all the attributes of all the symbols in the tree using an attribute evaluator that traverses the tree and executes the attribute evaluation rules of the associated productions. The result may be drawn as a attributed parse tree, which is the context-free parse tree with each symbol decorated with the attribute values computed for that symbol. A number of attribute evaluation algorithms with different properties and computational complexities have been presented (Alblas, 1991a, 1991b; Bochmann, 1976; Deransart, Jourdan and Borho, 1988). A traditional evaluator is described below: The task of an attribute evaluator is to compute the values of all attribute instances attached to the derivation tree, by executing the attribute evaluation instructions associated with these attribute instances. In general the order of evaluation is free, with the only restriction that an attribute evaluation instruction cannot be executed before its arguments are available. An attribute instance is available if its value is defined, otherwise it is unavailable. Initially, all attribute instances attached to the derivation tree are unavailable, with the exception of the inherited attribute instances attached to the root (containing information concerning the 106

130 environment of the program) and the synthesized attribute instances attached to the leaves (determined by the parser). At each step an attribute instance whose value can be computed is chosen. The evaluation process continues until all attribute instances in the tree are defined or until none of the remaining attribute instances can be evaluated. (Alblas, 1991a, p. 5) The attributed parse tree may be used in a variety of ways, such as to compute a single semantic meaning for the entire tree within the attributes of the root symbol (Knuth, 1968), to conclusively decide whether a sentence is semantically correct or not (Alblas, 1991a, p. 4), and to reduce ambiguity when a syntax is ambiguous, in the sense that some strings of the language have more than one derivation tree, the semantic rules give us one meaning for each derivation tree (Knuth, 1968, p. 144). In terms of their power of representation, attribute grammars are more powerful than context-free grammars, and have the same formal power as Turing machines (Knuth, 1968; Deransart et al., 1988) Ease of Representation A primary motivation behind the development of attribute grammars by Knuth (1968) was the benefit of using both inherited and synthesized attributes to improve the ease of design and the readability of the grammar, even though the use of synthesized attributes alone provides the same formal representational power. Synthesized attributes alone are sufficient to define the meaning associated with any derivation tree But this statement is very misleading, since semantic rules which do not use inherited attributes are often considerably more complicated (and more difficult to understand and to manipulate) than semantic rules which allow both kinds of attributes. The ability to allow the whole tree influence the attributes of each node of the tree often leads to rules of semantics which are much simpler and which correspond to the way in which we actually understand the meanings involved. (Knuth, 1968, p. 134) The importance of inherited attributes is that they arise naturally in practice and that they are dual to synthesized attributes in a straight- 107

131 forward manner there are many languages for which such a restriction (to only synthesized attributes) leads to a very awkward and unnatural definition of semantics. (Knuth, 1968, p. 131) Figure 43 illustrates an attribute grammar that encodes the syntax for binary notation (adapted from Knuth, 1968) and uses the attributes to define the semantics of a binary string in terms of its decimal value. The attribute grammar specification format used in this document is as follows. Every non-terminal symbol has its first letter capitalized and is enclosed in angle brackets for enhanced readability. Each non-terminal is listed with its synthesized attributes in parentheses and its inherited attributes in braces. The start symbol is boldface. Attribute names are all in lower-case and italicized. Terminal symbols are all in lower-case and may or may not be enclosed in angle brackets. The productions of the grammar each have three sections. The context-free rule is framed within a box. If two or more identical non-terminal symbols are used in the context-free production, they are distinguished through the use of sub-scripted numbers. Note that these numbers are for readability only and are not actual symbols in the grammar. The context-free rule is followed by the attribute evaluation rules for the synthesized attributes of the left-hand symbol. These are followed by the attribute evaluation rules for the inherited attributes of the right-hand symbol(s). If both types of evaluation rules are defined for a given production, a dotted line separates them. 108

132 Non-Terminal Symbols: (synthesized attributes) {inherited attributes} <Number> : (value) {} <BitString> : (value, length) {scale} <Bit> : (value) {scale} Terminal Symbols: 0 1. Productions and Attribute Evaluation Rules: <Number> <Bitstring> 1. <Bitstring> 2 <Number>.value = <Bitstring> 1.value + <Bitstring> 2.value <Bitstring> 1.scale = 0 <Bitstring> 2.scale = 1 <Bitstring> 1.length <Number> <Bitstring> <Number>.value = <Bitstring>.value <Bitstring> 1.scale = 0 <Bitstring> 1 <Bitstring> 2 <Bit> <Bitstring> 1.value = <Bitstring> 2.value + <Bit>.value <Bitstring> 1.length = <Bitstring> 2.length + 1 <Bitstring> 2.scale = <Bitstring> 1.scale + 1 <Bit>.scale = <Bitstring> 1.scale <Bitstring> <Bit> <Bitstring>.value = <Bit>.value <Bitstring>.length = 1 <Bit>.scale = <Bitstring>.scale <Bit> 0 <Bit>.value = 0 <Bit> 1 <Bit>.value = 2 <Bit>.scale Figure 43: Attribute Grammar for Binary Numbers The semantic rules of the grammar in Figure 43 compute the true decimal value of every subtree in the attribute value. The binary numbers generated by the grammar may include both an integer and a fraction component. Decoding the fraction component into a decimal value requires different computations than those required for decoding the integer component. To accomplish this, the attribute evaluation rules feed information up the tree to the root about the length of the integer and fraction components. This total length is used to compute and feed information down the tree about the correct scale 109

133 multiplier required for converting each particular bit to its true decimal value. The true value of all bits and bit-strings is computed and fed up the tree, with the value of the entire number computed at the root. <Number> value = <Bitstring> scale = 0 value = 11 length = 4. <Bitstring> scale = -3 value =.625 length = 3 <Bitstring> <Bit> <Bitstring> <Bit> scale = 1 value = 10 length = 3 scale = 0 value = 1 scale = -2 value =.5 length = 2 scale = -3 value =.125 <Bitstring> <Bit> 1 <Bitstring> <Bit> 1 scale = 2 value = 8 length = 2 scale = 1 value = 2 scale = -1 value =.5 length = 1 scale = -2 value = 0 <Bitstring> <Bit> 1 <Bit> 0 scale = 3 value = 8 length = 1 scale = 2 value = 0 scale = -1 value =.5 <Bit> 0 1 scale = 3 value = 8 1 Figure 44: Attributed Parse Tree Figure 44 illustrates a sample attributed parse tree generated using this attribute grammar. This example demonstrates how the attributes augment the context-free parse 110

134 tree, and clearly shows that an attribute grammar incorporates context-sensitive computations. The attribute values for each symbol are listed in a dashed box under the symbol, with the inherited attributes listed first and separated from the synthesized attributes by a dotted line. The two subtrees rooted by the bold, boxed symbols <Bitstring> are syntactically identical (both represent the bit-string 101). However, their semantics are clearly different (value = 10, scale = 1 versus value =.625, scale = -3) due to their different placement in the tree, and hence their different context. The interaction between inherited and synthesized attributes in the grammar allows a computation performed within one particular attribute evaluation rule to be influenced by attributes throughout the parse tree Properties and Limitations The key issue when designing an attribute grammar is to avoid circular dependencies between the attribute evaluation rules. Several algorithms have been presented for verifying whether a grammar is circular (Deransart, 1988; Knuth, 1968), and a traditional attribute evaluator, as described above, will not be capable of fully decorating every parse tree if the grammar is circular. A well-formed attribute grammar contains no circular dependencies among the attributes and successful evaluation of attributes is guaranteed. A normalized attribute grammar contains attribute evaluation rules that depend only upon attributes that have been computed outside of the production in question. In other words, only inherited attributes of the left-hand symbol and synthesized attributes of the right-hand symbols may be used within an attribute evaluation rule. Formally, 111

135 An attribute grammar is normalized if and only if E(p,a) R p, R p R, E(p,a) = f(inh X0 4 Syn Xi ), 1 i n where production p:x o X 1..X n A normalized attribute grammar is non-circular, and any well-formed attribute grammar may be normalized by means of a simple transformation of the semantic definitions (Deransart et al., p. 5). Attribute grammars are declarative. The designer specifies the rules that will be used to compute the values of the attributes, but he or she does not specify the order in which those rules will be applied. This order is determined by the attribute evaluation algorithm, and evaluation is guaranteed so long as the grammar is well-formed. 112

136 Chapter 3 Attribute Grammar Encoding This chapter first presents a critique of the scalability of current neural network models and the techniques that are used to represent complex neural network models. Based upon this analysis of the current state of the art, several key design issues concerning scaling and representation of neural networks are identified. A novel technique for representing families of neural network architectures using attribute grammars, the Network Generating Attribute Grammar Encoding (NGAGE) is introduced, and seven hypotheses concerning the capabilities of NGAGE are proposed. 3.1 Motivation The development of neural network systems capable of learning to solve problems of large size and high complexity in reasonable time with reasonable performance remains a continuing challenge hampered by the lack of consistent methods for representing, integrating and adapting neural network models. 113

137 3.1.1 Scalability The complexity of a problem may be defined in terms of the logical subtasks that must be solved before the original problem can be solved. A complete task decomposition identifies the subtasks used to solve the problem, the relationships between those subtasks and the problem s environment, and the relationships among the subtasks (Ronco et al., 1997). For example, in a given task decomposition, certain subtasks may perform similar functions, but operate upon different portions of the input space; certain subtasks may perform different functions, but operate upon similar portions of the input space; and/or certain subtasks may be dependent upon the results of other subtasks. A given problem may generally be decomposed in multiple ways, and a given task decomposition may solve only part of the original problem. Thus, different decompositions may vary in effectiveness. In particular, for a given problem, there may not exist a task decomposition that solves that problem with 100% effectiveness. In practical applications, an effective task decomposition for the problem being solved may or may not be known a priori, and a key factor in learning to solve the problem in the latter case is the discovery of such a decomposition. As a working definition, the complexity of a problem is defined by the number and type of subtasks required for a highly effective solution, and the intricacy of the relationships among those subtasks. For example, a simple problem may have a small number of similar subtasks that are solved independently of each other, while a complex problem may have a large number of different subtasks that are highly nested. Further, the complexity of learning to solve a given problem is defined in terms of the effort required to discover an effective task decomposition. Generally, the discovery of an effective task 114

138 decomposition for a simple, well-understood problem requires less effort than the discovery of one for a complex, poorly-understood problem. Early network models (Kohonen, 1984; Rumelhart and McClelland, 1986) followed the approach of adapting the weights of a static, generic neural structure (e.g., back-propagation has a fixed layered structure with full connectivity between layers). For small problems and for simple problems of moderate size, these networks can find effective solutions in a reasonable time since most of the resources (i.e., nodes and connections) of the network contribute meaningfully to the solution and the learning algorithm is able to effectively identify and adapt those resources pertinent to specific input patterns. However, for large, complex problems, the generic structure of the network often leads to poor learning behaviour. For example, in some cases the network contains many more resources than required and a great deal of wasteful computation adapting those resources occurs during learning. In other cases, the generic structure does not lend itself well to specialization on differing subtasks and the network fails to learn an effective task-decomposition. Subsequent neural network systems have used a variety of approaches to improve scalability. By far the most common approach is to manually tailor the neural algorithm and/or neural structure for a specific problem (Arbib, 1995). Significant performance improvements may result, but the issue of general scalability is not directly addressed. A second popular approach is the use of mechanisms that perform structural learning. Growing algorithms (Fahlman and Lebiere, 1990; Smieja, 1993) and pruning algorithms (Hassibi et al., 1993; LeCun et al., 1990) enable the development of neural solutions that have an appropriate number of resources for the problem at hand. This 115

139 helps reduce wasteful computation during learning, and aids in the discovery of better task decompositions. The most effective approaches to addressing scalability and solving complex problems are the use of modularity within neural network models and the use of evolutionary neural network systems (ENNSs), hybrid systems that use evolutionary algorithms to evolve neural network solutions Modularity and Scalability The introduction of modularity into neural network models is one of the most effective approaches to solving complex problems. (Hrycej, 1992; Ronco et al., 1997) The basic premise is that a neural network comprised of structurally distinct components, possibly with distinct processing and learning behaviours, is more naturally suited to effectively learning logically distinct subtasks than a homogeneous, uniformly connected network. In other words, better performance is obtained when the form of a network reflects the function of the problem at hand. The term intramodular topology will be used to refer to the nodes and connections within a module, and the term intermodular topology will be used to refer to the modules and connections between modules within a network. Modular neural network models may be distinguished along three dimensions: the structural distinctiveness of the modules, the types of modular decomposition that are performed, and the technique used to implement modularity. Each of these factors influences the scalability of the model. 116

140 Structural Distinctiveness The degree of structural distinctiveness has a direct, practical impact upon the scalability of a modular network. In general, intramodular topologies may be densely connected, but intermodular topologies should be, relatively speaking, sparsely connected. If such were not the case, the distinctiveness of the modules would disappear, or more simply, the network would not be modular by definition. Thus, a modular network should contain many fewer connections than a more traditional, densely-connected network with the same number of nodes, and should therefore generally have fewer wasteful resources and a better scaling potential. Most modular networks consist of modules that have a clearly defined intramodular topology and intramodular functionality, as well as strict limitations upon intermodular topology. Mixture of Experts networks (Jacobs et al., 1991a, 1991b), autoassociative networks (Ballard, 1990), counter-propagation networks (Hecht-Nielsen, 1990) and ARTSTAR networks (Hussain and Browse, 1994) all consist of modules that are complete neural models in their own right, and have clearly defined input and output nodes; connections between modules are limited to input and output nodes of the modules only. Hierarchical Mixture of Experts networks (Jordan and Jacobs, 1993), MINOS networks (Smieja, 1991, 1994; Smieja and Mühlenbein, 1992) and CALM networks (Happel and Murre, 1992, 1994) consist of modules that contain multiple internal components, each with distinct topology and functionality, but each module still has clearly defined input and output nodes, and intermodular connections are limited to those nodes only. 117

141 Some models, such as the short-connection bias network (Jacobs and Jordan, 1992) and the networks created using traditional cellular encoding (Gruau, 1994, 1995), do not impose strict boundaries between modules nor distinct functionality within modules. Rather, modules are regarded as sub-structures with a high degree of local connectivity, and nodes within such a module may connect arbitrarily (yet sparsely) with nodes from another module Modular Decomposition The modular structure of a neural network is generally used to implement one or more of several basic kinds of task decomposition. A redundant decomposition (Battiti and Colla, 1994; Gargano, 1992; Hansen and Salamon, 1990; Lincoln and Skrzypek, 1990; Mani, 1991) applies two or more similar modules to the exact same sub-task the same inputs are provided and the same outputs are expected. The expectation is that random differences between modules will result in slightly different performance characteristics, and that a decision based upon the combined results of the modules will be more effective. In essence, a redundant network samples multiple points in the search space of possible network configurations. The use of redundancy is at direct odds with the goal of improved scalability, and the training effort for the entire network grows linearly with the number of redundant modules. An input partition decomposition separates a given input vector into multiple, possibly overlapping, sub-vectors and assigns each sub-vector to a different module. The same outputs are expected from each module. ARTSTAR (Hussain and Browse, 1994) assigns distinct input vectors to each module, and the technique may be used to train 118

142 multi-resolution networks each module views image data at a different resolution and extracts qualitatively different features. Networks created by cellular encoding using the LSPLIT operator (Friedrich and Moraga, 1996, 1997) explicitly separate the input vector into two halves, each to be processed by a distinct module. An input partition decomposition serves to simplify the complexity of the problem by reducing the size of the patterns and the number of features that must be learned by each module. This a good approach to improving scalability since simpler modules may be used to solve the simpler subtasks. However, it assumes that the input patterns have natural structure that may be exploited. If important dependencies exist between the different sub-vectors of the partition, this approach may negatively affect the performance and scalability of the network. An output partition decomposition separates the output signals into multiple, possibly overlapping, subsets and assigns each to a different module. Each module is thus responsible for computing part of the problem output. This reduces the complexity of the problem being learned by each module since it only needs to identify and learn those patterns that affect its portion of the output. However, the approach is only effective if the output vectors contain locally independent results. Partitioning highly distributed output patterns can reduce the performance and scalability of the network. Networks created by cellular encoding using the USPLIT operator (Friedrich and Moraga, 1996, 1997) explicitly separate the output vector into two halves, each to be computed by a distinct module. A data space decomposition trains different modules on different subsets of the training data. The decomposition may be performed manually as part of the training 119

143 procedure; for instance, some variations upon redundant networks use the technique of training each redundant module on a different subset of data (Hansen and Salamon, 1990). More interestingly, the decomposition may be dynamically determined. The Mixture of Experts network and the MINOS network both use supervisory structures that allocate each specific data pattern to the most responsive module for further reinforcement. After training, the resulting modules each respond best to different types of patterns in the data. A data space decomposition is a very effective scaling solution when the subsets are well chosen. When each subset contains groups of similar patterns representing similar outputs, and these groups differ significantly from those in other subsets, then each module will tend to learn associations significantly different from those of the other modules. In other words, different modules specialize upon different, smaller subtasks, and thus tend to require fewer resources. When the subsets all contain similar distributions of patterns, a redundant decomposition occurs and no scaling benefit is achieved. A successive processing decomposition (Gallinari, 1995) uses multiple modules to solve the problem, each dependent upon the results of the previous module. Typically, the input vector is provided to one module, which then produces the output for the next module, which in turn produces the output of the network. Successive processing may improve the performance and scalability of a network when the problem has a natural, corresponding task decomposition. If the successive tasks require qualitatively different computations, then the approach is highly effective since the appropriate neural models (and associated processing and learning behaviours) may be used where necessary. For example, the counter-propagation network (Hecht-Nielsen, 1990) uses two successive 120

144 modules, and each module has a different architecture. The first performs a quantization of the input and the second performs an association of the resulting map with the desired outputs. A parallel processing decomposition uses multiple independent modules to solve the problem. Typically, each module receives the same input signals, but processes them differently in contributing to the network s output. Redundant and data space decompositions are instances of parallel processing decompositions that focus upon networks of modules with similar architectures. More generally, modules may differ significantly in their architecture. When a problem has independent subtasks that require qualitatively different computations, then the approach is highly effective since the appropriate neural models (and associated processing and learning behaviours) may be used where necessary (Ronco et al., 1997). These decompositions are not mutually exclusive and may be combined and exploited in a variety of ways, possibly multiple times within the same neural network. In particular, analogous decompositions may be applied to inputs and outputs that are actually internal to the network, resulting in a hierarchical decomposition. For example, one module may process the network s input signals and produce signals that are passed to two independent modules, which in turn provide the output of the network. This is a combination of successive and parallel processing decompositions. The networks generated using cellular encoding often exhibit highly varied modular topologies due to the explicit incorporation of redundancy (using the REC program symbol), successive processing (using the SEQ program symbol) and parallel processing (using the PAR 121

145 program symbol) (Gruau, 1994, 1995); as well as input and output signal partitioning (Friedrich and Moraga, 1996, 1997) Technique Ad-hoc modular models, such as counter-propagation and ARTSTAR combine two or more existing network models into a single, specific network architecture. This produces a new model that is naturally suited to problems that contain that specific task decomposition, but is not generally applicable to other types of problems. Building-block modular models, such as auto-associative networks and CALM networks, define a single type of higher-level network component and allow multiple instances of this component to be connected in arbitrary ways in a network solution. The resulting networks are better suited to discovering new task decompositions since the distinct components may specialize more easily than simple nodes. However, in current models, the actual number and sizes of modules and the intermodular topology are specified in advance, and thus the task decomposition is largely determined by the developer and not by the system itself. Also, although the individual modules may vary in size, they all have the same structural properties and learning behaviours, and thus may not be very effective at performing certain types of subtasks. Integrative modular models, such as most redundant networks, the Mixture of Experts network and the MINOS network define a specific integrative structure that is used integrate the responses of and possibly regulate the learning behaviours of the modules. These models directly address automated task decomposition and module specialization. However, in current models the network structure is static, and thus the 122

146 network is limited by the structural choices of the developer. Most integrative models consist of only a single layer of modules and are accordingly limited, but some integrative models, such as the hierarchical Mixture of Experts network are capable of more complex decompositions. Dynamic modular models, such as the short connection bias network, adapt the intramodular and intermodular topology in response to the demands of the task being learned. This is a useful technique for eliminating unnecessary resources and directly discovering new task decompositions within a back-propagation network Evolutionary Neural Network Systems Another promising approach to improving neural network scalability is the use of evolutionary neural network systems (ENNSs), hybrid systems that use evolutionary algorithms (Fogel, Owens and Walsh, 1966; Rechenberg, 1973; Holland, 1975; Koza, 1992, 1994; Spears et al., 1993; Angeline, 1996a; Fogel, 1998) to evolve neural network solutions. Most neural network models are complex algorithms with many degrees of freedom. A neural network may vary in its topology, in the local memory values associated with the nodes and connections (e.g., threshold values and weights), in the transfer functions applied at the nodes, in the parameters used by the learning rules applied by the nodes, and in the learning rules themselves. In any ENNS, certain network characteristics are kept fixed while others are optimized during the evolutionary search. The choice of which properties to evolve is intimately related to the choice of genetic representation (Holland, 1975; Koza, 1992, 1994). 123

147 Early ENNSs used direct, structural or parametric encoding to genetically represent a network. A direct encoding (Heistermann, 1990; Miller et al., 1989) specifies only the weight values of a network. All other aspects of the network, including topology, are kept fixed. A structural encoding specifies the exact topology for a fixed number of nodes, and may (Collins and Jefferson, 1990; Dasgupta and McGregor, 1992) or may not (Hancock and Smith, 1990; Kitano, 1990) also specify the weight values. All other aspects of the network, such as transfer functions, learning rates and learning rules, are kept fixed. A parametric encoding (Polani and Uthmann, 1992; Schaffer et al., 1990) may specify a wide variety of network properties, but usually these are limited in number and may be specified only at a high-level. These earliest ENNSs were limited in the variety of network solutions that they could explore since they were often based upon traditional neural network models that were regular in topology, were not modular, and had homogeneous processing and learning behaviours (Yao, 1999). Recent ENNSs have also used developmental rule, grammatical or cellular encoding to genetically represent neural networks. A developmental rule encoding (Boers et al., 1993; Jacob, 1994a; Kitano, 1990; Voigt et al., 1993) represents a network as a small set of graph re-write rules that may be applied to an initial network configuration to produce a final topology that may be sparse and irregular at the macroscopic level. Due to the recursive nature of the graph-rewrite operations, the resulting networks often exhibit identifiable repeated substructures, and thus some degree of modularity, when examined closely. The evolutionary search is over a set of possible re-write rules, and thus a wide variety of irregular, modular network topologies may be explored. Functional aspects of the network remain fixed. 124

148 A grammatical encoding (Jacob, 1994b; Jacob and Rehder, 1993) defines a fixed set of production rules and encodes a network as a sentence from that grammar. On its own, the grammatical encoding has been used to represent the topology of a network. However, Jacob (1994b) has developed a hierarchical evolutionary system that combines the grammatical encoding of topology with a parametric representation of neuron functionality and a bit-string representation of weight values. Evolution proceeds at the three levels simultaneously, and the resulting system may produce networks that are highly irregular in topology and heterogeneous in functionality. The networks have no learning capabilities, however. Cellular encoding (Friedrich and Moraga, 1996, 1997; Gruau, 1994, 1995, 1996; Gruau and Whitley, 1993; Gruau, Whitley and Pyeatt, 1996; Kodjabachian and Meyer, 1995, 1998a, 1998b; Whitley, Gruau and Pyeatt, 1995) follows an approach similar to both developmental rule encoding and grammatical encoding. A fixed set of rules are defined that may encode variations in topology, weight values and neuron functionality. These rules may be applied to an initial network configuration as graph re-write operations. A specific derivation sequence of rules forms the genetic encoding. The original cellular encoding produces networks that are generally irregular in topology but often have identifiable modular or regular substructures. The degree of modularity or regularity is dependent upon the use of certain productions (e.g., REC and CLONE). Advanced cellular encoding approaches (Gruau, 1996; Gruau et al., 1996; Kodjabachian and Meyer, 1998a, 1998b; Whitley et al., 1995) use syntactic mechanisms to constrain the set of possible derivation sequences. The resulting networks may be highly modular and regular in topology at certain levels of structure, but highly irregular at other levels. 125

149 Functional variations in the network may be explored through the use of re-write rules that select neuron transfer functions from a pre-defined set of alternatives. Typically, the networks produced by cellular encoding exhibit minimal, fixed learning behaviours. Happel and Murre (1994) have also examined the evolution of modular structure. They use a parametric encoding to evolve the size and arrangement of CALM modules within the network. This enables the automatic discovery of good intermodular topologies within CALM networks. Most existing ENNSs focus upon the optimisation of neural topology and/or associated weight values, and leave the functional behaviour in the network constant. Systems based upon grammatical and cellular encoding explore some variations in activation processing characteristics of nodes, but the nature of these variations is often limited to a small number of fixed alternatives. For example, Jacob and Rehder (1993) and Jacob (1994b) allow ten different activation functions to be used in the neurons, and Friedrich and Moraga (1996, 1997) allow four different activation functions. Yao (1993, 1999) has identified a continuing weakness in the development of ENNSs that optimise learning behaviours. Bengio et al (1992, 1994) and Chalmers (1990) use a parametric genetic encoding to evolve complex learning rules from simpler component functions. Due to the use of a parametric encoding, the form of the learning rule is fixed in advance. Bengio et al (1994) and Radi and Poli (1997, 1998) use a genetic programming with a standard tree-based encoding that allows the form of the learning rule to be evolved as well. However, in each of these systems, a fixed network topology is used, the networks are not modular, only the learning rules are adapted during evolution, and a single learning rule is used throughout the network. 126

150 3.1.4 Design Principles for Scaling Solutions Each of these approaches to resolving the scaling issues of neural networks has merits and drawbacks, and from all of the techniques taken together, six positive design principles are identified: (1) [Hybrid systems] Neural networks may be combined with other adaptive techniques (e.g., evolutionary algorithms, structural learning) to develop more effective neural solutions. (2) [Modular task decomposition] The use of modules and the specific arrangement of modules within a modular network is an important factor in the effectiveness of the resulting network. (3) [Heterogeneous modules] The use of different neural network architectures as modules within a single new architecture may produce more effective solutions to certain complex problems. (4) [Dynamic structure] The automatic adaptation of neural network structure enables discovery of solutions with appropriate size and more efficient structural resources. (5) [Dynamic functionality] The automatic adaptation of network functionality enables the most appropriate processing and learning mechanisms for a given problem to be used. (6) [Dynamic modular structure] The automatic adaptation of modular structure enables discovery of more effective task decompositions on generic problems. 127

151 A limitation of almost every existing system is either that the network structure is either too static and thus the task decomposition is primarily determined by the developer s choice of structure, or that the network is too homogeneous in structure and behaviour, and thus the network is unable to discover and specialize upon important subtasks. Extending upon the positive design principles above, and drawing upon the arguments of Yao (1993, 1999) for increased exploration of dynamic functionality, an additional key design principle is proposed: (7) [Dynamic modular functionality] The automatic adaptation of the processing and learning behaviours of modules within a modular network. In other words, in addition to intra- and intermodular topology being tailored to a given problem (i.e., form following function ), it is important for modular processing and learning behaviours to be tailored for the specific subtasks that are required to solve the problem (i.e., behaviour following function ). The solution of complex problems using neural networks requires that new neural network models with enhanced scaling performance be developed. The seven positive design principles identified above are important factors in the development of these new models, and new tools that enable the systematic application and exploration of these design principles are required Representation of Neural Network Models In the field of neural computation, all pure neural network architectures may be described using nodal representation, in which a network is described in terms of nodes with local functionality only sending signals to other nodes on connections with arbitrary 128

152 transmission characteristics. Numerous variations may exist in the details. For example, nodes may range from simple to complex in behaviour and be static or dynamic in nature; networks may range from unstructured to highly structured in connectivity and be homogenous or heterogeneous in nature. To further complicate the depiction of neural network architectures, many researchers specify their architectures using algorithmic components that are not described in this basic nodal terminology. This is particularly true of network learning rules, and such architectures require a translation, if possible, into the basic nodal form to ensure comparable analyses of architectural properties. For instance, the common back-propagation learning rule is not typically presented in a pure neural fashion, and requires significant additional neural structures to be purely implemented (Hecht-Nielsen, 1990). This basic nodal representation is insufficient for fully specifying a neural network model. Recall that a model encompasses a family of architectures. The specification of a model must depict all the possible nodes and associated behaviours that may exist in all valid architectures as well as all the connectivity constraints that define a valid architecture. The same specification may be made using a variety of techniques. The standard practice in the field of neural computing is to specify neural network models with an ad-hoc combination of nodal representation, textual descriptions, graphs and mathematical equations. For example, a perceptron network may be depicted by a nodal description of a single perceptron, and a textual description of the constraints that nodes are arranged in layers of a certain limiting size and number with full connectivity between layers. The back-propagation network as described by Hecht-Nielsen (1990), is depicted by a nodal description of two varieties of nodes (suns and planets), each 129

153 transmitting two types of connection signals (activation and feedback) and a textual description of the constraints that sun nodes are arranged in layers and a single planet modulates the activation and feedback signals between each pair of sun nodes. Modular neural networks are also specified using this standard practice. The primary consequences of this standard practice are that (1) constraints are often unclear; (2) two researchers may read the same specification differently due to differences in their own normal approach to specifying models; (3) different models are often difficult to compare without reformulating the specifications of each model into the same consistent format; and (4) exploration of variations to a model and/or combinations of models is difficult to automate. Research on evolutionary neural network systems has necessitated the development of formal representation methods for the specification of neural network models. A genetic encoding scheme is a formal representation of the model properties that are to be explored by the evolutionary process. The remaining model properties are generally assumed and left out of the representation. A particular genetic encoding contains an abstract representation of a neural network architecture, and may be viewed as a partial specification that is completed through the decoding process. Many encoding techniques, such as most direct and parametric encoding schemes, explicitly represent very limited architectural details; most of the specification is fixed and remaining details are provided by the decoding process. Some schemes, such as GP-encoding (Bengio et al., 1994; Koza and Rice, 1991; Radi and Poli, 1997,1998), grammatical encoding (Jacob and Rehder, 1993), cellular encoding (Gruau, 1995) and, especially, syntactically constrained cellular encoding (Gruau, 1996; Kodjabachian and Meyer, 1997), use context-free 130

154 grammars or mechanisms similar to context-free grammars (CFG-similar mechanisms) to explicitly represent a wide variety of architectural constraints and behavioural details; fewer specification details are provided by the decoding process. For example, a standard GP representation may be used to represent neural connectivity and weights (Koza and Rice, 1991) or neural learning rules (Bengio et al., 1994; Radi and Poli, 1997,1998). A GP representation obeying the closure principle (Koza, 1992), namely that any function or datum may be interchanged, is equivalent to a flat context-free grammar. For example, the GP representation may be reformulated as a CFG in which each datum is a terminal, each function is a non-terminal, and a single additional non-terminal symbol, say <T>, is introduced. Each function, say <F>, is specified by a production that simply identifies the arity of that function using an appropriate number of <T> symbols. The remaining productions map <T> to each terminal and datum. The resulting parse tree will be identical to the standard GP tree, except that every symbol in the latter will be preceded by a <T> in the former. The parsing mechanism used in decoding a GP tree is also context-free, and thus a standard GP representation is exactly equivalent in representational power to a (flat) context-free grammar. Basic cellular encoding is presented as a standard GP representation, and as such is similar to a context free grammar. However, a key distinguishing point of cellular encoding is its decoding mechanism. Generally, the traversal of a cellular gene tree is equivalent to a breadth-first traversal of a parse tree of a context-free grammar; the children of each symbol are placed at the end of the decoding queue and decoding of (most) symbols is context-free. For example, a subtree containing only SEC and PAR 131

155 symbols will always decode to the same, identical neural sub-structure. However, Gruau (1995) incorporates certain program symbols, such as REC, that actually directly influence the traversal of the gene tree. For example, the REC symbol represents a nested expansion of the entire tree, limited only by the value of the life parameter. Two identical parse trees will decode to different neural structures depending upon the value of the initial life parameter. Representation of a neural network does not necessarily imply that the system can generate functional neural networks; in other words, neural networks that may be executed, trained, tested and applied. The system must incorporate a decoding mechanism that accepts the specification provided by the representation and generates a functional network. Research on encoding of neural topology using gp-tree encoding (Koza and Rice, 1991) and grammatical encoding has clearly demonstrated that a family of complex neural network architectures may be represented using a context-free grammar (or equivalent mechanism). Productions of the grammar are used to define rules governing valid topological configurations, including connectivity and weight values, and valid behavioural alternatives among base neurons. A parse tree of the grammar (or equivalent tree), presents a concise partial representation of a single neural network architecture. Information in a given parse tree is decoded to a functioning neural network architecture through the incorporation of assumed knowledge about base neural behaviour and neural processes of the model (i.e., all remaining specifications that are not given in the parse tree). 132

156 Research on the encoding of learning rules (Bengio et al., 1994; Radi and Poli, 1997,1998) has clearly demonstrated that a family of complex neural learning functions may be represented using a context-free grammar. Productions of the grammar are used to define rules governing the combination of atomic functions and variables into complex compound functions. Information in a given parse tree is decoded to a functioning neural network architecture through the incorporation of assumed knowledge about network topology (which is generally fixed), base neural behaviour and neural processes (i.e., all remaining specifications that are not given in the parse tree) Design Principles for Effective Network Specification Each of these approaches to representing neural network models has merits and drawbacks, and from all of the techniques taken together, five positive design principles for a robust network specification framework are identified: (1) [Clear assumptions] Any representation of neural network models will necessarily make certain assumptions. The clear identification of all assumptions made by a given representation is a necessary basis for a robust specification framework. (2) [Explicitness] The explicit representation of neural characteristics, both topological and behavioural, enables meaningful manipulation of those characteristics within a given neural network model. (3) [Topological variety] The capability for specifying a variety of topological constraints, including modular constraints, enables the representation of a wide family of neural architectures 133

157 (4) [Behavioural variety] The capability for specifying a variety of neural behaviours enables the specification of a wide family of neural architectures. (5) [Consistency] The consistent use of the same representations for the same neural structures and/or behaviours, as well as a common basis for the interpretation of those representations, facilitates comparisons of the similarities and difference between different neural network models. Neural network models embodying most or all of the seven positive design principles for scaling solutions described above will almost certainly encompass large families of highly varied neural network architectures, with numerous different structural and behavioural possibilities and constraints. A representation framework embodying all of the five design principles for effective network specification described above will lead to benefits in the formal design and analysis of complex neural network architectures, and should provide the capability for specifying models that scale well. Further, as suggested by research on genetic encoding of neural networks, such a framework should provide a strong basis for the systematic exploration of new neural network solutions using techniques such as genetic programming. 3.2 Attribute Grammars for the Specification of Neural Networks Attribute grammars (Knuth, 1968) may be used as the basis for specifying highly varied and behaviourally complex families of neural networks. The following sections introduce the Network Generating Attribute Grammar Encoding (NGAGE) system. Within any grammatical system, the choice of the type of grammar used has a significant 134

158 effect upon the potential representation capabilities of the system. For instance, the context-sensitive capabilities of logic grammars, definite clause translation grammars (DCTG) and attribute grammars provide improved potential for the representation of complex languages over those of context-free grammars. However, the choice of the type of grammar in itself does not immediately provide practical solutions for the representation of effective languages within a given domain. Rather, experience with the peculiar requirements of that domain, and the development of appropriate grammar constructs and representation techniques are generally needed before effective languages may be developed. For instance, Wong and Leung (1997) introduce specific techniques for using logic grammars to emulate a variety of genetic programming representations of Koza (1992, 1994) and Ross (2001) introduce specific techniques for using DCTGs to represent guarded stochastic regular expressions. The NGAGE system introduces a number of techniques for the effective representation of a variety of neural structures and behaviours within a single attribute grammar. Those techniques are integrated within a consistent framework to provide a prototype language tool for the specification of neural networks, although it is not intended as a user-interface other than at the research level. Through the use of attribute grammars, NGAGE expands upon the success of neural systems that use context-free grammars as a basis for the abstract representation of neural network models, and addresses the key issue of the limited structural and behavioural complexity that may be currently represented in those systems. NGAGE exhibits a number of important properties, including the capability to represent a wide variety of neural network 135

159 architectures and the capability to provide an effective genetic representation for neural networks that may be manipulated using typed genetic programming techniques Basic Approach The design of an arbitrary context-free grammar is an open-ended problem in that, generally, many different grammars may be used to represent the same family of solutions. However, the most important strength of a CFG-based representation is that units of higher-level structure may be explicitly identified through the symbols of the grammar and the relationships between these units explicitly represented through the productions. Consider the classic problem of representing English sentences. A grammar may be used to simply represent not only the set of possible words, but also the higher-level components of a sentence (e.g., noun, verb, adjective), and how those components combine to form valid sentences (i.e., the phrase structure of the sentence). The key design principle of a CFG is that symbols have roles, and the key design issue is the identification of those roles and the hierarchical relationships between them. A weakness in existing CFG-based representations of neural networks is that the benefits of hierarchical grammatical structure are rarely exploited. In particular, the variety of roles played by the symbols of the grammar is often very limited, and the treatment of those different roles is often uniform. For example, in gp-encoding, all symbols represent topological structures and are treated interchangeably. In cellular encoding, symbols play both topological and compactness roles (e.g., symbols such as REC enable compact symbol trees), but all symbols are treated interchangeably. In syntactically-constrained geometry-oriented cellular encoding (Kodjabachian and Meyer, 136

160 1998a), all symbols represent topological structures but some hierarchical relationships are represented. These limitations arise because it is difficult to incorporate too many roles within a single CFG without elaborate explanation of the consequences of those roles. In other words, including symbols with semantically diverse roles is difficult in a context-free environment. An attribute grammar has two main components: the context-free symbols, with associated context-free productions, and the symbol attributes, with associated attribute evaluation rules. As with a CFG, a key design principle is that symbols have roles. However, the use of attributes adds even more complexity to the design process. The problem of assigning attributes to symbols and attribute evaluation rules to productions is also open-ended. The important benefit of attribute grammars over CFGs is that the semantics of a particular parse tree are explicitly computed within the attributes (i.e., an attributed parse tree provides both syntactic and semantic information). In the standard usage of attribute grammars, the attributes typically serve to identify semantically valid instances of the grammar or to compute the semantic value of a given parse tree. For instance, an attribute grammar that generates mathematical expressions within its context-free component may actually compute the value of that expression within the attributes (Alblas, 1991a). However, the number of possible attribute evaluation rules is unbounded, and the manner in which they may be used can vary significantly. For example, they may impose simple top-down constraints, perform context-sensitive computations, collect valuable information within the attributes or identify semantically invalid trees. Individual attributes likewise have a variety of uses, or semantic purposes, ranging from temporary storage of information for complicated 137

161 context-sensitive computations to the actual meaning of the entire parse tree. Thus, in an attribute grammar, symbols have roles and attributes have purposes. Attribute grammars for the representation of neural networks may be designed with different symbols representing distinct topological and behavioural roles at a variety of levels of detail. For example, a symbol may represent a class of activation functions or a set of related modular topologies. As another example, one symbol may represent the class of all supervised learning rules, and two additional symbols may represent different subsets of this class, such as the class of non-linear supervised learning rules and the class of linear supervised learning rules. The distinction between the roles played by these symbols will be clarified within the attributes of those symbols. Inherited and synthesized attributes may be used for a variety of purposes, including imposing required semantic constraints on the roles of the grammar symbols, identifying co-dependencies between various symbols, and collecting a single semantic specification of the entire parse tree that may be passed to an interpreter. Clear and consistent treatment of symbols/attributes in terms of their role/purpose within the neural network and in terms of their use in the grammar will enable a comprehensive framework for the representation of complex neural network models Network Generating Attribute Grammar Encoding Introduction This section presents a basic introduction to NGAGE. A simple NGAGE grammar is given that illustrates the key ways in which NGAGE may be used and manipulated, including how it may be used as a genetic representation within an evolutionary algorithm. 138

162 The grammar makes several simplifying assumptions concerning connections and neuron behaviour Attribute Grammar Component Figure 45 presents a simple NGAGE grammar. The figure presents an attribute grammar using a standard notation. The productions and attribute evaluation rules are numbered for easy reference. The grammar has two non-terminal symbols and one terminal symbol, all with suggestive names. The terminal <perceptron> represents a single perceptron node, whose complete specification is stored in the attribute spec. The nonterminal <Layer> represents a layer of unconnected homogenous perceptron nodes that has a minimum size of one node and no maximum size. The non-terminal <Network> represents a network with two connected layers of perceptrons. The attributes of the non-terminals illustrate how sets may be used to store information within attributes and how set operations may be used to manipulate those attributes. Within the attribute evaluation rule I.i, specifications of nodes from lower levels of a parse tree are collected using a synthesized attribute (i.e., all_nodes) and a set operation (i.e., union 4). Note that in this grammar, all nodes have identical specifications, but for ease of presentation are treated as distinct set elements by the set union operation; this issue will be addressed later. The rule I.ii shows how new connections may be formed using set operations (i.e., cross-product %) that operate only upon attributes of symbols within the rule. The complete network specification is given by the attributes all_nodes and all_connections of the root symbol. 139

163 Non-Terminal Symbols: (synthesized attributes) {inherited attributes} <Network>: (all_nodes, all_connections) {} <Layer>: (nodes) {} Terminal Symbols: (synthesized attributes) <perceptron>: (spec) Productions and Attribute Evaluation Rules: I. <Network> <Layer> 1 <Layer> 2 i <Network>.all_nodes = <Layer> 1.nodes 4 <Layer> 2.nodes ii <Network>.all_connections = <Layer> 1.nodes % <Layer> 2.nodes II. <Layer> 1 <perceptron> <Layer> 2 i. <Layer> 1.nodes = <perceptron>.spec 4 <Layer> 2.nodes III. <Layer> <perceptron> i <Layer>.nodes = <perceptron>.spec where Each connection is a pair of nodes: (source, destination) <perceptron>.spec = detailed specification of the behaviour of a perceptron neuron Figure 45: Simple NGAGE Grammar The sample grammar of Figure 45 illustrates a very simple family of neural network topologies. Namely, the space of all two-layer perceptron networks, in which each layer is of arbitrary size (with at least one node), each node in the first layer has an outgoing connection to every node in the second layer, and all nodes are identical Specific Neural Network Instance A given NGAGE grammar represents a family of neural networks. The contextfree parse tree generated from the grammar defines a specific neural network architecture. The attributes associated with each symbol may be automatically computed using the attribute evaluation rules to produce an attributed parse tree. Figure 46(a) illustrates the context free parse tree generated from the grammar of Figure

164 <Network> <Layer> <Layer> <perceptron> <Layer> <perceptron> <Layer> <perceptron> <Layer> <perceptron> <Layer> (c) <perceptron> <Layer> <perceptron> <perceptron> (a) <Network> all_nodes = {X 1, X 2, X 3, X 4, X 5, X 6, X 7 } all_connections = {(X 1, X 5 ), (X 1, X 6 ), (X 1, X 7 ), (X 2, X 5 ), (X 2, X 6 ), (X 2, X 7 ), (X 3, X 5 ), (X 3, X 6 ), (X 3, X 7 ), (X 4, X 5 ), (X 4, X 6 ), (X 4, X 7 )} <Layer> nodes = {X 1, X 2, X 3, X 4 } <Layer> nodes = {X 5, X 6, X 7 } <perceptron> 1 <Layer> spec = X nodes = {X 2, X 3, X 4 } <perceptron> 5 <Layer> spec = X nodes = {X 6, X 7 } <perceptron> 2 <Layer> spec = X nodes = {X 3, X 4 } <perceptron> 6 <Layer> spec = X nodes = {X 7 } <perceptron> 3 spec = X <Layer> nodes = {X 4 } <perceptron> 4 spec = X (b) <perceptron> 7 spec = X Figure 46: (a) Sample Parse Tree, (b) Associated Attributed Parse Tree and (c) Associated Neural Network Topology 141

165 Application of the attribute evaluation rules produces the attributed parse tree of Figure 46(b), where X refers abstractly to the value of <perceptron>.spec. The terminal symbols in the attributed parse tree are numbered for ease of readability, and these numbers are used as subscript X i to visually distinguish between the multiple identical X values. Note the use of concise roles for each symbol. The terminal symbol <perceptron> represents the complete specification of a perceptron node; the <Layer> symbol collects the specification of all the nodes of the layer within its attributes; and the <Network> symbol collects the specification of all the nodes and connections within the network. The specification collected in the root symbol produces the network topology illustrated in Figure 46(c), in which nodes are numbered to correspond to the terminal symbol numbering in Figure 46(b) Genetic Manipulation of NGAGE Representations The context-free parse tree generated from an NGAGE grammar is ideal for use as a genetic representation of neural networks. Figure 47 illustrates how the genetic operator of typed subtree crossover may be applied to parse trees generated from the grammar of Figure 45. In Figure 47, the original parse trees (a) and (b) are legal parse trees according to the grammar. A crossover point selected from each tree is indicated with a dotted circle, and the symbols selected are identical (i.e., have the same type). The result of crossing the subtrees below the crossover points produces the new parse trees (c) and (d), both of which are legal parse trees according to the grammar. 142

166 <Network> <Network> <Layer> <Layer> <Layer> <Layer> <perceptron> <Layer> <perceptron> <Layer> <perceptron> <Layer> <perceptron> <Layer> <perceptron> <Layer> <perceptron> <Layer> <perceptron> <perceptron> <Layer> <perceptron> <Layer> <perceptron> <perceptron> <Layer> <perceptron> (a) (b) <perceptron> <Network> <Network> <Layer> <Layer> <Layer> <Layer> <perceptron> <Layer> <perceptron> <Layer> <perceptron> <Layer> <perceptron> <perceptron> <Layer> <perceptron> <Layer> <perceptron> <perceptron> <Layer> <perceptron> (d) <perceptron> <Layer> <perceptron> <Layer> <perceptron> <Layer> <perceptron> (c) Figure 47: Typed Subtree Crossover using Context-Free Parse Trees of NGAGE Design Methodology The NGAGE system addresses the development of a consistent technique for the design of attribute grammar representations of neural networks (i.e., it is not, by contrast, concerned with the development of a single, specific attribute grammar or set of specific attribute grammars.) As such, several levels of design issues must be considered. At the 143

167 most basic level, the grammars formed using the technique must be valid (i.e., obey the limitations upon attribute naming and the use of attributes within attribute evaluation rules) and well-formed (i.e., non-circular). NGAGE grammars must also satisfy a variety of principles concerning the design of effective, scalable neural network architectures and the design of useful representations of such architectures. For example, an important principle is that different roles/purposes must be played by the symbols/attributes. However, as new roles/purposes are introduced, any new assumptions must be clearly identified and the underlying neural models that are used must be clearly defined. To address these many design issues, a number of design practices are introduced. These practices are followed when creating NGAGE representations. Individually, these practices provide a basis for the representation of specific neural structures and behaviours within a grammar. Collectively, these provide a consistent framework for comprehensive representation of numerous complex neural topologies and behaviours within a grammar. It is important to note that each design practice is not the only possible solution, but provides the benefit of narrowing the open-ended nature of grammar design. No claim is made as to broader applicability of these design practices to problems other than the representation and manipulation of neural network architectures. NGAGE uses the traditional attribute grammar of Knuth (1968) and a standard attribute evaluator (Bochmann, 1976; Knuth, 1968), and does not address the efficiency or limitations of those techniques (These are key research issues in the computational theory field of attribute grammars). 144

168 3.2.4 Basic Design Practices To illustrate the use of design practices in NGAGE, consider again the grammar from Figure 45. In that example, the basic form of the grammar is defined through adherence to the following fundamental design practices. Practice 1 Assign meaningful roles to symbols and purposes to attributes and reflect these roles/purposes in their names One of the primary design principles of NGAGE is the assigning of roles to different symbols and purposes to different attributes. To reinforce this principle, as well as to enhance the readability of the grammar, the names given to symbols and attributes will, as much as possible, reflect those roles and purposes. For example, the symbol <Layer> is used to refer to a layer of nodes, and is named in a manner suggestive of this role; the attribute <Layer>.nodes is used to refer to all the nodes within a layer, and is named in a manner suggestive of this purpose. Practice 2 Maintain consistent naming of attributes Within an attribute grammar, all attributes of a given symbol are completely independent of the attributes of another symbol, except for explicitly defined relationships. Thus, two attributes with the same name in different symbols are, in principle, completely independent. Thus, for example, one may be a synthesized attribute and one may be an inherited attribute; or they may refer to highly different classes of neural components. However, in terms of readability, it is highly 145

169 advantageous to use attributes with the same name in the same way, even across multiple symbols. For example, consider a non-terminal symbol <Instar-Layer> that is added to the grammar of Figure 45, refers to a layer of instar nodes, and has an attribute <Instar-Layer>.all_nodes. In keeping with the attribute naming of the <Layer> symbol, that attribute should represent a synthesized attribute that refers to all nodes within that layer. Practice 3 Neural network specification is stored within the attributed parse tree The benefits of attribute grammars over context-free grammars are realized only through the values of the attributes that are automatically computed while evaluating a given parse tree. A key purpose of some of the attributes will be to store a portion of the complete, final specifications of the neural network. These specifications will be directly extracted from those attributes by the decoder. Practice 4 Complete network specification is computed within attributes of root symbol Within an attributed parse tree, all nodes of the tree could potentially contain final specifications in their attributes. However, in NGAGE the complete specification of the neural network will be collected within the attributes of the root symbol. This enhances readability by providing a human with a concise, easily identified summary of the meaning of the tree. It also simplifies the decoding process since the decoder need only inspect a single point of the tree to extract the 146

170 network specification. The approach is feasible since information may flow both up and down a given parse tree during attribute evaluation. Any final specifications may be easily synthesized up to the root symbol. For example, a basic network may be specified in terms of a set of nodes and a set of connections among those nodes. The root symbol, say <Network>, may contain two synthesized attributes <Network>.all_nodes and <Network>.all_connections that, together, concisely specify the entire neural network. Practice 5 Use sets and set operations Details that specify the topology and behaviour of the network may be stored within the attributes as sets. Attribute evaluation rules may manipulate these sets as appropriate using set operations. A key limitation of any attribute evaluation rule is that it may not incorporate information that is not directly (and deterministically) derived from the attributes of the symbols within the rule. Thus, no random set operations are used. Practice 6 Where appropriate to improve readability, represent complex operations within an attribute evaluation rule as a single helper operation An attribute evaluation rule may be any arbitrary algorithm, encapsulated to the attributes within the production. To enhance legibility of the rules, complex deterministic algorithms can be represented as simple function calls. This enables the same operation to be used multiple times, but the description of the algorithms 147

171 to be given only once. These deterministic helper operations may be named in a manner suggestive of their purpose, and the same operation may be used consistently within the grammar to refer to a single algorithm. However, in the grammar of Figure 45, several characteristics are specified very roughly, and as a result, ambiguously. For instance, the treatment of terminal symbols is basic - what exactly is stored within the <perceptron>.spec attribute? If all the nodes are identical (i.e., recall that the numbering of the nodes was a convenience of presentation, and not inherent in the representation), how are they distinguished in the sets <Layer>.nodes and <Network>.all_nodes? How do the node-pairs created in the attribute <Network>.all_connections relate to the nodes in <Network>.all_nodes? Further, the networks formed by the grammar are very simple and the assumptions underlying the network are unclear. How can a network with multiple nodes types be represented? Is a node-pair sufficient representation for a connection? Is there any learning in the network, and if so, how does it work? To resolve these ambiguities and enable the representation of more complex neural networks, a clear definition of the underlying neural models is needed and additional design practices must be followed Hypotheses The Network Generating Attribute Grammar Encoding (NGAGE) system is proposed to satisfy seven key hypotheses. For each, a focus is placed upon both the technique and the design practice(s) that are used in the proposed solution. Motivations for these hypotheses are clarified where appropriate. 148

172 Hypothesis 1: NGAGE may be used to explicitly specify the topology and behaviour of the neural network architectures that comprise a neural network model. The key motivation of this hypothesis is the issue that current neural network specification techniques represent only a small variety of neural characteristics and make significant assumptions concerning the remaining neural characteristics that comprise the complete model (i.e., representation design principles 1, 2, 3 and 4 above). Further, many representation techniques implicitly specify the rules that must be followed for decoding a particular representation. NGAGE should be capable of specifying a wider variety of neural characteristics, within a single representation, than previous techniques and thereby require far fewer assumptions than previous representations (e.g., context-free grammar based systems). Hypothesis 2: Functional neural networks capable of learning may be generated from an NGAGE representation. The key motivation of this hypothesis is that the flexibility of any given representation technique is directly limited by the flexibility of the interpretation mechanism that is used to derive functional neural networks (i.e., networks that may be executed, trained and tested) from specific representation (i.e., representation design principle 5 above). In practice, previous systems generally follow the approach of designing an idiosyncratic interpreter for each representation. For example, a direct encoding that represents the topology of a back-propagation network (e.g., Kitano, 1990) requires an interpreter that can 149

173 decode the matrix of weights into a back-propagation topology and provide all the back-propagation behaviours. An attribute grammar that specifies behavioural as well as topological characteristics requires an interpreter that creates functional networks that may vary greatly in their behaviours. Thus, NGAGE requires an interpreter that is highly generic. Hypothesis 3: NGAGE may explicitly encode neural network modules with varied structure and behaviour. This hypothesis addresses the key scaling issue that modularity is an important factor in the effectiveness of neural solutions (i.e., scaling design principles 2, 3, 6 and 7 above). Explicit representation of modular topology and behaviour is a necessary step in achieving effective modular solutions. Existing network models and network specification techniques are limited in their treatment of modularity. Most modular models incorporate impose topological constraints that enable only one or two of the many possible varieties of task decompositions, namely redundant, input and output partitioning, data space, successive and parallel and hierarchical decompositions. However, as demonstrated by cellular encoding systems, it is possible to enable a single network to incorporate almost all of these decompositions, as appropriate. No previous techniques allow for modules that varying arbitrarily in behaviour. NGAGE should surpass the modular capabilities of most previous systems, and in particular cellular encoding. An NGAGE grammar should be capable of 150

174 constraining topology and behaviour in a modular fashion to enable networks with heterogeneous modules that perform a wide variety of task decompositions. Hypothesis 4: Existing neural network architectures may be represented within an NGAGE system. The key motivation of this hypothesis is that previous neural network specification techniques are highly rigid and are capable of representing only a limited range of neural network architectures. In particular, they either represent simple variations within a single existing model (e.g., Kitano s direct encoding uses back-propagation), or define an idiosyncratic model that does not correspond to any of the popular models (e.g., Gruau s cellular encoding). Extending a given technique to a new model generally requires a change to underlying the interpreter. NGAGE should be capable of representing multiple existing models without requiring any changes to the underlying interpreter. Hypothesis 5: The class of neural networks represented by a given NGAGE representation may be automatically explored using genetic search. The key motivation of this hypothesis is to incorporate the benefits of hybrid, adaptive systems by building upon the success of grammar-based genetic encoding of neural networks (i.e., scaling design principles 1, 4, 5, 6 and 7 above). The space of parse trees that are generated by an NGAGE grammar should naturally and effectively be explored using genetic search. 151

175 Hypothesis 6: NGAGE enables the integration of multiple models and facilitates the systematic exploration of variations to a model. The key motivation of this hypothesis is that NGAGE, in addition to its usefulness in representing complex neural network models, is also valuable as a tool for the development of new models. There currently exists no comprehensive framework for the systematic development of new neural network models. Most new models are developed ad-hoc, and comparisons among models are often difficult due to the limitations of previous specification techniques. Through the capabilities demonstrated in the first five hypotheses, NGAGE should be capable of representing multiple novel models without requiring any changes to the underlying interpreter. Hypothesis 7: NGAGE may be used to evolve solutions to many classes and sizes of problems. The key motivation of this hypothesis is to stress the flexibility and applicability of the NGAGE representation framework. Most existing neural networks are limited to certain types of problems due to the limited neural behaviours and task decompositions provided by each network. Most network models are also limited to problems of certain sizes. By increasing the variety of neural behaviours, task decompositions and internal structures that may be represented, NGAGE should be applicable to a wider variety of problems than current representation techniques. 152

176 Chapter 4 Attribute Grammar Encoding of Neural Network Topology In this chapter, techniques for the representation of neural network topology within NGAGE are presented. The complexity of the neural topologies increases throughout the chapter, from simple feed-forward topologies to complex modular topologies with multiple types of signals, such as activation and feedback signals. The treatment of neural network topology in this chapter will make certain simplifying assumptions. In particular, the treatment of the internal functionality of neurons is simplified; neurons are primarily represented as atomic terminal symbols. 4.1 Neural Foundations To enable flexible representation in NGAGE, consistent underlying neural foundations are assumed. The foundations that NGAGE uses in its representation of neural topology are summarized below. These will be extended appropriately as the treatment of topological and behavioural characteristics is expanded. 153

177 4.1.1 Timing Model Only networks that follow a discrete time-step processing model of behaviour are considered. The specific model followed uses three levels of timing. Iteration timing determines how long a given input pattern is made available to the network s input sources before the output of the network is extracted. Transmission timing determines the frequency with which a signal may be transmitted across a network connection. Processing timing determines the amount of time required for a single node to process its inputs and provide its output. In NGAGE, all timing is based upon a time step of consistent size. Every node is considered to perform its internal processing in less than one time step. Every connection is capable of transmitting at most one signal per time step. Each iteration is assumed to consist of an arbitrary number of time steps. All operations in a neural network are synchronized by time step. All operations that occur within an iteration are determined by the model. All operations that occur between iterations are considered implementation details that are outside the model Basic Neuron Model In the NGAGE system, a neuron is defined as a processing element that may receive signals of multiple types and may transmit signals of multiple types (see Figure 48). This is in contrast to the typical neuron model in which there is only one signal type (i.e., activation). All signals of all types are real values in the range [-1,1]. A neural network architecture is defined as a purely localized structure. The domain of a neuron is defined as all information that is stored in its internal memory and all information that a 154

178 neuron receives as input signals during a given time step. Any computations that occur in the network must occur in the neurons, and the computations performed in any given neuron must be based solely upon information local to the domain of that neuron. Thus, for instance, all learning computations and weight updates must be performed locally within neurons based solely upon memory values and signals received. Type 1 Type 3 Multiple outgoing signal types f Multiple, arbitrary internal functions May perform multiple actions per iteration dependent upon specific input signal types Type 1 Type 2 Internal memory (e.g., weights) Multiple incoming signal types Figure 48: Basic NGAGE Neuron Model Domain Connection Model In NGAGE, a connection is defined as a directed link that (1) connects a single node or input source of the network to a single node (possibly the same node) or output source of the network, such that (2) the link is capable of carrying signals of a single, fixed, specific type; (3) the source node must be capable of transmitting and the destination node must be capable of receiving a signal of that type; (4) the source node of the link may place exactly one signal, of the appropriate type, onto the connection in a single time step; (5) in each time step at most one signal may be delivered to the destination node; (6) the link has a fixed delay value that is an integer greater than 0 and 155

179 that represents the number of time steps between the time a signal is placed on the link to the time it is delivered; and (6) the signal value that is finally delivered is identical to the signal value that was placed on the link. 4.2 Representation of Network Topology The connection model described above permits the transmission of multiple signal types with varying delays. To facilitate the introduction of attribute grammar representations of neural network topology, the examples in this section primarily assume simplified connections with only one signal type and no delays. Details of neuron functionality are not given. The primary design practices used within NGAGE for the representation of simple network topologies are first presented, and additional, advanced practices that exploit unique properties of attribute grammars for the representation of complex network topologies are then presented. The NGAGE capabilities resulting from the primary and advanced practices are discussed and compared to previous network representation techniques Primary Design Practices Practice 7 Every node created within the grammar has a unique identity value. In order to enable references to nodes within multiple attributes of a symbol (e.g., the specification of connections), it must be possible to uniquely identify and refer to each node. Ideally, this may be done explicitly within the 156

180 attributes. For example, an inherited attribute id of the left-hand symbol may store a unique identity value (e.g., as a string). A unique id may be deterministically created for all right hand symbols (e.g., through string operations such as concatenation) and passed down a parse tree. The identity value may be used as the basis for uniquely identifying all new nodes created within the production. Non-Terminal Symbols: <Network>: (all_nodes, all_connections) {} <Layer>: (nodes) {id} Terminal Symbols: <perceptron>: (spec) Helper Operations: get_id Productions and Attribute Evaluation Rules: I. <Network> <Layer> 1 <Layer> 2 i <Network>.all_nodes = <Layer> 1.nodes 4 <Layer> 2.nodes ii <Network>.all_connections = get_id(<layer> 1.nodes) % get_id(<layer> 2.nodes) a <Layer> 1.id = 1.1 b <Layer> 2.id = 1.2 II. <Layer> 1 <perceptron> <Layer> 2 i <Layer> 1.nodes = <Layer> 2.nodes 4 [concatenate(<layer> 1.id,.1 ), <perceptron>.spec] a <Layer> 2.id = concatenate(<layer> 1.id,.2 ) III. <Layer> <perceptron> i <Layer>.nodes = [concatenate(<layer>.id,.1 ), <perceptron>.spec] where get_id(a): A is a set of (id,*) pairs. Function returns a set containing all the id values only, in the same order as in the original set. Figure 49: NGAGE Grammar Illustrating the Representation of Identity Values Representation of identity values is illustrated in Figure 49. A helper operation, get_id, is used as well as the set operations of cross-product and union and the string operation of concatenation. The attribute evaluation rules I.a and I.b demonstrate how inherited attributes may be used to assign unique identity values to each right hand symbol as strings. Rule II.ii demonstrates how the concatenation operator may be used to assign a unique identity value to the right- 157

181 hand non-terminal symbol. Rule II.i demonstrates how a unique identity value may be assigned to the node represented by the right-hand terminal, and how a node may be stored as a (id, spec) pair. This allows proper synchronization between the connections in <Network>.all_connections and the nodes in <Network>.all_nodes. An attributed parse tree generated from this grammar is illustrated in Figure 50. The parse tree corresponds to that of Figure 46. Unlike in Figure 46(b), the identical node specifications X are explicitly distinguished by identity strings rather than by an ad-hoc visualization aid (namely the use of subscripts). Alternative schemes for determining unique identity values are possible. For example, all nodes could be numbered sequentially (i.e., 1, 2, 3, ) through the combined use of inherited and synthesized attributes. This would produce more readable parse trees, but would make certain attribute manipulations more difficult (e.g., the clone operation of Practice 9 below). Further, depending upon the implementation, it may be easier to assign unique identity values based upon object references that are automatically created by the language (e.g., as in Java). This avoids the need for rigorous treatment of identity values within the attributes, but is an implicit specification that violates the encapsulation property of attribute grammar productions. 158

182 <Network> all_nodes = {( 1.1.1,X), ( ,X), ( ,X), ( ,X), ( 1.2.1,X), ( ,X), ( ,X)} all_connections = {( 1.1.1, ), ( 1.1.1, ), ( 1.1.1, ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , )} <Layer> <perceptron> <Layer> spec = X id = nodes = {( ,X), ( ,X), ( ,X)} <perceptron> <Layer> spec = X <perceptron> <Layer> spec = X id = 1.1 nodes = {( 1.1.1,X), ( ,X), ( ,X), ( ,X)} <perceptron> spec = X id = nodes = {( ,X), ( ,X)} id = nodes = {( ,X)} <Layer> <perceptron> <Layer> spec = X <perceptron> <Layer> spec = X id = 1.2 nodes = {( 1.2.1,X), ( ,X), ( ,X)} <perceptron> spec = X Figure 50: Attributed Parse Tree with Identity Attribute Values id = nodes = {( ,X), ( ,X)} id = node = {( ,X)} Practice 8 Use distinct attributes to explicitly distinguish among similar neural components that may be treated differently in different contexts In a complex neural network topology, it is often the case that similar neural components are treated differently in the context of other components. For example, only certain nodes in one topological structure may connect to the nodes in another topological structure. The approach followed in NGAGE is to explicitly differentiate such neural components using distinct attributes. 159

183 Non-Terminal Symbols: <Network>: (all_nodes, all_connections) {} <Multi-Layer>: (all_nodes, output_nodes, all_connections) {id} <Layer>: (nodes) {id} Terminal Symbols: <perceptron>: (spec) Helper Operations: get_id Productions and Attribute Evaluation Rules: I. <Network> <Multi-Layer> i <Network>.all_nodes = <Multi-Layer>.all_nodes ii <Network>.all_connections = <Multi-Layer>.all_connections a <Multi-Layer>.id = 1.1 II. <Multi-Layer> 1 <Multi-Layer> 2 <Layer> i <Multi-Layer> 1.all_nodes = <Multi-Layer> 2.all_nodes 4 <Layer>.nodes ii <Multi-Layer> 1.all_connections = <Multi-Layer> 2.all_connections 4 get_id(<multi-layer> 2.output_nodes) % get_id(<layer>.nodes) iii <Multi-Layer> 1.output_nodes = <Layer>.nodes a <Multi-Layer> 2. id = concatenate(<multi-layer> 1.id,.1 ) b <Layer>.id = concatenate(<multi-layer> 1.id,.2 ) III. <Multi-Layer> <Layer> i <Multi-Layer>.all_nodes = <Layer>.nodes ii <Multi-Layer>.all_connections = {} iii <Multi-Layer>.output_nodes = <Layer>.nodes a <Layer>.id = concatenate(<multi-layer>.id,.1 ) IV. <Layer> 1 <perceptron> <Layer> 2 i <Layer> 1.nodes = <Layer> 2.nodes 4 [concatenate(<layer> 1.id,.1 ), <perceptron>.spec] a <Layer> 2.id = concatenate(<layer> 1.id,.2 ) V. <Layer> <perceptron> i <Layer>.nodes = [concatenate(<layer>.id,.1 ), <perceptron>.spec] Figure 51: NGAGE Grammar Illustrating Distinct Sets of Similar Neural Structures Figure 51 illustrates a grammar that generates networks with an arbitrary number of layers. The symbol <Multi-Layer> represents a set of sequentially connected layers of nodes. The attribute <Multi-Layer>.output_nodes maintains a subset of all the nodes in <Multi-Layer>.all_nodes that represents the last layer in that multi-layered structure. This enables rule II.ii to use a straightforward set operation (%) to add a new layer of nodes to the multi-layer structure by 160

184 connecting the former to only the output layer of nodes of the latter. The topological effects of production II are illustrated in Figure 52. <Layer>.nodes new connections added by: get_id(<multi-layer> 2.output_nodes) % get_id(<layer>.nodes) <Multi-Layer> 2.all_nodes <Multi-Layer> 2.output_nodes <Multi-Layer> 2.all_connections Figure 52: Neural Topology Arising from Distinct Representation of Sub-Structures The use of distinct attributes to distinguish among neural components that are treated differently is consistent with the principle of assigning clear roles to attributes. The approach provides useful functionality and a clear specification. Multiple attributes may, as a result, refer to the same structures, and care must be taken to maintain consistency across these different attributes (e.g., consistent identity references). For arguments sake, consider an alternative approach to this problem. It is possible to use set operations that inspect the internals of sets, and use the values of the set members to identify a particular subset that is to be treated differently. For example, each node could be stored with a value indicating, say, node type, and the attribute evaluation rules could be complex algorithms that treated each node differently according to its type. This would obviate the need for storing the same node in multiple attributes. However, in the general case, this approach requires the rule to understand the meaning of the set members (e.g., all the possible variations in values). As such, it limits the flexibility of the grammar design (e.g., each time a new manipulation is desired, a new type must be created 161

185 and the algorithms changed) as well as the readability of the grammar since the role of each attribute becomes unclear. Practice 9 Neural structures may be replicated within the attributes using set operations A neural structure may easily be replicated using set operations upon the corresponding sets of nodes and sets of connections among those nodes. Recalling Practice 7, nodes are distinguished by unique identifiers and each reference to that node uses that identifier. An operation to replicate a set of nodes must methodically replicate each node in the set and assign a distinct, unique identifier to each copy. An operation to replicate both nodes and the connections among them must also replicate each connection and replace the original identifiers with the same new unique identifiers of the corresponding replicated nodes. Figure 53 illustrates a grammar that augments the grammar of Figure 51. The helper operation set_concatenate is used to copy a set and replace the identity values with new unique values. Uniqueness is ensured through the use of a prefix based upon the identity value of the left-hand symbol and an affix (i.e.,.2 ) that is distinct from the affix used to determine the inherited identity value of the righthand symbol (i.e.,.1 ). The topological effects of production VI are illustrated in Figure

186 Grammar fragment augmenting grammar of Figure 51. Terminal Symbols: duplicate ( ) Productions and Attribute Evaluation Rules: VI. <Multi-Layer> 1 duplicate ( <Multi-Layer> 2 ) i <Multi-Layer> 1.all_nodes = <Multi-Layer> 2.all_nodes 4 set_concatenate(<multi-layer> 1.id,.2., <Multi-Layer> 2.all_nodes, 1) ii <Multi-Layer> 1.all_connections = <Multi-Layer> 2.all_connections 4 set_concatenate(<multi-layer> 1.id,.2., set_concatenate (<Multi- Layer> 1.id,.2., <Multi-Layer> 2.all_connections, 1), 2) iii <Multi-Layer> 1.output_nodes = <Multi-Layer> 2.output_nodes 4 set_concatenate(<multi-layer> 1.id,.2., <Multi-Layer> 2.output_nodes, 1) a <Multi-Layer> 2. id = concatenate(<multi-layer> 1.id,.1 ) where set_concatenate(a,b,c,d): C is a set of n-tuples, with the d th element being an identity string value (e.g., if n=2, d = 1, then C is a set of (id,*) pairs; or if n=2, d = 2, then C is a set of (*,id) pairs). Function replaces d th element, id, of every tuple of C with concatenate(a,b,id). Figure 53: NGAGE Grammar Illustrating Replication of Neural Structure using Set Operations <Multi-Layer> 2.id = <Multi-Layer> 2.output_nodes <Multi-Layer> 2.all_connections = {( , ), } (a) <Multi-Layer> 1.id = 1.1 <Multi-Layer> 1.output_nodes <Multi-Layer> 1.all_connections = {( , ), ( , ), } (b) Figure 54: Topology Replication Practice 10 Explicitly represent the input sources and output targets of the network. All neural networks operate within a larger environment. Input patterns are presented to the network, and output patterns are returned. Communication 163

187 with the external environment may be regarded as occurring through input ports and output ports. It is important to explicitly identify how many ports of each type are available and indicate which internal nodes of the network receive signals from which input ports, and which internal nodes of the network send signals to which output ports. For example, ports may be represented as terminal symbols and manipulated within the attributes of the grammar. For enhanced readability, connections between neurons and ports are specified primarily within the top-level productions that expand the start symbol of the grammar. Figure 55 illustrates how the network s interaction with the external environment may be explicitly specified. To simplify presentation, calls to get_id are avoided through the use of helper operations, direct_connect_simple and full_connect_simple, that extract the node identity values accordingly when creating connections. Expanding upon the earlier example of distinguishing the output nodes of a <Multi-Layer> structure, rules II.iii and III.iii propagate the input nodes of the <Multi-Layer> upwards. Production I explicitly includes nonterminal symbols for the input and output port layers, and specifies the size of those layers using inherited attributes. For clarity, the number of input and output ports is also reflected syntactically, with (<Input-Port-Layer>, 3) indicating that there are 3 input ports. Within productions VI and VII, a helper operation, replicate, is used to create a layer of multiple ports, each with a unique identity value. 164

188 Non-Terminal Symbols: <Network>: (all_nodes, all_connections) {} <Multi-Layer>: (all_nodes, input_nodes, output_nodes, all_connections) {id} <Layer>: (nodes) {id} <Input-Port-Layer>: (all_ports) {id, size} <Output-Port-Layer>: (all_ports) {id, size} Terminal Symbols: <perceptron> <in-port> <out-port> fill 4 5 (, ) Helper Operations: direct_connect_simple, full_connect_simple, replicate Productions and Attribute Evaluation Rules: I. <Network> (<Input-Port-Layer>, 3) <Multi-Layer> (<Output-Port-Layer>, 4) i <Network>.all_nodes = <Multi-Layer>.all_nodes 4 <Input-Port-Layer>.all_ports 4 <Output-Port-Layer>.all_ports ii <Network>.all_connections = <Multi-Layer>.all_connections 4 full_connect_simple(<input-port-layer>.all_ports, <Multi-Layer>.input_nodes) 4 direct_connect_simple(<multi-layer>.output_nodes, <Output-Port-Layer>.all_ports) a <Input-Port-Layer>.id = 1.1 b <Input-Port-Layer>.size = 3 c <Multi-Layer>.id = 1.2 d <Output-Port-Layer>.id = 1.3 e <Output-Port-Layer>.size = 4 II. <Multi-Layer> 1 <Multi-Layer> 2 <Layer> i <Multi-Layer> 1.all_nodes = <Multi-Layer> 2.all_nodes 4 <Layer>.nodes ii <Multi-Layer> 1.all_connections = <Multi-Layer> 2.all_connections 4 full_connect_simple(<multi-layer> 2.output_nodes, <Layer>.nodes) iii <Multi-Layer> 1.input_nodes = <Multi-Layer> 2.input_nodes iv <Multi-Layer> 1.output_nodes = <Layer>.nodes a <Multi-Layer> 2.id = concatenate(<multi-layer> 1.id,.1 ) b <Layer>.id = concatenate(<multi-layer> 1.id,.2 ) III. <Multi-Layer> <Layer> i <Multi-Layer>.all_nodes = <Layer>.nodes ii <Multi-Layer>.all_connections = {} iii <Multi-Layer>.input_nodes = <Layer>.nodes iv <Multi-Layer>.output_nodes = <Layer>.nodes a <Layer>.id = concatenate(<multi-layer>.id,.1 ) IV. <Layer> 1 <perceptron> <Layer> 2 i <Layer> 1.nodes = <Layer> 2.nodes 4 [concatenate(<layer> 1.id,.1 ),<perceptron>.spec] a <Layer> 2.id = concatenate(<layer> 1.id,.2 ) V. <Layer> <perceptron> i <Layer>.nodes = [concatenate(<layer>.id,.1 ), <perceptron>.spec] VI. <Input-Port-Layer> fill ( <in-port> ) i <Input-Port-Layer>.all_ports = replicate(1, <Input-Port-Layer>.size, <Input-Port-Layer>.id, [ 1, <in-port>.spec], 1) VII. <Output-Port-Layer> fill ( <out-port> ) i <Output-Port-Layer>.all_ports = replicate(1, <Output-Port-Layer>.size, <Output-Port- Layer>.id, [ 1, <out-port>.spec], 1) continued

189 where <in_port>.spec = detailed specification of an input port <out_port>.spec = detailed specification of an output port...continuation direct_connect_simple(a,b): Given two sets of (id,*) pairs, A={(a 1,*), (a 2,*), (a n,*) }and B = {(b 1,*), (b 2,*) (b m,*)}, returns the set of connections { (a 1, b 1 ), (a 2, b 2 ), (a p, b p ) } where p is the smaller of m, n full_connect_simple(a,b): Given two sets of (id,*) pairs, A={(a 1,*), (a 2,*), (a n,*) }and B = {(b 1,*), (b 2,*) (b m,*)}, returns set of connections [(a i, b j )] for all a i A, b j B replicate(base, num, a, C,d): Makes num calls to set_concatenate(a,b,c,d), where b is created by looping through the value range [base,base+num), and adding a. on each end. Result is a new set containing num duplications of the set C, with unique values. Figure 55: NGAGE Grammar Illustrating Explicit Representation of External Environment id values produced by replicate operation <Output-Port-Layer>.all_ports <Multi-Layer>.output_nodes new connections added by: direct_connect_simple(<multi-layer>.output_nodes, <Output-Port-Layer>.all_ports) <Multi-Layer>.input_nodes <Input-Port-Layer>.all_ports new connections added by: full_connect_simple(<input-port-layer>.all_ports, <Multi-Layer>.input_nodes) id values produced by replicate operation Figure 56: Neural Topology Including External Ports Rule I.ii maps the external environment to the input and output nodes of the network. Note that in this grammar, output nodes are mapped on a one-to-one basis to the output ports (using direct_connect_simple), but there is no guarantee that the two sets are the same size. As such, not all output nodes may be externally visible. The topological effects of production I are illustrated in Figure 56. Input ports are represented as small clear squares and output ports as small 166

190 shaded squares. Note that the bold output node of the <Multi-Layer> structure is not connected to any output port. Practice 11 NGAGE grammar productions that represent a range of numeric values may be abstracted as a single compound production. A key limitation of any attribute evaluation rule is that it may not incorporate information that is not directly (and deterministically) derived from the attributes of the symbols within the rule. Thus, as discussed in Practice 5, no random set operations are used within NGAGE. However, it is often valuable to compactly represent a range of values within a grammar. For instance, multiple grammar productions may be identical in all respects, yet vary only in the value of a single numeric terminal symbol. In the simplest case, the number of possible values that may be used is small, and a few grammar productions and terminal symbols suffice and are convenient to create. In the general case, however, the number of possible values may be very large, and creating a distinct production and terminal symbol for each possible value leads to an unwieldy grammar. Within NGAGE, a range of values may be represented within a single compound grammar production. The compound production is not, in itself, a legal attribute grammar production. Rather, it is a meta-level production that is used to generate many different legal attribute grammar productions and terminal symbols. Each compound production represents a range of values with a specific distribution. During the generation of a parse tree, the selection of a compound production results in the automatic creation of a terminal symbol with an attribute 167

191 that stores a random value drawn from the corresponding distribution. For clarity, the name of the terminal symbol identifies the value as well, prefixed by a special reserved character (i.e., # ) to eliminate conflicts with other user-defined terminals symbols. Within the current NGAGE, the random selection function is defined within native code and may accept parameter values. I. <Real> <#:uniformreal> i <Real>.value = <#:uniformreal>.value (a) Ia. <Real> <#3.44> i <Real>.value = <#3.44>.value where <#3.44>.value = 3.44 <Real> (b) <Real> value = 3.44 <#3.44> value = 3.44 (c) <#3.44> value = 3.44 Figure 57: (a) Compound Production for Real Numbers, (b) Instance of Compound Production and (c) Deterministic Attribute Evaluation Process Figure 57(a) illustrates a sample compound production (I) that generates a random real value from a uniform distribution using the native operation uniformreal. A double arrow ( ) is used to indicate a compound production, and the notation <#:selection-operation> is used to indicate the random value that is generated by the indicated native operation. Figure 57(b) illustrates the result of new step that is required during the creation of a parse tree. The uniformreal operation is applied and a random number (i.e., 3.44) is generated. A new terminal 168

192 symbol is created with the corresponding reserved name (i.e., #3.44 ), and its value attribute is set to the generated number. A new specific instance of the compound production (Ia) is effectively formed using the new terminal symbol. Figure 57(c) illustrates a sequence of attribute evaluations, applied to the corresponding parse tree; the evaluation of attributes remains deterministic. I. <IntRange> <#:uniforminteger ( 1, 100 ) > i <IntRange>.value = <#:uniforminteger(1, 100)>.value (a) Ia. <IntRange> <#98> i <IntRange>.value = <#98>.value where <#98>.value = 98 <IntRange> (b) <IntRange> value = 98 <#98> value = 98 (c) <#98> value = 98 Figure 58: (a) Compound Production for Range of Integers, (b) Instance of Compound Production and (c) Deterministic Attribute Evaluation Process Figure 58(a) illustrates a sample compound production (I) that generates a random integer value from a uniform distribution using the native operation uniforminteger. The operation also accepts two values as parameters (i.e., 1 and 100), as indicated in the notation <#:selection-operation (parameter1, parameter 2 )>. Figure 58(b) illustrates the result of applying the uniforminteger operation passing the given parameters. A random integer (i.e., 98) is generated, a new terminal symbol is created with the corresponding reserved name (i.e., #98 ), 169

193 and its value attribute is set to the generated number. A new specific instance of the compound production (Ia) is effectively formed using the new terminal symbol. Figure 58(c) illustrates a sequence of attribute evaluations, applied to the corresponding parse tree; as before, the evaluation of attributes remains deterministic. A compound production incorporates randomness into the grammar, but ensures the generation of parse trees whose attribute evaluation processes are completely deterministic. Grammar short-hand notations and meta-rules (Gruau, 1996) also accomplish concise representations of a range of values. A distinction of the NGAGE approach is the capability to select a value from a specific distribution as well (i.e., not all uniform), as determined by the behaviours of the native operations Discussion of Primary Design Practices Through the application of Practice 1 to Practice 11, NGAGE is capable of representing a variety of network topologies that use only activation signals. For example, Figure 56 illustrates a classic layered feedforward topology. With slight variations, simple recurrent networks may also be readily represented. For example, Figure 59(a) illustrates an NGAGE grammar that augments the grammar of Figure 55. Through a slight change to the attribute evaluation rule that forms the connections in production I (indicated in bold), a recurrent network is created, as illustrated in Figure 59(b). Practice 1 to Practice 11 also enable NGAGE to form representations that are comparable to those of grammatical encoding (Jacob and Rehder, 1993) and simple 170

194 cellular encoding (Friedrich and Moraga, 1996, 1997; Gruau, 1994, 1995; Gruau and Whitley, 1993). Further, NGAGE immediately offers some benefits over these two approaches. Grammar fragment augmenting grammar of Figure 55. I. <Network> (<Input-Port-Layer>, 3) recurrent ( <Multi-Layer> ) (<Output-Port-Layer>, 5) ii <Network>.all_connections = <Multi-Layer>.all_connections 4 full_connect_simple(<input-port-layer>.all_ports, <Multi-Layer>.input_nodes) 4 direct_connect_simple(<multi-layer>.output_nodes, <Output-Port-Layer>.all_ports) 4 full_connect_simple(<multi-layer>.output_nodes, <Multi-Layer>.input_nodes) e <Output-Port-Layer>.size = (a) <Multi-Layer>.output_nodes new connections added by: full_connect_simple(<multi-layer>.output_nodes, <Multi-Layer>.input_nodes) <Multi-Layer>.input_nodes (b) Figure 59: (a) NGAGE Grammar Illustrating Recurrent Connections and (b) Associated Recurrent Topology Figure 60 illustrates an NGAGE grammar that is equivalent to the grammatical encoding example presented in Figure 32 (with 5 possible input neurons, 10 possible output neurons and 7 possible cortex neurons). The context-free productions of both grammars are identical (barring slight notational differences). In the compound 171

195 productions VII, IX, and X, neurons are drawn from a set of possible neurons that are distinguished using unique identity strings. The same neuron (e.g., the fifth input neuron) may be specified multiple times throughout the tree (e.g., always with the production instance <InputNeuron> <in-port> (<#:5>) ) and always be assigned the same identity value (e.g., 1.5 ). The NGAGE grammar, unlike the grammatical encoding, explicitly collects the set of nodes and connections in the attributes of the root symbol (i.e., <Topology>.all_nodes and <Topology>.all_connections). This eliminates the need for an extra decoding step, namely extracting connections from the path lists, and offers the possibility of using a generic interpreter. Non-Terminal Symbols: <Topology>: (all_nodes, all_connections) {} <InputNeuron>: (node) {} <PathList>: (all_nodes, all_connections) {} <OutputNeuron>: (node) {} <Path>: (all_nodes, all_connections) {} <CortexNeuron>: (node) {} <NeuronList>: (all_nodes, all_connections, first_node, last_node) {} Terminal Symbols: <in-port> <out-port> <cortex> ; (, ) Helper Operations: connect Productions and Attribute Evaluation Rules: I. <Topology> 1 <Topology> 2 <PathList> i <Topology> 1.all_nodes = <Topology> 2.all_nodes 4 <PathList>.all_nodes ii <Topology> 1.all_connections = <Topology> 2.all_connections 4 <PathList>.all_connections II. <Topology> <Path> i <Topology>.all_nodes = <Path>.all_nodes ii <Topology>.all_connections = <Path>.all_connections III. <PathList> ; <Path> i <PathList>.all_nodes = <Path>.all_nodes ii <PathList>.all_connections = <Path>.all_connections IV. <Path> <InputNeuron> <NeuronList> <OutputNeuron> i <Path>.all_nodes = <InputNeuron>.node 4 <NeuronList>.all_nodes 4 <OutputNeuron>.node ii <Path>.all_connections = <NeuronList>.all_connections 4 connect(<inputneuron>.node, <NeuronList>.first_node) 4 connect(<neuronlist>.last_node,<outputneuron>.node) V. <NeuronList> 1 <NeuronList> 2 <NeuronList> 3 i <NeuronList> 1.all_nodes = <NeuronList> 2.all_nodes 4 <NeuronList> 3.all_nodes ii <NeuronList> 1.all_connections = <NeuronList> 2.all_connections 4 <NeuronList> 3.all_connections 4 connect(<neuronlist> 2.last_node, <NeuronList> 3.first_node) iii <NeuronList> 1.first_node = <NeuronList> 2.first_node iv <NeuronList> 1.last_node = <NeuronList> 3.last_node continued

196 ...continuation VI. <NeuronList> <CortexNeuron> i <NeuronList>.all_nodes = { <CortexNeuron>.node } ii <NeuronList>.all_connections = {} iii <NeuronList>.first_node = { <CortexNeuron>.node } iv <NeuronList>.last_node = { <CortexNeuron>.node } VII. <NeuronList> <OutputNeuron> i <NeuronList>.all_nodes = { <OutputNeuron>.node } ii <NeuronList>.all_connections = {} iii <NeuronList>.first_node = { <OutputNeuron>.node } iv <NeuronList>.last_node = { <OutputNeuron>.node } VIII. <InputNeuron> <in-port> ( <#:uniforminteger(1,5)> ) i <InputNeuron>.node = [concatenate( 1.,<#uniformInteger(1,5)>.value), <in-port>.spec] IX. <OutputNeuron> <out-port> ( <#:uniforminteger(1,10)> ) i <OutputNeuron>.node = [concatenate( 2.,<#uniformInteger(1,10)>.value), <out-port>.spec] X. <CortexNeuron> <cortex> ( <#:uniforminteger(1,7)> ) i <CortexNeuron>.node = [concatenate( 3.,<#uniformInteger(1,7)>.value), <cortex>.spec] where <in_port>.spec = detailed specification of an input port <out_port>.spec = detailed specification of an output port <cortex>.spec = detailed specification of a cortex neuron connect (a,b): Given two node pairs, (id a,*) and (id b,*), returns the set of connections {(id a,id b )} Figure 60: NGAGE Grammar Illustrating Equivalent Representation to Grammatical Encoding Figure 61 illustrates an NGAGE grammar that is equivalent to a basic cellular encoding representation. The interchangeability of all cellular encoding program symbols is represented through the use of a single non-terminal symbol, <Cell>, that is expanded in all productions except the first. Production I introduces a starting non-terminal symbol, <Network>, to enhance the clarity of the representation. Productions II and III represent two common cellular encoding program symbols, SEQ and PAR, as illustrated in Figure 35 and Figure 36. Production IV represents the program symbol CLONE, and uses the structure replication technique (as in Figure 53) to duplicate an entire cell structure. Productions V and VI represent the terminating program symbol END and illustrate the use of multiple neuron types with different sigmoid activation functions, as introduced by 173

197 Gruau (1995) and Friedrich and Moraga (1996, 1997). Unlike the cellular encoding approach, in which the exact connectivity patterns associated with a program symbol tree can be unclear, an NGAGE representation distinguishes between all neurons (using unique identity strings) and clearly identifies the pair of nodes that form each connection. These connections are collected in the attributes of the root symbol and are readily accessible. Non-Terminal Symbols: <Network>: (all_nodes, all_connections) {} <Cell>: (all_nodes, input_nodes, output_nodes, all_connections) {id} <Input-Port-Layer>: (all_ports) {id,size} <Output-Port-Layer>: (all_ports) {id,size} Terminal Symbols: <neurona> <neuronb> SEQ PAR END CLONE fill (, ) Helper Operations: full_connect_simple, concatenate, set_concatenate, replicate Productions and Attribute Evaluation Rules: I. <Network> (<Input-Port-Layer>, 3) <Cell> (<Output-Port-Layer>, 4) i <Network>.all_nodes = <Cell>.all_nodes ii <Network>.all_connections = <Cell>.all_connections 4 full_connect_simple(<input-port- Layer>.all_ports, <Cell>.input_nodes) 4 full_connect_simple(<cell>.output_ports, <Output-Port-Layer>.all_ports) a <Input-Port-Layer>.id = 1.1 b <Input-Port-Layer>.size = 3 c <Cell>.id = 1.2 d <Output-Port-Layer>.id = 1.3 e <Output-Port-Layer>.size = 4 II. <Cell> 1 SEQ ( <Cell> 2, <Cell> 3 ) i <Cell> 1.all_nodes = <Cell> 2.all_nodes 4 <Cell> 3.all_nodes ii <Cell> 1.all_connections = <Cell> 2.all_connections 4 <Cell> 3.all_connections 4 full_connect_simple(<cell> 2.output_nodes, <Cell> 3.input_nodes) iii <Cell> 1.input_nodes = <Cell> 2.input_nodes iv <Cell> 1.output_nodes = <Cell> 3.output_nodes a <Cell> 2.id = concatenate(<cell> 1.id,.1 ) b <Cell> 3.id = concatenate(<cell> 1.id,.2 ) III. <Cell> 1 PAR ( <Cell> 2, <Cell> 3 ) i <Cell> 1.all_nodes = <Cell> 2.all_nodes 4 <Cell> 3.all_nodes ii <Cell> 1.all_connections = <Cell> 2.all_connections 4 <Cell> 3.all_connections iii <Cell> 1.input_nodes = <Cell> 2.input_nodes 4 <Cell> 3.input_nodes iv <Cell> 1.output_nodes = <Cell> 2.output_nodes 4 <Cell> 3.output_nodes a <Cell> 2.id = concatenate(<cell> 1.id,.1 ) b <Cell> 3.id = concatenate(<cell> 1.id,.2 ) IV. <Cell> 1 CLONE ( <Cell> 2 ) i <Cell> 1.all_nodes = <Cell> 2.all_nodes 4 set_concatenate(<cell> 1.id,.2., <Cell> 2.all_nodes, 1) ii <Cell> 1.all_connections = <Cell> 2.all_connections 4 set_concatenate(<cell> 1.id,.2., set_concatenate (<Cell> 1.id,.2., <Cell> 2.connections, 1), 2) iii <Cell> 1.input_nodes = <Cell> 2.input_nodes 4 set_concatenate(<cell> 1.id,.2., <Cell> 2.input_nodes, 1) iv <Cell> 1.output_nodes = <Cell> 2.output_nodes 4 set_concatenate(<cell> 1.id,.2., <Cell> 2.output_nodes, 1) a <Cell> 2. id = concatenate(<cell> 1.id,.1 ) continued

198 ...continuation V. <Cell> END ( <neurona> ) i <Cell>.all_nodes = { [concatenate(<cell>.id,.1 ), <neurona>.spec] } ii <Cell>.all_connections = {} iii <Cell>.input_nodes = <Cell>.all_nodes iv <Cell>.output_nodes = <Cell>.all_nodes VI. <Cell> END ( <neuronb> ) i <Cell>.all_nodes = { [concatenate(<cell>.id,.1 ), <neuronb>.spec] } ii <Cell>.all_connections = {} iii <Cell>.input_nodes = <Cell>.all_nodes iv <Cell>.output_nodes = <Cell>.all_nodes VII. <Input-Port-Layer> fill ( <in-port> ) i <Input-Port-Layer>.all_ports = replicate(1, <Input-Port-Layer>.size, <Input-Port- Layer>.id, [ 1, <in-port>.spec], 1) VIII. <Output-Port-Layer> fill ( <out-port> ) i <Output-Port-Layer>.all_ports = replicate(1, <Output-Port-Layer>.size, <Output-Port- Layer>.id, [ 1, <out-port>.spec], 1) where <neurona>.spec = detailed specification of a neuron with sigmoid function A <neuronb>.spec = detailed specification of a neuron with sigmoid function B Figure 61: NGAGE Grammar Illustrating Equivalent Representation to Cellular Encoding The grammar of Figure 61 does not represent all cellular encoding program symbols. Some program symbols, such as INCBIAS and DECBIAS, simply modify the value of a property of the cell. These may be readily implemented within NGAGE through the definition of a new attribute (e.g., <Cell>.bias) and the appropriate manipulation of the value of that attribute (e.g., <Cell> 2.bias = <Cell> 1.bias + 1, for INCBIAS). Other symbols, such as WAIT and REC, do not have exactly equivalent representations within NGAGE since these program symbols directly affect the parsing mechanism of cellular encoding, whereas the parsing mechanism of an attribute grammar is fixed. However, NGAGE may represent some of these symbols in an approximate manner. For example, REC is roughly equivalent to a CLONE operation that is applied multiple times. Production IX of Figure 62 illustrates an extension to production IV of Figure 61 in which multiple copies of a cell structure are made based upon an inherited 175

199 attribute value (i.e., <Cell>.number_duplicates). The resulting CLONE-N replication operation is analogous to the behaviour of REC and its associated life parameter and achieves the same goal of compactness within the representation (Gruau, 1994). The CLONE_N operation also demonstrates some advantages over the REC program symbol. CLONE-N permits replication of arbitrary subtrees, whereas REC permits replication only of the entire tree. CLONE-N also permits different degrees of replication at different subtrees through the use of different inherited attribute values (e.g., as determined by productions such as production X in Figure 62), whereas every REC operation performs the same degree of replication (as determined by the single life parameter). Grammar fragment that augments grammar of Figure 61 IX. <Cell> 1 CLONE-N ( <Cell> 2 ) i <Cell> 1.all_nodes = <Cell> 2.all_nodes 4 replicate(2, <Cell> 1.number_duplicates, <Cell> 1.id, <Cell> 2.all_nodes, 1) ii <Cell> 1.all_connections = <Cell> 2.all_connections 4 replicate(2, <Cell> 1.number_duplicates, <Cell> 1.id, replicate(2, <Cell> 1.number_duplicates, <Cell> 1.id, <Cell> 2.connections, 1), 2) iii <Cell> 1.input_nodes = <Cell> 2.input_nodes 4 replicate(2, <Cell> 1.number_duplicates, <Cell> 1.id, <Cell> 2.input_nodes, 1) iv <Cell> 1.output_nodes = <Cell> 2.output_nodes 4 replicate(2, <Cell> 1.number_duplicates, <Cell> 1.id, <Cell> 2.output_nodes, 1) a <Cell> 2.id = concatenate(<cell> 1.id,.1 ) b <Cell> 2.number_duplicates = <Cell> 1.number_duplicates X. <Cell> 1 set_duplicates ( <Cell> 2, <#:uniforminteger(1,10)> ) i <Cell> 1.all_nodes = <Cell> 2.all_nodes ii <Cell> 1.all_connections = <Cell> 2.all_connections iii <Cell> 1.input_nodes = <Cell> 2.input_nodes iv <Cell> 1.output_nodes = <Cell> 2.output_nodes a <Cell> 2.id = concatenate(<cell> 1.id,.1 ) b <Cell> 2.number_duplicates = <#:uniforminteger(1,10)>.value Figure 62: NGAGE Grammar Illustrating Multiple Clone Operation The remaining program symbols within cellular encoding, such as CUT and LSPLIT (see Figure 39), perform manipulations of connections that are incoming to a given cellular structure; the source nodes of those connections are external to the 176

200 structure. The approach followed in grammar of Figure 61 for computing and collecting the connections is insufficient for the representation of such program symbols; only the incoming and outgoing nodes that form the interface of a cellular structure are represented; these nodes are within the structure. Representation of cellular encoding program symbols such as CUT and LSPLIT within NGAGE is possible, but requires a more complex use of synthesized and inherited attributes Advanced Design Practices The primary design practices presented above demonstrate the basic capabilities of NGAGE for the representation of neural structures. However, within an attribute grammar many manipulations of and interactions between attributes are possible. These properties may be exploited within NGAGE to enable the representation of a robust family of neural structures. Practice 12 Inherited attributes may be used to propagate topological constraints upon the network. An important function of the attributes of an attribute grammar is to limit, effectively, the expansion of a given parse tree. Within an attribute grammar that permits unbounded syntactic expansion of a parse tree (e.g., the context-free productions II and III of the grammar of Figure 55 together enable parse trees that are arbitrarily deep and form networks with an arbitrary number of layers of arbitrary size), it is important to be able to constrain the semantic interpretation of the parse tree when necessary. Inherited attributes may be used to impose semantic constraints upon the evaluation of certain subtrees. Thus, the space of 177

201 parse trees may be syntactically unbounded, while the space of corresponding neural networks may be semantically bounded. Grammar fragment that augments grammar of Figure 55. Non-Terminal Symbols: <Multi-Layer>: (all_nodes, input_nodes, output_nodes, all_connections) {id, max_layers} Terminal Symbols: 2 Helper Operations: fully_connect_simple Productions and Attribute Evaluation Rules: I. <Network> (<Input-Port-Layer>, 5) (<Multi-Layer>, 2) (<Output-Port-Layer>, 4) f <Multi-Layer>.max_layers = 2 II. <Multi-Layer> 1 <Multi-Layer> 2 <Layer> i <Multi-Layer> 1.all_nodes = if (<Multi-Layer> 1.max_layers > 1) then <Multi-Layer> 2.all_nodes 4 <Layer>.nodes else if (<Multi-Layer> 1.max_layers == 1) then <Layer>.nodes else {} ii <Multi-Layer> 1.all_connections = if (<Multi-Layer> 1.max_layers > 1) then <Multi-Layer> 2.all_connections 4 fully_connect_simple(<multi-layer> 2.output_nodes,<Layer>.nodes) else {} iii <Multi-Layer> 1.input_nodes = if (<Multi-Layer> 1.max_layers > 1) then <Multi-Layer> 2.input_nodes else if (<Multi-Layer> 1.max_layers == 1) then <Layer>.nodes else {} iv <Multi-Layer> 1.output_nodes = if (<Multi-Layer> 1.max_layers > 0) then <Layer>.nodes else {} c <Multi-Layer> 2.max_layers = max(<multi-layer> 1.max_layers 1, 0) III. <Multi-Layer> <Layer> i <Multi-Layer>.all_nodes = if (<Multi-Layer>.max_layers > 0) then <Layer>.nodes else {} ii <Multi-Layer>.all_connections = {} iii <Multi-Layer>.input_nodes = <Multi-Layer>.all_nodes iv <Multi-Layer>.output_nodes = <Multi-Layer>.all_nodes Figure 63: NGAGE Grammar Illustrating the Use of Inherited Attributes to Constrain Topology 178

202 Figure 63 illustrates a grammar fragment that augments the grammar of Figure 55. An additional inherited attribute, max_layers, is associated with each <Multi-Layer> non-terminal symbol. This attribute is used to impose a limit upon the number of nodes in a network, and is decremented further the deeper a parse tree becomes. The grammar behaves exactly as that in Figure 55 until the attribute max_layers reaches a value of 0. Once this limit is reached at a given <Multi- Layer> symbol in a given parse tree, the subtree rooted by that symbol and hence all lower subtrees are known to contribute no more nodes to the semantics of the network. Thus, when the max_layers attribute has a value of 0, an empty specification is returned for the left-hand symbol <Multi-Layer> of productions II and III (i.e., all node and connection sets are empty). In other words, the entire subtree below that point is effectively ignored. When max_layers has a value of 1, only certain specifications from the lower subtrees are ignored. Figure 64(a) presents an attributed parse tree generated from the grammar of Figure 63. Note the difference in the treatment of the attribute values of the two circled <Layer> symbols, which each have syntactically identical subtrees and virtually identical semantic evaluations (i.e., neural specifications differing only in identity values). The neural specification of one symbol, namely the <Layer>.nodes attribute indicated in bold-face in the shaded subtree, is effectively ignored in the attribute evaluation of its parent, while that of the other symbol is propagated upwards in its entirety. Figure 64(b) illustrates the neural topology produced by this parse tree. Note that two output ports are unconnected to the network since there are too few nodes in the top layer of the network. 179

203 all_nodes = {( 1.1.1,Y), ( 1.1.2,Y), ( 1.1.3,Y), ( ,X), ( ,X), ( ,X), ( 1.3.1,Z), ( 1.3.2,Z), ( 1.3.3,Z), ( 1.3.4,Z)} all_connections = {( , ), ( , ), ( , ), ( , ), ( , ), ( , ), ( , )} <Network> ( <Input-Port-Layer>, 3 ) ( <Multi-Layer>, 2 ) ( <Output-Port-Layer>, 4 ) <in-port> spec = Y id = 1.1 size = 3 all_ports ={( ,Y), ( ,Y), ( ,Y) id = 1.2 max_layers = 2 all_nodes = {( ,X), ( ,X), ( ,X)} all_connections = {( , ), ( , ) } input_nodes = {( ,X)} output_nodes = {( ,X), ( ,X)} id = 1.3 size = 4 all_ports ={( ,Z), ( ,Z), ( ,Z), ( ,Z) <out-port> spec = Z <Multi-Layer> id = max_layers = 1 all_nodes = {( ,X) } all_connections = {} input_nodes = {( ,X)} output_nodes = {( ,X)} <Layer> id = nodes = {( ,X), ( ,X)} <Multi-Layer> <Layer> <perceptron> spec = X id = max_layers = 0 all_nodes = {} all_connections = {} input_nodes = {} output_nodes = {} id = nodes = {( ,X)} <Layer> id = nodes = {( ,X)} <perceptron> spec = X (a) <perceptron> spec = X <Layer> <perceptron> spec = X id = nodes = {( ,X)} (b) Figure 64: (a) Attributed Parse Tree Illustrating "Ignoring" Mechanism and (b) Associated Neural Topology The use of constraints such as max_layers clearly demonstrates that an NGAGE grammar is context-sensitive since identical subtrees may have radically 180

204 different evaluations. However, such constraints increase the redundancy of the grammar since (infinitely) many different parse trees may produce the same network specification. Practice 13 The neural specification of a given subtree of a parse tree should, as much as possible, be consistent with its contribution to neural specification of the entire tree The grammar of Figure 63 above illustrates a powerful mechanism the capability of ignoring certain attribute values of a symbol. In particular, the inherited max_layers constraint can result in a radically different evaluation of a given subtree. However, this mechanism should be used with care since it reduces the clarity of the representation. A better approach to achieving the same semantic effect (i.e., a null specification based upon an inherited attribute) is to propagate that constraint further downwards and ensure that, as much as possible, the neural specification of any given subtree is meaningful in the context of the rest of the parse tree. This also reinforces Practice 1 and emphasizes the role of each grammar symbol. Figure 65 illustrates a new grammar that propagates the constraint to the <Layer> symbol using the attribute is_empty. Rules II.d and III.b propagate a true value for <Layer>.is_empty when <Multi-Layer>.max_layers reaches a value of 0. When is_empty is true, an empty specification for that layer is returned. 181

205 Grammar fragment that augments the grammar of Figure 63 Non-Terminal Symbols: <Layer>: (nodes, is_empty) {id} Productions and Attribute Evaluation Rules: II. <Multi-Layer> 1 <Multi-Layer> 2 <Layer> i <Multi-Layer> 1.all_nodes = <Multi-Layer> 2.all_nodes 4 <Layer>.nodes ii <Multi-Layer> 1.all_connections = <Multi-Layer> 2.all_connections 4 fully_connect_simple(<multi-layer> 2.output_nodes,<Layer>.nodes) iii <Multi-Layer> 1.input_nodes = if (<Multi-Layer> 1.max_layers > 1) then <Multi-Layer> 2.input_nodes else <Layer>.nodes iv <Multi-Layer> 1.output_nodes = <Layer>.nodes d <Layer>.is_empty = if (<Multi-Layer>.max_layers > 0) then false else true III. <Multi-Layer> <Layer> i <Multi-Layer>.all_nodes = <Layer>.nodes ii <Multi-Layer>.all_connections = {} iii <Multi-Layer>.input_nodes = <Multi-Layer>.all_nodes iv <Multi-Layer>.output_nodes = <Multi-Layer>.all_nodes b <Layer>.is_empty = if (<Multi-Layer>.max_layers > 0) then false else true IV. <Layer> 1 <perceptron> <Layer> 2 i <Layer> 1.nodes = if (<Layer> 1.is_empty == false) then <Layer> 2.nodes 4 [concatenate(<layer> 1.id,.1 ),<perceptron>.spec] else {} b <Layer> 2.is_empty = <Layer> 1.is_empty V. <Layer> <perceptron> i <Layer> 1.nodes = if (<Layer> 1.is_empty == false) then [concatenate(<layer> 1.id,.1 ),<perceptron>.spec] else {} Figure 65: NGAGE Grammar Illustrating the Use of Inherited Attributes to Avoid Unnecessary Specification of Neural Components Figure 66 illustrates the main semantic difference between the corresponding parse trees from the grammars of Figure 63 and Figure 65. The key difference is in the evaluation of the attributes of the shaded subtree of Figure 64. Using the additional <Layer>.is_empty constraint, the <Layer>.nodes value in the shaded subtree is evaluated as the empty set. The result is a neural specification for the subtree that is consistent with its contribution to the overall topology (i.e., 182

206 no contribution). The attributed parse trees produced by the grammar of Figure 65 are thus cleaner and easier to read and understand. <Layer> <perceptron> spec = X id = is_empty = true nodes = {} Figure 66: Subtree Generated Using Constraint to Ensure Meaningful Neural Specification Practice 14 NGAGE grammars may be parameterized for concise representation of highly similar grammars that differ only in toplevel constraints As illustrated in Figure 63, productions that expand the root symbol of an NGAGE grammar may impose certain constraint values (i.e., 2) within the attribute evaluation rules (i.e., rule I.b), as determined through the use of a specific terminal symbol (i.e., 2). With this approach, specification of a new grammar that differs only in that one constraint value requires the definition of a new production that differs only in that one terminal symbol (and associated semantic value). To enable the concise representation of multiple NGAGE grammars that differ only in such top-level constraint values, the notion of NGAGE grammar parameters is introduced. Figure 67 illustrates a simple change to the grammar of Figure 63 that incorporates the use of parameter values. The convention for specifying a parameter is to use the same name, in all capital letters, to refer to the parameter, its semantic value, and its use as a terminal symbol. The parameters IN_SIZE, 183

207 OUT_SIZE and MAX_LAYERS are applied in the rules of production I to propagate inherited constraints. Grammar fragment that augments grammar of Figure 63. Grammar Parameters: IN_SIZE, OUT_SIZE, MAX_LAYERS Productions and Attribute Evaluation Rules: I. <Network> (<Input-Port-Layer>, IN_SIZE) (<Multi-Layer>, MAX_LAYERS) (<Output-Port-Layer>, OUT_SIZE) b e f <Input-Port-Layer>.size = IN_SIZE <Output-Port-Layer>.size = OUT_SIZE <Multi-Layer>.max_layers = MAX_LAYERS Figure 67: NGAGE Grammar Illustrating the Use of NGAGE Parameter Values Practice 15 Synthesized and inherited attributes may be used in combination to pass structural information and constraints throughout a parse tree The practices described above demonstrate that in an NGAGE grammar it is possible to pass structural information up to the root as well as pass structural constraints down a given parse tree. Further, these have been treated largely as separate processes. However, in an attribute grammar, synthesized and inherited attributes may interact in a variety of complex ways. Context-sensitive constraints may be passed down using inherited attributes, but the values of those attributes may be influenced by structural information that is passed up using synthesized attributes. Figure 68 illustrates a grammar fragment (augmenting the grammars of Figure 55 and Figure 63) in which a constraint is placed upon the maximum number of nodes upon the network. In the general case, the expansion of any one 184

208 branch of a parse tree may produce arbitrarily many nodes, thereby constraining the remaining number of nodes that may be created in another branch (i.e., before the overall limit is exceeded). The attribute max_size passes down a constraint that limits the number of nodes that may be created in any given branch, and the attribute true_size passes up the actual number of nodes created in that branch. The true_size of a left branch may in turn be used to modulate the max_size constraint on the right branch, as in rule II.e. Note that a value of 0 for the <Multi-Layer>.max_size or <Layer>.max_size attribute is propagated downwards when either the maximum number of layers or the maximum number of nodes has been reached. In the former case, a value of 0 for <Layer>.max_size has the same effect as a true value for <Layer>.is_empty in Figure 65. Figure 69 illustrates an attributed parse tree generated from the grammar. The numbers next to the max_size and true_size attributes indicate the order in which those attributes are evaluated. Due to the dependencies between the attributes, the evaluation proceeds in a right-to-left manner, with the max_size attributes evaluated on the way down and the true_size attributes evaluated on the way up. The two shaded attribute pairs illustrate the two conditions under which max_size is assigned a value of 0. The right pair (indexed by 8 and 9) represents the condition when the maximum number of nodes has been met. The left pair (indexed by 12 and 15) represents the condition when the maximum number of layers has been met. Topologically, the higher layers of the networks produced by this grammar will be filled before the lower layers. 185

209 Grammar fragment that augments combined grammar of Figure 55 and Figure 63 Non-Terminal Symbols: <Network>: (all_nodes, all_connections) {} <Multi-Layer>: (all_nodes, input_nodes, output_nodes, all_connections, true_size) {id, max_layers, max_size } <Layer>: (nodes, true_size) {id, max_size } Terminal Symbols: <perceptron> Grammar Parameters: MAX_LAYERS, MAX_SIZE, IN_SIZE, OUT_SIZE Productions and Attribute Evaluation Rules: I. <Network> (<Input-Port-Layer>, IN_SIZE) (<Multi-Layer>, MAX_LAYERS, MAX_SIZE) (<Output-Port-Layer>, OUT_SIZE) g <Multi-Layer>.max_size = MAX_SIZE II. <Multi-Layer> 1 <Multi-Layer> 2 <Layer> v <Multi-Layer> 1.true_size = <Multi-Layer> 2.true_size + <Layer>.true_size d <Multi-Layer> 2.max_size = if (<Multi-Layer> 1.max_layers > 1) then <Multi-Layer> 1.max_size - <Layer>.true_size else 0 e <Layer>.max_size = if (<Multi-Layer> 1.max_layers > 0) then <Multi-Layer> 1.max_size else 0 III. <Multi-Layer> <Layer> v <Multi-Layer>.true_size = <Layer>.true_size b <Layer>.max_size = if (<Multi-Layer>.max_layers > 0) then <Multi-Layer>.max_size else 0 IV. <Layer> 1 <perceptron> <Layer> 2 i <Layer> 1.nodes = if (<Layer> 1.max_size > 0) then <Layer> 2.nodes 4 [concatenate(<layer> 1.id,.1 ), <perceptron>.spec] else {} ii <Layer> 1.true_size = if (<Layer> 1.max_size > 0) then <Layer> 2.true_size + 1 else 0 b <Layer> 2.max_size = max(<layer> 1.max_size 1, 0) V. <Layer> <perceptron> i <Layer>.nodes = if (<Layer>.max_size > 0) then [concatenate(<layer>.id,.1 ), <perceptron>.spec] else {} ii <Layer>.true_size = if (<Layer>.max_size > 0) then 1, else 0 Figure 68: NGAGE Grammar Illustrating the Interaction of Synthesized and Inherited Attributes for Structural Constraints 186

210 <Multi-Layer> <Multi-Layer> id = max_layers = 1 6 max_size = 1 11 true_size = 1 all_nodes = {( ,X)} all_connections = {} input_nodes = id = 1.2 max_layers = 2 [MAX_LAYERS = 2] 1 max_size = 3 [MAX_SIZE = 3] 16 true_size = 3 all_nodes = {( ,X), ( ,X), ( ,X)} all_connections = {( , ), ( , )} input_nodes = {( ,X)} output_nodes = {( ,X), ( ,X)} <Layer> id = max_size = 3 5 true_size = 2 nodes = {( ,X), ( ,X)} <Multi-Layer> <Layer> <perceptron> <Layer> id = max_layers = 0 12 max_size = 0 15 true_size = 0 all_nodes = {} all_connections = {} input_nodes = {} id = max_size = 1 10 true_size = 1 nodes = {( ,X)} spec = X id = max_size = 2 4 true_size = 1 nodes = {( ,X)} <Layer> id = max_size = 0 14 true_size = 0 nodes = {} <perceptron> <Layer> spec = X id = max_size = 0 9 true_size = 0 nodes = {} <perceptron> spec = X <perceptron> spec = X <perceptron> spec = X Figure 69: Attribute Parse Tree Illustrating Interaction Between Synthesized and Inherited Attributes When inherited and synthesized constraints are used together in this way, care must be taken to avoid circular dependencies. Note that the evaluation path followed in the parse tree depends upon the relationships among the attributes, and not upon a fixed, a priori traversal method, such as depth-first traversal. Practice 16 Attributes may be used to synchronize neural structures. Within a neural network topology, it is often important that different structures, such as two layers, obey a common constraint, such as the same size. 187

211 For instance, recall Figure 56, in which the layer of output nodes of the network was larger than the layer of output ports, resulting in one output node being unconnected to the external environment, and Figure 64b, in which the layer of output nodes was smaller than the layer of output ports, resulting in two output ports being unconnected to the network. Figure 70 illustrates a grammar in which the number of output nodes of the network is constrained to match the number of output ports, which in turn is determined by a grammar parameter. The <Multi-Layer>.min_output_size attribute provides a lower limit on the size of output layer of each multi-layer structure and the <Multi-Layer>.max_output_size attribute provides an upper limit. These limits are used to synchronize the top layer of the network with the output ports. Rules II.g and II.h propagate the limits, as inherited, to the <Layer>.min_size and <Layer>.max_size attributes. Rules II.e and II.f, however, relax the limits to the range of 1..max_size since synchronization is only required for the top layer of the network. As before, productions IV and V use the ignoring mechanism when the layer has a maximum size of 0. Production V further uses the replicate operator to create enough <perceptron> nodes to meet the minimum required by the min_size attribute. In the general case, this attribute will either have a value of 1 (normally) or 0 (when the maximum size of the layer has been reached). However, for the top layer of the network, this attribute may have a value greater than 1, in which case the replicate operator behaves as a filling mechanism. Proper synchronization is achieved through the appropriate use of the ignoring and filling mechanisms. 188

212 Grammar fragment that augments combined grammar of Figure 55 and Figure 63 Non-Terminal Symbols: <Network>: (all_nodes, all_connections) {} <Multi-Layer>: (all_nodes, input_nodes, output_nodes, all_connections, true_size) {id, max_layers, max_size, min_output_size, max_output_size } <Layer>: (nodes, true_size) {id, min_size, max_size } Terminal Symbols: <perceptron> fill ( ) Grammar Parameters: MAX_LAYERS, MAX_SIZE, IN_SIZE, OUT_SIZE Productions and Attribute Evaluation Rules: I. <Network> (<Input-Port-Layer>, IN_SIZE) (<Multi-Layer>, MAX_LAYERS, MAX_SIZE, OUT_SIZE) (<Output-Port-Layer>, OUT_SIZE) h <Multi-Layer>.max_output_size = OUTSIZE i <Multi-Layer>.min_output_size = OUTSIZE II. <Multi-Layer> 1 <Multi-Layer> 2 <Layer> e <Multi-Layer> 2.max_output_size = if (<Multi-Layer> 1.max_layers > 0) then <Multi-Layer> 2.max_size else 0 f <Multi-Layer> 2.min_output_size = if (<Multi-Layer> 1.max_layers > 0) then min (1, <Multi-Layer> 2.max_size) else 0 g <Layer>.max_size = <Multi-Layer> 1.max_output_size h <Layer>.min_size = <Multi-Layer> 1.min_output_size III. <Multi-Layer> <Layer> b <Layer>.max_size = <Multi-Layer>.max_output_size h <Layer>.min_size = <Multi-Layer>.min_output_size IV. <Layer> 1 <perceptron> <Layer> 2 i <Layer> 1.nodes = if (<Layer> 1.max_size > 0) then <Layer> 2.nodes 4 [concatenate(<layer> 1.id,.1 ), <perceptron>.spec] else {} ii <Layer> 1.true_size = if (<Layer> 1.max_size > 0) then <Layer> 2.true_size + 1 else 0 b <Layer> 2.max_size = max(0, <Layer> 1.max_size 1) c <Layer> 2.min_size = max(0, <Layer> 1.min_size - 1) V. <Layer> fill ( <perceptron> ) i <Layer>.nodes = if (<Layer>.max_size > 0 and <Layer>.min_size > 0) then replicate(1, <Layer>.min_size, <Layer>.id, [ 1, <perceptron>.spec], 1) else {} ii <Layer>.true_size = if (<Layer>.max_size > 0 and <Layer>.min_size > 0) then <Layer>.min_size else 0 Figure 70: NGAGE Grammar Illustrating Synchronization of Neural Structures 189

213 all_nodes = all_connections = { ( , ), ( , ), ( , ), ( , ) } <Network> (<Input-Port-Layer>, IN_SIZE ) (<Multi-Layer>, MAX_LAYERS, MAX_SIZE, OUT_SIZE) (<Output-Port-Layer>, OUT_SIZE ) id = 1.1 size = 3 [IN_SIZE = 3] all_ports = {( ,Y), ( ,Y), ( ,Y) } <in-port> spec = Y <Multi-Layer> <Layer> <Multi-Layer> id = max_layers = 1 max_size = 8 min_output_size = 1 max_output_size = 5 true_size = 2 output_nodes = {( ,X), ( ,X)} id = max_layers = 2 max_size = 8 min_output_size = 1 max_output_size = 8 true_size = 5 output_nodes = {( ,X), ( ,X), ( ,X) } id = min_size = 1 max_size = 5 true_size = 2 nodes = {( ,X), ( ,X)} id = 1.2 max_layers = 3 [MAX_LAYERS = 3] max_size = 12 [MAX_SIZE = 12] min_output_size = 4 [OUT_SIZE = 4] max_output_size = 4 [OUT_SIZE = 4] true_size = 9 output_nodes = {(" ",X), (" ",X), (" ",X), (" ",X)} <Layer> <perceptron> <Layer> spec = X id = min_size = 1 max_size = 8 true_size = 3 nodes = {( ,X), ( ,X), ( ,X)} id = min_size = 1 max_size = 7 true_size = 2 nodes = {( ,X), ( ,X)} id = 1.3 size = 4 [OUT_SIZE = 4] all_ports = {( ,Z), ( ,Z), ( ,Z), ( ,Z)} <out-port> spec = Z <Layer> id = min_size = 4 max_size = 4 true_size = 4 nodes = {( ,X), ( ,X), ( ,X), ( ,X)} <perceptron> <Layer> spec = X <perceptron> spec = X id = min_size = 3 max_size = 3 true_size = 3 nodes = {( ,X), ( ,X), ( ,X)} <Layer> id = min_size = 2 max_size = 2 true_size = 2 nodes = {( ,X), ( ,X)} <perceptron> <Layer> spec = X id = min_size = 1 max_size = 4 true_size = 1 nodes = {( ,X)} <perceptron> spec = X <Layer> id = min_size = 1 max_size = 6 true_size = 1 nodes = {( ,X)} fill ( <perceptron> ) spec = X fill ( <perceptron> ) spec = X fill ( <perceptron> ) spec = X Figure 71: Attributed Parse Tree Illustrating Synchronization using the "Filling" Mechanism Figure 71 illustrates an attributed parse tree that is generated from the grammar. Note the qualitative difference between the min_output_size and max_output_size attributes of the top <Multi-Layer> symbol and those of the lower <Multi-Layer> symbols, as indicated in bold-face. The shaded subtree 190

214 illustrates how two nodes are created within a single <Layer> symbol using the filling mechanism. <perceptron> spec = X <Layer> id = min_size = 2 max_size = 2 true_size = 2 nodes = {( ,X), ( ,X)} <Layer> id = min_size = 1 max_size = 1 true_size = 1 nodes = {( ,X)} <perceptron> spec = X <Layer> id = min_size = 0 max_size = 0 true_size = 0 nodes = {} fill ( <perceptron> ) spec = X Figure 72: Attributed Subtree Illustrating Synchronization using the "Ignoring" Mechanism <Output-Port-Layer>.all_ports and their identity values <Multi-Layer>.output_nodes and their identity values Exact one-to-one mapping Figure 73: Neural Topology Produced using Synchronization Figure 72 illustrates a second subtree that may be generated from the grammar at the same position as the shaded subtree of Figure 71. The second 191

215 subtree is deeper than the first, and illustrates the ignoring mechanism in its lowest <Layer> symbol. As with the first subtree, it also produces exactly two nodes, although the identity values indicated in bold-face differ slightly from those of the first subtree. Both subtrees result in a neural topology that, as desired, has a top layer with exactly OUT_SIZE nodes, as illustrated in Figure 73. Practice 17 Attributes may be used to ensure that only valid neural networks are generated. An NGAGE grammar specifies a family of neural network topologies. A given grammar must therefore include all networks that are part of a given family as well as exclude all networks that are not part of the family. An attribute grammar has a context-free grammar base. This provides a space of syntactically valid context-free parse trees. A key benefit of the use of attributes is the ability to reduce that space to a space of semantically valid parse trees. A parse tree that is both syntactically and semantically valid is a member of the language defined by the attribute grammar. A parse tree that is syntactically valid but semantically invalid is not a member. In all the example grammars presented above, it is assumed that all parse trees are in fact semantically valid. In other words, the attributes of the root symbol are assumed to define a valid neural topology. Certain techniques, such as the ignoring and filling mechanisms, are used to transform a parse tree that is seemingly invalid into a valid tree by constraining the neural topology that is 192

216 generated. This increases the semantic redundancy of the space of valid trees, in that multiple trees with different syntax may produce the same neural topology. Grammar fragment that augments grammar of Figure 70 Non-Terminal Symbols: <Network>: (all_nodes, all_connections, is_valid) {} <Multi-Layer>: (all_nodes, input_nodes, output_nodes, all_connections, true_size, is_valid) {id, max_layers, max_size, min_output_size, max_output_size } <Layer>: (nodes, true_size, is_valid) {id, min_size, max_size } Terminal Symbols: <perceptron> fill ( ) Grammar Parameters: MAX_LAYERS, MAX_SIZE, IN_SIZE, OUT_SIZE Productions and Attribute Evaluation Rules: I. <Network> (<Input-Port-Layer>, IN_SIZE) (<Multi-Layer>, MAX_LAYERS, MAX_SIZE, OUT_SIZE) (<Output-Port-Layer>, OUT_SIZE) iii <Network>.is_valid = <Multi-Layer>.is_valid II. <Multi-Layer> 1 <Multi-Layer> 2 <Layer> v <Multi-Layer> 1.is_valid =<Multi-Layer> 2.is_valid AND <Layer>.is_valid III. <Multi-Layer> <Layer> v <Multi-Layer>.is_valid = <Layer>.is_valid IV. <Layer> 1 <perceptron> <Layer> 2 iii <Layer> 1.is_valid = if (<Layer> 1.max_size > 0) then <Layer> 2.is_valid else false V. <Layer> fill( <perceptron> ) iii <Layer>.is_valid = if (<Layer>.max_size > 0 and <Layer>.min_size = 1) then true else false Figure 74: NGAGE Grammar Illustrating Use of Validity Mechanism to Minimize Semantic Redundancy An alternative approach to ensuring that only valid neural topologies are produced by the grammar is to define a root-level attribute that indicates the semantic validity of the entire parse tree. The value of this attribute may be determined by the validity or invalidity of the subtrees of the root, their subtrees, and so on. The use of such attributes forms a validity mechanism. Any parse trees that evaluate as invalid are, by definition, not instances of the space of 193

217 solutions defined by the attribute grammar, and thus represent invalid neural topologies. Figure 74 illustrates a grammar fragment that augments the grammar of Figure 70 in which the non-terminal symbols <Network>, <Multi-Layer> and <Layer> all have an attribute is_valid that is used to identify the semantic validity of the neural specification computed within the attributes of that symbol. The grammar includes all the ignoring and filling mechanisms from Figure 70, but also includes a validity mechanism that minimizes redundancy. Any subtree that would result in a violation of a constraint were it not for the use of the ignoring or filling mechanisms is flagged as invalid, and that invalidity value is propagated upwards through the tree to the root. The validity mechanism may also be used to reduce the design complexity of the grammar. A disadvantage to the ignoring and filling mechanisms is that they may require the coordinated use of several different attributes within several non-terminal symbols in order to ensure correctness. For certain constraints, a simple validity check at a higher level may be used instead. For example, Figure 75 illustrates a grammar fragment that augments the grammar of Figure 55. Unlike the grammar of Figure 74, no constraints are used at all. Instead, a single validity check is performed within the is_valid attribute of the root symbol. The helper operation cardinality further avoids the need for the true_size attributes used above. The result is a very simple grammar that defines the desired space of neural topologies with minimal redundancy. Note that this simplicity of design is achieved at the cost of not following Practice

218 Grammar fragment that augments grammar of Figure 55 Non-Terminal Symbols: <Network>: (all_nodes, all_connections, is_valid) {} <Multi-Layer>: (all_nodes, input_nodes, output_nodes, all_connections, num_layers) {id} <Layer>: (nodes) {id} Terminal Symbols: <perceptron> Helper Operations: cardinality Grammar Parameters: MAX_LAYERS, MAX_SIZE, IN_SIZE, OUT_SIZE Productions and Attribute Evaluation Rules: I. <Network> (<Input-Port-Layer>, IN_SIZE) (<Multi-Layer>, MAX_LAYERS, MAX_SIZE, OUT_SIZE) (<Output-Port-Layer>, OUT_SIZE) iii <Network>.is_valid = if (<Multi-Layer>.num_layers <= MAX_LAYERS) AND cardinality(<multi-layer>.all_nodes) <= MAX_SIZE) AND cardinality(<multi-layer>.output_nodes) = OUT_SIZE) then true else false II. <Multi-Layer> 1 <Multi-Layer> 2 <Layer> v <Multi-Layer> 1.num_layers =<Multi-Layer> 2.num_layers + 1 III. <Multi-Layer> <Layer> v <Multi-Layer>.num_layers = 1 where cardinality(a): Returns the number of elements in the set A Figure 75: NGAGE Grammar Illustrating Simplified Grammar Design Through Use of Validity Mechanism Practice 18 Synthesized and inherited attributes may be used in combination to enable specialized local handling of global structures In addition to passing structural constraints throughout a given parse tree, synthesized and inherited attributes may be used in combination to pass neural structures throughout the parse tree. This is of particular importance when productions in a parse tree must perform locally unique operations upon neural structures created elsewhere in the tree. 195

219 Non-Terminal Symbols: <Network>: (all_nodes, all_connections) {} <Recurrent-Layer>: (all_nodes, all_connections) {id, nodes_to_connect} <Neuron>: (node, incoming_connections) {id, nodes_to_connect} Terminal Symbols: <neuron> Helper Operations: connect_from_all_simple, direct_connect_simple Productions and Attribute Evaluation Rules: I. <Network> <Recurrent-Layer> i <Network>.all_nodes = <Recurrent-Layer>.all_nodes ii <Network>.all_connections = <Recurrent-Layer>.all_connections a <Recurrent-Layer>.id = 1.1 b <Recurrent-Layer>.nodes_to_connect = <Network>.all_nodes II. <Recurrent-Layer> 1 <Recurrent-Layer> 2 <Neuron> i <Recurrent-Layer> 1.all_nodes = <Recurrent-Layer> 2.all_nodes 4 <Neuron>.node ii <Recurrent-Layer> 1.all_connections = <Recurrent-Layer> 2.all_connections 4 <Neuron>.incoming_connections a <Recurrent-Layer> 2.id = concatenate(<recurrent-layer> 1.id,.1 ) b <Recurrent-Layer> 2.nodes_to_connect = <Recurrent-Layer> 1.nodes_to_connect c <Neuron>.id = concatenate(<recurrent-layer> 1.id,.2 ) d <Neuron>.nodes_to_connect = <Recurrent-Layer> 1.nodes_to_connect - <Neuron>.node III. <Recurrent-Layer> <Neuron> i <Recurrent-Layer> 1.all_nodes = <Neuron>.node ii <Recurrent-Layer> 1.all_connections = <Neuron>.incoming_connections a <Neuron>.id = concatenate(<recurrent-layer> 1.id,.1 ) b <Neuron>.nodes_to_connect = <Recurrent-Layer> 1.nodes_to_connect - <Neuron>.node IV. <Neuron> <neuron> i <Neuron>.node = [<Neuron>.id, <neuron>.spec] ii <Neuron>.incoming_connections = connect_from_all_simple( <Neuron>.nodes_to_connect, <Neuron>.node) where <neuron>.spec = detailed specification of the behaviour of an (arbitrary) neuron connect_from_all_simple(a,b): Given the set of (id,*) pairs, A={(a 1,*), (a 2,*), (a n,*) } and the (id,*) pair, b = (b id,*), returns the set of connections (a i, b id ) for all a i A Figure 76: NGAGE Grammar Illustrating the Interaction of Synthesized and Inherited Attributes for Neural Structures Consider the specification of a recurrent layer of nodes, in which nodes form a single layer and have outgoing connections to every other node in that layer, with no self-connections. Figure 76 illustrates a grammar for such a recurrent topology that demonstrates how neural structures may be passed both up and down a parse tree. Nodes are passed up the tree and collected in the 196

220 all_nodes attribute of the root as normal. However, these nodes are then passed down the tree using the inherited attribute nodes_to_connect. At the leaves, the recurrent connections to the specific neuron are formed in the <Neuron>.incoming_connections attribute and collected back up the tree. Note that self-connections are avoided through the use of the set subtraction operation [<Recurrent-Layer>.nodes_to_connect - <Neuron>.node]. The benefit of this approach is not immediately apparent from the grammar of Figure 76. The same family of structures may be easily represented using other approaches. For instance, the connections between all the nodes could simply be created at the root level (i.e., in production I). Alternatively, each time a node is created, a connection between it and every other node, and vice versa, could be created (i.e., in production II). However, these approaches, while elegant and easy to understand, do not readily provide the capability to create connections that are specialized for each node. Figure 77 illustrates a grammar fragment (augmenting the grammar of Figure 76) in which two types of connections, differing in delay value, may be created between nodes. The delay value of connections is explicitly defined within the connection specification. Productions IV and V are identical except that the connections created in each have a differing delay of 0 and 1, respectively. In the resultant topologies, all recurrent connections to a given node are assigned the same delay value. 197

221 Grammar fragment that augments the grammar of Figure 76 Terminal Symbols: <neuron> 0 1 (, ) Helper Operations: connect_from_all_delay, direct_connect_delay Productions and Attribute Evaluation Rules: IV. <Neuron> (<neuron>, 0) i <Neuron>.node = [Neuron.id, <neuron>.spec] ii <Neuron>.incoming_connections = connect_from_all_delay(<neuron>.nodes_to_connect, Neuron.node, 0) V. <Neuron> (<neuron>, 1) i <Neuron>.node = [Neuron.id, <neuron>.spec] ii <Neuron>.incoming_connections = connect_from_all_delay(<neuron>.nodes_to_connect, Neuron.node, 1) where Each connection includes a delay value: [(source_id, destination_id), delay] direct_connect_delay(a,b,x): Given two sets of (id,*) pairs, A={(a 1,*), (a 2,*), (a n,*) }and B = {(b 1,*), (b 2,*) (b m,*)} and the delay value x, returns the set of connections { [(a 1, b 1 ),x], [(a 2, b 2 ),x], [(a p, b p ), x] } where p is the smaller of m, n connect_from_all_delay(a,b,x): Given the set of (id,*) pairs, A={(a 1,*), (a 2,*), (a n,*) }, the (id,*) pair, b = (b id,*), and the delay value x, returns the set of connections [(a i, b id ),x] for all a i A Figure 77: NGAGE Grammar Illustrating Specialized Local Handling of Global Structures Neither of the two alternative approaches above would easily enable specification of such varying delays since there would be no way to distinguish which nodes required which delay. A fourth approach that would work, following Practice 9, is the use of different attributes to store nodes requiring different delays, and the creation of connections at the root symbol. However, as the number of different specialized handling requirements increases (e.g., if many different delays values were possible), the readability of the grammar may be adversely affected using this fourth approach. 198

222 <Network> all_nodes = {( 1.1.2,W), ( ,W), ( ,W)} all_connections = {( , 1.1.2, 0), ( , 1.1.2, 0), ( 1.1.2, , 1), ( , , 1), ( 1.1.2, , 1), ( , , 1)} <Recurrent-Layer> id = all_nodes = {( ,W)} nodes_to_connect = {( 1.1.2,W), ( ,W), ( ,W)} all_connections = {( , 1.1.2, 0), ( , 1.1.2, 0), ( 1.1.2, , 1), ( , , 1), ( 1.1.2, , 1), ( , , 1)} <Recurrent-Layer> ( <Neuron>, 0 ) id = all_nodes = {( ,W), ( ,W)} nodes_to_connect = {( 1.1.2,W), ( ,W), ( ,W) } all_connections = {( 1.1.2, , 1), ( , , 1), ( 1.1.2, , 1), ( , , 1)} <Recurrent-Layer> id = all_nodes = {( ,W)} nodes_to_connect = {( 1.1.2,W), ( ,W), ( ,W) } all_connections = {( 1.1.2, , 1), ( , , 1)} ( <Neuron>, 1 ) <neuron> spec = W ( <Neuron>, 1 ) <neuron> spec = W id = node = ( 1.1.2,W) nodes_to_connect = {( ,W), ( ,W)} incoming_connections = {( , 1.1.2, 0), ( , 1.1.2, 0)} <neuron> spec = W id = node = ( ,W) nodes_to_connect = {( 1.1.2,W), ( ,W)} incoming_connections = {( 1.1.2, , 1), ( , , 1)} id = node = ( ,W) nodes_to_connect = {( 1.1.2,W), ( ,W)} incoming_connections = {( 1.1.2, , 1), ( , , 1)} Figure 78: Attributed Parse Tree Illustrating Local Handling of Global Structures 199

223 Figure 78 illustrates an attributed parse tree that may be generated from the grammar of Figure 77. Note that each <Neuron> symbol is passed a slightly different attribute value for nodes_to_connect, as indicated in bold-face, which enables the local creation of recurrent connections with no self-links. Figure 79 illustrates the associated recurrent layer topology, where the number next to each link represents the delay value of that link Figure 79: Recurrent Neural Topology With Connections of Differing Delay Value 1 Practice 19 Ordered sets and ordered set operations may be used to represent topologies with natural structural regularities and synchronized substructures. The use and manipulation of unordered sets within attributes, as discussed in Practice 5 and demonstrated in the sample grammars above, enables the representation of relatively complex neural topologies. Many neural topologies, however, require specific patterns of connectivity that are synchronized across different neural components. For example, the Kohonen network (Kohonen, 1984) is based upon a two-dimensional grid of neurons. The connections within such a grid are dependent upon the relative locations of the neurons; each neuron is connected to the neurons horizontally and vertically next to it. Figure 80 illustrates the potential problem that may occur when using unordered sets to construct a grid from two smaller grids. If the two grids of Figure 80(a) are stored 200

224 in unordered sets, then multiple outcomes are possible when the two sets are joined in a one-to-one fashion, including the desired grid topology Figure 80(b) and incorrect grid topology Figure 80(c). A simple approach to constructing such a grid within the attributes of an NGAGE grammar is to use ordered sets and ordered set manipulations. (a) (b) (c) Figure 80: Construction of Grid with (a) Smaller Component Grids, (b) Desired Connectivity, and (c) Potential Connectivity using Unordered Sets Figure 81 illustrates a grammar that represents a family of square neural grids of arbitrary size. Through the explicit use of ordered sets in the attributes <Grid>.last_row, <Grid>.last_column and <SerialLayer>.ordered_nodes, the grammar ensures the desired grid connectivity pattern in rule II.ii. In the grammar, the use of ordered sets is necessary to ensure that the serial duplex connections are created properly, that the serial layers are connected properly with the last row and last column of the smaller grid, that the last element of each serial layer is next to the corner node, and that the new last row and last column are properly defined. 201

225 Non-Terminal Symbols: <Network>: (all_nodes, all_connections) {} <Grid>: (all_nodes, all_connections, last_column, last_row, length, width) {id} <SerialLayer>: (all_nodes, ordered_nodes, all_connection) {id, size} <Neuron>: (node) {id} Terminal Symbols: <instar> duplicate_and_connect ( ) Helper Operations: duplex_connect, ordered_duplex_1to1_connect, last_element, append, order, duplex_serial_connect Productions and Attribute Evaluation Rules: I. <Network> <Grid> i <Network>.all_nodes = <Grid>.all_nodes ii <Network>.all_connections = <Grid>.all_connections a <Grid>.id = 1.1 II. <Grid> 1 <Grid> 2 <SerialLayer> 1 <SerialLayer> 2 <Neuron> i <Grid> 1.all_nodes = <Grid> 2.all_nodes 4 <SerialLayer> 1.all_nodes 4 <SerialLayer> 2.all_nodes 4 <Neuron>.node ii <Grid> 1.all_connections = <Grid> 2.all_connections 4 <SerialLayer> 1.all_connections 4 <SerialLayer> 2.all_connections 4 ordered_duplex_1to1_connect(<seriallayer> 1.ordered_nodes,<Grid> 2.last_row) 4 ordered_duplex_1to1_connect(<seriallayer> 2.ordered_nodes,<Grid> 2.last_column) 4 duplex_connect(<neuron>.id, get_id(last_element(<seriallayer> 1.ordered_nodes))) 4 duplex_connect(<neuron>.id, get_id( last_element(<seriallayer> 2.ordered_nodes))) iii <Grid> 1.last_column = append(<seriallayer> 1.ordered_nodes, <Neuron>.id) iv <Grid> 1.last_row = append(<seriallayer> 2.ordered_nodes, <Neuron>.id) v <Grid> 1.length = <Grid> 2.length + 1 vi <Grid> 1.width = <Grid> 2.width + 1 a <Grid> 2.id = concatenate(<grid> 1.id,.1 ) b <SerialLayer> 1.id = concatenate(<grid> 1.id,.2 ) c <SerialLayer> 1.size = <Grid> 2.length d <SerialLayer> 2.id = concatenate(<grid> 1.id,.3 ) e <SerialLayer> 1.size = <Grid> 2.width f <Neuron>.id = concatenate(<grid> 1.id,.4 ) III. <Grid> <Neuron> i <Grid>.all_nodes = <Neuron>.node ii <Grid>.all_connections = {} iii <Grid>.last_column = order( <Neuron>.id ) iv <Grid>.last_row = order(<neuron>.id ) v <Grid>.length = 1 vi <Grid>.width = 1 a <Neuron>.id = concatenate(<grid>.id,.1 ) IV. <SerialLayer> duplicate_and_connect( <Neuron> ) i <SerialLayer>.all_nodes = replicate(1, <SerialLayer>.size, <Neuron>.id, <Neuron>.node, 1) ii <SerialLayer>.ordered_nodes = order(<seriallayer>.all_nodes) iii <SerialLayer>.all_connections = duplex_serial_connect(<seriallayer>.ordered_nodes) a <Neuron>.id = concatenate(<seriallayer>.id,.1 ) V. <Neuron> <kohonen> i <Neuron>.node = [<Neuron>.id, <kohonen>.spec ] continued

226 ...continuation where <kohonen>.spec = complete specification of the behaviour of a kohonen neuron duplex_connect(a,b): Given two id s a and b, returns the connections (a,b) and (b,a). ordered_duplex_direct_connect(a,b): Given two ordered sets of (id,*) pairs, A={(a 1,*), (a 2,*), (a n,*) }and B = {(b 1,*), (b 2,*) (b m,*)}, returns the set of connections {(a 1, b 1 ), (b 1, a 1 ), (a 2, b 2 ), (b 2, a 2 ) (a p, b p ), (b p, a p )} where p is the smaller of m, n last_element(a): Given an ordered set of (id,*) pairs, A={(a 1,*), (a 2,*), (a n,*)}, returns the last element of the set, (a n,*) append(a,b): Given an ordered set of (id,*) pairs, A={(a 1,*), (a 2,*), (a n,*)} and a single (id b,*) pair, returns the ordered set{(a 1,*), (a 2,*), (a n,*), (id b,*) } duplex_serial_connect(a): Given an ordered set of (id,*) pairs, A={(a 1,*), (a 2,*), (a 3,*) (a n- 1,*), (a n,*)}, returns the set of connections {(a 1, a 2 ), (a 2, a 1 ), (a 2, a 3 ), (a 3, a 2 ) (a n-1, a n ), (a n, a n-1 )} order(a): Given a set A, returns an ordered set containing all the same elements. Order is determined deterministically from A. Figure 81: NGAGE Grammar Illustrating Representation of Grid Neural Topology using Ordered Sets Figure 82 illustrates a sample application of Rule II.ii, in which a grid of length and width 3 in Figure 82(a) is extended on the sides using two serial layers and a new corner node. The corner node is connected to the last element of each layer, indicated by the bold neurons. The result is a grid of length and width 4 in Figure 82(b). <Grid> 2 <SerialLayer> 1 <Grid> 1.last_column <Neuron> <SerialLayer> 2 <Grid> 1.last_row (a) (b) Figure 82: Growing a Grid using Ordered Set Operations 203

227 In the grammar of Figure 81, ordered sets are explicitly defined (e.g., ordered_nodes) and are distinct from the unordered sets (e.g., all_nodes, all_connections). However, the use of both ordered and unordered sets reduces the clarity of the grammar since it becomes unclear which sets are ordered, which are not, and how they interact. In general, it suffices to assume that all sets are ordered, that certain set manipulations make explicit use of that ordering while others do not, and that all set manipulations are deterministic. This assumption will be followed in subsequent NGAGE grammars Discussion of Advanced Design Practices Practice 12 to Practice 19 enable NGAGE to represent a wide variety of topologies with arbitrary degrees of regularity and complexity through the intelligent use of inherited and synthesized attributes. Topologies may range from layered feedforward networks, to recurrent layers of neurons, to grids of neurons, to networks of neurons with arbitrary connectivity patterns. The topologies may be tightly constrained in size, or unconstrained; they may consist of a single type of neuron or be highly heterogeneous in behaviour; connections may be uniform in their transmission properties, or may vary. The networks, however, may only consist of a single type of signal (i.e., activation). To represent networks that require external feedback signals, such as the back-propagation network, within NGAGE requires additional multiple-signal capabilities. 204

228 Grammar that augments grammar of Figure 61 Non-Terminal Symbols: <Cell>: (all_nodes, input_nodes, output_nodes, all_connections) {id, incoming_sources} Helper Operations: first_half, second_half, eliminate_element Grammar Parameters: IN_SIZE, OUT_SIZE Productions and Attribute Evaluation Rules: I. <Network> (<Input-Port-Layer>, IN_SIZE) <Cell> (<Output-Port-Layer>, OUT_SIZE) ii <Network>.all_connections = <Cell>.all_connections 4 full_connect_simple(<cell>.output_nodes, <Output-Port-Layer>.all_ports) b <Input-Port-Layer>.size = IN_SIZE d <Output-Port-Layer>.size = OUT_SIZE f <Cell>.incoming_sources = <Input-Port-Layer>.all_ports II. <Cell> 1 SEQ ( <Cell> 2, <Cell> 3 ) ii <Cell> 1.all_connections = <Cell> 2.all_connections 4 <Cell> 3.all_connections c <Cell> 2.incoming_sources = <Cell> 1.incoming_sources d <Cell> 3.incoming_sources = <Cell> 2.output_nodes III. <Cell> 1 PAR ( <Cell> 2, <Cell> 3 ) ii <Cell> 1.all_connections = <Cell> 2.all_connections 4 <Cell> 3.all_connections c <Cell> 2.incoming_sources = <Cell> 1.incoming_sources d <Cell> 3.incoming_sources = <Cell> 1.incoming_sources IV. <Cell> 1 LSPLIT ( <Cell> 2, <Cell> 3 ) i <Cell> 1.all_nodes = <Cell> 2.all_nodes 4 <Cell> 3.all_nodes ii <Cell> 1.all_connections = <Cell> 2.all_connections 4 <Cell> 3.all_connections iii <Cell> 1.input_nodes = <Cell> 2.input_nodes 4 <Cell> 3.input_nodes iv <Cell> 1.output_nodes = <Cell> 2.output_nodes 4 <Cell> 3.output_nodes a <Cell> 2.id = concatenate(<cell> 1.id,.1 ) b <Cell> 3.id = concatenate(<cell> 1.id,.2 ) c <Cell> 2.incoming_sources = first_half(<cell> 1.incoming_sources) d <Cell> 3.incoming_sources = second_half(<cell> 1.incoming_sources) V. <Cell> 1 CUT <#:uniforminteger(1,100)> ( <Cell> 2 ) i <Cell> 1.all_nodes = <Cell> 2.all_nodes ii <Cell> 1.all_connections = <Cell> 2.all_connections iii <Cell> 1.input_nodes = <Cell> 2.input_nodes iv <Cell> 1.output_nodes = <Cell> 2.output_nodes a <Cell> 2.id = concatenate(<cell> 1.id,.1 ) b <Cell> 2.incoming_sources = eliminate_element(<cell> 1.incoming_sources, <#:uniforminteger(1,100)>.value) VI. <Cell> END ( <neurona> ) ii <Cell>.all_connections = full_connect_simple(<cell>.incoming_sources, <Cell>.all_nodes) VII. <Cell> END ( <neuronb> ) ii <Cell>.all_connections = full_connect_simple(<cell>.incoming_sources, <Cell>.all_nodes) where first_half(a): Given set A of size n, returns the subset of A containing the first n/2 elements second_half(a): Given set A of size n, returns the subset of A containing the last n/2 elements eliminate_element(a,m): Given set A, return the same set with the m th element removed. Figure 83: NGAGE Grammar Illustrating Manipulation of Connections Equivalent to Cellular Encoding 205

229 Practice 12 to Practice 19 also enable an NGAGE grammar to represent most cellular encoding program symbols. Figure 83 illustrates an NGAGE grammar that extends the NGAGE representation of cellular encoding from Figure 61. To enable direct manipulation of connections that are incoming to a given cellular structure, a new inherited attribute, <Cell>.incoming_sources, is defined. As before, the cells are expanded and their input and output interface nodes are computed within the attributes <Cell>.input_nodes and <Cell>.output_nodes. Unlike before, however, connections are not computed directly from these attributes. Instead, once all cells have been expanded and all neurons have been created, the <Cell>.incoming_sources attribute is used to compute information about which external nodes can potentially connect to the input nodes of each cellular structure. At each production, this information is updated to reflect the topological manipulation of that production. For example, in production IV and V, operations equivalent to the cellular program symbols LSPLIT and CUT are implemented; instead of propagating down all possible incoming sources, only a subset is passed down. At the terminating productions, VI and VII, connections are finally formed between that cell and the final set of incoming sources to that cell. These connections are propagated upwards and collected in the root symbol. 4.3 Representation of Multiple Signal Types NGAGE enables the specification of neural networks in which nodes may process multiple signal types and in which these multiple types of signals are transmitted along distinct pathways. This is important, for instance, since many neural network architectures, especially supervised networks such as back-propagation, require nodes to 206

230 receive feedback signals from the external environment and/or from other nodes. These feedback signals are treated differently than normal activation signals and are transmitted along their own set of connections, distinct from those connections carrying activation signals (Haykin, 1994; Hecht-Nielsen, 1990). In all the example grammars above, only one signal type was used and its type and characteristics were assumed. Based upon the underlying NGAGE neuron and connection models described earlier, effective specification of multi-signal topologies may be achieved using the following additional practices. The specification of each connection is expanded further (from that of Figure 77) to include a signal type (i.e., [(source_id, destination_id), delay, signal_type]) Design Practices for Signal Types Practice 20 A signal type is represented as a terminal of the grammar. Explicit specification of multiple signal types is achieved in NGAGE through the use of distinct terminal symbols for each signal type. The terminal symbol provides detailed specifications concerning the signal type that are required by the operations that create connections within the attributes. Different families of networks may be obtained through the use of different sets of terminal symbols. Practice 21 Explicitly distinguish input sources and output targets that correspond to different signal types. This extends Practice 10. In a typical network, different inputs to the network may be treated differently. For instance, the desired output pattern is 207

231 often used in a supervised network to provide feedback. That pattern is actually an input to the network that is used to generate feedback signals. These feedback inputs must be distinguished from normal inputs. Grammar fragment that augments grammar of Figure 55 Terminal Symbols: <activation> <feedback> Helper Operations: full_connect, ordered_direct_connect Grammar Parameters: IN_SIZE, OUT_SIZE Productions and Attribute Evaluation Rules: I. <Network> (<Input-Port-Layer> 1, IN_SIZE, <activation>) (<Input-Port-Layer> 2, OUT_SIZE, <feedback>) <Multi-Layer> (<Output-Port-Layer>, OUT_SIZE, <activation>) i. <Network>.all_nodes = <Multi-Layer>.all_nodes 4 <Input-Port-Layer> 1.all_ports 4 <Input-Port-Layer> 2.all_ports 4 <Output-Port-Layer>.all_ports ii <Network>.all_connections = <Multi-Layer>.all_connections 4 full_connect(<input-port-layer> 1.all_ports, <Multi-Layer>.input_nodes, 0, <activation>.name) 4 ordered_direct_connect(<multi-layer>.output_nodes, <Output-Port-Layer>.all_ports, 0, <activation>.name) 4 ordered_direct_connect(<input-port-layer> 2.all_ports, <Multi-Layer>.output_nodes, 0, <feedback>.name) a <Input-Port-Layer> 1.id = 1.1 b <Input-Port-Layer> 1.size = IN_SIZE c <Input-Port-Layer> 2.id = 1.2 d <Input-Port-Layer> 2.size = OUT_SIZE e <Multi-Layer>.id = 1.3 f <Output-Port-Layer>.id = 1.4 g <Output-Port-Layer>.size = OUT_SIZE where <activation>.name = Activation signal <feedback>.name = Feedback signal ordered_direct_connect(a,b,x,t): Given two sets of (id,*) pairs, A={(a 1,*), (a 2,*), (a n,*) }and B = {(b 1,*), (b 2,*) (b m,*)}, delay value x, and signal-type t, returns the set of connections { [(a 1, b 1 ),x,t], [(a 2, b 2 ),x,t], [(a p, b p ),x,t] } where p is the smaller of m, n full_connect(a,b,x,t): Given two sets of (id,*) pairs, A={(a 1,*), (a 2,*), (a n,*) }and B = {(b 1,*), (b 2,*) (b m,*)}, delay value x, and signal-type t, returns set of connections [(a i, b j ),x,t] for all a i A, b j B Figure 84: NGAGE Grammar Illustrating Multiple Pathways of Different Signal Types For instance, Figure 84 illustrates a grammar fragment (augmenting the grammar of Figure 55) that defines a family of networks in which two input patterns are presented to the network, one corresponding to a normal input pattern 208

232 and one corresponding to a supervisory input pattern. The latter pattern is the same size as the output pattern of the network. In the grammar, the two input patterns are presented to the network in two distinct layers of input ports. Two different signal types, <activation> and <feedback>, are used in the grammar. The attribute evaluation rule I.ii connects the normal input ports to the first layer of the network using <activation> signals, and connects, in a one-to-one manner, the supervisory input ports to the last layer of the network using <feedback> signals. This last layer of the network also provides <activation> signals to the output ports of the network. Note that the assumption regarding ordered sets plays a key role here, and guarantees that the feedback corresponding to a specific output port will be delivered to the same output node that is connected to that port. Practice 22 Signal types may be propagated through a parse tree to ensure creation of consistent pathways Connections within an NGAGE grammar may be created in a number of different grammar productions. Each production may potentially use different signal types when creating connections. This may, in fact, be desirable in some architectures. However, almost all architectures use the same types of signals throughout the network. To ensure consistency, and reduce redundant use of symbols, the same set of signal types may be propagated through a parse tree within the attributes. For example, in the grammar of Figure 85, the attributes activation_type and feedback_type of the symbol <Multi-Layer> serve to propagate a consistent 209

233 set of signals types. Rule II.ii creates a fully-connected forward path of activation_type signals and a fully connected backward path of feedback_type signals between layers. The resulting network has two distinct pathways, each carrying a different type of signal. Grammar fragment that augments grammar of Figure 84 Non-Terminal Symbols: <Multi-Layer>: (all_nodes, input_nodes, output_nodes, all_connections, activation_type, feedback_type) {id} Terminal Symbols: bi-directional (, ) Productions and Attribute Evaluation Rules: I. <Network> (<Input-Port-Layer> 1, IN_SIZE, <activation>) (<Input-Port-Layer> 2, OUT_SIZE, <feedback>) <Multi-Layer> (<Output-Port-Layer>, OUT_SIZE, <activation>) h <Multi-Layer>.activation_type = <activation>.name i <Multi-Layer>.feedback_type = <feedback>.name II. <Multi-Layer> 1 bi-directional (<Multi-Layer> 2, <Layer>) ii <Multi-Layer> 1.all_connections = <Multi-Layer> 2.all_connections 4 full_connect(<multi-layer> 2.output_nodes, <Layer>.nodes, 0, <Multi- Layer> 1.activation_type) 4 full_connect(<layer>.nodes,<multi-layer> 2.output_nodes, 0, <Multi- Layer> 1.feedback_type) c <Multi-Layer> 2.activation_type = <Multi-Layer> 1.activation_type d <Multi-Layer> 2.feedback_type = <Multi-Layer> 1.feedback_type Figure 85: NGAGE Grammar Illustrating the Propagation of Consistent Signal Types Figure 86 illustrates a neural topology that is generated using this grammar. The network has two sets of input ports and one set of output ports. Connections carrying <activation> signals form a pathway from the first set of input ports through to the output ports. Connections carrying <feedback> signals form a pathway from the second set of input ports through to the first layer of nodes in the network. The result is a topology that is typical of supervised multi-layer networks. 210

234 <Output-Port-Layer>.all_ports <Input-Port-Layer> 2.all_ports IN_SIZE = 3 OUT_SIZE = 4 Connection carrying <activation> signals <Input-Port-Layer> 1.all_ports Connection carrying <feedback> signals Figure 86: Neural Topology with Activation and Feedback Pathways Practice 23 To ensure the specification of valid connections, explicitly associate each neuron with the signal types that it is capable of receiving and those that it is capable of transmitting The grammar of Figure 85, through its rules I.ii and II.ii, assumes that the nodes of <Multi-Layer>.input_nodes and <Multi-Layer>.output_nodes are capable of sending and receiving both <activation> and <feedback> signals. However, following the NGAGE neuron model, in the general case, this assumption may not hold. For example, consider the grammar fragment of Figure 87. The resultant networks may contain perceptron nodes, which process and transmit both activation and feedback signals, and/or unsupervised instar nodes (i.e., an instar node in which the desired output value is the actual output value), which process and transmit only activation signals. If this fragment is incorporated into the grammar of Figure 85, the resultant connections created in rules I.ii and II.ii may 211

235 not be valid an unsupervised instar node cannot be part of a feedback pathway. To ensure the specification of valid connections, each node may be explicitly included in attributes that correspond to specific pathways. Grammar fragment that augments grammar of Figure 85 Non-Terminal Symbols: <Layer>: (nodes) {id} <Neuron>: (node) {id} Terminal Symbols: <perceptron> <unsupervised-instar> Productions and Attribute Evaluation Rules: IV. <Layer> 1 <Neuron> <Layer> 2 i <Layer> 1.nodes = <Layer> 2.nodes 4 <Neuron>.node a <Neuron>.id = concatenate(<layer> 1.id,.1 ) b <Layer> 2.id = concatenate(<layer> 1.id,.2 ) V. <Layer> <Neuron> i <Layer>.nodes = <Neuron>.node a <Neuron>.id = concatenate(<layer>.id,.1 ) VI. <Neuron> <perceptron> i <Neuron>.node = [<Neuron>.id, <perceptron>.spec] VII. <Neuron> <unsupervised-instar> i <Neuron>.node = [<Neuron>.id, <unsupervised-instar>.spec] where <unsupervised-instar>.spec = detailed specification of the behaviour of an unsupervised instar neuron Figure 87: NGAGE Grammar Illustrating Multiple Node Types with Different Transmission Properties Figure 88 presents an NGAGE grammar for a family of supervised multilayered neural network architectures that contain both unsupervised instar and supervised perceptron nodes. The signal transmission properties of each node are captured within the attributes of the <Neuron> symbol (e.g., attribute feedback_out) and these properties are propagated up a given parse tree. Connections of a specific signal type are formed only between source nodes that are capable of sending that type and target nodes that are capable of receiving that type. 212

236 Non-Terminal Symbols: <Network>: (all_nodes, all_connections) {} <Multi-Layer>: (all_nodes, all_connections, activation_in_nodes, activation_out_nodes, feedback_in_nodes, feedback_out_nodes, activation_type, feedback_type) {id} <Layer>: (nodes, activation_in_nodes, activation_out_nodes, feedback_in_nodes, feedback_out_nodes) {id} <Neuron>: (node, activation_in, activation_out, feedback_in, feedback_out) {id} Terminal Symbols: <perceptron> <unsupervised-instar> <in_port> <out_port> <activation> <feedback> bi-directional (, ) Helper Operations: full_connect, ordered_direct_connect Productions and Attribute Evaluation Rules: I. <Network> (<Input-Port-Layer> 1, IN_SIZE, <activation>) (<Input-Port-Layer> 2, OUT_SIZE, <feedback>) <Multi-Layer> (<Output-Port-Layer>, OUT_SIZE, <activation>) i <Network>.all_nodes = <Multi-Layer>.all_nodes 4 <Input-Port-Layer> 1.all_ports 4 <Input-Port-Layer> 2.all_ports 4 <Output-Port-Layer>.all_ports ii <Network>.all_connections = <Multi-Layer>.all_connections 4 full_connect(<input-port-layer> 1.all_ports, <Multi-Layer>.activation_in_nodes, 0, <activation>.name) 4 ordered_direct_connect(<multi-layer>.activation_out_nodes, <Output-Port- Layer>.all_ports, 0, <activation>.name) 4 ordered_direct_connect(<input-port-layer> 2.all_ports, <Multi- Layer>.feedback_in_nodes, 0, <feedback>.name) a <Input-Port-Layer> 1.id = 1.1 b <Input-Port-Layer> 1.size = IN_SIZE c <Input-Port-Layer> 2.id = 1.2 d <Input-Port-Layer> 2.size = OUT_SIZE e <Multi-Layer>.id = 1.3 f <Multi-Layer>.activation_type = <activation>.name g <Multi-Layer>.feedback_type = <feedback>.name h <Output-Port-Layer>.id = 1.4 i <Output-Port-Layer>.size = OUT_SIZE II. <Multi-Layer> 1 bi-directional (<Multi-Layer> 2, <Layer>) i <Multi-Layer> 1.all_nodes = <Multi-Layer> 2.all_nodes 4 <Layer>.nodes ii <Multi-Layer> 1.all_connections = <Multi-Layer> 2.all_connections 4 full_connect(<multi-layer> 2.activation_out_nodes, <Layer>.activation_in_nodes, 0, <Multi-Layer> 1.activation_type) 4 full_connect(<layer>.feedback_out_nodes, <Multi-Layer> 2.feedback_in_nodes, 0, <Multi-Layer> 1.feedback_type) iii <Multi-Layer> 1.activation_in_nodes = <Multi-Layer> 2.activation_in_nodes iv <Multi-Layer> 1.activation_out_nodes = <Layer>. activation_out_nodes v <Multi-Layer> 1.feedback_in_nodes = <Layer>. feedback_in_nodes vi <Multi-Layer> 1.feedback_out_nodes = <Multi-Layer> 2.feedback_out_nodes a <Multi-Layer> 2.id = concatenate(<multi-layer> 1.id,.1 ) b <Layer>.id = concatenate(<multi-layer> 1.id,.2 ) c <Multi-Layer> 2.activation_type = <Multi-Layer> 1.activation_type d <Multi-Layer> 2.feedback_type = <Multi-Layer> 1.feedback_type continued

237 ...continuation III. <Multi-Layer> <Layer> i <Multi-Layer>.all_nodes = <Layer>.nodes ii <Multi-Layer>.all_connections = {} iii <Multi-Layer>.activation_in_nodes = <Layer>. activation_in_nodes iv <Multi-Layer>.activation_out_nodes = <Layer>. activation_out_nodes v <Multi-Layer>.feedback_in_nodes = <Layer>. feedback_in_nodes vi <Multi-Layer>.feedback_out_nodes = <Layer>. feedback_out_nodes a <Layer>.id = concatenate(<multi-layer>.id,.1 ) IV. <Layer> 1 <Neuron> <Layer> 2 i <Layer> 1.nodes = <Layer> 2.nodes 4 <Neuron>.node ii <Layer> 1.activation_in_nodes = <Layer> 2.activation_in_nodes 4 <Neuron>.activation_in iii <Layer> 1.activation_out_nodes = <Layer> 2.activation_out_nodes 4 <Neuron>.activation_out iv <Layer> 1.feedback_in_nodes = <Layer> 2.feedback_in_nodes 4 <Neuron>.feedback_in v <Layer> 1.feedback_out_nodes = <Layer> 2.feedback_out_nodes 4 <Neuron>.feedback_out a <Neuron>.id = concatenate(<layer> 1.id,.1 ) b <Layer> 2.id = concatenate(<layer> 1.id,.2 ) V. <Layer> <Neuron> i <Layer>.nodes = <Neuron>.node ii <Layer>.activation_in_nodes = <Neuron>.activation_in iii <Layer>.activation_out_nodes = <Neuron>.activation_out iv <Layer>.feedback_in_nodes = <Neuron>.feedback_in v <Layer>.feedback_out_nodes = <Neuron>.feedback_out a <Neuron>.id = concatenate(<layer>.id,.1 ) VI. <Neuron> <perceptron> i <Neuron>.node = [<Neuron>.id, <perceptron>.spec] ii <Neuron>.activation_in = {<Neuron>.node} iii <Neuron>.activation_out = {<Neuron>.node} iv <Neuron>.feedback_in = {<Neuron>.node } v <Neuron>.feedback_out = {<Neuron>.node } VII. <Neuron> <unsupervised-instar> i <Neuron>.node = [<Neuron>.id, <unsupervised-instar>.spec] ii <Neuron>.activation_in = {<Neuron>.node} iii <Neuron>.activation_out = {<Neuron>.node} iv <Neuron>.feedback_in = {} v <Neuron>.feedback_out = {} Figure 88: NGAGE Grammar Illustrating Explicit Specification of Valid Signal Pathways Figure 89 illustrates a neural topology that is produced from the grammar in which the shaded nodes, representing unsupervised instar neurons, do not send or receive any feedback connections, while the clear nodes, representing perceptron neurons, do both. Note that the networks generated by this grammar 214

238 do not ensure the connectivity of the feedback path (e.g., if a layer is comprised entirely of instar nodes, the feedback path would be broken). <Output-Port-Layer>.all_ports <Input-Port-Layer> 2.all_ports IN_SIZE = 3 OUT_SIZE = 3 Connection carrying <activation> signals Connection carrying <feedback> signals <unsupervised-instar> neuron <perceptron> neuron <Input-Port-Layer> 1.all_ports Figure 89: Neural Topology Illustrating Feedback Pathways Specific to Neuron Type Discussion of Design Practices for Signal Types Practice 20 to Practice 23 enable NGAGE to represent most supervised topologies through the explicit creation of feedback pathways. Current genetic encoding techniques for neural networks, including grammatical encoding and cellular encoding, do not demonstrate this capability. NGAGE therefore offers better promise as a generic tool for the representation of models of neural networks. It is important, of course, to note that the creation of feedback pathways is not sufficient in itself to model some neural network architectures. For example, some architectures, such as the back-propagation network, also require additional structures to properly implement the model in purely nodal fashion. 215

239 Non-Terminal Symbols: <Network>: (all_nodes, all_connections) {} <Top-Multi-Layer>: (all_nodes, all_connections, sun_nodes, delta_nodes, input_nodes) {id, size} <Multi-Layer>: (all_nodes, all_connections, input_nodes, output_nodes) {id} <Sun-Layer>: (all_nodes) {id} <Planet-Layer>: (all_nodes, all_connections) {id, lower_nodes, upper_nodes} <Input-Port-Layer>: (all_ports) {id, size} <Output-Port-Layer>: (all_ports) {id, size} Terminal Symbols: <activation> <desired> <delta> <sun-neuron> <planet-neuron> <deltaneuron> <in-port> <out-port> fill grouped-fill (, ) Helper Operations: concatenate, replicate, group_elements, size, full_over_direct_connect, direct_over_full_connect Grammar Parameters: IN_SIZE, OUT_SIZE Productions and Attribute Evaluation Rules: I. <Network> (<Input-Port-Layer> 1, IN_SIZE, <activation>) (<Input-Port-Layer> 2, OUT_SIZE, <desired>) <Planet-Layer> <Top-Multi-Layer> (<Output- Port-Layer>, OUT_SIZE, <activation>) i <Network>.all_nodes = <Multi-Layer>.all_nodes 4 <Planet-Layer>.all_nodes 4 <Input- Port-Layer> 1.all_ports 4 <Input-Port-Layer> 2.all_ports 4 ii <Output-Port-Layer>.all_ports <Network>.all_connections = <Multi-Layer>.all_connections 4 <Planet- Layer>.all_connections 4 ordered_direct_connect(<multi-layer>.sun_nodes, <Output-Port-Layer>.all_ports, 0, <activation>.name) 4 ordered_direct_connect (<Input-Port-Layer> 2.all_ports, <Multi-Layer>.delta_nodes, 0, <desired>.name) a <Input-Port-Layer> 1.id = 1.1 b <Input-Port-Layer> 1.size = IN_SIZE c <Input-Port-Layer> 2.id = 1.2 d <Input-Port-Layer> 2.size = OUT_SIZE e <Output-Port-Layer>.id = 1.3 f <Output-Port-Layer>.size = OUT_SIZE g <Top-Multi-Layer>.id = 1.4 h <Top-Multi-Layer>.size = OUT_SIZE i <Planet-Layer>.id = 1.5 j <Planet-Layer>.lower_nodes = <Input-Port-Layer> 1.all_ports k <Planet-Layer>.upper_nodes = <Multi-Layer>.input_nodes II. <Top-Multi-Layer> <Multi-Layer> <Planet-Layer> fill( <sun-neuron>) fill( <deltaneuron>) <activation> <delta> i <Top-Multi-Layer>.all_nodes = <Multi-Layer>.all_nodes 4 <Planet-Layer>.all_nodes 4 ii iii <Top-Multi-Layer>.sun_nodes 4 <Top-Multi--Layer>.delta_nodes <Top-Multi-Layer>.sun_nodes = replicate(1, <Top-Multi-Layer>.size, <Top-Multi- Layer>.id, [ 1, <sun-neuron>.spec], 1) <Top-Multi-Layer>.delta_nodes = replicate(1, <Top-Multi-Layer>.size, <Top-Multi- Layer>.id, [ 2, <delta-neuron>.spec], 1) iv <Top-Multi-Layer>.all_connections = <Multi-Layer>.all_connections 4 <Planet-Layer>.all_connections 4 ordered_direct_connect(<top-multi- Layer>.sun_nodes, <Top-Multi-Layer>.delta_nodes, 0, <activation>.name) 4 ordered_direct_connect(<top-multi-layer>.delta_nodes, <Top-Multi- Layer>.sun_nodes, 0, <delta>.name) v <Top-Multi-Layer>.input_nodes = <Multi-Layer>.input_nodes a <Multi-Layer>.id = concatenate(<top-multi-layer>.id,.3 ) b <Planet-Layer>.id = concatenate(<top-multi-layer>.id,.4 ) c <Planet-Layer>.lower_nodes = <Multi-Layer>.output_nodes d <Planet-Layer>.upper_nodes = <Top-Multi-Layer>.sun_nodes continued

240 ...continuation III. <Top-Multi-Layer> fill( <sun-neuron> ) fill( <delta-neuron> ) <activation> <delta> i <Top-Multi-Layer>.all_nodes = <Top-Multi-Layer>.sun_nodes 4 <Top-Multi-- Layer>.delta_nodes ii <Top-Multi-Layer>.sun_nodes = replicate(1, <Top-Multi-Layer>.size, <Top-Multi- Layer>.id, [ 1, <sun-neuron>.spec], 1) iii <Top-Multi-Layer>.delta_nodes = replicate(1, <Top-Multi-Layer>.size, <Top-Multi- Layer>.id, [ 2, <delta-neuron>.spec], 1) iv <Top-Multi-Layer>.all_connections = ordered_direct_connect(<top-multi- Layer>.sun_nodes, <Top-Multi-Layer>.delta_nodes, 0, <activation>.name) 4 ordered_direct_connect(<top-multi-layer>.delta_nodes, <Top-Multi- Layer>.sun_nodes, 0, <delta>.name) v <Top-Multi-Layer>.input_nodes = <Top-Multi-Layer>.sun_nodes IV. <Multi-Layer> 1 <Multi-Layer> 2 <Planet-Layer> <Sun-Layer> i <Multi-Layer> 1.all_nodes = <Multi-Layer> 2.all_nodes 4 <Planet-Layer>.all_nodes 4 <Sun-Layer>.all_nodes ii <Multi-Layer> 1.all_connections = <Multi-Layer> 2.all_connections 4 <Planet-Layer>.all_connections iii <Multi-Layer> 1.input_nodes = <Multi-Layer> 2.input_nodes iv <Multi-Layer> 1.output_nodes = <Sun-Layer>.all_nodes a <Multi-Layer> 2.id = concatenate(<multi-layer> 1.id,.1 ) b <Planet-Layer>.id = concatenate(<multi-layer> 1.id,.2 ) c <Planet-Layer>.lower_nodes = <Multi-Layer> 2.output_nodes d <Planet-Layer>.upper_nodes = <Sun-Layer>.all_nodes e <Sun-Layer>.id = concatenate(<multi-layer> 1.id,.3 ) V. <Multi-Layer> <Sun-Layer> i <Multi-Layer>.all_nodes =<Sun-Layer>.all_nodes ii <Multi-Layer>.all_connections = {} iii <Multi-Layer>.input_nodes = <Sun-Layer>.all_nodes iv <Multi-Layer>.output_nodes = <Sun-Layer>.all_nodes a <Sun-Layer>.id = concatenate(<multi-layer> 1.id,.1 ) VI. <Sun-Layer> 1 <sun-neuron> <Sun-Layer> 2 i <Sun-Layer> 1.all_nodes = <Sun-Layer> 2.all_nodes 4 [concatenate(<sun-layer> 1.id,.1 ), <sun-neuron>.spec] a <Sun-Layer> 2.id = concatenate(<sun-layer> 1.id,.2 ) VII. <Sun-Layer> <sun-neuron> i <Sun-Layer>.all_nodes = [concatenate(<sun-layer>.id,.1 ), <sun-neuron>.spec] VIII. <Planet-Layer> grouped_fill ( <planet-neuron> ) <activation> <delta> i <Planet-Layer>.all_nodes = replicate(1, size(<planet-layer>.upper_nodes) * size(<planet- Layer>.lower_nodes), <Planet-Layer>.id, [ 1,<planet-neuron>.spec], 1) ii <Planet-Layer>.grouped_nodes = group_elements(<planet-layer>.all_nodes, size(<planet- Layer>.upper_nodes) ) iii <Planet-Layer>.all_connections = direct_over_full_connect(<planet-layer>.grouped_nodes, <Planet-Layer>.upper_nodes, 0, <activation>.name) 4 direct_over_full_connect( <Planet-Layer>.upper_nodes, <Planet-Layer>.grouped_nodes, 0, <delta>.name) 4 full_over_direct_connect(<planet-layer>.lower_nodes, <Planet- Layer>.grouped_nodes, 0, <activation>.name) 4 full_over_direct_connect <Planet-Layer>.grouped_nodes, <Planet-Layer>.lower_nodes, 0, <delta>.name) continued

241 ...continuation IX. <Input-Port-Layer> fill ( <in-port> ) i <Input-Port-Layer>.all_ports = replicate(1, <Input-Port-Layer>.size, <Input-Port-Layer>.id, [ 1, <in-port>.spec], 1) X. <Output-Port-Layer> fill ( <out-port> ) i <Output-Port-Layer>.all_ports = replicate(1, <Output-Port-Layer>.size, <Output-Port- Layer>.id, [ 1, <out-port>.spec], 1) where <sun-neuron>.spec = detailed specification of Hecht-Nielsen s sun neuron <planet-neuron>.spec = detailed specification of Hecht-Nielsen s planet neuron <delta-neuron>.spec = detailed specification of neuron that accepts a single activation value and a single desired value and returns the difference as a delta error value <activation>.name = Activation signal that carries activation values <delta>.name = Delta signal that carries delta error values <desired>.name = Desired signal that carries the desired output values size(a): Given the set A, returns the number of elements in that set. group_elements(a,n): Given the set A and number n, returns the set of n subsets such that each subset is of equal size, and every element of A is a member of one and only one subset. direct_over_full_connect (A, B, x,t): Given two sets A and B such that one is a set of individual elements { e 1, e 2,... e n }and one is a set of equal sized groups of elements { (e 1, e 2, e 3 ) (e 4, e 5, e 6 ), (e m-2, e m-1, e m )}. A and B are of equal size n at top level (i.e., n elements or n groups). Performs n calls to full_connect(a i, B i, x, t). Returns all results. [1 i n] full_over_direct_connect (A, B, x,t): Given two sets A and B (as above). The size n of the individual set equals the size of each group in the other set (with m groups). If A is the group set, performs m calls to direct_connect(a j, B, x, t), otherwise performs m calls to direct_connect(a, B j, x, t). Returns all results. [1 j m] Figure 90: NGAGE Grammar Illustrating Localized Sun-and-Planet Architecture for Back-Propagation Networks Figure 90 illustrates an NGAGE grammar that creates localized back-propagation networks using the sun-and-planet approach of Hecht-Nielsen (1990). The network is comprised of three types of neurons whose behaviours are captured within terminal symbols (i.e., <sun-neuron>, <planet-neuron> and <delta-neuron>). In order to connect two layers of <sun-neuron>s or a layer of <in-port>s and a layer of <sun-neuron>s, the grammar must create an appropriate number of <planet-neuron>s that serve as an interface between the two layers. Production VIII performs this task given two layers of neurons in separate inherited attributes (<Planet-Layer>.lower_nodes and <Planet- Layer>.upper_nodes). For each neuron in the upper layer (say, of size n), a distinct planet 218

242 node is created to interface with each neuron in the lower layer (say, of size m for a total of m n planet neurons). In attribute evaluation rules VIII.i and VIII.ii, these planets neurons are created and grouped. In attribute evaluation rule VIII.iii, the upper and lower layers are connected to the planet neurons using appropriate helper operations. The network contains three types of signals: <activation>, <desired>, and <delta>. The top-most layer of the network is created in a special manner in productions II and III. To ensure that the network s output is of the right dimension, the top-most layer of sunneurons is created according to an inherited <Top-Multi-Layer>.size attribute. To ensure the proper computation of network error for the top-most layer of sun neurons, a layer of <delta-neuron>s is added to the network. Each delta neuron computes the difference between the actual <activation> output of a single sun neuron and the <desired> feedback for that sun-neuron. The result is passed to the sun neuron as a <delta> signal. This approach enables the use of a single type of sun neuron even at the top-most level. The <activation> pathway flows forward through the network from activation input ports to planet neurons to sun neurons to planet neurons to sun neurons to activation output ports. The <desired> path flows backward from the desired signal input ports to the delta neurons. The <delta> pathway flows backward from the delta neurons to sun neurons to planet neurons to sun neurons to planet neurons. 4.4 Representation of the Structure of Modules Modular Foundations In the NGAGE system, a neural network that exhibits modular topology may be defined as a 5-tuple (I, O, M, N, C), where 219

243 I is the set of input sources of the network. Each source provides a signal of a lklklkllll ;; specific type, though different sources may provide different types. O is the set of output targets of the network. Each target receives a signal of a specific type, though different targets may receive different types. M is a non-empty set of modules. N is a possibly empty set of nodes that consists of all the nodes of the network that are not within M. C is the set of all connections of the network that are not within M. Each module within M may in turn be defined as a 5-tuple (MI, MO, MH, MM, MC) representing a group of nodes and nested modules and directed connections among them, where MI is a non-empty subset of nodes, the module-input nodes, that form an input interface and are the only nodes in the module capable of receiving extramodular signals (i.e., from nodes that are outside the module and/or from the input sources of the network). MO is a non-empty subset of the nodes, the module-output nodes, that form an output interface and are the only nodes in the module capable of sending extramodular signals (i.e., to nodes that are outside the module and/or to the output targets of the network). MI MO may be non-empty. MH is a possibly-empty set of nodes, the module-hidden nodes, that consist of all nodes in the module such that MH MI = and MH MO =. MM is a possibly empty set of modules. Let MI i, MO i and MH i refer to the module input, module-output and module-hidden nodes, respectively, of the i th 220

244 nested module within MM. A module and its nested module may share certain nodes, but may not share the hidden nodes of the nested module. Specifically, n MI i MO i n MI MO MH, and n MH i n MI MO MH. MC is the set of directed connections, the intramodular connections, between pairs of nodes within the module, such that C ab MC a, b MI MO MH, where C ab is a directed connection from node a to node b. Figure 91 illustrates the topology of a simple NGAGE module. Note that certain nodes lie in both the input and output interface (indicated by solid colour and bold line together). The internal connectivity pattern is arbitrary, and may contain links between any neurons within the module, including recurrent self-links. Intramodular connection MC Extramodular input connection MC Extramodular output connection MC Module-Input node MI Module-Output node MO Module-Hidden node MH Figure 91: Simple NGAGE Module Figure 92 illustrates a module containing one nested module. Note that nodes in the input and output interface of the nested module may be shared as part of the input and output interface of the main module. 221

245 Intramodular connection MC Extramodular input connection MC Extramodular output connection MC Module-Input node MI Module-Output node MO Module-Hidden node MH Module i MM: Intramodular connection MC i Module-Input node MI i Module-Output node MO i Shared node: MI and MI i Shared node: MO and MO i Figure 92: Nested NGAGE Module Within NGAGE, the behaviour of a modular network is emergent from the behaviours of its constituent modules and nodes and the communication relationships formed by connections among those modules and nodes, as well as between them and the network input sources and output targets. The behaviour of a module is emergent from the behaviours of its constituent neurons and nested modules and the communication relationships formed by the intramodular connections Design Practices for Modular Topology Practice 24 Complete module specification is computed within the attributes of a single grammar symbol, the module-root symbol. This practice is analogous to Practice 4. Following the goal of assigning roles to symbols, the specification of a module is collected within the attributes of a single symbol. That symbol is referred to as a module-root symbol. In 222

246 particular, all nodes of a module and all connections between those nodes are collected within those attributes. Practice 25 Explicitly represent the module-input and module-output nodes of a module. This practice is analogous to Practice 10. As described earlier, an NGAGE module consists of nodes that send and receive extramodular signals. Explicit representation of those nodes enables the module to be easily embedded within a larger topology without violating the internal constraints of the module. Practice 26 Specification of the internal topology of a module is encapsulated within the subtree of the attributed parse tree rooted by the module-root symbol. Following Practice 24, collecting the specification of a module within the attributes of a single symbol clarifies the role of that symbol. However, a neural module is only part of a larger neural structure, and within an NGAGE grammar the potential exists for computing neural structures at any location in a given parse tree and collecting those at any other location. To improve readability and reinforce the encapsulation of modules, all internal nodes and all internal connections should be created exclusively within the productions that expand the module-root symbol (i.e., in the subtree below the module-root). A given module-root symbol and the grammar productions that expand that symbol collectively form a module-subgrammar. The productions of 223

247 the grammar that are not part of the module-subgrammar should preserve the distinction between the internal and external module components. Note that this practice does not prevent the module-root symbol from inheriting structural constraints from higher levels of the grammar (e.g., the maximum size of the module) or passing up structural information (e.g., the actual size of the module). Practice 27 Multiple types of modules may be represented within a single grammar through multiple module-subgrammars. Within a single neural network, it may be desirable to incorporate multiple types of modules. For example, one type of module may differ from another in architecture, topological constraints, types of neurons, and so on. Within NGAGE, this capability is readily achieved through the use of distinct module-root symbols. An alternative view is that two NGAGE grammars, each of which defines a complete neural architecture, may be composed to form a single hybrid NGAGE grammar representing a modular architecture. This may be achieved by assigning a distinct module-root symbol to each of the original grammars and adding appropriate productions that specify the relationships between those symbols. For example, Figure 93 illustrates an NGAGE grammar that includes multilayer perceptrons and grids of Kohonen nodes. The grammar productions of Figure 55 that define and expand the <Multi-Layer> symbol are included in their entirety. Similarly, the grammar productions of Figure 81 that define and expand the <Grid> symbol are included in their entirety. The new grammar represents a 224

248 space of networks in which the inputs are fed to a multi-layer perceptron module, the outputs of the multi-layer perceptron module are fed to a grid of Kohonen neurons, whose output forms the output of the network. Non-Terminal Symbols: <Network>: (all_nodes, all_connections) {} <Module>: (all_nodes, all_connections, input_nodes, output_nodes) {id} <Multi-Layer>: (all_nodes, all_connections, input_nodes, output_nodes) {id} <Grid>: (all_nodes, all_connections, last_column, last_row, length, width) {id} Productions and Attribute Evaluation Rules: I. <Network> <Input-Port-Layer> <Module> <Output-Port-Layer> i <Network>.all_nodes = <Module>.all_nodes 4 <Input-Port-Layer>.all_ports 4 <Output- Port-Layer>.all_ports ii <Network>.all_connections = <Module>.all_connections 4 full_connect(<input-port-layer> 1.all_ports, <Module>.input_nodes) 4 ordered_direct_connect(<module>.output_nodes, <Output-Port-Layer>.all_ports) 4 a <Input-Port-Layer>.id = 1.1 b <Multi-Layer>.id = 1.2 c <Output-Port-Layer>.id = 1.3 II. <Module> <Multi-Layer> <Grid> i <Module>.all_nodes = <Multi-Layer>.all_nodes 4 <Grid>.all_nodes ii <Module>.all_connections = <Multi-Layer>.all_connections 4 <Grid>.all_connections 4 full_connect(<multi-layer>.output_nodes, <Grid>.all_nodes) iii <Module>.input_nodes = <Multi-Layer>.input_nodes iv <Module>.output_nodes = <Grid>.all_nodes a <Multi-Layer>.id = concatenate(<module>.id,.1 ) b <Grid>.id = concatenate(<module>.id,.2 ) where <Multi-Layer> defined as in Figure 55 <Grid> defined as in Figure 81 Figure 93: NGAGE Grammar Illustrating Combination of Module-Subgrammars to Form Hybrid Grammar When combining different types of modules within a single grammar, it may be the case that certain neural structures in one module architecture are the same as those in another. To reduce redundancy and improve compactness of the grammar, the module subgrammars for those two types of modules may share some of the same productions. Note that this is only possible if all constraints within the shared productions are common to both types of modules 225

249 Practice 28 Explicitly represent different topological relationships between modules in distinct productions. In a modular network, a number of different topological relationships between modules are possible. For example, a common relationship is a successive decomposition in which the output nodes of one module are fully connected to the input nodes of another module. This is typical of a building block approach to modular networks, such as the CALM network (Happel and Murre, 1992, 1994). Models may instead impose highly specific connectivity patterns between several modules, as in the ARTSTAR network (Hussain and Browse, 1994) in which four modules of different types (i.e., ART module, instar layer and threshold layer) are connected in a specific sequence with some modules fully connected and others connected one-to-one. Other possibilities include parallel processing decompositions, input and output partition decompositions, and so on. Within NGAGE, a wide variety of module relationships may be represented. Further, within a single grammar, multiple relationships between the same types of modules may be specified. For example, Figure 94 illustrates an NGAGE grammar in which three types of modules (i.e., <Multi-Layer>, <Grid> and <Recurrent-Layer> from Figure 76) may be connected in three different ways in productions II, III and IV. Further, the modules may nest arbitrarily deep, enabling the representation of highly complex modular networks. 226

250 Grammar that augments grammar of Figure 93. Non-Terminal Symbols: <Recurrent-Layer>: (all_nodes, all_connections) {id, nodes_to_connect} Terminal Symbols: successive-full successive-1to1 parallel Productions and Attribute Evaluation Rules: II. <Module> 1 successive-full ( <Module> 2, <Module> 3 ) i <Module> 1.all_nodes = <Module> 2.all_nodes 4 <Module> 3.all_nodes ii <Module> 1.all_connections = <Module> 2.all_connections 4 <Module> 3.all_connections 4 fully_connect(<module> 2.output_nodes, <Module> 3.input_nodes) iii <Module> 1.input_nodes = <Module> 2.input_nodes iv <Module> 1.output_nodes = <Module> 3.output_nodes a <Module> 2.id = concatenate(<module> 1.id,.1 ) b <Module> 3.id = concatenate(<module> 1.id,.2 ) III. <Module> 1 successive-1to1 ( <Module> 2, <Module> 3 ) i <Module> 1.all_nodes = <Module> 2.all_nodes 4 <Module> 3.all_nodes ii <Module> 1.all_connections = <Module> 2.all_connections 4 <Module> 3.all_connections 4 ordered_direct_connect(<module> 2.output_nodes, <Module> 3.input_nodes) iii <Module> 1.input_nodes = <Module> 2.input_nodes iv <Module> 1.output_nodes = <Module> 3.output_nodes a <Module> 2.id = concatenate(<module> 1.id,.1 ) b <Module> 3.id = concatenate(<module> 1.id,.2 ) IV. <Module> 1 parallel ( <Module> 2, <Module> 3 ) i <Module> 1.all_nodes = <Module> 2.all_nodes 4 <Module> 3.all_nodes ii <Module> 1.all_connections = <Module> 2.all_connections 4 <Module> 3.all_connections iii <Module> 1.input_nodes = <Module> 2.input_nodes 4 <Module> 3.input_nodes iv <Module> 1.output_nodes = <Module> 2.output_nodes 4 <Module> 3.output_nodes a <Module> 2.id = concatenate(<module> 1.id,.1 ) b <Module> 3.id = concatenate(<module> 1.id,.2 ) V. <Module> <Multi-Layer> i <Module>.all_nodes = <Multi-Layer>.all_nodes ii <Module>.all_connections = <Multi-Layer>.all_connections iii <Module>.input_nodes = <Multi-Layer>.input_nodes iv <Module>.output_nodes = <Multi-Layer>.output_nodes a <Multi-Layer>.id = concatenate(<module>.id,.1 ) VI. <Module> <Grid> i <Module>.all_nodes = <Grid>.all_nodes ii <Module>.all_connections = <Grid>.all_connections iii <Module>.input_nodes = <Grid>.all_nodes iv <Module>.output_nodes = <Grid>.all_nodes a <Grid>.id = concatenate(<module>.id,.1 ) VII. <Module> <Recurrent-Layer> i <Module>.all_nodes = <Recurrent-Layer>.all_nodes ii <Module>.all_connections = <Recurrent-Layer>.all_connections iii <Module>.input_nodes = <Recurrent-Layer>.all_nodes iv <Module>.output_nodes = <Recurrent-Layer>.all_nodes a <Recurrent-Layer>.id = concatenate(<module>.id,.1 ) b <Recurrent-Layer>.nodes_to_connect = <Recurrent-Layer>.all_nodes where <Recurrent-Layer> defined as in Figure 76 Figure 94: NGAGE Grammar Illustrating Multiple Relationships Between Modules 227

251 <Network> <Input-Port-Layer> <Module> <Output-Port-Layer> successive-1to1 ( <Module>, <Module> ) <Multi-Layer> successive-fill ( <Module>, <Module> ) parallel ( <Module>, <Module> ) <Grid> <Recurrent-Layer> (a) <Multi-Layer> <Grid> <Multi-Layer> successive-full <Recurrent-Layer> parallel successive-1to1 <Multi-Layer> (b) Figure 95: (a) Partial Context-Free Parse Tree and (b) Associated Modular Network with Multiple Module Types and Multiple Intermodular Relationships 228

252 Figure 95(a) illustrates the context free portion of a parse tree generated from the grammar of Figure 94. The tree is only partially defined, and does not present the subtrees generated by the module-subgrammars. Figure 95(b) illustrates the modular network that is produced by this parse tree. The network contains four modules of three different types. The connectivity relationships between the modules decompose the problem in several ways. The result is a novel network solution that partitions the problem into different subtasks and solves each of these subtasks using specialized architectures. Practice 29 A module may be explicitly associated with a specific subtask of the main problem. An important reason for applying a modular network to a given task is the ability to decompose the problem and apply distinct modules to solve those tasks. Within NGAGE, such decompositions may be readily and explicitly defined within the productions of the grammar. For example, Figure 96 illustrates an NGAGE grammar in which the input and output vectors that define the problem are explicitly decomposed to form an arbitrary number of subtasks, each defined by a (potentially) different input-output mapping. In production I, the entire inputoutput mapping solved by the network is explicitly assigned to the non-terminal symbol <Task>. In productions II-V, the task is recursively decomposed into a number of subtasks by partitioning the input-output mapping into two new mappings. Two subtasks may be identical (as in production V), share the same input set (as in production IV), share the same output set (as in production III) or 229

253 define distinct mappings (as in production II). Once a subtask has been fully defined, production VI identifies a module to solve that subtask, and creates the appropriate connections between that module and the elements of the subtask s input-output mapping. Productions VII-XIII are supporting productions that are used to manipulate and partition sets of elements. Given a specific set of elements, productions VII and VIII partition it into two subsets such that all members of the original set are included in at least one of the new sets. As a result, all the ports in the original input-output mapping are included in at least one of the subtasks generated by the grammar. Productions IX-XI may be applied to generate all possible subsets of a given set of elements. Productions XII and XIII may be applied to generate partitions similar to those made by the cellular encoding program symbols USPLIT and LSPLIT (Friedrich and Moraga, 1996, 1997). A drawback to the grammar of Figure 96 is that it may return a subtask in which the input or output set is empty. This may result in an unconnected network and/or in modules that are ineffective. To ensure that a module always has at least one input source and one output target, productions VII and VIII may be expanded as in Figure 97. In each production, a single set element is identified as a backup element. If either of the two partitions created by the production (e.g., <Subset> and its complement in production VIII) would otherwise be empty, the backup element is returned for that partition instead of an empty set. 230

254 Non-Terminal Symbols: <Network>: (all_nodes, all_connections) {} <Task>: (all_nodes, all_connections) {id, input_set, output_set} <Partition>: (first_set, second_set) {id, original_set} <Subset>: (actual_members) {potential_members} Terminal Symbols: input output complement overlap select selected-complement all none first-half second-half + (, ) Helper Operations: select_subset, concatenate, full_connect, ordered_direct_connect Productions and Attribute Evaluation Rules: I. <Network> <Input-Port-Layer> <Task> <Output-Port-Layer> i <Network>.all_nodes = <Task>.all_nodes 4 <Input-Port-Layer>.all_ports 4 <Output-Port- Layer>.all_ports ii <Network>.all_connections = <Task>.all_connections a <Input-Port-Layer>.id = 1.1 b <Task>.id = 1.2 c <Task>.input_ports = <Input-Port-Layer>.all_ports d <Task>.output_ports = <Output-Port-Layer>.all_ports e <Output-Port-Layer>.id = 1.3 II. <Task> 1 (input <Partition> 1 ) <Task> 2 <Task> 3 (output <Partition> 2 ) i <Task> 1.all_nodes = <Task> 2.all_nodes 4 <Task> 3.all_nodes ii <Task> 1.all_connections = <Task> 2.all_connections 4 <Task> 3.all_connections a <Task> 2.id = concatenate(<task> 1.id,.1 ) b <Task> 2.input_ports = <Partition> 1.first_set c <Task> 2.output_ports = <Partition> 2.first_set d <Task> 3.id = concatenate(<task> 1.id,.2 ) e <Task> 3.input_ports = <Partition> 1.second_set f <Task> 3.output_ports = <Partition> 2.second_set g <Partition> 1.original_set = <Task> 1.input_ports h <Partition> 2.original_set = <Task> 1.output_ports III. <Task> 1 (input <Partition>) <Task> 2 <Task> 3 i <Task> 1.all_nodes = <Task> 2.all_nodes 4 <Task> 3.all_nodes ii <Task> 1.all_connections = <Task> 2.all_connections 4 <Task> 3.all_connections a <Task> 2.id = concatenate(<task> 1.id,.1 ) b <Task> 2.input_ports = <Partition>.first_set c <Task> 2.output_ports = <Task> 1.output_ports d <Task> 3.id = concatenate(<task> 1.id,.2 ) e <Task> 3.input_ports = <Partition>.second_set f <Task> 3.output_ports = <Task> 1.output_ports g <Partition>.original_set = <Task> 1.input_set IV. <Task> 1 <Task> 2 <Task> 3 (output <Partition>) i <Task> 1.all_nodes = <Task> 2.all_nodes 4 <Task> 3.all_nodes ii <Task> 1.all_connections = <Task> 2.all_connections 4 <Task> 3.all_connections a <Task> 2.id = concatenate(<task> 1.id,.1 ) b <Task> 2.input_ports = <Task> 1.input_ports c <Task> 2.output_ports = <Partition>.first_set d <Task> 3.id = concatenate(<task> 1.id,.2 ) e <Task> 3.input_ports = <Task> 1.input_ports f <Task> 3.output_ports = <Partition>.second_set g <Partition>.original_set = <Task> 1.output_set continued

255 ...continuation V. <Task> 1 redundant ( <Task> 2 <Task> 3 ) i <Task> 1.all_nodes = <Task> 2.all_nodes 4 <Task> 3.all_nodes ii <Task> 1.all_connections = <Task> 2.all_connections 4 <Task> 3.all_connections a <Task> 2.id = concatenate(<task> 1.id,.1 ) b <Task> 2.input_ports = <Task> 1.input_ports c <Task> 2.output_ports = <Task> 1.output_ports d <Task> 3.id = concatenate(<task> 1.id,.2 ) e <Task> 3.input_ports = <Task> 1.input_ports f <Task> 3.output_ports = <Task> 1.output_ports VI. <Task> <Module> i <Task>.all_nodes = <Module>.all_nodes ii <Task>.all_connections = <Module>.all_connections 4 full_connect(<task>.input_ports, <Module>.input_nodes) 4 ordered_direct_connect(<module>.output_nodes, <Task>.output_ports) a <Module>.id = concatenate(<task>.id,.1 ) VII <Partition> ( <Subset>, complement ) i <Partition>.first_set = <Subset>.actual_members ii <Partition>.second_set = <Partition>.original_set <Subset>.actual_members a <Subset>.potential_members = <Partition>.original_set VIII. <Partition> ( <Subset> 1, complement + overlap(<subset> 2 ) ) i <Partition>.first_set = <Subset> 1.actual_members ii <Partition>.second_set = <Subset> 2.actual_members 4 (<Partition>.original_set <Subset> 1.actual_members) a <Subset> 1.potential_members = <Partition>.original_set b <Subset> 2.potential_members = <Subset> 1.actual_members IX. <Subset> 1 select ( <#uniforminteger(1,100)> ) + selected-complement ( <Subset> 2 ) i <Subset>.actual_members = <Subset> 2.actual_members 4 select_subset(<subset> 1.potential_members, <#uniforminteger(1,100)>.value, <#uniforminteger(1,100)>.value) a <Subset> 2.potential_members = <Subset> 1.potential_members select_subset(<subset> 1.potential_members, <#uniforminteger(1,100)>.value, <#uniforminteger(1,100)>.value) X. <Subset> all i <Subset>.actual_members = <Subset>.potential_members XI. <Subset> none i <Subset>.actual_members = {} XII. <Subset> first-half i <Subset>.actual_members = select_subset(<subset>.potential_members, 1, 50) XIII. <Subset> second-half i <Subset>.actual_members = select_subset(<subset>.potential_members, 51, 100) where select_subset(a,x,y): Given set A and two percent values (integers ), return all members of A in the given range [x..y] <Module> as defined in Figure 94 Figure 96: NGAGE Grammar Illustrating Partitioning of Problem into Arbitrary Subtasks that are Solved by Distinct Modules 232

256 Grammar that augments grammar of Figure 96 Terminal Symbols: backup VII. <Partition> ( <Subset>, complement ) backup(<#uniforminteger(1,100)>) i <Partition>.first_set = if ( <Subset>.actual_members == {} ) select_subset(<partition>.original_set, <#uniforminteger(1,100)>.value, <#uniforminteger(1,100)>.value) else <Subset>.actual_members ii <Partition>.second_set = if (<Partition>.first_set == <Partition>.original_set) select_subset(<partition>.original_set, <#uniforminteger(1,100)>.value, <#uniforminteger(1,100)>.value) else <Partition>.original_set <Partition>.first_set a <Subset>.potential_members = <Partition>.original_set VIII. <Partition> ( <Subset> 1, complement + overlap(<subset> 2 ) ) backup(<#uniforminteger(1,100)>) i <Partition>.first_set = if ( <Subset> 1.actual_members == {} ) select_subset(<partition>.original_set, <#uniforminteger(1,100)>.value, <#uniforminteger(1,100)>.value) else <Subset> 1.actual_members ii <Partition>.second_set = if ( <Partition>.first_set == <Partition>.original_set and <Subset> 2.actual_members == {} ) select_subset(<partition>.original_set, <#uniforminteger(1,100)>.value, <#uniforminteger(1,100)>.value) else (<Partition>.original_set <Partition>.first_set) 4 <Subset> 2.actual_members a <Subset> 1.potential_members = <Partition>.original_set b <Subset> 2.potential_members = <Subset> 1.actual_members Figure 97: NGAGE Productions Illustrating Non-Empty Partitions Figure 98 illustrates a sample neural network structure that may be generated from the grammars of Figure 96 and Figure 97. The original task is decomposed into 3 subtasks. Each subtask is defined by a different input-output mapping, although there is some overlap on certain nodes. Each subtask is assigned to a different module (indicated by the dashed squares), each of which in turn has a different internal composition. The leftmost module, for example, 233

257 solves the subtask that maps the first, third and fourth input elements to the first five output elements; internally, it consists of a single multi-layer submodule. The rightmost module, by contrast, solves the subtask that maps the second half of the input vector to the four rightmost output elements; internally it consists of a multilayer submodule and a recurrent layer submodule arranged in a successive one-toone relationship. Figure 98: Network Illustrating Task Decomposition among Heterogeneous Modules Figure 99 illustrates a complex modular network that may be generated from the grammars of Figure 96 and Figure 97, but modified such that each module is represented by a single internal node and its input and output interface. The network is drawn using the Visualizing Graphs with Java (VGJ) graph drawing package (McCreary and Barowski, 1998). The network decomposes the problem into 13 subtasks. Each subtask is solved in a different way, with different numbers of modules arranged in a variety of intermodular relationships. For instance, the leftmost subtask is solved by 73 modules arranged in an irregular 234

258 pattern, while the rightmost subtask is solved by two modules in a simple successive processing relationship. Figure 99: Network Illustrating Complex Modular Topologies Solving Decomposed Task Practice 30 Explicitly distinguish module-input and module-output nodes by each signal type received and transmitted by the module. A given module may contain nodes in its interface that accept and transmit signals of different types to/from nodes external to the module. Extending Practice 23 and Practice 25, all module-input (and module-output) nodes of a given signal type may be stored in a distinct attribute. This helps ensure the specification of valid extramodular input and output connections and the formation of valid pathways between all modules of a network. Figure 100 illustrates an NGAGE grammar fragment in which two modules may be connected using a successive decomposition that forms both valid feedforward activation connections and valid feedback connections. Productions III and IV define a module to be one of two types, <Multi-Layer> or <Grid>. 235

A Frequent Max Substring Technique for. Thai Text Indexing. School of Information Technology. Todsanai Chumwatana

School of Information Technology A Frequent Max Substring Technique for Thai Text Indexing Todsanai Chumwatana This thesis is presented for the Degree of Doctor of Philosophy of Murdoch University May