A Graph-Based Evolutionary Algorithm: Genetic Network Programming(GNP) and Its Extension Using Reinforcement Learning

Size: px

Start display at page:

Download "A Graph-Based Evolutionary Algorithm: Genetic Network Programming(GNP) and Its Extension Using Reinforcement Learning"

Reginald Greer
6 years ago
Views:

1 Graph-Based Evolutionary lgorithm: Genetic Network Programming(GNP) and Its Extension Using Reinforcement Learning Shingo Mabu Graduate School of Information, Production and Systems, Waseda University, Hibikino 2-7 Wakamatsu-ku, Kitakyushu, Fukuoka, , Japan Kotaro Hirasawa Graduate School of Information, Production and Systems, Waseda University, Hibikino 2-7, Wakamatsu-ku, Kitakyushu, Fukuoka, , Japan Jinglu Hu Graduate School of Information, Production and Systems, Waseda University, Hibikino 2-7, Wakamatsu-ku, Kitakyushu, Fukuoka, , Japan bstract his paper proposes a graph-based evolutionary algorithm called Genetic Network Programming(GNP). Our goal is to develop GNP, which can deal with dynamic environments efficiently and effectively, based on the distinguished expression ability of the graph(network) structure. he characteristics of GNP are as follows. 1) GNP programs are composed of a number of nodes which execute simple judgment/processing, and these nodes are connected by directed links to each other. 2) hegraphstructureenablesgnptore-usenodes,thusthestructurecanbeverycompact. 3) he node transition of GNP is executed according to its node connections without any terminal nodes, thus the past history of the node transition affects the current node to be used and this characteristic works as an implicit memory function. hese structural characteristics are useful for dealing with dynamic environments. Furthermore, we propose an extended algorithm, GNP with Reinforcement Learning(GNP- RL) which combines evolution and reinforcement learning in order to create effective graph structures and obtain better results in dynamic environments. In this paper, we applied GNP to the problem of determining agents behavior to evaluate its effectiveness. ileworld was used as the simulation environment. he results show some advantages for GNP over conventional methods. Keywords Evolutionary computation, graph structure, reinforcement learning, agent, tileworld. 1 Introduction large number of studies have been conducted on evolutionary optimization techniques. Genetic lgorithm(g)(holland, 1975), Genetic Programming(GP)(Koza, 1992, 1994) and Evolutionary Programming(EP)(Fogel et al., 1966; Fogel, 1994) are typical evolutionary algorithms. G evolves strings and is mainly applied to optimizationproblems. GPwasdevisedlaterinordertoexpandtheexpressionabilityofG by using tree structures. EP is a graph structural system creating finite state machines by evolution. In this paper, a new graph-based evolutionary algorithm named Genetic c 27 by the Massachusetts Institute of echnology Evolutionary Computation 15(3):

2 S.Mabu,K.HirasawaandJ.Hu Network Programming(GNP)(Katagiri et al., 2, 21; Hirasawa et al., 21; Mabu etal.,22,24)isdescribed. OuraimindevelopingGNPistodealwithdynamic environments efficiently and effectively by using the higher expression ability of graph structure, and the inherently equipped functions in it. he distinguishing functions of the GNP structure are directed graph expression, reusability of nodes, and implicit memory function. he directed graph expression can realize some repetitive processes, and it can be effective because it works like the utomatically Defined Functions(DFs) in GP. he node transition of GNP starts from astartnodeandcontinuesbasedonthenodeconnections,thusitcanbesaidthatan agent s 1 actionsinthepastareimplicitlymemorizedinthenetworkflow. In addition, we propose an extended algorithm, GNP with Reinforcement Learning(GNP-RL) which combines evolution and reinforcement learning in order to create effective graph structures and obtain better results in dynamic environments. he distinguishing functions of GNP-RL are the combinations of offline and online learning, and diversified and intensified search. lthough we have already proposed a method, online learning of GNP(Mabu et al., 23), which uses only reinforcement learning to selectthenodeconnections,thismethodhasaprobleminthattheqtablebecomesvery large, and the calculation time and the occupation of the memory also become large. hus, we propose a method using both evolution and reinforcement learning in this paper. Evolutionary algorithms are superior in terms of wide space search ability because they continue to evolve various individuals and select better ones(offline learning), while RL can learn incrementally, based on rewards obtained during task execution (online learning). herefore, the combination of evolution and RL can cooperatively make good graph structures. In fact, the proposed evolutionary algorithm(diversified search) makes rough graph structures by selecting some important functions among many kinds of functions and connects them based on fitness values after task execution, thus the Q table becomes quite compact. nd then RL(intensified search) selects the best function during task execution, i.e., determines an appropriate node transition. his paper is organized as follows. Section 2 provides the related work and the comparisons with GNP. In Section 3, the details of GNP and GNP-RL are described. Section 4 explains the ileworld problem, available function sets, and shows the simulation results. Section 5 discusses future work and remaining problems. Section 6 is devoted to conclusions. 2 Related Work and Comparisons GPtreecanbeusedasadecisiontreewhenallfunctionnodesare if-thentype functions and all terminal nodes are concrete action functions. In this case, a tree is executedfromarootnodetoacertainterminalnodeineachstep,sothebehaviorsof agents are determined mainly by the current information. On the other hand, since GNPhasanimplicitmemoryfunction,GNPcandetermineanactionbynotonlythe current, but also the past information. he most important problem for GP is the bloat of the tree. he increase in depth causes an exponential enlargement of the search space, the occupation of large amounts of memory, and an increase in calculation time. Constrainingthedepthofthetreeisoneofthewaystoovercomethebloatproblem. SincethegraphstructureofGNPhasanimplicitmemoryfunctionandtheabilityto re-use nodes, GNP is expected to use necessary nodes repeatedly and create compact 1 n agent is a computer system that is situated in some environments, and it is capable of autonomous action in these environments in order to meet its design objectives(weiss, 1999). In this paper, the autonomous action is determined by GNP. 37 Evolutionary Computation Volume 15, Number 3

3 Genetic Network Programming and Its Extension structures. We will have a discussion on the program size in the simulation section. Evolutionary Programming(EP) is a graph structural system used for the automatic synthesis of finite state machines(fsms). For example, FSM programs are evolved in the iterated prisoner s dilemma game(fogel, 1994; ngeline, 1994) and the ant problem(ngeline and Pollack, 1993). However, there are some essential differences between EP and GNP. Generally FSM must define their transition rules for all combinations of states and possible inputs, thus the FSM program will become large andcomplexwhenthenumberofstatesandinputsislarge. InGNP,thenodesare connected by necessity, so it is possible that only the essential inputs obtained in the currentsituationareusedinthenetworkflow.saresult,thegraphstructureofgnp can be quite compact. PDO(eller and Veloso, 1995, 1996; eller, 1996) is also a graph-based algorithm, butitsfundamentalconceptisdifferentfromthatofgnp.eachnodeinpdoprogramshastwofunctionalparts:anactionpartandabranchdecisionpart,andpdo alsohasbothastartnodeandanendnode.hestatetransitionofpdoisbasedon stack and indexed memory. Since PDO has been successfully applied to image and sound classification problems, it can be said that PDO has a splendid ability for static problems. GNP is designed mainly to deal with problems in dynamic environments. First,themainconceptofGNPistomakeuseoftheimplicitmemoryfunction.herefore, GNP does not presuppose that it uses explicit memories such as stack and indexed memories. Second, GNP has judgment nodes and processing nodes which correspond to branch decision parts and action parts in PDO, respectively. Note that GNP separates judgment and processing functions, while both functions of PDO are in a node. herefore GNP can create more complex combinations/rules of judgment and processing.finally,thenodesofgnphaveauniquenodenumberandthenumberofnodes is the same in all the individuals. his characteristic contributes to executing RL effectively in GNP-RL(see section 3.6). Some information on the techniques of graph or networkbasedgpisgiveninlukeandspector(1996). Finally, we explain the methods that combine evolution and learning. In Downing (21), the special terminal nodes for learning are introduced to GP and the contents oftheterminalnodes,i.e.,theactionsofagentsaredeterminedbyqlearning.iniba (1998),aQtableisproducedbyGPtomakeanefficientstatespaceforQlearning,e.g., if the GP program is ( B( xy)(+z5)), it represents a 2-dimensional Q table having 2 axes, x yand z + 5.InKamioandIba(25),theterminalnodesofGPselectappropriateQtablesandtheagentactionisdeterminedbytheselectedQtable. hemost important difference between these methods and GNP-RL is how to create state-action spaces(q tables). GNP creates Q tables using its graph structures. Concretely speaking, an activated node corresponds to the current state, and selection of a function in the activated node corresponds to an action. In the other methods, the current state is determined by the combination of the inputs, and actions are actual actions such as moveforward,turnrightandsoon. 3 Genetic Network Programming In this section, Genetic Network Programming is explained in detail. GNP is an extension of GP in terms of gene structures. he original motivation for developing GNP is based on the more general representation ability of graphs as opposed to that of trees in dynamic environments. Evolutionary Computation Volume 15, Number 3 371

4 S.Mabu,K.HirasawaandJ.Hu gent gent fitness value Environment LIBRRY J1: P1: J2: P2: Jk:... Pl:... start node Directed graph structure P1 J1 B d1 1 2 d1 3 4 Node gene node i Ki IDi di Ci di B J2 P2 Gene structure node 1 node node 3 node Connection gene B B Ci di node B J: Judgement function P: Processing function : Judgement node : Processing node : ime delay Figure 1: Basic structure of GNP. 3.1 Basic Structure of GNP Components Fig. 1showsabasicstructureofGNP.GNPprogramiscomposedofonestartnode, plural judgment nodes and plural processing nodes. In Fig. 1, there are one start node, two judgment nodes and two processing nodes, and they are connected to each other. Startnodehasnofunctionandnoconditionalbranch.heonlyroleofthestartnode is to determine the first node to be executed. Judgment nodes have conditional branch decision functions. Each judgment node returns a judgment result and determines the next node to be executed. Processing nodes work as action/processing functions. For example, processing nodes determine the agent s actions such as go forward, turn right and turn left. In contrast to judgment nodes, processing nodes have no conditional branch. By separating processing and judgment functions, GNP can handle various combinations of judgment and processing. hat is, how many judgments and which kinds of judgment should be used can be determined by evolution. Suppose there areeightkindsofjudgmentnodes(j 1,...,J 8 ),andfourkindsofprocessingnodes (P 1,..., P 4 ).hen,gnpcanmakeanodetransitionbyselectingnecessarynodes,e.g., J 1 -> J 5 -> J 3 -> P 1. Here,itsaysthatjudgmentnodes J 1, J 5 and J 3 areneededfor processingnode P 1.Byselectingthenecessarynodes,GNPprogramcanbequitecompact and evolved efficiently. In this paper, as described above, each processing node determines an agent s actionsuchas goforward, turnright andsoon.ndeachjudgmentnodedetermines thenextnodeafterjudging whatisintheforward?, whatisintheright? andso on. However, in other applications, they could, for example, be applied to other functions such as judge sensor values(judgment), determine wheel speed(processing) of Khepera robot(by K-eam Corp.), judge whether stocks rise or drop(judgment) and determine buy or sell strategy(processing) in stock markets. 372 Evolutionary Computation Volume 15, Number 3

5 Genetic Network Programming and Its Extension GNP evolves the graph structure with the predefined number of nodes, so it never causesbloat 2. Inaddition,GNPhasanabilitytousecertainjudgment/processing nodesrepeatedlytoachieveatask. herefore,evenifthenumberofnodesispredefined and small, GNP can perform well by making effective node connections based on re-usingnodes.saresult,wedonothavetoprepareanexcessivenumberofnodes. he compact structure of GNP is a quite important and distinguishing characteristic, because it contributes to saving memory consumption and calculation time MemoryFunction henodetransitionbeginsfromastartnode,buttherearenoterminalnodes.fterthe start node, the current node is transferred according to the node connections and judgmentresults,inotherwords,theselectionofthecurrentnodeisinfluencedbythenode transitions of the past. herefore, the graph structure itself has an implicit memory function of the past agent actions. lthough a judgment node is a conditional branch decision function, the GNP program is not merely the aggregate of if-then rules, because it includes information of past judgment and processing. For example, in Fig. 1,afternode1(processingnode P 1 )isexecuted,thenextnodebecomesnode2(judgmentnode J 2 ).herefore,whenthecurrentnodeisnode2,wecanknowtheprevious processingwas P 1. henodetransitionofgnpendswhentheendconditionissatisfied,e.g.,when the time step reaches the preassigned one or the GNP program completes the given task imedelays GNPhastwokindsoftimedelays:thetimedelayGNPspendsonjudgmentorprocessing,andtheoneitspendsonnodetransitions. Inrealworldproblems,whenagents judge environments, prepare for actions and take actions, they need time. For example, whenamaniswalkingandseesapuddlebeforehim,hewillavoidit.tthatmoment, ittakessometimetojudgethepuddle(timedelayofjudgment),toputjudgmentinto action(time delay of transition from judgment to processing) and to avoid the puddle(timedelayofprocessing).sincetimedelaysarelistedineachnodegeneandare unique attributes of each node, GNP can evolve flexible programs considering time delays. Inthispaper,timedelayofeachnodetransitionissetatzerotimeunit,thatof eachjudgmentnodeisonetimeunit,thatofeachprocessingnodeisfivetimeunits, andthatofastartnodeiszerotimeunit.inaddition,theonestepofanagent sbehaviorisdefinedinsuchawaythatonestependswhenanagentusesfiveormoretime units.husanagentshoulddofewerthanfivejudgmentsandoneprocessing,orfive judgmentsinonestep.supposetherearethreeagents(agent,agent1,agent2)inan environment. During one step, first agent takes an action, next agent 1, finally agent 2. In this way, agents repeatedly take actions until reaching the maximum preassigned steps. nother important role of time delays and steps is to prevent the program from falling into deadlocks. For example, if an agent cannot execute processing because of thejudgmentloop,thenonestependsafterfivejudgments.suchaprogramisremoved from the population in the evolutionary process, or the node transition is changed by the learning process of GNP-RL, as described later. 2 phenomenonthataprogramsize,i.e.,thenumberofnodes,becomestoolargeasgenerationgoeson. Evolutionary Computation Volume 15, Number 3 373

6 S.Mabu,K.HirasawaandJ.Hu 3.2 GeneStructure he graph structure of GNP is determined by the combination of the following node genes.geneticcodeofnode i( i n 3 1)isalsoshowninFig.1. K i representsthenodetype, K i = meansstartnode, K i = 1meansjudgment nodeand K i = 2meansprocessingnode. ID i representstheidentificationnumber ofthenodefunction,e.g., K i = 1and ID i = 2meanthenodeis J 2. d i isthetime delayspentonjudgmentorprocessing. Ci, CB i...showthenodenumberconnected fromnode i. d i, db i,...meantimedelaysspentonthetransitionfromnode itonode Ci, CB i...,respectively.judgmentnodesdeterminetheuppersuffixoftheconnection genes to refer to depending on their judgment results. For example, if the judgment resultis B,GNPrefersto Ci B and d B i.however,astartnodeandprocessingnodes useonly Ci and d i,becausetheyhavenoconditionalbranch. 3.3 Initialization of a GNP Population Fig. 2 shows the whole flowchart of GNP. n initial population is produced according tothefollowingrules. First,wedeterminethenumberofeachkindofnode 4,thereforeallprogramsinapopulationhavethesamenumberofnodesandthenodeswith the same node number have the same function. However, the extended algorithm, GNP-RL described later, determines the node functions automatically, so we only need to determine the number of judgment nodes and processing nodes, e.g., 4 judgment nodesand2processingnodes.heconnectiongenes C i, CB i,...aresetatthevalues selectedrandomlyfrom 1,..., n 1(except iinordertoavoidself-loop). 3.4 RunofaGNPProgram henodetransitionofgnpisbasedon C i.ifthecurrentnode iisajudgmentnode, GNPexecutesthejudgmentfunction ID i anddeterminesthenextnodeusingitsresult. Forexample,whenthejudgmentresultis B,thenextnodebecomes C B i.whenthe currentnodeisaprocessingnode,afterexecutingtheprocessingfunction ID i,thenext nodebecomes C i. 3.5 GeneticOperators In each generation, the elite individuals are preserved and the rest of the individuals are replaced with the new ones generated by crossover and mutation. In Simulation I(Section 4.3), first, 179 individuals are selected from the population by tournament selection 5 andtheirgenesarechangedbymutation. hen,12individualsarealso selected from the population and their genes are exchanged by crossover. Finally, the 299 individuals generated by mutation and crossover and one elite individual form the next population Mutation Mutationisexecutedinoneindividualandanewoneisgenerated[Fig.3].heprocedure of mutation is as follows. 3 Eachnodeinaprogramhasauniquenodenumberfrom to n 1(n:totalnumberofnodes),respectively. 4 Fiveofeachkindinthispaper.Itcouldbedeterminedexperimentally,howeverinthispaper,previous experienceindicatesfivenodespereachkind(j 1, J 2,..., P 1, P 2,...)couldkeepareasonablebalanceof expression ability and search speed. 5 hecalculationcostoftournamentselectionisrelativelysmall,becauseitsimplycomparesthefitness values of some individuals, and we can easily determine selection pressure by tournament size. hus, we use tournament selection in this paper. ournament size is set at six. 374 Evolutionary Computation Volume 15, Number 3

7 Genetic Network Programming and Its Extension start generate an initial population ind=1 Judgement/Processing No ind=ind+1 No trial ends? Yes ind=the number of individuals? Yes reproduction mutation crossover No ind : individual number last generation? Yes stop Figure 2: Flowchart of GNP system. 1. Select one individual using tournament selection and reproduce it as a parent. 2.Eachconnectionofeachnode(C i )isselectedwiththeprobabilityof P m.heselected C i ischangedtoothervalue(nodenumber)randomly. 3. Generated new individual becomes the new one of the next generation Crossover Crossover is executed between two parents and generates two offspring[fig. 4]. he procedure of crossover is as follows. 1. Select two individuals using tournament selection twice and reproduce them as parents. 2.Eachnode iisselectedasacrossovernodewiththeprobabilityof P c. 3. wo parents exchange the genes of the corresponding crossover nodes, i.e., the nodes with the same node number. 4. Generated new individuals become the new ones of the next generation. Fig. 4 shows a crossover example of the graph structure with three processing nodes for simplicity. If GNP exchanges the genes of judgment nodes, it must exchange allthegeneswithsuffix, B, C,...simultaneously. 3.6 Extended lgorithm GNP with Reinforcement Learning(GNP-RL) In this subsection, we propose an extended algorithm, GNP with Reinforcement Learning(GNP-RL). Standard GNP(SGNP) described in the previous section is based on the Evolutionary Computation Volume 15, Number 3 375

8 S.Mabu,K.HirasawaandJ.Hu 1 C3=1 mutation Each branch is selected with the probability of Pm 2 3 C3=2 he selected branch becomes connected to another node randomly. Figure 3: Mutation. general evolutionary framework such as selection, crossover and mutation. GNP-RL is based on evolution and reinforcement learning(sutton and Barto, 1998). he aim of combining RL with evolution is to produce programs using the current information (state and reward) during task execution. Evolution-based methods change their programs mainly after task execution or enough trials, i.e., offline learning. On the other hand, GNP-RL can change its programs incrementally based on rewards obtained during task execution, i.e., online learning. For example, when an agent takes a good action withapositiverewardatacertainstate,theactionisreinforcedandtheactionwillbe adopted with higher probability when visiting the state again. Online learning is one of the advantages of GNP-RL. he other advantage is a combination of a diversified search of evolution and an intensified search of RL. he role of evolution is to make rough structures, i.e., plural paths of node transition, through selection, crossover and mutation. he role of RL is to determine one appropriate path in a structure made by evolution. Because RL is executed based on immediate rewards obtained after taking actions, intensified search, i.e., local search, can be executed efficiently. Evolution changes programs largely than RL and the programs(solutions) could escape from local minima, so we call evolution as a diversified search. henodesofgnphaveauniquenodenumberandthenumberofnodes(states)is the same in all the individuals. In addition, the crossover operator exchanges the nodes withthesamenodenumber.herefore,thelargechangesoftheqtablesdonotoccur and the obtained knowledge in the previous generation can be used effectively in the current generation Basic Structure of GNP-RL Fig. 5 shows a basic structure of GNP-RL. he difference between GNP-RL and SGNP iswhetherornotpluralfunctionsexistinanode.eachnodeofsgnphasonefunction, butthatofgnp-rlhasseveralfunctionsandoneofthemisselectedbasedonapolicy. K i representsthenodetype,whichisthesameasinsgnp. ID ip (1 p m i 6 )showstheidentificationnumberofthenodefunction.infig.5, m i ofallnodesare setat2,i.e.,gnpcanselectthenodefunction ID i1 or ID i2. Q ip isaqvaluewhichis assigned to each state and action pair. In reinforcement learning, state and action 6 m i (1 m i M M:Maximumnumberoffunctionsinanode,e.g., M = 4)showsthenumberof nodefunctionsgnpcanselectatthecurrentnode i. m i isdeterminedrandomlyatthebeginningofthefirst generation, but they can be changed by mutation. 376 Evolutionary Computation Volume 15, Number 3

9 Genetic Network Programming and Its Extension parent1 1 1 parent2 d3 C3=1 * d3 2 3 ID3 2 C3=2 * 3 ID3 d3 d3* Each node is selected with the probability of Pc (crossover node) crossover offspring1 1 1 offspring2 d3 C3=1 * d C3=2 * * ID3 ID3 * d3 d3 Figure 4: Crossover. must be defined, and generally, the current state is determined by the combination of the current information, e.g., sensor inputs, and action is an actual action an agent takes, e.g., go forward. However, in GNP-RL, the current state is defined as the current node, andaselectionofanodefunction(id ip )isdefinedasanaction. d ip isthetimedelay spentonjudgmentorprocessing. C ip, CB ip,...showthenodenumberofthenextnode. d ip, db ip,...meantimedelaysspentonthetransitionfromnode itonode C ip, CB ip,..., respectively Run of GNP with Reinforcement Learning he node transition of GNP-RL also starts from a start node and continues depending on the node connections and judgment results. Ifthecurrentnode iisajudgmentnode, first, oneqvalueisselectedfrom Q i1,..., Q imi based on ε-greedy policy. hat is, a maximum Q value among Q i1,..., Q imi isselectedwiththeprobabilityof 1 ε,orarandomoneisselectedwith theprobabilityof ε,thenthecorresponding ID ip isselected.gnpexecutestheselected judgmentfunction ID ip anddeterminesthenextnodedependingonthejudgmentresult.forexample,iftheselectedfunctionis ID i2 andthejudgmentresultis B,the nextnodebecomesnode Ci2 B. If the current node is a processing node, GNP selects and executes a processing functioninthesamewayasjudgmentnodes,andthenextnodebecomesnode Ci2 whentheselectedfunctionis ID i2. Here, a concrete example of node transition is explained using Fig. 6. he first nodeisajudgmentnode2andtherearefunctionsjfandd(seeable1).supposejf is selected based on the ε-greedy policy and the judgment result is D(=floor). hen the nextnodenumberbecomes C21 D = 4.Innode4,theprocessingfunctionMFisselected, Evolutionary Computation Volume 15, Number 3 377

10 S.Mabu,K.HirasawaandJ.Hu Directed graph structure start node node Node gene Gene structure 1 node Connection gene 4 node node 3 node node i Ki IDi1 Qi1 di1 Ci1 Node gene... Connection gene di1 di1... IDimi Qimi dimi... Ci1 B B Cimi dimi B Cimi dimi B Judgement node in the case of mi=2 : ime delay Qi1 di1 IDi1 IDi2 Qi2 di2 di1 node i node Ci1 B di1 di2... node Ci1 One branch is selected according to the judgement result. B di2 node node B Ci2 B Ci2 Processing node node i Qi1 di1 IDi1 di1... node j node Ci1 IDi2 Qi2 di2 di2 node Ci2 Figure 5: Basic structure of GNP with Reinforcement Learning. soanagentmovesforward,andthenextnodebecomesnode C 42 = GeneticOperators crossoveroperatoringnp-rlisthesameasinsgnp,i.e.,allthegenesoftheselected nodes are exchanged. However, GNP-RL has its own mutation operators. he procedure is as follows. 1. Select one individual using tournament selection and reproduce it as a parent 2. Mutation operator here are three kinds of mutation operators[fig. 7], and, uniformly selected, one operator is executed. (a) connection of functions: Each node connection is re-connected to another node(c ip ischangedtoanothernodenumber)withtheprobabilityof P m. (b)contentoffunctions:eachfunctionisselectedwiththeprobabilityof P m and changedtoanotherfunction,i.e., ID ip and d ip areeachchanged. 378 Evolutionary Computation Volume 15, Number 3

11 Genetic Network Programming and Its Extension node 2 ID21=1 Q21=1. JF D ID22=5 Q22=.1... tile hole obstacle agent floor C21=1 B C21=5 C C21=8 E C21=1 D C21=4 node 4 ID41=1 Q41=.2 L MF ID42=3 Q42=.5 C41=7 C42=9 JF : judge forward D : direction of the nearest ile from the agent MF: move forward L : turn left Figure 6: n example of node transition using the nodes for ileworld problem. (c)numberoffunctions:eachnode iisselectedwiththeprobabilityof P m,and thenumberoffunctions m i ischangedto 1,...,or Mrandomly.Iftherevised m i becomeslargerthantheprevious m i,thenoneormorenewfunctionsselectedfrom LIBRRYareaddedtothenodesothatthenumberoffunctions becomestherevised m i.iftherevised m i becomessmaller,thenoneormore functions are deleted from the node. 3. he generated new individual becomes the new one of the next generation. 3.7 LearningPhase Reinforcement learning is carried out when agents are carrying out their tasks and terminates when the time step reaches the predefined steps. he learning phase of GNPisbasedonabasicSarsaalgorithm(SuttonandBarto,1998).SarsacalculatesQ valueswhicharefunctionsofstate sandaction a. Qvaluesestimatethesumofthe discountedrewardsobtainedinthefuture.supposethatanagentselectsanaction a t atstate s t attime t,areward r t isobtainedandanaction a t+1 istakenatthenextstate s t+1.hen Q(s t, a t )isupdatedasfollow. Q(s t, a t ) Q(s t, a t ) + α [r t + γq(s t+1, a t+1 ) Q(s t, a t )] (1) αisastepsizeparameter,and γisadiscountratewhichdeterminesthepresent valueoffuturerewards:arewardreceived ktimestepslaterisworthonly γ k 1 times oftherewardsupposedtoreceiveatthecurrentstep. sdescribedbefore,astatemeansthecurrentnodeandanactionmeanstheselectionofafunction.hereaprocedureforupdatingqvalueisexplainedusingfig.8 which shows states, actions and an example of node transition. 1.ttime t,gnprefersto Q i1, Q i2,...,q imi andselectoneofthembasedon ε- greedy.supposethatgnpselects Q ip andthecorrespondingfunction ID ip. 2.GNPexecutesthefunction ID ip,getsthereward r t andthenextnode jbecomes C ip. 3.ttime t + 1,GNPselectsoneQvalueinthesamewayasstep1.Supposethat Q jp isselected. Evolutionary Computation Volume 15, Number 3 379

12 S.Mabu,K.HirasawaandJ.Hu 1 C31=1 2 C32=2 3 mutation (the connection) 1 C31=2 2 C32=2 3 he connection is changed randomly. IDi di node i mutation (the content of functions) * * IDi di he content of the functions is changed randomly. node i mi=2 mutation (the number of functions) mi=3 he number of functions is selected from 1,..., M. Figure 7: Mutation of GNP with Reinforcement Learning. 4.Qvalueisupdatedasfollows. Q ip Q ip + α [r t + γq jp Q ip ] 5. t t + 1, i j, p p thenreturnstep2. Inthisexample,node iisaprocessingnode,butifitisajudgmentnode,thenext currentnodeisselectedamong C ip, CB ip,...dependingonthejudgmentresult. 4 Simulations o confirm the effectiveness of the proposed method, the simulations for determining the agents behavior using the ileworld problem(pollack and Ringuette, 199) are described in this section. 4.1 ileworld ileworldiswellknownasatestbedfortheproblemofagents.fig.9showsanexample of ileworld, which is a 2D grid world including multi-agents, obstacles, tiles, holes and floors.gentscanmovetoacontiguouscellinonestep.moreover,agentscanpusha 38 Evolutionary Computation Volume 15, Number 3

13 Genetic Network Programming and Its Extension node i reward rt node j (=Cip ) reward rt+1 Qi1 Qj1 IDi1 IDj1 action at Qip IDip Qimi Cip Qjp at+1 IDjp Qjmj IDimi IDjmj state st state st+1 t t+1 time Figure 8: n example of node transition. tiletothenextcellexceptwhenanobstacleorotheragentexistsinthecell.whenatile isdroppedintoahole,theholeandthetilevanish,i.e.,theholeisfilledwiththetile. gentshavesomesensorsandactionabilities,andtheiraimistodropmanytilesinto holes as fast as possible. herefore, agents are required to use sensors and take actions properly according to their situations. Since the given sensors and simple actions are not enough to achieve tasks, agents must make clever combinations of judgment and processing. henodesusedbyagentsareshowninable1.hejudgmentnodes {JF,JB,JL, JR }return {tile,hole,obstacle,floor,agent },and {D,HD,HD,SD }return { forward,backward,left,right,nothing }asjudgmentresults,like, B,...inFig. 1. Fig.1showsthefourdirectionsanagentcanperceivewhenitfacesnorth Fitness and Reward trialendswhenthetimestepreachesthepreassignedstep,andthenfitnessiscalculated. Fitness is used in the evolutional processes and Reward is used in the learning phase of GNP-RL. Fitness = the number of dropped tiles Reward=1(whenanagentdropsatileintoahole) 4.2 SimulationConditions he simulation conditions are shown in able 2. For the comparison, the simulations arecarriedoutbysgnp,gnp-rl,standardgp,gpwithdfs,andepevolvingfsms Conditions of GNP sshowninable2,thenumberofnodesinaprogramis61.inthecaseofsgnp,the numberofeachkindofjudgmentandprocessingnodesisfixedatfive.inthecaseof GNP-RL,4judgmentnodesand2processingnodesareused,butthenumberofdifferentkindsofnodes(ID i )arechangedthroughtheevolution.tthefirstgeneration, allnodeshavefunctionsrandomlyselectedfromlibrry.but, ID i areexchanged by crossover and also changed by mutation, thus the appropriate kinds of nodes are selected as a result of evolution. Evolutionary Computation Volume 15, Number 3 381

14 S.Mabu,K.HirasawaandJ.Hu (North) Forward ile Hole Obstacle Floor gent Left Left Forward Right Figure 9: ileworld. Backward Backward (South) Figure 1: Four directions gents perceive. he number of elite individuals and offspring generated by crossover and mutationispredefined.simulationiusestheenvironmentshowninfig.9,wherethepositions of tiles and holes are fixed, so the same environment is used every generation. Ontheotherhand,inSimulationII,thepositionsoftilesandholesaredetermined randomly, so the problem becomes more difficult and complex. In Simulation I, the best individual is preserved as an elite one, but in Simulation II, five elite individuals are preserved, because the environment changes generation by generation. In fact, thebestindividualinthepreviousgenerationdoesnotalwaysshowagoodresultin the current generation. herefore, in order to make the performance of each method stable, we preserve five good individuals in Simulation II. In GNP-RL, when creating offspring in the evolution, offspring inherits the Q values of its parents and uses them asinitialvalues.hatis,theoffspringgeneratedbymutationhasthesameqvaluesas theparentsbecausemutationdoesnotoperatesonqvalues.hence,theqvaluesofan elite individual carry over to the next generation. Furthermore, the offspring generated bycrossoverhastheexchangedqvaluesoftheparentsinaddition,threeagentsinthe tileworld share the Q values. Crossoverrate P c andmutationrate P m aredeterminedappropriatelythrough our experiments, which maintains the variation of the population, but does not change theprogramstoomuch.hesettingsoftheparametersusedinthelearningphaseof GNP-RLareasfollows.hestepsizeparameter αissetat.9inordertofindsolutions quicklyandthediscountrate γissetat.9inordertosufficientlyconsiderthefuture rewards. ε is set at.1 experimentally, which considers the balance between exploitation and exploration. In fact, the programs with lower epsilon fall into local minima with higher probability, and those with higher epsilon take too much random actions. M(themaximumnumberoffunctionsinanode)issetatthebestvalueamong M = 2, 3and Conditions of GP WeuseGPasadecisionmaker,sothefunctionnodesareusedas if-thentypebranch decision functions, and the terminal nodes are used as action selection functions. he terminal nodes of standard GP are composed of the processing nodes of GNP: {MF, 382 Evolutionary Computation Volume 15, Number 3

15 Genetic Network Programming and Its Extension able 1: Function Set. Judgment node J symbol content J 1 JF judgeforwrd J 2 JB judgebckwrd J 3 JL judgelefside J 4 JR judgerighside J 5 D directionofthenearestilefromtheagent J 6 HD directionofthenearestholefromtheagent J 7 HD directionofthenearestholefromthenearestile SD directionofthesecondnearestilefromtheagent J 8 Processing node P symbol content P 1 MF moveforward P 2 R turnright P 3 L turnleft P 4 S stay L,R,S},andthefunctionnodesarethejudgmentnodesofGNP: {JF,JB,JL,JR, D, HD, HD, SD}. erminal nodes have no arguments and function nodes have five arguments corresponding to the judgment results. In the case of GP with DFs, the maintreeuses {DF1,...,DF1 7 }asterminalnodesinadditiontotheterminal andfunctionnodesinstandardgp.hedftreeusesthesamenodesasstandard GP.hegeneticoperatorsofGPusedinthispaperarecrossover(PoliandLangdon, 1998, 1997), mutation and inversion(koza, 1992). In the simulations, the maximum depthofthetreesisfixedinordertoavoidbloat,butthesettingofthemaximumdepth is very important, because the expression ability is improved as the depth becomes large, while the search space is increased. herefore, we try to use various depths in therangepermittedbymachinememoryandcalculationtime,andusea full anda ramped half-and-half initialization methods(koza, 1992) in order to produce trees with various sizes and shapes in the initial population Conditions of EP EPusesthesamesensorinformationasthejudgmentnodesofGNPuse,andtheoutputsarethesameasthecontentsoftheprocessingnodes. GenerallyEPmustdefine transitions and outputs for all combinations of states and inputs. Here, we would like todiscusshowthecomplexityoftheepandgnpprogramsdiffersdependingonthe problem. able 3 shows the number of outputs/connections for each individual in EP andgnp.fig. 11showsthenumberofoutputsateachstate/node. Incase1,there is only one sensor which can distinguish two objects, and the number of states/nodes is6 8. hen,thenumberofoutputsofepbecomes12,thatofsgnpbecomes1, and that of GNP-RL becomes 1-4(variable). However, as the number of sensors 7 henumberofdfsineachindividualis1andeachdfiscalledbytheterminalnodesofthemain tree. 8 startnodeofgnpisnotcountedbecauseithasonlyonebranchdeterminingthefirstjudgmentor processingnodeanddoesnothaveanyfunctions. EPhasabranchdeterminingafirststate,butitisnot counted as an output. Evolutionary Computation Volume 15, Number 3 383

16 S.Mabu,K.HirasawaandJ.Hu able 2: Simulation conditions. GNP-RL SGNP GP EP GP-DFs Population size crossover mutation 179[175] 179[175] 119[115] 299[295] inversion 6 elite 1[5] 1[5] 1[5] 1[5] program size GP: 3 5 state 6,3,5 GP-DFs:main3,4,DF2,3 input1 4 Crossoverrate P c.1.1 Mutationrate P m.1.2[.1].1[.1].1 ournament size 6 Otherparameters α =.9 γ =.9 ε =.1 M = 4 [3] []: conditions in Simulation II Population size: the number of individuals Program size: the number of nodes including one start node(gnp), max depth(gp), max number of states and number of inputs(ep). processing node one connection Y connections state X Y outputs judgement node GNP EP Figure 11: he number of outputs from a node/state. (inputs) and the number of objects each sensor can distinguish increase, the number of outputs of EP becomes exponentially large(case 2 case 4). On the other hand, the number of connections in GNP does not become exponentially large because each judgment nodedealswithonlyonesensoranddoesnotneedtoconsiderallthecombinationsof the inputs. Case4inable3showsthecaseoftheileworldproblem: thetotalnumberof outputsofanepprogramis 23, 437, 5(= ).hisisimpracticaltouse,sowe limitthenumberofsensors(inputs)usedateachstatetoacertainnumber(seeable2). However,whichinputsareusedateachstateisaveryimportantmatter,thus,inorder to find the necessary kind(s) of input(s), the additional mutation operator is introduced, whichcanchangethekind(s)ofinput(s)usedateachstate.fig.12showsanexample oftheepprogramusingtwostatesandonesensorateachstateforsimplicity.inthis example,theinputusedatstate1isthejudgmentresultofjfandthatofstate2is thejudgmentresultofd.hen,thenumberoftransitionsandoutputsisfive,each corresponding to the judgment result, and each output shows the next action an agent 384 Evolutionary Computation Volume 15, Number 3

17 Genetic Network Programming and Its Extension able3:herelationbetweenthenumberofinputsandoutputsinepandgnp. case1 case2 case3 case4 X Y EP ,36 23,437,5 Z SGNP GNP-RL(M = 4) 1-4(variable) X: the number of inputs(sensors) Y:thenumberofobjectseachsensorcandistinguish Z: the total number of outputs(connections) the number of states/nodes: 6(judgment node: 4, processing node: 2 in the case of GNP) initial state tile/mf agent/r hole/r right/r floor/mf 1 nothing/mf 2 left/l obstacle/l JF D forward/mf backward/r in the case where one input is dealt with Figure12:nexampleofanEPprogram. shouldtake.ifthenumberofinputsateachstateistwoinsteadofone,thenumberof transitions and outputs becomes 25 each. t the beginning of the first generation, the predefined number of sensors are assignedtoeachstaterandomly,butthetypesofsensorsarechangedbythemutation operatorasthegenerationgoeson. sshowninable2,thenumberofsensors(inputs)andthemaximumnumberofstatesaresetatsomevalueinordertofindgood settings for EP. he mutation operators used in the simulations are {DD SE, REMOVE SE, CHNGE RNSIION, CHNGE OUPU, CHNGE INIIL SE, CHNGE INPU}. he last operator is the additional one adopted in this paperandtheothersarethesameastheonesusedinngeline(1994). 4.3 SimulationI SimulationIusestheenvironmentshowninFig. 9wherethereare3tilesand3 holes.hreeagentshavethesameprogrammadebygnp,gporep.inthisenvironment, since each input for judgment nodes is not the complete information needed to distinguish various situations, each method is required to judge its situations and take actions properly by combining various kinds of judgment and processing nodes. he maximumnumberoftimestepsissetat15. Evolutionary Computation Volume 15, Number 3 385

18 S.Mabu,K.HirasawaandJ.Hu able4:hedataonthefitnessofthebestindividualsatthelastgenerationinsimulation I. GNP-RL SGNP GP-DFs GP EP average standard deviation t-test GNP-RL (p-value) SGNP Figs. 13,14and15showthefitnesscurvesofthebestindividualsateachgeneration averaged over 3 independent simulations. From the results, GNP-RL shows the best fitness value at 5 generation. In early generations, SGNP exhibited better fitness,becauseqvaluesingnp-rlaresetatzerointhefirstgenerationandmust be updated gradually. However, GNP-RL can produce the better result in the later generations by appropriately learned Q values. able 4 shows the averaged fitness values of the best individuals over 3 simulations 9 atthelastgeneration,theirstandarddeviation,andtheresultsofat-test(onesided test). he results of the t-test show the p-values between GNP-RL and the other methods, and between SGNP and the other methods. here are significant differences betweengnp-rlandtheothermethods,andbetweensgnpandgp,gpwithdfs, and EP. lthough it seems natural that the method using RL can obtain better solutions thantheothermethodswithoutit,theaimofdevelopinggnp-rlistosolveproblems fasterthantheotherswithinthesametimelimitofactions(samesteps).inotherwords, GNP-RL aims to make full use of the information obtained during task execution in order to make the appropriate node transitions. FromFig.14,weseethatstandardGPofdepthfour,initializedbythefullmethod (GP-full4), shows better results than the other standard GP programs, and GP with DFsofdepththree(maintree)anddepthtwo(DFtree)initializedbythefullmethod (GP-DF-full3-2) produces the best result of all the GP programs. However, in this problem,thearityofthefunctionnodesofgpisrelativelylarge(five),sothetotal numberofnodesofgpbecomesquitelargeasthedepthbecomeslarge. Forexample, GP-full4 has 781 nodes, GP-full5 has 3,96 nodes, and GP-full6 has 19,531 nodes. lthough GP programs can have higher expression ability as the number of nodes increases, they take much time to explore the programs and much memory is needed. For example, GP-full6 takes too much time to execute the programs and GP(depth7) cannot be executed because of the lack of memory in our machine(pentium4 2.5GHz, DDR-SDRM PC21 512MB). On the other hand, GNP can obtain good results using relatively small number of nodes. sshowninfig.15,epusingthreeinputsandfivestatesshowsbetterresults,so thissettingissuitablefortheenvironment.epusesagraphstructure,soitcanalsoexecute state transition considering the past agents actions. Furthermore, as the number of states increases, EP can implicitly memorize the past action sequences. However, if therearemanyinputs,itcausesalargenumberofoutputsandstatetransitionrulesand the programs then become impractical for exploration and execution. he structure of GNP does not become exponentially large even if the number of inputs increases as 9 theresultsofthebestsettings,i.e.,gp-full4,gp-df-full3-2andep-input3-state5.able5,6and7also show the results of the best settings. 386 Evolutionary Computation Volume 15, Number 3

19 fitness fitness fitness Genetic Network Programming and Its Extension 22 2 GNP-RL (21.23) SGNP (18.) fitness at the last generation generation Figure 13: Fitness curves of GNP in Simulation I GP-DF-full3-2 (15.43) GP-DF-ramp4-3 (13.5) GP-full4 (14.) GP-DF-full4-3 (14.46) GP-DF-ramp3-2 (13.86) GP-full5 (13.76) GP-ramp4 (1.1) GP-ramp5 (1.3) generation Figure14:FitnesscurvesofGPinSimulationI EP-input4 state 5 (14.93) EP-input1 state 6 (13.3) EP-input3 state 5 (16.3) EP-input2 state 3 (13.7) generation Figure15:FitnesscurvesofEPinSimulationI. Evolutionary Computation Volume 15, Number 3 387

20 S.Mabu,K.HirasawaandJ.Hu able 5: Calculation time for 5 generations in Simulation I. GNP-RL SGNP GP-DFs GP EP Calculation time[s] 1,364 1,19 3,252 3, ratio of each method to SGNP the average number of functions in each node generation Figure16:Changeoftheaveragenumberoffunctions m i ineachnodeingnp-rl. describedin4.2.3,thereforemanymorestates(nodes)canbeusedingnpthaninep. s a result, the implicit memory function of GNP becomes more effective in dynamic environments than that of EP. able 5 shows the calculation time for 5, generations. SGNP is the fastest, GNP- RLissecondandEPisthird.GNP-RLtakesmoretimethanSGNP,becauseitexecutes RLduringtasks,however,itdoesnottakesomuchtime. hemaximumnumberof functions(m)ineachnodeis4andoneofthemisselectedasanaction;thisprocedure doesnottakemuchtime. Inaddition,moreimportantly, m i tendstodecreaseasthe generationgoesonasshowninfig.16(1.75atthelastgeneration)becausetheappropriate number and contents of functions are selected automatically in the evolutional processes. herefore, reinforcement learning just selects one function from 1.75 functions on average. his tendency contributes to saving time. In addition, the relatively large epsilon (=.1) succeeded in achieving the tasks thanks to this tendency, because lessthantwofunctions(actions)areinanodeonaverageatthelastgeneration,while the relatively large epsilon is useful at the beginning of the generations, because the agentscantrytodomanykindsofactionsandfindgoodonesbyrl.eptakesmore timethangnpandgnp-rl,butitcansaveitscalculationtimecomparedwithordinaryep,becausethenumberofinputsislimitedtothreeandthestructurebecomes compact. ctually,thecalculationtimeofepusingfourinputsis4,184. GPandGP withdfshavemanynodes,thustheytakemoretimethantheothersinevolutional processes. Fig.17(a)showsthetypicalnodetransitionoftheupperleftagent(inFig.9)operated by SGNP. he x-axis shows the symbols of the nodes, and the y-axis distinguishes thesamekindofnodes,i.e.,therearefivenodespereachkind/symbol 1 andtheyhave thenumber,1,2,3and4.forexample, (x, y) = (4, 1)showsthesecondJFnode.From thefigure,wecanseethatthespecificnodesarerepeatedlyused. 1 FiveJFnodes,fiveJBnodes,...,fiveMFnode,fiveLnodes,...areusedinSGNP. 388 Evolutionary Computation Volume 15, Number 3

21 Genetic Network Programming and Its Extension MF L R S JF JB JL JR D HD HDSD (a)wholenodetransitionfor15steps reward (drop tile) judgement result floor forward MF JF D MF JF (x,y)=(,1) (4,1) (8,) (,1) (4,1) obstacle reward right forward backward D MF R HD MF HD (8,) (,) (2,) (9,2) (,2) (1,4) (b) partial node transition extracted from(a) Figure 17: Node transition of standard GNP. Fig. 17(b) shows the partial node transition extracted from the whole node transition[fig.17(a)].infig.17(b),thefirstnodeismf(,1),sotheagentmovesforward, andthenextnodebecomesjf(4,1)accordingtothenodeconnectionfrommf(,1). husthepoints(,1)and(4,1)infig. 17(a)areconnectedwithline. Next,thejudgment JF(4,1) is executed and the judgment result is floor, thus the corresponding node branch(connectedtod(8,))isselected.henthepoints(4,1)and(8,)infig.17(a) are connected. fter executing the judgment D(8,)(judgment result: forward), the agent goes forward, judges forward(judgment result: obstacle), judge the tile direction (judgment result: right), and so on. Finally,thesimulationsusingEnvironment,BandC[Figs. 18,19and2]are carriedout.heconditionofeachmethodisthesameasthatshowingthebestresultin thepreviousenvironment[fig.9].fromfigs.21,22and23,gnp-rlandsgnpshow better results than the other methods. Evolutionary Computation Volume 15, Number 3 389

22 S.Mabu,K.HirasawaandJ.Hu Figure 18: Environment. Figure 19: Environment B. Figure 2: Environment C. 4.4 SimulationII In Simulation II, we use an environment whose size(2x2) and distribution of the obstaclesarethesameassimulationi.however,2tilesand2holesaresetatrandom positions at the beginning of each generation. In addition, when an agent drops a tile intoahole,thetileandtheholedisappear;however,anewtileandanewholeappear at random positions. herefore, the individuals obtained in the previous generation are required to show good performance in the new, unexperienced environment. his problem is more dynamic and suitable than Simulation I in terms of confirming the generalization ability of each method. he maximum number of steps is set at 3. Figs.24,25 11 and26showtheaveragedfitnesscurvesofthebestindividualsover 3 independent simulations at each generation. From the figures, we can see that GNP- RL can obtain the highest fitness value at the last generation because the information obtained during task execution is used for making node transitions efficiently. From able 6, we can also see that there are significant differences between GNP-RL and the other methods. SGNP can obtain better fitness value than GP, GP-DFs at the last generation, but, from able 6, it is found that there is no significant difference between SGNPandEP-input1-state6.InthecaseofEP,itisinterestingtonotethattheprograms insimulationiishowtheoppositeresultstothoseinsimulationi,i.e.,theprogramusing one input shows better results in Simulation II, while that using three inputs show better results in Simulation I. herefore, for EP in this environment, it is recommended thatanactionbedeterminedbyoneinputandthatarelativelylargenumberofstatesis used. In other words, this EP makes many simple rules and combines them considering the past state transition. his special structure of EP is similar to that of GNP. However, the advantage of GNP is automatically selecting the necessary number of inputs and actions depending on the situations, and moreover, it is found that GNP programs with 61nodesshowgoodresultsinbothSimulationIandII,thereforewearenotworried about the settings of the number of nodes. In fact, there are significant differences betweensgnpandep-input3-state5(p-value= )whichshowsthebestfitness valueinsimulationi,ep-input2-state3( )andep-input4-state5( ). InthecaseofGP,itisdifficulttofindeffectiveprograms,becausetheenvironment changes randomly generation by generation. In addition, GP has relatively complex structuresandwidesearchspacecomparedtognpandep,thusitismoredifficultfor GP to explore solutions. 11 Fig. 25showsthefitnesscurvesofGP-full5andGP-DF-full3-2,andthefitnessvaluesoftheother settings at the last generation. Because the fitness curves of the GP overlap each other, the best two results (GP-full5 and GP-DF-full3-2) are shown. 39 Evolutionary Computation Volume 15, Number 3

A Structural Optimization Method of Genetic Network Programming. for Enhancing Generalization Ability

International Journal of Engineering Innovation and Management Vol.6, No.2, 2016 A Structural Optimization Method of Genetic Network Programming for Enhancing Generalization Ability Shingo Mabu, Yamaguchi