Evolving caching algorithms in C by genetic programming

Evolving caching algorithms in C by genetic programming Norman Paterson Computer Science Division University of St Andrews ST ANDREWS Fife, KY16 9SS, Scotland norman@dcs.st-and.ac.uk http://www.dcs.st-and.ac.uk/~norman/ ABSTRACT This paper outlines current work in ontogenic-mapped and languageabstracted genetic programming. Developments in the genetic algorithm for deriving software (GADS) technique are described, and tested in a series of experiments to generate caching algorithms in C. GADS quickly finds over-fitted solutions which perform better than designed solutions but only in one niche. The need for a scalable approach in GADS to deal with language definitions involving more productions is demonstrated. 1 Introduction Genetic algorithms (GAs) were introduced in Holland (1975). Genetic programming (GP) was introduced in Koza (1993). GP is the application of the GA technique to the problem of producing computer programs. Koza (1993) does this by extending GA techniques to deal with genotypes of type tree LISP trees in particular. Michalewicz (1994) describes how GA has been extended to genotypes of various other data types. The original GA data type is best described as a tuple. A current line of research is to abstract the language in which the phenotype is written from the GP process. Keller & Banzhaf (1996) describe a many-to-one mapping from the chromosome to the terminal symbols of the phenotype language. Hörner (1996) describes a GP system in which the genotype is a derivation tree rather than a parse tree. Paterson & Livesey (1996) describe GADS, an implementation of GP that uses tuple genotypes. These approaches are described and compared in more detail in 2. The main body of this paper reports recent developments of the GADS technique. The target language is changed from first-order Lisp to a subset of C. Evaluating fitness involves compiling and running the phenotype. A more general repair mechanism is introduced, where a default terminal string is specified for each nonterminal as part of the BNF definition of the phenotype language. GA parameters are refined: uniform crossover is replaced by Mike Livesey Computer Science Division University of St Andrews ST ANDREWS Fife, KY16 9SS, Scotland mjl@dcs.st-and.ac.uk http://www.dcs.st-and.ac.uk/~mjl/ single-point crossover. This results in a marked improvement from generation to generation that was lacking in previous work. A series of experiments on caching algorithms is described. Caching was chosen because it is small, has no closed solution, and an improved solution would be of value outside of the GP community. The work is carried out with the machine acting as a servant under human direction. 2 Related work GADS differs from current mainstream GP in two main ways: it distinguishes between the genotype and phenotype, and the phenotype language is abstracted so that phenotypes can be generated in any language of choice. This section describes 2 other systems with these properties and compares them with GADS. 2.1 Linear genomes (LG) Keller & Banzhaf (1996) introduce a genotype-phenotype mapping (called GPM or genetic code) based on a theory of molecular evolution. The chromosome is a tuple of codons. Every codon is represented by the same number of bits. The difference between a codon and a gene is that there is a many-to-one mapping from codons to the terminal symbols of the phenotype language, and the notion of distance or similarity between codons is defined. That is, distance based on the codon value, not on its position in the chromosome. For example, the genotype 000 001 011 might map to the sequence ab*. This is now scanned for possible repair, according to the rules of the phenotype language. The symbol a is valid but b must not immediately follow a. The valid symbols to follow a are * and +. The repair is effected on the basis of the codon values. The invalid codon is 001, and the "nearest" valid replacement is 011, so the sequence is repaired to a**. Scanning and repair continue in this way, leading to the fully repaired sequence a*b. This is then subject to wrapping and fitness evaluation. The codons in the initial population are randomly generated with a uniform distribution. Two mutation operators that modify bits within a codon are defined. It is claimed that LG is closer to natural mutation, since only a single codon is modified, rather than the entire syntactic units that are changed in Koza (1993). However, limiting mutation to a single codon is only half the story. The full effect of a change depends also on the genotype-phenotype mapping and the grammar of the phenotype language. The effect of changing one codon is not restricted.

2.2 Genetic Programming Kernel (GPK) Hörner (1996) introduces the Vienna University of Economics Genetic Programming Kernel (GPK). The chromosomes of GPK are derivation trees of a specific grammar. All the trees in the population are complete, so no repair of incomplete or invalid phenotypes is necessary. However, ensuring this is not trivial. Initialisation is relatively complicated. Generating derivation trees by repeatedly replacing nonterminals by one of their possible derivations chosen at random leads (a) to a preponderance of short derivations, and (b) to many derivation trees which are incomplete when the size limit is reached. Considerable effort is needed to overcome these problems. Whether this is effort well spent, considering that only generation 0 is involved, is not discussed. The genetic operators are designed to maintain the property that all trees in the population are complete. Crossover swaps subtrees which have the same nonterminal root. To mutate an individual, GPK creates a new, random individual and applies the crossover operator to the individual to be mutated and the random individual. The phenotype language is provided to GPK in the form of a BNF definition. 2.3 Comparison to GADS GADS (GA for deriving software) was introduced by Paterson & Livesey (1996). A description is also given in 3 of this paper. The main features in common between GADS and the two systems described above are: The genotype and phenotype are distinct. An ontogenic mapping is necessary to convert from genotype to phenotype. This is not new to GADS; Michalewicz (1994) gives many examples in contexts other than GP. The genetic process is more or less orthogonal to the phenotype language. The language definition is abstracted from the genetic process. GADS differs from both in that it uses a standard GA engine with standard chromosomes, genes and genetic operators. Very little effort is involved in setting up GADS on top of a GA engine. From its description, LG could also be implemented on a standard GA engine, provided that the specialised mutation could be supported. GPK is a sophisticated system with many additional features to those described here; it provides its own, specialised, GA engine. Like GPK, GADS accepts a language specification in BNF. The GADS BNF is extended to include a default terminal string for each nonterminal, and production weights. These are seen as language definition features rather than tuning parameters. In LG, the phenotype language is represented by the genetic code and scanning mechanism. This introduces a whole new class of parameter, since the genetic code defines how one codon is repaired into another. Of course, it may be that this has little effect on the general performance of the system, and that it can be automated. Both linear genomes and GPK use mutation. GPK without mutation is limited to rearranging the parts that were present in generation 0. GADS, at least in this paper, sets the mutation rate to 0. A masking effect means that forms can appear in one generation, which were not only absent in previous generations, but whose very components were absent. Mutation is not necessary for this. 3 Experimental design This section describes the problem that was studied and the system that was used to study it. The problem was to find an efficient caching algorithm. The nature of this problem and its terminology is described in 3.1. The form of the solution that we are looking for is a few lines of C source. The wrapper that we used to evaluate its fitness is described in 3.2. The experiment was conducted as a series of GP runs, using the GADS technique. GADS is introduced in Paterson & Livesey (1996). The main components of the system are the GA engine, the ontogenic mapping and the fitness evaluation. The GA engine maintains a population of chromosomes. These are arrays of integers. The GA engine is described in 3.3. The ontogenic mapping converts chromosomes into phenotypes, which are program fragments in C. This is described in 3.4. Fitness evaluation simulates cache operation using the phenotype, to measure how effective it is at avoiding cache misses. This is described in 3.5. 3.1 The problem As a program executes, it accesses locations in memory. However, programs do not access all memory locations equally. They tend to access a small subset of their address space (the working set) and ignore the rest, the working set changing slowly (compared to the rate of accesses) over the life of the program. This can be exploited to improve the performance of the program. A cache memory is a small, fast (fast implies expensive, which in turn implies small) memory, large enough to hold the working set the program is using at any time. Instead of accessing main memory, the program accesses cache memory most of the time, and therefore runs faster. The following terminology is used. The executing program emits a stream of requests (addresses of main memory locations to access). A recorded stream is a trace. The set of distinct requests in a stream is its alphabet. A consecutive series of identical requests is a run. The cache has a fixed number of lines, each of one word. Initially the cache is empty, ie all lines are unoccupied. If a request is met from the cache, it is called a hit. Otherwise, it is a miss. When a miss occurs, the requested word is copied from main memory to the cache. If the cache is not yet full, then the first empty line is used. If the cache is full, the caching algorithm is used to choose a victim (ie a word to evict), so that its line can be used for the requested word. A typical caching algorithm is LRU, which is to replace the word that was least recently used. In general a caching algorithm needs some management information on which to base its decision. This information is provided in various ways by different systems; for

example, the lines may be held in a linked list where the order is significant. In this implementation, we provided an integer array info [ ], with one element per line. This is sufficient to implement an algorithm such as LRU, but general enough to find other solutions as well. 3.2 Fitness evaluation: the wrapper The problem to be solved is not to generate a whole caching system, but just the caching algorithm which lies at its heart. A phenotype takes the form of a file, called phenotype.cc, containing some statements in the C programming language, whose effect is to choose a victim on the basis of some calculation. These statements are inserted (by a #include compiler directive) into a function, called phenotype, that is the immediate wrapper of the phenotype. The function is shown below: long phenotype ( long request, /* address of word to access */ boolean miss, /* a miss? 0 = n, 1 = y */ long line_no /* of this word; -1 if miss */ ) { long victim = 0; #include "phenotype.cc" } /* Ensure result is in range */ return abs (victim) % CACHESIZE; The function phenotype is part of a cache simulator called wrapper.cc. When function phenotype is called, the phenotype being evaluated calculates a value for victim (or leaves it at 0). The return statement maps the value of victim into the valid range. Function phenotype is called by the cache simulator once per request, whether the request is a hit or a miss. This allows the phenotype to update the array info [ ]. The result is only used to update the cache if the request is a miss. wrapper.cc also provides a small selection of constants, variables, information functions and calculation functions, some protected, which the phenotype can use. These are shown in table 1, and are built into the language definition. Table 1: Phenotype support facilities write_x (i, v) sets info [i] = v read_x (i) info [i] small_x () index of smallest element of info [ ] large_x () index of greatest element of info [ ] random_x () index of random element of info [ ] counter () successive values 0, 1, 2 etc CACHESIZE count of lines in cache div (x, y) if y ==0 then 1 else x / y rem (x, y) if y ==0 then 1 else x % y Apart from the protected operations, this language is not tailored for GP. It is designed to be sufficiently powerful to implement a range of typical caching algorithms. For example, LRU could be implemented as follows: write_x (line_no, counter ()); victim = small_x (); In fact the function phenotype is slightly more sophisticated than shown above: it has a range of designed caching algorithms such as LRU in addition to the #included one, and a switch statement to choose between them, under control of a command line argument. This allows for calibration and comparison. The designed algorithms play no part in fitness evaluation. 3.3 The GA engine The GA engine used for this experiment is GAGS-1.0, by Merelo (1996). GAGS is a general-purpose GA system, not especially designed for GADS. It provides a C++ environment in which a programmer can set up a wide range of GA environments. GADS is implemented as a C++ program called ga. The ga program is unspecialised and merely provides a GA environment. Genes are bit strings, but GAGS enables the program to view them as objects of any type. In this experiment, genes were viewed simply as integers in the range 0 to 2 n -1, where n was as small as possible (5 or 6) but large enough to specify any production in the syntax. Chromosomes were fixed at 500 genes. This was chosen after some brief experiments to generate sentences from the BNF file. GAGS supports variable-length chromosomes but this feature was not used. The population was fixed at 500 individuals. The genes of the initial population are randomly generated with a uniform distribution. The same random seed was used in all runs. Phenotypes are generated in C by the ontogenic mapping procedure. See 3.4 for details. Phenotypes are evaluated using a cache simulator. See 3.5 for details. The number of generations was limited to 10 or 20. This was found to be sufficient to show convergence. Mutation was not used. Elitism was set at 20%, so that the best 20% of each generation was carried on to the next. The best individual from each generation was noted, and the result of the run was the best individual from the last generation. 3.4 Ontogenic mapping This section describes how the genotype, which is a list of integers, is converted into a fragment of C program. The ontogenic mapping procedure is called by the ga program as the first step in evaluating the fitness of a chromosome. The ontogenic mapping uses a specification of the phenotype language in an extended BNF (Backus normal form). For example, the conventional BNF definitions: <frag> ::= <stmts>; <stmts> ::= <stmt> <stmt>; <stmts> is written as:

<frag> {} ::= <stmts>; <stmts> {} ::= <stmt> ::= <stmt>; <stmts> The terminal string {} is the default value of both <frag> and <stmts>. In all cases, a minimal default string was used. For example, <fun> represents a function call, but an unevaluated <fun> is repaired to the constant 0L. Space limitations preclude listing the actual BNF file used. The range of statement forms included a conditional, a write to info [ ], and an assignment to the variable victim. Arithmetic expressions can be combined in the usual ways. Protected functions are used for division and remainder. The base cases for expressions include the defined term CACHESIZE (ie the number of lines in the cache), numbers composed of a mantissa (1, 2 or 5) followed by any number of zeros, variables describing the conditions at the time of the call, and a range of protected functions to access info [ ]. Most of these are described in Table 1. Before the GA engine is compiled, a pre-processor (written in perl) translates the BNF file into a C++ data structure, with the BNF as an array of productions. This data structure is #included into the ontogenic mapping procedure. During execution, each phenotype is generated from a chromosome as a parse tree. The tree is initialised to a single node which represents the unexpanded start symbol <frag>. Each gene is then used in turn. If the value of the gene is in the subscript range of the BNF array, it identifies a production. If the nonterminal left hand side of that production can be found in the parse tree, it is expanded to the right hand side, which in general involves adding a mixture of terminal strings and new unexpanded nonterminal nodes to the parse tree. After all genes in the chromosome have been used, the parse tree is traversed and terminal strings printed to a file. Residual nonterminal nodes are replaced by their default terminal strings as specified in the BNF. The end result is a file containing a number of statements in the phenotype language C. A weakness in this approach is that with as the number of productions grows, the chance of any particular production being chosen by a gene reduces. This can be offset by increasing the population size, or the chromosome length, or replicating selected productions in the BNF. All of these strategies were adopted to some degree. For example, a total of 6 identical productions for <frag> were used in the actual BNF file. 3.5 Fitness evaluation To evaluate the fitness of a phenotype, we compile it into a cache simulator called wrapper, and simulate the cache with a trace file. The effectiveness of a caching algorithm is usually measured by its hit rate: number of hits / number of requests which can conveniently be expressed as a percentage. For GP, a slightly different measurement was used. The raw fitness of a caching algorithm is defined as the number of misses it avoids. The number of misses that can occur is bounded both above and below, depending on the particular trace file, cache size, and the caching algorithm. The trace file is part of the problem definition, while cache size and the caching algorithm are part of the solution. The upper bound is the number of runs in the trace. This bound is achieved if cache size is 1, which means that there is no choice about which word to evict. This is independent of the caching algorithm. A lower bound on the number of misses is the size of the alphabet, since each word must be read at least once. A perfect caching algorithm would not read any word more than once. Whether this is achievable for any given trace file and cache size is moot. An alternative interpretation of this bound is that it corresponds to cache size large enough to hold all of the words of main memory. Again, this is independent of the caching algorithm. Both upper and lower bounds on the number of misses are independent of the caching algorithm. The actual performance for a given trace file depends on the cache size and caching algorithm, but must lie between the bounds. The raw fitness of a caching algorithm is measured as: number of runs number of misses The better the algorithm, the fewer misses, and so the larger the fitness. Three trace files from Flanagan et al (1992) were used. One of these (ken2.00200) was used for fitness evaluation; it comprises about 400 000 requests with about 250 000 distinct addresses. The other two traces (ken2.00201 and ken2.00202) were used to compare the algorithms outside the GP process. To cut down the cache simulator's memory requirements, the addresses were grouped into blocks of 256 words by removing the low order 10 bits. This reduced the alphabet size to about 500. However, we suspect that this may have been a false economy. It would be possible to generate trace data randomly, but any conclusions would then have to be ratified with real data. To avoid introducing this step we use real data from the start. 4 Experimental results The experiment consisted of a series of GP runs. 4.1 GP1: small cache size In this run, an exceptionally small cache size of 20 lines was used. The best-of-run, with 14 168 misses, appeared in generation 1: victim = random_x (); No further improvement was shown in the following 8 generations of this run. Random is the best of the designed algorithms for this size of cache. 4.2 GP2: large cache size Since a caching algorithm works by exploiting patterns in the request stream, and the cache size is the window in which these patterns can be detected, smaller cache sizes make pattern detection harder. It is therefore no surprise that random is the best we can do. Table 9 shows that as the cache size grows, more intelligent caching algorithms outperform random. The second run was therefore based on

increasing the cache size from 20 to 200. The best-of-run, with 1 365 misses, appeared in generation 8: victim = rem ((rem ((rem ((CACHESIZE), read_x (request + miss))), counter ())), read_x (miss + small_x ())); victim = div (CACHESIZE, CACHESIZE); victim = counter () * (read_x (2L) - 5L); It would appear that only the last assignment to victim has any effect in any of these cases. However, the earlier assignments use counter (), which has a side effect. No further generations were evolved, even though improvement looked possible. At 1 365 misses, the generated caching algorithm outperforms LRU by almost 14%. 4.3 GP3: no constants A concern about the quality of evolved caching algorithms is that they are over-specialised for the particular cases used in their fitness evaluation. To make it harder for GADS to find an over-fitted algorithm, the next experiment is to remove the capacity to generate numbers, so making arbitrary constants less likely to evolve. This is done simply by deleting the relevant lines of the BNF file. The best-of-run, with 1 345 misses, appeared in generation 4: victim = random_x () + counter (); Surprisingly, imposing this constraint improved the final performance. At 1 345 misses, the generated caching algorithm outperforms LRU by over 15%. 4.4 GP4: using info [ ] Although the algorithms discovered so far outperform LRU, none of them make any use of the info [ ] array. To see if this array is useful, we modify the syntax further to force its use. This is done by redefining <frag> as follows: <frag> {} ::= write_x (<expr>, <expr>); ::= victim = read_x (<expr>); The best-of-run, with 1 341 misses, appeared in generation 3: write_x ((CACHESIZE + (CACHESIZE + (large_x () * CACHESIZE))), counter () + CACHESIZE); victim = read_x (rem ((div (((div (large_x (), line_no)) - miss), (div ((CACHESIZE - (counter () * (read_x (div (random_x (), (div (((div (large_x (), line_no)) - miss), (div ((0L - (counter () * (read_x (div (small_x (), miss)) * read_x (0L - (div (line_no, counter ())))))), line_no)))))) * counter ()))), line_no)))), small_x ())); The experiment was terminated at generation 3 because it was taking such a long time to execute; but in just 3 generations, GADS has produced an even better performance. At 1 341 misses, the generated caching algorithm outperforms LRU by over 15%. 5 Comparisons Each run used a particular cache size and trace file. When the evolved algorithms were compared using other cache sizes or trace files, they did not perform as well as designed algorithms. Figure 1 shows a typical case: designed algorithm LRU versus algorithm GP4 (evolved for cache size 200). Misses 70000 60000 50000 40000 30000 20000 10000 0 GP4 LRU 0 200 400 600 Cache size Figure 1: Comparison of GP4 and LRU The designed algorithm has fewer misses than the evolved algorithm everywhere except in the particular cache size. The evolved algorithm is over-fitted. Table 2 shows the fitness of all algorithms, averaged over all cache sizes and all trace files. Fitness is measured as: (number of runs number of misses) / (number of runs size of alphabet) which lies in the range [0, 1]. 0% is the worst and 100% is the best that can be achieved. Table 2: Summary comparison LRU 95.39% counter 95.39% random 95.40% constant 0 43.03% GP1 95.40% GP2 71.73% GP3 92.11% GP4 81.22% It is clear from this that although the GP algorithms do extremely well in specific cases, their average performance is not particularly good. However, the fitness figures for designed algorithms are unexpectedly close: the possibility of a problem in the simulation is discussed in 6. 6 Conclusions A review criticism of the experimental design is that it has two separate aims, namely to introduce a new GP technique, and to solve a genuine problem. This was deliberate, as success would have validated the technique in one step.

Success with a toy problem is less than convincing. In the end the evolved algorithms are not as good as the designed algorithms, but for other reasons we cannot conclude that the technique itself is at fault. First, the GA parameters may have been unreasonable. One reviewer thought 1% elitism more realistic than 20%. Another questioned the zero mutation rate. To some extent, these are criticisms of the technique rather than the particular experiment. Too many experiments seem to solve the same problem many times to find "optimal" parameter settings. If optimum settings are necessary for a solution, the technique is of limited value, because in many problems we do not know what the solution is and have therefore no way to know when the settings are optimal. To that extent, this experiment shows GADS to be of limited value. However, it is possible that GADS would operate well over a wide range of settings, and that the particular settings we chose are simply unreasonable. Further experiments would answer this question. Second, modifying the traces by removing the low order 10 bits may have destroyed the value of the traces. The trace ken2.00200 has 400 000 requests and an alphabet of about 250 000 addresses. This gives an average run of 1.6 requests. After removing 10 bits, the alphabet drops to about 500 addresses, giving an average run of 800. A change of such magnitude might be expected to cause problems; but it was not until the comparison of designed algorithms were summarised in table 2 that doubts were expressed. The designed algorithms' performances are too close. In short there are flaws in the experimental design. Nonetheless some valuable conclusions can be drawn. First, getting the experimental design right may take more than one attempt. This can hardly be a new conclusion but it is surely worth mentioning, especially given that some effort was spent in the hope of getting this one right first time! Second, the over-fitting shows clearly that evolution is taking place. GADS succeeds, in very few generations, in exploiting patterns in the trace. Over-fitting is usually seen as a problem, but in fact could be useful. The advantage of designed caching algorithms is that they work well in a wide range of situations; the advantage of over-fitted algorithms is that they work even better in certain niches. An adaptive system that could use this might be of value. For example, an adaptive caching system could maintain a steady-state population of caching algorithms, using the better ones to deal with incoming requests, modifying their fitness, and breeding them. As the request stream changed, algorithms that were less fit would become more suited and would rise in the population. Mutation would prevent the population from converging. Third, the ease of changing languages with GADS is clearly demonstrated. Minor changes to a BNF file were all that was required to produce a range of tailored solutions. Fourth, GADS must be modified to cope with larger language definitions. The current approach cannot scale up because the probability of a gene selecting a production with a particular left-hand-side becomes too low as the number of productions increases. Fifth, GADS has shown that it can be used on a "real" problem. The caching algorithm problem was chosen because it is of interest outside the GP community. It was attacked by a combination of human skill in choosing the phenotype language, in particular the function set, and in directing the GADS runs, interpreting the results, and modifying the approach. GADS was used to explore areas chosen by the human. GADS did not provide the turnkey service that was originally hoped for, but it was a very effective junior partner. Bibliography Flanagan, K, Grimsrud, K, Archibald, J & Nelson, B. 1992. BACH: BYU Address Collection Hardware. Technical report TR-A150-92.1. Electrical and Computer Engineering Department, Brigham Young University. Holland, John. 1975. Adaptation in natural and artificial systems. MIT Press. ISBN 0-262-08213-6. Hörner, Helmut. 1996. A C++ class library for genetic programming. Vienna University of Economics. http://aif.wu-wien.ac.at/~geyers/archive/gpk/vuegpk.html Keller, Robert E & Banzhaf, Wolfgang. 1996. Genetic programming using mutation, reproduction and genotypephenotype mapping from linear binary genomes into linear LALR(1) phenotypes. In Koza, Goldberg, Fogel & Riolo, eds. Genetic Programming 1996: Proceedings of the First Annual Conference. MIT Press. ISSN 1088-4750. ISBN 0-262-61127-9. Pp116 122. Koza, John R. 1993. Genetic programming. MIT Press. ISBN 0-262-11170-5. Merelo, J J. 1996. Genetic algorithms from Granada, Spain. ftp://kal-el.ugr.es/gags/gags-1.0.tar.gz. Michalewicz, Zbigniew. 1994. Genetic algorithms + Data structures = Evolution programs. Springer-Verlag. ISBN 3-540-58090-5. Paterson, Norman & Livesey, Mike. 1996. Distinguishing genotype and phenotype in genetic programming. In Koza, Goldberg, Fogel & Riolo, eds. Late Breaking Papers at GP 1996. MIT Press. ISBN 0-18-201-031-7.