Algorithm Selection using Reinforcement Learning

Algorithm Selectio usig Reiforcemet Learig Michail G. Lagoudakis Departmet of Computer Sciece, Duke Uiversity, Durham, NC 2778, USA Michael L. Littma Shao Laboratory, AT&T Labs Research, Florham Park, NJ 7932, USA Departmet of Computer Sciece, Duke Uiversity, Durham, NC 2778, USA MGL@CS.DUKE.EDU MLITTMAN@RESEARCH.ATT.COM Abstract May computatioal problems ca be solved by multiple algorithms, with differet algorithms fastest for differet problem sizes, iput distributios, ad hardware characteristics. We cosider the problem of algorithm selectio: dyamically choose a algorithm to attack a istace of a problem with the goal of miimizig the overall executio time. We formulate the problem as a kid of Markov decisio process (MDP), ad use ideas from reiforcemet learig to solve it. This paper itroduces a kid of MDP that models the algorithm selectio problem by allowig multiple state trasitios. The well kow Q-learig algorithm is adapted for this case i a way that combies both Mote-Carlo ad Temporal Differece methods. Also, this work uses, ad exteds i a way to cotrol problems, the Least-Squares Temporal Differece algorithm (LSTD()) of Boya. The experimetal study focuses o the classic problems of order statistic selectio ad sortig. The ecouragig results reveal the potetial of applyig learig methods to traditioal computatioal problems. 1. Itroductio Whe performig a repetitive task, people ofte fid ways of optimizig their behavior to make it faster, cheaper, safer, or more reliable. Computer systems execute tasks that are far more repetitive ad could beefit cosiderably from optimizatio. Programmers ad source-level compilers work hard to reorgaize computatios to make them more efficiet, but as computer systems become more complex ad mobile programs are expected to ru efficietly o a wide variety of hardware platforms, squeezig maximum performace out of a program requires ru-time iformatio. A challegig research goal is to desig a ru-time system that ca repeatedly execute a program, learig over time to make decisios that speed up the overall executio time. Sice the right decisios may deped o the problem size ad parameters, the machie characteristics ad load, the data distributio, ad other ucertai factors, this ca be quite challegig. As a first attempt, we attack the followig algorithm selectio problem. We require that the programmer provide (a) a set of algorithms that are equivalet i terms of the problem they solve, but ca differ i, for example, how their ruig time scales with problem size, ad (b) a set of istace features, such as problem size, that ca be used to select the most appropriate algorithm from the set for a give problem istace. We show how a reiforcemet learig approach ca be used to select the right algorithm for each istace at ru-time based o the istace features. Recall that a recursive algorithm is oe that solves a problem by doig some preprocessig to reduce the iput problem to oe or more subproblems from the same class, solves the subproblems, the performs some postprocessig to tur the solutios to the subproblems ito a solutio for the origial problem. Because each of the subproblems geerated by a recursive algorithm belogs to the same class as the origial problem, each gives rise to a ew algorithm selectio problem. Thus, whe recursive algorithms are icluded i the algorithm set, the algorithm selectio problem becomes a sequetial decisio problem. Related work (Lobjois & Lemaître, 1998; Fik, 1998) treats algorithms i a black-box maer: each time a sigle algorithm is selected ad applied to the give istace. Our focus is o algorithm selectio while the istace is beig solved. I that sese, each istace is solved by a mixture of algorithms formed dyamically at ru-time. The remaider of this sectio develops a simple example to clarify the defiitio of the algorithm selectio problem. Sectio 2 coects the problem to that of solvig a Markov decisio process ad Sectio 3 explais how a learig al-

gorithm ca be applied to improve performace. Sectio 4 discusses approximatio methods for the value fuctio, ad, fially, Sectio 5 provides results for two iitial studies usig the problems of order statistic selectio ad sortig. As a simple cocrete example, let s cosider creatig a system for sortig. We write two algorithms: shellsort ad bubblesort. Shellsort has a bit more overhead, ad thus ca ru a bit more slowly for small problems. However, its asymptotic ruig time for a list of items is O( 3/2 ) i cotrast to bubblesort s O( 2 ), so we d expect shellsort to be preferable for large problems. If we use oly problem size,, to decide which algorithm to ru, the algorithm selectio problem reduces to fidig a optimal cutoff such that we sort lists of fewer tha items with bubblesort ad loger lists with shellsort. Now, cosider addig mergesort to our algorithm set. Mergesort is a O( log ) recursive algorithm. It takes a list of items, separates it ito two lists of size /2 ad /2, sorts them idividually, ad fially combies the two small sorted lists ito a sigle sorted list. Sice mergesort is the most efficiet algorithm i the set for large lists, a large list will be sorted by applyig mergesort repeatedly util the resultig subproblems are sufficietly small. At this poit, either shellsort or bubblesort should be applied. 2. Algorithm Selectio as a MDP The algorithm selectio problem ca be ecoded as a kid of Markov decisio process (Puterma, 1994) (MDP) cosistig of states, actios, costs, trasitios, ad a objective fuctio. The state of the MDP is represeted by the curret istatiatio of the istace features. To fully satisfy the Markov property, some ukow factors, like data distributio ad machie characteristics, should be part of the process state. However, such iformatio is ot oly uavailable, but would also make the state space extremely large ad perhaps overly expesive to maipulate o the fly. We treat such factors as umodeled hidde state ad assume their ifluece is egligible. Actios are the differet algorithms that ca be selected. No-recursive algorithms are termial i that they are ot followed by a state trasitio ad the correspodig process termiates. I cotrast, recursive algorithms cause trasitios to other states, which correspod to the subproblems created by the recursive algorithm. These state trasitios are o-determiistic i geeral, especially whe radomizatio is used as part of the recursive algorithm. The immediate cost for choosig some algorithm (actio) o some problem (state) is precisely the real time take for that executio, excludig ay time take i recursive calls. The total (udiscouted) cost accumulated while fully solvig a problem is exactly the total time take to solve the Top Level Recursio Preprocessig Ruig Time Recursive Calls Subproblem 1 Subproblem 2 Postprocessig Figure 1. For each (sub)problem the shaded part of the ruig time idicates the immediate cost. problem (see Figure 1). The objective is to fid a policy, a mappig from values of istace features to algorithms, such that the expected total executio time is miimized. For a fixed policy, the value of a state s is the expected time to solve a problem described by state s usig the algorithms selected by the policy. Note that the cost fuctio is ukow ad o-determiistic i geeral, sice it may deped o several ucertai ad hidde factors. State trasitios are a bit more complex i the case of recursive algorithms. From the MDP poit of view, the multiple subproblems that are created ad solved by a recursive algorithm result i trasitios to multiple states this violates the stadard MDP defiitio. For example, mergesort divides the iput to be sorted ito two pieces, each correspodig to a differet state, yieldig a 1 to 2 state trasitio. However, as log as a sequetial model of computatio is used, we ca safely treat each of these trasitios to a ew state idepedetly, ad the total cost will be the sum of the idividual total costs for each subproblem. Oe ca thik of it as cloig the MDP ad geeratigoecopy for each trasitio. There is a strog relatio betwee the recurrece equatios used to aalyze the ruig time of recursive algorithms ad the Bellma equatio for the algorithm selectio problem. The stadard recurrece for mergesort is (Corme et al., 199) T () =2T(/2) + Θ(), T(1) = Θ(1), where T () represets the ruig time o a istace of size. The Bellma equatio for the state value fuctio of the Markov chai uderlyig mergesort is V (s )=2V(s /2 )+R(s,a m ), V(s 1 )=, where R(s,a m )is the cost for choosig mergesort i state s that correspods to a istace of size 1 ; the Bellma 1 Size is the most crucial istace feature for most problems. I presetig our method we assume that the state of the MDP

equatio captures the uderlyig structure of the recursive algorithm. I the most geeral case, the average ruig time T () of a recursive algorithm that creates k subproblems of sizes 1, 2,..., k, is described by the recurrece k T () =E T ( j )+t(), j=1 where t() is the preprocessig ad postprocessig time. O the other had, the value of a state s uder a fixed determiistic policy would be expressed as follows: k a V (s )=E V (s j )+R(s,a), j=1 where a is the algorithm chose by the policy for state s, s j are the states describig the resultig subproblems, ad R(s,a) is the cost for choosig a i state s. Although T () correspods to V (s ), it is expected that V (s ) < T (), that is, the expected time for the combied algorithm is less tha the time for the recursive algorithm aloe. I geeral, there is o model of the MDP available ad thus, i order to act optimally, either a model must be leared by experiece, or a model-free approach must be used. We choose the secod track ad focus o learig the state actio value fuctio Q(s, a). I this case, the Bellma optimality equatio becomes Q(s,a)=E k a j=1 3. Learig Mechaism mi{q(s j,a )}+R(s,a). a Our learig mechaism is a variatio of the well kow Q-learig algorithm (Watkis & Daya, 1992), adapted to accout for multiple state trasitios. The geeral (udiscouted) update equatio of Q-learig is: Q (t+1) (s t,a t )= (1 α)q (t) (s t,a t )+α [ { R t+1 +mi a Q (t) (s t+1,a) }], where s t is the state at time t, a t is the actio take at time t, R t+1 is the oe-step cost for that decisio, ad α is the learig rate. If a t is a o-recursive algorithm, the resultig state is termial ad has a cost of, so the update rule, reduces to Q (t+1) (s t,a t )=(1 α)q (t) (s t,a t )+αr(s t,a t ). cosists of solely the istace size, but, i geeral, several other features may be used. For recursive algorithms the learig rule is a little more ivolved. For the sake of simplicity, let s cosider a recursive algorithm that geerates oly two subproblems (geeralizatio to more subproblems is easy). I this case, the Q-learig rule is Q (t+1) [ (s t,a t )=(1 α)q { (t) (s t },a t )+ { }] α R(s t,a t )+mi Q (t) (s 1,a) +mi Q (t) (s 2,a), a a where s 1 ad s 2 are the states (at time t +1) correspodig to the two subproblems. Notice that the target value depeds o two estimates, which ca itroduce sigificat bias depedig o the accuracy of the value fuctio. I additio, multiple bootstrappig ca easily cause divergece of the value fuctio to wrog estimates if a fuctio approximator is used (Boya & Moore, 1995). The two resultig states must be visited idividually i tur, as both subproblems must be solved. That meas that it is ecessary to store state iformatio for all the pedig states alog the curret path i the recursio tree. The update rule above makes use of previous estimates i updatig the value of the curret state actio pair i the spirit of Temporal Differece (TD) algorithms (Sutto & Barto, 1998). Alteratively, oe could ufold (solve completely) each of the two subproblems, addig the idividual costs at each step. This is the Mote-Carlo retur, R π (s) = t R(s t,a t ), ad expresses the sum of all idividual costs whe startig with a subproblem correspodig to state s ad followig the policy π util the subproblem has bee fully solved. To get good estimates, the policy π should ot take ay exploratory actios. Typically, π is the greedy policy with respect to the curret value fuctio. Although R π (s) is a ubiased estimate of the target value of Q(s, a), it has high variace as it depeds o several returs ad is ot available before the ed of the episode. Ufoldig both subproblems would result i a pure Mote-Carlo (MC) algorithm with the followig update rule ad the shortcomigs just metioed: Q (t+1) (s t,a t )=(1 α)q (t) (s t,a t )+ α [R(s t,a t )+R π (s 1 )+R π (s 2 )]. Our learig rule combies the TD ad MC rules above, by takig the MC approach o oe subproblem ad the TD approach o the other. I other words, oe subproblem (say, the smallest oe) is ufolded ad its Mote-Carlo retur is added to the curret oe step retur, before bootstrappig ad recursig o the other. This is a viable alterative i this problem because of the oe to may state trasitios. The update rule takes the form: Q (t+1) (s t,a t )=(1 α)q (t) (s t,a t )+ { } α R(s t,a t )+R π (s 1 ) +mi Q (t) (s 2,a), }{{} a R t+1

where s 2 is the state correspodig to the subproblem we recurse o. By choosig s 2 to be the largest oe (or the hardest to solve, i geeral), we achieve several thigs: (1) more opportuities for later exploratio, (2) less variace i R t+1, ad (3) small recursio stack (for ufoldig the small subproblem). I additio, our problem becomes a ordiary MDP with sigle state trasitios, with the extra trasitio effectively pushed ito the cost fuctio. Also, ulike the pure TD approach, there is o eed for extra state iformatio storage. Figure 2 clarifies all these issues. This learig rule is used i all our experimets. A issue related to our learig rule cocers the availability of R(s, a) durig learig. If the last step of the recursive algorithm is oe or more recursive calls, the R(s, a) is immediately available before ay attempt to solve the subproblems is made. Thus, the system ca lear about the curret state by immediately applyig the learig rule ad the cotiuig idepedetly with the subproblems discardig the curret state iformatio. This is similar to the use of tail recursio to improve the efficiecy of recursive calls. However, if the algorithm requires some amout of postprocessig work after oe or more subproblems are solved, the retur R(s, a) is delayed util these subproblems have bee captured. Clearly, learig is delayed i this case ad state iformatio storage is ecessary. I our experimets i Sectio 5, we take advatage of the tail recursio as this is allowed by the algorithms we explored. 4. Geeralizatio ad Approximatio I this iitial study, we have used both table-based ad approximatio methods to represet the value fuctio ad cope with the size of the state space. I particular, we make use of state aggregatio ad liear architectures. State aggregatio is primarily used to compress specific istace features, like problem size. The ratioale is that although the ruig time of a algorithm might be sigificatly differet for small feature values, this relative differece fades out as values become large. For example, sortig 2 elemets is relatively more expesive tha sortig 1 elemets, but there is almost o relative differece betwee sortig 52 ad 51 elemets. So, i order to avoid a explosio of the state space, i our experimets we use logarithmic compressio that allows for high resolutio at small feature values ad progressively lower resolutio as values grow. I particular, the value v of a istace feature is mapped to v, accordig to v = log 1.1 (v +1). This formula 2 maps 1, 1, ad 1 to 49, 73, ad 97 respectively. 2 The base 1.1 of the logarithm is a empirically-derived value that simply provides the desired resolutio. The uit icremet is used to overcome states values of. Time Step t Iclude i R t+1 Recursio t+1 t+2 Mote-Carlo (MC) Temporal Differece (TD) No Exploratio (NE) < Temporal Differece (TD) Exploratio (EX) Exploratio (EX) MC NE TD NE TD EX MC NE < < Iclude i R t+2 Figure 2. The learig mechaism. The Mote-Carlo retur from the smaller subproblem is icluded i the cost of its paret, followed by a trasitio to the bigger subproblem. The same patter applies recursively at all levels/time steps. Notice that oce exploratio is preveted at some ode, it is preveted i the whole subtree uder the ode. The bold arrows show the trajectory of the stadard MDP. Liear architectures are used to approximate the value fuctio. Recall that such a approximator approximates Q(s, a) as a liear combiatio φ(s, a) w of k basis fuctios φ(s, a) with coefficiets (or weights) w. The k weights w are estimated i a way that miimizes discrepacy with the observed data i the least-squares sese. The observed data take the form {s t,a t,q (t+1) (s t,a t )} for t =1,2,..., whereq (t+1) (s t,a t ) is the ew (updated) value give by our learig rule i Sectio 3. Ideally, we would like φ(s t,a t ) w = Q (t+1) (s t,a t )to be true for all data. Usig Φ to deote the matrix with rows φ(s t,a t ) ad q to deote the vector with compoets Q (t+1) (s t,a t ), the least-squares solutio for w is give by solvig the k k liear system (Φ Φ)w = Φ q = w =(Φ Φ) 1 Φ q. The matrix Φ ad the vector q ca become extremely big as data accumulate. Fortuately, we eed oly maitai the k k matrix A = Φ Φ ad the k-dimesioal vector b = Φ q, which ca be icremetally updated with ew data as follows: A (t+1) = ( Φ φ(s t+1,a t+1 ) ) ( ) Φ φ(s t+1,a t+1 ) = Φ Φ + φ(s t+1,a t+1 )φ(s t+1,a t+1 ) = A (t) + φ(s t+1,a t+1 )φ(s t+1,a t+1 ), b (t+1) = ( Φ φ(s t+1,a t+1 ) ) ( ) q Q (t+2) (s t+1,a t+1 ) = Φ q + φ(s t+1,a t+1 )Q (t+2) (s t+1,a t+1 ) = b (t) + φ(s t+1,a t+1 )Q (t+2) (s t+1,a t+1 ).

The weights ca be updated by w (t) = ( A (t)) 1 b (t) wheever eeded. I this work, we use a separate liear architecture for each algorithm a, each oe havig its ow set of weights, that is w = w(a). This least-squares approach is similar to the oe described by Boya (1999), ad actually exteds the LSTD(λ) algorithm to geeral MDPs for λ =. 5. Results We have applied the ideas above o two fudametal computatioal problems: order statistic selectio ad sortig. These early experimetal results 3 reveal that there is potetial for gettig the most out of well-kow algorithms by combiig them as suggested i the previous sectios. 5.1 Order Statistic Selectio For the order statistic selectio problem, wearegivea array of (uordered) umbers ad some iteger idex i, 1 i. We would like to select the umber that would rak i-th i the array if the umbers were sorted i ascedig order. There are several algorithms for order statistic selectio. We picked two of them such that either is best i all cases, otherwise learig would ot really help 4. DETERMINISTIC SELECT (Corme et al., 199) is a recursive worst case liear time algorithm. It fids a good partitioig elemet by makig a recursive call to fid the media of a subset of the iput. That subset cosists of the medias of every five elemets of the iput, ad therefore its size is a fifth of the origial size. The, the origial iput is partitioed ad a recursive call is made to the appropriate (left or right) subproblem. The size of this subproblem varies, but it is o less tha 3/1 6 ad o more tha 7/1+6,ifistheorigial size. Hece, two subproblems are solved at each recursive call. The recursio cotiues util the desired elemet is restricted i a subset of size less tha or equal to 5 from where it ca be easily isolated. The performace of the algorithm is almost ivariat with respect to the value of the idex (assumig fixed array size). HEAP SELECT is a (o-recursive) algorithm with O( log ) worst case ruig time. The basic step of this algorithm is the costructio of a biary heap betwee the positio i ad the closest ed of the array. Without loss of geerality, assume that i is closer to the left ed, i.e., i /2 (the other case is symmetric). All the elemets betwee positios 1 ad i are orgaized ito a heap, whose 3 All experimets were performed o a Su Ultra 5 machie usig MATLAB code. All ruig time plots represet averages of 1 rus per data poit. Learig was tured off durig performace testig. 4 We excluded RANDOMIZED SELECT because it was cosistetly best i our iitial studies. root is located at positio i ad holds the maximum elemet. The, the algorithm iterates through the remaiig elemets; if a elemet is smaller tha the root elemet of the heap, the two elemets are exchaged ad the ew root is pushed ito the heap to maitai the heap property. At the ed, the desired elemet is located at the root of the heap. To see this, otice that all elemets i the heap are smaller tha or equal to the elemets outside the heap. Obviously, the closer the idex to the left ed, the smaller the heap ad the faster the algorithm, sice T (, i) =Θ(i)+O(( i)logi)for i /2. Figure 3 (i additio to other iformatio) shows the average ruig time of the two algorithms for radomly geerated istaces of fixed size (1) ad varyig idex (1 1). As expected, HEAP SELECT performs much better tha DETERMINISTIC SELECT for idices close to the eds. However, for idices close to the middle (e.g. medias) DETERMINISTIC SELECT outperforms HEAP SE- LECT. A similar picture holds for other array sizes as well. Thus, there is potetial for a better average ruig time if the two algorithms are combied. As a first attempt to lear how to combie the two algorithms, we used a tabular approach. The state of the process, i this case, cosists of two istace features, amely the size of the iput ad the distace d of the idex from the closest ed of the array (d =mi{i, i +1}). We assume that the problem is symmetric with respect to the middle of the array to reduce the rage of d: selectig, say, the 1th elemet is equivalet to selectig the 91st oe out of 1 elemets. Also, the differece betwee selectig the 4th elemet amog 1 elemets ad the 41th elemet i 9995 elemets is so small that discrimiatig betwee these two cases is ot of much help. So, i order to avoid a explosio of the state space, we logarithmically compress the two features as described i Sectio 4. Give that, half a table of size 9 83 is sufficiet to represet the value fuctio. We traied the system o thousads of radomly geerated istaces ( 1, ) offixedsize(1) advaryig idex (1 1). To facilitate traiig, we first traied o several istaces of smaller size. A 1 ɛ policy with high degreeof exploratio (ɛ =.6) was used durig traiig. Two decreasig learig rates were used, oe for DE- TERMINISTIC SELECT (α 1 =.4 iitially) ad oe for HEAP SELECT (α 2 =.7iitially). DETERMINISTIC SE- LECT has varyig cost for a give state because of the o determiistic trasitios, whereas HEAP SELECT is quite ivariat. This differece is reflected i the two learig rates. The results are show i Figure 3. The cut-off poit algorithm selects HEAP SELECT whe the idex is withi the first 13% or the last 7% of the iput, ad DETERMINISTIC SELECT otherwise. The two cut-off

6 6 5 Heap Select 5 Heap Select 4 Determiistic Select 4 Determiistic Select Time (sec) 3 Time (sec) 3 2 2 1 Cut off Poit Algorithm Leared Algorithm 1 Cut off Poit Algorithm Leared Algorithm 1 2 3 4 5 6 7 8 9 1 Idex 1 2 3 4 5 6 7 8 9 1 Idex Figure 3. Results for order statistic selectio (tabular case). Figure 4. Order statistic selectio (liear architecture). poits were determied directly from the crossover poits i Figure 3. Thus, it implemets a empirically derived policy, typical of those foud i optimized software implemetatios. The leared algorithm, however, performs better, because the ideal cut-off poits differ by problem size. The exceptio close to idex value 1 is due to the lack of the assumed perfect symmetry. The system is forced to arrive at a compromise usig symmetric cut-off poits. Although the tabular approach reveals a performace gai, it comes with several disadvatages: it uses a huge amout of storage, it imposes upper limits to istace features (e.g. size), ad it takes a log time to trai (several days for the case above). This is mostly due to the lack of good geeralizatio. The key observatio here is that the ruig time of a algorithm typically varies smoothly as some istace feature chages smoothly. That makes geeralizatio much easier compared to other domais. Our secod approach to learig makes use of liear architectures to represet the value fuctio. The state s = (, d) i this case cosists of the problem size ad the siged distace d of the idex i from the midpoit of the array (d = i /2 ). Usig our kowledge about the shape of the value fuctio, ad after may trials, we foud that the value fuctio Q(, d, a) ca be approximated by the followig two parametric fuctios (oe for each actio/algorithm): ( ) 2 ( ) Q(, d, a D )=w d 1D 1 +w 2D d2, ( ) 2 Q(, d, a H )=w 2d 1H 1 +w 2H where a D,a H ( ) 4 d2, are the actios of selectig DETERMIN- ISTIC SELECT ad HEAP SELECT respectively, ad w 1D,w 2D,w 1H,w 2H are the parameters (weights). Briefly, these fuctios represet a liear combiatio of a semiellipse ad a parabola (for costat ). The amout of storage required i this case (see Sectio 4) is a 2 2 matrix (A (t) )adtwo2 1arrays (q (t) ad w (t) ) for each equatio. That gives a total of 16 real umbers which compares favorably with the 3735 umbers of the tabular case. We traied the system o 2,4 radomly geerated istaces of differet sizes distributed uiformly i the rage [2, 1] with a schedule that starts with smaller sizes ad moves toward larger sizes. The idex was also varied uiformly withi the available rage for each size. We set the learig rate α to 1. for both actios to prevet use of wrog estimates ad divergece of the value fuctio. With α =1., oly estimates of smaller sizes are used, sice the resultig subproblems ca oly be smaller. As log as the traiig schedule is from smaller to larger sizes, it is guarateed that these estimates will be fairly accurate, because traiig has bee completed for smaller sizes. This idea is similar to the Grow-Support algorithm of Boya ad Moore (Boya & Moore, 1995). Exploratio was set to maximum (ɛ =1) so that both actios get approximately the same amout ad distributio of data poits. We used the leastsquares approach, described i Sectio 4, to estimate the weights at each step durig traiig. The mai advatage of the liear architecture is that the value fuctio is defied for ay state, eve for states the system has ot bee traied o. Also, the learig time was less tha a hour i this case. Overall, this secod approach overcomes all the difficulties of the tabular approach with oly a small degrade i performace (the best cut-offs caot be estimated precisely due to the restricted form of approximatio). Figure 4 shows performace results for fixed size ( = 1) ad Figure 5 results for

3 25 Heap Select Determiistic Select Cut Off Poit Leared Algorithm Usig our kowledge of the asymptotic ruig times, we approximate the value fuctio (that is, the expected ruig time) by the followig parametric fuctios (oe for each algorithm): Time (sec) 2 15 1 5 1 2 3 4 5 Size 6 7 8 9 1 x 1 4 Figure 5. Order statistic selectio (media, liear architecture). fixed idex (d =, the media) ad size up to 1. Note that the system was traied oly o istaces of size up to 1. These iitial results revealed that our approach to the algorithm selectio problem is feasible ad ecouraged experimetatio with other problems. 5.2 Sortig The sortig problem is to rearrage a array of (uordered) umbers i ascedig order. This is probably the best kow computatioal problem ad there exist umerous sortig algorithms. QUICKSORT (Corme et al., 199) is a recursive radomized sortig algorithm with O( 2 ) worst case ruig time ad O( log ) expected ruig time. It picks a partitioig elemet from the array at radom ad partitios the iput i two parts such that all elemets i the first part are less tha or equal to the elemets i the secod part. The, the two parts are sorted recursively. QUICKSORT is extremely efficiet for large arrays. INSERTIONSORT (Corme et al., 199) is a o recursive algorithm with O( 2 ) worst case ruig time. It starts with the first elemet as the iitial sorted list ad iteratively iserts the other elemets oe by oe at their correct positio by shiftig elemets that are greater to the right. IN- SERTIONSORT is very efficiet for small arrays ad for iputs that are almost sorted. A commo approach is to ru QUICKSORT for large sizes ad switch to INSERTIONSORT whe the size falls below some cut-off poit. However, the optimal cut-off poit may deped o several ucertai factors ad it is ulikely fixed. Usig our approach, however, it is possible to figure out the best cut-off poit o the fly. The state of the process cosists of the size of the iput. Q(, a) =w (a) 1 2 +w (a) 2 log 2 + w (a) 3, where a is either a Q or a I. The costat term is omitted, because Q(,a)=by defiitio. The weights for these liear architectures are estimated by the least-squares approach of Sectio 4. We traied the system o 4 radomly geerated istaces oly, 2 with size i [1, 1], ad 2 i [1, 1], startig from smaller ad movig toward larger sizes. We focus o this small rage because the cut-off poit lies somewhere i that rage. The learig rate was set to 1. for reasos metioed i the order statistic selectio case. The leared weights were w (Q) =(.6,.85, 5.969) 1 4 ad w (I) =(.142,.54, 3.539) 1 4. Figure 6 shows the leared value fuctio alog with actual ruig times for the idividual algorithms. As expected, the value fuctio for QUICKSORT is less tha the actual ruig time of pure QUICKSORT, because of the ability to ivoke IN- SERTIONSORT as eeded. Notice that the cut-off poit suggested by the learig algorithm is much lower tha the poit where the ruig time curves cross each other. This leads to a iterestig isight: oce a cut-off poit is employed (say, the poit where the ruig times cross), QUICKSORT becomes better overall, but INSERTIONSORT does ot chage. Thus, QUICKSORT ca ow be faster for istaces right below the chose cut-off poit, where it was ot faster before. That gives a ew cut-off poit ad the same reasoig applies agai ad agai, ad the cut-off poit moves lower ad lower util it evetually coverges. This is captured by the learig algorithm, but is difficult to work it out empirically offlie. Performace results are show i Figure 7 for sizes up to 1. The cut-off poit algorithm is the oe that uses the crossover poit (size=47) of the ruig time curves. The leared algorithm, whose policy sets the cut-off poit to size=35, performs aroud 15% better. The leared policy was precomputed beforehad to elimiate the overhead of evaluatig the value fuctio at each step. 6. Future Work ad Coclusios I this paper, we have igored the distributio of the iput data; all data come from the same uiform radom distributio. Ideally, a learig system should be able to adapt rapidly to chages i the uderlyig distributio. To this ed, it is required that (1) learig is cotiuously o, (2) some exploratio is allowed, ad (3) most recet data overshadow old data. We are curretly experimetig with expoetial widowig i our least-squares approach to ex-

Time (sec).1.9.8.7.6.5.4.3.2.1 Actual Ruig Time of IsertioSort Value Fuctio for IsertioSort Actual Ruig Time of QuickSort Value Fuctio for QuickSort Cut Off Poit from Ruig Times = 47 Cut Off Poit from Value Fuctio = 35 1 2 3 4 5 6 7 8 9 1 Size Figure 6. Value fuctio, actual ruig time, ad cut-off poits. Time (sec) 1.9.8.7.6.5.4.3.2.1 IsertioSort QuickSort Cut Off Poit Algorithm Leared Algorithm 1 2 3 4 5 6 7 8 9 1 Size Figure 7. Ruig times for sortig. poetially discout old data. Allowig cotiuous exploratio might lead to a cost pealty or to the discovery of a chage. We have o clear solutio to that problem, but i order to avoid uecessary time pealties, we eed more cotrol over the algorithms. For example, we could termiate a selected (termial) algorithm if its curret ruig time exceeds sigificatly the estimate of the value fuctio, ad select aother algorithm. We curretly ivestigate these ideas o sortig. We pla to add more algorithms to the algorithm set ad target for rapid olie adaptatio. We also pla to apply the proposed ideas to other problems, like covex hull ad graph problems, where algorithm selectio may iduce sigificat savigs. The log-term goal ad potetial cotributio of the work preseted i this paper is twofold. First, from a computer sciece poit of view, we evisio a era where a computatioal problem is solved ot by a isolated algorithm selected o the basis of its theoretical properties, but by a adaptive system that ecapsulates the available repertoire of algorithms for that problem ad selects them based mostly o their practical performace. We believe that such systems will be more efficiet i applicatios that ivolve a wide ad diverse rage of problem istaces. Secod, from a machie learig poit of view, the real-time costrait (learig is part of solvig the problem) calls for learig algorithms that geeralize ad adapt rapidly while cosumig miimum computatioal resources (especially time). The challege for real-time learig becomes more ad more importat for the success of learig systems i real-world applicatios. The results preseted here are the first steps alog these directios ad toward these goals. Ackowledgmets The first author would like to thak the Lilia-Boudouri Foudatio i Greece for fiacial support. The secod author is supported i part by NSF-IRI-97-2576-CAREER. Refereces Boya, J. A. (1999). Least-squares temporal differece learig. Machie Learig: Proceedigs of the Sixteeth Iteratioal Coferece (pp. 49 56). Morga Kaufma, Sa Fracisco, CA. Boya, J. A., & Moore, A. W. (1995). Geeralizatio i reiforcemet learig: Safely approximatig the value fuctio. Advaces i Neural Iformatio Processig Systems 7 (pp. 369 376). Cambridge, MA: The MIT Press. Corme, T. H., Leiserso, C. E., & Rivest, R. L. (199). Itroductio to algorithms. Cambridge, MA: The MIT Press. Fik, E. (1998). How to solve it automatically: Selectio amog problem-solvig methods. Proceedigs of the Fourth Iteratioal Coferece o Artificial Itelligece Plaig Systems (pp. 128 136). AAAI Press. Lobjois, L., & Lemaître, M. (1998). Brach ad boud algorithm selectio by performace predictio. Proceedigs of the Fifteeth Natioal Coferece o Artificial Itelligece (pp. 353 358). Melo Park: AAAI Press. Puterma, M. L. (1994). Markov decisio processes discrete stochastic dyamic programmig. New York, NY: Joh Wiley & Sos, Ic. Sutto, R. S., & Barto, A. G. (1998). Reiforcemet learig: A itroductio. The MIT Press. Watkis, C. J. C. H., & Daya, P. (1992). Q-learig. Machie Learig, 8, 279 292.