Pruning Game Tree by Rollouts

Size: px

Start display at page:

Download "Pruning Game Tree by Rollouts"

Lee Hopkins
6 years ago
Views:

1 Pruning Game Tree by Rollout Bojun Huang Mirooft Reearh Abtrat In thi paper we how that the α-β algorithm and it ueor MT-SSS*, a two lai minimax earh algorithm, an be implemented a rollout algorithm, a generi algorithmi paradigm widely ued in many domain. Speifially, we define a family of rollout algorithm, in whih the rollout poliy i retrited to elet ueor node only from a ertain ubet of the hildren lit. We how that any rollout poliy in thi family (either determiniti or randomized) guarantee to evaluate the game tree orretly with a finite number of rollout. Moreover, we identify imple rollout poliie in thi family that implement α-β and MT-SSS*. Speifially, given any game tree, the rollout algorithm with thee partiular poliie alway viit the ame et of leaf node in the ame order with α-β and MT-SSS*, repetively. Our reult ugget that traditional pruning tehnique and the reent Monte Carlo Tree Searh algorithm, a two ompeting approahe for game tree evaluation, may be unified under the rollout paradigm. Introdution Game tree evaluation formulate a logi proe to make optimal wort-ae plan for equential deiion making problem, a reearh topi uually benhmarked by twoplayer board game. Hitorially, lai game-tree earh algorithm like α-β and it ueor have uefully demontrated human-hampion-level performane in tatial game like Che, but their omputational omplexity uffer from exponential growth with repet to the depth of the game tree. In order to redue the ize of the game tree under onideration, thee traditional game-tree earh algorithm have to be omplemented by domain-peifi evaluation funtion, a tak that i very diffiult in ome domain, uh a GO and General Game Playing. Reently, rollout algorithm have been introdued a a new paradigm to evaluate large game tree, partially beaue of their independene of domain-peifi evaluation funtion. In general, rollout algorithm i a generi algorithmi paradigm that ha been widely ued in many domain, uh a ombinatorial optimization (Berteka, Titikli, and Wu 1997) (Glover and Taillard 1993), tohati optimization (Berteka and Catanon 1999), planning A horter verion of thi paper i publihed in AAAI in Markov Deiion Proe (Péret and Garia 2004) (Koi and Szepevári 2006), and game playing (Teauro and Galperin 1996) (Abramon 1990). In the ontext of game tree evaluation, a rollout i a proe that imulate the game from the urrent tate (the root of the game tree) to a terminating tate (a leaf node), following a ertain rollout poliy that determine eah move of the rollout in the tate pae. A rollout algorithm i a equene of rollout proee, where information obtained from one rollout proe an be utilized to reinfore the poliie of ubequent rollout, uh that the long-term performane of the algorithm may onverge toward the (near-)optimal poliy. In partiular, a pefi la of rollout algorithm, alled Monte Carlo Tree Searh (MCTS), ha ubtantially advaned the tate-of-art in Computer GO (Gelly et al. 2012) and General Game Playing (Finnon and Björnon 2008). The key idea of MCTS i to ue the average outome of rollout to approximate the minimax value of a game tree, whih i hown to be effetive in dynami game (Coulom 2006). On the other hand, however, people alo oberved that exiting MCTS algorithm appear to lak the apability of making narrow line of tatial plan a traditional game-tree earh algorithm do. For example, Ramanujan, Sabharwal, and Selman (2012) how that UCT, a a popular rollout algorithm in game playing, i prone to being miled by over-etimated node in the tree, thu often being trapped in ub-optimal olution in ome ituation. A a reult, reearher have been trying to ombine MCTS algorithm with traditional game tree earh algorithm, in order to deign unified algorithm that hare the merit of both ide (Baier and Winand 2013) (Lantot et al. 2013) (Coulom 2006). Unfortunately, one of the major diffiultie for the unifiation i that mot traditional α-βlike algorithm are baed on minimax earh, whih eem to be a different paradigm from rollout. In thi paper, we how that two lai game-tree earh algorithm, α-β and MT-SSS*, an be implemented a rollout algorithm. The obervation offer a new perpetive to undertand thee traditional game-tree earh algorithm, one that unifie them with their modern ompetitor under the generi framework of rollout. Speifially, We define a broad family of rollout algorithm, and prove that any algorithm in thi family i guaranteed to orretly evaluate game tree by viiting eah leaf

2 node at mot one. The orretne guarantee i poliyobliviou, whih applie to arbitrary rollout poliy in the family, either determiniti or probabiliti. We then identify two imple rollout poliie in thi family that implement the lai minimax earh algorithm α-β and MT-SSS* under our rollout framework. We prove that given any game tree, the rollout algorithm with thee partiular poliie alway viit the ame et of leaf node in the ame order a α-β and an augmented verion of MT-SSS*, repetively. A a by-produt, the augmented verion of thee lai algorithm identified in our equivalene analyi are guaranteed to outprune their original verion. Preliminarie Game tree model A game tree T i defined by a tuple (S, C, V), where S i a finite tate pae, C( ) i the ueor funtion that define an ordered hildren lit C() for eah tate S, and V( ) i the value funtion that define a minimax value for eah tate S. We aume that C( ) impoe a tree topology over the tate pae, whih mean there i a ingle tate that doe not how in any hildren lit, whih i identified a the root node of the tree. A tate i a leaf node if it hildren lit i empty (i.e. C() = ), otherwie it i an internal node. The value funtion V( ) of a given game tree an be peified by labeling eah tate S a either MAX or MIN, and further aoiating eah leaf-node tate with a determiniti reward R(). The minimax value V() of eah tate S i then defined a R() if i Leaf; V() = max C() V() if i Internal & MAX; (1) min C() V() if i Internal & MIN. To align with previou work, we aume that the reward R() for any i ranged in a finite et of integer, and we ue the ymbol + and to denote a finite upper bound and lower bound that are larger and maller than any poible value of R(), repetively. 1 Expliit game tree often admit ompat peifiation in the real world (for example, a the rule of the game). Given the peifiation of a game tree, our goal i to ompute the minimax value of the root node, denoted by V(root), a quikly a poible. Speifially, we meaure the effiieny of the algorithm by the number of time the algorithm all the reward funtion R( ), i.e. the number of leaf-node evaluation, whih i often an effetive indiator of the omputation time of the algorithm in pratie (Marland 1986). A a onvenient abtration, we uppoe that any algorithm in the game tree model an ae an external torage (in unit time) to retrieve/tore a loed interval [v, v + ] a the value range for any peified tate S. Initially, we have [v, v + ] = [, + ] in the torage for all tate. We 1 The algorithm and analyi propoed in thi paper alo apply to more general ae, uh a when R() i ranged in an infinite et of integer or in the real field. remark that uh a whole-pae torage only erve a a oneptually unified interfae that implifie the preentation and analyi in thi paper. In pratie, we do not need to alloate phyial memory for node with the trivial bound [, + ]. For example, the torage i phyially empty if the algorithm doe not ae the torage at all. Suh a torage an be eaily implemented baed on tandard data truture uh a a tranpoition table (Zobrit 1970). Depth-firt algorithm and α-β pruning Oberve that Eq. (1) implie that there mut exit at leat one leaf node in the tree that ha the ame minimax value a the root. We all any uh leaf node a ritial leaf. A gametree earh algorithm ompute V(root) by earhing for a ritial leaf in the given game tree. Depth-firt algorithm i a peifi la of game-tree earh algorithm that ompute V(root) through a depthfirt earh. It will evaluate the leaf node tritly from left to right, under the order indued by the ueor funtion C( ). It i well known that the α-β algorithm i the optimal depth-firt algorithm in the ene that, for any game tree, no depth-firt algorithm an orretly ompute V(root) by evaluating fewer leaf node than α-β doe (Pearl 1984). The key idea of the α-β algorithm i to maintain an open interval (α, β) when viiting any tree node, uh that a ritial leaf i poible to loate in the ubtree under only if V() (α, β). In other word, whenever we know that the value of fall outide the interval (α, β), we an kip over all leaf node under without ompromiing orretne. Algorithm 1 give the puedoode of the α-β algorithm, whih follow Figure 2 in (Plaat et al. 1996). Given a tree node and an open interval (α, β), the alphabeta proedure return a value g, whih equal the exat value of V() if α < g < β, but only enode a lower bound of V() if g α (a ituation alled fail-low), or an upper bound of V() if g β (a ituation alled fail-high). Note that Algorithm 1 aee the external torage at Line 4 and Line 22, whih are unneeary beaue the algorithm never viit a tree node more than one. Indeed, the bai verion of the α-β algorithm doe not require the external torage at all. Here, we provide the torage-enhaned verion of α-β for the ake of introduing it ueor algorithm MT-SSS*. Bet-firt algorithm and MT-SSS* To a great extent, the pruning effet of the α-β algorithm depend on the order that the leaf node are arranged in the tree. In general, identifying and evaluating the bet node early tend to narrow down the (α, β) window more quikly, thu more effetively pruning uboptimal node in the ubequent earh. In the bet ae, the optimal hild of any internal node i ordered firt, and thu i viited before any other ibling node in the depth-firt earh. Knuth and Moore (1975) prove that in thi ae the α-β algorithm only need to evaluate n 1/2 leaf node, auming n leave in total. In omparion, Pearl (1984) how that thi number degrade to around n 3/4 if the node are randomly ordered. In the wort ae, it i poible to arrange the node in uh

3 Algorithm 1: The α-β algorithm enhaned with torage. 1 g alphabeta(root,, + ) ; 2 return g 3 Funtion alphabeta(, α, β) retrieve [v, v + ] ; 4 if v β then return v ; if v + α then return v + ; 5 if i a leaf node then 6 g R() ; 7 ele if i a MAX node then 8 g ; α α ; 9 foreah C() do 10 g max{ g, alphabeta(, α, β) } ; 11 α max{ α, g } ; 12 if g β then break; 13 end 14 ele if i a MIN node then 15 g + ; β β ; 16 foreah C() do 17 g min{ g, alphabeta(, α, β ) } ; 18 β min{ β, g } ; 19 if g α then break; 20 end 21 end 22 if g < β then v + g; if g > α then v g; tore [v, v + ] ; 23 return g Algorithm 2: The MT-SSS* algorithm. 1 [v, v + ] [, + ] for every S; 2 while v root < v + root do 3 alphabeta(root, v + root 1, v + root) ; 4 end 5 return v root gle pa, the algorithm iteratively all the alphabeta proedure (in Algorithm 1) to refine the value bound of the root until the gap between the bound i loed. Eah pa of the alphabeta proedure i for examining the ame quetion: I the urrent upper bound of the root tight? The algorithm all the alphabeta proedure with the minimum window (v + root 1, v + root), 2 whih fore the proedure to only viit the leaf node relevant to thi quetion. In thi ae the alphabeta proedure will anwer thi quetion by returning either a new upper bound that i lower than the original bound v + root, or a mathing lower bound that equal v + root. Note that the alphabeta proedure tore value bound in the external torage, o latter iteration an re-ue the reult gained in previou iteration, avoiding repeated work. Sine SSS* ha been proven to outprune α-β, the total number of leaf-node evaluation over all pae of minimum-window earhe in MT-SSS* will never exeed the number made by a ingle pa of full-window earh. On the other hand, in ae when the given game tree i in a bad order, the bet-firt working tyle of MT-SSS* an help to ignifiantly redue the number of leaf-node evaluation. Monte Carlo Tree Searh and UCT a way that the α-β algorithm ha to evaluate all of the n leaf node in the tree. Bet-firt algorithm i a la of algorithm that alway try to evaluate the node that urrently look more promiing (to be or ontain a ritial leaf). In partiular, The SSS* algorithm, firt propoed by Stokman (1979) and later revied by Campbell (1983), i a lai bet-firt algorithm that i guaranteed to outprune the α-β algorithm in the ene that: SSS* never evaluate a leaf node that i not evaluated by α-β, while for ome problem intane SSS* manage to evaluate fewer leaf node than α-β. The SSS* algorithm i baed on the notion of olution tree. The bai idea i to treat eah MAX node a a luter of olution tree, and to alway prefer to earh the leaf node that i neted in the olution-tree luter urrently with the bet upper bound. Interetingly, Plaat et al. (1996) how that SSS*, a a betfirt algorithm, an be implemented a a erie of torageenhaned depth-firt earhe. Eah pa of uh a depthfirt earh i alled a Memory-enhaned Tet, o thi verion of SSS*i alo alled MT-SSS*. Plaat et al. prove that for any game tree, MT-SSS* viit the ame et of leaf node in the ame order with the original SSS* algorithm propoed by Stokman. Algorithm 2 give the puedoode of MT-SSS*. Intead of diretly determining the exat value of V(root) in a in- While α-β and MT-SSS* an offer ubtantial improvement over exhautive tree earh, both of them till have to run in time exponential to the depth of the tree (Pearl 1984), whih limit the tree ize they an diretly deal with. In pratie, thee algorithm typially need to be omplemented with a tati evaluation funtion that an make heuriti etimation on the minimax value of an arbitrarily given non-leaf node, reulting in the bounded look-ahead paradigm (Reinefeld and Marland 1994). In uh a paradigm, an internal node may be onidered a a virtual leaf node (or frontier node) under ertain ondition, and in that ae the evaluation funtion i applied to give the reward value of thi virtual leaf node, with the whole ub-tree under being ut-off from the tree earh. The hope i that the evaluation funtion an reaonably approximate V() at thee virtual leaf node uh that the reult of earhing only the partial tree above thee virtual leave i imilar to the reult of a omplete tree earh. However, depending on domain it an ometime be highly hallenging to deign a atifatory evaluation funtion and ut-off ondition. Monte Carlo Tree Searh (MCTS) i an alternative algorithmi paradigm that an evaluate large game tree without ophitiated evaluation funtion (Coulom 2006). A 2 Reall that the reward are aumed to be integer, in whih ae the open interval (x 1, x) i eentially empty if x i integer.

4 Algorithm 3: The UCT algorithm, under given time budget T and parameter λ. 1 n 0, µ 0 for every S; 2 while n root < T do rollout(root) ; 3 return µ root 4 Funtion rollout() 5 if i a leaf node then 6 g R() ; 7 ele if i a MAX node then 8 arg max C() µ + λ ln n /n ; 9 g rollout( ) ; 10 ele if i a MIN node then 11 arg min C() µ λ ln n /n ; 12 g rollout( ) ; 13 end 14 µ n n µ n g ; n n + 1 ; 16 return g a peifi la of rollout algorithm, MCTS algorithm repeatedly perform rollout in the given game tree and ue the average outome of the rollout to approximate the minimax value of the tree (Abramon 1990). Among other, the UCT algorithm i a partiular intane of MCTS algorithm that ha drawn a lot of attention in the ommunity (Koi and Szepevári 2006). Algorithm 3 how the peudo-ode of UCT. At a tree node, the algorithm ue a determiniti rollout poliy that ompute for eah ueor node a ore UCT () = µ +λ ln n /n, where µ i the average reward of the previou rollout paing through, n i the number of uh rollout, and n i the number of rollout through. Then the algorithm imply hooe the ueor node with the highet ore. One an hek that the UCT ore will approah to the average reward µ if the urrent node ha been extenively viited. In ontrat, when the ample ize n i mall, le-viited ueor node an get ubtantial bonu in their ore, thu may get hane to be explored even if it ha a low average reward µ. The trade-off between exploitation and exploration an be ontrolled by fine-tuning the parameter λ, uh that the reulting footprint of the rollout are oftly biaed to the mot promiing variation. Koi and Szepevári (2006) proved that the outome of UCT alway onverge to the minimax value V(root) if given infinite time. A Family of Rollout Algorithm Rollout algorithm perform a rollout by iteratively eleting a ueor node at eah node along a top-down path tarting from the root. In general, a rollout algorithm may elet the ueor node aording to an arbitrary probability ditribution over the whole hildren lit C(). However, the idea of α-β pruning ugget that it may be unneeary to onider every ueor node for omputing the value of V(root). In thi etion we preent a family of rollout algorithm that follow thi obervation by retriting the ueor node eletion to be over a ubet of C(). A hown later, thi family of rollout algorithm naturally enompae the idea of traditional minimax earh algorithm. Reall that at any time we know from the external torage a value range [v, v + ] for eah tree node S. We tart with identifying an important property of the knowledge in the torage. Speifially, we ay that the torage i valid with repet to a given game tree T if v V() v + for any node in T. Moreover, we define that a valid torage i oherent to the given game tree if it validity i robut to the unertainty itelf laim. Definition 1. Given a game tree T = (S, C, V), a torage M = {[v, v + ]} S i oherent, with repet to T, if i) M i valid to T ; and ii) For any leaf node S and for any r [v, v + ], let T = (S, C, V ) be the game tree obtained by etting R() = r in the original tree T, M i valid to T. A oherent torage enable a uffiient ondition to ignore a tree node (a well a the ubtree rooted at ) in the earh for the value of V(root). Speifially, let P() be the et of tree node between and inluding the root node and node (o P(root) = {root}). For eah tree node, define [α, β ] a the interetion interval of the value range of all node in P(). That i, [α, β ] = [v, v + ], (2) P() or equivalently, in pratie we an ompute α and β by α = max P() v, β = min P() v+. (3) The following lemma how that if the unertainty in the torage i oherent, then under ertain ondition, not only i the value range [v root, v + root] table to the diturbane from lower layer of the tree, but the exat minimax value V(root) i alo table. The key inight i to ee that both max and min are monotone value funtion. Lemma 1. Given any game tree T = (S, C, V), let M = {[v, v + ]} S be a torage oherent to T. For any leaf node S and for any r [v, v + ], let T = (S, C, V ) be the game tree obtained by etting R() = r in the original tree T, then V (root) = V(root) if α β. Proof. Firt oberve that the lemma trivially hold when the leaf node i at depth 1, i.e. when it i the root node in that ae we have [α, β ] = [vroot, v root], + and thu α β implie vroot = v root + if the torage M i valid. For leaf node with a depth larger than 1, by indution we only need to prove the lemma auming that it hold for all anetor node of the (or equivalently, for eah uh anetor node we aume the lemma hold for another tree in whih the ubtree of thi anetor node i pruned). Let be the parent node of. For ontradition aume ould affet the value of root, i.e., for leaf node with α β, V (root) V(root) when R() hange to ome r [v, v + ]. In that ae we mut have α < β, beaue otherwie annot affet the value of the root (whih i aumed by indution), and neither an it ueor node.

5 A B A B A B C C C D E F E E α β α β (a) (b) () α β Figure 1: Illutration of one downward tep in the rollout proe of the algorithm family. Now we have α < β and α β. Reall that [α, β ] = [α, β ] [v, v + ], whih mean we have either (i) v = v +, in whih ae the lemma trivially hold; or (ii) β v ; or (iii) v + α. We only diu ae (ii) in the following, and the argument for ae (iii) i ymmetrial. Sine β v, and by Eq.(3), β = min t P() v t +, there mut exit a node P() uh that v + v. That i, there i no overlap between the value range of and (exept the boundary). Now we uppoe R() = v in T, and prove that V ( ) = V( ) if R() inreae from v to any r > v. In that ae it immediately follow that the value of mut alo remain ontant if R() i further hanging between uh r (beaue any of them equal the V( ) when R() = v ). The key inight i to ee that both the max funtion and the min funtion are monotone, and o doe any reurive funtion defined by Eq.(1). Speifially, beaue v + v, and beaue v + i valid upper bound, we have V( ) v. Cae 1: When V( ) = v. Beaue V ( ) i monotone and r > v, we have V ( ) V( ) = v. On the other hand, reall that we already have v + v, and beaue the torage M i valid to T, we mut have V ( ) v + v. For both the inequalitie about V ( ) to be true, the only poibility i V ( ) = v, and thu V ( ) = V( ). Cae 2: When V( ) < v. Beaue V() = v, we have V( ) < V(). One an hek that in thi ae there mut exit an and it ueor node 1, both on the path between and (inluded), uh that V( ) < v V( 1 ). Notie that thi an only happen when i a MIN node and there i another ueor node of, denoted by 2, uh that V( 2 ) = V( ) < V( 1 ). In other word, the value of mut be urrently dominated by 2, and not by 1. Again, due to the monotoniity of the minimax funtion, when R() inreae from v to r > v, the value of 1 mut beome even larger, if ever hanged. On the other hand, all the other ueor node of inluding 2 are not anetor of, o their value will not hange when R() hange. Therefore, we know that the value of mut till be dominated by 2 after R() hange, i.e. V ( ) = V ( 2 ) = V( 2 ) = V( ). Sine i on the path between and, we alo have V ( ) = V( ). Finally, ine we have proven that the leaf node annot affet the value of it anetor node, it immediately follow that annot affet the root node either, a ontradition to the aumption made at the beginning of the proof. Sine Lemma 1 guarantee that a ueor node with α β annot affet V(root) (nor doe any node in the ubtree of ), it i afe for rollout algorithm to elet ueor node, at any node, only from the olletion of ueor node with α < β, denoted by A = { C() α < β }. (4) Algorithm 4 preent a family of rollout algorithm that embodie thi idea. Algorithm in thi family keep performing rollout until the value range [v root, v + root] i loed. In eah round of rollout, the peifi rollout trajetory depend on the SeletionPoliy() routine, whih elet a ueor node from the ubet A for the next move. The eletion of an be either baed on determiniti rule or ampled from a probability ditribution over A. An algorithm intane of thi family i fully peified one the SeletionPoliy routine i onretely defined. To keep the torage oherent, Algorithm 4 update the value range [v, v + ] for eah node along the trajetory of rollout, in a bottom-up order. The value bound are updated diretly baed on the minimax funtion defined by Eq.(1). It i not hard to ee that the torage of Algorithm 4 i alway in a oherent tate after eah round of the rollout. Finally, Algorithm 4 ompute α and β in an inremental way, a illutrated by Figure 1. In the following, we preent ome nie propertie that are hared between all algorithm in the family of Algorithm 4. Firt, Lemma 2 how that all data truture ued in Algorithm 4 hange monotonially over time. Lemma 2. Given any game tree T, and under any eletion poliy (determiniti or randomized), for any S, the et of A in Algorithm 4 i non-inreaing over time, and o do for the interval [v, v + ] and [α, β ]. Proof. It an be diretly een that the interval [v, v + ] an never inreae in Algorithm 4. By definition, i.e. Eq.(3), thi implie the non-inreaing monotoniity of [α, β ], whih in turn implie the monotoniity of A, due to Eq.(4). Lemma 2 ugget that one a node i exluded from A, it annot ome bak. In that ene, the rollout algorithm of Algorithm 4 i indeed uing A to prune the game tree. On the other hand, one might worry that ome algorithm in thi family ould potentially be tuk at ome point, when there i no andidate in the ubet A. It turn out that thi an never happen, regardle of the eletion poliy the algorithm i uing.

6 Algorithm 4: A family of rollout algorithm. 1 [v, v + ] [, + ] for every S; 2 while v root < v + root do 3 rollout(root, v root, v + root) ; 4 end 5 return v root 6 Funtion rollout(, α, β ) 7 if C() then 8 foreah C() do 9 [α, β ] [max{α, v }, min{β, v + }] ; 10 end 11 A = { C() α < β } ; 12 SeletionPoliy(A ) ; rollout(, α, β ) ; 13 end R() if i Leaf 14 v max C() v if i Internal & MAX min C() v if i Internal & MIN ; R() if i Leaf 15 v + max C() v + if i Internal & MAX min C() v + if i Internal & MIN ; 16 return Lemma 3. Given any game tree T, and under any eletion poliy (determiniti or randomized), Algorithm 4 alway run the rollout() proedure on node uh that A = if and only if C() =. Proof. It i uffiient to prove that if Algorithm 4 viit a non-leaf node (i.e. C() ), then A. We prove thi by howing that for any non-leaf node that Algorithm 4 viit, there alway exit at leat one ueor node C() uh that [v, v + ] [v, v + ], i.e. ha a wider (or the ame) value range than. We only diu the ae when i a non-root MAX node. The argument i imilar in other ae. If i an internal MAX node, aording to Algorithm 4 we have v + = max C() v +, o there exit a ueor node uh that v + = v+. On the other hand, ine v = max C() v, we have v v. Thu, [v, v+ ] [v, v + ]. Given uh a ueor node, let t be the parent node of, aording to Algorithm 4 we have α = max{α t, v } = max{α t, v, v } = max{α, v } = α. Similarly we alo have β = β. Beaue Algorithm 4 viit node, we mut have α < β, thu α < β, thu i in A. Finally, we oberve that the family of Algorithm 4 i onitent, in the ene that all algorithm intane in the family alway return the ame reult, and thi poliy-independent reult of Algorithm 4 i alway orret, a Theorem 1 how. Theorem 1. Given any game tree T and under any eletion poliy (determiniti or randomized), Algorithm 4 never viit a leaf node more than one, and alway terminate with v root = v + root = V(root). Proof. Firt, we ee that Algorithm 4 never viit a node with v = v +, beaue in that ae α = β. The firt time Algorithm 4 viit a leaf node, it et v = v + = R(), o the algorithm never re-viit a leaf node, whih mean it will have to terminate, at it latet, after viiting every leaf node. Aording to Line 2, we have vroot = v root + at that time. Sine the way Algorithm 4 update the value bound guarantee that they are alway valid that i, at any time we have v V() v + for any node we know that vroot mut equal V(root) when vroot = v root. + Note that Theorem 1 how a tronger oniteny property than that of ome other rollout algorithm, uh a UCT (Koi and Szepevári 2006), whih only guarantee to onverge to V(root) if given infinite time. In ontrat, Algorithm 4 never re-viit a leaf even under a probabiliti rollout poliy, thu alway terminating in finite time. Two Rollout Poliie: α-β and MT-SSS* Sine the rollout family of Algorithm 4 ue an α-β window to prune tree node during the omputation, one may wonder how the lai game-tree pruning algorithm are ompared to Algorithm 4. In thi etion we how that two imple greedy poliie in the family of Algorithm 4 are equivalent, in a trit way, to an augmented verion of the lai α-β and MT-SSS* algorithm, repetively. To etablih the trit equivalene, we introdue a variant of the lai alphabeta proedure, a Algorithm 5 how, whih differ from the alphabeta proedure of Algorithm 1 only in two plae: (1) The lai alphabeta proedure in Algorithm 1 only return a ingle value of g, but the alphabeta2 proedure in Algorithm 5 tranmit a pair of value {g, g + } between it reurive all. (2) The lai alphabeta proedure in Algorithm 1 initialize the (α, β ) window diretly with the reeived argument (α, β), while the alphabeta2 proedure in Algorithm 5 will further trim [α, β ] with the value range [v, v + ]. Before omparing thee two verion more arefully, we firt etablih the equivalene between the alphabeta2 proedure, a well a the MT-SSS* algorithm baed on the alphabeta2 proedure, with two imple rollout poliie of the algorithm family propoed in the lat etion. Speifially, Theorem 2 how that the alphabeta2 proedure i equivalent to a left-firt poliy of Algorithm 4, in the ene not only that they have the ame footprint in leaf evaluation, but alo that given any oherent torage (not neearily an empty torage), they alway leave idential ontent in their repetive torage when terminating. Thi mean they are till equivalent in a reentrant manner, even when ued a ubroutine by other algorithm. Theorem 2. Given any game tree T = (S, C, V) and any oherent torage M = {[v, v + ]} S, Algorithm 4 alway evaluate the ame et of leaf node in the ame order a the augmented α-β algorithm (Algorithm 5) doe, if Algorithm 4 i uing the following eletion poliy: = the leftmot in A. (5)

7 Moreover, let M rollout and M αβ be the torage tate when Algorithm 4 and Algorithm 5 terminate, repetively. We have M rollout = M αβ, when the poliy of Eq.(5) i ued. The key inight here i to ee that under the left-firt poliy, Algorithm 4 an loally ompute the new window [α, β ] of the next round right at node, without updating the anetor node of through bak-propagation, a trik that turn the rollout algorithm into a baktraking algorithm. The omplete proof onit of a erie of equivalent tranformation of algorithm, whih i given in a later etion due to it length. It i eay to ee from Theorem 2 that the MT-SSS* algorithm, if uing the augmented alphabeta2 proedure, an alo be implemented by a equene of rollout with the leftfirt poliy, although uh a rollout algorithm will not belong to the family of Algorithm 4. Interetingly, however, it turn out that the augmented MT-SSS algorithm an be enompaed by the rollout paradigm in a more unified way. In fat, Theorem 3 how that the MT-SSS* algorithm i tritly equivalent to another poliy of the ame rollout family of Algorithm 4. Intead of eleting the leftmot node a the rollout poliy of α-β doe, the rollout poliy of MT-SSS* elet the node with the larget β. Theorem 3. Given any game tree T = (S, C, V) and any oherent torage M = {[v, v + ]} S, Algorithm 4 alway evaluate the ame et of leaf node in the ame order a the augmented MT-SSS* algorithm (Algorithm 2 + Algorithm 5) doe, if Algorithm 4 i uing the following eletion poliy: = the leftmot in arg max β (6) A Moreover, let M rollout and M SSS be the torage when Algorithm 4 and Algorithm 5 terminate, repetively. We have M rollout = M SSS, when the poliy of Eq.(6) i ued. The augmented α-β and MT-SSS* Given the trit equivalene between the rollout algorithm of Algorithm 4 and lai tree-earh algorithm baed on the alphabeta2 proedure, we now examine the relationhip between the two variant of α-β preented in Algorithm 1 and 5. A mentioned before, the alphabeta2 proedure in Algorithm 5 apture every apet of the original alphabeta proedure in Algorithm 1 exept for two differene. The alphabeta proedure in Algorithm 1 return a ingle value of g, while the alphabeta2 proedure in Algorithm 5 return a value pair {g, g + }. From the puedo-ode one an hek that g = g = v when it i a fail-high and g = g + = v + when it i a fail-low, otherwie g = g = g + = V(). Thi i onitent with the well-known protool of the original α-β algorithm. The differene i that the alphabeta2 proedure trie to update both bound even in fail-high and fail-low ae. A a reult, we an expet that in ome ae the alphabeta2 proedure will reult in a torage with tighter value bound than the one of the lai alphabeta proedure. Meanwhile, the lai alphabeta proedure in Algorithm 1 initialize the (α, β ) window diretly with the reeived argument (α, β), while the alphabeta2 proedure in Algorithm 5 will further trim [α, β ] with the value range Algorithm 5: A variant of the alphabeta proedure, whih return a pair of value bound. 1 return alphabeta2(root,, + ) 2 Funtion alphabeta2(, α, β) 3 retrieve [v, v + ] ; 4 [α, β ] [max{α, v }, min{β, v + }] ; 5 if α β then return [v, v + ]; 6 if i a Leaf node then 7 [g, g + ] [R(), R()] ; 8 ele if i a MAX node then 9 [g, g + ] [, ] ; 10 foreah C() do 11 {g, g + } alphabeta2(, α, β ) ; 12 [g, g + ] [max{g, g }, max{g +, g + }]; 13 [α, β ] [max{α, g }, β ] ; 14 end 15 ele if i a MIN node then 16 [g, g + ] [+, + ] ; 17 foreah C() do 18 {g, g + } alphabeta2(, α, β ) ; 19 [g, g + ] [min{g, g }, min{g +, g + }] ; 20 [α, β ] [α, min{β, g + }] ; 21 end 22 end 23 [v, v + ] [g, g + ] ; tore [v, v + ] ; 24 return {g, g + } [v, v + ]. In other word, even given the ame torage, the alphabeta2 proedure may have a tighter pruning window than it ounterpart. While the eond differene may look like a mall trik at the implementation level, we believe that the ingle-bound v.. double-bound diparity i an inherent differene between the two verion. Sine the torage-enhaned α-β algorithm require maintaining a pair of bound anyway (even for the ingle-bound verion), it make ene to update both of them effetively at run time. Interetingly, the ingle-bound veion of the alphabeta proedure perform exatly a well a it double-bound verion if they are working on an empty torage with only one pa. Thi i diretly followed from the well-known fat that the lai alphabeta proedure i per-intane optimal in all diretional algorithm (Pearl 1984). However, when ued a ubroutine, they will behave differently. It turn out that the MT-SSS* algorithm uing the alphabeta2 proedure an outprune the one baed on the ingle-bound verion, in the ame way a the SSS* algorithm outprune the lai α-β algorithm. Theorem 4. Given any game tree T, let L be the equene of leaf node evaluated by the original MT-SSS* algorithm that all Algorithm 1, and let L + be the equene of leaf node evaluated by the augmented MT-

8 SSS* algorithm that all Algorithm 5, then L + i a ubequene of L. Proof of Theorem 2 In thi etion we prove Theorem 2, whih tate that the left-firt poliy in the family of Algorithm 4 i equivalent to the augmented α-β algorithm hown by Algorithm 5. The key inight here i to ee that under the leftfirt poliy, Algorithm 4 an loally ompute the new window [α, β ] of the next round right at node, without updating the anetor node of through bak-propagation, a trik that turn the baktraking algorithm into a rollout algorithm. The omplete proof onit of a erie of equivalent tranformation of algorithm, along Algorithm Proof. To prepare the proof, we need to define a wrapper proedure that generalize Algorithm 4 a little bit, a Algorithm 6 how. It i eay to ee that Algorithm 6 i idential to Algorithm 4 when = root, α =, β = +. On the other hand, Algorithm 6 may terminate without loing the range of [v root, v + root] if V(root) i outide the open interval (α, β ). Algorithm 6: A wrapper proedure of Algorithm 4. 1 Funtion wrapper1(, α, β ) 2 while α < β do 3 rollout(, α, β ) ; 4 [α, β ] [max{α, v }, min{β, v + }] ; 5 end 6 return In thi proof, we ay two algorithm A1 and A2 are equivalent if for any input (T, M, α, β) their behavior inluding both the leaf-node footprint and the terminating tate of the torage are idential. That i, let L A1 and L A2 be the equene of leaf evaluated by A1 and A2 (repetively), and let M A1 and M A2 be the tate of torage when A1 and A2 terminate (repetively), we ay A1 and A2 are equivalent if we have (L A1, M A1 ) = (L A2, M A2 ) for any input (T, M, α, β). A another hortut, we ay a tree node i ative if at the given time we have α < β. To prove the theorem, it i uffiient to prove that the wrapper1 proedure in Algorithm 6 i equivalent to the alphabeta2 proedure in Algorithm 5 if [α, β] [v, v + ] in the alphabeta2 proedure. Oberve that Algorithm 5 work in the baktraking manner, while Algorithm 6 work in the rollout manner. Conider the moment when the exeution of the rollout proedure at a node i about to end (for example, imagine we are at Line 16 of Algorithm 4). Aording to the rollout paradigm, the algorithm will now update the value bound of all anetor node of, then re-tart another rollout round from the root. Under the peifi poliy of Eq.(5), the rollout proe will alway elet, at eah layer of the tree, the leftmot ative node. Notie that i the hoen node at it own layer for the urrent round of rollout, whih mean all node at the left ide of (in the ame layer) are already inative. Sine Lemma 2 how that the [α, β] window i non-inreaing for any node, we know that the urrent node will till be hoen in the next round of rollout if and only if α < β in the next round. The key inight of the proof i to ee that for the rollout algorithm of Algorithm 4, we an loally ompute at node the new window [α, β ] of the next round, without updating any anetor node of through bak-propagation. Thi enable u to make lole deiion at node : If we foreee that α < β in the next round, we an immediately tart a new rollout from node (rather than from the root node), a the rollout from the root will go through anyway; Otherwie if α β in the next round, we jut leave node and ontinue the bak-up phae, in whih ae we know that the rollout will never ome bak to later. Note that the node below an alo play thi trik, pretending that it i running a ingle round of rollout in the view of and other anetor. Extending thi to the whole tree, we eentially re-write the original rollout algorithm into a reurion algorithm. Further ombined with ome other optimization, we finally arrive at the alphabeta2 proedure in Algorithm 5. The omplete proof onit of a erie of equivalent tranformation between algorithm. We tart with tranforming Algorithm 6 to the Algorithm 7 hown below, whih imply replae the rollout proedure in Algorithm 6 with the left-firt poliy of Eq.(5). Algorithm 7 run in a loop until beome inative. In eah round, the algorithm iterate over C() to find the leftmot ative node, iue a rollout on, then update [v, v + ] and [α, β ] immediately. The entene [v, v + ] [V (), V + ()] i a hortut of Line 14 and 15 of Algorithm 4. Now onider Algorithm 8, a hown below. The algorithm work by iterating over C(), from left to right. For eah ative node C(), the algorithm keep running rollout on until beome inative. It terminate when all ueor node are inative. The interval [v, v + ] and [α, β ] are updated only when the algorithm withe to another node. Finally, the value range [v, v + ] i updated again before terminating, in ae i leaf node. Propoition 1. Algorithm 8 i equivalent to Algorithm 7. Proof. In fat, it i uffiient to prove that in eah round and for the ame node, the window [α, β ] omputed at Line 5 of Algorithm 8 i alway idential to the window [α, β ] omputed at Line 3 of Algorithm 7. Note that the ame [α, β ] will lead to the ame node hoen for the rollout, and alo lead to the ame terminating ondition between the two algorithm by definition [α, β ] [α, β ], o the node mut be ative if ome ueor node i ative; the revere i alo true, due to Lemma 3. Sine Algorithm 8 will update [v, v + ] and [α, β ] when it hange the ueor node for rollout, we only need to prove that the [α, β ] in Algorithm 8 i onitent to the one in Algorithm 7 at the eond time when ha been hoen for rollout. For any uh round t, by mathematial indution we an aume that the window [α, β ] i onitent in all previou round, in partiular, for the lat round t 1. Notie

9 Algorithm 7: An implementation of Algorithm 6 when uing the left-firt poliy. 1 Funtion wrapper1(, α, β ) 2 while α < β do foreah C() do [α, β ] [max{α, v }, min{β, v + }] ; if α < β then rollout(, α, β ) ; 3 break; end end [v, v + ] [V (), V + ()] ; 4 [α, β ] [max{α, v }, min{β, v + }] ; 5 end 6 return {v, v + } Algorithm 8: Another implementation of Algorithm 7. 1 Funtion wrapper2(, α, β ) 2 foreah C() do 3 [α, β ] [max{α, v }, min{β, v + }] ; 4 if α < β then while α < β do rollout(, α, β ) ; 5 [α, β ] [max{α, v }, min{β, v + }] ; end 6 [v, v + ] [V (), V + ()] ; 7 [α, β ] [max{α, v }, min{β, v + }] ; 8 end 9 end 10 [v, v + ] [V (), V + ()] ; 11 return {v, v + } that in Algorithm 7 we have [α (t), β (t) ] = [α (t 1), β (t 1) ] [v (t) and in Algorithm 8 we have [α (t), β (t) ] = [α (t 1), β (t 1) ] [v (t 1), v +(t 1) ] [v (t) By Lemma 2, [v (t), v +(t 1) [v (t 1) ] [v (t), v +(t) ], (7) ]. (8), v +(t) ] [v (t 1), v +(t 1) ], o the ] in Eq.(8) i unneeary, and o to prove the equivalene of Algorithm 7 and 8, we only need to prove that in Algorithm 7 we alway have [α (t 1), β (t 1) ] [v (t) [α (t 1), β (t 1) ] [v (t), v +(t) ] = ] [v (t) That i, we only need to prove that [v (t) for updating [α (t), β (t) ]. (9), v +(t) ] i uele ] in Algorithm 7. Thi an be hown by oberving that for both the max and min funtion, when one of it argument hange, the value of the funtion either remain unhanged or i equal to the value of the new argument. So, when [v (t 1), v +(t 1) ] hange to [v (t) ], Algorithm 9: Reurion-baed verion of Algorithm 8. 1 Funtion wrapper2(, α, β ) 2 foreah C() do 3 [α, β ] [max{α, v }, min{β, v + }] ; 4 if α < β then 5 wrapper2(, α, β ) 6 [v, v + ] [V (), V + ()] ; [α, β ] [max{α, v }, min{β, v + }] ; 7 end 8 end 9 [v, v + ] [V (), V + ()] ; 10 return {v, v + } Algorithm 10: Another implementation of Algorithm 9. 1 Funtion wrapper3(, α, β ) 2 foreah C() do 3 [α, β ] [max{α, v }, min{β, v + }] ; 4 if α < β then 5 wrapper3(, α, β ) ; 6 if i a MAX node then [α, β ] [max{α, v }, β ] ; ele if i a MIN node then [α, β ] [α, min{β, v + }] ; end 7 end 8 end R() if i Leaf 9 v max C() v if i Internal & MAX min C() v if i Internal & MIN ; R() if i Leaf 10 v + max C() v + if i Internal & MAX min C() v + if i Internal & MIN ; 11 return {v, v + } the bound of [v (t) equal to the bound of [v (t) ] either remain unhanged, or i, v +(t) ] aordingly. In the for- ] i maked by [α (t 1), β (t 1) ] in, v +(t) ] i maked by mer ae [v (t) Eq.(9), while in the latter ae [v (t) [v (t), v +(t) ] in Eq.(9). Now, oberve that Line 5 of Algorithm 8 (the hadowed part) i atually idential to the wrapper1 proedure in Algorithm 6, whih we have jut proven to be equivalent to the wrapper2 proedure. A a reult, we an replae the logi blok with a ubroutine all of wrapper2(, α, β ), a hown by Algorithm 9. Note that thi replaement ha turned Algorithm 9 into a reurion proedure. In the following we further tranform the wrapper2 proedure in Algorithm 9 to the wrapper3 proedure in Algorithm 10. Propoition 2. Algorithm 9 i equivalent to Algorithm 10.

10 Proof. Algorithm 10 i different from Algorithm 9 only in the method for updating [α (t), β (t) ] when the algorithm withe the node for rollout. Without lo of generality, we only diu ae in whih i MAX node, and there i an ative node that i behind in C() and that i the node for rollout in the next round. We ee that Algorithm 10 doe not ue [v (t), v +(t) ] to update [α (t), β (t) ] at all. Intead, it update with α = max{α, v }. Thi i equivalent to α = max{α, v } again beaue v an either be v or be itelf, in the latter ae it i maked by α. To ee why β doe not need to update at all, oberve that our goal i to orretly ompute [α, β ] in the next round, and at that time v + i either maked by β or by v +, depending on whether v + > v + or v+ v +. So far we have made a erie of equivalent tranformation from Algorithm 6 to Algorithm 10. A the final tep, jut by following the ode it i traightforward to verify the equivalene between the wrapper3 proedure in Algorithm 10 and the alphabeta2 proedure in Algorithm 5 if [α, β] [v, v + ]. Interetingly, by omparing Algorithm 10 and Algorithm 5 we an find that the window (α, β ) ued in the rollout proedure i oneptually different from the lai (α, β) window ued in the alphabeta proedure. Speifially, we have [α, β ] rollout = [α, β] minimax [v, v + ]. (10) Proof of Theorem 3 In thi etion we prove Theorem 3, whih tate that the max-β poliy in the family of Algorithm 4 i equivalent to the MT-SSS* algorithm baed on the alphabeta2 proedure defined in Algorithm 5. Proof. Sine we already prove the trit equivalene between the wrapper1 proedure in Algorithm 6 and the alphabeta2 proedure in Algorithm 5, it i eay to firt write MT- SSS* into rollout algorithm, a Algorithm 11 how, whih repeatedly alling wrapper1(root, v + root 1, v + root) until the range [v root, v + root] i loed. Note that the null window ha guaranteed the ondition that [α, β] [v root, v + root]. Algorithm 11: A rollout verion of MT-SSS*, whih i baed on the wrapper1 proedure in Algorithm6. 1 while v root < v + root do 2 wrapper1(root, v + root 1, v + root) ; 3 end Now we only need to prove that, tarting from the empty torage where [v, v + ] = [, + ] for all S, and for every round of rollout, Algorithm 11 hooe exatly the ame rollout trajetory with Algorithm 4 if they are uing the poliy of Eq.(5) and Eq.(6), repetively. Note that both of them all the ame rollout proedure to update the torage, whih mut at the ame given the ame rollout trajetory. On the ide of Algorithm 11, beaue it i uing a minimal window [v root + 1, v root], + no ative node eleted in the rollout an further redue the upper-influene-bound β, thu the algorithm i alway eleting the leftmot node with v + v root. + Moreover, reall that Lemma 3 guarantee that uh an ative node an alway be found along the rollout (a long a the value range of root i open). On the other ide, onider how Algorithm 4 elet ueor node in the rollout: The root node i a MAX node, o v root + = max C(root) v +, whih mean there exit an ative ueor node of the root with v + = v root. + The leftmot of them will be eleted by the algorithm, with the upper-influene-bound β = v root. + At the node thu eleted, beaue i a MIN node, every ueor node of will have v + v +. Thu, the algorithm will jut elet the leftmot one in C(), till with the upper-influene-bound β = v root. + In that way, it i eay to ee that Algorithm 4, if under the poliy of Eq.(6), will alo alway elet the leftmot node with v + v root, + thu having exatly the ame rollout trajetory. Related Work The gaming tree model enompa omputational problem that are inherently hard. It i known that the problem of evaluating the game tree of Che and GO are both EXPTIME- Complete (Fraenkel and Lihtentein 1981) (Robon 1983), implying a provable intratability to olve them in polynomial time, due to the Time Hierarhy Theorem (Demaine 2001). A important peial ae, if eah MIN internal node ha only one ingle hild node, the game tree degenerate to a baktraking tree of ombinatorial optimization. In thi ae the problem ould till be polynomial-time intratable, due to the widely believed P NP hypothei (Fortnow 2013). Hitorially, pratial game tree evaluation algorithm were motly in the depth-firt earh tyle at the early tage, partially beaue at that time omputer had very limited phyial memory. For example, the original implementation of SSS* in (Stokman 1979) need to expliitly maintain a lit of open node, whih i a huge burden on omputation reoure, enough to make it effetively impratial at the time when it wa propoed (Pearl 1984). However, the memory ize of omputer ha grown exponentially for deade ine then, and modern omputer are now often equipped with enough memory to math the torage demand of online planning (typially in the order of million or billion). Indeed, it ha beome a tandard pratie to even enhane depth-firt algorithm with external torage in their modern implementation (Plaat et al. 1996). In partiular, tranpoition table i one of the mot popular data truture in uh torage. The original purpoe of the tranpoition table i to tranpoe earhing to avoid repeatedly viiting a tate when the tate orrepond to multiple node in the game tree (in whih ae the topology of the tate pae i not tritly a tree, but atually a DAG). Meanwhile, the tranpoition table an alo be ued in the iterative deepening paradigm to tore earh reult for improving move ordering in later iteration (Reinefeld and Marland 1994).

11 On the other hand, rollout algorithm were originally ued a a traightforward ampling method to etimate the expeted outome of tohati game (Teauro and Galperin 1996), where the poliy at player node onit of heuriti rule and the poliy at hane node jut follow the probability ditribution a the rule of the game preribe. But it turn out that rollout algorithm an alo be ueful for even determiniti game. In 1990, Abramon reported that for everal popular determiniti game there i an intereting orrelation between the minimax value and the average outome of random rollout that imply hooe ueor tate aording to uniform ditribution (Abramon 1990). Thi obervation implie that repeated imulation baed on the random rollout poliy may approximately evaluate the minimax value of determiniti game tree. Later on, reearher further developed adaptive rollout poliie that may dynamially hange their rollout preferene given outome from previou rollout imulation (Bouzy 2006), oined a Monte Carlo Tree Searh in (Coulom 2006). Similar rollout idea were alo propoed in other domain, uh a metaheuriti for ombinatorial optimization (Berteka, Titikli, and Wu 1997) (Glover and Taillard 1993). A popular way to deign rollout poliy i by reating ueor node eletion into a Multi-Armed Bandit (MAB) problem every time when the rollout imulation viit a MAX node, we elet one ueor node from the hildren lit C() and obtain a reward of thi hoie (from the rollout outome of thi round), and the goal i to maximize the average reward of repeated rollout in the long term. A ymmetri formulation applie to MIN node. By indution it i not hard to ee that the average reward of eah node under uh a bandit poliy will eventually onverge to V() a long a the average reward of every ueor node C() onverge to V(), whih i true for the leaf node. In partiular, the rollout poliy of UCT wa diretly borrowed from the UCB algorithm, a renowned algorithm for the tohati MAB problem. In the peifi model of tohati MAB, the UCB index of any bandit arm, whih i idential to the UCT ore, i guaranteed to be an upper bound of the expeted reward of the arm, with the onfidene level of 1 1/n λ2. Note that the onfidene level for different bandit arm are the ame. Then by imply hooing the bandit arm with the bet onfidene-upper-bound, the UCB algorithm manage to ahieve ublinear regret for any problem intane of tohati MAB, and wa proven to ahieve the aymptotially optimal performane of the problem (Auer, Cea-Bianhi, and Fiher 2002). 3 In general, uh a bet-upper-bound-firt trategy ha been widely ued in ombinatorial optimization and ingle-player game, uh a in the lai A algorithm. In many domain the upper bound of eah node i typially omputed by olving a relaxed (and eaier) problem, whih i thu inherently domain-peifi. In the game tree model, however, thee upper bound an be boottrapped diretly from the tree earh 3 However, we note that the underlying theoretial aumption of game tree evaluation are different from the one of tohati MAB, whih mean that the UCT algorithm need a different mathematial jutifiation about it performane from the UCB. itelf, without any domain-peifi heuriti. In term of the tehnique ued in thi paper, the idea of maintaining pair of value bound in an in-memory tree (and updating the bound in a bottem-up manner) wa propoed by Han Berliner in the B* algorithm (Berliner 1979). More reently, Walh, Gohin, and Littman (2010) propoed the FSSS algorithm, a rollout algorithm that update the value bound [v, v + ] in the ame way a Algorithm 4, in order to have a theoretial guarantee of it performane when ued in reinforement-learning appliation. An algorithm with imilar idea wa alo propoed in the ontext of game-tree evaluation (Cazenave and Saffidine 2011). Weintein, Littman, and Gohin (2012) further adapted the FSSS algorithm into the game tree model and propoed a rollout algorithm that outprune the α-β algorithm. Their algorithm alo ue an (α, β) window to filter ueor node, but the window i manipulated in a different way from the algorithm family propoed in thi paper. Furthermore, there i no domination between MT-SSS* and their algorithm, while in thi paper we argue that α-β and MT-SSS* themelve an be unified under the rollout framework of Algorithm 4. Chen et al. (2014) have reently preented a rollout algorithm that apture the idea of the MT-SSS* algorithm. Their algorithm ue the null window (v + root 1, v + root) to filter node, and thu doe not manipulate the window at all during rollout. Beide, they did not formally haraterize the relationhip between MT-SSS* and their null-window rollout algorithm. Interetingly, the analyi of thi paper ugget that their algorithm i not exatly equivalent to the original MT-SSS* algorithm. Conluion Reult from thi paper ugget that the rollout paradigm ould erve a a unified framework to tudy game-tree evaluation algorithm. In partiular, Theorem 2 and 3 how that ome lai minimax earh algorithm ould be implemented by rollout. Thi obervation implie that we ould ollet information in a ingle rollout for both minimax pruning and MCTS ampling. In light of thi, we may deign new hybrid algorithm that naturally ombine MCTS algorithm with traditional game-tree earh algorithm. For example, by Theorem 3 we ee that MT-SSS* orrepond to a rollout algorithm that prefer ueor node with the larget abolute upper-influene-bound β, whih i powerful in ompletely pruning node without ompromiing orretne. But the value of β propagate upward lowly in large game tree. A a reult, mot node in the upper layer of the tree may be left with the non-informative bound + at the early running tage of the algorithm, in whih ae the MT-SSS* poliy i eentially blind. On the other hand, the UCT ore an be een a an upper bound with ome onfidene, whih i able to provide informative guidane with muh le rollout, but ould repond lowly to the diriminative knowledge olleted in the earh, probably due to it amortizing nature (Ramanujan, Sabharwal, and Selman 2012). In light of thi omplementary role between β and the UCT ore, Algorithm 12 demontrate a natural way to ombine the idea of UCT and MT-SSS*. The algorithm

Macrohomogenous Li-Ion-Battery Modeling - Strengths and Limitations

Macrohomogenous Li-Ion-Battery Modeling - Strengths and Limitations Marohomogenou Li-Ion-Battery Modeling - Strength and Limitation Marku Lindner Chritian Wieer Adam Opel AG Sope Purpoe of the reearh: undertand and quantify impat of implifiation in marohomogeneou model