Taming Decentralized POMDPs: Towards Efficient Policy Computation for Multiagent Settings

Size: px

Start display at page:

Download "Taming Decentralized POMDPs: Towards Efficient Policy Computation for Multiagent Settings"

Kristian Melton
6 years ago
Views:

1 Taming Deentralized PMDPs: Towards ffiient Poliy omputation for Multiagent Settings. Nair and M. Tambe omputer Siene Dept. University of Southern alifornia Los Angeles A M. Yokoo oop. omputing esearh Grp. NTT omm. S. Labs Kyoto, Japan yokoo@slab.kel.ntt.o.jp D. Pynadath, S. Marsella Information Sienes Institute University of Southern alifornia Marina del ey A pynadath, Abstrat The problem of deriving joint poliies for a group of agents that maximize some joint reward funtion an be modeled as a deentralized partially observable Markov deision proess (PMDP). Yet, despite the growing importane and appliations of deentralized PMDP models in the multiagents arena, few algorithms have been developed for effiiently deriving joint poliies for these models. This paper presents a new lass of loally optimal algorithms alled Joint quilibriumbased searh for poliies (JSP). We first desribe an exhaustive version of JSP and subsequently a novel dynami programming approah to JSP. ur omplexity analysis reveals the potential for exponential speedups due to the dynami programming approah. These theoretial results are verified via empirial omparisons of the two JSP versions with eah other and with a globally optimal brutefore searh algorithm. Finally, we prove pieewise linear and onvexity (PWL) properties, thus taking steps towards developing algorithms for ontinuous belief states. 1 Introdution As multiagent systems move out of the researh lab into ritial appliations suh as multisatellite ontrol, researhers need to provide highperforming, robust multiagent designs that are as nearly optimal as feasible. To this end, researhers have inreasingly resorted to deisiontheoreti models as a framework in whih to formulate and evaluate multiagent designs. Given a group of agents, the problem of deriving separate poliies for them that maximize some joint reward an be modeled as a deentralized PMDP (Partially bservable Markov Deision Proess). In partiular, the DPMDP (Deentralized PMDP) [Bernstein et al., 2000] and MTDP (Markov Team Deision Problem [Pynadath and Tambe, 2002]) are generalizations of a PMDP to the ase where there are multiple, distributed agents basing their ations on their separate observations. These frameworks allow a variety of multiagent analysis. f partiular interest here, they allow us to formulate what onstitutes an optimal poliy for a multiagent system and in priniple derive that poliy. However, with a few exeptions, effetive algorithms for deriving poliies for deentralized PMDPs have not been developed. Signifiant progress has been ahieved in effiient singleagent PMDP poliy generation algorithms [Monahan, 1982; assandra et al., 1997; Kaelbling et al., 1998]. However, it is unlikely suh researh an be diretly arried over to the deentralized ase. Finding optimal poliies for deentralized PMDPs is NXPomplete [Bernstein et al., 2000]. In ontrast, solving a PMDP is PSPAomplete [Papadimitriou and Tsitsiklis, 1987]. As Bernstein et al. [2000] note, this suggests a fundamental differene in the nature of the problems. The deentralized problem annot be treated as one of separate PMDPs in whih individual poliies an be generated for individual agents beause of possible rossagent interations in the reward, transition or observation funtions. (For any one ation of one agent, there may be many different rewards possible, based on the ations that other agents may take.) In some domains, one possibility is to simplify the nature of the poliies onsidered for eah of the agents. For example, hadès et al. [2002] restrit the agent poliies to be memoryless (reative) poliies. Further, as an approximation, they define the reward funtion and the transition funtion over observations instead of over states thereby simplifying the problem to solving a multiagent MDP [Boutilier, 1996]. Xuan et al. [2001] desribe how to derive deentralized MDP (not PMDP) poliies from a entralized MDP poliy. Their algorithm, whih starts with an assumption of full ommuniation that is gradually relaxed, relies on instantaneous and noise free ommuniation. Suh simplifiations redue the appliability of the approah and essentially sidestep the question of solving deentralized PMDPs. Peshkin et al. [2000] take a different approah by using gradient desent searh to find loal optimum finiteontrollers with bounded memory. Their algorithm finds loally optimal poliies from a limited subset of poliies, with an infinite planning horizon, while our algorithm finds loally optimal poliies from an unrestrited set of possible poliies, with a finite planning horizon. Thus, there remains a ritial need for new effiient algorithms for generating optimal poliies in distributed PMDPs. In this paper, we present a new lass of algorithms for solving deentralized PMDPs, whih we refer to as Joint quilibriumbased Searh for Poliies (JSP). JSP iterates through the agents, finding an optimal poliy for eah agent

2 / D L / U A a L L / U A L assuming the poliies of the other agents are fixed. The iteration ontinues until no improvements to the joint reward is ahieved. Thus JSP ahieves a loal optimum similar to a Nash quilibrium. We disuss xhaustivejsp whih uses exhaustive searh to find the best poliy for eah agent. Sine this exhaustive searh for even a single agent s poliy an be very expensive, we also present DPJSP whih improves on xhaustivejsp by using dynami programming to inrementally derive the poliy. We onlude with several empirial evaluation that ontrast JSP against a globally optimal algorithm that derives the globally optimal poliy via a full searh of the spae of poliies. Finally, we prove pieewise linear and onvexity (PWL) properties, thus taking steps towards developing algorithms for ontinuous initial belief states. 2 Model We desribe the Markov Team Deision Problem (MTDP) [Pynadath and Tambe, 2002] framework in detail here to provide a onrete illustration of a deentralized PMDP model. However, other deentralized PMDP models ould potentially also serve as a basis [Bernstein et al., 2000; Xuan et al., 2001]. Given a team of agents, an MTDP [Pynadath and Tambe, 2002] is defined as a tuple:. is a finite set of world states.!#$&%(')$&%(* ', where + *, are the sets ), of ation for agents 1 to. A joint ation is represented as, *. /. ',, * 0 /214, the transition funtion, represents the probability / / 1 of the urrent state is, if the previous state is ' ), and the previous joint ation is, *. * 5 /6, are the set of observations for agents 1 to., * 087#, the observation 7 funtion, represents the probability of joint observation ),, if the urrent state is and the previous joint ation is, *. The agents reeive a single, immediate joint reward 59/:,, * whih is shared equally. Pratial analysis using models like MTDP often assume that observations of eah agent is independent of eah other s observations. Thus the observation funtion an be expressed 5 / 9, as, * ;87#< = 9/ ),, * ;>7?(@? * /,, * 0>7 *. ah agent A hooses its ations based on its loal poliy, B ', whih is a mapping of its observation history to ations. Thus, at time, agent A will perform ation B ' 6D 7F ' 7F where 'G7 ' >7F '. B H B B * refers to the joint poliy of the team of agents. The important thing to note is that in this model, exeution is distributed but planning is entralized. Thus agents don t know eah other s observations and ations at runtime but they know eah other s poliies. xample Senario For illustrative purposes it is useful to onsider a familiar and simple example, yet one that is apable of bringing out key diffiulties in reating optimal poliies for MTDPs. To that end, we onsider a multiagent version of the lassi tiger problem used in illustrating single agent 9 I J PMDPs[Kaelbling et al., 1998] and reate an MTDP ( ) for this example. In our modified version, two agents are in a orridor faing two doors: left and right. Behind one door lies a hungry tiger and behind the other lies untold rihes but K the FLM FI agents do not know the position of either. Thus,, indiating behind whih door the tiger is present. The agents an jointly or individually open either door. In addition, the agents an independently N PQ listen TSVU for L#U2W the presene of the tiger. Thus, : TSVU >X A Y[Z\>X (X. The transition funtion, speifies that every time either agent opens one of the doors, the state is reset to FL or ] with equal probability, regardless of the ation of the other agent, as shown in Table 1. However, if both agents listen, the state remains unhanged. After every ation eah agent reeives an obser vation about the new state. The observation funtion, or (shown in Table 2) will return either ^ or ^ with different probabilities depending on the joint ation taken and the resulting world state. For example, if both FL agents listen and the tiger is behind the left door (state is ), eah agent reeives the observation ^ with probability _. with probability _ `a and ^ Ation/Transition SL b SL SL b S S b S S b SL penight,*d penleft,*d *,penleftd *,penightd Listen,Listend Table 1: Transition funtion Ation State HL H Listen,Listend SL Listen,Listend S penight,*d * penleft,*d * *,penleftd * *,penightd * Table 2: bservation funtion for eah agent If either of them opens the door behind whih the tiger is present, they are both attaked (equally) by the tiger (see Table ). However, the injury sustained if they opened the door to the tiger is less severe if they open that door jointly than if they open the door alone. Similarly, they reeive wealth whih they share equally when they open the door to the rihes in proportion to the number of agents that opened that door. The agents inur a small ost for performing the (X ation. learly, ating jointly is benefiial (e.g., N ef TSVU LgUhW >X ) beause the agents reeive more rihes and sustain less damage by ating together. However, beause the agents reeive independent observations (they do not share observations), they need to onsider the observation histories of the other agent and what ation they are likely to perform.

3 * * Ation/State SL S penight,penightd penleft,penleftd penight,penleftd penleft,penightd Listen,Listend 2 2 Listen,penightd penight,listend Listen,penLeftd penleft,listend Table : eward funtion A We also onsider onsider another ase of the reward funtion, where we vary the penalty for jointly opening the door to the tiger (See Table 4). Ation/State SL S penight,penightd penleft,penleftd penight,penleftd penleft,penightd Listen,Listend 2 2 Listen,penightd penight,listend Listen,penLeftd penleft,listend Table 4: eward funtion B 4 ptimal Joint Poliy When agents do not share all of their observations, they must instead oordinate by seleting poliies that are sensitive to their teammates possible beliefs, of whih eah agent s entire history of observations provides some information. The problem faing the team is to find the optimal joint poliy, i.e. a ombination of individual agent poliies that produes behavior that maximizes the team s expeted reward. ne surefire method for finding the optimal joint poliy is to simply searh the entire spae of possible joint poliies, evaluate the expeted reward of eah, and selet the poliy with the highest suh value. To perform suh a searh, we must first be able to determine the expeted reward of a joint poliy. We ompute this expetation by projeting the team s exeution over all possible branhes on different world states and different observations. We present here the 2agent version of this omputation, but the results easily generalize to arbitrary team sizes. At eah time step, we an ompute the expeted value of a joint poliy, B B B, for a team starting 7g D F in a given state,, with a given set of past observations, 7g D and, as follows: 9 [D 7 7D F 59 B 6D 7 B 6D 7 B hd 7 B 2D 7 B 7D B 7D 87 87? 9 7D 7D (1)? At eah time step, the omputation of performs a summation over all possible world states and agent observations, so the time omplexity of this algorithm is! #!h?!!h?!! %$'&. The overall searh performs this omputation for eah and every possible joint poliy. Sine eah poliy speifies different ations over possible histories of observations, the number of possible poli )(*! ies for an individual agent A is '!%+,.+/10 +, The number of possible joint poliies for agents is thus (4(! 65! +,7+/ +, , where and orrespond to the largest individual ation and observation spaes, respetively, among the agents. The time omplexity for finding the optimal joint poliy by searhing this spae is thus: (4(! 65! +, +, 77 +/ +00 2?:! #!h?! #5! * $ 2 5 Joint quilibriumbased Searh for Poliies Given the omplexity of exhaustively searhing for the optimal joint poliy, it is lear that suh methods will not be suessful when the amount of time to generate the poliy is restrited. In this setion, we will present algorithms that are guaranteed to find a loally optimal joint poliy. We refer to this ategory of algorithms as JSP (Joint quilibrium Based Searh for Poliies). Just like the solution in Setion 4, the solution obtained using JSP is a Nash equilibrium. In partiular it is a loally optimal solution to a partially observable idential payoff stohasti game(pipsg) [Peshkin et al., 2000]. The key idea is to find the poliy that maximizes the joint expeted reward for one agent at a time, keeping the poliies of all the other agents fixed. This proess is repeated until an equilibrium is reahed (loal optimum is found). The problem of whih optimum the agents should selet when there are multiple loal optima is not enountered sine planning is entralized. 5.1 xhaustive approah(xhaustivejsp) The algorithm below desribes an exhaustive approah for JSP. Here we onsider that there are ooperative agents. We modify the poliy of one agent at a time keeping the poliies of the other :9 agents fixed. The funtion best Poliy, returns the joint poliy that maximizes the expeted joint reward, obtained by keeping ;9 agents poliies fixed and exhaustively searhing in the entire poliy spae of the agent whose poliy is free. Therefore at eah iteration, the value of the modified joint poliy will always either

4 D X X ^ inrease or remain unhanged. This is repeated until an equilibrium is reahed, i.e. the poliies of all agents remains unhanged. This poliy is guaranteed to be a loal maximum sine the value of the new joint poliy at eah iteration is nondereasing. Algorithm 1 XHAUSTIVJSP 1: prev random joint poliy, onv _ 2: while onv 9 do : for A to do 4: fix poliy of all agents exept A 5: poliyspae list of all poliies for A 6: new bestpoliy(a,poliyspae,prev) 7: if new.value prev.value then 8: onv onv + 1 9: else 10: prev new, onv _ 11: if onv 9 then 12: break 1: return new The best poliy annot remain unhanged for more than 9 iterations without onvergene being reahed and in the worst ase, eah and every joint poliy is the best poliy for at least one iteration. Hene, this algorithm has the same worst ase omplexity as the exhaustive searh for a globally optimal poliy. However, it ould do muh better in pratie as illustrated in Setion 6. Although the solution found by this algorithm is a loal optimum, it may be adequate for some appliations. Tehniques like random restarts or simulated annealing an be applied to perturb the solution found to see if it settles on a different higher value. The exhaustive approah to Steps 5 and 6 of the xhaustivejsp algorithm enumerates and searhes the entire poliy spae of a single agent, A. There are (! '! +,.+/ +, suh poliies,! #!2?! and! evaluating eah inurs $ a time omplexity of. Thus, using the exhaustive approah (*! inurs an overall time omplexity in Steps 5 and 6 of: '! +, +, %+0 +/ 0! #!h?!!$ 2. Sine we inur this omplexity ost in eah and every pass through the JSP algorithm, a faster means of performing the bestpoliy funtion all of Step 6 would produe a big payoff in overall effiieny. We desribe a dynami programming alternative to this exhaustive approah for doing JSP next. 5.2 Dynami Programming (DPJSP) If we examine the singleagent PMDP literature for inspiration, we find algorithms that exploit dynami programming to inrementally onstrut the best poliy, rather than simply searh the entire poliy spae [Monahan, 1982; assandra et al., 1997; Kaelbling et al., 1998]. These algorithms rely on a priniple of optimality that states that eah subpoliy of an overall optimal poliy must also be optimal. In other words, if we have a step optimal poliy, then, given the history over the first steps, the portion of that poliy that overs the last 9. steps must also be optimal over the remaining 9. steps. In this setion, we show how we an exploit an analogous optimality property in the multiagent ase to perform more effiient onstrution of the optimal poliy within our JSP algorithm. To support suh a dynamiprogramming algorithm, we must define belief states that summarize an agent s history of past observations, so that they allow the agents to ignore the atual history of past observations, while still supporting onstrution of the optimal poliy over the possible future. In the singleagent ase, a belief state that stores the distribution, ' * 9!9D 7F>, is a suffiient statisti, beause the agent an ompute an optimal poliy based on ' * without 7F having to onsider the atual observation sequene, [Sondik, 1971]. In the multiagent ase, an agent faes a omplex but normal singleagent PMDP if the poliies of all other agents are fixed. However, ' *, is not suffiient, beause the agent must also reason about the ation seletion of the other agents and hene on the observation histories of the other agents. Thus, at eah time, the agent A Uh reasons about the tuple ' D 7F %(' 7g D, where %(' 7F D 7F D '( 7F D ' 7F D * is the joint U6 observation histories of all the agents exept A. By treating ' to be the state of the agent A at time, we an define the transition funtion and observation funtion for the single agent PMDP for agent A as follows: )U ', ' U ' N )U '! U ', ' U ', ' B %@' 2D 7 %(', ' + 9 B %(' 6D 7 %(' +, ' 87? %@' ' N 7 '! U ', ' ' 9 B %(' 6D 7 %@' +, ' >7 ' () where B %(' B B '( B ' + B * is the joint poliy for all agents exept A. We now define the novel multiagent belief state for an 9/2 agent 6 A given /2 the distribution over the initial state, : ' U '! 7D ' 2D, ; ' In other words, when reasoning about an agent s poliy in the ontext of other agents, we maintain a distribution over U ', rather than simply the urrent state. Figure 1 shows different belief states, and X for agent 1 in the tiger domain. For instane, U, shows probability distributions over U. In H9 ]LM6 J8 J, ^ is the history of agent 2 s observations while SL is the urrent state. Setion 5. demonstrates how we an use this multiagent belief state to onstrut a dynami program that inrementally onstruts the optimal poliy for agent A. 5. The Dynami Programming Algorithm Following the model of the singleagent valueiteration algorithm, our dynami program enters around a value funtion over a step finite horizon. For readability, this setion presents the derivation for the dynami program in the (2) (4)

5 ! $#%& ' (! $#%& ' Figure 1: Trae of Tiger Senario twoagent ase; the results easily generalize to the agent ase. > Having fixed the poliy of agent 2, our value funtion,, represents the expeted reward that the team will reeive when agent 1 follows its optimal poliy from the th step onwards when starting with a urrent belief state,. We start at the end of the time horizon (i.e., ), and then work our way bak to the beginning. Along the way, we onstrut the optimal poliy by maximizing our value funtion over possible ation hoies: g *),+(. 0/. We an define the ation value funtion,. N 1 5), * 7?, 7 and observing, reursively:!, (5) (6) The first term in equation 6 refers to the expeted immediate reward, while the seond term refers to the expeted future reward. is the belief state updated after performing ation. In the base ase,, the future reward is 0, leaving us with: $ $ F21 5), $ The alulation of expeted immediate reward breaks down as follows: 1 5, F % 45 U?59 ), B 6D 7 8 Thus, we an ompute the immediate reward using only the agent s urrent belief state and the primitive elements of our given MTDP model (See Setion 2). omputation of the expeted future reward (the seond term in quation 6) depends on our ability to update agent 1 s belief state from to 7 in light of the new observation,. For example, in Figure 1, the belief state is updated to X, on performing ation and reeiving obser. We now derive an algorithm for performing suh vation 7 X an update, as well as omputing the remaining 6 7! term from quation 6. The initial belief state based on the distribution over initial state,, is: )U F (7) (8), (9) U For D 7F 87, the updated U is obtained using quations 2 and and Bayes ule and is given as follows: )U 6 7 N )U?6. 6), B 6D 7 8+? 6), B 46D 7 8+>7? 6), B 6D 7 8+>7 6 7,! (10) We treat the denominator of quation 10 (i.e.,!, ) as a normalizing Uh onstant to bring the sum of the numerator over all to be 1. This result also enters into our omputation of future expeted reward in the seond term of quation 6. Thus, we an ompute the agent s new belief state (and the future expeted reward and the overall value funtion, in turn) using only the agent s urrent belief state and the primitive elements of our given MTDP model. Having omputed the overall value funtion,, we an also extrat a form of the optimal poliy, B, that maps observation histories into ations, as required by quations 8 and 10. Algorithm 2 presents the pseudoode for our overall dynami programming algorithm. Lines 1 6 generate all of the 9/6 belief states reahable /6 from a given initial belief state,. Sine there is a possibly unique belief state for every sequene of ations and observations by agent 1, there are!!h?!! $ reahable belief states. This reahability analysis uses our belief update proedure (Algorithm ), whih itself has time omplexity 5! #!!! when invoked on a belief state at time. Thus, the overall reahability analysis phase has a time omplexity of 5! #!!!2?!!h?! 5! $. Lines 7 22 perform the heart of our dynami programming! algorithm, whih also has a time omplexity of!?!!2?!! $. Lines 2 27 translate the resulting value funtion into an agent poliy defined over observation sequenes, 5! #! as required by our al argument). This! last phase has a lower time and spae omplexity,!$?!!$, than our other two phases, sine it onsiders only optimal ations for agent 1.! Thus, the overall time omplexity of our algorithm is!?!!h?!! $. The spae omplexity of the gorithm (i.e., the B resulting value funtion and poliy is essentially the produt of the number of reahable belief states and the size of our belief state representation:! 4!!!?!!h?!! $. 5.4 Pieewise Linearity and onvexity of Value Funtion Algorithm 2 omputes a value funtion over only those belief states that are reahable from a given initial belief state, whih is 7F D a subset of all possible probability distributions over and. To use dynami programming over the entire set, we must show that our hosen value funtion is pieewise linear and onvex (PWL). ah agent is faed with solving a single agent PMDP is the poliies of all other agents is fixed as shown in Setion 5.2. Sondik [1971] showed that the value funtion for a single agent PMDP is PWL. Hene the value funtion in quation 5 is PWL. Thus, in addition

6 ? Algorithm 2 PTIMALPLIYDP B 1: reahable _ 2: for to ; do : for all reahable 9 do 4: reahable, 5: for all 87 do 6: reahable ;, UPDAT 87 7: for downto do 8: for all 9: > 9 11: 1: reahable, 10: for all > do / ] 7D T 12: for all do quation 8 > / 7D? 9/:6 ), B 6D : if then 7 ompute future reward 15: for all /6 / do 16: prob _ 7 D T 17: for all ), do 18: at B 46D 7 / 7 5? D. / 19: prob / 5? at / 87 at > 20: prob., UPDAT >7 8 quation 6 21: if 8 then 22: 8 7 D 8 2: for all $ do 24: 25: for to do 26: UPDAT 27: B 2D 7 28: return B do ; B 2D 7 arg )+. '9 8 +D 7 Algorithm /, UPDAT 87 D 7 for all 9/ D 7 do, at / B 2D 7 8 for all do quation 10 / 7 D 9/ D 7?. / /? at / >7 at? e4 / [D 7 at / 7 D normalize return to supporting the more effiient dynami programming of Algorithm 2, our novel hoie of belief state spae and value funtion an potentially support a dynami programming algorithm over the entire ontinuous spae of possible belief states. 6 xperimental esults In this setion, we perform an empirial omparison of the algorithms desribed in Setions 4 and 5 using the Tiger Senario (See Setion ) in terms of time and performane. Figure 2, shows the results of running the globally optimal algorithm and the xhaustive JSP algorithm for two different reward funtions (Tables and 4. Finding the globally optimal poliy is extremely slow and is doubly exponential in the finite horizon, T and so we evaluate the algorithms only for finite horizons of 2 and. We ran the JSP algorithm for different randomly seleted initial poliy settings and ompared the performane of the algorithms in terms of the number of poliy evaluations (on Yaxis using log sale) that were neessary. As an be seen from this figure, for the JSP algorithm requires muh fewer evaluations to arrive at an equilibrium. The differene in the run times of the globally optimal algorithm and the JSP algorithm is even more apparent when. Here the globally optimal algorithm performed million poliy evaluations while the JSP algorithm did _: evaluations. For the reward funtion A, JSP sueeded in finding the globally optimal poliies for both (expeted reward 9 ) and (expeted reward 9 ). However, Q this is not always the ase. Using reward funtion B for, the JSP algorithm sometimes settles on a loally optimal poliy (expeted reward 9 ) that is different from the globally optimal poliy (expeted reward _ ). However, when random restarts are used, the globally optimal reward an be obtained. Based on Figure 2, we an onlude that the exhaustive JSP algorithm performs better than an exhaustive searh for the globally optimal poliy but an some times settle on a poliy that is only loally optimal. This ould be suffiient for problems where the differene between the loally optimal poliy s value and the globally optimal poliy s value is small and it is imperative that a poliy be found quikly. Alternatively,the JSP algorithm ould be altered so that it doesn t get stuk in a loal optimum via random restarts. Table 5 ompares presents experimental results from omparison of exhaustive JSP with our dynami programming approah (DPJSP). These results, also from the tiger domain, show runtime in milliseonds (ms) for the two algorithms with inreasing horizon. DPJSP is seen to obtain signifiant speedups over exhaustivejsp. For time horizon of 2 and DPJSP run time is essentially 0 ms, ompared to the signifiant run times of xhaustivejsp. As we inreased the horizon to, we ould not run exhaustivejsp at all; while DPJSP ould be easily run up to horizon of. 7 Summary and onlusion With the growing importane of deentralized PMDPs in the multiagents arena, for both design and analysis, it is ritial to develop effiient algorithms for generating joint poli

7 Number of poliies evaluated (log) ewarda,t2 ewarda,t ewardb,t2 GloballyptimalPoliy Searh JSP (setting 1) JSP(setting 2) JSP (setting ) Figure 2: valuation esults Method xhaustivejsp 10 17,800 DPJSP ,60 0,00 Table 5: un time(ms) for various T with Pentium 4, 2.0GHz, 1GB memory, Linux edhat 7.1, Allegro ommon Lisp 6.0 ies. Yet, there is a signifiant lak of suh effiient algorithms. There are three novel ontributions in this paper to address this shortoming. First, given the omplexity of the exhaustive poliy searh algorithm doubly exponential in the number of agents and time we desribe a lass of algorithms alled Joint quilibriumbased Searh for Poliies (JSP) that searh for a loal optimum rather than a global optimum. In partiular, we provide detailed algorithms for xhaustive JSP and dynami programming JSP(DP JSP). Seond, we provide omplexity analysis for DPJSP, whih illustrates a potential for exponential speedups over exhaustive JSP. We have implemented all of our algorithms, and empirially verified the signifiant speedups they provide. Third, we provide a proof that the value funtion for individual agents is pieewise linear and onvex (PWL) in their belief states. This key result ould pave the way to a new family of algorithms that operate over ontinuous belief states, inreasing the range of appliations that an be attaked via deentralized PMDPs, and is now a major issue for our future work. eferenes [Bernstein et al., 2000] D. Bernstein, S. Zilberstein, and N. Immerman. The omplexity of deentralized ontrol of MDPs. In Proeedings of the Sixteenth onferene on Unertainty in Artifiial Intelligene, [Boutilier, 1996]. Boutilier. Planning, learning & oordination in multiagent deision proesses. In Proeedings of the Sixth onferene on Theoretial Aspets of ationality and Knowledge, [assandra et al., 1997] A. assandra, M. Littman, and N. Zhang. Inremental pruning: A simple, fast, exat method for partially observable markov deision proesses. In Proeedings of the Thirteenth onferene on Unertainty in Artifiial Intelligene, [hadès et al., 2002] I. hadès, B. Sherrer, and F. harpillet. A heuristi approah for solving deentralizedpomdp: Assessment on the pursuit problem. In Proeedings of the Sixteenth AM Symposium on Applied omputing, [Kaelbling et al., 1998] L. Kaelbling, M. Littman, and A. assandra. Planning and ating in partially observable stohasti domains. Artifiial Intelligene, 101, [Monahan, 1982] G. Monahan. A survey of partially observable markov deision proesses: Theory, models and algorithms. Management Siene, 101(1):1 16, January [Papadimitriou and Tsitsiklis, 1987]. Papadimitriou and J. Tsitsiklis. omplexity of markov deision proesses. Mathematis of peratios esearh, 12(): , [Peshkin et al., 2000] L. Peshkin, N. Meuleau, K.. Kim, and L. Kaelbling. Learning to ooperate via poliy searh. In UAI, [Pynadath and Tambe, 2002] D. Pynadath and M. Tambe. The ommuniative multiagent team deision problem: Analyzing teamwork theories and models. JAI, [Sondik, 1971] dward J. Sondik. The optimal ontrol of partially observable markov proesses. Ph.D. Thesis, Stanford, [Xuan et al., 2001] P. Xuan, V. Lesser, and S. Zilberstein. ommuniation deisions in multiagent ooperation. In Proeedings of the Fifth International onferene on Autonomous Agents, Aknowledgments We thank Piotr Gmytrasiewiz for disussions related to the paper. This researh was supported by NSF grant # and DAPA award no. F

Extracting Partition Statistics from Semistructured Data

Extracting Partition Statistics from Semistructured Data Extrating Partition Statistis from Semistrutured Data John N. Wilson Rihard Gourlay Robert Japp Mathias Neumüller Department of Computer and Information Sienes University of Strathlyde, Glasgow, UK {jnw,rsg,rpj,mathias}@is.strath.a.uk