Robotic manipulation of multiple objects as a POMDP

Size: px

Start display at page:

Download "Robotic manipulation of multiple objects as a POMDP"

Tyrone McDaniel
6 years ago
Views:

1 Robotc manpulaton of multple objects as a POMDP Jon Pajarnen, Jon.Pajarnen@aalto.f Vlle Kyrk, Vlle.Kyrk@aalto.f Department of Electrcal Engneerng and Automaton, Aalto Unversty, Fnland arxv: v2 [cs.ro] 8 Jul 2014 Abstract Ths paper nvestgates manpulaton of multple unknown objects n a crowded envronment. Because of ncomplete knowledge due to unknown objects and occlusons n vsual observatons, object observatons are mperfect and acton success s uncertan, makng plannng challengng. We model the problem as a partally observable Markov decson process (POMDP), whch allows a general reward based optmzaton objectve and takes uncertanty n temporal evoluton and partal observatons nto account. In addton to occluson dependent observaton and acton success probabltes, our POMDP model also automatcally adapts object specfc acton success probabltes. To cope wth the changng system dynamcs and performance constrants, we present a new onlne POMDP method based on partcle flterng that produces compact polces. The approach s valdated both n smulaton and n physcal experments n a scenaro of movng drty dshes nto a dshwasher. The results ndcate that: 1) a greedy heurstc manpulaton approach s not suffcent, mult-object manpulaton requres mult-step POMDP plannng, and 2) on-lne plannng s benefcal snce t allows the adaptaton of the system dynamcs model based on actual experence. 1 Introducton For a servce robot, physcal nteracton wth ts envronment s an essental capablty. As the applcaton areas of servce robotcs are extendng to complex unstructured envronments, robotc manpulaton has become an mportant focus area wthn robotcs research. In complex envronments, the robot s knowledge about ts envronment s ncomplete and uncertan. To operate n such envronments, robots can employ dfferent mechansms. Frst, a robot may use on-lne sensng to attempt to gan more nformaton about the envronment. Second, sensory measurements can be used drectly n a feedback control loop to adapt to small dsturbances. Thrd, the uncertanty can be taken nto account on the level of plannng the actons. Plannng under uncertanty wth mperfect sensng can be modelled as a partally observable Markov decson process (POMDP). Whle POMDPs have been one of the actvely pursued research drectons wthn AI, they have not been appled very wdely n robotc manpulaton. Ths s partly because the 1

2 manpulaton plannng problems have ntrnsc characterstcs such as contnuous state spaces whch make the applcaton of POMDP solvers less straghtforward, and partly because the manpulaton plannng approaches have only recently advanced to the pont where the explct modelng of uncertanty becomes tractable. The queston whch robotc manpulaton problems beneft from the explct modelng of uncertanty n a POMDP framework remans open to a large extent. In ths paper, we consder a mult-object manpulaton plannng problem wth the envronment state estmated by mperfect sensors. The dynamcs of the system are consdered to be partly unknown, whch supports the goal of long term autonomy of the robotc system. Our hgh-level research queston s n whch stuatons explct plannng under uncertanty s benefcal. The manpulaton plannng problem s consdered on task level, that s, the desred result of plannng s the best acton to be performed. Whle moton plannng for an ndvdual acton, such as graspng an object, s out-of-the-scope for ths paper, the relatonshp of such an ndvdual acton to observatons of the overall system state and the world model dynamcs are modeled and learned durng operaton. The theoretcal contrbutons of the paper are two-fold: frst, a POMDP model for mult-object manpulaton s proposed. As a partcular novel contrbuton, the model consders the effect of vsual occlusons on observatons and success of actons. Moreover, the dynamcs of the world model, n partcular related to the success of actons, are updated durng operaton as more nformaton s ganed. Second, we propose a new POMDP method whch s applcable to manpulaton plannng. In partcular, t does not requre a dscrete world model but nstead samples the world model to construct polces. The method s also scalable, does not requre heurstcs, can handle uncertanty n the world model, and allows onlne plannng, whch s mportant when the world model s not accurate. Furthermore, the method produces compact polces n a predefned tme. Ths can be benefcal n a robotc settng where easy to nspect polces may gve new nsghts nto the problem. The proposed approaches are expermentally evaluated both n smulaton and wth a real robot. The experments demonstrate that mult-step, longer horzon plannng s benefcal n complex envronments wth clutter. In partcular, POMDPs are benefcal f a partcular problem has some of the followng characterstcs: 1) the problem requres weghtng the value of nformaton gatherng versus collectng mmedate rewards such as lftng objects to get a better vew on other objects, 2) the world model s uncertan and thus t should be updated, for example when some objects are harder to grasp than others, or 3) the sequence of actons matters such as when objects occlude each other even partally. Altogether, the paper s the frst to propose long term POMDP plannng for manpulatng many objects n a hgh dmensonal, unknown, and cluttered envronment. 2

3 2 Related work 2.1 Partally observable Markov decson processes A partally observable Markov decson process (POMDP) [15] defnes the optmal polcy for a sequental decson makng problem whle takng nto account uncertan state transtons and partally observable states. Ths makes POMDPs applcable to dverse applcaton domans such as robotcs [42], elder care [11], tger conservaton [5], and wreless networkng [30]. However, versatlty comes wth a prce, the computatonal complexty of fnte-horzon POMDPs s PSPACE-complete [32] n the worst case. Because of the hgh computatonal complexty, state-of-the-art POMDP methods [41, 18, 40, 2, 39] use dfferent knds of approxmatons. There are at least two causes for the ntractablty of POMDPs: 1) state space sze, and 2) polcy sze. State-of-the-art POMDP methods yeld good polces even for POMDP problems wth hundreds of thousands of states [41, 18] by tryng to lmt polcy search to state space parts that are reachable and relevant for fndng good polces. However, n complex real-world problems the state space can be stll much larger. In POMDPs wth dscrete varables, the state space sze grows exponentally w.r.t the number of state varables. In order to make POMDPs wth large state spaces tractable, there are a few approaches: compressng probablty dstrbutons nto a lower dmenson [33], usng factored probablty dstrbutons [26, 29], or usng partcle flterng to represent probablty dstrbutons [40, 2]. Partcle flterng s partcularly attractve, because an explct probablty model of the problem s not needed. In fact, n order to cope wth a complex state space, we use partcle flterng n the onlne POMDP method presented n more detal n Secton 3.2. Assumng the problem of state space sze solved, the problem of polcy sze stll remans. In the worst case, the sze of a POMDP polcy grows exponentally wth the plannng horzon. Offlne POMDP methods [39] take advantage of the pecewse lnear convexty (PWLC) of the POMDP value functon to keep the sze of the polcy reasonable. Some offlne POMDP methods use fxed sze polces. A common approach s to use a fxed sze (stochastc) fnte state controller [34, 1] as a polcy. The monotonc polcy graph mprovement method [31] utlzes a fxed sze polcy graph, an dea whch s also adopted here. Contrary to offlne approaches, onlne POMDP methods [38] compute a new polcy at each tme step. Onlne plannng starts from the current belef and can thus concentrate on only the part of the search space that s currently reachable. Moreover, restartng plannng from the current belef allows the onlne planner to correct plannng mstakes, whch an naccurate world model caused n earler tme steps. Ths s especally relevant n robotc manpulaton n servce settngs where an accurate world model s dffcult to estmate for example due to unknown objects. Onlne POMDP methods usually represent the polcy as a polcy tree. Technques such as prunng can be used n order to reduce the sze of the polcy tree, but ths does not solve the problem of exponental growth of the polcy tree w.r.t. the plannng horzon. The onlne POMCP method of Slver et al. [40] uses partcle flterng to address the state space sze problem and Monte-carlo tree search to explore the polcy space. However, for best results, POMCP requres a problem specfc heurstc [40]. POMCP s also not desgned to produce compact polces. The onlne POMDP approach that we 3

4 propose allows for long plannng horzons by usng a compact, fxed-sze polcy graph [31]. Because of the compact polcy sze, the polcy can be nspected by a doman expert. In manpulaton, and more generally n robotcs, the world model s often uncertan and thus t s necessary to learn the world model durng onlne operaton. The goal then s to maxmze total reward whle takng nto account that the current world model s not accurate and that actons can yeld nformaton about the world model. For a dscrete POMDP, a natural Bayesan approach s to model transton and observaton probabltes wth Drchlet dstrbutons [8, 35, 37]. Followng ths, our probablty model uses Beta dstrbutons to model uncertanty n object specfc grasp probabltes. Few papers [36, 3] exst on usng a POMDP model wth uncertan probabltes n robotcs. Ross et al. [36] apply an onlne POMDP approach to smulated robot navgaton wth Gaussan dstrbutons wth unknown parameters. Ba et al. [3] present an offlne POMDP plannng approach for robot moton plannng wth unknown model parameters and valdate ther approach n smulaton. However, we are not aware of pror robotc manpulaton research that uses a POMDP model wth uncertan probabltes. 2.2 Manpulaton under uncertanty Manpulaton plannng under uncertanty s not a new problem to be consdered n robotcs. Already n the early 1980 s, Lozano-Péres et al. [23] consdered the automatc synthess of fne-moton actons under ntal robot pose uncertanty usng complant moton, premages and backward channg. The orgnally sensorless decson-theoretc lne of research can be seen to contnue to ths day wth extensons over the years to e.g. graspng [4] and a probablstc settng [22]. A good summary of the current state n ths lne of work can be found n [21]. There s a recent trend to ntegrate task and moton plannng, see [17] for an overvew. However, these approaches do not address drectly the problem of uncertanty. The only excepton s the recent work by Kaelblng and Lozano- Peres [16], where premage backchanng s used for belef-space plannng n a herarchcal framework. The approach handles probablstc uncertanty by usng a determnstc approxmaton of the doman and replannng after each tme step. Our work dffers from ths approach: we consder the nteractons of the manpulaton actons, that often occur n mult-object manpulaton, as well as update the world model based on on-lne experence. There s a tradton to formulate robot navgaton problems as POMDPs, for an overvew see [43], or a recent study for long tme horzon POMDP plannng [19]. In manpulaton, the use of the formulaton s not common. Hsao et al. [13] proposed the parttonng of the confguraton space of graspng wth one uncertan degree of freedom to yeld a dscrete POMDP whch can be solved for an optmal polcy. In grasp plannng, the state-of-the-art ncludes probablstc approaches wth a short tme horzon. The goal can be formulated ether as postonng the robot accurately as n [14] or maxmzng the probablty of a successful grasp as n [12, 20]. The short-term plannng can also be extended to nclude nformaton gatherng actons [28]. In contrast to the above, ths paper consders manpulaton of multple objects whch are unknown and where the sequence of actons has a sgnfcant effect. 4

5 Recent work by Dogar and Srnvasa [7] proposes manpulaton of multple objects usng graspng and pushng prmtves. The approach uses pushng to collapse the uncertantes of the object locatons as well as to clear clutter n the scene. The plannng s performed at the level of object poses. Monso et al. [27] proposed to formulate clothes separaton as a POMDP. In contrast to our work, the approach of Monso et al. s envronment specfc. Monso et al. rely on a clothes separaton specfc state space defnton, whch models the number of clothes n each area. We model object attrbutes, assocated probabltes, and grasp probabltes, n any knd of envronment. 3 Mult-object manpulaton: a POMDP In mult-object manpulaton, a robot performs actons on several objects. In partcular, the robot may grasp objects, move them, or use them n another way to accomplsh some predefned goals. In ths paper, we focus on the problem of decdng how to manpulate unknown objects n a crowded envronment. Because the envronment s crowded, only parts of the objects can be observed by vsual sensors. In addton to uncertan observatons, real-world manpulaton problems have uncertan acton consequences, especally when the robot does not have a model of the objects beforehand, or when the robot does not observe the objects well. For example graspng or movng an object may fal, because the shape or locaton of the object dffers from the observed one. Real-world problems often have several (possbly conflctng) goals. As a practcal example consder puttng drty dshes from a table full of dshes nto a dshwasher: the goal s to maxmze the number of drty dshes n the dshwasher, mnmze the number of clean dshes n the dshwasher, and mnmze the executon tme. In order to address the ssue of complex objectves, uncertan observatons, and uncertan acton effects, ths paper models the problem of manpulatng multple unknown objects as a partally observable Markov decson process (POMDP). Plannng of manpulaton, for nstance grasp plannng, s tradtonally consdered as a geometrcal problem. However, n unstructured envronments wth unknown objects the current state-of-the-art approaches often plan ndvdual actons (e.g., a grasp) drectly based on the observed envronment [10]. We follow the same dea so that the plannng s performed on the level of semantc actons and locatons, whle the executon of ndvdual actons s then performed based on the currently observed scene. However, our approach also models the nterplay of the mmedate sensor measurements to both observaton and system models. For example, the rate of vsual occluson modulates the probablty of correct observatons and successful completon of actons. We begn by defnng a POMDP, then descrbe a new onlne POMDP plannng method whch s sutable for complex problems such as mult-object manpulaton, and fnally descrbe how to model mult-object manpulaton n crowded envronments as a POMDP wth an applcaton of movng drty dshes nto a dshwasher. 3.1 What s a POMDP? A POMDP s a model that defnes optmal behavor for a gven Markovan problem, takng nto account uncertanty n observatons as well as acton effects 5

6 over a potentally long tme horzon. In a POMDP, a set of hdden Markov models, one for each acton choce, descrbes the temporal dynamcs of the problem and the optmzaton objectve s defned by assgnng a reward to each acton n each possble stuaton. In a specfc applcaton, rewards should reflect real value, e.g. monetary cost. Formally a POMDP s defned by the tuple S, A, O, P, R, O, b 0, where S s the set of states, A s the set of actons, and O s the set of observatons. The state set ncludes all possble states of the world, n whch the agent s assumed to operate n. P (s s, a) s the transton probablty to move from state s to the next tme step state s, when acton a s executed. R(s, a) yelds the realvalued reward for executng acton a n state s and O denotes the observaton probabltes P (o s, a), where o s the observaton made by the agent, when acton a was executed and the world moved to the state s. Lastly, b 0 (s) s the ntal state probablty dstrbuton, also known as the ntal belef. In a fnte-horzon POMDP, the goal s to optmze the expected reward [ T 1 ] E R(s(t), a(t)) π, (1) t=0 where T s the horzon, s(t) s the state, and a(t) the acton chosen at tme step t by the polcy π. Because the states are not fully observable, the current state cannot be used for decson makng as n fully observed models. Instead, the belef b(s), a probablty dstrbuton over world states, s mantaned to make (optmal) decsons at each tme step. Startng from the ntal belef b 0 (s), the belef s updated at each tme step. After performng acton a and observng o the updated belef b = b (s b, a, o) can be obtaned from the current belef b = b(s) usng the Bayes formula b (s b, a, o) = P (o s,a) s P (s s, a)b(s), where C s a normalzng constant. To gve an ntutve dea of how POMDPs can be appled n practce, we wll now gve short examples for the transton and observaton probabltes, and the reward functon. In a POMDP, the transton probablty P (s s, a) models the uncertanty n acton effects: what s the probablty to move a cup successfully from a table (part of state s) nto a dshwasher (part of s ), when the acton a s move cup nto dshwasher? The observaton probablty P (o s, a) models the uncertanty n observatons: what s the probablty of observng a cup as drty (observaton o), when t s drty (part of state s ) and we are executng acton a look at cup? Fnally, the reward R(s, a) explctly specfes the optmzaton goal: gan postve reward for movng (acton a) a drty cup (part of state s) nto the dshwasher. 3.2 Onlne polcy graph POMDP usng monotonc value mprovement In ths paper, as often n robotc applcatons, the state space of the robotc manpulaton task s hgh dmensonal. The state space has exponental sze n the number of dscrete state varables, and ncludes uncertan grasp success probablty dstrbutons. Because of the complex state space, POMDP methods based on exact probablty representatons are not applcable. We present a new onlne POMDP method based on the monotonc polcy value mprovement C 6

7 algorthm [31] proposed by us earler. The next subsecton brefly ntroduces the method from [31] followed by the extensons: 1) the new method uses partcle flterng to represent probablty dstrbutons and estmate values n a way that takes advantage of the polcy graph (Sec ), nstead of usng a dscrete tabular probablty dstrbuton representaton [31], and 2) the new method s transformed from an offlne [31] method nto an onlne method n a polcy graph specfc way (Sec ) POMDP polcy and method We represent the polcy of the agent (the robot) as a polcy graph G (see Fg. 1 for an example polcy graph) begnnng from tme step t = 0 and endng at the plannng horzon t = T 1. Each graph node defnes a condtonal plan for the robot to follow: whch acton to perform, and dependng on the observaton made, to whch next layer node to transton next. b o (s) b 0 (s, q) b 1 (s, q) b 2 (s, q) V 3 (s, q) V 4 (s, q) a=1 o=1 o=2 a=2 a=1 o=2 o=1 o=2 o=1 a=1 a=? o=1, 2 o=? o=? a=1 a=2 o=1 o=2 o=1, 2 a=1 a=2 Fgure 1: Illustraton of a polcy graph node update n the monotonc polcy graph value mprovement algorthm for POMDPs. The polcy mprovement approach n [31] uses dynamc programmng to mprove each polcy graph layer at a tme. Frst, the approach computes the belef b t (s, q) at each layer t and each graph node q startng from the ntal belef b 0 (s) n the frst layer. Then, startng from the last layer and movng one layer at a tme towards the frst layer, the approach computes for each node n the layer a new polcy (acton, observaton edges), whch maxmzes the expected reward for the belef at the current node. The expected reward s computed from the mmedate reward and the next layer value functon V t+1 (s, q), whch yelds the expected reward when startng from state s and graph node q n layer t+1 and followng the polcy graph untl layer T 1. Ths procedure guarantees monotonc mprovement of polcy value. For algorthmc detals, see [31]. In order to keep computatons tractable, we use a polcy graph wth fxed wdth and depth. Ths crcumvents the problem of exponental growth of a search tree, allows for manual nspecton of a compact polcy, and enables us to convert the offlne approach to an onlne one Partcle flterng The method n [31] assumes a dscrete flat POMDP. In order to deal wth a large state space, we use partcle flterng to approxmate belefs and for estmatng values. Belef representaton and update. We represent a belef b(s), a probablty dstrbuton over s, as a fnte set of partcles [43], that s, a weghted set 7

8 of state nstances s j. The belef s b(s) = j wj δ(s, s j ); j wj = 1; 0 w j 1, where w j s the partcle weght and δ(s, s j ) = 1 when s = s j and zero otherwse. What a state actually s depends on the applcaton: Secton 3.3 defnes a state for mult-object manpulaton. We use two knds of belef updates. The frst one s the commonly used update of the current belef b(s), when an acton has been executed and an observaton made. Ths belef update s used n ntalzng the polcy graph and for samplng new belefs for redundant polcy graph nodes n order to reoptmze them (f the optmzed polcy at a polcy graph node, that s, the acton and connectons to the next layer, s dentcal to the polcy of another node n the layer, we sample a new belef over world states, and re-optmze the node for the new belef. Ths compresses the polcy graph wthout changng ts value [31]). The second knd of belef update projects the belef b t (s, q) to the next layer belef b t+1 (s, q), usng the current polcy. The second belef update s used n each mprovement round. In the frst belef update, the acton a(t) and observaton o(t + 1) are gven. We sample a next tme step state s j (t + 1) for each current state s j (t) accordng to the applcaton specfc dynamcs (state transton) model P (s j (t + 1) s j (t), a(t)). We then compute the new partcle weght w j (t + 1) as the product of the old weght and the observaton probablty: w j (t + 1) = w j (t)p (o(t + 1) s j (t + 1), a(t)). As usual, to prevent partcle mpovershment, we resample partcles, when the effectve sample sze drops below a threshold (0.1 n the experments). In the second belef update, for updatng the belef b t (s, q), a partcle conssts of a weght w j (t) and a state/node par (s j (t), q j (t)). To sample a new partcle (s j (t + 1), q j (t + 1)) usng a (s j (t), q j (t)) par, we frst get the acton a(t) for node q j (t). Then, we sample a new state s j (t + 1) from P (s j (t + 1) s j (t), a(t)). Next, we sample an observaton o(t + 1) from P (o(t + 1) s j (t + 1), a(t)). Fnally, the observaton edge for observaton o(t + 1) of the graph node q j (t) yelds the new graph node q j (t + 1). In ths update, the partcle weghts do not change. As a sde remark, note that our approach dffers from exstng partcle flterng based approaches. In order to mprove the polcy, we use the current polcy for fndng a belef dstrbuton over graph nodes, but other state-of-theart POMDP methods based on partcle flterng [40, 2] select an acton and observaton to fnd a new belef for whch to compute a polcy. In other words, other POMDP methods use a constant amount of partcles to represent a sngle belef, but we use a constant amount of partcles to represent the belef over a polcy graph layer (a tme step) and each graph node s assgned partcles proportonal to the probablty of the graph node. We beleve ths wll result n a more effcent use of the computatonal resources. Value estmaton. In order to determne the best acton and observaton edges for a polcy graph node, the method has to estmate the value for each acton-observaton-next node trplet. From these trplets the method can then select for each acton the hghest value observaton-next node pars and based on these select the hghest value acton. To do ths effcently, we follow Algorthm 1 n [2]. The algorthm samples state transtons and observatons for each acton and for the sampled observaton smulates the value for each next controller node. Ba et al. [2] represent the polcy as a possbly cyclc fnte state controller, but we use nstead an acyclc polcy graph. However, no sgnfcant modfcatons 8

9 are necessary because the algorthm s based on smulaton. Furthermore, the bound for the approxmaton error nduced by samplng, shown n Theorem 1 n [2], also apples here: the error s bounded by a term that decreases at the rate of O(1/ N), where N s the number of samples. In the mplementaton we do not actually sample states from a belef, but just go through all partcles, one at a tme, and utlze the partcle weght for value estmaton. In the polcy mprovement round, ths s more effcent than samplng states, because partcles usually have dentcal weghts. Complexty. The worst case complexty of one polcy mprovement round of the POMDP method s quadratc w.r.t. the plannng horzon because the method smulates state trajectores up to the plannng horzon for each polcy graph layer. In the experments n Secton 4, the method performed well. In the future, one could parallelze the algorthm to utlze multple CPU cores (easly because of the partcle representaton of probabltes), or use a fxed samplng depth From offlne to onlne Because of the computatonal and modelng restrctons dscussed prevously, we transform the offlne POMDP method nto an onlne one. Smlarly to the recedng horzon control (RHC) approach [25, 6] n automatc control we re-plan at each tme step up to a fnte horzon. Intutvely, we use a movng wndow that at each tme step shfts one step to the rght over the polcy graph (magne ths wth the help of the polcy graph n Fg. 1), dscards the frst layer, and adds a new layer at the end. At the begnnng, the agent optmzes the polcy graph for several mprovement rounds for the ntal belef. Then n followng tme steps the agent estmates the new belef and constructs a new polcy for the belef, as follows: 1) ntalze the new polcy graph wth the layers 2,..., T 1 of the prevous polcy graph; 2) add a new last layer to the polcy graph wth random actons, and add random observaton edges to the layer preceedng the last layer; 3) use the regular polcy graph mprovement method on the new polcy graph. The basc dea here s to ntalze the current polcy graph usng the polcy graph of the prevous tme step, and then optmze the polcy graph for the current belef. Because of the ntalzaton, the requred number of mprovement rounds durng onlne operaton s then less when compared to offlne optmzaton. 3.3 Mult-object manpulaton as a POMDP We dscuss now a general POMDP framework for modelng mult-object manpulaton. Later, n Secton 3.3.1, we then show how the POMDP framework can be appled to the problem of movng drty dshes nto a dshwasher. In mult-object manpulaton, the robot has to decde at each tme nstance whch object to manpulate. We consder problems, where the world conssts of N objects wth varyng attrbutes. The total number of actons s A, where A denotes the number of possble actons for object. In each tme step, the acton of the robot changes the spatal locatons and poses of the objects, and the robot makes an observaton about the changed state of the world. Our POMDP model uses dscrete actons and observatons. However, nstead of forcng the robotc plannng problem nto a manageable dscrete state 9

10 space as s done e.g. n [27], we use a POMDP method based on partcle flterng (dscussed n Sec. 3.2) that allows us to mantan complex object nformaton requred for effcent mult-object manpulaton. State space and actons. The state space conssts of semantc object locatons (e.g. on table, n a dshwasher ), object attrbutes, and hstorcal data of observatons and acton successes for each object. The model assumes that the semantc locaton of an object s constant over tme unless a manpulaton acton successfully changes t. However, because an onlne plannng approach s used, the plannng always restarts from the current belef takng nto account the most recent measurements. Formally, the POMDP state s = (s 1, s 2,..., s N ) s a combnaton of object states s = (s loc, s attr, s hst ) where s loc s the semantc object locaton, s attr the object attrbutes, and s hst compressed hstorcal nformaton of acton successes and object attrbute observatons. The acton success nformaton conssts of a count of succeeded n succ and faled n fal grasps for each object. Because of the fnte number of objects the number of acton counts s fnte. Smlarly, as dscussed n more detal below, the number of dfferent object attrbute observatons s fnte. Therefore, s hst has fnte dmensonalty, and the POMDP state can be stored and operated on effcently. Note that the POMDP states have the Markov property because the probablty for the next state depends only on the current state (and acton). The observaton hstory contans nformaton of past observatons of object attrbutes. Past object attrbute observatons can be used to compute the probablty dstrbuton over an object s attrbutes. Addtonally, these are needed durng plannng because future observatons of the attrbutes can not be assumed statstcally ndependent, because the man source of observaton uncertanty s occluson. In contrast, unless the occluson changes, we assume that an dentcal observaton of the attrbute s made (note that we assume dfferently occluded observatons ndependent). We assume that the probablty of makng the correct observaton depends on how occluded the object s (we dscuss ths n more detal shortly). In more detal, s hst contans the observaton made n each occluson settng. For example, n the experments objects can be temporarly lfted: n addton to the current occluson settng, we store the observaton for each object whch was temporarly lfted and whch s otherwse n front of the observed object. Note that because of the fnte number of objects the number of occluson settngs s fnte, and thus the observaton hstory has fnte sze. Occluson rato. The acton success probablty and the observaton probablty of an object depend on how occluded the object s. Because we do not have models for the objects, the occluson s modeled usng a model free occluson rato. The reasonng s that the hgher the occluson, the smaller the probablty of success n actons or observatons. In the experments, we capture a pont cloud, segment the pont cloud nto objects, compute edges for all objects usng 2-D nformaton, and then fnd out how much the edges of objects touch each other. The rght hand sde fgure n Fg. 2 shows edges found for segmented objects n a scene. When the edge of object A, whch s closer to the vsual sensor, touches the edge of object B, object A occludes object B. Consder computng the occluson rato for object B. Denote wth TOT the permeter of the 2D contour of object B, that s, the total number of 2D pxels for whch the number of neghborng 2D pxels, whch are part of object B, s less than eght. Denote wth TOU the touchng edge between A and B, that 10

s, the number of 2D pxels n B whch have atleast one neghborng 2D pxel n object A (when B s occluded by several objects, just use the 2D pxels of the occludng objects).

The reasonng s that when an object almost completely occludes another object, TOT s roughly double TOU.

11 s, the number of 2D pxels n B whch have atleast one neghborng 2D pxel n object A (when B s occluded by several objects, just use the 2D pxels of the occludng objects). The occluson rato for object B s 1, when TOT subtracted by TOU s smaller than TOU, 0 when TOU = 0, and otherwse TOU/(TOT TOU). The reasonng s that when an object almost completely occludes another object, TOT s roughly double TOU. Thus an occluson rato of 1 corresponds to totally occluded and an occluson rato of 0 to no occluson at all. The convenence varable s occl denotes the occluson rato of object. Fgure 2: Expermental setup. Left: A Knova Jaco robotc arm manpulates objects placed on the table. A Mcrosoft XBOX Knect acts as a monocular vsual sensor for capturng RGB-D pont clouds. In the experments, the goal s to pck up drty objects, here cups marked wth a green color, from the table and place them nto the dshwasher, represented by the blue box on the far left. Rght: An mage captured by the Knect sensor. Object edges are depcted n blue. In ths paper, POMDP state transtons are based on samplng. When an object A s sampled to be moved, so that t does not occlude another object B anymore, t s straghtforward to update the occluson rato of B by removng the touchng edge between A and B. However, f there s an object C, whch occludes A (edges of A and C touch), but not B, and A s moved away, then there s a possblty that C could occlude B after the removal of A. We call ths occluson nhertance. For smplcty, we do not take occluson nhertance nto account n the experments and leave t as future work. Grasp probablty. We assume that occluson affects the grasp probablty of all objects n a smlar way, but, n addton, we assume that each object has unknown propertes that affect the grasp probablty of that specfc object: we do not know beforehand what knd of grasp propertes each object has. For example a cup that has fallen down may be harder to grasp, than another cup, whch s standng uprght (see Fg. 8b for an example). The probablty of a successful grasp s modeled as P (grasp succeeded s occl p succ Beta(p, s hst succ pror ) = E[p succ ] n pror + n succ, (1 p succ pror )n pror + n fal ), (2) succ pror where p s the occluson rato specfc grasp success pror probablty and n pror s the strength of the pror. In the experments, we mapped the occluson succ pror rato to the grasp success pror probablty p usng a smple exponental functon succ pror p = exp( θ G1 s occl + θ G2 ), (3) 11

12 where θ G1 and θ G2 are parameters that can be expermentally estmated from object grasps, for example, usng two dfferent occluson ratos. Note that we model the grasp probablty as the mean of the Beta dstrbuted random varable p succ. It would be possble to use a more complex model durng plannng, n whch one would sample the grasp probablty from the Beta dstrbuton, but we expect ths would ncrease the number of partcles needed for plannng. Observatons. We assume that the semantc locatons and dependences (whch cup s n front of whch cup) are fully observed and that grasp success s also fully observed. At each tme step the agent observes whether the grasp succeeded and makes an observaton about object attrbutes. Usng these observatons, we can compute a probablty dstrbuton over object attrbutes, whch s needed for samplng the ntal POMDP belef and for dsplayng attrbute probabltes. Note that f grasp success or semantc locatons are not fully observed, then we can not estmate the ntal POMDP belef drectly usng grasp success and object attrbute observatons. Instead, we could update at each tme step an (approxmate) belef accordng to the current acton and observaton and use that as the ntal POMDP belef. However, n many applcatons, ncludng the dshwasher applcaton further down, semantc locatons such as object on table, object n dshwasher, and thus also grasp success, are fully observed. As dscussed earler, we assume that the robot observes an object dentcally unless the occluson changes. Denote wth o j, the observaton for object when n the jth occluson settng, and wth a j the acton performed when observng, then the attrbute probablty gven the hstory s o j P (s attr o 1,..., o M, a 1,..., a M ) = P (o 1,..., om s attr, a 1,..., am )P (s attr a 1,..., am ) P (o 1,..., om a 1,..., = am ) M M P (s attr a 1,..., a M ) P (o j sattr j=1, a j )/ s attr j=1 P (o j sattr, a j ), (4) where we assumed that observatons are condtonally ndependent gven the object attrbutes, but f needed and computatonally possble one can use jont probabltes. We assume that attrbutes (e.g. color) do not change over tme, and thus actons do not nfluence object attrbutes: P (s attr a 1,..., am ) = P (s attr ). In the experments, we assumed P (s attr Drty cups nto dshwasher ) s unform. We now demonstrate how the framework can be used to model the problem of movng drty cups from a table nto a dshwasher as a POMDP (another realstc applcaton could be movng dshwasher-safe cups, nstead of drty cups, nto the dshwasher). In ths problem, the robot can gan more nformaton of attrbutes by removng occlusons and gan nformaton about the object specfc grasp probablty through successful and faled grasps. State space and actons. In addton to the graspng and observaton hstory dscussed n Secton 3.3, the world state conssts of the semantc locaton s loc = {table, dshwasher}, and the attrbutes s attr of an object nclude 12

13 drtyness s drty = {clean, drty}. The robot can perform three knds of actons. The FINISH acton termnates the robot actons and assgns a negatve reward to drty dshes remanng on the table. The LIFT acton tres to lft an object to expose the objects behnd t and allows the agent to gather more nformaton about the occluded objects. A small negatve reward representng tme cost s assocated wth the acton. Note that the acton takes less tme than movng the object nto the dshwasher. The WASH acton tres to move an object nto the dshwasher (n the experments, a box). If the move succeeds, the state of the object changes from table to dshwasher. If the move succeeds and the moved object s drty, then a large reward s obtaned. If the move succeeds and the object s clean, a large negatve reward s obtaned. Faled grasps cause a small negatve reward accountng for the tme cost. Note that when mplementng the model, we can compute the reward for the WASH acton as the expected next tme step reward usng the grasp success probablty, nstead of deferrng reward computaton untl the grasp has happened n the next tme step. Observatons. At each tme step the agent observes whether the grasp succeeded and the drtyness of the k nearest objects (n the experments k = 2) whch were occluded by the moved cup. In total, 2 k+1 possble observatons. In the experments, we model the condtonal probablty of observng cup as drty when t s drty wth P (o = drty s drty = drty, s occl ) = exp( θ D1 s occl + θ D2 ), (5) where θ D1 and θ D2 are parameters that can be, smlarly to the grasp probablty, expermentally estmated from captured pont clouds and object labels. The probablty of observng a cup as drty when t s clean s modeled dentcally wth P (o = drty s drty = clean, s occl ) = exp( θ C1 s occl + θ C2 ), (6) where parameters θ C1 and θ C2 are also estmated n the same way. 4 Experments The experments follow the scenaro descrbed above. The scene s observed by an RGB-D sensor (Mcrosoft Knect) and a 6-DOF Knova Jaco arm wth an ntegrated 3-fngered hand s used to manpulate the objects. The objects belong to two classes: clean whte cups and cups wth green drt representng drty objects. Fg. 2 llustrates the expermental setup: to the left a pcture of the setup, and to the rght an mage captured by the Knect sensor. Rewards. The robot receves a reward at each tme step. The reward depends on the acton executed and the current state of the world. As dscussed n Secton 3.3.1, the robot can execute three dfferent knds of actons. The FINISH acton termnates the problem and accumulates a reward of 5 for each drty cup on the table. Smlarly, to lmt experment run tmes, after ten tme steps, the problem s termnated and a reward of 5 for each drty on the table gven. The LIFT acton lfts an object up and yelds a reward of 0.5 for both faled and succeeded grasps. The WASH acton moves an object nto the dshwasher. If the move succeeds, then the reward s +5 for a drty object and 13

14 10 for a clean object. For a faled move the reward s 0.5. In our dshwasher applcaton, there was no well determned objectve. Rewards were desgned based on the researchers understandng of the applcaton. Methods. The POMDP plannng method descrbed n Secton 3.2 s ntalzed by 10 offlne polcy mprovement rounds. Then, at each tme step 4 mprovement rounds for the current belef are executed. To evaluate the beneft of plannng under uncertanty, the POMDP approach s also compared aganst heurstc decson makng: The heurstc manpulaton method assumes that observatons are accurate and determnstc. It tres to move the drty cup that has the hghest grasp success probablty nto the dshwasher. If no cup s observed drty t performs the FINISH acton. In the experments, we used two versons of the heurstc method: one whch updates grasp probabltes accordng to the grasp success hstory and another whch does not remember any grasp hstory. Pont cloud nto a world model. In the experments, the vsual sensor captures a pont cloud, from whch we extract objects, ther color, and nformaton on how they occlude each other. From these we estmate grasp and observaton probabltes and use these probabltes to plan whch acton to perform. In more detal, frst the Knect sensor captures an RGB-D pont cloud of the vsual scene. Wthout usng pror nformaton we segment 1 the pont cloud nto objects. From the 2D-mage, we determne the edge of each object and how much t touches other objects edges (see edges n the rght hand sde mage of Fg. 2). Usng the object edges, we compute, as dscussed n Secton 3.3, occluson ratos. However, because of occluson, segmentaton may produce multple objects for one complete object. Therefore, we merge objects that occlude each other and are close (occluson rato above 0.5 and centrod dstance below 8cm) nto one object and re-compute ts occluson rato. Next, we compute object specfc grasp (Equaton 2) and observaton probabltes (Equatons 5 and 6) usng the occluson ratos, observaton hstory, and ntally estmated parameters. We set the grasp pror count n pror = 0.5. Fnally, we make an observaton f an object s drty or clean based on the dstance of the object color to precomputed color prototypes. Graspng. Graspng an unknown object s performed by executng a top grasp, closng fngers around the centrod of the pont cloud of the object to grasp, smlar to e.g. [9]. Estmatng ntal parameters. Before actual expermental runs, we estmated expermentally the parameters of graspng and observaton probablty functons defned n Equatons 3, 5, and 6. To estmate grasp parameters we attempted to lft cups postoned on the table usng the robot arm, both when the cups were occluded and when not, and estmated grasp parameters (θ G1 = 0.904, θ G2 = 0.087) from the recorded success rates. For the occluded case we used the average occluson rato. We estmated observaton functon parameters for drty (θ D1 = 0.895, θ D2 = 0.087) and clean (θ C1 = 0.193, θ C2 = 0.0) cups smlarly, but nstead of the lftng success rate, we used the observaton success rate. 1 For segmentaton we use organzed multplane segmentaton and organzed eucldean cluster extracton, part of the pont cloud lbrary 14

4.1 Experments wth smulated dynamcs In mult-object manpulaton, robot actons may have far reachng consequences: lftng frst cup A and then cup B, may ncrease the probablty of cup C beng observed drty

15 4.1 Experments wth smulated dynamcs In mult-object manpulaton, robot actons may have far reachng consequences: lftng frst cup A and then cup B, may ncrease the probablty of cup C beng observed drty from low to hgh by exposng t more fully. The robot has to consder at each tme step, whether the nformaton gan from lftng a cup yelds more reward n the long run than executng an acton whch may yeld hgher mmedate reward. Of course, because of the uncertanty n actons and observatons, the real decson makng problem can be even more complcated than ths smple example mples. Consequently, our hypothess s that a heurstc greedy manpulaton approach s not suffcent and that plannng several tme steps nto the future s needed. In order to study ths hypothess, we expermentally compared heurstc manpulaton and the proposed POMDP approach wth dfferent plannng horzons. Note that even though we smulate world dynamcs, we estmate the grasp and observaton probabltes usng the physcal robot arm and real observed occlusons. Moreover, we estmate the occlusons and locatons of objects from pont clouds captured by the Knect sensor. Fgure 3: Cropped knect mages of cup confguratons used n the experments. Each confguraton contans four drty (partly green color) and four clean cups. The green color on some cups s occluded. In the smulated dynamcs experments, we used ten dfferent captured pont clouds shown n Fg. 3 as the startng pont for smulatons. In these experments, we form a world model from the pont cloud and then repeatedly sample an ntal belef and smulate the system usng the probablty model for 10 tme steps. To get an ntal belef, we sample partcles usng the cup drtyness probablty, whch depends on past observatons and whch s defned n Equaton 4 (drtyness s an object attrbute). For evaluaton purposes we also sample hdden object specfc grasp success probabltes. In more detal, we sample for object the total amount of observed grasps n = nsucc + nfal from a Gamma probablty dstrbuton wth shape 0.2 and scale 5.0, that s, a probablty dstrbuton where small n are common, but also large n are possble. We sample nsucc from the unform dstrbuton between 0 and n, and keep nsucc and nfal constant durng each smulaton run. Note that the magntude of n determnes how much object specfc grasp propertes affect the grasp success probablty compared to occluson Results Fg. 4 compares POMDP plannng wth dfferent plannng horzons, rangng from two to fve, wth the heurstc manpulaton approach. The POMDP polcy graph had a wdth of three. Fg. 4 shows the average total reward over 100 smulaton runs for each of the ten dfferent cup confguratons shown n Fg. 3. Overall, POMDP plannng acheves hgher reward than the heurstc 15

16 manpulaton approach. Interestngly, the performance dfference between the heurstc approach wth and wthout grasp hstory s not sgnfcant. To study ths further, we ran over 2000 smulaton runs for the scene shown n the thrd mage, upper row, n Fg. 3. In ths scene drty cups are n front and thus the heurstc approach can select between several cups to move. Not surprsngly, the approach utlzng grasp hstory performed better (wth non-overlappng average reward confdence ntervals; not shown n Fg. 3). 5 Reward over ten tme steps Smple Smple+hstory POMDP plannng Plannng horzon Fgure 4: The average reward sum and ts 95% confdence nterval (computed usng bootstrappng) for the heurstc manpulaton approach, heurstc manpulaton approach utlzng grasp hstory nformaton, and for the POMDP plannng method. It s also nterestng that a POMDP plannng horzon of three works sgnfcantly better than a horzon of two. Intutvely, one could magne that short condtonal plans, such as lft a cup, and then, f the cup behnd the lfted cup s drty, move t nto the dshwasher, would already perform very well. However, the results suggest that many problems requre a complex polcy to gan hgh reward. Fg. 5 shows a compact polcy graph computed by the POMDP method for the frst scene n Fg. 3. The polcy llustrates nformaton gatherng through lftng cups, the effect of faled grasps, and complex condtonal plannng. In the polcy graph, the agent lfts e.g. cups 8 and 12 (for reference, frst RGB mage n Fg. 3 shows cups 2, 4, 8, and 12) n order to gan nformaton, and then when observng cups 4 or 2 as drty, tres to move them nto the dshwasher. In tme step two, when the move of cup 4 nto the dshwasher fals, the grasp probablty of cup 4 decreases. In tme step three, the agent tres to move cup 4 agan. Ths hghlghts the mportant feature of prncpled uncertanty handlng n POMDP plannng. Even though graspng faled prevously, the planner tres to move the same cup, because compared to the alternatves the grasp probablty s stll hgh enough. We also tested dfferent reward scenaros. Fg. 6 shows performance for the heurstc manpulaton method and the POMDP method wth a plannng horzon of three for dfferent reward choces. In the experment, we vared the reward for lftng a cup/a faled grasp attempt and the reward for puttng a clean 16

17 object nto the dshwasher. The POMDP method outperformed the heurstc method n each reward scenaro. The reward for faled grasps/lftng a cup had a sgnfcant effect on the POMDP method s performance. One explanaton s that when lftng cups becomes more expensve the beneft of plannng over complex acton-observaton sequences decreases. 4.2 Robot arm experments In the prevous secton, we smulated world dynamcs usng a world model created from real robot grasps and pont clouds captured by the vsual sensor. In ths secton, we present experments wth a physcal robot arm. In Secton 4.2.1, we demonstrate crucal parts of our world model. In Secton 4.2.2, we compare quantatvely the performance of the greedy heurstc approach wth the proposed POMDP approach. In the demonstratons, we show the usefulness of nformaton gatherng actons, such as lftng cups, n occluded settngs. Furthermore, we expermentally nvestgate when object specfc adaptve grasp probabltes are requred. In addton, we examne n whch stuatons the heurstc manpulaton approach suffces for effcent operaton, and when nstead more comprehensve POMDP based decson makng s requred. The quantatve experments show that the POMDP based approach sgnfcantly outperforms the smple greedy approach and yeld nsghts, for nstance, on why onlne plannng s benefcal. Overall, the experments show that real world problems requre a model that takes occluson nto account, that mult-object manpulaton problems requre mult-step POMDP plannng, and that adaptve acton success probabltes are necessary n many stuatons. We performed robot arm experments usng the Knova Jaco arm. In the robot arm experments, the Knect sensor observes the scene, a method decdes whch acton to execute, and then the robot arm executes the acton. At each tme step we estmate a belef from the captured pont cloud and add the observaton hstory nformaton to ths belef, to get the current belef. The method under evaluaton decdes on an acton usng the current belef. In order to mantan a consstent observaton hstory and for detectng when a grasp succeeded or faled, we match current objects to objects n the prevous tme step: f an object s less than 4cm from ts last spatal poston, we assume t s the same object. If an object exsts at the same locaton after t was moved or lfted, we assume the grasp faled Demonstratons We clam that n mult-object manpulaton, the robot may need to perform nformaton gatherng actons when objects are occluded, or when the grasp success probabltes of objects dffer. However, when objects are n plan sght and easy to grasp decson makng s easer. In ths case, the problem requres no mult-step plannng, and the heurstc polcy of movng all cups that appear drty nto the dshwasher s suffcent. To test ths, and to test whether our observaton and state space models are applcable n physcal robot arm experments (we tested the model also n several other robot arm experments whch are dscussed below), we performed robotc manpulaton n a setup wth drty cups whch are not occluded. Fg. 7a shows how the heurstc manpulaton approach successfully moves the drty cups nto the dshwasher n ths setup. 17

18 To test our occluson model, and to test whether occluson requres more complex decson makng, we performed an experment where the drtyness of a cup s not apparent because another cup partly occludes the vew on the drty cup. The experment n Fg. 7b demonstrates how the heurstc manpulaton approach does not consder nformaton gatherng, and thus fals n the task. On the other hand, mult-step POMDP plannng takes nto account that the drty cup may n fact be drty, even though the robot observes t as clean, because the robot makes wrong observatons on occluded cups wth a hgh probablty. The POMDP approach lfts the clean cup, gans new nformaton on the drty cup, that s, observes the drty cup as drty, whch ncreases the probablty of the cup beng actually drty, and then successfully moves the drty cup nto the dshwasher. Prevously, we clamed that real world mult-object manpulaton problems requre an object specfc adaptve grasp success probablty. To test ths clam and to verfy that our adaptve grasp success model works, we performed an experment wth two drty cups where the frst cup s slghtly occluded, and the second cup contans drnkng straws that make correct graspng more dffcult. The robot tres to move the second cup always frst, because the occluson on the frst cup makes the ntal grasp success of the second cup hgher. For smplcty, we compared the heurstc manpulaton approach wth and wthout adaptve grasp success probabltes. As shown n Fg. 8a, both methods fal to grasp the second cup because of the drnkng straws. The adaptve grasp success probablty method updates the grasp success probablty after observng a faled grasp, and moves the frst drty cup successfully nto the dshwasher. The method that does not take grasp success hstory nto account tres to grasp the same second cup agan, even though an easer to grasp drty cup would be avalable. These knd of stuatons occur often n practce. Durng expermentaton wth the robotc arm for example, as shown n Fg. 8b, the robot moves drty cups, but when t moves the thrd drty cup, the cup falls down and remans n a harder to grasp pose. We observed that when further grasps on the object faled, the grasp success probablty decreased as expected Quanttatve results In addton to the demonstratons, we performed a quanttatve comparson between the smple greedy heurstc approach and the proposed POMDP approach n physcal robot arm experments. Smlar to the experments wth smulated dynamcs n Secton 4.1, the goal was to move drty, that s, partly green objects, nto a dshwasher. An object was observed drty f the number of green pxels was at least 100. Fg. 9 shows the four dfferent scenes used. The fourth scene contans also toys to demonstrate the genercty of our approach. In each scene, we placed the objects on the table, and then ran the smple heurstc method and the POMDP method wth a plannng horzon of 3 after each other, fve tmes each, yeldng a total of twenty runs for each method over all four scenes. We reconstructed a scene after each run. Fg. 10 shows the results. Overall, the POMDP approach sgnfcantly outperformed the heurstc approach. Moreover, n each scene, the POMDP approach receved hgher rewards on average. Regardng plannng tmes, on a sngle low performance AMD A M CPU core the heurstc approach took roughly 0.02% (2 mllseconds per tme step) and the POMDP 18

19 approach took 4.7% (2.6 seconds per tme step) of the total executon tme. The plannng tme for both methods was neglgble compared to the tme sensor processng and movng the robot arm requred. Performance wse the heurstc approach was closest to the POMDP approach n scene 3. In scene 3, the two partly green objects closest to the Knect were easy to grasp and the heurstc approach always successfully moved them to the dshwasher. Because of heavy occluson the two partly green objects farthest from the Knect were very hard to grasp. Therefore, whle beng usually able to move the easy to grasp objects, the POMDP approach had more dffculty n movng the other two partly green objects. Interestngly, among ndvdual experment runs, the POMDP approach had both the lowest ( 20) and hghest (17) reward. The lowest reward was possble because of the grasp and observaton uncertanty, and because the POMDP approach was more actve than the heurstc approach. Another nterestng observaton from the experments was that occasonally an object could be dropped or tpped over. Our POMDP model does not explctly take these knds of events nto account. However, n spte of ths, the POMDP approach adapted to these unexpected stuatons because t always planned actons based on the belef estmated from current sensor readngs. 4.3 Dscusson The experments confrm that mult-step POMDP plannng s useful, when the order of actons s crtcal to the successful completon of the task. In partcular, a POMDP estmates the value of nformaton optmally. In an uncertan world, the probablstc model used n POMDPs can weght dfferent acton choces n a prncpled manner. In contrast to a greedy approach, a POMDP may select actons that gather nformaton, but do not yeld mmedate reward, when the problem so requres. In the mult-object manpulaton experments, the robot had to decde between lftng objects to gather nformaton or movng objects that appear drty nto the dshwasher. Our POMDP model ncludes graspng success and learns graspng probabltes. Graspng unknown objects requres object specfc grasp probabltes because each object may be dfferent. However, even when predefned object models are avalable, adaptve object specfc grasp probabltes may be useful; especally n heavly cluttered settngs, wth multple objects, the large uncertanty about object pose and dentty make graspng some objects harder than others and requres an adaptve approach. 5 Concluson We presented a POMDP model for mult-object manpulaton of unknown objects n a crowded envronment. Because objects are occluded, ther attrbutes are harder to observe and they are harder to manpulate. To address ths, our POMDP model uses an occluson rato to defne how much an object occludes another one. We use the occluson rato as a parameter n the observaton and grasp probabltes of objects. In addton to occluson specfc grasp probabltes, our model also ncludes automatcally adaptng object specfc grasp probabltes. To compute compact polces for the computatonally complex POMDP model, we presented a new POMDP method that optmzes a polcy 19

20 graph usng partcle flterng. The method allows mult-step POMDP plannng, both offlne and onlne. Experments confrm that a heurstc greedy manpulaton approach s not adequate for mult-object manpulaton, but nstead, the problem requres complex condtonal mult-step POMDP plans that take long term effects nto account. Moreover, object specfc grasp probabltes are needed n many realworld stuatons. In the future we plan to apply the presented POMDP model to other knds of robotc tasks. Currently, we are extendng the POMDP model to take nto account the uncertanty n the composton of objects from segments. In general, to obtan true long-term autonomy, we beleve that a robot should base ts decsons on pror learned knowledge and adjust ts world model to the specfc envronment t operates n. For ths purpose a probablstc Bayesan framework should be used that allows the robot to operate and learn n an uncertan, unstructured envronment. In contrast to engneered solutons, learnng offers the possblty to fnd solutons that generalze to unexpected stuatons and a possblty for autonomous adaptaton. Our goal s an autonomous robot whch can be placed n a complex new envronment and whch then knows how to adapt to the new envronment. The work presented here s a step towards that goal. Acknowledgements Ths work was supported by the Academy of Fnland, decson References [1] C. Amato, B. Bonet, and S. Zlbersten. Fnte-state controllers based on Mealy machnes for centralzed and decentralzed POMDPs. In Proceedngs of the Twenty-Fourth Natonal Conference on Artfcal Intellgence (AAAI). AAAI Press, [2] H. Ba, D. Hsu, W. Lee, and V. Ngo. Monte Carlo value teraton for contnuous-state POMDPs. Algorthmc Foundatons of Robotcs IX, pages , [3] H. Ba, D. Hsu, and W. S. Lee. Plannng How to Learn. In IEEE Internatonal Conference on Robotcs and Automaton (ICRA), [4] Randy C Brost. Automatc grasp plannng n the presence of uncertanty. Internatonal Journal of Robotcs Research, 7(1):3 17, [5] I. Chadès, E. McDonald-Madden, M. A. McCarthy, B. Wntle, M. Lnke, and H. P. Possngham. When to stop managng or surveyng cryptc threatened speces. Proceedngs of the Natonal Academy of Scences (PNAS), 105(37): , [6] S. Chakravorty and R.S. Erwn. Informaton space recedng horzon control. In IEEE Symposum on Adaptve Dynamc Programmng And Renforcement Learnng (ADPRL), pages IEEE,

21 [7] Mehmet Dogar and Sddharta Srnvasa. A plannng framework for non-prehensle manpulaton under clutter and uncertanty. Autonomous Robots, 33(3): , June [8] F. Dosh, J. Pneau, and N. Roy. Renforcement learnng wth lmted renforcement: Usng Bayes rsk for actve learnng n POMDPs. In Proceedngs of the 25th Internatonal Conference on Machne learnng (ICML), pages ACM, [9] Javer Felp, Jonna Laaksonen, Antono Morales, and Vlle Kyrk. Manpulaton prmtves: A paradgm for abstracton and executon of graspng and manpulaton tasks. Robotcs and Autonomous Systems, 61(3): , [10] D. Fschnger, M. Vncze, and Y. Jang. Learnng grasps for unknown objects n cluttered scenes. In IEEE Internatonal Conference on Robotcs and Automaton, Karlsruhe, Germany, [11] J. Hoey, A. Von Bertold, P. Poupart, and A. Mhalds. Assstng persons wth dementa durng handwashng usng a partally observable Markov decson process. In Proceedngs of the 5th Internatonal Conference on Vson Systems (ICVS). Belefeld Unversty Lbrary, [12] Kajen Hsao, Mate Cocarle, and Peter Brook. Bayesan grasp plannng. In ICRA 2011 Workshop on Moble Manpulaton, [13] Kajen Hsao, Lesle Pack Kaelblng, and Tomas Lozano-Peres. Graspng POMDPs. In IEEE Internatonal Conference on Robotcs and Automaton, Rome, Italy, [14] Kajen Hsao, Lesle Pack Kaelblng, and Tomás Lozano-Pérez. Robust graspng under object pose uncertanty. Autonomous Robots, 31(2-3): , [15] L. P. Kaelblng, M. L. Lttman, and A. R. Cassandra. Plannng and actng n partally observable stochastc domans. Artfcal Intellgence, 101(1-2):99 134, [16] Lesle Pack Kaelblng and Tomas Lozano-Peres. Integrated robot task and moton plannng n belef space. Techncal Report MIT-CSAIL-TR , MIT CSAIL, [17] Lesle Pack Kaelblng and Tomas Lozano-Peres. Integrated task and moton plannng n the now. Techncal Report MIT-CSAIL-TR , MIT CSAIL, [18] H. Kurnawat, D. Hsu, and W. S. Lee. SARSOP: Effcent pont-based POMDP plannng by approxmatng optmally reachable belef spaces. In Proceedngs of Robotcs: Scence and Systems IV, pages MIT Press, [19] Hanna Kurnawat, Yanzha Du, Davd Hsu, and Wee Sun Lee. Moton plannng under uncertanty for robotc tasks wth long tme horzons. Internatonal Journal of Robotcs Research, 30(3): ,

22 [20] Jonna Laaksonen, Ekaterna Nkandrova, and Vlle Kyrk. Probablstc sensor-based graspng. In IEEE/RSJ Internatonal Conference on Intellgent Robots and Systems, IROS 2012, pages , Vlamoura, Portugal, [21] Steven M. LaValle. Plannng Algorthms. Cambrdge Unversty Press, [22] Steven M. LaValle and Seth Hutchnson. An objectve-based framework for moton plannng under sensng and control uncertantes. Internatonal Journal of Robotcs Research, 17(1):19 42, [23] Tomás Lozano-Pérez, Matthew T Mason, and Russell H Taylor. Automatc synthess of fne-moton strateges for robots. Internatonal Journal of Robotcs Research, 3(1):3 24, March [24] H. B. Mann and D. R. Whtney. On a test of whether one of two random varables s stochastcally larger than the other. The Annals of Mathematcal Statstcs, 18(1):50 60, [25] J. Mattngley, Y. Wang, and S. Boyd. Recedng horzon control. IEEE Control Systems, 31(3):52 65, [26] D. McAllester and S. Sngh. Approxmate plannng for factored POMDPs usng belef state smplfcaton. In Proceedngs of the Ffteenth Annual Conference on Uncertanty n Artfcal Inttellgence (UAI), pages Morgan Kaufmann, [27] Pol Monso, Gullem Alenya, and Carme Torras. POMDP approach to robotzed clothes separaton. In IEEE/RSJ Internatonal Conference on Intellgent Robots and Systems, pages , Vlamoura, Portugal, [28] Ekaterna Nkandrova, Jonna Laaksonen, and Vlle Kyrk. Towards nformatve sensor-based grasp plannng. Robotcs and Autonomous Systems, Accepted to be publshed. [29] J. Pajarnen, J. Peltonen, A. Hottnen, and M. Uustalo. Effcent Plannng n Large POMDPs through Polcy Graph Based Factorzed Approxmatons. In Proceedngs of The European Conference on Machne Learnng and Prncples and Practce of Knowledge Dscovery n Databases (ECML PKDD), volume 6323 of Lecture Notes n Computer Scence, pages Sprnger, [30] J. Pajarnen, J. Peltonen, Mkko A. Uustalo, and A. Hottnen. Latent state models of prmary user behavor for opportunstc spectrum access. In Proceedngs of IEEE Internatonal Symposum on Personal, Indoor and Moble Rado Communcatons (PIMRC). IEEE, [31] Jon Pajarnen and Jaakko Peltonen. Perodc Fnte State Controllers for Effcent POMDP and DEC-POMDP Plannng. In Proceedngs of the 25th Annual Conference on Neural Informaton Processng Systems (NIPS), pages , Dec

23 [32] C. H. Papadmtrou and J. N. Tstskls. The complexty of Markov decson processes. Mathematcs of operatons research, pages , [33] P. Poupart and C. Boutler. Value-drected compresson of POMDPs. In S. Becker, S. Thrun, and K. Obermayer, edtors, Advances n Neural Informaton Processng Systems 15 (NIPS), pages MIT Press, [34] P. Poupart and C. Boutler. Bounded fnte state controllers. In S. Thrun, L. Saul, and B. Schölkopf, edtors, Advances n Neural Informaton Processng Systems 16 (NIPS), pages MIT Press, [35] P. Poupart and N. Vlasss. Model-based Bayesan renforcement learnng n partally observable domans. In Proceedngs of the Tenth Internatonal Symposum on Artfcal Intellgence and Mathematcs (ISAIM), [36] S. Ross, B. Chab-draa, and J. Pneau. Bayesan renforcement learnng n contnuous POMDPs wth applcaton to robot navgaton. In IEEE Internatonal Conference on Robotcs and Automaton (ICRA), pages IEEE, [37] S. Ross, J. Pneau, B. Chab-Draa, and P. Kretmann. A Bayesan approach for learnng and plannng n partally observable Markov decson processes. Journal of Machne Learnng Research, 12(May): , [38] S. Ross, J. Pneau, S. Paquet, and B. Chab-Draa. Onlne plannng algorthms for POMDPs. Journal of Artfcal Intellgence Research, 32(1): , [39] G. Shan, J. Pneau, and R. Kaplow. A survey of pont-based POMDP solvers. Autonomous Agents and Mult-Agent Systems, 27(1):1 51, [40] D. Slver and J. Veness. Monte-Carlo Plannng n Large POMDPs. In J. Lafferty, C. K. I. Wllams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, edtors, Advances n Neural Informaton Processng Systems 23 (NIPS), pages Curran Assocates Inc., [41] T. Smth and R. Smmons. Pont-Based POMDP Algorthms: Improved Analyss and Implementaton. In Proceedngs of the Twenty-Frst Annual Conference on Uncertanty n Artfcal Intellgence (UAI), pages AUAI Press, [42] S. Thrun. Probablstc robotcs. Communcatons of the ACM, 45(3):52 57, [43] Sebastan Thrun, Wolfram Burgard, and Deter Fox. Probablstc Robotcs. MIT Press,

24 Lft 8 (1) , S,D,C (0.165) S,D,D (0.045) F,C,C (0.084) Wash 4 (0.21) 0.681,3.242 ID P(drty),P(grasp),P(on table) ,0.613, ,0.699, ,0.659, ,0.658, ,0.707, ,0.944, ,0.917, ,0.917,1.000 F,C,C (0.319) S,C,C (0.5095) S,D,C (0.1714) S,D,C ( ) S,D,D ( ) Wash 4 (0.073) 0.142,1.944 ID P(drty),P(grasp),P(on table) ,0.613, ,0.699, ,0.659, ,0.658, ,0.491, ,0.925, ,0.917, ,0.917,1.000 F,C,C (0.5479) S,D,C (0.1507) S,C,C (0.3014) Wash 4 (0.04) 0.046,1.156 ID P(drty),P(grasp),P(on table) ,0.613, ,0.699, ,0.659, ,0.658, ,0.359, ,0.933, ,0.917, ,0.917,1.000 ID P(drty),P(grasp),P(on table) ,0.613, ,0.699, ,0.659, ,0.658, ,0.707, ,0.917, ,0.917, ,0.917,1.000 S,C,C (0.542) S,C,D (0.164) Lft 8 (0.626) , ID P(drty),P(grasp),P(on table) ,0.613, ,0.699, ,0.659, ,0.658, ,0.707, ,0.900, ,0.917, ,0.917,1.000 Lft 3 (0.164) F,C,C (0.1022) S,C,C (0.8722) S,C,D ( ) F,C,C (0.3232) S,C,C (0.1768) S,D,C (0.5) Lft 12 (0.917) , ID P(drty),P(grasp),P(on table) ,0.660, ,0.699, ,0.659, ,0.659, ,0.597, ,0.918, ,0.917, ,0.917,1.000 S,C,C (0.6957) F,C,C ( ) S,D,C (0.2225) F,C,C (0.1) S,C,C (0.9) Lft 0 (0.756) , ID P(drty),P(grasp),P(on table) ,0.672, ,0.691, ,0.659, ,0.659, ,0.568, ,0.916, ,0.917, ,0.910, , ID P(drty),P(grasp),P(on table) Wash 1 (0.01) 0.021,2.147 Wash 2 (0.204) 0.603, ,0.613, ,0.699, ,0.659, ,0.658, ,0.707, ,0.944, ,0.917, ,0.917,1.000 ID P(drty),P(grasp),P(on table) ,0.613, ,0.699, ,0.659, ,0.658, ,0.707, ,0.708, ,0.917, ,0.917,1.000 ID P(drty),P(grasp),P(on table) ,0.653, ,0.699, ,0.659, ,0.662, ,0.613, ,0.913, ,0.917, ,0.944,1.000 Fgure 5: A polcy graph optmzed by the POMDP method for four tme steps, when startng executon from the confguraton shown n the frst pont cloud n Fg. 3. At each tme step an agent executes the acton assocated wth the current graph node, makes an observaton, and moves to the next layer node along the correspondng observaton edge. Each graph node shows ts acton, the vstng probablty n parenthess, the expected reward, and the expected reward dvded by the vstng probablty. Each graph edge s labeled wth the observaton, that s, three symbols, e.g. F,D,C, and a vstng probablty n parenthess. The frst observaton symbol denotes grasp success ( S ) or falure ( F ); the second and thrd symbol denotes ether drty D or clean C for the frst and second observed object, respectvely. The box below a graph node dsplays for each object the drtyness probablty ( P(drty) ), grasp success probablty ( P(grasp) ), and the probablty for the object to be on the table ( P(on table) ). Noteworthy: 1) faled grasps decrease the grasp probablty, 2) lftng objects yelds nformaton about the drtyness of objects behnd them, 3) POMDP plannng yelds complex behavor. 24

Reward over ten tme steps 6 5 4 3 2 1 0 1 Smple POMDP plannng 2 0.25 5 0.25 10 0.

0 10 Fgure 6: The average reward sum and ts 95% confdence nterval (computed usng

method wth a plannng horzon of three for dfferent reward scenaros.

(a) The heurstc approach moves drty cups whch are not occluded nto the dshwasher.

Top: In order to gan more nformaton, the POMDP approach lfts the occludng cup,

25 Reward over ten tme steps Smple POMDP plannng Fgure 6: The average reward sum and ts 95% confdence nterval (computed usng bootstrappng) for the heurstc manpulaton approach and for the POMDP plannng method wth a plannng horzon of three for dfferent reward scenaros. Each reward scenaro has dfferent rewards for movng a clean cup nto the dshwasher ( 5 or 10), and for lftng a cup/a faled grasp attempt ( 0.25, 0.5, or 1.0). (a) The heurstc approach moves drty cups whch are not occluded nto the dshwasher. (b) Because of occluson the robot observes a drty cup as clean. Top: In order to gan more nformaton, the POMDP approach lfts the occludng cup, and then, when observng the drty cup correctly, moves t to the dshwasher. Bottom: The heurstc approach executes the Fnsh acton, because all cups appear clean. Fgure 7: The robot tres to move possbly occluded drty cups (partly green color) nto the dshwasher (blue box). 25

(a) The robot fals to grasp a drty cup that contans drnkng straws.

moves the other drty cup nto the dshwasher.

The robot has moved two drty cups nto the dshwasher, when another

Followng grasp attempts fal, because the cup s now more dffcult to

Fgure 8: The robot tres to move drty cups (partly green color) nto

Some of the cups are harder to grasp than others.

26 (a) The robot fals to grasp a drty cup that contans drnkng straws. Top: The heurstc approach whch does not consder grasp hstory tres to move the same drty cup agan. Bottom: The heurstc approach whch takes grasp hstory nto account moves the other drty cup nto the dshwasher. (b) Graspng becomes harder. The robot has moved two drty cups nto the dshwasher, when another drty cup drops onto the table and remans restng on ts sde. Followng grasp attempts fal, because the cup s now more dffcult to grasp. Fgure 8: The robot tres to move drty cups (partly green color) nto the dshwasher (blue box). Some of the cups are harder to grasp than others. Fgure 9: Top row: cropped knect mages of the four scenes used n the robot arm experments. Bottom row: correspondng photographs of the scenes. Each scene contans drty (partly green color) and clean objects. Scenes one to three contan only cups but scene four contans also several toys. 26

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision SLAM Summer School 2006 Practcal 2: SLAM usng Monocular Vson Javer Cvera, Unversty of Zaragoza Andrew J. Davson, Imperal College London J.M.M Montel, Unversty of Zaragoza. josemar@unzar.es, jcvera@unzar.es,