Heuristic Search in MDPs 3/5/18

Size: px

Start display at page:

Download "Heuristic Search in MDPs 3/5/18"

Logan Richard
5 years ago
Views:

1 Heuristic Search in MDPs 3/5/18

2 Thinking about online planning. How can we use ideas we ve already seen to help with online planning? Heuristics? Iterative deepening? Monte Carlo simulations? Other ideas?

3 Heuristics What would happen if we started value iteration with non-zero initial values? Suppose we initialized values as follows:

16 LAO* Initialize graph with just the start state. A partial policy specifies actions for some states. If it s closed, it gives actions for all reachable states. Repeat until the optimal partial policy is closed: Expand a state that s reachable from a state that s reachable by the optimal partial policy. Update values that could have been affected by this expansion. Update the optimal partial policy.

17 Choosing Heuristics Suppose we had this MDP, where state 4 is terminal and has reward +1, while all other states are non-terminal and have reward 0. Actions succeed with probability ½ and fail (agent stays put) w/prob. ½. How could we initialize values to simplify the search? We want to make sure that LAO* doesn t bother fully exploring the path that starts by moving down.

18 <latexit sha1_base64="6l/mdi7gw32amgeuvyrwo8syzee=">aaab7nicbvdlsgnbeoynrxhfuy9ebomqeckuf/ugbl14jocaqlke2clsmmr2dp3pfulit3jxokjhv8ebf+pkcddegoaiqpvurjcvwqdrfju5pewv1bx8emfjc2t7p7i7d2+stdpus0qmuhfsw6vq3eebkjdszwkcsl4p+9djv/7itrgjusnbyooydpwibknopuavbi7jjxhbxzjbcscgi8sbkvl1shxyaqc1dvgr1ulyfnoftfjjmp6byjckggwtffrozyanlpvplzctvttmjhho7h2ri6t0sjrowwrjrp09masxmym4tj0xxz6z98bif14zw+g8gaqvzsgvmy6kmkkwiepnsudozlaolkfmc3sryt2qkumbucgg4m2/vej808pfxbu1yvzbfhk4gemogwdnuiubqiepdcq8wqu8og/os/pmve9bc85szh/+wpn8ayqpj/w=</latexit> <latexit sha1_base64="venw3qw2t68dlkdtvu7pt0wj9go=">aaab7nicbvbns8naej3ur1q/qh69lc1crsijf/ugfl14rgbsoq1ls920szebulsrquifemgdild/j7f+gzdtd9r6yodx3gwz8/yym6vte2ivvlbx1jekm6wt7z3dvfl+wyokekmosyieybapfevmufczzwk7lhshpqctf3st+60nkhwlxl1oy+qfecbywajwrmopa+oexsg7v67adxskteycoak2kt3tl0kjbfbk391+rjkqck04vqrj2lh2miw1i5yos91e0riter7qjqech1r52ftemto2sh8fktqlnjqqvycyhcqvhr7pdleeqkuvf//zookolrymitjrvjdzoidhsecofx71mare89qqtcqztyiyxbitbsiqmrccxzexixtwv6w7dyama5ihcedqgro4ca4nuiumuecawzo8wbv1al1ah9bnrlvgzwco4q+srx+mdzgc</latexit> <latexit sha1_base64="venw3qw2t68dlkdtvu7pt0wj9go=">aaab7nicbvbns8naej3ur1q/qh69lc1crsijf/ugfl14rgbsoq1ls920szebulsrquifemgdild/j7f+gzdtd9r6yodx3gwz8/yym6vte2ivvlbx1jekm6wt7z3dvfl+wyokekmosyieybapfevmufczzwk7lhshpqctf3st+60nkhwlxl1oy+qfecbywajwrmopa+oexsg7v67adxskteycoak2kt3tl0kjbfbk391+rjkqck04vqrj2lh2miw1i5yos91e0riter7qjqech1r52ftemto2sh8fktqlnjqqvycyhcqvhr7pdleeqkuvf//zookolrymitjrvjdzoidhsecofx71mare89qqtcqztyiyxbitbsiqmrccxzexixtwv6w7dyama5ihcedqgro4ca4nuiumuecawzo8wbv1al1ah9bnrlvgzwco4q+srx+mdzgc</latexit> <latexit sha1_base64="2cwvhlz2mxneqyw662c40ph3eog=">aaab7nicbva9swnbej2lxzf+rs1tfomqm3bnoxzc0myygmccsqh7m71kyd7eutsnhcn/wszcxdbfy+e/czncoykpbh7vztazl0ikmoi6305hzxvtfao4wdra3tndk+8fpjg41yz7ljaxbgxuccku91gg5k1ecxofkjed0c3ubz5xbuss7ngc8g5eb0qeglg0umtynafkiri9cswtutoqzellpai5gr3yv6cfszticpmkxrq9n8furjukjvmk1ekntygb0qfvw6poxe03m907isdw6zmw1ryukpn6eykjkthjklcdecwhwfsm4n9eo8xwopsjlatifzsvclnjmcbt50lfam5qji2htat7k2fdqildg1hjhuatvrxm/lpazc27cyv16zynihzbmvtbg3oowy00wacgep7hfd6cr+ffexc+5q0fj585hd9wpn8aczoocw==</latexit> <latexit sha1_base64="bcidou0jxf+dw6x0mmcelvo6fsi=">aaab93icbzdntsjafivv8q/xh6plnxojcbogrtfrd0q3ljgxqakvticptjhom5mpcty8irsxatz6ku58gwfoqsgttpll3htz75wg4uxpx/m2ciura+sbxc3s1vbobtne22+qojweeitmswwhwfhobpu005y2e0lxfhdacky303rrkurfyngvxwn1izwqlgqea2p17pkwqk5qd0br8+huym+uodvnjrqmbg4vynxo2v/dfkzsiapnofaq4zqj9jmsnsoctkrdvneekxee0i5bgsoq/gx2+aqdg6epwliajzsaub8nmhwpny4c0xlhpvsltan5x62t6vdsz5hiuk0fms8ku450jkypod6tlgg+nocjzozwrizyyqjnviutgrv45wxwzmpxnffuvfk/ztmowiecqrvcuia63eidpccqwjo8wpv1zl1y79bhvlvg5tmh8efw5w+yhjfl</latexit> <latexit sha1_base64="bcidou0jxf+dw6x0mmcelvo6fsi=">aaab93icbzdntsjafivv8q/xh6plnxojcbogrtfrd0q3ljgxqakvticptjhom5mpcty8irsxatz6ku58gwfoqsgttpll3htz75wg4uxpx/m2ciura+sbxc3s1vbobtne22+qojweeitmswwhwfhobpu005y2e0lxfhdacky303rrkurfyngvxwn1izwqlgqea2p17pkwqk5qd0br8+huym+uodvnjrqmbg4vynxo2v/dfkzsiapnofaq4zqj9jmsnsoctkrdvneekxee0i5bgsoq/gx2+aqdg6epwliajzsaub8nmhwpny4c0xlhpvsltan5x62t6vdsz5hiuk0fms8ku450jkypod6tlgg+nocjzozwrizyyqjnviutgrv45wxwzmpxnffuvfk/ztmowiecqrvcuia63eidpccqwjo8wpv1zl1y79bhvlvg5tmh8efw5w+yhjfl</latexit> <latexit sha1_base64="bcidou0jxf+dw6x0mmcelvo6fsi=">aaab93icbzdntsjafivv8q/xh6plnxojcbogrtfrd0q3ljgxqakvticptjhom5mpcty8irsxatz6ku58gwfoqsgttpll3htz75wg4uxpx/m2ciura+sbxc3s1vbobtne22+qojweeitmswwhwfhobpu005y2e0lxfhdacky303rrkurfyngvxwn1izwqlgqea2p17pkwqk5qd0br8+huym+uodvnjrqmbg4vynxo2v/dfkzsiapnofaq4zqj9jmsnsoctkrdvneekxee0i5bgsoq/gx2+aqdg6epwliajzsaub8nmhwpny4c0xlhpvsltan5x62t6vdsz5hiuk0fms8ku450jkypod6tlgg+nocjzozwrizyyqjnviutgrv45wxwzmpxnffuvfk/ztmowiecqrvcuia63eidpccqwjo8wpv1zl1y79bhvlvg5tmh8efw5w+yhjfl</latexit> <latexit sha1_base64="bcidou0jxf+dw6x0mmcelvo6fsi=">aaab93icbzdntsjafivv8q/xh6plnxojcbogrtfrd0q3ljgxqakvticptjhom5mpcty8irsxatz6ku58gwfoqsgttpll3htz75wg4uxpx/m2ciura+sbxc3s1vbobtne22+qojweeitmswwhwfhobpu005y2e0lxfhdacky303rrkurfyngvxwn1izwqlgqea2p17pkwqk5qd0br8+huym+uodvnjrqmbg4vynxo2v/dfkzsiapnofaq4zqj9jmsnsoctkrdvneekxee0i5bgsoq/gx2+aqdg6epwliajzsaub8nmhwpny4c0xlhpvsltan5x62t6vdsz5hiuk0fms8ku450jkypod6tlgg+nocjzozwrizyyqjnviutgrv45wxwzmpxnffuvfk/ztmowiecqrvcuia63eidpccqwjo8wpv1zl1y79bhvlvg5tmh8efw5w+yhjfl</latexit> Admissible Heuristics What constitutes an admissible heuristic for LAO*? In A* search, admissibility guarantees that the optimal path will be found. For LAO*, we want to ensure that an optimal closed partial policy is found. If s is terminal: h(s) =0 A partial policy specifies actions for some states. If it s closed, it gives actions for all reachable states. Otherwise: h(s) V (s)

19 Real-Time Dynamic Programming Repeat while there s time remaining: state ß start state What does admissibility guarantee in RTDP? repeat until terminal (or depth bound): action ß optimal action in current state V(state) ß R(state) + discount * Q(state, action) Q(state, action) calculated from V(s ) for all reachable s. If s hasn t been seen before, initialize V(s ) ß h(s ). state ß result of taking action

20 Online Planning An online planner is one that interleaves planning and acting. Are LAO* and RTDP online planners? If not, how could we modify them to work online?

Monte Carlo Tree Search

Monte Carlo Tree Search Branislav Bošanský PAH/PUI 2016/2017 MDPs Using Monte Carlo Methods Monte Carlo Simulation: a technique that can be used to solve a mathematical or statistical problem using repeated