XV International PhD Workshop OWD 2013, October Machine Learning for the Efficient Control of a Multi-Wheeled Mobile Robot

XV Internatonal PhD Workshop OWD 203, 9 22 October 203 Machne Learnng for the Effcent Control of a Mult-Wheeled Moble Robot Uladzmr Dzomn, Brest State Techncal Unversty (prof. Vladmr Golovko, Brest State Techncal Unversty) Abstract Ths paper presents an applcaton of the multagent renforcement learnng approach for the effcent control of a moble robot. Ths approach s based on a mult-agent system appled to mult-wheel control. The robot s platform s decomposed nto drvng modules agents that are traned ndependently. The proposed approach ncorporates multple Q-learnng agents, whch permts them to effectvely control every wheel relatve to other wheels. The power reward polcy wth common error reward s adjusted to produce effcent control. The proposed approach s appled for the dstrbuted control of a mult-wheel platform, n order to provde energy consumpton optmzaton.. Introducton An effcent robot control s one of the mportant tasks for the applcaton of moble robot n producton. The mportant control tasks are power consumpton optmzaton and optmal trajectory plannng. Control subsystems should provde energy consumpton optmzaton n a robot control system. The power consumpton problem s solved by motor power control optmzaton [], [2] and by effcent moton plannng [3]. The robot control subsystem cannot nfluence on motor parameters, but must have polcy for the effcent control (optmal speed parameter, maxmum start drvng power, safe slow down dstance). The trajectory plannng usually mplements by plannng subsystem [4], [5]. Such subsystem bulds trajectory and dvdes t nto dfferent parts, whch are reproduced by crcles and straght lnes. The robot control subsystem should provde movement along the trajectory parts. The problem of effcent control s an mportant prerequste for the applcaton of the moble robot platform that was developed n Hochschule Ravensburg-Wengarten. The 3D mage of the Robot s llustrated n the fg. a. Ths platform s based on four nnovated vehcle steerng modules [6]. The steerng module s shown n the fg. 2b and conssts of two wheels powered by separate motors and behaves lke a dfferental drve. It s mounted to the platform by means of a bearng whch allows unlmted rotaton of the module wth respect to the platform. The platform can contan three or more modules. a) b) Fg.. a) Robot platform 3D model; b) The drvng module. The problem of mult-agent control s researched usually as problem of formaton composton, trajectory plannng, dstrbuted control and others. In ths paper we consder problem of crcular moton for sngle and mult-module case. One soluton of ths problem [7] [9] s to calculate knematcs of one wheeled robot for crcle drvng and after generalze t for mult-vehcle system. Ths approach has shown good modelng results. The dsadvantage of ths technque s small flexblty and hgh computatonal complexty. Alternatve approach s to use coordnaton archtecture wth one or more leaders [0] where vrtual coordnate frame follows the crcle trajectory. The expermental results usng a multrobot platform have shown the effectveness of ths approach. The lmtaton of such a technque s explct leader requrement. In ths paper we solve problem of optmal control for mult-module case n cooperatve crcular moton. The objectve s to acheve a crcular moton around a vrtual reference beacon wth optmal forward and angular speed. In ths paper we develop renforcement learnng technque producng effcent control rule, based on the relatve pose of the module wth respect to the beacon. In order to llustrate the man features of the problem, the sngle module case and the mult-module scenaro followed one by one are examned. The key contrbuton of ths paper are purposed renforcement learnng model for robot postonng 397

that allow module postonng around beacon even f such beacon s dynamcally change ts poston, and purposed renforcement learnng model for multmodule scenaro that allow to adjustment module speed to requred value wthn mult-module platform. Ths model requre only postons of modules relatve to the center of the mult-module platform By combnng postonng and speed control modules are able to produce cooperatve control law for effcent crcular moton. As a result the developed system can be easy scaled for any number of modules. 2. Robot Control The conventonal approach for the platform control s knematcs calculaton and nverse knematcs modelng [6]. If any modules on platform wll be added or removed, t wll requre recalculaton of knematcs equatons and reconfguraton of control subsystem. Knematcs calculatons can apply only for symmetrc turnng. For example we cannot drve usng Ackerman (car-lke) drvng scheme [], because t sn t enough for the movng along the dffcult trajectory parts. Fg.3. Robot knematcs for the smmetrc turnng. It should be noted that the prevously developed control model uses only one rotaton center that les on the lne, whch s perpendcularly to the robot center (Fg. 3). The pont G s the rotaton center, the pont S s the robot center and the SG lne s the robot turnng radus. It s an mportant restrcton for the ndustral control system, because the robot cannot drve n others ways. 3. Steerng Module Agent Let s decompose the robot s platform nto the ndependent drvng module agents. Agent stays n physcal 2d envronment wth reference beacon as t s shown n Fg. 4. Beacon poston s defned by coordnates (x b, y b). Rotaton radus ρ s the dstance from the center of module to the beacon. Fg.4. State of the module wth respect to reference beacon. The angle error s calculated by the followng equatons: ϕ = arctan 2( x x, y y) () center ϕ err center b = ϕ ϕ (2) robot Here φ center and φ robot are known from envronment. In ths paper envronment s represented by physcal 2d model smulated by Player/State. Envronment provdes the all necessary nformaton about the agent and rotaton pont relatve postons. The envronment nformaton states are llustrated n the table. Envronmental nformaton Robot Get Value X robot poston, x Coordnate, m 2 Y robot poston, y Coordnate, m 3 X of beacon center, x b Coordnate, m 4 Y of beacon center, y b Coordnate, m 5 Robot orentaton angle, ϕ robot b Float number, radans -π < ϕ robot π Tab.. Beacon orentaton Float number, angle relatve to robot, radans 6 ϕ center -π < ϕ center π 7 Radus sze, ρ Float number, m The navgaton subsystem of real steerng uses odometer sensors for navgaton purposes n the presented platform. The full set of actons avalable to the agent s presented n the table 2. The agent can change the angle error φ err around beacon, usng control of lnear ν and angular speed ω. Agent actons Tab.2. Robot actons Value Encrease force, v + +0.0 m/s 2 Reduce force, v -0.0 m/s 3 Encrease turnng left, ω + +0.0 rad/s 4 Encrease turnng rght, ω -0.0 rad/s 5 Do nothng, 0 m/s, 0 rad/s 398

4. Mult-agent system of drvng modules One soluton of formaton control s the vrtual structure approach [0]. The basc dea s to specfy a vrtual leader or a vrtual coordnate frame located at the vrtual center of the formaton as a reference for the whole group such that each module s desred states can be defned relatve to the vrtual leader or the vrtual coordnate frame. Once the desred dynamcs of the vrtual structure are defned, then the desred moton for each agent s derved. As a result, sngle module path plannng and trajectory generaton technques can be employed for the vrtual leader or the vrtual coordnate frame whle trajectory trackng strateges can be employed for each module. Let, N steerng module agent s wth vrtual leader forms a mult-agent system called platform. Fg. 5 shows an llustratve example of such a structure wth a formaton composed of four modules, where (x b, y b) represents the beacon and C represents a vrtual coordnate frame located at a vrtual center (x c, y c) wth an orentaton φ c relatve to beacon and rotaton radus ρ c. Fg.5. The steerng modules platform. Platform contans addtonal nformaton such as square of platform and requred module topology ncludng ts desred postons relatve to the centrod of platform. Vrtual leader seen by envronment analogously to sngle steerng module agent t has a state, and can perform acton. It receves the same nformaton from envronment defned n the table I, and acton set defned n the table 2. It should be noted, that modules not drectly controlled by vrtual leader. The modules reman ndependent enttes and adopt ther behavor to conform desred poston n platform. In the fg. 6, (x, y ) and (x opt, y opt) represent, respectvely, the -th module s actual and desred poston, and d err represent the desred devaton of the -th module relatve to desred poston, where d err = d d (3) t opt Fg.6. State of the platform wth reference the -th module Here d t dstance from vrtual center to current module poston and d opt requred dstance between vrtual center and -th module poston derved from platform topology. The vrtual center poston s derved from centrod of platform area. The vrtual leader agent knows the goal optmal forward speed of whole platform lmted n bounds ν opt [ν opt mn, ν opt max], where ν opt mn and ν opt max s respectvely mnmum and maxmum values of optmal speed. 5. Cooperatve crcular moton problem Let s examne the control problem of crcular moton. The man objectve s to buld the control strategy so that all the modules wthn platform acheve crcular moton around the beacon, wth prescrbed radus of rotaton and dstances between neghbors. Then the task s to create cooperatve control rule for any confguraton of N modules wthn platform n crcular moton around a fxed beacon, wth rotaton radus ρ c defned for center of platform. Fg.7 depcts such a behavor. If each module can track ts desred poston accurately, then the desred formaton shape can be preserved accurately. Further requrements are to take nto account module postonng before movement, adaptaton of angular and lnear speed durng crcular movement. The control strategy should be scalable for varous numbers of agents and ts confguratons. The problem of fndng mult-agent control law for crcular moton can be decomposed n the followng steps (a) module postonng and (b) forward speed adjustment to ft desred radus and poston. For both problems we use renforcement learnng, whch permts to acheve generalzaton ablty. For example, the new beacon poston can be dynamcally assgned to platform, and the same control low can be used for module postonng. 6. Module postonng The secton dscusses a renforcement learnng method producng effcent control law for module orentaton around beacon. 399

6. Renforcement Learnng Framework Renforcement learnng (RL) s used as one of the technques to learn optmal control for autonomous agents n unknown envronment [2]. The renforcement learnng framework s shown n the Fg. 8. The man dea s that agent execute acton a t n partcular state s t, goes to the next state s t+ and receves numercal reward r t+ as a feedback of recent acton. Agent should explore state space and for every state fnd actons, whch s more rewarded than other n some fnte horzon. Acton of robot a t A ω s a change of angular speed ω t for gven moment of tme t. The learnng system s gven a postve reward when the robot orentaton closer to the goal orentaton (φ err t 0) usng optmal speed ω opt and a penalty when the orentaton of the robot devates from the correct or selected acton does not optmal for the gven poston. The value of the reward s defned as: r t t t = R( ϕ, ω ) (7) Where R s reward functon whch s represented by decson tree depctng n the fg 8. err Fg. 7. Renforcement learnng framework. Let, Q(s, a) s a Q-functon reflects qualty of selectng specfed acton a n state s. For gven Q-functon, the optmal acton a n specfed state s s defned as follows: a * = arg max Q( s, a) (4) a A( s) The ntal values of Q-functon are unknown and equal to zero. The learnng goal s to approxmate optmal Q-functon, e.g. fndng true Q-values for each acton n every state usng receved sequences of rewards durng state transtons. For the moment of tme t, the change of Q-value can be calculated as follows: Q ) αδ t t t ( s, a = (5) Where value α (0, ] s learnng rate, and δ t Temporal Dfference (TD) error. The agent s learned n such a way that the TD error s decreased. Usng Q-learnng rule [2], the temporal dfference error calculated by: t t t + t t δ = r γ max Q( s, a) Q( s, a ) (6) t+ a A( s ) Where r t reward value obtaned for acton α t selected n s t, and γ dscount rate, A(s t+ ) set of actons avalable at s t+. 5.2 RL-model for Module Postonng Usng defned above Q-learnng rule defne more precsely RL-model for module postonng ncludng state, acton and reward functon descrpton. It can be formulated as learnng to fnd such a behavor polcy that mnmzes φ err. Let, state of agent wll be par of values s t = [φ err t, ω t ]. Acton set A ω = {Ø, ω +, ω -} s represented by value of angular speed from table II. Fg. 8. Reward functon decton tree. Here φ stop the value of angle, where robot reduce speed to stop at the correct orentaton, ω opt [0.6...0.8] rad/s optmal speed mnmzng module power consumpton. Angular speed wthn ths range s gven the hghest award wth exceptons ncases of acceleraton and deceleraton. 7. Cooperatve Movng In ths secton, we consder a mult-agent renforcement learnng model for cooperatve movng problem. The problem s to control module s ndvdual speed n order to acheve stable crcular moton of whole platform. Modules wth dfferent dstances to beacon should have a dfferent speed: for two modules and j, wth dstances to beacon ρ and ρ j respectvely, the speed v j wll more than v f the dstance to beacon ρ j more than ρ. Every module should have addtonal polcy to control ts forward speed wth respect to speed of other modules. 7. Mult-Agent Renforcement Learnng Framework The man prncples of a such technque are descrbed n [3] [4]. The basc dea of selected approach s to use nfluences between module and platform vrtual leader to determne sequences of correct actons n order to coordnate behavor 400

among them. The good nfluences should be rewarded and negatve should be punshed. The code desgn queston s how to determne such nfluences n terms of receved ndvdual reward. RL-framework used for such control problem s llustrated n the fg. 9: represented by physcal-lke 2d world wth four modules, vrtual leader, platform descrpton and beacon poston. The preparng part of collectve movement smulaton s learnng of robot postonng. Ths step s done once for ndvdual module before any cooperatve smulaton sessons. The Learned polcy s stored and coped for other modules. The topology of Q-functon traned durng 720 epochs s shown n the fg. 0. Fg. 9. Mult-Agent RL framework. The -th module at the state s t selects acton α t usng current polcy Q and goes to next state s t+ takng acton to envronment. Platform observes changes done by executed acton, calculates and assgns reward r t+ to module as a feedback reflectng successveness of specfed acton. The same Q-learnng rule (5) (6) can be used to update module control polcy. The man dfference between both rules s that n second case reward s assgned by a vrtual leader nstead of envronment: t t t+ Q ( s, a ) = α[ rp + t+ t t γ max Q ( s, a) Q ( s, a )] t+ a A( s ) (8) Instead of tryng to buld global Q-functon Q({s, s 2,, s n}, {a, a 2,,a n}) for n modules we decompose the problem and buld set of local Q- functons Q (s, a), Q 2(s, a),, Q N(s, a), where every polcy contans specfc control rule for each module. The combnaton of such ndvdual polces produces cooperatve control law. 7.2 RL-model for cooperatve movng Let, state of the module s par of s t = {v t, d err}, where ν t current value of lnear speed, and d err t dstance error calculated by (8). Acton set A ν = {Ø, ν +, ν - } s represented by ncreasng/decreasng of lnear speed from the table II and acton a t A ν s a change of forward speed ν t for gven moment of tme t. The vrtual agent receves error nformaton for each module and calculates dsplacement error. Ths error can be postve (module ahead of the platform) or negatve (the module behnd of the platform). The learnng process follows toward to mnmzaton of d err for every module. The maxmum reward s gven for case where d err 0, and a penalty gven when the poston of the module devates from the predefned. 8. Smulaton results For smulaton purposes the Player/State [5] modelng envronment was used. Envronment s Fg. 0. Result topology of Q-functon. When the correct orentaton of the robot s determned, angular speed s calculated to ft the specfed drvng radus usng lnear speed produced by cooperatve control law: ω = v (0) ρ Fg. shows the platform ntal state (left) and postonng auto-adjustment (rght) usng learned polcy. Fg.. Intal and fnal agents poston. Fg. 2 shows the expermental result of cooperatve movement after learnng postonng. It takes 000 epochs n average. The external parameters of smulaton are summarzed n the table 3. In the case of modelng ω opt s chosen from predefned bounds to show the applcablty of proposed approach. For real robot, bounds of optmal speed s derved from documentaton. 40

Fg. 2. Agents team drvng process. External parameters Parameter Value 2 α, learnng rate 0.4 γ, dscount factor 0.7 ω, optmal speed 0.8 rad/s 3 opt ϕ 4 stop, angle for slow down 0.6 radans Tab.3. 9. Conclusons and future works The expermental results shows that descrbed mult-agent renforcement learnng framework can solve the problem of effcent mult-wheel robot control. The proposed approach ncorporates multple Q-learnng agents, whch permts them to effectvely control every wheel relatve to the vrtual leader. The reward functons desgned n order to produce effcent control. A vrtual leader s used to coordnate module speeds. Meanwhle, ths role could be assgned to any module whch has access to global nformaton on platform level. The advantages of ths method are follows: Decomposton means that the nstead of tryng to buld global Q-functon we buld a set of local Q- functons. Adaptablty the platform wll adapt ts behavor for dynamcally assgned beacon and wll auto reconfgure movng trajectory. Scalablty and generalzaton the same learnng technque s used for every agent, for every beacon poston and every platform confguraton. Acknowledgement I express sncere grattude to my advsors Prof. Vladmr Golovko and Prof. Ralf Stetter, to my colleague Anton Kabysh for deas and dscussng. Bblography [] J. C. Andreas. Energy-Effcent Electrc Motors. Marcel Dekker, 2 edton, 992. [2] P. Bertold, A. de Almeda, H. Falkner. Energy Effcency Improvements n Electrc Motors and Drves. Berln: Sprnger, 997. [3] A. Barl, M. Ceresa, and C. Pars. Energy- Savng Moton Control for An Autonomous Moble Robot. Internatonal Symposum on Industral Electroncs, pp. 674-676, 995. [4] D. J. Balkcom and M. T. Mason. Extremal Trajectores for Bounded Velocty Dfferental Drve Robots. ICRA, pp. 2479-2484, 2000. [5] Yongguo Me, Yung-Hsang Lu, Y. Charle Hu, C.S. George Lee. Energy-Effcent Moton Plannng for Moble Robots. Robotcs and Automaton, 2004. [6] R. Stetter, P. Zemnak and A. Pachnsk. Development, Realzaton and Control of a Moble Robot. Research and Educaton n Robotcs-EUROBOT 200, Communcaton n Computer and Informaton Scence, Sprnger, vol. 56, pp. 30-40, 20. [7] N. Ceccarell, M. D Marco, A. Garull, and A. Ganntrapan. Collectve crcular moton of mult-vehcle systems wth sensory lmtatons. Proceedngs of the 44th Conference on Decson and Control, Sevlle, Span, pp.740-745, December 2005. [8] N. Ceccarell, M. D Marco, A. Garull, and A. Ganntrapan. Collectve crcular moton of mult-vehcle systems. Automatca, 44(2):3025-3035, 2008. [9] D. Benedettell, A. Garull, and A. Ganntrapan. Expermental valdaton of collectve crcular moton for nonholonomc mult-vehcle systems. Robotcs and Autonomous Systems, 58:028-036, 200. [0] W. Ren, N. Sorensen. Dstrbuted coordnaton archtecture for mult-robot formaton control. Robotcs and Autonomous Systems, 56(4):324-333, 2008. [] R. Segwart, Illah R. Nourbakhsh. Introducton to Autonomous Moble Robots. MIT Press, 2004. [2] Rchard S. Sutton, Andrew G. Barto.: Renforcement Learnng: An Introducton. MIT Press., 998. [3] A. Kabysh, V. Golovko. General model for organzng nteractons n mult-agent systems. Internatonal Journal of Computng, (3):224-233, Sept. 202. [4] A. Kabysh, V. Golovko. Influence Learnng for Mult-Agent Systems Based on Renforcement Learnng. Internatonal Journal of Computng ():39-44. [5] Rchard Vaughan. Massvely Multple Robot Smulatons n Stage, Swarm Intellgence 2(2-4):89-208, 2008. Sprnger Authors: Mr. Vladmr Domn Brest State Techncal Unversty 267 Moskovskaja str. 22407 Brest The Republc of Belarus tel. (+375) 297 9 84 22 emal: spas.work@gmal.com 402