Self-learning navigation algorithm for vision-based mobile robots using machine learning algorithms

Size: px

Start display at page:

Download "Self-learning navigation algorithm for vision-based mobile robots using machine learning algorithms"

Jerome Caldwell
6 years ago
Views:

Journal of Mechanical Science and Technology 25 (1) (2011) 247254 www.springerlink.com/content/1738-494x DOI 10.

Machine tool research center, #462-18, Sam-dong, Uiwang, Gyeonggido 437-815,Korea 2 Department of Mechatronics Engineering, Chungnam National University, Daejeon 305-764, Korea (Manuscript Received

1 Journal of Mechanical Science and Technology 25 (1) (2011) DOI /s y Self-learning navigation algorithm for vision-based mobile robots using machine learning algorithms Jeong-Min Choi 1, Sang-Jin Lee 2 and Mooncheol Won 2,* 1 Hyundai Wia Corp. Machine tool research center, #462-18, Sam-dong, Uiwang, Gyeonggido ,Korea 2 Department of Mechatronics Engineering, Chungnam National University, Daejeon , Korea (Manuscript Received May 22, 2010; Revised September 28, 2010; Accepted November 11, 2010) Abstract Many mobile robot navigation methods use, among others, laser scanners, ultrasonic sensors, vision cameras for detecting obstacles and following paths. However, humans use only visual (e.g. eye) information for navigation. In this paper, we propose a mobile robot control method based on machine learning algorithms which use only camera vision. To efficiently define the state of the robot from raw images, our algorithm uses image-processing and feature selection steps to choose the feature subset for a neural network and uses the output of the neural network learned through supervised learning. The output of the neural network uses the state of a reinforcement learning algorithm to learn obstacle-avoiding and path-following strategies using camera vision image. The algorithm is verified by two experiments, which are line tracking and obstacle avoidance. Keywords: Pattern recognition; Feature selection; Reinforcement learning; Mobile robot; Robot vision; Obstacle avoidance Introduction This paper was recommended for publication in revised form by Associate Editor Yang Shi * Corresponding author. Tel.: , Fax.: address: mcwon@cnu.ac.kr KSME & Springer 2011 In contrast to humans, who use only visual information for navigation, many mobile robots use laser scanners and ultrasonic sensors along with vision cameras to navigate. The goal of our research is to develop a navigation algorithm for mobile robots using only visual information. Also, by using a reinforcement learning algorithm [1], we expect that the mobile robot will learn the right actions for navigating by itself. Research on task learning using visual information has been introduced by the following papers: Gasket et al. use reinforcement learning ( Advantage Learning ) to teach the mobile robot navigation with a match matrix consisting of a stored carpet image and a camera image [2]. Asada et al. use reinforcement learning using position and the size of a goal and a ball detected from the camera image to teach the robot how to shoot at a goal [3]. Nehmzow uses supervised learning with the neural network and the distribution of edge pixels in the extracted edge image for navigation [4]. Regueiro et al. proposed a leaning method using reinforcement learning with the state defined by existence and nonexistence of the edge in a grid image [5]. Shibata et al. use an original image to teach the box-pushing task using reinforcement learning (Actor-critic architecture) [6]. However, these methods need complex image-processing to detect specific patterns [2, 3] or use camera vision along with other sensors such as IR and ultrasonic sensors for learning [5] [6]. In contrast to the above-mentioned studies we propose a navigation method using only camera-vision information processed by a simple image-processing algorithm and the machine learning algorithm. Also, we verify our algorithm by two experiments which are ine tracking and obstacle avoidance. 2. Learning process and experimental settings 2.1 Learning process The suggested learning process used in this paper consists of two parts as shown in Fig. 1. The first part (Fig. 1(a)) refers to the off-line neural network [7] learning as a road environment recognizer. The input features of the neural network are obtained from camera images. The raw training dataset for the neural network is obtained by the manual operation of robot angular velocities under the constant forward velocity 0.3m/s. The training data consist of camera images (inputs of the neural network) and momentary angular velocities (outputs of the neural network). Also, we adopted the feature selection technique to find the best input features from camera images to optimize the performance of

248 J.-M. Choi et al. / Journal of Mechanical Science and Technology 25 (1) (2011) 247254 (a) The neural network learning process (off-line) Fig. 1. The suggested learning process.

1(b)) is navigation learning using reinforcement learning by on-line experiments with the mobile robot. We use the Q-learning algorithm [8] which is one of the popular reinforcement learning methods.

2 248 J.-M. Choi et al. / Journal of Mechanical Science and Technology 25 (1) (2011) (a) The neural network learning process (off-line) Fig. 1. The suggested learning process. (b) Navigation learning process (on-line) Fig. 3. The desired path for the line tracking experiment. Fig. 2. The mobile robot for experiments. the environment recognizer. The second part (Fig. 1(b)) is navigation learning using reinforcement learning by on-line experiments with the mobile robot. We use the Q-learning algorithm [8] which is one of the popular reinforcement learning methods. The mobile robot learns the proper angular velocity and forward velocity for the navigation environment by using the Q-learning algorithm. The state of Q-learning is the output of the environment recognizer for the current image and the current forward velocity of the mobile robot. The action of Q-learning is the forward velocity and angular velocity of the mobile robot. The reward of Q-learning is defined by the forward velocity, angular velocity and change of edge pixels. 2.2 The mobile robot and experiment settings We used the mobile robot depicted in Fig. 2 for our experiments. The mobile robot is 0.5m wide, 0.4m long, and 0.77m high. It has a monocular vision camera that can change its Fig. 4. Experiment environment for obstacle avoidance. pitch angle. The maximum velocity of the mobile robot is about 1.5m/s. Our first experiment consisted of tracking the line shown in Fig. 3. The goal of this experiment was to track the desired path which is composed of a straight and a curved line. The second experiment is obstacle avoidance. One of the test environments is depicted in Fig. 4. We set box obstacles in a corridor with black strips at the border between the walls and the floor. The goal of this experiment is to avoid the obstacle and to follow the corridor. 3. Supervised learning of the neural network for environment recognition The navigation method proposed in this paper uses only

J.-M. Choi et al. / Journal of Mechanical Science and Technology 25 (1) (2011) 247254 249 Table 1. Feature candidates for the neural network. Fig. 5. The image processing procedure. No.

3 J.-M. Choi et al. / Journal of Mechanical Science and Technology 25 (1) (2011) Table 1. Feature candidates for the neural network. Fig. 5. The image processing procedure. No. Feature (edge extraction method distribution direction) 1 Sobel filter Horizontal direction 2 Sobel filter Vertical direction 3 X differentiation filter Horizontal direction 4 X differentiation filter Vertical direction 5 Y differentiation filter Horizontal direction 6 Y differentiation filter Vertical direction 7-45 differentiation filter Horizontal direction 8-45 differentiation filter Vertical direction 9 45 differentiation filter Horizontal direction differentiation filter Vertical direction Fig. 6. The image processing procedure. (a) The horizontal (x direction) distribution data (b) The vertical (y direction) distribution data Fig. 7. The generation of the distribution data. camera-generated visual information in order to recognize the navigation environment. Therefore, this paper uses the pattern recognition method of the neural network as an algorithm to detect lanes or obstacles more effectively. Also finding the optimal input data for the neural network is necessary to improve the performance of pattern recognition. Therefore we used the feature selection algorithm [9] to choose the best input feature set of the neural network. 3.1 Training data acquisition for the neural network The raw training data for the neural network was obtained in the following way: We acquired camera images and momentary angular velocities ranging between -20 /s 20 /s during the manual operation with a computer keyboard and RF Communication every 0.1s. At this time the forward velocity of the mobile robot was fixed at 0.3m/s. The edge is detected by imaged acquired here and the edge pixel distribution is defined as input. Momentary angular velocity is defined as output to compose a learning data set of the neural network. At this time, since the momentary angular velocity is the angular velocity of fixed velocity, 0.3m/s, it cannot be the absolute value applied to every velocity. In this paper, this value is used as an index describes in which situation the robot is placed. Through this process we made 110 data sets for the linetracking experiment and 518 data sets for the obstacleavoidance experiment. 3.2 Image processing to generate input features After data acquisition, we processed the original images in order to get more meaningful and useful input features for the neural network. Fig. 5 shows the image-processing procedure that we adopted. The first step required reducing the image size from 160x140 to 40x30. The second step involved changing the color image into the gray level one. The third step consisted of making the edge image extracted by five kinds of filters (sobel filter, x, y, -45, +45 differentiation filter). Fig. 6 shows the result of this step about the image of the obstacle avoidance experiment. And the last step involved counting edge pixels in separate areas which are 10 for each horizontal and vertical (x and y) direction to get the distribution data. This procedure is shown in Fig. 7. After the image processing procedure, we finally obtained 10 distribution data sets from each image. 10 distribution data sets are shown in Table 1. These data became input feature candidates for the neural network to recognize the road environment. The reason that the distribution of the horizontal and vertical direction of the edge pixel is generated by the final results of the image processing is that information such as distance of lanes or obstacles and obstacle volumes is indirectly collected through this information. Although this paper does not directly

4 250 J.-M. Choi et al. / Journal of Mechanical Science and Technology 25 (1) (2011) Table 2. Feature combinations of each subset number. Subset No. Combined feature (refer to Table. 1) 4_210 7, 8, 9, 10 4_115 3, 4, 9, 10 3_22 1, 5, 6 2_1 1, 2 1_2 2 Fig. 8. The structure of the neural network for road state recognition. Fig. 9. Feature selection results of the obstacle avoidance experiment training data. calculate obstacle volumes or distances, distribution data that draws out those values are used as inputs of a neural network that recognizes the navigation environment for the mobile robot to navigate using only one camera image. 3.3 Feature selection for the optimal feature subset Because 10 input feature candidates do not have the same meaning and importance for the relation between input and output of the neural network, we needed a procedure which chooses the optimal input feature subset. This procedure is called feature selection. We carried out feature selection to optimize the performance of the neural network as the navigation environment recognizer and to reduce the training time of the neural network. Three popular methods for feature selection are the wrapper method [10], the embedded method [11], and the filter method [12]. The feature subset search strategy and the way to measure the performance of the feature subset are important factors for feature selection. In this paper, we use the wrapper method with the forward method as the feature subset search strategy and validation error by a cross validation for the performance measure. The wrapper method was used to train the neural network for all feature subsets, so it has a weak point in that the computation cost is high. Therefore, we carried out feature selection with 385 feature subsets (from 1 to 4 feature combinations) instead of 1024 feature subsets (from 1 to 10 feature combinations). After training the neural network for each 385 feature subset with the cross validation, we choose the feature subset which has minimum validation error as the optimal feature subset. 3.4 Results of the feature selection and the neural network training The neural network for road state recognition using in the feature selection was designed with the multi-layer feed forward neural network which has 2 hidden layers. Fig. 8 shows the structure of the neural network system. The number of inputs is 1040 in accordance with the number of feature combinations. The output layer has 1 neuron which is the correspondence of momentary angular velocity with the input image. Each hidden layer has 10 and 5 neurons individually. The scaled conjugate gradient method [13] was used to train the weights of the neural network. Fig. 9 shows a part of the feature selection results obtained from the obstacle avoidance experiment training data. It is difficult to show all results for 385 feature subsets, so we chose 5 results and show their training and validation errors. The 4_210 feature combination is the optimal feature subset, because it has the smallest validation error. So we can expect that the 4_210 feature combination will result in the best performance when used for the neural network input feature. Table 2 shows combined features of each feature subset number in Fig. 9. By comparing the 4_210 and 4_155 feature subsets, we can notice the difference of importance between 7, 8 features and 3, 4 features. Although they have commonly 9, 10 features, there is the a difference on their validation errors. Also, in the case of the 2_1 feature subset combined with 1, 2 features it has enough edge information for all directions, because it is generated by the sobel filter. However it shows poor performance when compared to the 4_210 feature combination. The reason why the 4_210 feature combination is the optimal feature subset for the obstacle avoidance experiment is that most images contain the diagonal black areas which represent the borders between the walls and the floor. As a result of the feature selection, the optimal feature subset is the feature combination composed of 2, 5, 7, 8 features for the line tracking experiment and 7, 8, 9, 10 features for the obstacle avoidance experiment. Therefore, we used these optimal feature subsets as input features to train weights of each neural network for the two experiments. We carried out the on-line experiments with these trained neural networks as the navigation environment recognizer.

J.-M. Choi et al. / Journal of Mechanical Science and Technology 25 (1) (2011) 247254 251 Table 3. State definition for the line tracking experiment. State 0 1 2 3 4 5 6 Output Range -1.0-0.71 4.

5 J.-M. Choi et al. / Journal of Mechanical Science and Technology 25 (1) (2011) Table 3. State definition for the line tracking experiment. State Output Range Navigation learning algorithm We used reinforcement learning to build the navigation learning algorithm. Reinforcement learning is a learning method in which which the action of the state is learned when maximizing the reward from the environment through trial and error. Unlike supervised learning that needs a pair of input-output training data, reinforcement learning needs only reward to evaluate adequacy of actions. In this paper, we use the Q-learning algorithm which is one of the reinforcement learning algorithms. 4.1 Q-learning algorithm The Q-learning algorithm is a model-free method learning algorithm. It learns the Q-value which means the expected value of a pair of state and action. This algorithm is learning the Q-value table by iteration with the delayed reward of action in state. The update rule of the Q-value is given as U = r + γ max Q( s, a) (1) t+ 1 t+ 1 t+ 1 a Q( s, a ) Q( s, a ) + [ U Q( s, a )] (2) α t t t t t+ 1 t t Eq. (1) reflects the delayed reward and the Q-value of next state and Eq. (2) is the update equation of Q-value. The action in the specific state is determined by Eq. (3) to be chosen as one of possible actions maximizing the Q-value. a arg max Q( s, a') (3) a ' Table 4. Action definition for the line tracking experiment Action Angular velocity ( /s) Also, to learn optimal actions it is needed to select random actions by exploration instead of always selecting learned actions based on Eq. (3). So, we use the ε-greedy strategy [1] which selects random actions by ε probability. Table 5. Visual state definition for the obstacle avoidance experiment. Visual state Output Range Table 6. Velocity state definition for the obstacle avoidance experiment. Velocity state Forward velocity range (m/s) Fig. 10. The reward area for the line tracking experiment States, actions and rewards design for the line tracking experiment We designed states using the output of the neural network trained for the line tracking experiment. States were discretized by 7 discrete output ranges of the neural network. Because the neural network output has the range from -1.0 to 1.0, we discretize states as shown in Table 3. The 0 state means that the mobile robot should turn left rapidly and the 6 state means that the mobile robot should turn right rapidly. The 3 state means that the mobile robot will continue along a straight line. We considered only the angular velocity for the design of actions, and we fixed the forward velocity at 0.3 m/s. Table 4 shows the definition of actions. We picked five angular velocities ranging between -15 /s 15 /s. We introduce a reward area to design rewards and use the number of edge pixels in it. Fig. 10 shows the reward area for the line tracking experiment. If the number of edge pixels in it is large, we could think that the mobile robot drives well along with the line. Also, if it doesn t contain any edge pixels, it means that the mobile robot is off the line. So, we design rewards such that the more edge pixels in it, the bigger rewards are gotten. Eq. (4) is the designed reward for the line tracking experiment. 4.2 States, actions and rewards design The design of states, actions and rewards in reinforcement learning is the major factor affecting learning results. Therefore, it is important to design those that are suitable for the goal of our experiments. 1.0, if N 40, N 40 t t 1 r = 10.0, if N 10, N 30 t t t 1 1.0, else (4) The mobile robot receives aplus reward when more than 40 edge pixels are in the reward area for two control loop times (500ms). On the other hand, the mobile robot receives a minus reward when it runs off the line at current loop time.

252 J.-M. Choi et al. / Journal of Mechanical Science and Technology 25 (1) (2011) 247254 Table 7. Action definition for the obstacle avoidance experiment.

Learning time and the average reward for each episode of the line tracking experiment. 4.2.

the mobile robot. By including the forward velocity as a state, the mobile robot could learn the right actions, unlike the line tracking experiment.

Like the state definition of the line tracking experiment, the 0 and 4 visual state means that the mobile robot should turn left and right respectively.

2m/s increase, and decrease. The designed angular velocity is -20 /s, 0 /s, 20 /s. Therefore, the number of actions is 9 (see Table 7).

6 252 J.-M. Choi et al. / Journal of Mechanical Science and Technology 25 (1) (2011) Table 7. Action definition for the obstacle avoidance experiment. Angular velocity Velocity change -20 /s 0 /s 20 /s Decrease (-0.2m/s) No change (0m/s) Increase (0.2m/s) Fig. 11. The reward area for the obstacle avoidance experiment. Fig. 12. Learning time and the average reward for each episode of the line tracking experiment States, actions and rewards design for the obstacle avoidance experiment States for the obstacle avoidance experiment are designed by the output of the neural network and the forward velocity of the mobile robot. By including the forward velocity as a state, the mobile robot could learn the right actions, unlike the line tracking experiment. Table 5 and table 6 show the state definitions for visual states and forward velocity states respectively. Like the state definition of the line tracking experiment, the 0 and 4 visual state means that the mobile robot should turn left and right respectively. Actions of the mobile robot are designed with combinations of the forward velocity change and the angular velocity. The forward velocity change is keeping current velocity, 0.2m/s increase, and decrease. The designed angular velocity is -20 /s, 0 /s, 20 /s. Therefore, the number of actions is 9 (see Table 7). Because of this kind of action design, the mobile robot can learn proper forward velocities and angular velocities for each state. Table 7 shows the action definitions for the obstacle avoidance experiment. We adopted the reward area as in the line tracking experiment. Fig. 11 shows the reward area for the obstacle avoidance experiment. In this experiment, the edge pixels in the reward area represent obstacles or walls to avoid. Therefore, we designed rewards so that whenfewer edge pixels are in the reward area, bigger rewards are obtained. Also, we considered the forward velocity and the angular velocity for the design of the reward for the mobile robot to encourage it to go as fast and straight as possible. r = ( N N ) + ( v 0.5) 0.02 w (5) t t t t t Eq. (5) is the designed reward for the obstacle avoidance experiment. This equation consists of rewards linked to the number of edge pixels in the reward area and the forward velocity and the angular velocity of mobile robot. So, the mobile robot obtains big rewards when it runs in a straight line at a (a) An image at the start point (c) An image at the middle of a curvy path fast speed and avoids obstacles in the reward area. We tuned the contribution of each term by adjusting the constant in front of the corresponding terms. 5. Experiment results (b) An image at the beginning of a curvy path (d) An image after the turning Fig. 13. Images during the line tracking after learning. Our algorithm is verified by two on-line experiments. All experiments were implemented in good lighting conditions, because performance of the visual camera, which is our unique sensor, is seriously affected by lighting conditions. The pitch angle of the camera is set at 45 for the line tracking experiment and at 40 for the obstacle avoidance experiment in orderto catch images of the further distance. The loop time of the learning algorithm is 500ms. 5.1 Results of the line tracking experiment An episode for line tracking experiment ends when the mobile robot reaches the end point or runs markedly off the desired path. Fig. 12 shows learning time and the average reward for each episode. The average reward is the reward per second. At the beginning of the learning process the mobile robot ob-

7 J.-M. Choi et al. / Journal of Mechanical Science and Technology 25 (1) (2011) Fig. 14. Learning time and the average reward for each episode of the obstacle avoidance experiment. (a) The path of the mobile robot at the beginning of the learning process (b) The path of the mobile robot after learning Fig. 15. Navigation paths of the mobile robot during and after learning. tains small average rewards and achieves little learning time, because it frequently runs off the line. However, by increasing the number of the episodes, learning time and the average rewards increase. After the 15 th episode, both average rewards and learning time are roughly on saturation levels. From this we can infer that the mobile robot learns proper actions which are suitable for learning states and drives well along the desired path. Fig. 13 shows images taken during the line tracking after learning is completed. We know through these images that the mobile robot has learned to keep the black line in the reward area while driving. This result comes from the reward design that bigger rewards are obtained when more edge pixels are in the reward area. Although we conducted the line tracking navigation test in the environment combined only with straight and curved lanes as seen in Fig. 3, the algorithm that is suggested can be applied to other situations. The reason is that the algorithm suggested here uses vertical and horizontal direction distribution information of edge pixels, and that although the line used in this test is not a curved line but a corner line with a rightangled shape, it shows distribution data similar to those of the curved line. 5.2 Results of the obstacle avoidance experiment Fig. 14 shows learning time and the average reward for each episode for the obstacle avoidance experiment. The episode ends when the robot reaches the end point or approaches the obstacle or a wall too closely. (a) The path of the mobile robot in the new environment #1 (b) The path of the mobile robot in the new environment #2 Fig. 16. Navigation paths of the mobile robot in new environments. Like the line tracking experiment, learning time and average rewards are small in the early episode because the mobile did not learn the right actions yet. Also, learning time is long, but the average reward is not big at the 12 th episode, because the mobile robot runs around obstacles at a slow speed. However, after the 17 th episode (after the mobile robot learns enough) the mobile robot gets nearly equal rewards and learning time except for a few episodes (20, 24, 25, 31) during which the mobile robot executes random actions by ε-greedy learning strategy with a change of 10% exploration probability (ε=0.1). However, after the 35 th episode that exploration rate is set at 0 the mobile robot gets equal rewards during the nearly same learning time. So, we could know that the Q-table is learned correctly for this navigation environment. Navigation paths of the mobile robot during learning are shown in Fig. 15. The black vertical lines represent the boundaries between walls and the floor. In the beginning of learning (Fig. 15(a)), we notice that the navigation is finished because the mobile robot approaches the obstacle too closely. After enough learning (Fig. 15(b)), the mobile robot runs well while avoiding obstacles and walls. Fig. 16 shows the navigation path using the learned Q-table in the new environment with two obstacles. Fig. 16(a) shows that the mobile robot turns right to avoid the first obstacle and then turns left, unlike the movement reflected in Fig. 15(b). This is because the second obstacle is detected. Fig. 16(b) shows the navigation path in a more difficult environment than that in Fig. 16(a). The first obstacle is located more closely than in Fig. 16(a), and a bigger second obstacle than Fig. 16(a) is set. In this case, the mobile robot follows a more curvy path than in Fig. 16(a) in order to avoid the second obstacle. Through these experiments, we know that it is possible to navigate in an environment which is different from the learning environment.. The algorithm suggested here can be applied to circular or other shapes of obstacles in the place of a square shaped box as used in this paper. Since the algorithm suggested in this paper uses the distribution of edge pixels, edge pixel exists in similar location of the image, and similar distribution data will be acquired regardless of the obstacle shapes, if the obstacles are placed in the similar location. Therefore, the robot can perform obstacle-avoidance by the algorithm suggested here

254 J.-M. Choi et al. / Journal of Mechanical Science and Technology 25 (1) (2011) 247254 regardless of the shape of the obstacle. 6.

We used the pattern recognition technique of the neural network to enable the mobile robot to differentiate between the environment and the camera image.

By using this procedure, we found the best feature subset for the input of the neural network.

8 254 J.-M. Choi et al. / Journal of Mechanical Science and Technology 25 (1) (2011) regardless of the shape of the obstacle. 6. Conclusions We proposed a navigation learning algorithm using only visual information for mobile robots by mimicking human behavior. We used the pattern recognition technique of the neural network to enable the mobile robot to differentiate between the environment and the camera image. Also, we adopted the feature selection procedure in order to optimize the performance of the neural network. By using this procedure, we found the best feature subset for the input of the neural network. Output of the neural network and the forward velocity of the mobile robot were used for the state of Q-learning which is our navigation learning algorithm. We introduced the reward area to define rewards for Q-learning. Actions are forward velocity and angular velocity. We verified the proposed algorithm by the line tracking and obstacle avoidance experiments. We confirmed that the mobile robot navigates well in both experiments after sufficient learning. Our future work will focus on developing an algorithm which can navigate more complex indoor environments such as offices. Nomenclature Q (s, a) : Q-value s : State a : Action r : Reward γ : Discount factor α : Learning rate ε : Exploration rate N : The number of edge pixels in the reward area t at time t v : The forward velocity of the mobile robot at time t t : The angular velocity of the mobile robot at time t w t References [1] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction, The MIT Press (1998). [2] Chris Gaskett, Luke Fletcher and Alexander Zelinsky, Reinforcement Learning for a Vision Based Mobile Robot, IEEE (2000). [3] Minoru Asada, Shoichi Koda, Sukoya Tawaratsumida and Koh Hosoda, Vision-based reinforcement learning for purposive behavior acquisition, IEEE International Conference on Robotics and Automation (1995). [4] Ulrich Nehmzow, Vision Processing for Robot Learning, J. Industrial Robot, 26 (2) (1999) [5] Carlos V. Regueiro, Jos e E. Domenech, Roberto Iglesias, and Jos e L. Correa, Acquiring contour following behavior in robotics through Q-learning and image-based states, PWASET, 15 (2006). [6] Katsunari Shibata and Masaru Iida, Acquisition of box pushing by direct-vision-based reinforcement learning, SICE Annual Conference (2003). [7] Christopher M. Bishop, Neural Networks for Pattern Recognition, Oxford University Press (1997). [8] C. J. C. H. Watkins. Learning from Delayed Rewards. Cambridge University (1989). [9] Isabelle Guyon and Andre Elisseeff, An Introduction to Variable and Feature Selection, Journal of Machine Learning Research 3 (2003) [10] R. Kohavi and G. John, Wrappers for feature selection, Artificial Intelligence, 97 (1-2) (1997) [11] L. Breiman, J. H. Friedman, R. A. Olshen and C. J. Stone, Classification and Regression Trees, Wadsworth and Brooks (1984). [12] H. Stoppiglia, G. Dreyfus, R. Dubois and Y. Oussar, Ranking a random feature for variable and feature selection, JMLR, 3 (2003) [13] Martin F. Moller, November, A Scaled Conjugate Gradient Algorithm for Fast Supervised Learning, Neural Networks, 6 (1990) Mooncheol Won received a B.Sc. and an M.Sc. degree from Seoul National University, Korea, in the Department of Naval Architecture and Ocean Engineering. He also received a Ph.D. degree in mechanical engineering from the University of California at Berkeley, USA. Currently, he is a professor in the Department of Mechatronics Engineering at Chungnam National University, Korea. His research interests include control of maritime and mechatronics systems, and machine learning applications of robotic systems. Jeong-Min Choi received a B.Sc. degree Chungnam National University, Korea in the Department of Mechatronics Engineering. Currently, he is in the researcher in the department of Research Center of Hyundai Wia Corp., Korea. His research interests include machine learning applications of robotic systems, especially reinforcement learning and neural networks. robotic systems. Sang-Jin Lee received the B.Sc. degree in the department of mechatronics engineering from Chungnam National University, Korea. He is in the master s course in the department of mechatronics engineering of Chungnam National University, Korea. His research interests include machine learning applications of

Continuous Valued Q-learning for Vision-Guided Behavior Acquisition

Continuous Valued Q-learning for Vision-Guided Behavior Acquisition Yasutake Takahashi, Masanori Takeda, and Minoru Asada Dept. of Adaptive Machine Systems Graduate School of Engineering Osaka University