Learning Orientation Information for Robotic Soccer using Neural Nets

Size: px

Start display at page:

Download "Learning Orientation Information for Robotic Soccer using Neural Nets"

Abner Leonard
5 years ago
Views:

1 Learning Orientation Information for Robotic Soccer using Neural Nets Jacky Baltes John Anderson Department of Computer Science University of Manitoba Winnipeg, Canada August 6, 2003 Abstract Robotic soccer teams using both local and global vision traditionally rely on a set of pre-determined markers (e.g., a group of small colored circles mounted on the top surface of the robot) to provide easy targets for visual analysis in order to determine the team membership, identity, and orientation of robots in the visual field. This approach requires calibration before any competition, as well as agreement in advance on color codes different enough between teams to avoid recognition errors at run-time. Even after extensive calibration, small lighting variations can cause extensive misidentification. In this paper, we examine an alternative approach: training a neural network to recognize the orientation of the robots on a team so that visual tracking can occur in real time without special markers of any kind. This paper describes the design and implementation of such an approach, and shows the results of an empirical evaluation of this approach. 1 Introduction In robotic soccer leagues such as the F180 league, visual tracking of robots relies on the use of colored markings identifying teams and individuals, with the placement of those markers in such a fashion that individual orientation can easily be determined. While the F180 league uses an overhead camera for global vision, similar requirements exist for local vision teams as well. There is a significant set of difficulties that accompany this approach, both prior to play and during play itself. For this approach to work in practice, the vision system must be finely calibrated to recognize the colors used for these markers across the range of lighting over the playing field. In advance of play Colors must also be chosen far enough apart in the spectrum from one another to avoid misidentification. During play itself, lighting must vary as little as possible during play, since any variation would change the apparent colors of the markers and misidentification would ensue. Even if lighting variance can be controlled to an unrealistic degree, shadows and occlusion also cause similar problems in visual recognition. A promising alternative to this approach is to attempt to learn to recognize the identity and orientation of robots from their image without using additional markers. There are three major points that would make this an improvement over existing approaches: 1

2 1. Recognition without markers would scale much better to more complex domains. Marker-based approaches do not scale well in situations where larger playing fields or larger teams of robots are employed, for example. 2. Recognition without markers would allow the number of colors used on the robot to be reduced. This would in turn directly support scaling, since more objects could be tracked over a greater visual field if color information could be simplified. 3. The ability to infer robots orientation without prior knowledge allows a team to infer the orientation and identity of the opponent s robots. This allows for more sophisticated tactical decision making. For example, robots that have a strong kicker can be extremely dangerous. If the kicker is facing away from the current locus of activity, the situation is not as dangerous. What would ultimately be desirable is a system that could determine identity and orientation of robots from a visual image without an involved initial calibration process: simply allow it to view the field with the robots moving on it for a reasonably limited time, and the system would self-calibrate. Such a system would also ideally adjust itself as play ensued, allowing it compensate for small changes in lighting, shadow, etc. There are two sub-tasks to accomplishing this: first the ability to isolate the image of a moving robot from the overall visual field, and then the ability to recognize identity and orientation. Video servers already exist that can accomplish the former of these. The Doraemon video server [6, 1], for example, allows capture of moving objects and can supply both location information and a small bit-mapped graphic to any number of recipient machines connected to the video server over a network. This paper presents an approach that attempts to deal with the latter sub-task: take this video server data and extract a robot s orientation from each image frame, given that the robot has no additional features for identification or calibration. An artificial neural network is trained on time-sequenced image data where orientation and identity is provided. We then empirically test this training using real data. The remainder of this paper describes the approach employed and the evaluation of that approach. Section 2 gives an overview of related work in the area. The design and implementation of the neural network learner is explained in section 3. The results of an empirical evaluation of the system is described in section 4. The paper concludes in section 5. 2 Related Work In leagues such as the F180 league, team descriptions illustrate the heavy reliance on pre-defined markers for the recognition of object location and orientation (e.g.[3, 14]). While this is true of teams using global vision, local vision relies heavily upon this as well (e.g. [10]). Colored patches mounted on the robot are the most commonly-employed approach; however bar codes or geometrical features are also used [2]. Using geometric features or bar codes does not change the nature of the recognition problem itself, nor the disadvantages of requiring pre-defined markers that have already been discussed in section 1. Work continues on better ways of performing color calibration and real-time detection of colored blobs and other markers within this approach [13, 4]. Removing the ability to build a recognition and tracking system around pre-defined markers results in a much more difficult object recognition problem, since features for recognition must be discovered as part of the recognition process itself. Some of these features may at times be shadowed or otherwise occluded, but the fact that more subtle features than the obvious colored markers currently used can be employed should result in a more robust recognition and tracking process if these features can be learned. The fact that we are working with subtle features across a noisy image points to the use of a more decentralized approach that can consider recognition across the image as a whole. The natural choice for such a task is a neural-net based approach. Neural nets have been used extensively in image processing tasks, from face recognition [12] and recognition of the human form [7] to oil-spill detection from 2

3 satellite images and other forms of remote sensing [8, 9]. Their robustness in the face of noisy and uncertain data is well-known [11]. Within this range of tasks, it is important to distinguish between the problem of recognition or classification in single images as opposed to dealing with the much larger problem of ongoing recognition in real time over a sequence of temporal data. Detecting an oil spill, classifying vegetation, or or classifying clouds in a single satellite image, for example, are all simpler recognition tasks than performing the ongoing recognition over time that is required to track a cloud formation, ocean current, or other entity over time (e.g. [5]). These previous efforts at recognition over time show that neural networks appear to be a viable choice for performing real-time object recognition and tracking in a robotic soccer domain. The question we attempt to answer in this paper is whether a neural network can be trained to robustly recognize the orientation of a robot solely from a sequence of images, thus completely eliminating the need for special markers on the robot. The next section describes the design of a neural network for this task. Figure 1: A sequence of sample images taken and annotated by the videoserver Figure 2: A second sequence of sample images taken and annotated by the videoserver 3 Design In order to gather data suitable for use with a neural network, we use the output of the Doraemon video server [6, 1]. This server provides a small bitmapped image of moving object(s) on a playing field by analyzing frames from a camera looking for differences. Those differences allow the server to focus on portions of the image containing moving objects, and direction of change also allows orientation information to be derived. The system can therefore be used to record training data by assuming the robot is only being driven in a forward direction so that the orientation information provided will be accurate for reinforcement during training. For a test data set, we place one robot to serve as a test subject for recognition and drive it around at random (including across the lines of the field in order to expose the system to interference from the lines). Other stationary robots are also placed on the field so that the system is exposed to more varied examples of robot images and to avoid over-fitting. Two sets of sample images are shown in Figures 1 and 2. Figure 1 is a set using a robot platform based on a toy car, while Figure reffig:sample-input1 uses a robot based on the Lego Mindstorms platform. Both of these were used in evaluation trials described in section 4. These images are sent in the form of 64x32 bit pixmaps (in 24-bit color) with annotation information from the video server (stored in PPM format). Figure 3 shows the pixmap structure for the upper left image in Figure 1. The orientation shown is in radians, where 0 is approximately straight up on the playing field. We use a 3-layered feed-forward backpropagation network to perform recognition on these datasets. The network architecture is shown in Fig.4. The size of each image supplied by the video server is 6,144 bytes. We tested a neural net with 6,144 3

4 Figure 3: Sample pixmap of the image as transmitted by the videoserver. The pixmap is the one for the first image in Fig. 1. Figure 4: The Neural Network Architecture P6 # Object name=robot1,found=yes # Position x= ,y= ,z=50 #Theta= ,dx= ,dy= (64*32*3 Bytes of pixel data ) input nodes, but found quickly that the net would perform very poorly and training was extremely slow. Therefore, sub-sampling of the data is necessitated. All pixels in a 4x4 neighborhood are averaged to create a 16x8 pixmap and then fed to an input layer consisting of a corresponding 384 nodes. The hidden layer consists of 32 nodes, roughly 10% of the input layer. The output layer is a 1-hot encoding of the orientation angle discretized into 5-degree steps. Therefore, there are 72 output nodes, each corresponding to a 5-degree interval. For a correctly-trained neural net, exactly one of these nodes should have a high activation, with the others low. 3.1 Pre-processing of Input Images In addition to studying the efficacy of this approach with raw image data, we desired to investigate the effects of other representations of the input data on the performance of the system. Because edge information may be more important than pixel values alone, we applied 2x2 Sobel edge detection on the sub-sampled pixmaps, to produce sample images such as those shown in Figure 6. Since the resulting edge map is a gray scale image, the size of the input layer can be reduced. 4 Evaluation This section describes the results of a series of empirical evaluations of the neural network. Firstly, we determined the number of hidden units needed for the network. The results of this investigation are described in subsection 4.1. Secondly, we investigated the influence of the input representation (4x4 sub-sampled color image, 2x2 gray scale, and 2x2 sub-sampled edge map) on the accuracy and the speed of learning. Subsection 4.2. Thirdly, subsection 4.3 describes the investigation into the generalization ability of the neural network. Fourthly, the computational cost of classifying using a neural network is described in subsection Accuracy of Learning versus Number of Hidden Units The first experiment was used to determine a suitable number of hidden units for the network. Figure 7 shows the development of the mean squared error (MSE) for neural nets with 8, 16, 24, and 32 4

5 Figure 5: The output of edge detection applied to the sample images shown in Fig. 1 Figure 7: Evolution of the MSE for Varying Number of Hidden Nodes in the Toycar Image Sequence Figure 6: The output of edge detection applied to the sample images shown in Fig. 1 hidden units. There are two common error measures when using neural nets. The sum of the squared error (SSE) is the sum of the squared errors over the output units over the entire set of training examples (a so-called epoch). The MSE is the SSE divided by the number of training patterns in the epoch. In other words, it is the mean error for a given pattern. Unless otherwise specified, the network was trained using a learning rate of 0.3 and a momentum term of 0.2. The learning rate was slightly increased from the more popular 0.2 value to speed up learning. The training data was randomly reshuffled after each epoch to avoid over-fitting the neural network to the specific sequence of training images. The tests were run on a dual MP Athalon PC with 1 GB of RAM. On this system, a training run took about 30 minutes. We applied 4x4 sub-sampling to reduce the size of the image to 16x8 pixels and the number of nodes in the input layer to 384. Sub-sampling is necessary since otherwise the training of the neural net takes too long to be practical. As can be seen, the neural net with 8, 16, and 24 hidden nodes was not able to learn the correct orientation information from the image sequence. After 5000 epochs, the networks with 8, 16, and 24 hidden units had an accuracy of 23%, 59.5%, and 88%. The network with 32 hidden nodes The network with 32 nodes is able to learn the correct orientation from the training data and achieves 100% accuracy after 2000 epochs. Similar results were obtained when training the network with a 2x2 sub-sampled gray scale images. This results in a network with 512 input nodes. The network with 32 hidden units learned to classify the input images with 100% accuracy after 10,000 epochs over all 200 input images. 4.2 Input Representation versus Learning Accuracy After determining a suitable number for the number of hidden units, we evaluated the learning ability and speed of learning for three different representations of the input data. We compared the following three representations: 1. a 4x4 sub-sampled color image 5

Figure 8: Evolution of the MSE for different input representations for the Toycar Image Sequence Figure 9: Evolution of the MSE for different input representations for the Lego Image Sequence 2.

This figure shows that the representation makes little difference in the overall performance of the neural network. The accuracy for all three representations is similar (with an MSE of 0.10).

6 Figure 8: Evolution of the MSE for different input representations for the Toycar Image Sequence Figure 9: Evolution of the MSE for different input representations for the Lego Image Sequence 2. a 2x2 sub-sampled gray scale image 3. a 2x2 sub-sampled edge map of the original image Figure 8 shows the MSE for different input representations. This figure shows that the representation makes little difference in the overall performance of the neural network. The accuracy for all three representations is similar (with an MSE of 0.10). Although the edge detected image improves quicker in the first few hundred epochs, the performance after longer training is worse than the color or gray scale representation. However, the differences in performance after training for more than 4000 epochs are too small and not statistically significant. We tested the neural networks and computed not only the accuracy based on the MSE, but also when applied to the task of returning the orientation (as one would use the neural network in the vision server). The results were also very similar. The neural network trained using color images was able to classify 99% of all images within 5 degrees. The neural network trained using gray scale or edge maps was able to classify 97% of all images correctly. There was no correlation between the missed images in our tests. That is, the three representations did mis-classify different images in our test. We are currently investigating more closely why these images posed problems for the neural network. The results of training the neural network on 200 sample images taken from the Lego image sequence are shown in Figure 9. The Lego robots have more features to work with. In fact, both the color and gray-scale representation terminated training after 4500 and 2500 epochs respectively with 100% accuracy. The small residual error shown in the graph is negligible. The edge representation again performed more poorly. After 5000 epochs, its classification accuracy was only 93%. The evaluation gives some indication that preprocessing and selecting features for the neural network is unnecessary and may decrease the performance of the neural network. This result is at first counter intuitive. However, a more careful analysis of the results show that the orientation of a robot depends more on gathering small pieces of evidence and combining it into a consistent view than extracting features such as edges. Each pixel only provides a small amount of information by itself and only its relationship to other pixels makes this information important. The edge information in the images is hard to extract and skews the results. For example, after checking some of the misclassified images in the Lego image sequence we found that the front edge of the robot was occluded by shadows in some of those images. Artificial neural networks, on the 6

7 other hand, are especially suited to combining large numbers of inputs into a consistent view. There are therefore useful and promising tools in this research. 4.3 Generalization Ability We then investigated the generalization ability of the neural net. We extracted 20 images (10%) at random from the set of 200 training images. After training the network on the training set, its performance on the 20 unseen instances was tested. The results of these tests were not as encouraging as the results of the previous experiments. Instead of classifying at least some of the previously unseen images correctly, the network was unable to generalize to unseen patterns well. Since it is infeasible to show all possible input patterns to the system, the generalization ability of the neural network is extremely important. Further investigation showed that the generalization ability of the network was limited since there were few examples for each 5 degree bucket. Simply increasing the number of training examples is not a workable solution since the training time increases. The generalization ability of the network can be improved by selecting a more representative training set. Images can be selected to generate a training set with equal proportions of all 5 degree step. 4.4 Computational Cost of Classification Another concern are the computational requirements of the Neural Net after training. We therefore run the classification program on a modern PC (1.6 GHz Athalon Processor) repeatedly. The speed of the classification using a fully connected 16x8 feed-forward network (i.e., 384 input units, 32 hidden units, 72 output units, weights) was around 0.07 msecs. per classification. In practice, this means that the neural net is fast enough to classify all 10 robots in a typical RoboCup F-180 game. 5 Conclusion This is a first step - we are currently working on adding to the process of matching visual images the ability to look at commands that were sent to the robots, giving a greater basis for reasoning about uncertainty during the tracking process. For example, if we are uncertain of a robot s location and orientation at the current time, we can start with the robot s last known location/orientation at previous time t 1. Knowing that the robot was sent a move-forward command at time t 1 allows an assumption of where the robot should be now, and this assumption can be used to improve the recognition process. This will assist in increasing the recognition capabilities in cases of occlusion, for example. The biggest obstacle to the use of neural nets to determine the orientation of a robot is the inadequate generalization ability of the neural net. As discussed previously, this is partly due to the fact that even in a set of 200 training examples, there are only few samples for each orientation. We are currently addressing this problem by preprocessing the training data to create a training set with similar number of examples for each degree. We are also currently exploring options to increase the speed of the classification after the Neural Net has been trained. A simple example is that once one of the output neurons has fired, the computation can terminate since the there should only be one activated output neuron. Another idea is to extract a small set of rules that can be tested faster than calculating the output of the entire network. References [1] John Anderson and Jacky Baltes. Doraemon user s manual [2] Jacky Baltes. Yuefei: Object orientation and id without additional markers. In Andreas Birk, Silvia Coradeschi, and Satoshi Tadokoro, editors, RoboCup-2001: Robot Soccer World Cup V, pages Springer Verlag,

8 [3] Andreas Birk, Silvia Coradeschi, and Satoshi Tadokoro, editors. RoboCup-2001: Robot Soccer World Cup V. Springer Verlag, Berlin, [4] Paulo Costa, Paulo Marques, Antonio Moriera, Armando Sousa, and Pedro Costa. Tracking and identifying in real time the robots of an f-180 team. In Proceedings of the Third International Robocup Workshop, IJCAI-99, Stockholm, IJCAI. [13] Mirk Simon, Sven Behnke, and Raul Rojas. Robust real time color tracking. In Peter Stone, Tucker Balch, and Gerhard Kraetszchmar, editors, RoboCup-2000: Robot Soccer World Cup IV, pages Springer Verlag, Berlin, [14] Peter Stone, Tucker Balch, and Gerhard Kraetszchmar, editors. RoboCup-2000: Robot Soccer World Cup IV. Springer Verlag, Berlin, [5] S. Cote and A. R. L. Tatnall. The hopfield neural network as a tool for feature tracking and recognition from satellite sensor images. International Journal of Remote Sensing, 18(4): , [6] Jacky Baltes et al. The doraemon video server [7] Tony Jan, Massimo Piccardi, and Thomas Hintz. Detection of suspicious pedestrian behavior using modified probabilistic neural networks. In Proceedings of Image and Vision Computing-02, pages , Auckland, New Zealand, November [8] Miroslav Kubat, Robert C. Holte, and Stan Matwin. Machine learning for the detection of oil spills in satellite radar images. Machine Learning, 30(2-3): , [9] Schaale M. and Furrer R. Land surface classification by neural networks. International Journal of Remote Sensing, 16(16): , [10] Emanuele Menegatti, Francesco Nori, Enrico Pagello, Carlo Pellizzari,, and David Spagnoli. Designing an omnidirectional vision system for a goalkeeper robot. In Andreas Birk, Silvia Coradeschi, and Satoshi Tadokoro, editors, RoboCup-2001: Robot Soccer World Cup V, pages Springer Verlag, Berlin, [11] Tom Mitchell. Machine Learning. McGraw-Hill, [12] Rein-Lien Hsu Mohamed. Face detection in color images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5): ,

Horus: Object Orientation and Id without Additional Markers

Computer Science Department of The University of Auckland CITR at Tamaki Campus (http://www.citr.auckland.ac.nz) CITR-TR-74 November 2000 Horus: Object Orientation and Id without Additional Markers Jacky