Wearable Context-Aware Food Recognition for Nutrition Monitoring

Size: px

Start display at page:

Download "Wearable Context-Aware Food Recognition for Nutrition Monitoring"

Shawn Atkinson
6 years ago
Views:

1 Wearable Context-Aware Food Recognition for Nutrition Monitoring Geeta Shroff May 1, 2008 Undergraduate Thesis School of Computer Science Carnegie Mellon University Pittsburgh, PA Thesis Committee: Asim Smailagic (Advisor) Mark Stehlik Klaus Sutner

2 Abstract We propose DiaWear, a novel assistive mobile phone-based calorie estimation and monitoring system to improve the quality of life of diabetes patients and individuals with unique nutrition management needs. Our goal is to achieve near automatic food recognition using mobile wearable cell phone platforms. DiaWear currently uses a neural network classification scheme to identify food items from a captured image. However, it is difficult to account for the varying and implicit nature of certain foods using traditional image recognition techniques. To overcome these limitations, we also introduce the role of the mobile phone as a platform to gather static and dynamic contextual information in obtaining better food recognition. ii

3 CONTENTS 1 Introduction Previous work 1.2 Our proposed approach 1.3 Contributions 1.4 Thesis Outline 2 Motivation Initial Studies 2.2 CogTool 2.3 CogTool Implementation Details 2.4 CogTool Experimental Results 2.5 Conclusions/Discussion 3 System Architecture Client 3.2 Server 3.3 Data Transfer 4 Food Image Recognition Aspects Common Image Recognition Aspects 4.2 Food Specific Image Recognition Aspects 4.3 Contextual Information Aspects 4.4 Our Assumptions 5 Food Recognition Algorithm Background Work 5.2 Steps Before Feature Extraction 5.3 Image Acquisition 5.4 Preprocessing 5.5 Background Removal 5.6 Segmentation and Labeling iii

4 6 Feature Extraction Details Feature Vector 6.2 Reference Object 6.3 Color Based Features 6.4 Size Based Features 6.5 Texture Based Features 6.6 Shape Based Features 6.7 Context Based Features 7 Neural Network Based Food Classification Training the Neural Net 7.2 Classifying Outputs of Neural Net 8 Contextual Improvements 23 9 Experimental Work Experiment #1: Color, Size Features 9.2 Experiment #2: Additional Texture Features 9.3 Experiment #3: Additional Shape Features 9.4 Experiment #4: Additional Contextual Features 10 Evaluation Evaluation Setup, Scenario, and Results 10.2Discussion of Results 11 Conclusions and Future Work Current Work 11.2Future Work 12 References 34 Appendix: Tools, SDKs 36 iv

5 LIST OF FIGURES Figure 1. Typical DiaWear Use Case 1 Figure 2. CogTool Tasks and User Interaction Times 4 Figure 3. DiaWear System Architecture 7 Figure 4. Example of Image that fits our assumptions and the same image after preprocessing, background removal, and segmentation 11 Figure 5. Non-context features based Training and Classification 12 Figure 6. Context and non-context features based Training and Classification 12 Figure 7. Steps before Feature Extraction 14 Figure 8. Background Mask 15 Figure 9: Segmented Objects with Blue Reference Object 16 Figure 10. Neural Network (NN) Model 20 Figure 11. Training Convergence Example 21 Figure 12: Mean-Squared Error (MSE) performance goal reached in 1295 epochs 25 Figure 13. NN hits with non-contextual texture, size, color features 27 Figure 14. NN hits with non-contextual shape, texture, size, color features 28 Figure 15. Hits when context added to NN inputs and compared to non-contextual hits 29 Figure 16. Hits when context added to NN outputs compared to non-contextual hits 30 Figure 17. Evaluation Results 31 Figure 18. The DiaWear Vision 33 LIST OF TABLES Table 1. CogTool User Interaction Times 5 Table 2. CogTool Mean User Interaction Times 5 Table 3: General Image Assumptions 15 Table 4: Verification results for object recognition over the 120 trained images 25 Table 5: Testing results for object recognition over the 120 trained images 26 v

6 1 Introduction Living with diabetes today involves a substantial amount of effort on the patient's side in regularly and frequently managing personal nutrition. This discourages patients from the careful management of their disease. Two target groups of persons who are most prone to neglecting this task are young children with Type I diabetes or busy adults with Type II diabetes [1]. This is also the case with pancreatitis, gall bladder disease and other such diet related conditions. In many cases, negligence on the patient s side may result in serious illness or death. A Diabetes Care (2006) study shows that about 73 Million persons in the United States either have diabetes or are at a risk [2]. Moreover, the total direct and indirect costs associated with diabetes accrue to about 132 billion in the USA [1]. With this widespread and growing number of patients with these costly dietary management needs, we have identified the need for a low-cost, automatic, personalized, mobile assistive system for these patients to use in order to manage their disease in a more efficient, effective, and convenient manner. Recorded and displayed data will also assist their caretakers and medical professionals to provide more accurate forms of treatment and medications. The benefit of such a system could also be used in fitness and weight loss related applications. 1.1 Previous work Previous work on a wearable diabetes management system at Carnegie Mellon University employed menu-based recording of consumed food [3]. As an effort to minimize user interaction time with the device, we are shifting to an automatic food recognition technique for calorie monitoring. The Nokia Wellness Diary is another mobile device based application that allows for monitoring health on a daily basis [4]. However, this application also requires the user to enter information about their eating habits manually. The task of recognizing food in general through image recognition is extremely difficult. Past efforts have been limited to recognition within specific food categories. Examples include the fish, meat, or citrus fruit industries [5] [6] [7].The effects of classification methods on colorbased feature detection for food processing such as meats and fish was developed with a prefilter to specially ignore the background of image objects. It classified images using a statistically based fast bounded box (SFBB) algorithm, which was successfully compared against common classification algorithms using SVM and neural networks [5]. Other efforts have been limited by the placement of food on a plate or the compartmentalized location of food items [8] [9]. Many of these and other efforts have been used for assessing the quality of specific foods [5][6]. One of these food recognition algorithms is based on dish recognition and has also been developed to adjust microwave settings according to food in a lunch box with predefined rules regarding positioning of foods in the lunch box according the assigned categories for the different sections, called a food arrangement map [8]. However, very little effort has been geared towards using food recognition at large in the medical and assistive technology domain. One effort for hospital environments was a dish extraction method using a neural network based image recognition algorithm for food intake measurements [9]. However, even in this case, the algorithm is limited to the known domain of hospital foods and their positions in the tray. None of these methods are available to the user on a daily basis since they are not based on mobile wearable platforms or architectures. We are bridging this gap between personal health and 1

7 technology, and the user and their nutrition information, by bringing food monitoring to their personal handheld computing device. Allowing them to monitor their health conditions on a daily basis will allow for a reduction in overall health costs as well. 1.2 Our proposed approach We propose the DiaWear architecture shown in Section 3, as a solution to the food management needs of these individuals. DiaWear is a context-aware wearable food recognition and calorie estimation mobile system, as a solution to the above mentioned nutrition management problems for these individuals. After images and context are captured by the cell phone client, recognition of the images is completed by the server. On classifying the food, the corresponding calorie estimation is made available to the user on their handheld personal computing device. The tools and kits that the author had to learn and use to make this possible have been listed in Appendix A. A typical use case for such a system may be as shown in Figure 1 below. User orders food item/meal Calorie info sent to user User takes picture of food on plate Item/s translated into calories Picture sent to DiaWear server Item/s in image get recognized Figure 1. Typical DiaWear Use Case 1.3 Contributions We are building on past work through our Neural Network based food recognition technique. We are additionally bringing nutrition and calorie management to the user s daily life through a mobile wearable cell phone platform. Daily nutrition monitoring will provide the user, caretaker, and medical professionals with useful data to make better decisions for future treatment and medications. We are also aiming to cover a larger food base through the use of static and dynamic contextual information. The growing number of sensors and capabilities of cell phones provides us with a good platform to capture images and gather contextual information that can be used to improve our food recognition and calorie estimation. 2

8 1.4 Thesis Outline In Section 2, we start with an explanation of the purpose of our food recognition followed by an initial motivation experiment using Cogtool. Section 3 explains the DiaWear system architecture and our client-server model. In Section 4, we introduce the reader to common image recognition issues, followed by a description of food specific issues that we have observed. We then go on to discuss how context can be used to approach these different aspects. Sections 5, 6, 7, and 8 discuss the food recognition algorithm and merging of Neural Network outputs with our contextual features. We also give the reader a detailed explanation of the different features we have implemented. In Section 9, we demonstrate how our image recognition algorithm evolved and how we added on more features through various experiments. In Section 10, we evaluate our conclusion of using contextual improvements by generating user context profiles. Section 11 then describes our future steps and overall conclusions. The Appendix contains a list of tools, platforms, and software the student learned about and used to build the complete end to end system. 3

2 Motivation We performed Cogtool experiments before being motivated to use image recognition in improving automatic food recognition and calorie estimation in mobile personal computing based systems.

9 2 Motivation We performed Cogtool experiments before being motivated to use image recognition in improving automatic food recognition and calorie estimation in mobile personal computing based systems. 2.1 Initial Studies Our initial studies comprised of understanding improvements to DIMA, a mobile PDA-based Diabetes management system developed by our research group in the past [3]. The food monitoring functionality of DIMA employs a menu-based system that allows the user to click through menus loaded with relevant items from a food database before logging calorie information of the food item that the user has consumed. 2.2 CogTool CogTool is a cognitive performance modeling tool that has been developed at Carnegie Mellon University to measure user interaction time with a mobile device. Since it takes into account user learning time, it is an improvement over earlier key stroke level (KLM) techniques of estimating storyboards [10]. We used CogTool because we wanted a cost effective method to gain a measurement of the relative benefit of a smart context aware system versus a non-context system without requiring special user studies. 2.3 CogTool Implementation Details We measured the relative user interaction time as our metric for a menu-based food recognition system using CogTool. We performed four sets of task experiments for each of the six selected food items: 1. By using non-context based system storyboards 2. By utilizing time context (eg: morning time of the day to filter breakfast items) based system storyboards 3. By utilizing user preferences (eg: user s frequent foods) based system storyboards 4. By utilizing multiple contexts (time and user preference) based system storyboards 4

10 Figure 2. CogTool Tasks and User Interaction Times Pictures for the storyboards were obtained from our PDA emulator, and the required menu items were marked as active to connect to the next menu/s. Figure 2 summarizes our scenarios and tasks used for our experiments in CogTool. 2.4 CogTool Experimental Results Table 1 lists our results for identifying different food items without and with using contextual filters as described above. Tasks Identify Hamburger Identify Veggie Pizza Slice Identify Honey Nut Cheerios Identify Sunny side up Identify Strawberry Smoothie Identify Mac n Cheese No Context (s) Time Context (s) User Preference (s) Both Contexts (s) Table 1. CogTool User Interaction Times From the above table, we calculate the mean user interaction times for the four different cases to be as shown in Table 2. Storyboard Implementation Case Mean User Interaction Time (s) No Context Time Context User Preference Both Contexts Table 2. CogTool Mean User Interaction Times 5

11 2.5 Conclusions/Discussion We successfully verified the role of contextual information such as time of day and user preferences in cutting down menus to achieve shorter user input time. Ignoring time to start the phone and application, the results using both contexts in Figure 1 give us near automatic detection in a menu-based approach. The general trend showed that some context is better than no context, where better is defined as requiring less user interaction time with the system to input information regarding the food being eaten. Also combining multiple contexts resulted in an even better user interaction curve. An additional finding was that 'Time of day' to determine meal time as a context, individually, was stronger than 'Static User Preference' as an applied context to filter out samples, in the search for food being eaten. However, there is still some minimal amount of clicking, searching, or skimming of menus that is required by the user to get to the final end result. Users with specific usability or accessibility needs may not find enough value in fewer menus that still require some amount of searching and clicking. Moreover, a large number of diabetes patients fall into the blind and elderly user communities [1]. These groups may not be as comfortable using cell phones and menu based systems for varying accessibility and usability reasons. Other users might also appreciate having a more automatic approach for monitoring food on a daily basis. We propose DiaWear to bring further automation in food monitoring by reducing menus, through the automatic image recognition of food by leveraging the on board device camera and Wifi capabilities. Our studies of using mobile devices to gain useful context such as the system time and user preference based filters have motivated us to incorporate similar contextual filters into our image recognition based food recognition approach as we discuss in detail in Section 8. 6

12 3 System Architecture DiaWear has been built using a client-server architecture as shown in Figure 3. Figure 3. DiaWear System Architecture 3.1 Client The client consists of a Nokia N95 phone equipped with a 5.0 Megapixel camera for capturing food images. We also have logging functionality on the phone to log food items recognized from past sessions by the user. This log allows users to monitor their diet, and also allows for analysis to derive other contextual information about current sessions such as the user preference based on frequency count of a food, and also at a particular time of the day. The system data on the phone, such as the system time, allows us to extract further contextual features like time of day. After images are captured on this cell phone client, the calories are estimated by the server and provided back to the user on their cell phone display. The phone also has Wifi capabilities for such data transfer. 3.2 Server Recognition of the images is completed on our server. The server machine is a Pentium 4, 2.4 GHz CPU with 1 GB RAM. A running Apache Server hosts a.net platform web site that links to a Matlab engine in conjunction with the Neural Networks Toolbox. The server machine resources allow for up to ten simultaneous client requests at a time. During classification, the Neural Network calculates weights, our contextual algorithm calculates weights, and these weights are further merged to calculate a final set of weights. The food item is then classified based on these weights as described in Section 8. On classifying the food, the corresponding calorie estimation is done via a lookup table, and then the calorie range is made available to the website and then the user s cell phone screen. 7

13 3.3 Collected Data Data transfer occurs mainly for three main items. The user s food image, the context information grabbed from the phone, and then calorie range returned from the server. We have implemented this client-server architecture keeping modularity as the highest priority in our design decisions. The tools and kits that the student had to learn and which were used to make this possible have been listed in Appendix A. Since this client-server interaction was not the main focus of this senior thesis project, more emphasis has been given in this thesis to the food recognition methods and algorithms involved. 8

14 4 Food Image Recognition Aspects The task of dealing with food image recognition is very complex and there are many aspects that need to be considered or ignored. Some of these aspects are listed below. 4.1 Common Image Recognition Aspects Some common image recognition aspects are as follows: Scaling/Perspective: Images of food items can be taken from different distances. This changes the perceived size of the food with respect to the image. This makes it hard to distinguish whether it is a small, medium, or large food item, and hence deduce the exact number or even range of calories contained in the image. Rotation: Images of food items that are taken at different rotations need to be considered. While a hamburger looks more or less the same for varying rotations, fries does not since the height and width changes drastically for shifted and rotated coordinate systems in the same plane. Translation: When food items are captured at a different angle, they might look completely different from a top down view or other views. Features such as the color ranges, textures, shapes, and other such data in the image may change drastically. A hamburger from top view will show the bun, but from a side view will show less of the bun and more of the burger. Variation in Camera Quality: It is important to note that not all cell phones have the same camera quality. Not even all digital or other cameras have the same quality. Settings will be different from person to person. This will introduce variation in the quality of food images. Variation in Lighting: Not all images will be taken under the same lighting conditions. A change in direction, type, or amount of light can greatly vary the appearance of a food item. Sometimes a shadow may not be detected, and this may cause distortion in many extracted features of the food item. Variation in Color: Different lighting conditions can change the color of a food item. Different cooking practices or setups can change the color of a food item. It is hard to capture these differences. A burnt chocolate chip cookie may appear to be a chocolate chocolate chip cookie or the presence of food coloring may change items drastically in this respect. 4.2 Food Specific Image Recognition Aspects Some common food specific image recognition aspects are as follows: Location in Image: It is important to realize that the user may not place the food in the same position every time they order a particular type of food. Perhaps over one meal, a user may place food back in similar positions, but maybe not over different meals. This can be of values in cases where the actual number of calories consumed needs to be estimated based on before/after images of the food. Variation in Shape: Not all food items will have the same shape every time. Some examples include mashed potatoes and chicken nuggets. This adds to the complexity of food recognition algorithms. Variation in Texture: Different lighting conditions and cooking conditions or practices or 9

15 ingredients may also affect the texture of a food sample. Using all purpose flour instead of wheat flour for a particular style of pasta or adding a significant amount of additional water/milk in mashed potatoes than before might be a couple examples. Occlusion: Moreover it is difficult to tell whether a certain ingredient exists or not because it can get covered by other layers of the food. For example, sometimes it can be difficult to tell whether a hamburger has a slice of cheese or not, or whether the user has included mayonnaise in his/her sub. Hidden ingredients/occlusion: There exist certain hidden ingredients in foods such as sugar, salt, butter or certain spices. Other hidden aspects that cannot be captured by an image include the temperature. One example is that it is difficult to tell whether tea has sugar in it, is hot, or is iced tea. Un-cluttered Background: A cluttered background makes it difficult to detect objects in general, but usually in the case of food, there is an even plate or dish, or just the design of a border to get rid of. 4.3 Contextual Information Aspects Contextual information comes into play to limit the mentioned difficulties above. We explain this concept in detail in Section 2 and also show how image recognition can add to context information such as those listed below to achieve improved automatic recognition of food in Section 8. User variation: The users themselves can vary greatly from person to person or situation to situation. Learned user habits (eating cereal with sugar in it, or chicken sandwich with mayonnaise instead of ketchup), user preferences (disliking foods with eggplant or spinach), and user needs/restrictions (such as being vegetarian, lactose intolerant, or not being able to eat foods with sugar due to medical problem) can help in allowing us to limit the comparisons of the food items to specific foods versus the entire database of possible food items. Environment variation: The environments that the users consume the food in can vary greatly from person to person or situation to situation. The time of the day (breakfast items being eaten during the morning), the weather (hot tea on a cold day as opposed to iced tea), the location (Pizza Hut food set versus food set at Burger King), and so on can have an impact on what a certain food item might be. Text Recognition: By performing text recognition on nearby items or a grocery bill, it is possible to limit food further or detect items that cannot be detected just by an image. Sometimes it can be difficult to distinguish root beer from Pepsi or Coke and from their diet or non-caffeine counterparts. Textual context from the soda can or bottle can solve this problem. Other: Other types of context include text recognition of a grocery bill, tasks performed that same day, taking a picture during cooking or when setting up ingredients for cooking, and so on. Using information from sensors for temperature, texture, movement detection of method used by user to eat the food, and other such context can also help in making a better decision of what the food item is. By realizing what types of combinations of food usually are eaten by the user, other options can be filtered out. As more and more contextual information can be grabbed about a particular situation, more accuracy will be brought to the field of general automatic food recognition. 10

4.4 Our Assumptions Based on the above aspects of food image recognition, we have considered images with the following assumptions: Non-touching objects: To limit the possibility of objects being

Reference object: To correct variation in scaling, color, and lighting. Single colored background: To take advantage of the fact that most foods are eaten on a plate or plain background.

16 4.4 Our Assumptions Based on the above aspects of food image recognition, we have considered images with the following assumptions: Non-touching objects: To limit the possibility of objects being occluded and also for segmentation purposes. Complete objects: To limit the possibility of missing the correct size of an object and hence the calories calculated. Reference object: To correct variation in scaling, color, and lighting. Single colored background: To take advantage of the fact that most foods are eaten on a plate or plain background. Background lighter than objects: To allow for a simplified adaptive thresholding technique for background removal. Minimal shadows and no flash lighting: To prevent cases where food item features are compromised due to improper detection of shadows or flash spots. Figure 4 represents an image that fits our assumptions with a demonstration of how the food items can easily be segmented as part of the initial steps of our image recognition algorithm. Figure 4. Example of Image that fits our assumptions and the same image after preprocessing, background removal, and segmentation 11

17 5 Food Recognition Algorithm The user places a designated reference object next to the food to be recognized and a picture is taken. After performing preprocessing, background removal and segmentation, a special reference object feature extraction algorithm is run on the detected food components to obtain relative color and size features. Shape and texture based features are also extracted. Our noncontextual feature vector is then input to the Neural Network (NN) Classifier. We have employed a two layer feed forward back propagation (FFBP) NN. The output layer consists of 4 neurons to represent each of the four food classes {Hamburger, Fries, Chicken nugget, Apple Pie} under consideration here. The lowest Euclidean distance from a true value NN output vector below our experimentally calculated threshold is accepted as a valid food item. Figure 5 shows the high-level flow for a non-context based food recognition and calorie estimation algorithm. Figure 5. Non-context features based Training and Classification When additional context information is also collected, then weights for context and non-context cases are separately calculated for each of the possible food classes the item may belong to. The Law of Total Probability is applied to calculate merged context and non-context probabilities for each of these classes [12]. For this, probabilities are distributed according to the number of features contributed by each case out of the total feature set. Figure 6. Context and non-context features based Training and Classification 12

18 Figure 6 shows the high-level flow for a context-based food recognition and calorie estimation algorithm. 5.1 Background Work An initial phase of this project involved studying different image recognition techniques being used to classify different foods and understanding the underlying issues involved as discussed in Section 1. It is also appropriate to mention that the student took Surgery for Engineers, a Robotics Special Topics Course offered by Carnegie Mellon University, as an attempt to understand the relationships between the computing and medical disciplines. Since the student does not have any image recognition related prior work or course background, a simple simulation of trained data and feature vectors was created by the student to better understand the concepts involved in image recognition, and a preliminary model using Support Vector Machines using Cornell University's freely available SVMLite package was studied to classify the generated vectors. After understanding the process of feature extraction and studying pattern recognition using neural networks [11], a neural network classification based approach was then created for the problem at hand using the following steps. 5.2 Steps Before Feature Extraction After scaling the input image to the desired size, minor lighting problems are resolved by adjusting the intensity of the gray scale version of the input image. We then perform adaptive equalization of the color histogram for different sub parts over the entire image, and finally pass it through a linearly averaging kernel to remove some level of noise and blur. Minor lighting problems are resolved, but extreme lighting such as bright camera flashes are not resolved well, so we assume the image is taken with the flash turned off. Pixel values with minimum intensity are first removed, and the entire image is then normalized by dividing by pixel values of the maximum intensity. For correction of miscalculated masked and unmasked pixels, improper regions are flipped using an adaptive thresholding method. This accounts for variance in color, lighting, texture and other intensities over different sections of the image. Using this method, a binary segmentation mask is generated. This method works best for plain backgrounds in cases where the background is lighter than foreground non-touching objects. Disjoint background connected components are labeled using a 4-means connected neighbors algorithm [12]. Miscalculated background and foreground connected components are corrected based on area. The binary mask is then dilated and eroded to remove other unwanted noise before performing feature extraction on the detected objects (connected components). Figure 7 gives us a visual overview of the steps involved in our algorithm before feature extraction. 13

Figure 7. Steps before Feature Extraction Image Acquisition The Nokia N95 cell phone camera is set to 4.0 mega pixel resolution to capture the different images.

19 Figure 7. Steps before Feature Extraction Image Acquisition The Nokia N95 cell phone camera is set to 4.0 mega pixel resolution to capture the different images. A total of 200 images were captured in different conditions such as day light, indoor light, flash, shadows, camera blur, scale invariance and rotations. This set of images comprise of single food items (hamburger, apple pie, chicken nugget or fries) in which there exists a non touching, easily recognizable special blue reference object. This reference object based approach is novel to the field of food recognition that we are working with for the medical and health domain. The size of each of the images is scaled down to 100x100x3 RGB pixels. Of these images, 60% are used for training and verification while the remaining 40% are used for testing in each of our experiments as described in more detail in Section 9 and Section 10. Preprocessing After scaling the input image to the desired size, minor lighting problems are resolved by adjusting the intensity of the gray scale version of the input image. We then perform adaptive equalization of the color histogram for different sub parts over the entire image, and finally pass it through a linearly averaging kernel to remove some level of noise and blur. Minor lighting problems are solved. The resized image has fewer amounts of pixels, which allows for faster traversal of the algorithm. However, it also results in the loss of information. Although the equalized histogram allows for different lighting and shading effects to be equalized over different areas of the image, extreme lighting problems such as bright camera flashes are not resolved. The linear averaging filter does not solve this problem either and also leads to information loss. Since the mentioned negative effects do not affect the performance of the overall algorithm to a great extent under our assumptions as listed in Table 3, they are not considered for the purposes of the described experiments. 14

Images will contain only: Non-touching objects One reference object Complete objects Darker objects than background Single colored background Minimal shadows No flash lighting Table 3: General Image

20 Images will contain only: Non-touching objects One reference object Complete objects Darker objects than background Single colored background Minimal shadows No flash lighting Table 3: General Image Assumptions Background Removal Pixel values with minimum intensity are first removed, and the entire image is then normalized by dividing by pixel values of the maximum intensity. For correction of miscalculated masked and unmasked pixels, improper regions are flipped using an adaptive thresholding method. This correction technique is applied due to the variance in color, lighting and other intensities over different sections of the image. Sub-image matrices are extracted and an average threshold is calculated for each matrix. Corresponding pixels in the sub-image above the local threshold are marked as background or foreground depending on their color intensity values. Using this method, a black and white mask, as shown in Figure 8, is generated in which background pixels are blacked out. Figure 8. Background Mask This method mainly works for cases where the background is lighter than the foreground objects. In cases where touching objects are lighter or darker, the adaptive algorithm masks out a portion of the lighter object. This can still be ignored if the lighter intensity object is not too small. However, if very narrow, it will lead to a false positive region in the detected background mask. To overcome these limitations, we are only considering images that have lighter backgrounds with non-touching objects. Segmentation and Labeling Disjoint background connected components are labeled using a 4-means connected neighbors algorithm. If the size of any of these connected components is below a certain threshold of the largest foreground connected component in the image, then the component is reversed to be a 15

21 part of the foreground. The foreground connected components are similarly labeled, and if the size of any of these connected components is below the same threshold of the largest foreground component, then it is reversed to be a part of the background. After applying this small connected components removal algorithm to the adaptive normalized binary mask, it is then dilated and eroded to remove other unwanted noise. Dilation allows for the areas of foreground to grow and for the holes to become smaller. Erosion of the mask allows for removal of any extra dilation. The mask is then transposed onto the original image to obtain the segmented image with final detected objects separated from the background as shown in Figure 9. The final number of connected components is also noted to get a general idea of how many objects were detected in the image. This technique allows for the removal of unwanted components, and for the segmentation of detected objects from the background for further processing. Connected components may sometimes be false positives, but will be processed out in further steps of the algorithm. For objects touching each other in the image, further segmentation will have to be performed to separate the objects. This case is again ignored, since we are considering only non touching objects. Figure 9: Segmented Objects with Blue Reference Object 16

22 6 Feature Extraction Details Color and size based features are easy to extract and keep the feature extraction process fast, while texture and shape allow us to overcome overlaps in color intensities and other features. 6.1 Feature Vector The feature vector for a detected object consists of features extracted from the connected component representing the object. A feature vector F is given as, F = [color features ; size features ; texture features ; shape features] The values of all features are normalized into the interval [-1, 1] before inputting into the Neural Network. 6.2 Reference Object A special known predetermined static reference object that will be the same across all images is used to have some basis for extracting features such as size and color. Our special reference object based feature extraction methods allows for the correction of errors by features based on trained models of food objects with respect to the same reference object. Without the reference object, it becomes difficult, as described in Section 4 to understand the size of an item or its color, since these might vary based on the distance the food item is from the camera or lighting and other conditions that affect the entire image. This special reference object based feature extraction algorithm is run on the labeled connected components derived from the segmentation and labeling stage. 6.3 Color Based Features Color based features include normalized mean intensities along the red, green and blue spectrum as compared to the color based extracted features from the reference object included in the training or test image. If detected_obj represents the detected object mean colors and ref_obj represents the reference object mean colors, we represent the normalized color features as, R_diff = detected_obj_r ref_obj_r G_diff = detected_obj_g ref_obj_g B_diff = detected_obj_b ref_obj_b RGB_diffs = R_diff + G_diff + B_diff R feature = R_diff / RGB_diffs G feature = G_diff / RGB_diffs B feature = B_diff / RGB_diffs 17

23 6.4 Size Based Features Size based features consist of the ratio between the area of the detected object to that of the reference object. Area is represented by the number of pixels in the respective components. If N DO represents the number of pixels in the detected object, and N RO represents the number of pixels in the reference object, then we get, Area feature = N RO / N DO. 6.5 Texture Based Features The texture of the object is characterized statistically by three different feature methods as follows. Local Entropies: We calculate the local entropy over different neighborhoods in the connected component. Each neighborhood represents a 15x15 matrix around a pixel. p = imhist([neighborhood]) //histogram of local area local entropy = -Σ(p*log(p)) //entropy of local area texture feature = mean(all local entropies) Local Standard deviations: Similarly, local standard deviations over [15x15] sub-matrices of the object are calculated and the mean is similarly taken. Local Ranges: Similarly, local ranges over [15x15] sub-matrices of the object are calculated and the mean is similarly taken. 6.6 Shape Based Features In addition, the shape of the object is characterized using special connected component region properties by three different feature methods. Eccentricity: The eccentricity of the encapsulating ellipse around a detected connected component is calculated. Eccentricity lies in the range between 0 and 1, and is defined as the ratio of the focus to the semi-major axis of an ellipse. Nearer to 0 means the ellipse is more circular, and farther away means it is more elongated. Eccentricity, ε, is given by, ε = c/a where c is the distance of the focus from the center and a is the semi-major axis. Axes Ratio: The ratio between the major and minor axes of the encapsulating ellipse is also calculated as a measure of shape. This gives us an idea of the height to width ratios of the object. Axes feature = major axis/ minor axis Convex Hull Vertices: In addition, the vertices of the smallest enclosing convex polygon are counted as a measure of shape. A circular shape will have a more fitting convex polygon, so the number of vertices will be much greater than that of a rectangular shape, and so on. 6.7 Context Based Features In addition to physical features extracted from the image components, we also extract contextual features from the user and environment. We have tested the use of two features, namely, time of 18

24 day and user preference. The time of the day is obtained from the system calendar time for the requested time zone, and the user preference is obtained from information regarding whether the user likes a particular item or not. When using time of day, if it is morning time, then breakfast items are weighted higher than other items as per the specific user s profile, and similar is the case for lunch and dinner. For user preference, the user can either like or dislike an item and this is noted as part of the user s profile. Other pieces of contextual information that have been implemented but not integrated with the system yet include a count of the number of main dish items detected, since in most cases, the user generally orders one main course. A third piece of useful context information that has been implemented is GPS data about the location of the user. Once integrated with our system, it will tell us which location based database to use for our image matching and recognition. 19

7 Neural Network Based Food Classification Now that we have understood where the features are coming from, before discussing our training and recognition algorithms and results, we describe our

25 7 Neural Network Based Food Classification Now that we have understood where the features are coming from, before discussing our training and recognition algorithms and results, we describe our Neural Network classification model in detail. We have employed a feed forward back propagation (FFBP) neural network [13]. The first layer of the net comprises of 10 inputs, each representing a normalized element of the feature vector extracted from the detected object as described in the previous section. These inputs to our network are limited to a range between the minimum and maximum values in our normalized feature vectors, which are guaranteed to be between -1 and 1 after normalization. The hidden layer of the network includes 10 neurons, since experimentally too many hidden neurons decrease efficiency while too few decrease reliability. The output layer consists of 4 neurons, one neuron representing each of the four food classes {Hamburger, Fries, Chicken nugget, Apple Pie} under consideration here. We use a logarithmic sigmoid differentiable transfer function for the first layer, and a hyperbolic tangent sigmoid differentiable transfer function for the second layer to calculate the values at the output nodes. These have been common transfer functions for pattern recognition usages of neural networks [11]. Figure 10 displays a high-level representation of our neural network classification model. Figure 10. Neural Network (NN) Model 7.1 Training the Neural Net During training of the neural network, we input a set of images with one training object in each image and one reference object accompanying each training object. We also input the respective target classes (eg: [0; 1; 0; 0]) for the training set of images. The training function of our feed forward back-propagation network is based on gradient descent to reach our Mean Square Error (MSE) goal. For a true output food class vector {t1, t2, t3, t4}, the MSE in our case is defined by, MSE = (Σ[(ti xi)^2])/4 where ti is the true value of the food classification vector and xi is the estimate of ti (output value from NN). We allow for upto 2500 epochs to reach the target Mean Square Error Goal using a gradient descent algorithm. An example of how this works is given in Figure 11. The X-axis represents the number of epochs (rounds of training the net) while the Y-axis shows the Mean Square error (mse) and how it moves towards the desired target error goal line. In this case, we have set the MSE goal to 0.08 for reasons described below with a momentum constant set to

26 Figure 11. Training Convergence Example In the graph above, we have reached our target MSE within 192 epochs which is less than the maximum of Initially, it is very far from the goal, since the network weights have not been adjusted and readjusted enough to lead to a good classification. Slowly, with more epochs or rounds of training, we get closer to the MSE goal because of our adaptive learning rate for adjusting weights and bias values in each next training epoch. Sometimes misleading information may cause unevenness, so it may not be as smooth throughout. A higher momentum constant is directly linked to have far the gradient descent will jump to try getting closer to the true minima. When a local minima is hit, and the jump is not high or low enough, it may not reach the target goal for a while. Our parameters have been set according to many different considerations. If the number of epochs is decreased, then this may not be enough training rounds needed to reach our mse goal, and if the number of epochs is increased, then this may cause over training than needed. In the case of our MSE goal, if the error goal is decreased too much, it may reach more local minimas, and may cause over training at a wrong point with misleading information/features. However, if the error goal is increased, then we may not reach a low enough error rate which will cause insufficient training and more incorrect classifications. According to this training goal, our neural network will roughly detect about 8 out of 100 pictures incorrectly. We can further improve this by lowering the error goal and increasing the number of training images. 7.2 Classifying Outputs of Neural Net During recognition, Euclidean distances are calculated for each NN output from the true values of each class. True values are of the form [ ], [ ], etc., while NN estimations are of the form [ ]. Since, not all output values will be of the form where only one element has a high value and the others have low values (eg: [0.12; 0.35; -0.43; 0.28]), we can not merely accept the highest value as a food object. We must also account for these cases where a component was falsely detected and is not an actual food item. For these reasons, the lowest distance below our experimentally calculated threshold is taken to be a classified food item. If the training values are given by, X = [x1; x2; x3; x4], and the classification values are given 21

27 by, Y = [y1; y2; y3; y4], then Squared Euclidean distance, d, is used to detect the closest food class, where, d = Σ ((xi yi) 2 ) If the lowest Distance is less than T (where T is an experimental threshold), then we recognize the food as part of a food class. Given that the object is a valid food item, probabilities for each of the possible classes for the NN output, o, are then calculated as, P NC = Pr(o = ci NC ) = (1/di)/ (1/dj), where ci NC is the ith non-context (NC) class, di is the Euclidean distance of the current NN estimation from the true value of the ith class, N is the number of output classes, and j goes from 1 to N. These probabilities will later be merged with the context-based derived probabilities for each class, to get a final set of probabilities for each possible food classification of the object under consideration as described in Section 8. 22

28 8 Contextual Improvements As mentioned earlier, calorie estimation based on image recognition over a broad range of food items is a very difficult problem. There exist many physical and implicit variations in different foods. Contextual information from the user and cell phone can help overcome these limitations. The amount of sugar in tea and the presence of cheese in hamburger are some examples of these hidden variations that cannot always be picked up by a mere photo. Users may have personal preferences with respect to many of these variations. User preferences can also tell us whether a user will ever buy a certain item. This allows us to eliminate unwanted items from our training set database. Such user preferences can be learnt by the system over time, and can add useful value to the classification segment of our algorithm. Currently, time-based context picked up from the system time field allows us to weight breakfast foods higher for morning times and so on for other times of the day. We have also incorporated user preference based on whether the user likes a particular food or not. Equal contextual probabilities are calculated for each of the context features based on the number of classes for which the context is valid. If the context is not valid for a particular class, then the probability is taken as 0. The individual contextual probabilities are then merged by the law of total probability, as follows, Nc 1 Pr( x A) Pr( x A W i ), Nc i 1 where, N C is the number of context features, W i is the ith context based class weight, and where i goes from 1 to N C.To further merge the NN non-contextual weights and the contextual feature derived class weights, the probability that a valid item x is of class A is given by the Law of Total Probability [12] as, N NC N C Pr( x A) Pr( x A WNN ) Pr( x A WC ), N N T where, N NC is the number of non-context features, N C is the number of context features, N T is the total number of features, W NN is the NN based class weight, and W C is the context derived class weight. T 23

29 9 Experimental Work Various experiments were carried out during the course of this research project. Some of the key experiments that led to progressive steps are listed below. 9.1 Experiment #1 Training convergence and recognition testing for only color and size based extracted features with respect to the reference object Training Setup and Convergence Feature vectors comprising of only color and size based features are extracted from the detected objects and normalized before passing through the Neural Network classifier. Each input is an n dimensional column vector of feature vectors of size m. n is the number of training images and m is the number of features extracted from each image. Each output is an n dimensional vector of target vectors of size 4. Each of the n training images has a corresponding output column vector with a 1 in the position for its target food class and a 0 in all the other 3 food class positions. The neural network is trained using 30 training images for each of the four food classes. With every new epoch, biases and weights are altered by the back propagation of mean-squared error (MSE), until we reach our desired training error goal. After observing performance of the neural network over 10 trials, and after accounting for variability in performance, we allow for a maximum of epochs to reach our desired MSE goal of In this experiment, a maximum of epochs are required to reach an MSE of on these 120 images using a momentum constant of These values that are within our specified parameters demonstrate how the neural network classification of our training feature vectors ultimately converges to the desired target outputs within the specified error. The following graph in Figure 12 represents this MSE performance of our neural network based training algorithm as we increase the number of epochs. We have included the graph so that the reader can visualize how our training algorithm converges within a reasonable number of epochs. At first there is random behavior, followed by a curve that eventually converges with the desired target behavior within our target MSE goal. 24

30 Figure 12: Mean-Squared Error (MSE) performance goal reached in 1295 epochs Verification and Test Results After successful completion of training, the trained network is then verified against 20 random images from among the set of images that were used during the training step. The trained neural network is further tested with 20 additional images for each item taken in different conditions that had not been included during training. After performing image preprocessing, background removal and feature extraction on the images, the feature vectors are normalized, and are then passed through the neural network for classification. The classification values are compared to the known values and errors and successes are noted. Verification Results Food Category % Accuracy= % False Positives Hamburger % % Fries % 2.5% Chicken Nugget % % Apple Pie % % Table 4: Verification results for object recognition over the 120 trained images 25

31 Food Category Testing Results % Accuracy % False Positives Hamburger % % Fries % % Chicken Nugget % % Apple Pie % % Table 5: Testing results for object recognition over the 120 trained images From these observations, the percent accuracy result is determined by counting the total number of true hits for that class and dividing by the total number of images tested for that class. The percent false positive calculation is determined by counting the total number of times the class was falsely identified instead of another class, and dividing this number by the total number of images tested overall. The results of the verification step on 20 randomly selected images from among the training set of images for each food class are presented in Table 4. The results of detecting and classifying objects in 80 test images that were not used during training can be found in Table 5. Discussion of Results As we can see from the tables, the percent accuracy is quite low over all four food classes. In two cases, it is near 50% and in two cases it is near 70%-75%. We concluded that only using a reference object based relative size and color approach is not enough. The low accuracy necessitated more features to be extracted, since we wanted to maintain the same sizes for our test, training, and verification image sets as before. 9.2 Experiment #2 With non-contextual texture features added to non-contextual color and size based feature vector Setup and Results Under same training images set up and a similar training process as above, but with the addition of our three texture features (local entropies, standard deviations, and min-max ranges), we obtained the following results for the number of correct hits. In all cases, the number of false positives was below 6.67%. Figure 13 shows our results for recognition number of hits when using only non contextual texture, color, and size based features. 26

32 Figure 13. NN hits with non-contextual texture, size, color features Discussion of Results As we can see from the tables, the percent accuracy is a little higher than before, but still quite low over all the four food classes. In one case, it is near 65% and in two cases it is near 70%- 75%. We concluded that using a reference object based relative size and color approach in conjunction with only texture based features is not enough. The still low accuracy necessitated more features to be extracted, since we wanted to maintain the same sizes for our test, training, and verification image sets as before. 9.3 Experiment #3 With non-contextual shape features added to non-contextual image texture, color, and size features Setup and Results Under same training images set up and a similar training process as in Experiment #2 described above, but with the addition of our three shape based features (eccentricity, axes ratios, convex hull fitting vertices), we obtained the following results for the number of correct hits. In all cases, the number of false positives was below 3.33%. Figure 14 shows our results for recognition number of hits when using only non contextual shape, texture, color, and size based features. 27

33 Figure 14. NN hits with non-contextual shape, texture, size, color features Discussion of Results As we can see from the tables, there is an improvement in the number of hits for most of the four food classes. In two cases, it is near 70% accurate (detecting correct food items). We concluded that only using a reference object based relative size and color approach is not enough. The addition of texture and shape based features are also not enough due to the varying nature of certain foods from image to image. The low accuracy necessitated more features to be extracted, since we wanted to maintain the same sizes for our test, training, and verification image sets as before. For this, we moved onto using context as a way to gain more information about a particular image food item before recognizing it. 9.4 Experiment #4 A comparison of contextual features used as inputs to the NN versus context features being used on outputs of NN through user context profile simulations for two users (one with context features extracted and one without context features extracted) in two sets of sub-experiments. Setup As a recap, the Nokia N95 cell phone camera is set to 4.0 mega pixel resolution to capture the different images. A total of 200 images were captured in different conditions such as day light, indoor light, flash, shadows, camera blur, scale invariance and rotations. This set of images comprise of single food items (hamburger, apple pie, chicken nugget or fries) in which there exists a non touching, easily recognizable special blue reference object. The size of each of the images is scaled down to 100x100x3 RGB pixels. Of these images, 60% are used for training and verification while the remaining 40% are used for testing. For this experiment, we have generated static user preference and dynamic time of day context information for one user and their mobile device according to typical user profiles, as motivated by previous experimental methods for context-aware mobile platforms [14]. The non-contextual results represent the second user. 28

34 Sub-experiments Two sub-experiments were carried out to determine the proper placement of contextual information in our algorithm. The first allowed normalized contextual features to be added to our non-contextual feature vector containing the color, size, texture, and shape based features. This vector was then input to the NN, and the outputs were analyzed according to steps mentioned in previous sections before making a classification. The second experiment allowed contextual features to be weighted separately, and then merged with the NN outputs to calculate final probabilities, as described in Section 8, before coming to a classification decision. Sub-experiment Results The results of the first experiment where context is used on the inputs to our Neural Network are shown in Figure 15. Figure 15. Hits when context added to NN inputs and compared to non-contextual hits The results of the second experiment where context is used on the outputs to our Neural Network are shown in Figure

35 Figure 16. Hits when context added to NN outputs compared to non-contextual hits Discussion of Results Both graphs contain a comparison of our results without using contextual information and with using contextual information in food classification. When using context on the inputs, as seen in Figure 15, we notice negative trends. Since, context features for objects vary greatly as compared to regular non-context features (size, color, texture, shape), in our case, it is harder to recognize a pattern with contextual information in limited test image sets. This causes problems when training the net, since there is more susceptibility to local minimas, and wrong over training. This results in lower accuracy when recognizing test images as show in the first graph with a comparison of not using context at all. Thus, mathematically, we have realized that the training would be slower and less efficient when using context on the inputs to our Neural Network Classifier. However, as seen in Figure 16, we observe that context features used on the outputs of the NN can improve accuracy. We see that 75% of the classifications were improved with the addition of contextual features and 25% had unaffected results. Using context, the Hamburger, Fries, and Chicken nugget classes achieved 35.71%, 21.42%, and 5.88% percent improvement in the number of correctly detected food items, respectively. The number of false positives for each of the classes was below 3.33% for both implementations. Since we can easily obtain contextual information from a look up table instead of requiring the resources of our Neural Network, we have decided to place context on the outputs of our Neural Network Classifier instead of the inputs, to improve the accuracy of our non-context based results. 30

36 10 Evaluation User context profile simulations for five users (four with context features extracted and one without context features extracted). Evaluation Setup, Scenario, and Results We attempt to further verify how context information can improve the accuracy of food class weight outputs from our NN with a larger sample set of users. We use training and test images from the same exact set up in Experiment #4. For this experiment, we have generated static user preference and dynamic time of day context information for 4 users and their mobile devices according to typical user profiles, as motivated by previous experimental methods for contextaware mobile platforms [14]. Since we are using the same images for all persons, the noncontextual NN outputs will be the same in all cases (no variations based on context in the food recognition parts of the algorithm, since no contextual information to use for filtering or corrections). Hence, we have selected only one user for the non-context system, which denotes our fifth user. Figure 17 shows a comparison of our test results without and with using time of day and user preference context in food classification over the 5 users. Discussion of Results Figure 17. Evaluation Results We see that 75% of the classes showed improved classifications with the addition of contextual features and 25% had nearly unaffected results. Using context, the Hamburger, Fries, and Chicken nugget classes achieved 37.50%, 16.07%, and 5.88% percent improvement in the mean number of correctly detected food items, respectively. The average number of false positives for each of the classes across all users was below 3.55% for both implementations. These positive results over more user profiles further strengthen the benefit of using context features in automatic food recognition. 31

37 11 Conclusions and Future Work We conclude that it is possible to achieve nearly automatic recognition and monitoring of food through a wearable cell phone platform. We also see that context can improve or maintain results from a non-context case through our test scenarios Current Work Now that we have an end-end system up and running, we are also currently in the process of calculating how many calories the user has consumed based on before/after images through a Scale Invariant Feature Transform (SIFT) feature [15] vector for matching test images to a trained image database through a Support Vector Machine (SVM) classifier [13] based algorithm. This will allow for more items to be included in our database of items, and will allow for greater extensibility. Other current work includes deciding on relevant user groups (college, working, and elderly adults, as well as medical personnel and caretakers) and planning other items as needed for our user studies (including filling out necessary IRB forms). For this, user surveys and code to log useful information will also be written. Usability, accessibility, latency, satisfaction, and other metrics will be determined accordingly. Additional context will be implemented after determining context priorities from our user studies Future Work Future plans and next steps include some of the following approaches. Additional Features to Extract Additional information to extract from objects to improve the performance of our current and future versions of the algorithm include more color, size, texture and shaped based features. Key point descriptors as in SIFT described above will be very useful. Some color based additions include relative mean intensity, relative standard deviation, and relative correlation between the detected objects and the reference object. Texture data can be perfected by calculating the number of hough transform lines. The gray level co-occurrence matrix contrast, correlation, energy and homogeneity will provide more accurate texture details that can not be obtained from the mentioned texture filters we have used above. Shape descriptors can be improved using corner detection techniques based on region properties, detecting edges using edge detectors, and comparing perimeter to other parameters, and other such additions to gain more information about the various visual characteristics of a food item in an image. More Contextual information Contextual information has more scope to provide key information to filter training image database sets so as to limit our matching with only relevant training data. We will also provide the user with dynamically reordered options according to our contextually prioritized lists. Some such examples include using the context of other items detected (for example, only one main dish is possible), identifying other user preferences (no sugar, no fat, no beef, only vegetarian, etc) and using information obtained from a user's grocery bill or restaurant bill. Location based GPS context to filter out the database (McDonald's versus Pizza Hut) will also serve as very promising contextual information so long as the user eats according to their learned profiles. 32

38 Including text recognition capabilities to detect information on items such as food packages and soda cans will also be useful. Further research can be conducted on how much users actually deviate from their profiles and perhaps cognition aspects can also be considered. Activity Recognition and Other Sensors/Platforms We plan on combining our system with an easily accessible status bar on the e-watch platform and activity recognition sensors to provide the user with an enhanced wearable nutrition management experience. Figure 18. The DiaWear Vision Researching the placement of sensors on the body, analysing cell-phone accelerometer readings and seamless integration with the ArmBand, e-watch, and other sensor readings will be given priority as shown in Figure

DiaWear. Figure 1: Overall DiaWear System

DiaWear. Figure 1: Overall DiaWear System Neural Network Based Food Recognition and Calorie Calculation for Diabetes Patients DiaWear Technical Report (CSD Senior Thesis Extended Abstract) Geeta Shroff (Student), Asim Smailagic (Advisor) March