arxiv: v1 [cs.cv] 5 May 2016

Size: px
Start display at page:

Download "arxiv: v1 [cs.cv] 5 May 2016"

Transcription

1 Learning Action Maps of Large Environments via First-Person Vision Nicholas Rhinehart, Kris M. Kitani The Robotics Institute Carnegie Mellon University {nrhineha, arxiv: v1 [cs.cv] 5 May 2016 Abstract When people observe and interact with physical spaces, they are able to associate functionality to regions in the environment. Our goal is to automate dense functional understanding of large spaces by leveraging sparse activity demonstrations recorded from an ego-centric viewpoint. The method we describe enables functionality estimation in large scenes where people have behaved, as well as novel scenes where no behaviors are observed. Our method learns and predicts Action Maps, which encode the ability for a user to perform activities at various locations. With the usage of an egocentric camera to observe human activities, our method scales with the size of the scene without the need for mounting multiple static surveillance cameras and is well-suited to the task of observing activities up-close. We demonstrate that by capturing appearance-based attributes of the environment and associating these attributes with activity demonstrations, our proposed mathematical framework allows for the prediction of Action Maps in new environments. Additionally, we offer a preliminary glance of the applicability of Action Maps by demonstrating a proof-ofconcept application in which they are used in concert with activity detections to perform localization. 1. Introduction The goal of this work is to endow intelligent systems with the ability to understand the functional attributes of their environment. Such functional understanding of spaces is a crucial component of holistic understanding and decision making by any agent, human or robotic. Functional understanding of a scene can range from the immediate environment to the distant. For example, at the scale of a single room, a person can perceive the arrangement of tables, chairs, and computers in an office environment, and reason that they could sit down and type at the computer. People can also reason about the functionality about nearby rooms, for example, the presence of a kitchen down the hall from the office is useful functional and spatial information for Figure 1: Action Map prediction for the sit activity by using our method to combine appearance data and activity observations. Activity and appearance information from the top scene in combination with only appearance information (no activity observations) from the bottom scene is used to model the relationship between activities, scene information, and object information to make predictions for both scenes. Areas in the scenes where a person can sit are estimated by our method, such as the chairs and couches in both views. when the person decides to prepare a meal. The goal of this work is to learn a computational model of the functionality of large environments, called Action Maps (AMs), by ob- 1

2 serving human interactions and the visual context of those action within a large environment. There has been significant work in the area of automating the functional understanding of an environment, though much has focused on single scenes [9, 4, 11, 10, 2, 5]. In this work, we aim to extend automated functional understanding to very large spaces (e.g., an entire office building or home). This presents two key technical challenges: How can we capture observations of activity across large environments? How can we generalize functional understanding to handle the inevitable data sparsity of less explored or new areas? In order to address the first challenge of observing activity across large environments, we take a departure from the fixed surveillance camera paradigm, and propose an approach that uses a first-person point-of-view camera. By virtue of its placement, its view of the wearer s interactions with the environment is usually unobstructed by the wearer s body and other elements in the scene. An egocentric camera is portable across multiple rooms, whereas fixed cameras are not. An egocentric camera allows for the observation of hand-based activities, such typing or opening doors, as well as the observation of some ego-motion based activities, such as sitting down or standing. The first-person paradigm is well suited for large-scale sensing and allows observation of interactions with many environments. Although we can capture a large number of observations of activity across large environments with wearable cameras, it is still not practical to wait to observe all possible actions in all possible locations. This leads to the second technical challenge of generalizing functional understanding from a sparse set of action observations, which requires generalization to new locations. Our method generalizes by using another source of visual observation which we call side-information that encodes per-location cues relevant to activities. In particular, we propose to extract visual side-information using scene classification [24] and object detection [6] techniques. With this information, our method learns to model the relationship between actions, scenes, and objects. In a scene with no actions, we use scene and object information, coupled with actions in a separate scene, to infer possible actions. We propose to solve the problem of generalizing functional understanding (i.e., generating dense AMs) by formulating the problem as matrix completion. Our method constructs a matrix where each row represents a location and each column represents an action type (e.g., read, sit, type, write, open, wash). The goal of matrix completion is to use the observed entries to fill the missing entries. In this work, we make use of Regularized Weighted Non-Negative Matrix Factorization (RWNMF) [7], allowing us to elegantly leverage side-information to model the relationship between activities, scenes, and objects, and predict missing activity affordances. Estimated opendoor Action Map Estimated typing Action Map Estimated sit Action Map Estimated wash Action Map Figure 2: Projected Action Map examples learned by our method. With global estimates of large Action Maps produced by our method, we use localized images within the scene to show visualizations of the Action Maps by projecting them to the images Contributions To the best of our knowledge, this is the first work to generate Action Maps, such as those in Figures 1 and 2, over large spaces using a wearable camera. The first-person vision paradigm is an essential tool for this problem, as it can capture a wide range of visual information across a large environment. Our approach unifies scene functionality information via a regularized matrix completion framework that appropriately addresses the issue of sparse observations and provides a vehicle to leverage visual side information. We demonstrate the efficacy of our proposed approach on five different multi-room scenes: one home and four office environments. Our experiments in real large-scale environments show how first-person sensing can be used to efficiently observe human activity along with visual sideinformation across large spaces. 1) We show that our method can be used to model visual information from both single and multiple scenes simultaneously, and makes efficient use of all available activity information. 2) We show that our method s power increases as the set of performed activity increases. 3) Furthermore, we demonstrate how our proposed matrix factorization framework can be used to leverage sparse observations of human actions along with visual side-information to perform functionality estimation of large novel scenes in which no activities have been demonstrated. We compare our proposed method against natural baselines such as object-detection-based Action Maps and scene classification, and show that our approach outperforms them in nearly all of our experiments. 4) Additionally, as a proof-of-concept application of the rich information in an Action Map, we present an application of our Action Maps as priors for localization.

3 1.2. Background Human actions are deeply connected to the scene. Scene context (e.g., a chair or common room) can be a strong indicator of actions (e.g., sitting). Likewise, observing an action like sitting, is a strong indicator that there must be a sittable surface in the scene. In the context of time lapse video, Fouhey et al. [4] used detection of sitting, standing, and walking actions to obtain better estimates of 3D geometry for a single densely explored room. Gupta et al. [9] addressed the inverse problem of inferring actions from estimated 3D scene geometry using a single image of a room. Their approach synthetically inserted skeleton models into the 3D scene to reason about possible functional attributes of the scene. Delaitre et al. [2] also used time lapse video of human actions to learn the functional attributes of objects in a single scene. The work of Savva et al. [18] obtains a dense 3D representation of small workspace (e.g. desk and chair space) and learns the functional attributes of the scene by observing human interactions. Similar to previous work, this work seeks to understand the functionality of scenes. However, limitations of previous work include the reduced size of the physical space and the presumed density of interactions. In contrast, our approach attempts to infers the dense functionality over an entire building (e.g., office floor or house), and reasons about multiple large scenes simultaneously by modeling the relationship between scene information, object information, and sparse activities. Another flavor of approaches reason in the joint space of activities and objects. In Moore et al. [13], human actions are recognized by using information about objects in the scene. Gall et al. [5] uses human interaction information to perform unsupervised categorization of objects. Other approaches have capitalized on the interplay between actions and objects: Gupta et al. [8] demonstrate an approach to use object information for pose detection, and Yao et al. [22] jointly model objects and poses to perform recognition of both objects and actions. The approach of [15] performs object recognition by observing human activities, and notes an important idea that our approach also uses: whereas object information may sometimes be too small in detail, human activities usually are not. We capitalize on this observation close-up observation capability of an egocentric camera. The egocentric paradigm is an excellent method for understanding human activities at close range [20, 3, 16, 12]. Our work builds on such egocentric action recognition techniques by associating actions with physical locations in a single holistic framework. By bringing together ideas from single image functional scene understanding, object functionality understanding and egocentric action analysis, we propose a computational model that enables cross-building level functional understanding of scenes. 2. Constructing Action Maps Our goal is to build Action Maps that associate possible actions for every spatial location on a map over a large environment. We decompose the process into three steps. We first build a physical map of the environment by using egocentric videos to obtain a 3D reconstruction of the scene using structure from motion. Second, we use a collection of recorded human activity videos recorded with an egocentric camera to detect and spatially localize actions. This collection of videos is also used to learn the visual context of actions (i.e., scene appearance and object detections) which is later used as a source of side information for modeling and inference. Third, we aggregate the localized action detection and visual context data using a matrix completion framework to generate the final Action Map. The focus of our method is the third step, which we describe next. We mention how we obtain the visual context in Section 2.1.1, and describe the first two steps in detail in Section Action Map Prediction as Matrix Factorization We now describe our method for integrating the sparse set of localized actions and visual side-information to generate a dense Action Map (AM) using regularized matrix completion. Our goal is to recover an AM in matrix form R R M A +, where M is the number of locations on the discretized ground plane and A is the number of possible actions. Each row of the AM matrix R contains the action scores r m, where m is a location index, and each entry r ma describes the extent to which an activity a can be performed at location m. To complete the missing entries of R, we design a similarity metric for our side-information, enabling the method to model the relationship between activities, scenes, and objects. We impose structure on the rows and columns of the AM matrix by computing similarity scores with the sideinformation. Examples of this side information are shown in Figure 3, where two features from scene classification, plus one feature from object detection are shown in the same physical space as the AM. Figure 3 serves to further motivate the idea of exploiting scene and object information between two different scenes to relate the functionality of the scenes. We define three kernel functions based on scene appearance, object detections and spatial continuity. This structure is integrated as regularization in the RWNMF objective function (Equation 2) Integrating Side-Information To integrate side-information into our formulation, we build two weighted graphs that describe the cross-location (row) similarities, and cross-action (column) similarities. We are primarily interested in the cross-location similarities, and

4 (a) Office Flr. A Features (b) Office Flr. B Features Figure 3: Several Office Flr. A and Office Flr. B Features. The office and corridor layers correspond to the features from the scene classification CNN, and the sit layer corresponds to the object detection CNN features aggregated across all sit-able objects, which is also one of the baselines as described in Section 3.2. This figure demonstrates our idea that object information and scene information can be used to relate scenes to each other. This relationship is the basis for transferring and sharing activity functionality between scenes. Heatmaps from several layers are shown projected into localized images from the scene. Note that the office portion of Office Flr. A also contains sittable regions, and that the much larger office area in Office Flr. B contains a select few sittable regions. The corridors in both scenes are described well by the features, and these areas strongly correlate with an an absence of functionality, as scene in Figure 1. discuss how we handle the cross-action similarities in Section 2.2. To build the cross-location graph, we aggregate the spatial proximity, scene-classification, and object detection information as a linear combination of kernel-based similarities, as shown in Equation 1. For every location a in the AM, we compute the scene classification score p a = [p 1a... p Ca ] for each image as the average of the C-dimensional outputs from the Places- CNN of images within a small radius. We use Structure-from-Motion (SFM) keypoints inside each detection to estimate the back-projected 3D location of the detected object in the environment by taking the mean of their 3D locations, which are then projected to the ground plane to form a set D f for each object category f [1... F ]. The SFM reconstruction is also used to localize images and described further in Section 3.2. We calculate the object detection scores o a = [o 1a... o F a ] for each location a as the max score of object detection of the nearby back-projected object detections d D f within a r = 2 grid-cell radius, exponentially weighted by its distance along the floor from the object z d : 1 o fa = max d D f 2r2 π exp z2 d 2r 2. We wish to enforce similarity of activities between nearby locations, as well as between locations that have similar object detections and scene classification description. Between any two locations a, b, and given associated scene classification scores p a, p b, object detection scores o a, o b, and 2D grid locations x a, x b the kernel is of the form: k(a, b) = (1 α)k s (x a, x b )+ α 2 k p(p a, p b ) + α 2 k o(o a, o (1) b), where k s is an RBF kernel between the spatial coordinates of each location, k p and k o as χ 2 kernels on scene classification scores and object detection scores, and k o has 0 similarity between locations with no object score. Thus, there is a tradeoff between the k s, k p and k o kernels, controlled by α. When α = 0, only spatial smoothness is considered, and when α = 1, only scene classification and object detection terms are considered, ignoring spatial smoothness. When a location in one scene is compared to a location in a new scene or the same scene, k(, ) returns higher scores for locations with similar objects and places, and as shown Section 2.2, places more regularization constraint on the objective function, rewarding solutions that predict similar functionalities for both locations Completing the Action Map Matrix To build our model, we seek to minimize the RWNMF objective function in Equation 2: J(U, V) = W (R UV T ) 2 F + λ 2 + µ 2 M u i u j K U ij i,j A v i v j K V ij where U R M D +, V R+ A D, together form the decomposition, W R M A + is the weight matrix with 0s for i,j (2)

5 # GT locs. # Actions Length r e r a Office Flr. A min Office Flr. D min Office Flr. C min Office Flr. B min Home A min Table 1: Scene stats. The number of GT locations is the number of distinct places a specific activity can be performed. The number of activity demonstrations is the total number of demonstrations collected in each environment. r e = #cells with non-empty actions #total cells. #cells explored #total cells, r a = unexplored locations, and K U the kernel Gram matrix of the side information defined by its elements: K U ij = k(i, j). The squared-loss term penalizes decompositions with values different from the observed values in R. The term involving K U penalizes decompositions in which highly similar locations have different decompositions in the rows (u T i ) of U. Roughly, locations with high similarity in scene appearance, object presence, or position impose penalty on the resulting decomposition for predicting different affordance values in the AM. The term involving K V corresponds to the cross-action smoothing, which we take as the identity matrix, enforcing no penalty for differences across per-location action labels. To minimize the objective function, we use the regularized multiplicative update rules following [7]. Multiplicative update schemes for NMF are generally constructed such that their iterative application yields a non-increasing update to the objective function; [7] showed that these update rules yield non-increasing updates to the objective function. Thus, after enough iterations, a local minima in the objective function is found, and the resulting decomposition and its predictions are returned. Values in W are set to counteract class imbalance. The number of observed values for each activity is computed as n c, and assigned to each nonempty location i s corresponding entry as w ic = 1/n c, and the zeros from observed cameras associated with no activities as w = 1/n z. 3. Experiments Our dataset consists of 5 large, multi-room scenes from various locations. Three scenes, Office Flr. A, Office Flr. D, and Office Flr. C, are taken from three distinct office buildings in the United States, and another scene, Office Flr. B, comes from an office building in Japan. Each office scene has standard office rooms, common rooms, and a small kitchen area. A final scene, Home A, consists a kitchen, a living room, and a dining room. See Table 1 for scene activity and sparsity statistics. Our goal is to predict dense Action Maps from sparse activity demonstrations. The first experiments (Section 3.3) measure our method s performance when supplied with all observed ac- W. Max F1 W. Mean F1 Max F1 Mean F1 Of. Flr. A S sng ± ± 0.02 Of. Flr. A SOP D sng ± ± 0.01 Of. Flr. A SOP sng ± ± 0.04 Of. Flr. A SOP D all ± ± 0.01 Of. Flr. A SOP all ± ± 0.02 Of. Flr. B S sng ± ± 0.01 Of. Flr. B SOP sng ± ± 0.03 Of. Flr. B SOP D all ± ± 0.03 Of. Flr. B SOP all ± ± 0.04 Of. Flr. C S sng ± ± 0.06 Of. Flr. C SOP D sng ± ± 0.05 Of. Flr. C SOP sng ± ± 0.06 Of. Flr. C SOP D all ± ± 0.03 Of. Flr. C SOP all ± ± 0.04 Of. Flr. D S sng ± ± 0.12 Of. Flr. D SOP D sng ± ± 0.04 Of. Flr. D SOP sng ± ± 0.07 Of. Flr. D SOP D all ± ± 0.08 Of. Flr. D SOP all ± ± 0.09 Home A S sng ± ± 0.02 Home A SOP D sng ± ± 0.02 Home A SOP sng ± ± 0.02 Home A SOP D all ± ± 0.02 Home A SOP all ± ±0.02 Table 2: Prediction results by using the activity observations for each scene ( sng ), and, as separate results, by simultaneously fitting data from all scenes ( all ). By using observations from all scenes, the performance of our method on each scene improves over using each scene s observation data alone. Additionally, our method is able to integrate activity detections without much performance loss: a D suffix indicates activity detection predictions were used, otherwise, labelled activities were used. S stands for spatial kernel only, and SOP stands for Spatial+Object Detection+Scene Classification kernels. The spatial kernel only is useful yet outperformed by the full model. Side information from multiple scenes generally improves the performance. tion data that covers on average about half of all locations and some actions (See Table 1 for the coverage statistics). Additionally, this experiments compares against performance of the spatial kernel-only approach, which serves to illustrate the utility of including side-information. However, as it takes some time to collect the observations of each scene, we demonstrate a second set of experiments (Section 3.4), to showcase our method handling fractions of the already sparse observation data while still maintaining reasonable performance. In Section 3.5, our third set of experiments shows that if our method is presented with novel scenes for which there is zero activity demonstrations, our method can still make predictions in these new environments. This final set of experiments also investigates which side-information is most helpful for our task Performance scoring To evaluate an AM, we perform binary classification across all activities and compute mean F 1 scores. We col-

6 (a) Office Flr. A Elapse (b) Office Flr. D Elapse (c) Home A Elapse Figure 4: Performance improves a function of available data. For each parameter setting, we show the F 1 scores for each activity label, as well as the mean and weighted mean of the F 1 scores across all parameter settings and activity labels. Some variations in performance are observed as new activities are introduced, as the correlations between an established activities and newly introduced activities are initially sparse. As more data is collected, erroneous correlations are unlearnt, and correct ones are reinforced. Office Flr. B Office Flr. C W. Max F1 W. Mean F1 Max F1 Mean F1 W. Max F1 W. Mean F1 Max F1 Mean F1 RFC Det NMF Office Flr. D SO ± ± ± ± 0.09 SP ± ± ± ± 0.09 SOP ± ± ± ± 0.01 RFC Det NMF Home A SO ± ± ± SP ± ± ± ± 0.03 SOP ± ± ± ± 0.01 Table 3: Performance of our algorithm by using activity observations from Office Flr. A to make predictions in novel scenes. Each baseline method is run with a single parameter setting, and thus their maxes and means are equivalent. The baseline methods RFC, Det., and NMF correspond to the Random Forest Classification, Object Detection AMs, and non-regularized NMF augmented matrix approaches, respectively. Variants of our approach, SO, SP, and SOP correspond to using Spatial+Object Detection kernels, Spatial+Scene Classification kernels, and Spatial+Object Detection+Scene Classification kernels. Multiple metrics are considered to observe the effects of ground-truth class imbalance, and means are used to quantify performance across a variety of parameter settings. lect the ground truth activity classes for every image in the scene by retrieving them from labelled grid cells, as shown in Figure 5, in a small triangle in front of each camera, which represents the viewable space. We collect the predicted AM scores from the same grid cells and average the scores to produce per-image AM scores. We used 100 evenly-spaced thresholds to evaluate binary classification performance by averaging F 1 scores across the thresholds. We report F 1 scores as opposed to the overall accuracy, as the overall accuracy of our method is very high due to the large amount of space in each scene with no labelled functionality (a large amount of true negatives ). The activity classes we use are sit, type, open-door, read, write-whiteboard and wash. This set of activities provides good coverage of common activities that a person can do in an office or home setting. To summarize results, we compute the unweighted and weighted averages of per-class F 1 scores, where the weighted average is computed by using the normalized counts of the GT classes in the images Preprocessing and parameters The first step to build the AM is to build a physical map of the environment. We use Structure-From-Motion (SFM) [21] with egocentric videos of a walk through of the environment to obtain a 3D reconstruction of the scene. Next, we consider two important categories of detectable actions: (1) those that involve the user s hands (gesture-based activities), and (2) those that involve significant motion of the user s head, or egomotion-based activities. We used the deep network architecture inspired by [19] to perform activity detection, as the two stream network takes into account both appearance (e.g., hands and objects) as well as motion (e.g., optical flow induced by ego-motion and local hand-object manipulations). When actions are detected by

7 (a) Office Flr. A GT (b) Office Flr. B GT (c) Office Flr. D GT (d) Office Flr. C GT (e) Home A GT (f) Legend Figure 5: Ground truth labels and SFM points in each scene. Dotted lines indicate a doorway, solid lines indicate walls. our action recognition module, we need a method for estimating the location of this action. We use the SFM model to compute the 3D camera pose of new images. As we define an AM over a 2D ground plane (floor layout), we project the 3D camera pose associated to an action to the ground plane. To obtain a ground plane estimate, we fit a plane to a collection of localized cameras using SFM. We assume that the egocentric camera lies approximately at eye level, thus this height plane is tangent to the top of the camera wearer s head. We then translate this plane along its normal, while iteratively refitting planes with RANSAC to points in the SFM model. Once we have an estimate of the 2D ground plane in 3D space, we can use it to project the localized actions onto the ground plane. When dealing with multiple scenes, distances must be calibrated between them. We use prior knowledge of the user s height to form estimates of the absolute scale of each scene. Specifically, we use the distance between the ground plane and the user height plane, along with a known user height, to convert distances in the reconstruction to meters. Finally, we grid each scene with cells of size 0.25 meters. (we use a radius of 2 grid cells, which is 0.5 meters after metric estimation). Since actions are often strongly correlated with the surrounding area and objects, as shown in Figure 3, we also extract the visual context of each action as a source of side-information. For every image obtained with the wearable camera, we run scene classification and object detection with [24] and [6]. We use the pre-trained Places205- GoogLeNet model for scene-classification, which yields 205 features per image, one per each scene type, and a radius of 2 grid cells inside which to average the classification scores. For object detection, we use the pretrained Bvlc reference rcnn ilsvrc13 model, which performs object detection for 205 different object categories, and use NMS with overlap ratio 0.3, and min detection score 0.5. We use a small grid of parameters for our method (α [0,.1,.3,.5,.7,.9, 1], λ [10 3, 10 2 ], γ [100, 1000]), where each γ is used for the χ 2 kernels, and evaluate performance of multiple runs as the cross-run maximum and cross-run average of each of the various scores. In a scenario with many additional test scenes, a single choice of parameters could be selected via cross-validation. We also consider variations of our kernel that use different combinations of side-information: Spatial+object detection (SO), Spatial+scene classification (SP), and Spatial+object detection+scene classification (SOP). In the first two cases, the α 2 weight of Equation 1 becomes α for the object detection or scene classification kernel that is on, and 0 for the other Full observation experiments When all activity observations are available, our method is able to perform quite well. The dominant source of error is that of camera localization, which reduces the spatial precision of the AM. In Table 2, we evaluate the performance of our method run on each scene separately, as well as running once with all of the scenes in a single matrix. When multiple scenes are used, side-information is crucial: without it, there is no similarity enforced across scenes. In single scene case, we find that using a spatial kernel only can perform well, yet is generally outperformed by using all side information, especially when side information and activity demonstrations are present from other scenes. By using the data from all scenes simultaneously in a global factorization, performance increases globally over using each single scene s data alone. This is expected and desirable: simultaneous understanding of multiple scenes can improve as the set of available scenes with observation data grows.

8 3.4. Partial observation experiments 3.5. Novel scene experiments Another scenario is the task of predicting AMs for novel scenes containing zero activity observation data. Our method leverages the appearance and activity observation data in one scene, and only appearance data in the novel scene to make predictions. We now introduce three baselines we consider. The first baseline is to perform per-image classification with the object detection and scene classification features, which serves to estimate image-wise performance of using the object detection and scene classification information. This baseline requires observations in a labelled scene for training. We use Random Forests [1] as the classification method, trained on images from the source scene. The second baseline we consider is non-regularized Weighted Nonnegative Matrix Factorization by augmenting the target matrix R with the object detection and scene classification features for each location. This baseline does not explicitly enforce the similarity that the regularized framework does, thus, we expect it to not perform as well as our framework. The third baseline we consider is to build AMs from the back-projected object detections by directly associating each detection category with an activity category. We use the Office Flr. A demonstration and appearance data as input and evaluate the performance by applying the learned model to each of the other scenes. These results (Table 3) illustrate that our method s AM predictions out13 cases, and that the appearance perform the baselines in 16 information is capitalized upon the most by our method. We find that scene classification is particularly beneficial to performance, a phenomenon for which we present two hypothesized factors: 1) as shown in [23] object detectors emerge in deep scene CNNs, suggesting that the Scene Classification features subsume the cues present in the object detector features, and 2) due to localization noise, correlations between localized activities and localized objects are not as strong, and can serve to introduce noise to the Figure 6: Sit (top row) and Type (bottom row) AMs as the amount of observed data increases on Office Flr. A. The columns stand for 10%, 80%, and 100% of the data. Average Minimum Distance of WNMF Action Map Modes to Localized Sequence vs. Size of Query 'K' Average Minimum Distance We expose our algorithm to various fractions of the total activity demonstrations to simulate an increasing amount of observed actions. We find that performance is high even with only a few demonstrations and steadily increases as the amount of activity demonstrations increases. The Office Flr. A, Office Flr. D, and Home A scenes have enough activity demonstration data to illustrate the performance gains of our method as a function of the available data. We show quantitative per-class results for these in Figure 4. Sharp increases can be observed in the per-class trends, which correspond to the increase of coverage of each activity class. In Figure 6, we show the overhead view of the AM for the sit and type labels for the Office Flr. A as a function of the available data, where it can be seen how the AM qualitatively improves over time as observations are collected sit writing opendoor wash typing reading K Figure 7: Localizing with an Action Map and observed activities. Activities that are more specialized are localized with less guesses. Spatial+Scene Classification kernel combination when this object information is integrated. Overall, we find that our model harnesses the power of activity observations in concert with the availability of rich scene classification and object detection information to estimate the functionality of environments both with and without activity observations. See Appendix A, including Tables 4 and 5 for additional visualizations and novel scene prediction demonstrations. 4. Action Maps for Localization We demonstrate a proof-of-concept application of Action Maps to the task of localization. Intuitively, by leveraging the where an activity can be done functional-spatial information from Action Maps, along with what activity has been done functional information from activity detection, the user s spatial location is constrained to be in one of several areas. We localize activity sequences in each 2D map based on the combination of predicted action locations from the Action Map, and observed actions in each frame. In Figure 7, we show the spatial discrepancy in grid cells between the K-best AM location guesses decreases. Thus, an Action Map can be used to localize a person with observations of their activity.

9 5. Conclusion We have demonstrated a novel method for generating functional maps of uninstrumented common environments. Our model jointly considers scene appearance and functionality while consolidating evidence from the natural vantage point of the user, and is able to learn from a user s demonstrations to make predictions of functionality of less explored and completely novel areas. Finally, our proofof-concept application hints at the breadth of future work that can exploit the rich spatial and functional information present in Action Maps. Acknowledgements This research was funded in part by grants from the PA Dept. of Health s Commonwealth Universal Research Enhancement Program, IBM Research Open Collaborative Research initiative, CREST (JST), and an NVIDIA hardware grant. We thank Ryo Yonetani for valuable data collection assistance and discussion. References [1] L. Breiman. Random forests. Machine learning, 45(1):5 32, [2] V. Delaitre, D. F. Fouhey, I. Laptev, J. Sivic, A. Gupta, and A. A. Efros. Scene semantics from long-term observation of people. In Computer Vision ECCV 2012, pages Springer, , 3 [3] A. Fathi, A. Farhadi, and J. M. Rehg. Understanding egocentric activities. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages IEEE, [4] D. F. Fouhey, V. Delaitre, A. Gupta, A. A. Efros, I. Laptev, and J. Sivic. People watching: Human actions as a cue for single-view geometry. In Proc. 12th European Conference on Computer Vision, , 3 [5] J. Gall, A. Fossati, and L. Van Gool. Functional categorization of objects using real-time markerless motion capture. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages IEEE, , 3 [6] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Computer Vision and Pattern Recognition, , 7 [7] Q. Gu, J. Zhou, and C. H. Ding. Collaborative filtering: Weighted nonnegative matrix factorization incorporating user and item graphs. In SDM, pages SIAM, , 5 [8] A. Gupta, T. Chen, F. Chen, D. Kimber, and L. S. Davis. Context and observation driven latent variable model for human pose estimation. In Computer Vision and Pattern Recognition, CVPR IEEE Conference on, pages 1 8. IEEE, [9] A. Gupta, S. Satkin, A. A. Efros, and M. Hebert. From 3d scene geometry to human workspace. In Computer Vision and Pattern Recognition(CVPR), , 3 [10] Y. Jiang, H. Koppula, and A. Saxena. Hallucinated humans as the hidden context for labeling 3d scenes. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages IEEE, [11] H. S. Koppula, R. Gupta, and A. Saxena. Learning human activities and object affordances from rgb-d videos. The International Journal of Robotics Research, 32(8): , [12] Y. Li, Z. Ye, and J. M. Rehg. Delving into egocentric actions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages , [13] D. J. Moore, I. Essa, M. H. Hayes III, et al. Exploiting human actions and object context for recognition tasks. In Computer Vision, The Proceedings of the Seventh IEEE International Conference on, volume 1, pages IEEE, [14] P. K. Nathan Silberman, Derek Hoiem and R. Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, [15] P. Peursum, G. West, and S. Venkatesh. Combining image regions and human activity for indirect object recognition in indoor wide-angle views. In Computer Vision, ICCV Tenth IEEE International Conference on, volume 1, pages IEEE, [16] H. Pirsiavash and D. Ramanan. Detecting activities of daily living in first-person camera views. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages IEEE, [17] P. Ramachandran and G. Varoquaux. Mayavi: 3D Visualization of Scientific Data. Computing in Science & Engineering, 13(2):40 51, [18] M. Savva, A. X. Chang, P. Hanrahan, M. Fisher, and M. Nießner. Scenegrok: Inferring action maps in 3d environments. ACM Transactions on Graphics (TOG), 33(6), [19] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems, pages , [20] E. H. Spriggs, F. De la Torre Frade, and M. Hebert. Temporal segmentation and activity classification from first-person sensing. In IEEE Workshop on Egocentric Vision, CVPR 2009, June [21] C. Wu. Towards linear-time incremental structure from motion. In 3D Vision-3DV 2013, 2013 International Conference on, pages IEEE, [22] B. Yao and L. Fei-Fei. Modeling mutual context of object and human pose in human-object interaction activities. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages IEEE, [23] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Object detectors emerge in deep scene cnns. arxiv preprint arxiv: , [24] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning deep features for scene recognition using places database. In Advances in Neural Information Processing Systems, pages , , 7

10 A. Examples In the following examples displayed in Tables 4 and 5, we display visualizations of various components of our method. In both examples, activity demonstration data is available for only one scene. In Table 4, we display object detection results for images from each scene, as well as projected onto the floor planes. In Table 5, we display scene classification results for images from each scene, as well as projected onto the floor planes. In Table 4, the target scene without activity demonstration is a scene from the NYU V2 Depth Dataset [14] reconstructed and processed by our method. The per-row verbose description is as follows Scene Names 2. Scene reconstruction with localized cameras visualized as vectors, colored by temporal ordering 3. (Object detections) or (Scene classifications), with scores for example images 4. (Object detections and sit) or (Scene classification corridor) features visualized, with the example images (and objects from row 3) localized in the scene 5. Available localized activity demonstrations, height corresponds to bin count, color corresponds to activity type (with the same coloration scheme as Figure 5) 6. Final sit Action Maps as produced by our method visualized in 3D 7. Final sit Action Maps as produced by our method visualized as projected onto the example images (with no occlusion filtering). 1 We used [17] to produce 3D visualizations throughout the paper.

11 Home A(training) NYUV2 Home Office 0001 Scene Reconstructions with Localized Cameras Example Images with Detections Object Detections in 3D and Object sit Action Maps Binned Localized Detected Actions (colored by type) sit Maps None - Novel Scene Action Projected sit Action Maps Table 4: Visualizations of various aspects of the method.

12 Office Flr. A(training) Office Flr. B Scene Reconstructions with Localized Cameras Example Images with Classification Score Corridor Scene Classifications in 3D with Localized Example Images Binned Localized Detected Actions (colored by type) None - Novel Scene sit Action Maps Projected sit Action Maps Table 5: Visualizations of various aspects of the method

Learning Action Maps of Large Environments via First-Person Vision

Learning Action Maps of Large Environments via First-Person Vision Learning Action Maps of Large Environments via First-Person Vision Nicholas Rhinehart, Kris M. Kitani The Robotics Institute Carnegie Mellon University {nrhineha, kkitani}@cs.cmu.edu Abstract When people

More information

Three-Dimensional Object Detection and Layout Prediction using Clouds of Oriented Gradients

Three-Dimensional Object Detection and Layout Prediction using Clouds of Oriented Gradients ThreeDimensional Object Detection and Layout Prediction using Clouds of Oriented Gradients Authors: Zhile Ren, Erik B. Sudderth Presented by: Shannon Kao, Max Wang October 19, 2016 Introduction Given an

More information

Depth Estimation from a Single Image Using a Deep Neural Network Milestone Report

Depth Estimation from a Single Image Using a Deep Neural Network Milestone Report Figure 1: The architecture of the convolutional network. Input: a single view image; Output: a depth map. 3 Related Work In [4] they used depth maps of indoor scenes produced by a Microsoft Kinect to successfully

More information

Support surfaces prediction for indoor scene understanding

Support surfaces prediction for indoor scene understanding 2013 IEEE International Conference on Computer Vision Support surfaces prediction for indoor scene understanding Anonymous ICCV submission Paper ID 1506 Abstract In this paper, we present an approach to

More information

Separating Objects and Clutter in Indoor Scenes

Separating Objects and Clutter in Indoor Scenes Separating Objects and Clutter in Indoor Scenes Salman H. Khan School of Computer Science & Software Engineering, The University of Western Australia Co-authors: Xuming He, Mohammed Bennamoun, Ferdous

More information

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing Supplementary Material Introduction In this supplementary material, Section 2 details the 3D annotation for CAD models and real

More information

Contexts and 3D Scenes

Contexts and 3D Scenes Contexts and 3D Scenes Computer Vision Jia-Bin Huang, Virginia Tech Many slides from D. Hoiem Administrative stuffs Final project presentation Dec 1 st 3:30 PM 4:45 PM Goodwin Hall Atrium Grading Three

More information

Finding Tiny Faces Supplementary Materials

Finding Tiny Faces Supplementary Materials Finding Tiny Faces Supplementary Materials Peiyun Hu, Deva Ramanan Robotics Institute Carnegie Mellon University {peiyunh,deva}@cs.cmu.edu 1. Error analysis Quantitative analysis We plot the distribution

More information

What are we trying to achieve? Why are we doing this? What do we learn from past history? What will we talk about today?

What are we trying to achieve? Why are we doing this? What do we learn from past history? What will we talk about today? Introduction What are we trying to achieve? Why are we doing this? What do we learn from past history? What will we talk about today? What are we trying to achieve? Example from Scott Satkin 3D interpretation

More information

Contexts and 3D Scenes

Contexts and 3D Scenes Contexts and 3D Scenes Computer Vision Jia-Bin Huang, Virginia Tech Many slides from D. Hoiem Administrative stuffs Final project presentation Nov 30 th 3:30 PM 4:45 PM Grading Three senior graders (30%)

More information

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009 Learning and Inferring Depth from Monocular Images Jiyan Pan April 1, 2009 Traditional ways of inferring depth Binocular disparity Structure from motion Defocus Given a single monocular image, how to infer

More information

Learning from 3D Data

Learning from 3D Data Learning from 3D Data Thomas Funkhouser Princeton University* * On sabbatical at Stanford and Google Disclaimer: I am talking about the work of these people Shuran Song Andy Zeng Fisher Yu Yinda Zhang

More information

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing Supplementary Material

Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing Supplementary Material Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing Supplementary Material Chi Li, M. Zeeshan Zia 2, Quoc-Huy Tran 2, Xiang Yu 2, Gregory D. Hager, and Manmohan Chandraker 2 Johns

More information

Summarization of Egocentric Moving Videos for Generating Walking Route Guidance

Summarization of Egocentric Moving Videos for Generating Walking Route Guidance Summarization of Egocentric Moving Videos for Generating Walking Route Guidance Masaya Okamoto and Keiji Yanai Department of Informatics, The University of Electro-Communications 1-5-1 Chofugaoka, Chofu-shi,

More information

Perceiving the 3D World from Images and Videos. Yu Xiang Postdoctoral Researcher University of Washington

Perceiving the 3D World from Images and Videos. Yu Xiang Postdoctoral Researcher University of Washington Perceiving the 3D World from Images and Videos Yu Xiang Postdoctoral Researcher University of Washington 1 2 Act in the 3D World Sensing & Understanding Acting Intelligent System 3D World 3 Understand

More information

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks Si Chen The George Washington University sichen@gwmail.gwu.edu Meera Hahn Emory University mhahn7@emory.edu Mentor: Afshin

More information

Supplementary Material for Zoom and Learn: Generalizing Deep Stereo Matching to Novel Domains

Supplementary Material for Zoom and Learn: Generalizing Deep Stereo Matching to Novel Domains Supplementary Material for Zoom and Learn: Generalizing Deep Stereo Matching to Novel Domains Jiahao Pang 1 Wenxiu Sun 1 Chengxi Yang 1 Jimmy Ren 1 Ruichao Xiao 1 Jin Zeng 1 Liang Lin 1,2 1 SenseTime Research

More information

ECCV Presented by: Boris Ivanovic and Yolanda Wang CS 331B - November 16, 2016

ECCV Presented by: Boris Ivanovic and Yolanda Wang CS 331B - November 16, 2016 ECCV 2016 Presented by: Boris Ivanovic and Yolanda Wang CS 331B - November 16, 2016 Fundamental Question What is a good vector representation of an object? Something that can be easily predicted from 2D

More information

Content-Based Image Recovery

Content-Based Image Recovery Content-Based Image Recovery Hong-Yu Zhou and Jianxin Wu National Key Laboratory for Novel Software Technology Nanjing University, China zhouhy@lamda.nju.edu.cn wujx2001@nju.edu.cn Abstract. We propose

More information

Deformable Part Models

Deformable Part Models CS 1674: Intro to Computer Vision Deformable Part Models Prof. Adriana Kovashka University of Pittsburgh November 9, 2016 Today: Object category detection Window-based approaches: Last time: Viola-Jones

More information

METRIC PLANE RECTIFICATION USING SYMMETRIC VANISHING POINTS

METRIC PLANE RECTIFICATION USING SYMMETRIC VANISHING POINTS METRIC PLANE RECTIFICATION USING SYMMETRIC VANISHING POINTS M. Lefler, H. Hel-Or Dept. of CS, University of Haifa, Israel Y. Hel-Or School of CS, IDC, Herzliya, Israel ABSTRACT Video analysis often requires

More information

3D Object Recognition and Scene Understanding from RGB-D Videos. Yu Xiang Postdoctoral Researcher University of Washington

3D Object Recognition and Scene Understanding from RGB-D Videos. Yu Xiang Postdoctoral Researcher University of Washington 3D Object Recognition and Scene Understanding from RGB-D Videos Yu Xiang Postdoctoral Researcher University of Washington 1 2 Act in the 3D World Sensing & Understanding Acting Intelligent System 3D World

More information

Data-driven Depth Inference from a Single Still Image

Data-driven Depth Inference from a Single Still Image Data-driven Depth Inference from a Single Still Image Kyunghee Kim Computer Science Department Stanford University kyunghee.kim@stanford.edu Abstract Given an indoor image, how to recover its depth information

More information

arxiv: v1 [cs.cv] 28 Sep 2018

arxiv: v1 [cs.cv] 28 Sep 2018 Camera Pose Estimation from Sequence of Calibrated Images arxiv:1809.11066v1 [cs.cv] 28 Sep 2018 Jacek Komorowski 1 and Przemyslaw Rokita 2 1 Maria Curie-Sklodowska University, Institute of Computer Science,

More information

3D Spatial Layout Propagation in a Video Sequence

3D Spatial Layout Propagation in a Video Sequence 3D Spatial Layout Propagation in a Video Sequence Alejandro Rituerto 1, Roberto Manduchi 2, Ana C. Murillo 1 and J. J. Guerrero 1 arituerto@unizar.es, manduchi@soe.ucsc.edu, acm@unizar.es, and josechu.guerrero@unizar.es

More information

Discrete Optimization of Ray Potentials for Semantic 3D Reconstruction

Discrete Optimization of Ray Potentials for Semantic 3D Reconstruction Discrete Optimization of Ray Potentials for Semantic 3D Reconstruction Marc Pollefeys Joined work with Nikolay Savinov, Christian Haene, Lubor Ladicky 2 Comparison to Volumetric Fusion Higher-order ray

More information

08 An Introduction to Dense Continuous Robotic Mapping

08 An Introduction to Dense Continuous Robotic Mapping NAVARCH/EECS 568, ROB 530 - Winter 2018 08 An Introduction to Dense Continuous Robotic Mapping Maani Ghaffari March 14, 2018 Previously: Occupancy Grid Maps Pose SLAM graph and its associated dense occupancy

More information

Context. CS 554 Computer Vision Pinar Duygulu Bilkent University. (Source:Antonio Torralba, James Hays)

Context. CS 554 Computer Vision Pinar Duygulu Bilkent University. (Source:Antonio Torralba, James Hays) Context CS 554 Computer Vision Pinar Duygulu Bilkent University (Source:Antonio Torralba, James Hays) A computer vision goal Recognize many different objects under many viewing conditions in unconstrained

More information

Holistic 3D Scene Parsing and Reconstruction from a Single RGB Image. Supplementary Material

Holistic 3D Scene Parsing and Reconstruction from a Single RGB Image. Supplementary Material Holistic 3D Scene Parsing and Reconstruction from a Single RGB Image Supplementary Material Siyuan Huang 1,2, Siyuan Qi 1,2, Yixin Zhu 1,2, Yinxue Xiao 1, Yuanlu Xu 1,2, and Song-Chun Zhu 1,2 1 University

More information

DeepIM: Deep Iterative Matching for 6D Pose Estimation - Supplementary Material

DeepIM: Deep Iterative Matching for 6D Pose Estimation - Supplementary Material DeepIM: Deep Iterative Matching for 6D Pose Estimation - Supplementary Material Yi Li 1, Gu Wang 1, Xiangyang Ji 1, Yu Xiang 2, and Dieter Fox 2 1 Tsinghua University, BNRist 2 University of Washington

More information

CS 558: Computer Vision 13 th Set of Notes

CS 558: Computer Vision 13 th Set of Notes CS 558: Computer Vision 13 th Set of Notes Instructor: Philippos Mordohai Webpage: www.cs.stevens.edu/~mordohai E-mail: Philippos.Mordohai@stevens.edu Office: Lieb 215 Overview Context and Spatial Layout

More information

arxiv: v1 [cs.cv] 31 Mar 2016

arxiv: v1 [cs.cv] 31 Mar 2016 Object Boundary Guided Semantic Segmentation Qin Huang, Chunyang Xia, Wenchao Zheng, Yuhang Song, Hao Xu and C.-C. Jay Kuo arxiv:1603.09742v1 [cs.cv] 31 Mar 2016 University of Southern California Abstract.

More information

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION Kingsley Kuan 1, Gaurav Manek 1, Jie Lin 1, Yuan Fang 1, Vijay Chandrasekhar 1,2 Institute for Infocomm Research, A*STAR, Singapore 1 Nanyang Technological

More information

Object Detection by 3D Aspectlets and Occlusion Reasoning

Object Detection by 3D Aspectlets and Occlusion Reasoning Object Detection by 3D Aspectlets and Occlusion Reasoning Yu Xiang University of Michigan Silvio Savarese Stanford University In the 4th International IEEE Workshop on 3D Representation and Recognition

More information

CRF Based Point Cloud Segmentation Jonathan Nation

CRF Based Point Cloud Segmentation Jonathan Nation CRF Based Point Cloud Segmentation Jonathan Nation jsnation@stanford.edu 1. INTRODUCTION The goal of the project is to use the recently proposed fully connected conditional random field (CRF) model to

More information

Binge Watching: Scaling Affordance Learning from Sitcoms

Binge Watching: Scaling Affordance Learning from Sitcoms Binge Watching: Scaling Affordance Learning from Sitcoms Xiaolong Wang Rohit Girdhar Abhinav Gupta The Robotics Institute, Carnegie Mellon University http://www.cs.cmu.edu/ xiaolonw/affordance.html Abstract

More information

Revisiting Depth Layers from Occlusions

Revisiting Depth Layers from Occlusions Revisiting Depth Layers from Occlusions Adarsh Kowdle Cornell University apk64@cornell.edu Andrew Gallagher Cornell University acg226@cornell.edu Tsuhan Chen Cornell University tsuhan@ece.cornell.edu Abstract

More information

Tri-modal Human Body Segmentation

Tri-modal Human Body Segmentation Tri-modal Human Body Segmentation Master of Science Thesis Cristina Palmero Cantariño Advisor: Sergio Escalera Guerrero February 6, 2014 Outline 1 Introduction 2 Tri-modal dataset 3 Proposed baseline 4

More information

arxiv: v1 [cs.cv] 2 Sep 2018

arxiv: v1 [cs.cv] 2 Sep 2018 Natural Language Person Search Using Deep Reinforcement Learning Ankit Shah Language Technologies Institute Carnegie Mellon University aps1@andrew.cmu.edu Tyler Vuong Electrical and Computer Engineering

More information

Towards Spatio-Temporally Consistent Semantic Mapping

Towards Spatio-Temporally Consistent Semantic Mapping Towards Spatio-Temporally Consistent Semantic Mapping Zhe Zhao, Xiaoping Chen University of Science and Technology of China, zhaozhe@mail.ustc.edu.cn,xpchen@ustc.edu.cn Abstract. Intelligent robots require

More information

Fitting (LMedS, RANSAC)

Fitting (LMedS, RANSAC) Fitting (LMedS, RANSAC) Thursday, 23/03/2017 Antonis Argyros e-mail: argyros@csd.uoc.gr LMedS and RANSAC What if we have very many outliers? 2 1 Least Median of Squares ri : Residuals Least Squares n 2

More information

TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK

TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK 1 Po-Jen Lai ( 賴柏任 ), 2 Chiou-Shann Fuh ( 傅楸善 ) 1 Dept. of Electrical Engineering, National Taiwan University, Taiwan 2 Dept.

More information

Motion Tracking and Event Understanding in Video Sequences

Motion Tracking and Event Understanding in Video Sequences Motion Tracking and Event Understanding in Video Sequences Isaac Cohen Elaine Kang, Jinman Kang Institute for Robotics and Intelligent Systems University of Southern California Los Angeles, CA Objectives!

More information

Segmenting Objects in Weakly Labeled Videos

Segmenting Objects in Weakly Labeled Videos Segmenting Objects in Weakly Labeled Videos Mrigank Rochan, Shafin Rahman, Neil D.B. Bruce, Yang Wang Department of Computer Science University of Manitoba Winnipeg, Canada {mrochan, shafin12, bruce, ywang}@cs.umanitoba.ca

More information

Part Localization by Exploiting Deep Convolutional Networks

Part Localization by Exploiting Deep Convolutional Networks Part Localization by Exploiting Deep Convolutional Networks Marcel Simon, Erik Rodner, and Joachim Denzler Computer Vision Group, Friedrich Schiller University of Jena, Germany www.inf-cv.uni-jena.de Abstract.

More information

Segmentation and Tracking of Partial Planar Templates

Segmentation and Tracking of Partial Planar Templates Segmentation and Tracking of Partial Planar Templates Abdelsalam Masoud William Hoff Colorado School of Mines Colorado School of Mines Golden, CO 800 Golden, CO 800 amasoud@mines.edu whoff@mines.edu Abstract

More information

RECOGNIZING HAND-OBJECT INTERACTIONS IN WEARABLE CAMERA VIDEOS. IBM Research - Tokyo The Robotics Institute, Carnegie Mellon University

RECOGNIZING HAND-OBJECT INTERACTIONS IN WEARABLE CAMERA VIDEOS. IBM Research - Tokyo The Robotics Institute, Carnegie Mellon University RECOGNIZING HAND-OBJECT INTERACTIONS IN WEARABLE CAMERA VIDEOS Tatsuya Ishihara Kris M. Kitani Wei-Chiu Ma Hironobu Takagi Chieko Asakawa IBM Research - Tokyo The Robotics Institute, Carnegie Mellon University

More information

Depth from Stereo. Dominic Cheng February 7, 2018

Depth from Stereo. Dominic Cheng February 7, 2018 Depth from Stereo Dominic Cheng February 7, 2018 Agenda 1. Introduction to stereo 2. Efficient Deep Learning for Stereo Matching (W. Luo, A. Schwing, and R. Urtasun. In CVPR 2016.) 3. Cascade Residual

More information

Factorization with Missing and Noisy Data

Factorization with Missing and Noisy Data Factorization with Missing and Noisy Data Carme Julià, Angel Sappa, Felipe Lumbreras, Joan Serrat, and Antonio López Computer Vision Center and Computer Science Department, Universitat Autònoma de Barcelona,

More information

A Low Power, High Throughput, Fully Event-Based Stereo System: Supplementary Documentation

A Low Power, High Throughput, Fully Event-Based Stereo System: Supplementary Documentation A Low Power, High Throughput, Fully Event-Based Stereo System: Supplementary Documentation Alexander Andreopoulos, Hirak J. Kashyap, Tapan K. Nayak, Arnon Amir, Myron D. Flickner IBM Research March 25,

More information

Object Detection Based on Deep Learning

Object Detection Based on Deep Learning Object Detection Based on Deep Learning Yurii Pashchenko AI Ukraine 2016, Kharkiv, 2016 Image classification (mostly what you ve seen) http://tutorial.caffe.berkeleyvision.org/caffe-cvpr15-detection.pdf

More information

Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture David Eigen, Rob Fergus

Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture David Eigen, Rob Fergus Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture David Eigen, Rob Fergus Presented by: Rex Ying and Charles Qi Input: A Single RGB Image Estimate

More information

Intrinsic3D: High-Quality 3D Reconstruction by Joint Appearance and Geometry Optimization with Spatially-Varying Lighting

Intrinsic3D: High-Quality 3D Reconstruction by Joint Appearance and Geometry Optimization with Spatially-Varying Lighting Intrinsic3D: High-Quality 3D Reconstruction by Joint Appearance and Geometry Optimization with Spatially-Varying Lighting R. Maier 1,2, K. Kim 1, D. Cremers 2, J. Kautz 1, M. Nießner 2,3 Fusion Ours 1

More information

Imagining the Unseen: Stability-based Cuboid Arrangements for Scene Understanding

Imagining the Unseen: Stability-based Cuboid Arrangements for Scene Understanding : Stability-based Cuboid Arrangements for Scene Understanding Tianjia Shao* Aron Monszpart Youyi Zheng Bongjin Koo Weiwei Xu Kun Zhou * Niloy J. Mitra * Background A fundamental problem for single view

More information

3D model classification using convolutional neural network

3D model classification using convolutional neural network 3D model classification using convolutional neural network JunYoung Gwak Stanford jgwak@cs.stanford.edu Abstract Our goal is to classify 3D models directly using convolutional neural network. Most of existing

More information

Action recognition in videos

Action recognition in videos Action recognition in videos Cordelia Schmid INRIA Grenoble Joint work with V. Ferrari, A. Gaidon, Z. Harchaoui, A. Klaeser, A. Prest, H. Wang Action recognition - goal Short actions, i.e. drinking, sit

More information

CS 223B Computer Vision Problem Set 3

CS 223B Computer Vision Problem Set 3 CS 223B Computer Vision Problem Set 3 Due: Feb. 22 nd, 2011 1 Probabilistic Recursion for Tracking In this problem you will derive a method for tracking a point of interest through a sequence of images.

More information

arxiv: v1 [cs.cv] 3 Jul 2016

arxiv: v1 [cs.cv] 3 Jul 2016 A Coarse-to-Fine Indoor Layout Estimation (CFILE) Method Yuzhuo Ren, Chen Chen, Shangwen Li, and C.-C. Jay Kuo arxiv:1607.00598v1 [cs.cv] 3 Jul 2016 Abstract. The task of estimating the spatial layout

More information

Indoor Object Recognition of 3D Kinect Dataset with RNNs

Indoor Object Recognition of 3D Kinect Dataset with RNNs Indoor Object Recognition of 3D Kinect Dataset with RNNs Thiraphat Charoensripongsa, Yue Chen, Brian Cheng 1. Introduction Recent work at Stanford in the area of scene understanding has involved using

More information

CHAPTER 3. Single-view Geometry. 1. Consequences of Projection

CHAPTER 3. Single-view Geometry. 1. Consequences of Projection CHAPTER 3 Single-view Geometry When we open an eye or take a photograph, we see only a flattened, two-dimensional projection of the physical underlying scene. The consequences are numerous and startling.

More information

Efficient Segmentation-Aided Text Detection For Intelligent Robots

Efficient Segmentation-Aided Text Detection For Intelligent Robots Efficient Segmentation-Aided Text Detection For Intelligent Robots Junting Zhang, Yuewei Na, Siyang Li, C.-C. Jay Kuo University of Southern California Outline Problem Definition and Motivation Related

More information

Focusing Attention on Visual Features that Matter

Focusing Attention on Visual Features that Matter TSAI, KUIPERS: FOCUSING ATTENTION ON VISUAL FEATURES THAT MATTER 1 Focusing Attention on Visual Features that Matter Grace Tsai gstsai@umich.edu Benjamin Kuipers kuipers@umich.edu Electrical Engineering

More information

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University LSTM and its variants for visual recognition Xiaodan Liang xdliang328@gmail.com Sun Yat-sen University Outline Context Modelling with CNN LSTM and its Variants LSTM Architecture Variants Application in

More information

Pose Estimation on Depth Images with Convolutional Neural Network

Pose Estimation on Depth Images with Convolutional Neural Network Pose Estimation on Depth Images with Convolutional Neural Network Jingwei Huang Stanford University jingweih@stanford.edu David Altamar Stanford University daltamar@stanford.edu Abstract In this paper

More information

Part based models for recognition. Kristen Grauman

Part based models for recognition. Kristen Grauman Part based models for recognition Kristen Grauman UT Austin Limitations of window-based models Not all objects are box-shaped Assuming specific 2d view of object Local components themselves do not necessarily

More information

Learning to generate 3D shapes

Learning to generate 3D shapes Learning to generate 3D shapes Subhransu Maji College of Information and Computer Sciences University of Massachusetts, Amherst http://people.cs.umass.edu/smaji August 10, 2018 @ Caltech Creating 3D shapes

More information

LEARNING BOUNDARIES WITH COLOR AND DEPTH. Zhaoyin Jia, Andrew Gallagher, Tsuhan Chen

LEARNING BOUNDARIES WITH COLOR AND DEPTH. Zhaoyin Jia, Andrew Gallagher, Tsuhan Chen LEARNING BOUNDARIES WITH COLOR AND DEPTH Zhaoyin Jia, Andrew Gallagher, Tsuhan Chen School of Electrical and Computer Engineering, Cornell University ABSTRACT To enable high-level understanding of a scene,

More information

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun Presented by Tushar Bansal Objective 1. Get bounding box for all objects

More information

CS 231A Computer Vision (Fall 2012) Problem Set 3

CS 231A Computer Vision (Fall 2012) Problem Set 3 CS 231A Computer Vision (Fall 2012) Problem Set 3 Due: Nov. 13 th, 2012 (2:15pm) 1 Probabilistic Recursion for Tracking (20 points) In this problem you will derive a method for tracking a point of interest

More information

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation Object detection using Region Proposals (RCNN) Ernest Cheung COMP790-125 Presentation 1 2 Problem to solve Object detection Input: Image Output: Bounding box of the object 3 Object detection using CNN

More information

Previously. Part-based and local feature models for generic object recognition. Bag-of-words model 4/20/2011

Previously. Part-based and local feature models for generic object recognition. Bag-of-words model 4/20/2011 Previously Part-based and local feature models for generic object recognition Wed, April 20 UT-Austin Discriminative classifiers Boosting Nearest neighbors Support vector machines Useful for object recognition

More information

arxiv: v1 [cs.cv] 26 Jul 2018

arxiv: v1 [cs.cv] 26 Jul 2018 A Better Baseline for AVA Rohit Girdhar João Carreira Carl Doersch Andrew Zisserman DeepMind Carnegie Mellon University University of Oxford arxiv:1807.10066v1 [cs.cv] 26 Jul 2018 Abstract We introduce

More information

OCCLUSION BOUNDARIES ESTIMATION FROM A HIGH-RESOLUTION SAR IMAGE

OCCLUSION BOUNDARIES ESTIMATION FROM A HIGH-RESOLUTION SAR IMAGE OCCLUSION BOUNDARIES ESTIMATION FROM A HIGH-RESOLUTION SAR IMAGE Wenju He, Marc Jäger, and Olaf Hellwich Berlin University of Technology FR3-1, Franklinstr. 28, 10587 Berlin, Germany {wenjuhe, jaeger,

More information

arxiv: v1 [cs.cv] 28 Sep 2018

arxiv: v1 [cs.cv] 28 Sep 2018 Extrinsic camera calibration method and its performance evaluation Jacek Komorowski 1 and Przemyslaw Rokita 2 arxiv:1809.11073v1 [cs.cv] 28 Sep 2018 1 Maria Curie Sklodowska University Lublin, Poland jacek.komorowski@gmail.com

More information

Deep Learning-driven Depth from Defocus via Active Multispectral Quasi-random Projections with Complex Subpatterns

Deep Learning-driven Depth from Defocus via Active Multispectral Quasi-random Projections with Complex Subpatterns Deep Learning-driven Depth from Defocus via Active Multispectral Quasi-random Projections with Complex Subpatterns Avery Ma avery.ma@uwaterloo.ca Alexander Wong a28wong@uwaterloo.ca David A Clausi dclausi@uwaterloo.ca

More information

Proceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong

Proceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong , March 14-16, 2018, Hong Kong TABLE I CLASSIFICATION ACCURACY OF DIFFERENT PRE-TRAINED MODELS ON THE TEST DATA

More information

Spatial Latent Dirichlet Allocation

Spatial Latent Dirichlet Allocation Spatial Latent Dirichlet Allocation Xiaogang Wang and Eric Grimson Computer Science and Computer Science and Artificial Intelligence Lab Massachusetts Tnstitute of Technology, Cambridge, MA, 02139, USA

More information

Augmenting Reality, Naturally:

Augmenting Reality, Naturally: Augmenting Reality, Naturally: Scene Modelling, Recognition and Tracking with Invariant Image Features by Iryna Gordon in collaboration with David G. Lowe Laboratory for Computational Intelligence Department

More information

Identifying Layout Classes for Mathematical Symbols Using Layout Context

Identifying Layout Classes for Mathematical Symbols Using Layout Context Rochester Institute of Technology RIT Scholar Works Articles 2009 Identifying Layout Classes for Mathematical Symbols Using Layout Context Ling Ouyang Rochester Institute of Technology Richard Zanibbi

More information

Part-based and local feature models for generic object recognition

Part-based and local feature models for generic object recognition Part-based and local feature models for generic object recognition May 28 th, 2015 Yong Jae Lee UC Davis Announcements PS2 grades up on SmartSite PS2 stats: Mean: 80.15 Standard Dev: 22.77 Vote on piazza

More information

Viewpoint Invariant Features from Single Images Using 3D Geometry

Viewpoint Invariant Features from Single Images Using 3D Geometry Viewpoint Invariant Features from Single Images Using 3D Geometry Yanpeng Cao and John McDonald Department of Computer Science National University of Ireland, Maynooth, Ireland {y.cao,johnmcd}@cs.nuim.ie

More information

Markov Networks in Computer Vision

Markov Networks in Computer Vision Markov Networks in Computer Vision Sargur Srihari srihari@cedar.buffalo.edu 1 Markov Networks for Computer Vision Some applications: 1. Image segmentation 2. Removal of blur/noise 3. Stereo reconstruction

More information

Beyond Bags of Features

Beyond Bags of Features : for Recognizing Natural Scene Categories Matching and Modeling Seminar Instructed by Prof. Haim J. Wolfson School of Computer Science Tel Aviv University December 9 th, 2015

More information

arxiv: v1 [cs.cv] 20 Dec 2016

arxiv: v1 [cs.cv] 20 Dec 2016 End-to-End Pedestrian Collision Warning System based on a Convolutional Neural Network with Semantic Segmentation arxiv:1612.06558v1 [cs.cv] 20 Dec 2016 Heechul Jung heechul@dgist.ac.kr Min-Kook Choi mkchoi@dgist.ac.kr

More information

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,

More information

Supplementary: Cross-modal Deep Variational Hand Pose Estimation

Supplementary: Cross-modal Deep Variational Hand Pose Estimation Supplementary: Cross-modal Deep Variational Hand Pose Estimation Adrian Spurr, Jie Song, Seonwook Park, Otmar Hilliges ETH Zurich {spurra,jsong,spark,otmarh}@inf.ethz.ch Encoder/Decoder Linear(512) Table

More information

The Hilbert Problems of Computer Vision. Jitendra Malik UC Berkeley & Google, Inc.

The Hilbert Problems of Computer Vision. Jitendra Malik UC Berkeley & Google, Inc. The Hilbert Problems of Computer Vision Jitendra Malik UC Berkeley & Google, Inc. This talk The computational power of the human brain Research is the art of the soluble Hilbert problems, circa 2004 Hilbert

More information

CS395T paper review. Indoor Segmentation and Support Inference from RGBD Images. Chao Jia Sep

CS395T paper review. Indoor Segmentation and Support Inference from RGBD Images. Chao Jia Sep CS395T paper review Indoor Segmentation and Support Inference from RGBD Images Chao Jia Sep 28 2012 Introduction What do we want -- Indoor scene parsing Segmentation and labeling Support relationships

More information

Human Pose Estimation with Deep Learning. Wei Yang

Human Pose Estimation with Deep Learning. Wei Yang Human Pose Estimation with Deep Learning Wei Yang Applications Understand Activities Family Robots American Heist (2014) - The Bank Robbery Scene 2 What do we need to know to recognize a crime scene? 3

More information

Markov Networks in Computer Vision. Sargur Srihari

Markov Networks in Computer Vision. Sargur Srihari Markov Networks in Computer Vision Sargur srihari@cedar.buffalo.edu 1 Markov Networks for Computer Vision Important application area for MNs 1. Image segmentation 2. Removal of blur/noise 3. Stereo reconstruction

More information

CS 231A Computer Vision (Winter 2014) Problem Set 3

CS 231A Computer Vision (Winter 2014) Problem Set 3 CS 231A Computer Vision (Winter 2014) Problem Set 3 Due: Feb. 18 th, 2015 (11:59pm) 1 Single Object Recognition Via SIFT (45 points) In his 2004 SIFT paper, David Lowe demonstrates impressive object recognition

More information

Fully Convolutional Network for Depth Estimation and Semantic Segmentation

Fully Convolutional Network for Depth Estimation and Semantic Segmentation Fully Convolutional Network for Depth Estimation and Semantic Segmentation Yokila Arora ICME Stanford University yarora@stanford.edu Ishan Patil Department of Electrical Engineering Stanford University

More information

Dimensionality Reduction using Relative Attributes

Dimensionality Reduction using Relative Attributes Dimensionality Reduction using Relative Attributes Mohammadreza Babaee 1, Stefanos Tsoukalas 1, Maryam Babaee Gerhard Rigoll 1, and Mihai Datcu 1 Institute for Human-Machine Communication, Technische Universität

More information

Recognize Complex Events from Static Images by Fusing Deep Channels Supplementary Materials

Recognize Complex Events from Static Images by Fusing Deep Channels Supplementary Materials Recognize Complex Events from Static Images by Fusing Deep Channels Supplementary Materials Yuanjun Xiong 1 Kai Zhu 1 Dahua Lin 1 Xiaoou Tang 1,2 1 Department of Information Engineering, The Chinese University

More information

RSRN: Rich Side-output Residual Network for Medial Axis Detection

RSRN: Rich Side-output Residual Network for Medial Axis Detection RSRN: Rich Side-output Residual Network for Medial Axis Detection Chang Liu, Wei Ke, Jianbin Jiao, and Qixiang Ye University of Chinese Academy of Sciences, Beijing, China {liuchang615, kewei11}@mails.ucas.ac.cn,

More information

Lecture 5: Object Detection

Lecture 5: Object Detection Object Detection CSED703R: Deep Learning for Visual Recognition (2017F) Lecture 5: Object Detection Bohyung Han Computer Vision Lab. bhhan@postech.ac.kr 2 Traditional Object Detection Algorithms Region-based

More information

Deep Incremental Scene Understanding. Federico Tombari & Christian Rupprecht Technical University of Munich, Germany

Deep Incremental Scene Understanding. Federico Tombari & Christian Rupprecht Technical University of Munich, Germany Deep Incremental Scene Understanding Federico Tombari & Christian Rupprecht Technical University of Munich, Germany C. Couprie et al. "Toward Real-time Indoor Semantic Segmentation Using Depth Information"

More information

Detecting Object Instances Without Discriminative Features

Detecting Object Instances Without Discriminative Features Detecting Object Instances Without Discriminative Features Edward Hsiao June 19, 2013 Thesis Committee: Martial Hebert, Chair Alexei Efros Takeo Kanade Andrew Zisserman, University of Oxford 1 Object Instance

More information

A Summary of Projective Geometry

A Summary of Projective Geometry A Summary of Projective Geometry Copyright 22 Acuity Technologies Inc. In the last years a unified approach to creating D models from multiple images has been developed by Beardsley[],Hartley[4,5,9],Torr[,6]

More information

arxiv: v2 [cs.cv] 14 May 2018

arxiv: v2 [cs.cv] 14 May 2018 ContextVP: Fully Context-Aware Video Prediction Wonmin Byeon 1234, Qin Wang 1, Rupesh Kumar Srivastava 3, and Petros Koumoutsakos 1 arxiv:1710.08518v2 [cs.cv] 14 May 2018 Abstract Video prediction models

More information