When 3D Reconstruction Meets Ubiquitous RGB-D Images

Size: px

Start display at page:

Download "When 3D Reconstruction Meets Ubiquitous RGB-D Images"

Melinda Lindsey
6 years ago
Views:

When 3D Reconruction Meets Ubiquitous RGB-D Images Quanshi Zhang, Xuan Song, Xiaowei Shao, Huijing Zhao, and Ryosuke Shibasaki University of Tokyo, Peking University Abract 3D reconruction from a

In this aer, we roose to learn 3D reconruction knowledge from informally catured 2 RGB-D images, which will robably be ubiquitously used in daily life.

The category model eimates the ixellevel 3D ructure of an object from its 2D aearance, by taking into account considerable variations in rotation, 3D ructure, and texture.

Introduction 3D reconruction is a classical area in the field of comuter vision, but 3D reconruction from a single image ill has great challenges.

Category modeling & task outut: For the bottleneck in single-view 3D reconruction, i.e. the reconruction of objects with irregular 1 ructures, we have to return to the concet of category modeling.

The category model encodes the knowledge of how intra-category ructure deformation and object rotations affect 2D object shaes.

categories without setting rong assumtions for object shaes. In contra, Cheeger-set-based methods focus on ball-like surfaces, and ersective-based methods require vanishing oints.

Flowchart for training (cyan arrows) and teing (green arrows) rocesses. Single-view 3D reconruction is hamered by objects with irregular 1 shaes.

We roose to directly train the category model from informally catured RGB-D images (where target objects are NOT aligned) to save human labeling, thereby ensuring high efficiency in model learning.

For teing, the category detector and category model are used in sequence to localize the target object and eimate its 3D ructure. knowledge of 3D reconruction for a huge number of categories.

1 When 3D Reconruction Meets Ubiquitous RGB-D Images Quanshi Zhang, Xuan Song, Xiaowei Shao, Huijing Zhao, and Ryosuke Shibasaki University of Tokyo, Peking University Abract 3D reconruction from a single image is a classical roblem in comuter vision. However, it ill oses great challenges for the reconruction of daily-use objects with irregular 1 shaes. In this aer, we roose to learn 3D reconruction knowledge from informally catured 2 RGB-D images, which will robably be ubiquitously used in daily life. The learning of 3D reconruction is defined as a category modeling roblem, in which a model for each category is trained to encode category-secific knowledge for 3D reconruction. The category model eimates the ixellevel 3D ructure of an object from its 2D aearance, by taking into account considerable variations in rotation, 3D ructure, and texture. Learning 3D reconruction from u- biquitous RGB-D images creates a new set of challenges. Exerimental results have demonrated the effectiveness of the roosed aroach. 1. Introduction 3D reconruction is a classical area in the field of comuter vision, but 3D reconruction from a single image ill has great challenges. In this aer, we re-consider the roblem of single-view 3D reconruction in terms of two CV areas: category modeling and knowledge mining from big visual data (see Fig. 1). Category modeling & task outut: For the bottleneck in single-view 3D reconruction, i.e. the reconruction of objects with irregular 1 ructures, we have to return to the concet of category modeling. Therefore, the objective is to train a model to detect objects in the target category in large images, while simultaneously rojecting their 2D shaes into the 3D sace at the ixel level. The category model encodes the knowledge of how intra-category ructure deformation and object rotations affect 2D object shaes. Mining from big visual data & task inut: Another bottleneck lies in efficiently learning the category-secific 1 The word irregular is used to indicate that our aroach focuses on general object categories without setting rong assumtions for object shaes. In contra, Cheeger-set-based methods focus on ball-like surfaces, and ersective-based methods require vanishing oints. Informally catured RGB-D images Category detector A RGB image Collected RGB-D object samles Category detector Category model 2D shae 3D ructure Figure 1. Flowchart for training (cyan arrows) and teing (green arrows) rocesses. Single-view 3D reconruction is hamered by objects with irregular 1 shaes. Therefore, for objects in each category, we aim to learn a secific category model for 3D reconruction. We roose to directly train the category model from informally catured RGB-D images (where target objects are NOT aligned) to save human labeling, thereby ensuring high efficiency in model learning. We fir mine the category detector from informally catured RGB-D images to collect RGB-D objects, and then use these object samles to train the category model. For teing, the category detector and category model are used in sequence to localize the target object and eimate its 3D ructure. knowledge of 3D reconruction for a huge number of categories. Ideally, we would need to train a model for each object category in daily use, so as to conruct a knowledge base to rovide a 3D reconruction service for arbitrary RGB images. Therefore, we hoe to learn from big visual data 2 to avoid the labor of manually rearing training samles (e.g. well built 3D models), and thus ensure a high learning efficiency. In this aer, we train category models directly from informally catured and unaligned RGB-D images. We use the hrase informally catured to describe loose requirements for ubiquitous 2 images. They are tyical of what can be directly collected by search engines (Figures 1 and 2). The informally catured images are not manually aligned, and they consi of small objects that are randomly ositioned. In articular, these daily-use objects are 2 Actually, learning from big visual data is ill in the early ages of develoment, and its scoe is quite extensive. It usually involves two asects. This fir is learning from a large amount of web data, such as the widely used dee learning [17]. The second is learning under challenging conditions of informally catured images that contain small objects and are ubiquitous in everyday life, as is the case in our method. [26] also rovides detailed descrition of the second asect. 1

Duan Srayer Inut images for training Detailed view of srayer samles Detailed view of duan samles Figure 2. Overview of challenges.

designed with texture variations, various rotations, and some ructure deformation for commercial uroses. Technically seaking, these images can be loosely regarded as a kind of big visual data 2.

The use of informally catured images raises a number of challenges, as shown in Fig. 2.

Similarly, the learning rocess should also be able to handle these challenges to collect RGB-D objects in cluttered RGB-D images.

Then, we analyze the challenges of learning 3D reconruction knowledge from the collected RGB-D objects 3.

2) Deformation of 3D object ructures. 3) Deth errors near object edges in training samles (measured by the Kinect) Proosed method: As shown in Fig. 1, the training of category detectors is the fir e.

We use unsuervised category modeling based on grah matching [28] to train the category detector (secified in [26]).

3 In our udy, these objects are collected from RGB images (or RGB dimensions of the RGB-D images) using the trained category detectors. The 3D object ructure is rojected into the 2D shae afterwards.

Unlike conventional methods that use well built 3D models to rovide multi-view object aearances and rior regularity of 3D ructure deformation, our aroach needs to learn this knowledge in an

The category model consis of a number of eimators, each corresonding to a secific osition inside the 2D shae.

2 Duan Srayer Inut images for training Detailed view of srayer samles Detailed view of duan samles Figure 2. Overview of challenges. Automatic category modeling from cluttered scenes Robuness to intracategory variations Eimate deformation & remove object-rotation effects in 3D reconruct Learn with deth measurement noises usually designed with texture variations, various rotations, and some ructure deformation for commercial uroses. Technically seaking, these images can be loosely regarded as a kind of big visual data 2. Challenge analysis: Ju like the training, the RGB images for teing 3D reconruction are also such kind of informally catured images in this research. The use of informally catured images raises a number of challenges, as shown in Fig. 2. For alication uroses, the category model should have the ability to detect small target objects with various textures, rotations, and scales in large RGB images before 3D reconruction. Similarly, the learning rocess should also be able to handle these challenges to collect RGB-D objects in cluttered RGB-D images. Moreover, this detection has to be accurate to the art level, since 3D reconruction uses the arts 2D ositions to eimate the ose and 3D ructure of an object. Then, we analyze the challenges of learning 3D reconruction knowledge from the collected RGB-D objects 3. We need to simultaneously consider the following three factors that affect the corresondence between the 2D and 3D ructure of the object. 1) Rotation changes. This is the rimary factor. 2) Deformation of 3D object ructures. 3) Deth errors near object edges in training samles (measured by the Kinect) Proosed method: As shown in Fig. 1, the training of category detectors is the fir e. The category detectors collect 3 RGB-D/RGB object samles from the informally catured RGB-D/RGB images, which serves as the basis for further training/teing of 3D reconruction. We use unsuervised category modeling based on grah matching [28] to train the category detector (secified in [26]). This algorithm automatically discovers a set of relatively diinguishing arts (namely key arts) of objects for each category. 3 In our udy, these objects are collected from RGB images (or RGB dimensions of the RGB-D images) using the trained category detectors. The 3D object ructure is rojected into the 2D shae afterwards. Thus, the category detector achieves a art-level detection. We then train the category model to use the arts 2D ositions to determine the object s ose and 3D ructure. Unlike conventional methods that use well built 3D models to rovide multi-view object aearances and rior regularity of 3D ructure deformation, our aroach needs to learn this knowledge in an unsuervised manner. In other words, we need to train the category model to simultaneously remove the effects of viewoint changes, and eimate the 3D ructure deformation using the 2D shae of an object. The category model consis of a number of eimators, each corresonding to a secific osition inside the 2D shae. These eimators use the satial relationshi between key object arts to eimate the corresonding 3D coordinates of their ositions. We connect neighboring eimators to form a continuous Markov random field (MRF) and thus train these eimators. The contributions of this aer can be summarized as follows. We regard 3D reconruction from a single image as a category modeling roblem to overcome the difficulties in the reconruction of irregular 1 shaes. To ensure a high learning efficiency and a wide alication, this is the fir attemt to learn category models from informally catured RGB-D images, and then aly the category model back to the informally catured RGB images for 3D reconruction. We exlore a new set of challenges raised by our choice of training and teing data, and rovide a solution. 2. Related work 3D reconruction is a large area in the field of comuter vision. However, in this aer, we limit our discussion to 3D reconruction from a single image. Many methods for single-view 3D reconruction have rong assumtions for the image environment. For examle, shaes from shading methods [23, 4, 3, 14] had secial requirements for lighting conditions and textures. A large number of udies use the ersective rincile for 3D reconruction. They were tyically based on the vanishing oints in images, and therefore assumed that these images contained enough cues to extract vanishing oints [24, 15]. Such methods were mainly alied to large-scale indoor and urban environments. In addition, some udies used the ground/vertical assumtion to assi in vanishing oint extraction [13, 2, 7, 8]. Saxena et al. [19] roosed to learn an MRF for 3D reconruction in large-scale environments. Fouhey et al. [9] roosed to learn 3D rimitives from RGB-D images for single-view reconruction of indoor environment. A number of methods relied on the assumtion that the object ructure in queion had smooth surfaces. They usually extracted object-level ructures from locally catured images that contained no ersective cues. For examle, the Cheeger Set aroach [18] assumed that the target ob-

ject had ball-like surfaces. Given the object contour and the object volume, it comuted the 3D ructure with the minimum surface area.

Therefore, many methods [25, 22, 21, 14] combined human interactions with these assumtions to guide the 3D reconruction.

From this ersective, our research is related to the examle-based methods [11, 10, 20]. They required a comrehensive dataset of 3D object samles that were catured in all oses from different viewoints.

[6, 5] learned 3D reconruction knowledge from well built 3D object models. These 3D object models rovided knowledge on ructure deformation and the multi-view aearance.

In contra, we aly much less suervision to idealize the concet of automatic category modeling in terms of training.

The challenges of object detection, rotation changes, and the eimation of 3D ructure deformation are all involved in the training rocess.

reconruction, without examle comarison. 3. Training of category detectors Category detectors are not only trained from, but also alied to informally catured and unaligned images.

In our udy, we use the grahical model defined in [26] as the category detector, considering the need for robuness to shae deformations and rotations in the informally catured images.

The detector is trained to encode the attern for the common objects in all the images. A set of key object arts are discovered to form the attern for a category (see Fig. 3(c)).

3(b), continuous object edges (magenta) are discretized into line segments (black), and these line segments form the grah nodes. Thus the objects within it are reresented as sub-grahs.

Object edge segments (cyan) form the grah nodes. (b) Notation for the grahical model. The initial grahical model consis of five nodes (colored lines) and edges (dotted lines) between them.

The node set and node/edge features are modified so that 1) the node number (size) of G is maximized and 2) all the node/edge features reresent common atterns in all the grahs.

3 ject had ball-like surfaces. Given the object contour and the object volume, it comuted the 3D ructure with the minimum surface area. However, such assumtions for object ructure are usually not valid for the reconruction of irregular 1 shaes. Therefore, many methods [25, 22, 21, 14] combined human interactions with these assumtions to guide the 3D reconruction. By and large, the lack of cues for the eimation of irregular 1 object ructures is the main challenge for single-view 3D reconruction. From this ersective, our research is related to the examle-based methods [11, 10, 20]. They required a comrehensive dataset of 3D object samles that were catured in all oses from different viewoints. The 3D ructure knowledge of a target object was extracted via a large number of comarisons between the object and the samles in the dataset. Chen et al. [6, 5] learned 3D reconruction knowledge from well built 3D object models. These 3D object models rovided knowledge on ructure deformation and the multi-view aearance. In this way, they learned to match the 2D shae of a given object to the 3D object models, and thus eimate the object s viewoint, ose, and 3D ructure. In contra, we aly much less suervision to idealize the concet of automatic category modeling in terms of training. Learning from informally catured RGB-D images without manual alignment saves great labor for rearation for 3D object models or a large 3D-examle dataset. The challenges of object detection, rotation changes, and the eimation of 3D ructure deformation are all involved in the training rocess. The trained category model is then alied back to the informally catured RGB images for 3D reconruction, without examle comarison. 3. Training of category detectors Category detectors are not only trained from, but also alied to informally catured and unaligned images. The category models for 3D reconruction are designed on the basis of category detectors. In our udy, we use the grahical model defined in [26] as the category detector, considering the need for robuness to shae deformations and rotations in the informally catured images. We then aly unsuervised learning for grah matching [28] to mine the category detectors. In reality, we simly use the RGB channels of RGB-D images for training. The detector is trained to encode the attern for the common objects in all the images. A set of key object arts are discovered to form the attern for a category (see Fig. 3(c)). Category detectors: As shown in Fig. 3(a), the image is reresented as a comlete attributed relational grah (ARG). In Fig. 3(b), continuous object edges (magenta) are discretized into line segments (black), and these line segments form the grah nodes. Thus the objects within it are reresented as sub-grahs. as in (b) SAP ( s includes i texture features) Old node/edge s features i j New node/edge s features i j Node Node t Nodet i j s i Nodes Figure 3. Object detectors. (a) Grah-based image reresentation. Object edge segments (cyan) form the grah nodes. (b) Notation for the grahical model. The initial grahical model consis of five nodes (colored lines) and edges (dotted lines) between them. (c) Learning the grahical model (category detector). The algorithm modifies (i) the initial grahical model into (ii) the SAP (G). The node set and node/edge features are modified so that 1) the node number (size) of G is maximized and 2) all the node/edge features reresent common atterns in all the grahs. The category detector is a grahical model in [26], which will be automatically trained to be the soft attributed attern (SAP) among the target sub-grahs (or objects) in different images. Let grah G with node set V denote the model. G is described by a set of node and edge features. Let {F s i } and {F j } (i = 1, 2,..., N P, j = 1, 2,..., N Q ) denote the i-th node feature for node s and the j-th edge feature for edge (s, t), resectively. Here, N P and N Q indicate the feature numbers for each node and edge, resectively. F V = {Fi s} {Fj }, s, t V. Note that F i s includes the texture features on the terminals of lines segments. The category detector thus also contains the textural knowledge. (Please see [26] for details of the feature settings.) Object detection: Object detection is achieved via grah matching. Given a target RGB image reresented by grah G with a node set V, grah matching is erformed to comute the matching assignments between G and G. Let x denote the label set, and let the label x s x denote the matched node in G for node s in G. Grah matching is formulated as the minimization of the following energy function w.r.t x. E(x F V,F V )= P s (x s F V,F V )+ Q (x s,x t F V,F V ) (1) s V s,t V,s t This is a tyical formula for grah matching, where the functions P s (, ) and Q (,, ) measure matching enalties (feature dissimilarity) between the two matched nodes and edges. These are generally defined using squared differences. E.g. P s (x s F V,F V ) = w T [ F1 s F xk s 1 2,..., FN s P F xk s N P 2 ]. (Please see [28] for details.) Learning grah matching with SAPs: As shown in Fig. 3(c), [28] roosed an algorithm for mining SAPs. The Node

4 SAP is a fuzzy attributed attern, which describes the common sub-grahs in a set of ARGs, in which node and edge features have considerable variations. In other words, the category detector (i.e. the SAP) reresents a set of key object arts (nodes) that frequently aear in training images of the entire category. The category detector is trained taking into account the robuness to texture variations and ructure deformations. To art the training rocess, this method requires eole only to label the sub-grah of a single object for initializing the grahical model G. [28] is designed to modify G from the secific shae of the labeled object into the SAP among all grahs. Given a set of N training images, each image is denoted by {G k } (k = 1, 2,..., N) with node set V k. F V k are the attribute sets of G k, and xk s x k X indicates the node in V k matched by s V. The frequency of node s in G is defined as the average enalty for matching s to all the {G k }, as follows. E s( ˆX,ˆF V )= 1 N N [P s(ˆx k s ˆF V,F V )+ ] Q k (ˆx k s, ˆx k t ˆF V,F V k ) k=1 t V,t s If E s ( ˆX, ˆF V ) is less than a given threshold τ, node s is regarded as a frequently aearing art; otherwise not. Hence, the goal of learning grah matching is to train the node set V and attributes F V of model G, in such a manner that 1) all nodes of G frequently aear in grahs {G k } and 2) the size of G is maximized. max V V s.t. (ˆF V, ˆX) = argmin F V,X={x k } N s V, E s ( ˆX, ˆF V ) τ. k=1 E(xk F V, F V k ); [28] rooses an EM framework to solve (2). 4. Learning 3D reconruction We introduce the rearation of training samles, model learning, and the alication of 3D reconruction in the following three subsections Object samle collection & ose normalization The rearation of training samles involves two es, i.e. object samle collection and 3D ose normalization. In the fir e, we use the trained category detector to collect RGB-D samles from informally catured RGB-D images for training. Each training samle contains its 2D shae and the 3D coordinates of each 2D ixel within it. The 3D coordinates are measured in the global coordinate syem of each RGB-D image, rather than a coordinate syem w.r.t the object. Therefore, in the second e, we comute the 3D ose of the object, and thereby normalize the 3D coordinates into the object coordinate syem. (2) Object samle collection: We use the category detector to match a set of key object arts (grah nodes) in each image by alying (1). However, as mentioned above, the key arts consi of line segments on the object, and do not comrise the entire object body. Therefore, we fir recover the entire body area of the object from the detected line segments, before the further learning. Actually, a number of methods for interactive 4 object segmentation are oriented to this alication. Nevertheless, in order to make a reliable evaluation of the roosed 3D reconruction method, we need to simlify the whole syem and thereby avoid bringing in uncertainties related to object segmentation. Consequently, given the key line segments of the object, we roughly eimate its body area as the convex hull of the line terminals (see Fig. 4(c)). 3D ose normalization: We define centers and 3D oses for the object samles in the RGB-D images, thus conructing a relative coordinate syem for each samle, as shown in Fig. 4(f,bottom). We then roject 3D oint clouds of the objects onto these coordinate syems, as the ground truth of their 3D ructures. Fig. 4(e) shows the notation. The center of an object is defined as the mean of the 3D coordinates of its key arts c = 1 s m s/m, where m is the art (node) number and s denotes the center coordinates of art s. We define the orientations of the three orthogonal coordinate axes < v 1,v 2,v 3 > as follows. v 1 =u(v 1 + v 2), v 2 =u(v 1 v 2), v 3 =v 1 v 2 (3) where function u( ) normalizes a vector to a unit vector, v 1 = u( 1 s<t m u( t s )), v 2 = u( 1 s<t m (o t o s )), and o s denotes the 3D orientation of line segment s Model learning We train a category model from RGB-D object samles to eimate the ixel-level 3D reconruction. Essentially, even if we are given a rior 3D ructure for a category, 3D reconruction from a single image ill has great challenges. There are two general hyotheses, i.e. object rotations and ructure deformations, to interreting a secific 2D object shae in 3D reconruction. Each hyothesis can indeendently eimate the 3D object ructure from the 2D shae (e.g. comuting the rotation or 3D deformation of the rior 3D ructure that be fits the contour of the 2D shae). Obviously, the 3D reconruction needs to combine the both hyotheses. Unlike [5, 6] using well built 3D object models to rovide or train rior regularity of ructure deformation, we need to train the category model to simultaneously identify the effects of object rotations and ructure deformations from 2D shaes, without rior knowledge. This greatly increases the challenge. 4 The key object arts can be labeled as the foreground.

a b c d e t o t f Local 2D coordinate t syem o s s Centerline s Almo erendicular to the an Almo arallel to the an M q M s,t Lean again a box s,(i) node t q,(i),(j) q,(j) s Lean again a box Figure 4.

(e) Notation for determining the local 3D coordinate syem. (f) Three axes of the local 3D coordinate syem. (bottom) Local 3D coordinate syems of three objects. Fig.

Considering the model s robuness to shae deformation, we use a set of local 2D coordinate syems to simultaneously localize each ixel inside the 2D shae.

The oint features are designed to reresent the satial relationshi between the oint and key objects arts, as well as the 2D shae of these key object arts.

ϑ 1 and ϑ 2 denote the lengths of s and t, resectively, and ϑ 3 denotes the angle between s and t. ϑ 4 and ϑ 5 measure the diance between and the line segments s and t, resectively.

Then, the local regressor F ( ) is modeled as a linear regression of the 3D coordinates of using θ : F (θ ) = (θ ) T M (4) node t node s node u node v i Six local 2D coordinate syems uv sv Figure 5.

Given a secific 2D coordinate syem w.r.t each air of key arts s and t, we generate a MRF to train the arameter M for the local regressor F( ) in each different oint.

5 a b c d e t o t f Local 2D coordinate t syem o s s Centerline s Almo erendicular to the an Almo arallel to the an M q M s,t Lean again a box s,(i) node t q,(i),(j) q,(j) s Lean again a box Figure 4. Some es in learning 3D reconruction. (a,b) Visualization of the detected key object arts in the RGB and deth image. (c) Recovery of the object body area. (d) Notation for the local 2D coordinate syem. (e) Notation for determining the local 3D coordinate syem. (f) Three axes of the local 3D coordinate syem. (bottom) Local 3D coordinate syems of three objects. Fig. 5 shows the basic design of the category model. The category model handles both the effects and directly rojects each 2D ixel into a 3D sace. Considering the model s robuness to shae deformation, we use a set of local 2D coordinate syems to simultaneously localize each ixel inside the 2D shae. Each local 2D coordinate syem is conructed using each air of key object arts. For each oint in every 2D coordinate syem, we train a local regressor to eimate its 3D coordinates from its oint features. The oint features are designed to reresent the satial relationshi between the oint and key objects arts, as well as the 2D shae of these key object arts. Thus, the oint features contain sufficient cues for object rotations and ructure deformation to guide the eimation of the 3D coordinates of each 2D oint. For oint in the local 2D coordinate syem w.r.t key arts s and t of a given object samle, its oint features are defined as θ = [ϑ 1, ϑ 2, ϑ 3, ϑ 4, ϑ 5, 1] T. ϑ 1 and ϑ 2 denote the lengths of s and t, resectively, and ϑ 3 denotes the angle between s and t. ϑ 4 and ϑ 5 measure the diance between and the line segments s and t, resectively. ϑ 1, ϑ 2, ϑ 4, and ϑ 5 are normalized by the centerline length between s and t (Fig. 4(d)). Then, the local regressor F ( ) is modeled as a linear regression of the 3D coordinates of using θ : F (θ ) = (θ ) T M (4) node t node s node u node v i Six local 2D coordinate syems uv sv Figure 5. Learning 3D reconruction. The category model uses different local 2D coordinate syems (bottom) to localize the 2D oints, and eimates the 3D ructure in these coordinate syems. Given a secific 2D coordinate syem w.r.t each air of key arts s and t, we generate a MRF to train the arameter M for the local regressor F( ) in each different oint. F( ) uses the oint feature θ,(i) in object samle i to eimate 3D coordinates of, thus overcoming i s secific object rotations, ructure deformation, and deth errors. tv su where matrix M is the arameter for the function F ( ). The local regressor F ( ) is the basic element of the category model. The category model is formulated as a linear combination of these local regressors, as follows. Y(x) = 1 Z 1 s<t m j tu w F (θ ) =(x) (5) where Z = 1 s<t m w=(x). x is a ixel inside the 2D object shae, and Y(x) eimates its 3D coordinates. s and t indicate two key arts of the object; m denotes the key art number. Given the local coordinate syem conructed using s and t, function (x) determines the 2D osition of x in this coordinate syem. Thus, function F( ) is the local regressor of 3D coordinates for oint in the coordinate syem of s and t, as mentioned above. w measures the reliability of the local regressor F( ) and is regarded as its weight. Local 2D coordinate syem: As shown in Fig. 4(d), given the line segments of each air of key arts s and t (1 s < t m), we generate a local 2D coordinate syem. The line segment connecting the center of s to that of t is t

6 called the centerline between s and t. The origin of the local coordinate syem w.r.t s and t is defined as the center of the centerline. We take the centerline s orientation as the orientation of the fir coordinate axis, and determine the orientation of the second axis using the right-hand rule. The unit vector of the fir axis is normalized using the centerline length, and that of the second axis is normalized using the average segment length of s and t. Model learning based on a continuous MRF: In the local 2D coordinate syem determined by the key arts s and t, the local regressors of the neighboring ositions are connected to form an MRF. We generate the following objective function to learn the arameters of the regressors. argmax M φ 1 (M ) + V (,q) E φ 2 (M, M q ) (6) where M = {M V}; V and E denote node and edge sets of the MRF, resectively. In the MRF, the unary term φ 1 ( ) measures the similarities between the ground truth and the 3D coordinates eimated by the regressor for each oint. φ 1 (M ) = 1 ex( η F S (θ,(i) ) y i ) (7) i S where S is the set of training samles. For each object samle i, θ,(i) and y i denote the oint feature and true 3D coordinates of, resectively. η is a scaling arameter. The airwise term φ 2 (, ) is a smoothing term between neighboring oints. φ 2 (M, M q ) = λ(m M q ) 2 ex[ α(u + U q ) 2 ] U = 1 max S (, ) E y i y i (8) i S where U use true 3D ructures of the training samles to measure the discontinuity around oint. Object areas with high discontinuity, such as object edges, have low weights for smoothing. λ and α are scaling arameters. We use the MCMC method to solve the continuous MRF. Finally, when the regressor F ( ) has been trained, the weight for F ( ) in (5) is comuted as w = 1 S i S ex[ η(f (θ,(i) ) y i )2 ] (9) D reconruction based on category models As shown in Fig. 1, we use a trained category model to directly erform 3D reconruction on teing RGB images. We fir use the trained category detector to collect target objects in the RGB images, and simultaneously determine the body area of the objects (see Section 4.1). We then use the category model in (5) to eimate the 3D coordinates for each ixel inside the object. 5. Exeriments 5.1. Data We use the category dataset of Kinect RGB-D images [1, 26], which is ublished as a andard RGB-D object dataset 5 for the learning of grah-matching-based models, such as [26, 27]. These RGB-D images deict cluttered scenes containing objects with different textures and rotations. Four categories notebook PC, drink box, srayer, and duan in this dataset contain a sufficient number of RGB-D objects for training and are thus chosen in the exeriments Imlementation details Training of category detectors: We train the SAP (category detector) from a set of large grahs (informally catured images). Parameter τ controls the grah size of the SAP, or in other word, the number of key object arts contained by the category detector. When we set a higher value for τ, we can extract more key arts for a category, but the key arts are less reliable in art detection. To simlify the learning rocess, we require the SAPs (detectors) to have four nodes (key arts) for all the four categories. We try different values of τ in the training rocess, until the category detector satisfies this requirement. Hence, we can detect four key arts for each object, and the rotation and deformation of the object are determined by the comlex satial relationshi between the four key arts. Learning 3D reconruction: As shown in Fig. 5, in each local 2D coordinate syem, we divide the entire object area comrising of grids to identify different 2D oints. A local regressor with arameter M is generated for each grid, and we thus use the local regressors conruct the MRF. For the alication of 3D reconruction, many ixels in the 2D shae are not accurately localized in the grid centers. To achieve accurate ixel-level 3D reconruction, we interolate the regressor arameter value for each 2D ixel from the trained arameters M of the nearby grids. We set arameter α as 0.1 for all the categories in model learning, and use different values of λ and η to te the syem Quantitative comarison Cometing method: We comare our aroach with conventional methods for 3D reconruction from a single image. As discussed in Section 2, mo related techniques have their own secific assumtions for the target image and are thus not suitable for 3D reconruction of irregular 1 shaes. From this viewoint, examle-based 3D reconruction is close to our method, but conventional techniques are 5 Comared to other RGB-D datasets i.e. [16, 12], this is one of the large RGB-D object datasets, and reflects challenges of grah matching.

7 Figure 6. 3D reconruction comarison. The roosed method exhibits smaller reconruction errors and surface roughness than the examle-based method. The learning of category models is not sensitive to different settings of arameters λ and η. λ and η are not involved in the examle-based method hamered by the use of informally catured RGB-D images for training. Nevertheless, considering the challenges in the training dataset, we design a cometing method that achieves a rough idea of the examle-based 3D reconruction, as follows. Fir, given a RGB image, we use the category detector trained in Section 3 to detect the target object in it, as well as its key object arts. We then randomly select an RGB-D object from the training samle set as the examle for 3D reconruction. The 3D coordinate eimation is based on (5), ju like the roosed method. However, without training, the weight for local regressors (w ) is set to 1. The local regressors F is redefined as a conant, i.e. the 3D coordinate for oint on the examle, rather than a function in (4) w.r.t. oint features θ. Pixel-level evaluation: We evaluate the reconruction errors and surface roughness (non-smoothness) of the roosed aroach at the ixel level. Reconruction errors are widely used to evaluate 3D reconruction. We manually reare the ground the truth of the 3D ructure for each object using the Kinect measurement. We measure the diance between the eimated 3D coordinates of each ixel and its true coordinates, as the ixel reconruction error. We define the reconruction error of an object as the average reconruction error of its conituent ixels. The reconruction error of a category is defined as the average reconruction error of all its objects. The other evaluation metric is the surface roughness. The roughness of ixel is measured as r = Y 1 N() q N() Yq, where ixel N() is the set of 4-connectivity neighboring ixels of, Y denotes the eimated 3D coordinates of ixel on the object. Ju like the reconruction error, the object roughness is the average of the ixel roughness, and the surface roughness of a category is defined as the average of object-level roughness. The evaluation is achieved via cross validation. Ju as in [26, 27], we randomly select 2/3 and 1/3 of the RGB- D images in the entire dataset as a air of samle ools for training and teing, resectively. Each air of samle ools can train a category model, and rovide values of the reconruction error and surface roughness for the category. We reare different airs of training and teing ools and comute the average erformance by cross validation. Fig. 6 and Fig. 7 show the comarison of 3D reconruction erformance. Without sufficient learning, the cometing method cannot correctly eimate ructure deformation for objects, and suffers from the unavoidable errors in Kinect s deth measurement. 6. Conclusions In this aer, we roosed to learn a category model for single-view 3D reconruction, which encoded knowledge for ructure deformation, texture variations, and rotation changes. The category model was both trained from and alied to the informally catured images. Exeriments demonrated the effectiveness of the roosed method. For the training of category detectors, we used various node and edge features. These included texture features on local image atches. Thus, the texture information contributes to the extraction of key object arts. However, textures contribute much less in further 3D reconruction, as many daily-use objects (e.g. the drink box) are designed with different textures that are unrelated to object ructures for commercial uroses. The satial relationshi between key object arts was, therefore, taken as more reliable cues for 3D reconruction and used in model learning. We did not aly object segmentation and thus avoided the uncertainties related to it, so as to simlify the syem and enable a reliable evaluation. The category model did not erformed so well on object edges as in other object area due to the time-of-the-flight errors and calibration errors between 3D oints and RGB images. References [1] Category Dataset of Kinect RGBD Images, htt://sites.google.com/site/quanshizhang. 6 [2] O. Barinova, V. Konushin, A. Yakubenko, K. Lee, H. Lim, and A. Konushin. Fa automatic single-view 3-d reconruction of urban scenes. In ECCV,

a1 a2 b1 b2 a3 b3 a4 Learn texture knowledge for object detection a5 b4 a6 a7 b5 c2 c3 a9 b7 c9 b9 b6 b8 Eimate ructure deformation from 2D shaes c1 a8 c4 c5

(c1 9) Reconruction results of the roosed method. [3] J. T. Barron and J. Malik. Color conancy, intrinsic images, and shae eimation. In ECCV, 2012. 2 [4] J. T. Barron and J. Malik. Shae, albedo, and illumination from a single image of an unknown object.

Kim, and R. Ciolla. Inferring 3d shaes and deformations from single views. In ECCV, 2010. 3, 4 [7] E. Delage, H. Lee, and A. Ng.

A dynamic bayesian network model for autonomous 3d reconruction from a single indoor image. In CVPR, 2006. 2 [9] D. Fouhey, A. Guta, and M. Hebert.

In CVPR worksho, 2006. 3 [11] T. Hassner and R. Basri. Single view deth eimation from examles. In CoRR, abs/1304.3915, 2013. 3 [12] E. Herb, X. Ren, and D. Fox.

2 [14] K. Karsch, Z. Liao, J. Rock, J. T. Barron, and D. Hoiem. Boundary cues for 3d object shae recovery. In CVPR, 2013. 2, 3 [15] K. Ko ser, C. Zach, and M.

A large-scale hierarchical multi-view rgb-d object dataset. In Proc. of IEEE International Conference on Robotics and Automation (ICRA), ages 1817 1824, 2011.

In ICML, 2012. 1 [18] M. R. Oswald, E. To e, and D. Cremers. Fa and globally otimal single view reconruction of curved objects. In CVPR, 2012. 2 [19] A.

Leibe, T. Tuytelaars, and L. Gool. Shae-from recognition: Recognition enables meta-data transfer. In CVIU, 2009. 3 [21] E. Toe, C. Nieuwenhuis, and D. Cremers.

Imagebased 3d modeling via cheeger sets. In ACCV, 2010. 3 [23] A. Varol, A. Shaji, M. Salzmann, and P. Fua.

Symmetric iecewise lanar object reconruction from a single image. In CVPR, 2011. 2 [25] L. Zhang, G. Dugas-Phocion, J.-S. Samson, and S. M. Seitz.

Category modeling from ju a single labeling: Use deth information to guide the learning of 2d models.

8 a1 a2 b1 b2 a3 b3 a4 Learn texture knowledge for object detection a5 b4 a6 a7 b5 c2 c3 a9 b7 c9 b9 b6 b8 Eimate ructure deformation from 2D shaes c1 a8 c4 c5 c6 c7 c8 Figure 7. 3D reconruction erformance. (a1 9) Object detection (cyan lines) in RGB images. (b1 9) Detailed view of the target objects. (c1 9) Reconruction results of the roosed method. [3] J. T. Barron and J. Malik. Color conancy, intrinsic images, and shae eimation. In ECCV, [4] J. T. Barron and J. Malik. Shae, albedo, and illumination from a single image of an unknown object. In CVPR, [5] Y. Chen and R. Ciolla. Single and sarse view 3d reconruction by learning shae riors. In CVIU, 115(5): , , 4 [6] Y. Chen, T.-K. Kim, and R. Ciolla. Inferring 3d shaes and deformations from single views. In ECCV, , 4 [7] E. Delage, H. Lee, and A. Ng. Automatic single-image 3d reconructions of indoor manhattan world scenes. In Robotics Research, [8] E. Delage, H. Lee, and A. Ng. A dynamic bayesian network model for autonomous 3d reconruction from a single indoor image. In CVPR, [9] D. Fouhey, A. Guta, and M. Hebert. Data-driven 3d rimitives for single image underanding. In ICCV, [10] T. Hassner and R. Basri. Examle based 3d reconruction from single 2d images. In CVPR worksho, [11] T. Hassner and R. Basri. Single view deth eimation from examles. In CoRR, abs/ , [12] E. Herb, X. Ren, and D. Fox. Rgb-d object discovery via multi-scene analysis. In IROS, [13] D. Hoiem, A. Efros, and M. Hebert. Geometric context from a single image. In ICCV, [14] K. Karsch, Z. Liao, J. Rock, J. T. Barron, and D. Hoiem. Boundary cues for 3d object shae recovery. In CVPR, , 3 [15] K. Ko ser, C. Zach, and M. Pollefeys. Dense 3d reconruction of symmetric scenes from a single image. In DAGM, [16] K. Lai, L. Bo, X. Ren, and D. Fox. A large-scale hierarchical multi-view rgb-d object dataset. In Proc. of IEEE International Conference on Robotics and Automation (ICRA), ages , [17] Q. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. Corrado, J. Dean, and A. Ng. Building high-level features using large scale unsuervised learning. In ICML, [18] M. R. Oswald, E. To e, and D. Cremers. Fa and globally otimal single view reconruction of curved objects. In CVPR, [19] A. Saxena, M. Sun, and A. Y. Ng. Make3d: Learning 3d scene ructure from a single ill image. In PAMI, 31(5): , [20] A. Thomas, V. Ferrari, B. Leibe, T. Tuytelaars, and L. Gool. Shae-from recognition: Recognition enables meta-data transfer. In CVIU, [21] E. Toe, C. Nieuwenhuis, and D. Cremers. Relative volume conraints for single view 3d reconruction. In CVPR, [22] E. To e, M. R. Oswald, D. Cremers, and C. Rother. Imagebased 3d modeling via cheeger sets. In ACCV, [23] A. Varol, A. Shaji, M. Salzmann, and P. Fua. Monocular 3d reconruction of locally textured surfaces. In PAMI, 34(6): , [24] T. Xue, J. Liu, and X. Tang. Symmetric iecewise lanar object reconruction from a single image. In CVPR, [25] L. Zhang, G. Dugas-Phocion, J.-S. Samson, and S. M. Seitz. Single view modeling of free-form scenes. In CVPR, [26] Q. Zhang, X. Song, X. Shao, R. Shibasaki, and H. Zhao. Category modeling from ju a single labeling: Use deth information to guide the learning of 2d models. In IEEE Conference on Comuter Vision and Pattern Recognition (CVPR), ages , , 2, 3, 6, 7 [27] Q. Zhang, X. Song, X. Shao, H. Zhao, and R. Shibasaki. Learning grah matching for category modeling from cluttered scenes. In ICCV, , 7 [28] Q. Zhang, X. Song, X. Shao, H. Zhao, and R. Shibasaki. Attributed grah mining and matching: An attemt to define and extract soft attributed atterns. In CVPR, , 3, 4

Learning Motion Patterns in Crowded Scenes Using Motion Flow Field

Learning Motion Patterns in Crowded Scenes Using Motion Flow Field Min Hu, Saad Ali and Mubarak Shah Comuter Vision Lab, University of Central Florida {mhu,sali,shah}@eecs.ucf.edu Abstract Learning tyical