Brand-Aware Fashion Clothing Search using CNN Feature Encoding and Re-ranking

Size: px

Start display at page:

Download "Brand-Aware Fashion Clothing Search using CNN Feature Encoding and Re-ranking"

Sabina Atkinson
5 years ago
Views:

1 Brand-Aware Fashion Clothing Search using CNN Feature Encoding and Re-ranking Dipu Manandhar, Kim Hui Yap, Muhammet Bastan, Zhao Heng School of Electrical and Electronics Engineering Nanyang Technological University, Singapore Abstract Brand plays a significant role in fashion clothing. Consumers are brand conscious during the clothing search and purchase. Existing visual fashion search methods [1], [8], [17] [19] often do not explicitly consider the brand information such as logos. Brand logo in clothing images are quite small and often suffer various deformations, and hence pose a significant challenge for branded clothing search. In view of this, this paper presents a new Brand-Aware Fashion Search (BAFS) framework that explores the brand information during the visual search. We construct a new brand fashion dataset which consists of 10K images of branded clothing with trademark logos. The proposed framework first jointly detects the brand logo and clothing items in the images. Next, to extract the rich visual information from the clothing images, we propose a new deep feature encoding known as Principal Component Maximum Activation of Convolutions (PMAC) that leverages hierarchies of CNN activations. The PMAC feature aims to capture both lowlevel visual information and high-level global abstraction from the images. The proposed method further uses a brand-aware reranking technique to improve the search. Experiments conducted on the brand fashion dataset shows that the proposed framework achieves superior performance to other comparative methods. I. I NTRODUCTION In recent years, e-commerce and online shopping on clothing [1], [2] have been growing in various commercial platforms like ebay, Amazon, Taobao etc. In clothing fashion, the brand is an integral part of fashion. Brands often reflect customers self-image and personality. Customers are often brand conscious and hence the brand information plays an important role shopping. Consider a case where a user would like to search a branded clothing item. Fig. 1 shows clothing retrieval examples using off-the-shelf CNN features extracted from VGG16-Net [3]. The images in the first column show the query images and corresponding retrieved images are shown in the respective rows. For instance, in the first row, the customer wants to search for a black Adidas hoodie. Although the search has retrieved visually similar images, the results are not brand aware. In this scenario, the hoodies from other brands such as Puma, Ferrari etc. are also retrieved in the top ranks. In this work, we will explore the issue of clothing retrieval using deep learning framework, which leverages brand information during the visual search. The clothing retrieval has a wide range of commercial applications [4] [8]. The advantage of the proposed BAFS is that it considers the users brand preference and incorporate it into the search process during Fig. 1. Example of branded clothing search. Relevant and irrelevant retrieved images are shown with green border and red border respectively online shopping. In addition, the proposed method will also help the fashion companies to promote their brands [9]. Clothing retrieval is emerging as a popular topic in recent years. The earlier works [6], [7], [10], [11] used handcrafted features to describe the clothing attributes and perform the search. In recent years, advances in convolutional neural networks (CNN) and deep learning [12] have proven to be powerful techniques for various vision-related tasks such as image classification [13], detection [14], [15] etc. Fashion retrieval using deep learning is developed in [5], [8], [16] [19]. Features from fined-tuned CNN architectures have been used for clothing recognition and retrieval in [16] [18]. Kaipour et. al. [8] and Wang et. al. [5] have explored the problem of matching street images to online shop images using pretrained network features. Recently, Liu et. al. [19] proposed FashionNet which jointly uses features pooled from clothing landmarks and fully connected layer to perform fashion retrieval. Existing visual search methods [2], [8], [16] [18], [20] do not encode the brand information during the search. Moreover, these methods mainly rely on features extracted from fully connected layers of CNN networks, which provide global representations of images. Thus, local information and small objects are not well encoded in these global image signatures. Hence, existing methods cannot integrate brand information effectively, which plays a key role in clothing retrieval. In view of this, we propose a new brand-aware fashion search using robust features from CNN. The main contribution of this work is threefold. First, we construct a new brand fashion clothing dataset where images belong to different brands are characterized by different trademark logo. We explore the importance of brand information in clothing retrieval. Second,

2 we propose a new PMAC feature which captures the hierarchical visual information learned by different convolutional layers. We show that PMAC features are effective for instance retrieval over traditional FC layer features or single layer features. Third, we propose a clothing re-ranking engine based on the available clothing and brand information. The engine shows good performance improvement in fashion search. Overall, the proposed fashion search framework outperforms other state-of-the-art methods by 16% map. II. OVERVIEW OF THE PROPOSED METHOD The proposed fashion search framework is given in Fig. 2. Given a query image, the framework first detects the clothing and brand logo in the image. Next, using the detected clothing region, we extract the PMAC features from activations of different CNN layers. We compute the similarity score between the query and database images. The initial filtered images are then passed through a brand-aware re-ranking engine and the system returns the relevant images to the user. III. DATASET CONSTRUCTION We constructed a new Brand Fashion dataset which contains images of various clothing categories with brand information centered on trademark logo. We collect clothing images from 15 popular fashion brands such as Adidas, Puma, Nike etc. We will make the dataset publicly available in the near future. A. Image Collection and Cleaning We crawled the images from Google Images 1 using relevant search keywords. For each brand, we create a pool of keywords such as Demin Versace Hoodie, Playboy Tee, Floral Adidas Tank etc. to query the website. For 15 brands, around 150K images were downloaded. Irrelevant images are then removed by human screening. We also remove single channel and lowresolution images which are less than pixels. Next, to remove redundant and highly similar images, we use the FC7 response from VGGNet [3]. For each brand, pairwise image similarities are computed using FC7 features. For those images with similarity scores greater a threshold (thres = 0.97), we keep one copy and remove the rest. At the end of this step, the dataset contains 9,498 clean and relevant images. 1 Fig. 2. Proposed brand-aware fashion framework B. Annotation The images in dataset contain two types of annotations namely, clothing item category and brand logo information. We use the same fine-grained clothing categories as Deep- Fashion dataset [19]. Each image is first labeled with clothing category together with the bounding box coordinates. Next, all the logos in the dataset are annotated with the brand information and corresponding bounding boxes. Interactive annotation tools have been developed to support fast annotation. IV. PROPOSED BRAND-AWARE FASHION SEARCH (BAFS) FRAMEWORK The proposed fashion search framework consists of three main modules which are described in the following sections. A. Clothing and Brand Logo Detector The proposed framework simultaneously detects the existence of clothing items and brand logo in images. We develop a joint clothing and logo detector using Faster-RCNN [14]. The detector contains two sub-networks namely, Region Proposal Network (RPN) and Fast-RCNN which shares common convolutional layers for efficiency. In order to generate the region proposals for clothing and logos, RPN slides the predefined anchor boxes of different scales and aspect ratios over last convolution layer. In this paper, in order to target small brand logo, we sample the anchors with 4 scales as opposed to 3 scales used in [14]. Our experiments show that using finer scales of region proposals improves logo detection performance by an map of 8%. The RPN is trained to generate proposals using a multi-loss function in (1). L(p i, b i ) = L cls (p i, p i ) + λ p i L reg (b i, b i ) (1) The first term in (1) refers to soft-max loss, where p i and p i are the predicted probability for anchor i being an object and the grouth-truth label respectively. The value of p i {0, 1} is based on Intersection-over-Union (IoU) of anchors with ground-truth box {IoU < 0.3, IoU > 0.7}. The second term is activated only for true anchors (p i = 1) and it represents the regression loss for bounding box prediction, where b i and b i are predicted and ground-truth box coordinates. Next, the proposals obtained for clothing and logos are used to pool the features from the last convolutional layer and then

3 used to classify them into clothing and brand logos. We use a 4-step alternation method [14] to train the detector network. Fig. 3 shows examples of joint detection for clothing and logo from brand fashion dataset. B. PMAC Feature Extraction Several previous works have exploited the FC layer [2], [16], [17], [20] [23] features and pooled features from convolutional layers [24], [25] for instance image retrieval. However, these features do not encode hierarchy of information required for instance search. In view of this, we propose a new Principal Component Maximum Activation of Convolutions (PMAC) feature encoding which leverages both low-level and high-level abstractions to extract rich features for retrieval. In order to extract the PMAC features from images, we first operate on the n activations of convolutional layers L = {l i } n i=1 of the network. For a particular layer l, we pool feature from its 3D activation maps of W (l) H (l) K (l) dimensions, where K (l) is the number of filters in layer l and W (l) H (l) represents the spatial region R (l) i, i = {1, 2,, K} of the feature maps. For each feature map R (l) i, we spatially pool the { maximum activation of the map to construct a feature f l = f l 1, f2, l, fi l,, f } K l for layer l. f (l) i = max R (l) i, where, i = 1,..., K (2) Next, we concatenate the features from multiple layers to generate a single descriptor F= [f l ], l {1,, n}. The concatenation strategy is a simple yet effective way for image representation. Next, in order to extract discriminative information, the features Fare projected to a new subspace using PCA and whitening. Mathematically, we construct k-dimensional PMAC feature vector X = {x 1, x 2,, x i,, x k }. Each component x i of the vector is computed using (3). x i = 1 U T F i. (3) λi where U represents the transformation matrix formed by k principal eigenvectors and λ i represents the corresponding eigenvalues. We will demonstrate the effectiveness of the proposed PMAC in clothing retrieval in Section V-B. The similarity between the query image and database image is computed using the cosine similarity (4) of the PMAC representations q and D respectively. Sim(q, D) = (q T D)/( q 2 D 2 ) (4) C. Brand-Aware Re-ranking Although the PMAC image encoding captures both the lowlevel details and high-level abstraction of the images, it does not prioritize the brand information. It is because brand logo information generally tends to cover a small area in images, thus visual information from such small objects is not captured well into the descriptor. The proposed framework thus uses a brand-aware re-ranking engine to rank the initial retrieval shortlist. The engine is a two-stage re-ranking system which uses the information from brand and clothing detection. First, we retain the images which belong to the same brand as the Fig. 3. Sample images from branded fashion dataset with clothing and logo detection detected brand in the query images. Second, similar filtering is performed based on detected clothing category. We show the re-ranking engine significantly improve the retrieval results. A. Experimental Setting V. EXPERIMENTS AND RESULTS The clothing retrieval experiments are conducted on the Brand Fashion dataset which contains 9,498 clothing images. In the experiments, we use 50 query images. For each query, the ground-truths are images of the same clothing item in the database. For each query, Average Precision (AP) is calculated using Precision-Recall curve. The APs of all the query images are averaged to obtain the mean Average Precision (map) which is used as the performance metric. To demonstrate the effectiveness of the proposed method, we use the ZF-Net as the backbone CNN network as it is a light weight model with fewer parameters, when compared to other architectures such as Alexnet, VGGNet and ResNet. We employ transfer learning by fine-tuning the pre-trained network on ImageNet. The algorithms are implemented using Caffe [26] framework. B. Results and Discussion 1) Joint Logo and Clothing Detector: The joint logo and clothing detector achieves 96% and 98% detection accuracy for logo and clothing in the query images. Examples of joint detection for logo and clothing is shown in Fig. 3. From the figure, we can observe that the detector can simultaneously detect various logo and clothing items in the fashion images under different scales, deformation and background clutter. In the first image in Fig. 3, the proposed detector is able to detect the clothing item i.e. Hoodie and the brand Adidas with high confidence as shown in blue and red boxes respectively. Similarly, in other images, clothing items and logos are accurately detected despite their large variations in scale, orientation and location. These detected clothing boxes and logo information are incorporated into the subsequent search during feature extraction and clothing re-ranking. 2) Retrieval with PMAC encoding and Brand-Aware Reranking: This section first presents analysis on the choice of convolution layers used for PMAC encoding. Next, the performance of the proposed BAFS method is discussed. Table I shows the retrieval performance using different convolutional layers and their combinations. It is observed that the early layers achieve poor performance as early features are too generic which only represent low-level visual information such as edge, color, blobs etc. [27]. We observe that features

from the third layer to the last layer provides relatively similar performance ( 30%). Although all of these layers provide similar performance, they encode different level of information.

4 from the third layer to the last layer provides relatively similar performance ( 30%). Although all of these layers provide similar performance, they encode different level of information. Therefore, in order to exploit rich hierarchy of feature information, we explore various ways to combine layers for feature representation. The combined use of three penultimate layers of the network shows the best performance ( 35%). Hence, we choose these layers for feature embedding as described in (2) and (3) in our experiments. TABLE I RETRIEVAL PERFORMANCE (MAP) VS. CONVOLUTIONAL LAYERS USED FOR FEATURE ENCODING Single Layer map Multi-Layers map conv1 8.9 conv3 + conv conv conv3 + conv conv conv4 + conv conv conv3 + conv4 + conv conv The retrieval performance of the proposed method is shown in Table II. It shows the performance comparison between various features used and different re-ranking steps. The initial retrieval using direct concatenated feature vector achieves an map of 34.8%. The retrieval using the proposed PMAC feature outperforms this by achieving an map of 45.9% (a gain of 11%). Although the direct concatenated features try to capture multi-layer feature information, it contains significant redundant information. In contrast, the proposed PMAC extract discriminative features from various layers which capture low-to-high level visual information. It also uses PCA to further project the high dimensional feature into meaningful subspace. The initial performance of the proposed PMAC is further improved by brand-aware re-ranking to 51.9% and 53.6% using clothing and logo re-ranking. The experimental results clearly show the advantage of PMAC feature encoding and brand-aware re-ranking. TABLE II RETRIEVAL PERFORMANCES OF THE PROPOSED BAFS FRAMEWORK Feature Used Initial Clothing Logo Retrieval Re-ranking Re-ranking Direct Concatenated PMAC ) Comparison with other State-of-the-art Methods: We compare our proposed method with other state-of-the-art works [16], [17], [20] which use various fully connected layer features from CNN networks after clothing detection. Table III shows the performance comparison of the proposed BAFS method with other three methods. From the table, it is observed that the baseline method achieves the lowest performance as it directly uses features from the pre-trained networks which are not well adapted to the clothing domain. The R-MAC [24] method uses features pooled from the last convolutional layer and achieves an map of 30.03%. The other methods [17] and [20] uses features extracted from the finetuned network on domain dataset and achieves 33.17% and 37.11% respectively. All of these methods rely on the features extracted from the single layer of the network which often does not capture the full range of information which is crucial for clothing instance search. Moreover, they do not effectively incorporate the brand information during the search. Compared to this, the proposed BAFS method uses a rich hierarchy of CNN features and incorporate the brand information, and hence it achieves an map of 53.6% which clearly outperforms other state-of-the-art methods. The qualitative analysis on the retrieval performance of the proposed method and state-ofthe-art methods is shown in Fig. 4, which demonstrates the effectiveness of the proposed BAFS framework. TABLE III COMPARISON OF THE PROPOSED METHOD WITH OTHER METHODS Method map Baseline Methods 1. VGG16 - FC Alexnet - FC R-MAC [24] Rapid-Clothing [17] Visual-Search@Pinterest [20] Proposed method Fig. 4. Qualitative comparison of retrieval results of various methods. Retrieved relevant and irrelevant images are shown with green border and red border respectively. VI. CONCLUSION This paper proposes a new brand-aware fashion search framework. We introduce a new brand fashion dataset which consists of 10K images of branded fashion clothing images. A joint detection of the clothing items and logo is employed to extract the clothing and brand information. A new feature encoding method, PMAC is proposed which captures a hierarchy of low-level to high-level information into the feature descriptor. A brand-aware re-ranking engine is also proposed to improve the visual clothing search. The experimental results clearly show the effectiveness of the proposed method. VII. ACKNOWLEDGMENT This research was carried out at the Rapid-Rich Object Search (ROSE) Lab at the Nanyang Technological University, Singapore. The ROSE Lab is supported by the Infocomm Media Development Authority, Singapore. We gratefully acknowledge the support of NVIDIA AI Technology Center for their donation of a K40m GPU used for our research at the ROSE Lab.

5 REFERENCES [1] D. Shankar, S. Narumanchi, H. Ananya, P. Kompalli, and K. Chaudhury, Deep learning based large scale visual recommendation and search for e-commerce, arxiv preprint arxiv: , [2] F. Yang, A. Kale, Y. Bubnov, L. Stein, Q. Wang, H. Kiapour, and R. Piramuthu, Visual search at ebay, arxiv preprint arxiv: , [3] K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, arxiv: , [4] S. Bell and K. Bala, Learning visual similarity for product design with convolutional neural networks, ACM Transactions on Graphics (TOG), vol. 34, no. 4, p. 98, [5] X. Wang, Z. Sun, W. Zhang, Y. Zhou, and Y.-G. Jiang, Matching user photos to online products with robust deep features, in Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval. ACM, 2016, pp [6] S. Liu, Z. Song, G. Liu, C. Xu, H. Lu, and S. Yan, Street-toshop: Cross-scenario clothing retrieval via parts alignment and auxiliary set, in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012, pp [7] W. Di, C. Wah, A. Bhardwaj, R. Piramuthu, and N. Sundaresan, Style finder: Fine-grained clothing style detection and retrieval, in Proceedings of the IEEE Conference on computer vision and pattern recognition workshops, 2013, pp [8] M. Hadi Kiapour, X. Han, S. Lazebnik, A. C. Berg, and T. L. Berg, Where to buy it: Matching street clothing photos in online shops, in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp [9] Z.-Q. Cheng, X. Wu, Y. Liu, and X.-S. Hua, Video ecommerce++: Towards large scale online video advertising, IEEE Transactions on Multimedia, [10] M. Mizuochi, A. Kanezaki, and T. Harada, Clothing retrieval based on local similarity with multiple images, in Proceedings of the 22nd ACM international conference on Multimedia. ACM, 2014, pp [11] J. Fu, J. Wang, Z. Li, M. Xu, and H. Lu, Efficient clothing retrieval with semantic-preserving visual phrases, in Asian Conference on Computer Vision. Springer, 2012, pp [12] Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, Nature, vol. 521, no. 7553, pp , [13] A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification with deep convolutional neural networks, in Advances in neural information processing systems (NIPS), 2012, pp [14] S. Ren, K. He, R. Girshick, and J. Sun, Faster R-CNN: Towards realtime object detection with region proposal networks, in Advances in Neural Information Processing Systems (NIPS), [15] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, You only look once: Unified, real-time object detection, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp [16] J. Huang, W. Xia, and S. Yan, Deep search with attribute-aware deep network, in Proceedings of the 22Nd ACM International Conference on Multimedia, ser. MM 14. New York, NY, USA: ACM, 2014, pp [Online]. Available: [17] K. Lin, H.-F. Yang, K.-H. Liu, J.-H. Hsiao, and C.-S. Chen, Rapid clothing retrieval via deep learning of binary codes and hierarchical search, in Proceedings of the 5th ACM on International Conference on Multimedia Retrieval. ACM, 2015, pp [18] J.-C. Chen and C.-F. Liu, Visual-based deep learning for clothing from large database, in Proceedings of the ASE BigData & SocialInformatics ACM, 2015, p. 42. [19] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang, Deepfashion: Powering robust clothes recognition and retrieval with rich annotations, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp [20] Y. Jing, D. Liu, D. Kislyuk, A. Zhai, J. Xu, J. Donahue, and S. Tavel, Visual search at pinterest, in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2015, pp [21] A. Babenko, A. Slesarev, A. Chigorin, and V. Lempitsky, Neural codes for image retrieval, in European Conference on Computer Vision (ECCV). Springer, 2014, pp [22] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, Cnn features off-the-shelf: an astounding baseline for recognition, in IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp [23] Y. Kalantidis, C. Mellina, and S. Osindero, Cross-dimensional weighting for aggregated deep convolutional features, in European Conference on Computer Vision. Springer, 2016, pp [24] G. Tolias, R. Sicre, and H. Jégou, Particular object retrieval with integral max-pooling of cnn activations, arxiv: , [25] A. S. Razavian, J. Sullivan, A. Maki, and S. Carlsson, A baseline for visual instance retrieval with deep convolutional networks, CoRR, vol. abs/ , [Online]. Available: [26] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, Caffe: Convolutional architecture for fast feature embedding, arxiv preprint arxiv: , [27] M. D. Zeiler and R. Fergus, Visualizing and Understanding Convolutional Networks. Cham: Springer International Publishing, 2014, pp [Online]. Available:

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION Kingsley Kuan 1, Gaurav Manek 1, Jie Lin 1, Yuan Fang 1, Vijay Chandrasekhar 1,2 Institute for Infocomm Research, A*STAR, Singapore 1 Nanyang Technological