Fashion Analytics and Systems

Size: px

Start display at page:

Download "Fashion Analytics and Systems"

Kristina Jackson
5 years ago
Views:

1 Learning and Vision Group, NUS (NUS-LV) Fashion Analytics and Systems Shuicheng YAN National University of Singapore [ Special thanks to Luoqi LIU, Xiaodan LIANG, Si LIU, Jianshu LI]

Deep Learning Ecosystem in NUS-LV Lab Purine: General,

speedup Model parallel, data parallel Brain-like +

self/endless learning process Algorithms Landing Visual

Object/products/human analytics, and other non-visual big

One 2nd prize in VOC 3. 2nd prize in ImageNet 13 4.

ACM MM12, Licensed with M-scale users daily Fashion

2 Deep Learning Ecosystem in NUS-LV Lab Purine: General, bi-graph based DL framework Multi-PC Multi-CPU/GPU Linear speedup Model parallel, data parallel Brain-like + Baby-like: Brain-like network structures and baby-like self/endless learning process Algorithms Landing Visual Perception + Big Data Modeling/Learning Object/products/human analytics, and other non-visual big data Architecture 1. 4 winner awards in VOC 2. One 2nd prize in VOC 3. 2nd prize in ImageNet st prize in ImageNet 14 Best paper/demo awards: ACM MM13, ACM MM12, Licensed with M-scale users daily Fashion Analytics and Systems Applications LFW: 99.70% Best human parsing performance Cross-age synthesis Face analysis with occlusions

3 Fashion Analytics and Systems in NUS-LV Human Parsing Fashion Item Search Smart Advertisement Body Beauty e-experts Face

4 Task I: Human/Fashion Parsing (Pixel-to-Pixel Deep Prediction: 44.76%--> 85.36%) Time for Human+

5 Goal: Human Parsing Decompose a human photo into semantic fashion/body items Pixel-level semantic labeling Upper-clothes Sun-glass skirt scarf right-shoe right-leg right-arm pants left-shoe left-leg left-arm hat face dress belt bag hair null

6 Human Parsing = Engine for Applications

7 Stage-1: Pipeline Solutions Hand-designed pipelines --- Heavily rely on the performance of individual component --- Founded on hand-designed features and complex context models Parametric model [1] Segmentation hypotheses Extract handcrafted features Non-parametric model (retrieval) [2] Postprocessing (e.g. CRF) [1] Jian Dong, Qiang Chen, Wei Xia, ZhongYang Huang, and Shuicheng Yan. A deformable mixture parsing model with parselets. In ICCV, 2013 [2] K. Yamaguchi, M.H. Kiapour, and T.L. Berg. Paper doll parsing: Retrieving similar styles to parse clothing items. In ICCV, 2013

8 Exemplar Solution: Paper Doll Parsing

9 Stage II: Deep Regression for Components Hair Face Left-arm Sun-glasses Right-arm Bag Upper-clothes Left-leg Skirt Right-leg Left-shoe Right-shoe Assumption: pixel-to-pixel impossible! Deformable Human Items Model (similar to ASM): predict the normalized item masks, and their active shape/location parameters with two CNN networks

10 Normalized Item Mask The masks of different items often appear in various specific shapes The mask can be approximated as a linear combination of the learned templates

outputs and then refine the parsing result Structure Labels Human detection Active Template Network Item

11 Superpixel Smoothness Our Framework Active Template Network for predicting item template coefficients Active Shape Network for predicting active shape/location parameters Combine the resulting structure outputs and then refine the parsing result Structure Labels Human detection Active Template Network Item Mask Reconstruction Item confidence Map Active Shape Network Item position, scale and visibility Background confidence Map

12 Active Template Network Learn 50 templates for each item by Non-negative Matrix Factorization (NMF) in an offline way Regress the output: 50*17 for 17 human items Image size 227 Filter size Template coefficients Stride 2 3x3 max pool stride Contrast 3x3 max pool norm stride Contrast norm x3 max pool stride units units 850 Input Image Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6 Layer 7 Output

13 Active Shape Network Predict x,y coordinates, width, height, visibility flag for each item Eliminate the max-pooling layer in CNN to keep the position sensitiveness Image size 227 Filter size Active Shape parameters Stride units 1024 units 85 3 Input Image 55 Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6 Layer 7 Output Accuracy Foreground accuracy Average precision Average recall Average F-1 scores Original Structure Ours

14 Structure Output Combination Combine the structure outputs from two networks, and generate 17 confidence maps of the human items Hair Sunglasses Face Dress LeftArm Optional bounding-box refinement and super-pixel smoothening

15 Results Datasets: 7,700 images, 6,000 for training, 1,000 for testing and 700 for validation Training: Manually decrease the learning rate according to the validation error Training time: for 120 epochs, take 2-3 days on two NVIDIA GTX TITAN 6GB GPUs Testing time: process one image within about 0.5 second

16 Results Comparison of parsing performances with two state-of-the-art methods: Accuracy Foreground accuracy Average precision Average recall Average F-1 scores Yamaguchi [3] Paper-doll [2] ATR(noSPR) ATR ATR + BBox Regression [2] K. Yamaguchi, M.H. Kiapour, and T.L. Berg. Paper doll parsing: Retrieving similar styles to parse clothing items. In ICCV, 2013 [3] K. Yamaguchi, M.H. Kiapour, L.E. Ortiz, and T.L. Berg. Parsing clothing in fashion photographs. In CVPR 2012.

17 Parsing Results Test Paperdoll ATR (nospr) ATR Test Paperdoll ATR (nospr) ATR

18 Stage-III: Pixel-to-pixel Deep Prediction Contexts + Fully convolutional neural network Cross-layer context: : multi-level feature fusion Global image-level context: : coherence between pixel-wise labelling and image label prediction Local Super-pixel context: : within-superpixel consistency and cross-superpixel appearance consistency

Contextualized Network Cross-layer context Four feature map fusions Image 150*100 150*100 75*50 37*25 18*12 37*25 37*25 75*50 75*50

19 Contextualized Network Cross-layer context Four feature map fusions Image 150* *100 75*50 37*25 18*12 37*25 37*25 75*50 75*50 150* * * *5 convolutions Hierarchically combine the low-level local details and high-level semantic information

20 Contextualized Network Global image-level context Incorporate global image label prediction

21 Contextualized Network Local Super-pixel context Integrate within-super-pixel smoothing and cross-super-pixel neighbourhood voting

22 Contextualized Network Global image-level context helps distinguish the ambiguous labels Skirt Dress Co-CNN w/o global label Global image label Co-CNN skirt dress upper-clothes

23 Contextualized Network Local super-pixel context retains with-superpixel and appearance consistency input Co-CNN w/o sp Co-CNN input Co-CNN w/o sp Co-CNN

24 Results Comparison of parsing performances with four state-of-the-art methods on ATR dataset: Accuracy Foreground accuracy Average precision Average recall Average F-1 scores Yamaguchi et al Paperdoll ATR Co-CNN

25 Results Analyses on architectural variants of our model Cross-layer context

26 Results Analyses on architectural variants of our model Global image label context

27 Results Analyses on architectural variants of our model Local super-pixel context

Results Adding 10,000 human pictures from chictopia.com Accuracy Foreground accuracy Average precision Average recall Average F-1 scores Paperdoll 88.

28 Results Adding 10,000 human pictures from chictopia.com Accuracy Foreground accuracy Average precision Average recall Average F-1 scores Paperdoll ATR Co-CNN Co-CNN(+Chictopia10k)

29 Multi-task with Semantic Edge Detection Semantic edge detection task Input Semantic Edge Motivations: Incorrect Edges within-item edge vs. cross-item edge

30 Multi-task with Semantic Edge Detection Semantic edge integrate the semantic edge into the Co-CNN Multi-resolution fusion

w/o edge w edge Results Human parsing results with semantic edge

31% Accuracy Foreground accuracy Average precision Average recall

11 71.04 71.69 60.25 64.38 Co-CNN 95.23 80.90 81.55 74.42 76.

31 w/o edge w edge Results Human parsing results with semantic edge co-prediction Semantic edge accuracy: 92.31% Accuracy Foreground accuracy Average precision Average recall Average F-1 scores Paperdoll ATR Co-CNN Co-CNN(+Chictopia10k) Co-CNN(+Chictopia10k) (semanticedge)

32 Parsing Results Test Paperdoll ATR Co-CNN Test Paperdoll ATR Co-CNN

33 Online Human Parsing Engine (<0.15s) 44.76%--> 85.36%: nearly ready for many industry applications

34 Ongoing Work: Robust Snap&Buy

35 Task II: Face Beautification (Beauty e-experts)

36 Makeover (makeup+ hairstyle) Process The makeover process (2) Synthesis Before-makeup (1) Foundation (2) Lip (3) Eye shadow (4) Hairstyle (1) Recommendation color & shape

37 Database System Flowchart Recommendation Module Synthesis Module Image Feature s Beauty Attribut es Beauty Related Attributes Hair Synthesis Makeup Synthesis Try & Buy Testing Face (without makeover)

38 Recommendation Module Beauty Attributes Beautyrelated Attributes Visual Features

39 Beauty Attributes Hair length Hair color Hair shape Hair bangs Hair volume Spectral matting Eye shadow shape Spectral matting+ clustering Eye shadow color Clustering Foundation Clustering Lip gloss Totally, we define 9 kinds of beauty attributes(directly related with real cosmetic products). Clustering

40 Beauty-related Attributes Face shape long oval round Lip thickness thick normal Ocular distance wide normal narrow Race western eastern Totally, we define 21 kinds of beauty-related attributes: (1) Unchanged during makeover process (2) Strong correlations with beauty attributes

41 Visual Features Color Histograms Color Moments Histogram of Gradients Local Binary Patterns ASM Parameters Shape Context

42 Recommendation Model: Formulation Beautyrelated Attributes( a r ) Beauty Attributes (a b ) Visual Features (x) Z x = Gibbs Distribution p a b, a r x = 1 Z x exp E ab, a r, x a b,a r exp E a b, a r, x Super-graph [super-vertex and Maximum Spanning Tree] to define E

43 Synthesis Module Alignment Alpha blending Recommended beauty templates/attributes Synthesis result #3 #12 Short, straight side part #4 #3

44 Exemplar Synthesis Process Foundation Lip Gloss Eye Shadow Hair style

45 Recommendation and Synthesis Results

46 Recommendation and Synthesis Results

47 Beyond Makeover: Shape Beautification

48 Beyond Makeover: Shape Beautification

49 Fashion Analytics and Systems in NUS-LV Human Parsing Fashion Item Search Smart Advertisement Body Beauty e-experts Face

Human Parsing with Contextualized Convolutional Neural Network

Human Parsing with Contextualized Convolutional Neural Network Xiaodan Liang 1,2, Chunyan Xu 2, Xiaohui Shen 3, Jianchao Yang 5, Si Liu 6, Jinhui Tang 4 Liang Lin 1, Shuicheng Yan 2 1 Sun Yat-sen University