CAP 6412 Advanced Computer Vision

Size: px

Start display at page:

Download "CAP 6412 Advanced Computer Vision"

Paul Scott
5 years ago
Views:

1 CAP 6412 Advanced Computer Vision Boqing Gong Feb 11, 2016

2 Today Administrivia Neural networks & Backpropagation (Part VII) Edge detection, by Goran

3 Next week: CNN & videos Tuesday (02/16) Abdullah Jamal Thursday (02/18) Amar Kelu Nair [Optical flow] Fischer, Philipp, Alexey Dosovitskiy, Eddy Ilg, Philip Häusser, Caner Hazırbaş, Vladimir Golkov, Patrick van der Smagt, Daniel Cremers, and Thomas Brox. "Flownet: Learning optical flow with convolutional networks." arxiv preprint arxiv: (2015). & Secondary papers [Pose estimation] Pfister, Tomas, James Charles, and Andrew Zisserman. Flowing convnets for human pose estimation in videos. In Proceedings of the IEEE International Conference on Computer Vision, pp & Secondary papers

4 Project 1: Due in two weeks (02/28) If you choose option 2, your own project Deadline for discussion & approval: 02/11/2016 (This Thursday) See instructions on how to prepare the slides for discussion

5 Upload slides before or after class See Paper Presentation on UCF webcourse Sharing your slides Refer to the originals sources of images, figures, etc. in your slides Convert them to a PDF file Upload the PDF file to Paper Presentation after your presentation

6 Today Administrivia Neural networks & Backpropagation (Part VI) Image super-resolution, by Jose

Training algorithm INPUT: BSD (training image Xn, edge annotations Yn1, Yn5 ), n=1,2, N Configurate network: Trim VGG-net (See right for network specification) Generate gold target Yn for each image

7 Training algorithm INPUT: BSD (training image Xn, edge annotations Yn1, Yn5 ), n=1,2, N Configurate network: Trim VGG-net (See right for network specification) Generate gold target Yn for each image Xn: Majority vote: 5 annotated edge maps Yn1, Yn5 à one edge map Yn Duplicate Yn to have gold side output Augment data: Rotate images by 16 angles Flip images Tune hyper-parameters: mini-batch size (10), learning rate (1e-6), loss weight (1), momentum (0.9), initialization filter (0), initialization fusion (1/5), weight decay (2e-4), #iteration 10,000 Train the neural network with those hyper-parameters and augmented data OUTPUT: A well-trained network

8 Test INPUT: A natural image, the trained network OUTPUT: Predicted edge maps

9 Holistically-Nested Edge Detection Authors: Saining Xie, Zhuowen Tu, UC San Diego In Proceedings of the IEEE International Conference on Computer Vision, December Presented at UCF Advance Computer Vision Class by: Goran Igic, February 11 th 2016.

10 About Paper and important links Paper was presented on ICCV 2015 conference which was held December th 2015 In Santiago Chile. It was announced on the second day (TUE DEC 15 th ) that paper got Marr Prize honorable mention. The author Xie s web site: The code used in the paper is open sourced and published at: The same page has links to other repositories including CAFFE, modified-caffe for HED. The paper used Piotr's Structured Forest matlab toolbox available here

11 Content Motivation of Research Problem Statement Main Contributions of the paper Conclusion Strengths and Weakness of the Paper Overall Rating Future Directions Approach Outline Details of Proposed Approach Experiments Related Work

12 Motivation of Research To address fundamental and important vision problem: The Edge Detection The problem was studied for years in Image Processing, Camera Vision, 3D Camera Vision and Robotics There are plenty of solutions : Early Pioneering Methods: Sobel, Zerro Crossing, and Canny detector; Information Theory Implementation: Statistical Edges, Pb, and gpb; Learning Based Methods: BEL, Multiscale, Sketch Tokens, and structured Edges The newest wave of CNN based methods: N4-fields, Deep-Contour, Deep Edge, and CSCNN

13 Edge Detection Classic Approach There are three basic types of intensity discontinuities in digital images: points, lines, and edges. The most common way to look for discontinuities is to run a mask through the image. The response R of the mask in any point of the image is given by: 9 R wizi, for the 3x3 mask; i1 The isolated point is detected when Line detection R T, where T is treshold; R R, for all j i; Points and lines are important in the image segmentation, edge detection is the most common approach. i j Convolution and Edge Detection : Computational Photography. Alexei Efros, CMU, Fall 2005.

14 Effects of noise, solution: smooth first Convolution and Edge Detection : Computational Photography. Alexei Efros, CMU, Fall 2005.

15 Derivative theorem of convolution Convolution and Edge Detection : Computational Photography. Alexei Efros, CMU, Fall 2005.

16 Laplacian of Gaussian, 2D edge detection filters Convolution and Edge Detection : Computational Photography. Alexei Efros, CMU, Fall 2005.

17 Canny Edge Detector (1) Authors compare their work to the 1986 edge detector CANNY First image needs to be converted to gray scale:

18 Canny Edge Detector (2) The threshold is auto-detected T=[0.05,0.125] Standard deviation of smoothing filter is =2.0 =4.0

19 Canny Edge Detector (3) =8.0

20 Classical Methods On Canny Edge Detector from Matlab Help: The Canny method, finds edges by looking for local maxima of the gradient of I. The gradient is calculated using the derivative of a Gaussian filter. The method uses two thresholds, to detect strong and weak edges, and includes the weak edges in the output only if they are connected to strong edges. For more information in regards with classical methods see chapter 10 of the book: E.C.Gonzalez, R.E.Woods, S.L.Eddins, Digital Image Processing Using Matlab, 2 nd edition 2010

21 Information Theory on the top of features Methods, statistical edges, Pb and gpb: 1. P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Contour detection and hierarchical image segmentation. PAMI,2011, Contour detector combines multiple local cues into a globalization framework based on spectral clustering. It utilize segmentation algorithm for transforming the output of any contour detector into a hierarchical region tree. This paper is also important because introduces contour benchmark test.

22 Information Theory on the top of features

23 Learning Based Models Structured edges, SE: P. Doll ar and C. L. Zitnick. Fast edge detection using structured forests. PAMI, Patches of edges exhibit well-known forms of local structure, such as straight lines or T-junctions. This paper takes advantage of the structure present in local image patches to learn both an accurate and computationally efficient edge detector. The paper formulates the problem of predicting local edge masks in a structured learning framework applied to random decision forests.

24 Learning Based Models

25 Learning Based Models As input, the method takes an image that may contain multiple channels, such as an RGB or RGBD image. The task is to label each pixel with a binary variable indicating whether the pixel contains an edge or not. The labels within a small image patch are highly interdependent, providing a promising candidate problem for our structured forest approach. Method needs segmented training images, in which the boundaries between the segments correspond to contours. To train decision trees, the mapping is defined: :Y Z Ensemble model, Random forests achieve robust results by combining the output of multiple trees.

26 Models based on Convolutional Neural Networks Models with Integrated Automatic hierarchical feature learning, N4-fields, Deep Contour, Deep Edge, CSCNN Holistically-nested Edge Detection (HED) Holistic the system takes an image as input and directly produces the edge map as output. Nested the system produces edge maps as side outputs,

27 Content Motivation of Research Problem Statement Main Contributions of the paper Conclusion Strengths and Weakness of the Paper Overall Rating Future Directions Approach Outline Details of Proposed Approach Experiments Related Work

28 Problem Statement Paper needs to address two important issues of edge detection vision problem: 1. Holistic Image Training and Prediction, used in image to-image classification, CNN 2. Nested Multiscale and Multilevel Feature learning, deeply-supervised nets to guide early predictions, deep layer supervision to guide early classification results.

29 Content Motivation of Research Problem Statement Main Contributions of the paper Conclusion Strengths and Weakness of the Paper Overall Rating Future Directions Approach Outline Details of Proposed Approach Experiments Related Work

30 Main Contributions of the paper, Develops an end-to-end edge detection system, holistically-nested edge detection system (HED) Holistic it aims to train and predict edges in an image-toimage fashion; Nested the path along with each prediction is common to each of these edge maps; The system has integrated learning of hierarchical features

31 Content Motivation of Research Problem Statement Main Contributions of the paper Conclusion Strengths and Weakness of the Paper Overall Rating Future Directions Approach Outline Details of Proposed Approach Experiments Related Work

32 Approach Outline HED (Holistic Edge Detection) uses DSN (Deep Supervised Network) training to fine tune the VGG network (Visual Geometry Group Oxford University) for the task of boundary detection. The principle behind DSN is classifier stacking adapted to deep learning, where each layer is informed about the final objective. VGG Net has great depth (16 convolutional layers), great density (stride-1 convolutional kernels), and multiple stages (five 2-sttrade down sampling layers). J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/ , 2014.

Figure #2: Illustration of different multiscale-deep learning architecture configurations (a) (b) multi-stream architecture; Multiple parallel networks with different parameter numbers and receptive

33 Figure #2: Illustration of different multiscale-deep learning architecture configurations (a) (b) multi-stream architecture; Multiple parallel networks with different parameter numbers and receptive field sizes, corresponding to multiple scales. Same input data on all streams, concatenated feature responses are fed to the global output skip-layer net architecture; There is primary stream. The contributions from layers are added to this stream. There is only one output loss function. The edge prediction is better if there are multiple predictions to combine.

Figure #2: Illustration of different multiscale-deep learning architecture configurations (c) A single model running on multi-scale inputs; Used in non-deep learning based methods.

34 Figure #2: Illustration of different multiscale-deep learning architecture configurations (c) A single model running on multi-scale inputs; Used in non-deep learning based methods. Not very efficient prediction. (d) separate training of different networks; multiple independent networks with different depths; different output loss layers. Method has higher resource demanding.

35 Figure #2: Illustration of different multiscale-deep learning architecture configurations (e) Holistically-nested architectures, where multiple side outputs are added. This is a single stream deep network with multiple side outputs. The multiple scale prediction is possible.

36 Long at all DCN/VGG Network J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/ , 2014.

37 Approach Outline Authors of the paper changed net: Side output layer is connected to the last convolutional layer The last stage of VGGN is cut out including 5 th pooling layer J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/ , 2014.

38 Content Motivation of Research Problem Statement Main Contributions of the paper Conclusion Strengths and Weakness of the Paper Overall Rating Future Directions Approach Outline Details of Proposed Approach Experiments Related Work

39 Details of Proposed Approach Training Phase Input training data set: S X, Y, n 1,.., N n n n n n where raw image sample is X x, j 1,..., X ; coresponding ground truth binary edge map for image X : Y y, j 1,..., X, y 0,1 ; n j n j n j n Each image is treated holistically and independently, so sufix n can be omitted. The goal is to have a network that learns the features from witch it is possible to produce edge map approaching the ground truth. n

40 Details of Proposed Approach W is collection of all standard network layer parameters. Network has M side-output layers, each of them is associated with the classifier, with corresponding weights: (1) M w w,..., w The objective function of HED: M m W,w m L l W,w, where l is the image-level loss function for side outputs, side m side side m1 Loss function is computed over all pixels in training image. X = x j n, j = 1,..., X ; Y = y j n, j = 1,..., X n, y j n 0,1 ;

41 Details of Proposed Approach The typical natural image, has the distribution of edge / non-edge pixels as that 90% of pixels is non-edge. Simpler strategy (vs other papers) is used here to automatically balance loss between positive / negative classes, class balancing weight b on perpixel term basis. Class-balanced cross-entropy loss function: l W,w b log Pr y 1 X ; W,w (1 b) log Pr y 0 X ; W,w m m m m side j j jy jy Where b Y / Y ; and 1 b Y / Y ; Y m m y j X a j is edge ground truth label set; is non-edge ground truth label set; Pr 1 ; W,w 0,1 is computed using sigmoid function. on the activation value at pixel j ; Y

42 Details of Proposed Approach Edge map predictions on each side output layer: m m m ^ ^ ^ m Y, where, 1,..., side A side Aside a j j Y, are activations of the side output of layer m To directly utilize side output prediction, weighted-fusion layer is added. The fusion weight is leaned during the training. The fusion layer loss function: L fuse ^ ^ ^ m M W,w,h Dist Y, Y fuse ; Y fuse h side ; m 1 m A The fusion weight h h,..., h, 1 M

43 Details of Proposed Approach ^ Dist Y, Y fuse is the distance between the fused prediction and the ground truth label map; it is cross entropy loss; Minimized Objective Function: * W,w,h L L h arg min( W,w)+ W,w, ; side fuse

44 Details of Proposed Approach Testing phase: X is given image, edge-map prediction is obtain from both the side output layers and the weight fusion layer 1 M ^ ^ ^ * Y,,...,, W,w,h fuse Y side Y side CNN X, The final unified output: ^ ^ ^ Y AverageY, Y HED fuse side M,..., Y side. 1 ^

Figure #3 HED network architecture for edge detection. Error backpropagation path is highlighted. Side-output layers are inserted after convolutional layers.

45 Figure #3 HED network architecture for edge detection. Error backpropagation path is highlighted. Side-output layers are inserted after convolutional layers. Deep supervision is in place at each side-output layer. The HED outputs are multiscale and multilevel. Side-output-plane size getting smaller and the repetitive field size become larger. The weight fusion layer is added to automatically learn how to combine outputs from multiple scales

46 Details of Proposed Approach, C1-C5 intermediate layer of DCNN, E1-E5 are side layers, loss function is red. Simplified presentation of HED/DCN architecture, from the work SURPASSING HUMANS IN BOUNDARY DETECTION USING DEEP LEARNING Iasonas Kokkinos, Center for Visual Computing CentraleSupelec and INRIA

47 Figure #2, results from HED

48 Content Motivation of Research Problem Statement Main Contributions of the paper Conclusion Strengths and Weakness of the Paper Overall Rating Future Directions Approach Outline Details of Proposed Approach Experiments Related Work

49 Experiments The work uses publically available Caffe Library, publically available implementation of FCN and DSN. The whole network is fine tuned from an initialization with the pre-trained VGG-16 Net model The hyper-parameters (mini-batch size, learning rate, loss-weight m for each side-output layer) are tuned. Data augmentation, different pooling functions, in-network bilinear interpolation is used with no much improvement. Running time, training takes about 7 hours on a single NVIDIA K40 GPU. From an image 320x480 pixels, HED produces the final edge map for about 400mS

50 Arbelaez at all.. P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Contour detection and hierarchical image segmentation. PAMI, 33(5): , This paper proposed benchmark test: One option is to regard the segment boundaries as contours and evaluate them as such. One might argue that the boundary benchmark favors contour detectors over segmentation methods, since the former are not burdened with the constraint of producing closed curves. Leading contour detection approaches are ranked according to their maximum F-measure 2Precision Recall Precision+Recall with respect to human ground-truth boundaries. ODS fixed threshold contour - measure Boundary quality. Detector performance are measured in terms of precision, the fraction of true positives, and recall, the fraction of ground-truth boundary pixels detected. The global F measure, or harmonic mean of precision and recall at the optimal detector threshold, provides a summary score. ODS is for entire data set, OIS is per image, AP ia average precision.

51 BSDS500 dataset Results on BSDS500 data set, 200 training, 100validations, and 200 test images. The ground truth contour is manually annotated. Edge detection is evaluated using three standard measures: Fixed contour threshold (ODS) Per-image best threshold (OIS) Average precision (AP) HED performed better than CNNbased detectors

52 BSDS500 dataset - Fixed contour threshold (ODS), Per-image best threshold (OIS), Average precision (AP) Table 3. Results of single and averaged side output in HED on the BSDS 500 dataset.

53 NYUDv2 dataset Precision/recall curves on NYUD dataset. Holistically-nested edge detection (HED) trained with RGB and HHA features achieves the best result (ODS=.746). See

54 Content Motivation of Research Problem Statement Main Contributions of the paper Conclusion Strengths and Weakness of the Paper Overall Rating Future Directions Approach Outline Details of Proposed Approach Experiments Related Work

55 Related Work Lee, Chen-Yu, Xie, Saining, Gallagher, Patrick W., Zhang, Zhengyou, and Tu, Zhuowen. Deeply-supervised nets. In Proc. AISTATS, Iasonas Kokkinos "SURPASSING HUMANS IN BOUNDARY DETECTION USING DEEP LEARNING Liu, Ziwei, Li, Xiaoxiao, Luo, Ping, Loy, Chen Change, and Tang, Xiaoou. Semantic image segmentation via deep parsing network. arxiv preprint arxiv: , Yizhou, Yu, Chaowei, Fang, Zicheng, Liao, Piecewise Flat Embedding for Image Segmentation

56 Content Motivation of Research Problem Statement Main Contributions of the paper Conclusion Strengths and Weakness of the Paper Overall Rating Future Directions Approach Outline Details of Proposed Approach Experiments Related Work

57 Conclusion - Strength and Weakness The known problem is: Consensus sampling, the ground truth is duplicated at each side-output layer and side output is down sampled to its original scale. The mismatch exist and noise may cause convergence issue in the high-level side outputs even with the help of pretrained model. Develops an end-to-end edge detection system, holistically-nested edge detection system (HED) Holistic it aims to train and predict edges in an image-to-image fashion; Nested the path along with each prediction is common to each of these edge maps; The system has integrated learning of hierarchical features Experimental work done well

58 Conclusion Overall Rating The paper is very fresh and novel, it is the major part of the current wave of CNN implementations, and machine learning in Vision I have read the following works for ICCV 2016 The paper got Marr Prize honorable mention at the last ICCV 2015 conference which was held December th 2015 In Santiago Chile. The experimental results showing step forward The code is publically available Illustration of different multiscale-deep learning architecture configurations is given My rating is 0

59 Conclusion - Future Directions DCN can outperform humans (Iasonas Kokkinos said). There is certainly a great potential in usage of DCN. The system idea is not very complex. The training can be improved. There are works with including multiresolution architecture.

RSRN: Rich Side-output Residual Network for Medial Axis Detection

RSRN: Rich Side-output Residual Network for Medial Axis Detection Chang Liu, Wei Ke, Jianbin Jiao, and Qixiang Ye University of Chinese Academy of Sciences, Beijing, China {liuchang615, kewei11}@mails.ucas.ac.cn,