Object Geolocation from Crowdsourced Street Level Imagery

Similar documents
Object Detection on Street View Images: from Panoramas to Geotags

Collaborative Mapping with Streetlevel Images in the Wild. Yubin Kuang Co-founder and Computer Vision Lead

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Presented at the FIG Congress 2018, May 6-11, 2018 in Istanbul, Turkey

Synscapes A photorealistic syntehtic dataset for street scene parsing Jonas Unger Department of Science and Technology Linköpings Universitet.

Outdoor Scene Reconstruction from Multiple Image Sequences Captured by a Hand-held Video Camera

OSM-SVG Converting for Open Road Simulator

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Real-Time Vehicle Detection and Tracking DDDAS Using Hyperspectral Features from Aerial Video

A Novel Method for Activity Place Sensing Based on Behavior Pattern Mining Using Crowdsourcing Trajectory Data

CAP 6412 Advanced Computer Vision

Dense 3-D Reconstruction of an Outdoor Scene by Hundreds-baseline Stereo Using a Hand-held Video Camera

TRAINING MATERIAL HOW TO OPTIMIZE ACCURACY WITH CORRELATOR3D

Augmenting Crowd-Sourced 3D Reconstructions using Semantic Detections: Supplementary Material

Geo-location and recognition of electricity distribution assets by analysis of ground-based imagery

VISION FOR AUTOMOTIVE DRIVING

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009

A Fast Linear Registration Framework for Multi-Camera GIS Coordination

arxiv: v1 [cs.cv] 28 Sep 2018

A Statistical Consistency Check for the Space Carving Algorithm.

Planetary Rover Absolute Localization by Combining Visual Odometry with Orbital Image Measurements

NIH Public Access Author Manuscript Proc Int Conf Image Proc. Author manuscript; available in PMC 2013 May 03.

GIS Data Collection. This chapter reviews the main methods of GIS data capture and transfer and introduces key practical management issues.

Vehicle Ego-localization by Matching In-vehicle Camera Images to an Aerial Image

AUTONOMOUS IMAGE EXTRACTION AND SEGMENTATION OF IMAGE USING UAV S

City, University of London Institutional Repository

A NEW STRATEGY FOR DSM GENERATION FROM HIGH RESOLUTION STEREO SATELLITE IMAGES BASED ON CONTROL NETWORK INTEREST POINT MATCHING

STRUCTURAL EDGE LEARNING FOR 3-D RECONSTRUCTION FROM A SINGLE STILL IMAGE. Nan Hu. Stanford University Electrical Engineering

Camera Parameters Estimation from Hand-labelled Sun Sositions in Image Sequences

Collecting outdoor datasets for benchmarking vision based robot localization

Supplementary Note to Detecting Building-level Changes of a City Using Street Images and a 2D City Map

An ICA based Approach for Complex Color Scene Text Binarization

Detection of Rooftop Regions in Rural Areas Using Support Vector Machine

BATHYMETRIC EXTRACTION USING WORLDVIEW-2 HIGH RESOLUTION IMAGES

A New Direction in GIS Data Collection or Why Are You Still in the Field?

Predicting ground-level scene Layout from Aerial imagery. Muhammad Hasan Maqbool

Detecting motion by means of 2D and 3D information

Creation of LoD1 Buildings Using Volunteered Photographs and OpenStreetMap Vector Data

Quality Report Generated with version

Outline of Presentation. Introduction to Overwatch Geospatial Software Feature Analyst and LIDAR Analyst Software

DEPTH AND GEOMETRY FROM A SINGLE 2D IMAGE USING TRIANGULATION

Disguised Face Identification (DFI) with Facial KeyPoints using Spatial Fusion Convolutional Network. Nathan Sun CIS601

An Improvement of the Occlusion Detection Performance in Sequential Images Using Optical Flow

CS 231A Computer Vision (Fall 2012) Problem Set 3

Inverted Index for Fast Nearest Neighbour

Sensor Fusion: Potential, Challenges and Applications. Presented by KVH Industries and Geodetics, Inc. December 2016

Alignment of Continuous Video onto 3D Point Clouds

Minimizing Noise and Bias in 3D DIC. Correlated Solutions, Inc.

3D Shape Analysis with Multi-view Convolutional Networks. Evangelos Kalogerakis

QMUL-ACTIVA: Person Runs detection for the TRECVID Surveillance Event Detection task

Static Scene Reconstruction

Functionalities & Applications. D. M. Gavrila (UvA) and E. Jansen (TNO)

Joint Vanishing Point Extraction and Tracking (Supplementary Material)

#65 MONITORING AND PREDICTING PEDESTRIAN BEHAVIOR AT TRAFFIC INTERSECTIONS

TorontoCity: Seeing the World with a Million Eyes

Correcting User Guided Image Segmentation

Simultaneous Localization and Mapping (SLAM)

Separating Objects and Clutter in Indoor Scenes

Perception IV: Place Recognition, Line Extraction

Urban 3D Challenge & Future Directions

Mapping Road surface condition using Unmanned Aerial Vehicle- Based Imaging System. Ahmed F. Elaksher St. Cloud State University

LATEST TRENDS on APPLIED MATHEMATICS, SIMULATION, MODELLING

DEVELOPMENT OF A ROBUST IMAGE MOSAICKING METHOD FOR SMALL UNMANNED AERIAL VEHICLE

Multi camera tracking. Jan Baan. Content. VBM Multicamera VBM A270 test site Helmond

Unsupervised Camera Motion Estimation and Moving Object Detection in Videos

Vision based autonomous driving - A survey of recent methods. -Tejus Gupta

Latent Space Model for Road Networks to Predict Time-Varying Traffic. Presented by: Rob Fitzgerald Spring 2017

III. VERVIEW OF THE METHODS

Personal Navigation and Indoor Mapping: Performance Characterization of Kinect Sensor-based Trajectory Recovery

A Novel Texture Classification Procedure by using Association Rules

+50,000 Archived GCPs. The Most Comprehensive Ground Control Points Solution. Make Geospatial Data More Accurate

Fast Denoising for Moving Object Detection by An Extended Structural Fitness Algorithm

Mesh from Depth Images Using GR 2 T

Articulated Pose Estimation with Flexible Mixtures-of-Parts

Light Field Occlusion Removal

Lecture 10 Multi-view Stereo (3D Dense Reconstruction) Davide Scaramuzza

Pedestrian Detection Using Correlated Lidar and Image Data EECS442 Final Project Fall 2016

GPS/GIS Activities Summary

Raunak Rathi 1, Prof. A.V.Deorankar 2 1,2 Department of Computer Science and Engineering, Government College of Engineering Amravati

EE368 Project: Visual Code Marker Detection

ifp Universität Stuttgart Performance of IGI AEROcontrol-IId GPS/Inertial System Final Report

C. Premsai 1, Prof. A. Kavya 2 School of Computer Science, School of Computer Science Engineering, Engineering VIT Chennai, VIT Chennai

TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK

Real-time target tracking using a Pan and Tilt platform

Jana Urban Space Foundation Bengaluru. STAR JC Pin code Pairing (Existing Customers)

A System for Real-time Detection and Tracking of Vehicles from a Single Car-mounted Camera

Time-to-Contact from Image Intensity

2-4 April 2019 Taets Art and Event Park, Amsterdam CLICK TO KNOW MORE

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

Error Simulation and Multi-Sensor Data Fusion

A Low Power, High Throughput, Fully Event-Based Stereo System: Supplementary Documentation

Supplementary: Cross-modal Deep Variational Hand Pose Estimation

CS 223B Computer Vision Problem Set 3

Epipolar geometry-based ego-localization using an in-vehicle monocular camera

Automatic updating of urban vector maps

Conditional Random Fields as Recurrent Neural Networks

A Deep Learning Framework for Authorship Classification of Paintings

CS 4758: Automated Semantic Mapping of Environment

Image retrieval based on bag of images

Calibration of a rotating multi-beam Lidar

Transcription:

Object Geolocation from Crowdsourced Street Level Imagery Vladimir A. Krylov and Rozenn Dahyot ADAPT Centre, School of Computer Science and Statistics, Trinity College Dublin, Dublin, Ireland {vladimir.krylov,rozenn.dahyot}@tcd.ie Abstract. We explore the applicability and limitations of a state-of-theart object detection and geotagging system [4] applied to crowdsourced image data. Our experiments with imagery from Mapillary crowdsourcing platform demonstrate that with increasing amount of images, the detection accuracy is getting close to that obtained with high-end street level data. Nevertheless, due to excessive camera position noise, the estimated geolocation (position) of the detected object is less accurate on crowdsourced Mapillary imagery than with high-end street level imagery obtained by Google Street View. Keywords: Crowdsourced street level imagery object geolocation traffic lights. 1 Introduction In the last years massive availability of street level imagery has triggered a growing interest for the development of machine learning-based methods addressing a large variety of urban management, monitoring and detection problems that can be solved using this imaging modality [1, 2, 4, 5]. Of particular interest is the use of crowdsourced imagery due to free access and unrestricted terms of use. Furthermore, Mapillary platform has recently run very successful campaigns for collecting hundreds of thousands of new images crowdsourced by users as part of challenges in specific areas all over the world. On the other hand the quality of crowdsourced data varies dramatically. This includes both imaging quality (camera properties, image resolution, blurring, restricted field of view, reduced visibility) and camera position noise. The latter is particularly disruptive for the quality of object geolocation estimation which relies on the camera positions for accurate triangulation. Importantly, crowdsourced street imagery typically comes with no information about spatial bearing of the camera nor the This research was supported by the ADAPT Centre for Digital Content Technology, funded by the Science Foundation Ireland Research Centres Programme (Grant 13/RC/2106) and the European Regional Development Fund. This work was also supported by the European Union s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No.713567.

2 V. Krylov and R. Dahyot Fig. 1. Top: The original street level image processing pipeline proposed in [4] for object geolocation. Bottom: The modified pipeline with yellow components inserted to process crowdsourced street level imagery. information about the effective field of view (i.e. camera focal distance), which requires estimation of these quantities from the image data. The expert street level imaging systems, like Google Street View (GSV), ensure comparable data quality by using calibrated high-end imaging systems and supplementing GPS-trackers with inertial measurement units to ensure reliable camera position information, which is of critical importance in urban areas characterized by limited GPS signal due to buildings and interference. Here, we modify and validate the object detection and geotagging pipeline previously proposed in [4] to process crowdsourced street level imagery. The experiments are performed on Mapillary crowdsourced images in a study case of traffic lights detection in central Dublin, Ireland. 2 Methodology We rely on the general processing pipeline proposed in [4], with semantic segmentation and monocular depth estimation modules operating based on customtrained fully convolutional neural networks on street level images (Fig. 1). A modified Markov Random Field (MRF) model is used for fusion of information for object geolocation. The MRF is optimised on the space X of intersections of all the view-rays (from camera location to object position estimation via image segmentation). For each intersection location x i with state z i ( 0 discarded, 1 included in the final object detection map), the MRF energy is comprised of several terms. The full energy of configuration z in Z is defined as sum of all energy contributions over all sites in Z: U(z) = [ ] c d u d (z i ) + c c u c (z i ) + c b u b (z i ) + c m u m (z i, z j ), x i X x i,x jon the same ray

Object Geolocation from Crowdsourced Street Level Imagery 3 with parameter vector C = (c d, c c, c b, c m ) with non-negative components subject to c d + c c + c b + c m = 1. The unary term u d (z i ) promotes consistency with monocular depth estimates, and the pairwise term u m (z i, z j ) penalizes occlusions. These are defined as in [4]. To address the specific challenges of the crowdsourced imagery the other two terms are modified compared to [3, 4]: A second unary term is introduced to penalize more the intersections in the close proximity of other intersections (inside clusters): [ ] u c (z i X, Z) = z i I( z i z j < C) C, j i where I is the indicator function. Practically, the fewer intersections are found in C meters vicinity of the current location x i, the more it is encouraged in the final configuration, whereas in intersection clusters the inclusion of a site is penalized stronger to discourage overestimation from multiple viewings. This term is a modification of high-order energy term proposed in [4], and has the advantage of allowing the use of more stable minimization procedures for the total energy. The crowdsourced imagery is collected predominantly from dashboard cameras with a fixed orientation and limited field of view (60-90 degrees). Hence, a unary bearing-based term is added to penalize intersections defined by rays with a small intersection angle because these are particularly sensitive to camera position noise. This typically occurs when an object is recognized several times from the same camera s images with a fixed angle of view (in case of dashboard camera, as the vehicle is approaching the object the corresponding viewing bearing changes little). In case of several image sequences covering the same area this term stimulates mixed intersections from object instances detected in images from different sequences. The term is defined as: u b (z i X, Z) = z i (1 α(r i1, R i2 )/90), x i = R i1 R i2, with α(r i1, R i2 ) the smaller angle between rays R i1 and R i2 intersecting at x i. Optimal configuration is reached at the global minimum of U(z). Energy minimization is achieved with Iterative Conditional Modes starting from an empty configuration: zi 0 = 0, i, see in [4]. 3 Experimental study and conclusions We demonstrate experiments on Mapillary crowdsourced image data. We study the central Dublin, Ireland, area of about 0.75 km 2 and employ the 2017 traffic lights dataset [3] (as ground truth). All together, 2659 crowdsourced images are available collected between June 2014 and May 2018. We first remove the strongly blurred images identified by weak edges (low variance of the response to Laplacian filter), which results in 2521 images. We then resort to Structure from Motion (SfM) approach, OpenSfm (available at https://github.com/ mapillary/opensfm) developed by Mapillary, to adjust camera positions and recover estimates of image bearing, field-of-view for cameras. This results in 2047

4 V. Krylov and R. Dahyot Fig. 2. Examples of successful and failed traffic lights segmentation on Mapillary data. Fig. 3. Left: Dublin TL dataset ( ) in 0.75 km2 area inside green polygon, and Mapillary image locations ( ). Center: detection on Mapillary ( ) and on GSV ( ) imagery. Right: Precision plots as function of distance between estimates and ground truth. images post-sfm, with the rest being discarded due to failure to establish image matches using ORB/SIFT image features. The image resolutions are 960x720 (12%), 2048x1152 (34%), and 2048x1536 (54%), these are collected from cameras with estimated fields of view ranging from 58 to 65 degrees. Object detection is performed at the native resolution via cropping square subimages. Pixel level segmentations are aggregated into 1180 individual detections, of which 780 with mean CNN confidence score of above.55 after Softmax filter, see examples in Fig. 2. In this study contrary to [4] we adopt a threshold based on the CNN confidence due to variation in detection quality from different camera settings and imaging conditions. In the reported experiments, the energy term weights are set to cd = cm = 0.15, cb = 0.3, cc = 0.4, C = 5 meters in the uc energy term. To compare the performance of the proposed method we also report the results of traffic lights detection on GSV 2017 imagery (totaling 1291 panoramas) in the same area. The object recall reported on Mapillary (GSV) dataset reaches 9.8% (51%) at 2m threshold (ground truth object is located within such distance form an estimate), 27% (75%) at 5m and 65% (91%) at 10m. As can be seen in Fig. 3 the coverage of the considered area is not complete and several traffic light clusters are not covered or by very few Mapillary images. This caps the possible recall to about 94% on the given dataset. The precision is plotted for increasing object detection radii in Fig. 3 (right) for the complete Mapillary dataset (inclusive of 2521 images) and smaller subsets to highlight the improvement associated

Object Geolocation from Crowdsourced Street Level Imagery 5 with increased image volume. The latter is done by restricting the years during which the Mapillary imagery has been collecting: 950 on or after 2017, 1664 on or after 2016, out of 2521 total images inside the area. It can be seen that the introduction of the bearing penalty u b improves the detection and the precision grows with larger image volumes. Our preliminary conclusion after using crowdsourced imagery is that in high volume, these data can potentially allow similar detection performance but with a potential loss on geolocation estimation accuracy. Future plan focuses on the analysis of multiple sources of data (e.g. the mixed GSV + Mapillary, Twitter, as well as fusion with different imaging modalities, like satellite and LiDAR imagery) and scenarios to establish the benefits of using mixed imagery for object detection and position adjustment with weighted SfM methods. References 1. Bulbul, A., Dahyot, R.: Social media based 3d visual popularity. Computers & Graphics 63, 28 36 (2017) 2. Hara, K., Le, V., Froehlich, J.: Combining crowdsourcing and google street view to identify street-level accessibility problems. In: Proc. SIGCHI Conf. Human Factors Computing Syst. pp. 631 640. ACM (2013) 3. Krylov, V.A., Dahyot, R.: Object Geolocation using MRF-based Multi-sensor Fusion. In: Proc. IEEE Int Conf. Image Process.air (2018) 4. Krylov, V.A., Kenny, E., Dahyot, R.: Automatic discovery and geotagging of objects from street view imagery. Remote Sens. 10(5) (2018) 5. Wegner, J.D., Branson, S., Hall, D., Schindler, K., Perona, P.: Cataloging public objects using aerial and street-level images urban trees. In: Proc IEEE Conf on CVPR. pp. 6014 6023 (2016)