AudioSet: Real-world Audio Event Classification

Similar documents
CAP 6412 Advanced Computer Vision

Fuzzy Set Theory in Computer Vision: Example 3

arxiv: v1 [cs.sd] 1 Nov 2017

Know your data - many types of networks

Lecture 37: ConvNets (Cont d) and Training

Convolutional Neural Networks

Deep Face Recognition. Nathan Sun

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

Spatial Localization and Detection. Lecture 8-1

R-FCN: OBJECT DETECTION VIA REGION-BASED FULLY CONVOLUTIONAL NETWORKS

FUSION MODEL BASED ON CONVOLUTIONAL NEURAL NETWORKS WITH TWO FEATURES FOR ACOUSTIC SCENE CLASSIFICATION

Final Report: Smart Trash Net: Waste Localization and Classification

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

MACHINE LEARNING: CLUSTERING, AND CLASSIFICATION. Steve Tjoa June 25, 2014

YouTube-8M Video Classification

FaceNet. Florian Schroff, Dmitry Kalenichenko, James Philbin Google Inc. Presentation by Ignacio Aranguren and Rahul Rana

Project 3 Q&A. Jonathan Krause

SVM Segment Video Machine. Jiaming Song Yankai Zhang

Inception and Residual Networks. Hantao Zhang. Deep Learning with Python.

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Defense Data Generation in Distributed Deep Learning System Se-Yoon Oh / ADD-IDAR

An Exploration of Computer Vision Techniques for Bird Species Classification

Exploiting noisy web data for largescale visual recognition

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Object Detection Based on Deep Learning

Object detection with CNNs

YOLO9000: Better, Faster, Stronger

Deep Learning for Object detection & localization

Fine-tuning Pre-trained Large Scaled ImageNet model on smaller dataset for Detection task

Yiqi Yan. May 10, 2017

Real Time Monitoring of CCTV Camera Images Using Object Detectors and Scene Classification for Retail and Surveillance Applications

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

Inception Network Overview. David White CS793

Large Scale Data Analysis Using Deep Learning

Binary Convolutional Neural Network on RRAM

INTRODUCTION TO DEEP LEARNING

Content-Based Image Recovery

Internet of things that video

Real-time Object Detection CS 229 Course Project

Deep Model Compression

Object Detection on Self-Driving Cars in China. Lingyun Li

Online Open World Face Recognition From Video Streams

DeepFace: Closing the Gap to Human-Level Performance in Face Verification

Introduction to Deep Learning in Signal Processing & Communications with MATLAB

CNNS FROM THE BASICS TO RECENT ADVANCES. Dmytro Mishkin Center for Machine Perception Czech Technical University in Prague

Using Machine Learning for Classification of Cancer Cells

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Deep Learning Explained Module 4: Convolution Neural Networks (CNN or Conv Nets)

Structured Prediction using Convolutional Neural Networks

Martian lava field, NASA, Wikipedia

Recognition of Animal Skin Texture Attributes in the Wild. Amey Dharwadker (aap2174) Kai Zhang (kz2213)

Rich feature hierarchies for accurate object detection and semantic segmentation

Convolutional Neural Networks: Applications and a short timeline. 7th Deep Learning Meetup Kornel Kis Vienna,

Automated Diagnosis of Vertebral Fractures using 2D and 3D Convolutional Networks

Optimizing Object Detection:

Supporting Information

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

Classifying Images with Visual/Textual Cues. By Steven Kappes and Yan Cao

Convolutional Layer Pooling Layer Fully Connected Layer Regularization

CS231N Section. Video Understanding 6/1/2018

Flow-Based Video Recognition

Research at Google: With a Case of YouTube-8M. Joonseok Lee, Google Research

Visual Concept Detection and Linked Open Data at the TIB AV- Portal. Felix Saurbier, Matthias Springstein Hamburg, November 6 SWIB 2017

Unsupervised Learning of Spatiotemporally Coherent Metrics

OBJECT DETECTION HYUNG IL KOO

Automatic Detection of Multiple Organs Using Convolutional Neural Networks

CHROMA AND MFCC BASED PATTERN RECOGNITION IN AUDIO FILES UTILIZING HIDDEN MARKOV MODELS AND DYNAMIC PROGRAMMING. Alexander Wankhammer Peter Sciri

UTS submission to Google YouTube-8M Challenge 2017

All You Want To Know About CNNs. Yukun Zhu

Computer Vision Lecture 16

HENet: A Highly Efficient Convolutional Neural. Networks Optimized for Accuracy, Speed and Storage

Introduction to Data Mining and Data Analytics

Instance-aware Semantic Segmentation via Multi-task Network Cascades

Computer Vision Lecture 16

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

ImageCLEF 2011

Learning Deep Representations for Visual Recognition

Lab 9. Julia Janicki. Introduction

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Hide-and-Seek: Forcing a network to be Meticulous for Weakly-supervised Object and Action Localization

DL Tutorial. Xudong Cao

Intro to Deep Learning. Slides Credit: Andrej Karapathy, Derek Hoiem, Marc Aurelio, Yann LeCunn

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

Como funciona o Deep Learning

Deep learning for dense per-pixel prediction. Chunhua Shen The University of Adelaide, Australia

Unsupervised Learning

Applying Supervised Learning

Deep Learning Based Real-time Object Recognition System with Image Web Crawler

Rich feature hierarchies for accurate object detection and semant

LabROSA Research Overview

Real-time convolutional networks for sonar image classification in low-power embedded systems

Caffe2C: A Framework for Easy Implementation of CNN-based Mobile Applications

arxiv: v1 [cs.mm] 12 Jan 2016

Kaggle Data Science Bowl 2017 Technical Report

Detecting Thoracic Diseases from Chest X-Ray Images Binit Topiwala, Mariam Alawadi, Hari Prasad { topbinit, malawadi, hprasad

Deep Learning with Tensorflow AlexNet

Learning Deep Features for Visual Recognition

DEEP LEARNING WITH GPUS Maxim Milakov, Senior HPC DevTech Engineer, NVIDIA

Transcription:

AudioSet: Real-world Audio Event Classification g.co/audioset Rif A. Saurous, Shawn Hershey, Dan Ellis, Aren Jansen and the Google Sound Understanding Team 2017-10-20

Outline The Early Years: Weakly-Supervised YouTube Videos AudioSet Is Born AudioSet: Supervised and Unsupervised

General Audio Event Classification Audio Event Classification Using ideas from: ImageNet Object Recognizers DNN Speech Recognition Abdel-Hamid et al 2013

Audio Event Detection: Applications Content-based Archive Search Surveillance/Event Detection Context Awareness

Web Video Classification - Summary A TON of Weakly Labeled YouTube Audio CNN architectures from computer vision community + = Awesome Audio Event Classification

Tasks of Interest Soundtrack Classification 3 minutes Bagpipes Scotland Song Audio Event Detection Speech Bagpipes Tap dancing

YouTube-100M DataSet Size ~100M videos with video-level labels (~5 million hours, 600 years) 20 billion input examples Labels 30K labels (not all obviously acoustically relevant) ~3 labels per videos It's BIG

Training 64 log mel bands Cartoon, Animated Cartoon Cartoon, Animated Cartoon Dance, Music, Myanmar, Song, Hip hop music 96 10 ms frames (960ms) Train frame level classifier (very weak labeling)

Evaluation Run frame-level classifier over each non-overlapping 960ms Aggregate over frames to evaluate video level scores Calculate map, AUC (DPrime)

Gut-feel Evaluation Look at ratings against a few favorite test cases Potential problem: Focus on a few minor classes?

Questions Architectures - How well do various CNN architectures perform? Training Size - How do we benefit from training set size? Useful embeddings - Can we learn generally useful audio embeddings from our large dataset. (Embeddings that can be used as features to predict labels not in the original training set).

Architectures - Fully Connected

Architectures - AlexNet [Alex Krizhevsky et al. 2012]

Architectures - VGG E [Karen Simonyan et al. 2015]

Architectures - Inception V3 [Christian Szegedy et al. 2015]

Architectures - ResNet [Kaiming He et al. 2015]

Architectures - Results 5M Minibatches

Training Size - Results Model Used: ResNet-50 (16M mini batches)

DPrime vs Prior Good Less Good Least Common Most Common

What s Next? Web video tags are not sound terms We want the soundtrack described We need a set of sound-description terms

The AudioSet Ontology Need a set of sound events 635 sound terms in 7 categories Start from Hearst patterns:.. sounds, such as X.. Refined via: Other sound event lists (Salamon 14, Burger 12,..) Feedback from raters Manual inspection... github.com/audioset/ontology

More About The AudioSet Ontology Ontology Class Set goals: Not too fine: Non-expert can discriminate consistently Not too few: Cover all normally-encountered sounds { "id": "/m/0160x5", "name": "Digestive", "description": "Sounds associated with the human function of eating and processing nutrition (food).", "citation_uri": "", "positive_examples": [], "child_ids": ["/m/03cczk", "/m/07pdhp0", "/m/0939n_", "/m/01g90h", "restrictions": ["abstract"] }, { "id": "/m/03cczk", "name": "Chewing, mastication", Evolution "description": "Food being crushed and ground by teeth.", "citation_uri": "http://en.wikipedia.org/wiki/mastication", Merges: Tire squeal + Skidding Deletions: Sidetone Additions: Ukulele, Stairs "positive_examples": ["youtu.be/ebnra85wsc4?start=530&end=540", "y "child_ids": [], "restrictions": [] },

AudioSet Labeled Data Verification task Using metadata, identify videos that may contain sound class X Check for other possible classes Extract random 10 s excerpt Ask labeler Present/Absent/Unsure

When Metadata Fails For obscure sounds, metadata fails to find a large number of good candidates Exemplar-based Mining (assumes a few 10s segments for class): 1. 2. 3. 4. Extract frame-level embeddings using YT-100M bottleneck layer Cluster the frame-level embeddings to find frames shared across segments Use those frames in multiquery-by-example search over a millions of YT videos Retrieval score is average distance to individual example frames from step 2 Present retrieved frames (padded to 10s) to labeler for verification Pro: recovers a more diverse collection of new positives and difficult negatives for rare classes Con: sampling biased by existing models (still not as diverse as desired)

AudioSet Data Release A large-scale collection of Labeled sound examples Like ImageNet for sounds 2M+ ten-second excerpts from high-viewcount YT videos (1000x smaller than YT-100M But strongly labeled) At least 120 examples for 500+ classes

Training Classifiers on AudioSet Strategy 1 Training Data 3M AudioSet 10-second segments 0.96 s audio Deep CNN (ResNet-50) or FC DNN Posterior probability of AudioSet classes

Training Classifiers on AudioSet Strategy 2 Training Data 0.96 s audio Training Data Weakly Labeled YouTube 100M Videos 3M AudioSet 10-sec segments YT100M ResNet-50 Embedding Model Fully Connected DNN Posterior probability of AudioSet classes

AudioSet Performance Training: 3M AudioSet 10s segments, 527 classes Evaluation: explicit hard negatives (average prior: 0.282) Feature Model EER map Log Mel Fully connected 33.3% 0.445 Log Mel ResNet-50 24.5% 0.605 YT100 Embeddings Fully connected 25.7% 0.580 Strategy 1 Strategy 2 Convolutional ResNet model huge improvement over fully connected Transfer of embedding from large-scale weakly labeled model does not help overall (but greatly reduces training data requirements in semi-supervised experiments)

Complication #1: Extreme Class Imbalance Problem: priors range from 0.0001 (e.g. toothbrush, gargling, creak) to 0.5 (e.g. music, speech, vehicle) Results in poor score calibration across classes High-prior classes like speech and music are always strongest detections 1990s libsvm solution: per-class loss weights that balance positive and negative examples Simply does not work at our level of imbalance and model complexity Shared network: one toothbrush example is not worth 5000 speech examples

Complication #1: Extreme Class Imbalance Solution that works: per-class loss weights that balance to mean prior (~0.003), not 0.5 We weight class C positive example loss contribution by Mean prior Hyperparameter in [0, 1] Class prior Negative examples not weighted. Exponent hyperparameter allows easy backing off from full balancing

Weighted Loss Performance Average accuracy of the NL highest scoring classes, where NL=number of ground truth labels for segment Slight improvement to map for beta = 0.1, 0.25 More than doubling of prediction accuracy with full weighting (beta = 1)

Complication #2: Weak Labels Problem: AudioSet segments are still weakly labeled: Positive label implies event occurs in 10-sec segment, but we do not know extent Solution: apply simple label refinement of training data: 1. Train model on original data 2. For training segment 3. Discard frames where with label L, compute max-normalized frame scores falls below some prescribed threshold The Catch: target class needs to be most prevalent sound in present in labeled segments (relative to prevalence in set as a whole)

Weak Label Refinement Demo Without refinement Speech activated throughout With refinement (0.66 threshold) Speech activated only when speaking or breathing (difficult confound)

Unsupervised Triplet Loss Embedding The Idea: train triplet loss embedding models using unsupervised (a.k.a. self-supervised) constraints: 1. 2. 3. 4. Noise corrupted audio retains the categorical content of the clean signal. Sound is transparent: mixing two sound classes results in an (often-natural sounding) example of both classes. Sound classes are translation invariant in time and, to some extent, frequency. Sounds in close proximity or in same source content are likely to be categorically similar When expressed as triplets, trivial to combine all constraints into single huge convolutional network

Semi-Supervised AudioSet Classifier 0.96 s audio Training Data Training Data 3M AudioSet 10-sec segments without labels 20 labeled AudioSet segments/class 0.5% of dataset Unsupervised ResNet-50 Triplet Embedding Fully Connected DNN Posterior probability of AudioSet classes

Semi-Supervised AudioSet Performance % Data Labeled Feature Model EER map 100% Log Mel Fully connected 33.3% 0.445 100% Log Mel ResNet-50 24.5% 0.605 0.5% Log Mel Fully connected 40.6% 0.338 0.5% Log Mel ResNet-50 37.7% 0.385 0.5% Unsup. Triplet Embedding Fully connected 34.0% 0.429

AudioSet Demo Video

Future Work Transient Events Sound Mixtures Other Data Sources Closed captions Sound Effects Databases Direct solicitation Other modalities & label sources