LabROSA Research Overview

Similar documents
Mining Large-Scale Music Data Sets

Minimal-Impact Personal Audio Archives

Audio & Music Research at LabROSA

The Intervalgram: An Audio Feature for Large-scale Cover-song Recognition

CHROMA AND MFCC BASED PATTERN RECOGNITION IN AUDIO FILES UTILIZING HIDDEN MARKOV MODELS AND DYNAMIC PROGRAMMING. Alexander Wankhammer Peter Sciri

Multimedia Indexing. Lecture 12: EE E6820: Speech & Audio Processing & Recognition. Spoken document retrieval Audio databases.

Lecture 12: Multimedia Indexing. Spoken Document Retrieval (SDR)

Hands On: Multimedia Methods for Large Scale Video Analysis (Project Meeting) Dr. Gerald Friedland,

MUSIC/VOICE SEPARATION USING THE 2D FOURIER TRANSFORM. Prem Seetharaman, Fatemeh Pishdadian, Bryan Pardo

Short-Term Audio-Visual Atoms for Generic Video Concept Classification

Multimedia Event Detection for Large Scale Video. Benjamin Elizalde

Advanced techniques for management of personal digital music libraries

Comparing MFCC and MPEG-7 Audio Features for Feature Extraction, Maximum Likelihood HMM and Entropic Prior HMM for Sports Audio Classification

LARGE-SCALE COVER SONG RECOGNITION USING THE 2D FOURIER TRANSFORM MAGNITUDE

Multiple Kernel Learning for Emotion Recognition in the Wild

Detection of goal event in soccer videos

SOUND EVENT DETECTION AND CONTEXT RECOGNITION 1 INTRODUCTION. Toni Heittola 1, Annamaria Mesaros 1, Tuomas Virtanen 1, Antti Eronen 2

Principles of Audio Coding

EVALUATING MUSIC SEQUENCE MODELS THROUGH MISSING DATA

Sparse Models in Image Understanding And Computer Vision

Robustness and independence of voice timbre features under live performance acoustic degradations

Online music recognition: the Echoprint system

CCRMA MIR Workshop 2014 Evaluating Information Retrieval Systems. Leigh M. Smith Humtap Inc.

Available online Journal of Scientific and Engineering Research, 2016, 3(4): Research Article

Hello, I am from the State University of Library Studies and Information Technologies, Bulgaria

TWO-STEP SEMI-SUPERVISED APPROACH FOR MUSIC STRUCTURAL CLASSIFICATION. Prateek Verma, Yang-Kai Lin, Li-Fan Yu. Stanford University

EE482: Digital Signal Processing Applications

AudioSet: Real-world Audio Event Classification

INTERACTIVE REFINEMENT OF SUPERVISED AND SEMI-SUPERVISED SOUND SOURCE SEPARATION ESTIMATES

Audio-coding standards

Separating Speech From Noise Challenge

ALIGNED HIERARCHIES: A MULTI-SCALE STRUCTURE-BASED REPRESENTATION FOR MUSIC-BASED DATA STREAMS

DUPLICATE DETECTION AND AUDIO THUMBNAILS WITH AUDIO FINGERPRINTING

METRIC LEARNING BASED DATA AUGMENTATION FOR ENVIRONMENTAL SOUND CLASSIFICATION

The Sensitivity Matrix

MACHINE LEARNING: CLUSTERING, AND CLASSIFICATION. Steve Tjoa June 25, 2014

SCREAM AND GUNSHOT DETECTION IN NOISY ENVIRONMENTS

Multimedia Database Systems. Retrieval by Content

Audio-visual interaction in sparse representation features for noise robust audio-visual speech recognition

Predicting Song Popularity

GYROPHONE RECOGNIZING SPEECH FROM GYROSCOPE SIGNALS. Yan Michalevsky (1), Gabi Nakibly (2) and Dan Boneh (1)

Columbia University High-Level Feature Detection: Parts-based Concept Detectors

Display. 2-Line VA LCD Display. 2-Zone Variable-Color Illumination

MASK: Robust Local Features for Audio Fingerprinting

Discriminative training and Feature combination

IN the recent decades digital music has become more. Codebook based Audio Feature Representation for Music Information Retrieval

Sparse coding for image classification

Pitch Prediction from Mel-frequency Cepstral Coefficients Using Sparse Spectrum Recovery

TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING 1. Yonatan Vaizman, Brian McFee, and Gert Lanckriet

Codebook-Based Audio Feature Representation for Music Information Retrieval

Robust and Efficient Multiple Alignment of Unsynchronized Meeting Recordings

High-level Event Recognition in Internet Videos

MPEG-1 Bitstreams Processing for Audio Content Analysis

Using Noise Substitution for Backwards-Compatible Audio Codec Improvement

Robotics Programming Laboratory

Analyzing Vocal Patterns to Determine Emotion Maisy Wieman, Andy Sun

THE IMPORTANCE OF F0 TRACKING IN QUERY-BY-SINGING-HUMMING

Previous Lecture - Coded aperture photography

An efficient face recognition algorithm based on multi-kernel regularization learning

Detecting Burnscar from Hyperspectral Imagery via Sparse Representation with Low-Rank Interference

Search Engines. Information Retrieval in Practice

Audio-coding standards

Consumer Video Understanding

Accelerating Multimodal Sequence Retrieval with Convolutional Networks

Mpeg 1 layer 3 (mp3) general overview

The Automatic Musicologist

Speaker Verification with Adaptive Spectral Subband Centroids

Seamless transfer of ambient media from environment to mobile device

Lecture 7: Audio Compression & Coding

Content-based Video Genre Classification Using Multiple Cues

The Stanford/Technicolor/Fraunhofer HHI Video Semantic Indexing System

Second author Retain these fake authors in submission to preserve the formatting

Computer Vesion Based Music Information Retrieval

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009

BIRD-PHRASE SEGMENTATION AND VERIFICATION: A NOISE-ROBUST TEMPLATE-BASED APPROACH

KD-X240BT. Digital Media Receiver

Video search requires efficient annotation of video content To some extent this can be done automatically

IDMT Transcription API Documentation

RECOMMENDATION ITU-R BS Procedure for the performance test of automated query-by-humming systems

CHAPTER 3. Preprocessing and Feature Extraction. Techniques

Music Genre Classification

Online PLCA for Real-time Semi-supervised Source Separation

Lecture Video Indexing and Retrieval Using Topic Keywords

AUTOMATIC CONSUMER VIDEO SUMMARIZATION BY AUDIO AND VISUAL ANALYSIS

Image Matching. AKA: Image registration, the correspondence problem, Tracking,

Audio-Based Action Scene Classification Using HMM-SVM Algorithm

Chapter 14 MPEG Audio Compression

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

SPEECH FEATURE EXTRACTION USING WEIGHTED HIGHER-ORDER LOCAL AUTO-CORRELATION

Robust Shape Retrieval Using Maximum Likelihood Theory

Repeating Segment Detection in Songs using Audio Fingerprint Matching

Two-Layered Audio-Visual Speech Recognition for Robots in Noisy Environments

Novel Subband Autoencoder Features for Non-intrusive Quality Assessment of Noise Suppressed Speech

Video annotation based on adaptive annular spatial partition scheme

CHAPTER 8 Multimedia Information Retrieval

Separation Of Speech From Noise Challenge

TRAX SP User Guide. Direct any questions or issues you may encounter with the use or installation of ADX TRAX SP to:

Voice Command Based Computer Application Control Using MFCC

Auditory Sparse Coding

MuBu for Max/MSP. IMTR IRCAM Centre Pompidou. Norbert Schnell Riccardo Borghesi 20/10/2010

Transcription:

LabROSA Research Overview Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA dpwe@ee.columbia.edu 1. Music 2. Environmental sound 3. Speech Enhancement http://labrosa.ee.columbia.edu/ Laboratory for the Recognition and Organization of Speech and Audio COLUMBIA UNIVERSITY IN THE CITY OF NEW YORK LabROSA Research Overview - Dan Ellis 2014-06-12-1 /20

LabROSA Getting information from sound Information Extraction Music Machine Learning Recognition Separation Retrieval Speech Environment Signal Processing LabROSA Research Overview - Dan Ellis 2014-06-12-2 /20

1. Music Audio Analysis Trained classifiers for low-level information notes, chords, beats, section boundaries E.g. Polyphonic transcription feature agnostic needs training data Poliner & Ellis 06 LabROSA Research Overview - Dan Ellis 2014-06-12-3 /20

Million Song Dataset Industrial-scale database for music information research Many facets: Echo Nest audio features + metadata Echo Nest taste profile user-song-listen count Second Hand Song covers musixmatch lyric BoW last.fm tags Now with audio? resolving artist / album / track / duration against what.cd Bertin-Mahieux McFee LabROSA Research Overview - Dan Ellis 2014-06-12-4 /20

MIDI-to-MSD Raffel Aligned MIDI to Audio is a nice transcription Can we find matches in large databases? LabROSA Research Overview - Dan Ellis 2014-06-12-5 /20

Singing ASR McVicar Speech recognition adapted to singing needs aligned data Extensive work to line up scraped acapellas and full mix including jumps LabROSA Research Overview - Dan Ellis 2014-06-12-6 /20

4, one can hear some high frequency isolated coefficients superimposed to the separated voice. This drawback could be reduced by including harmonicity priors in the sparse component of RPCA, as Papadopoulos proposed in [20]. Ground versus estimated voice activity location. ImRPCAtruth separates vocals and background perfect voice location still allows an improvebasedactivity on low rankinformation optimization ment, although to a lesser extent than with ground-truth voice acparameter single trade-off tivity information. The decrease in the results mainly comes from basedclassified on higher-level features? adjust background segments as vocalmusical segments. Block Structure RPCA e. Fig. 4. Separated for various LabROSA Research Overview voice - Dan Ellis values of λ for the Pink Noise Party song - 7 /20 2014-06-12

Ordinal LDA Segmentation McFee Low-rank decomposition of skewed self-similarity to identify repeats Learned weighting of multiple factors to segment Linear Discriminant Analysis between adjacent segments LabROSA Research Overview - Dan Ellis 2014-06-12-8 /20

2. Environmental Sound Extracting useful information from soundtracks e.g. TRECVID Multimedia Event Detection (MED) Making a Sandwich, Getting a Vehicle Unstuck 100 examples, find matches in 100k videos manual annotations for ~10 h E009 Getting a Vehicle Unstuck LabROSA Research Overview - Dan Ellis 2014-06-12-9 /20

Foreground Event Recognition Cotton, Ellis, Loui 11 Transients = foreground events? Onset detector finds energy bursts best SNR PCA basis to represent each 300 ms x auditory freq bag of transients LabROSA Research Overview - Dan Ellis 2014-06-12-10/20

NMF Transient Features Decompose spectrograms into templates + activation X = W H well-behaved gradient descent 2D patches sparsity control computation time Basis 1 (L2) Basis 2 (L1) Basis 3 (L1) freq / Hz 5478 3474 2203 1398 Original mixture Smaragdis & Brown 03 Abdallah & Plumbley 04 Virtanen 07 Cotton & Ellis 11 LabROSA Research Overview - Dan Ellis 2014-06-12-11/20 883 442 0 1 2 3 4 5 6 7 8 9 10 time / s

Background Retrieval Classify soundtracks by statistics of ambience E.g. Texture features Sound Automatic gain control Subband distributions Envelope cross-corrs mel x filterbank x (18 chans) x freq / Hz 2404 1273 617 mel band 15 10 5 Envelope correlation Cross-band correlations (318 samples) mean, var, skew, kurt (18 x 4) Modulation energy (18 x 6) LabROSA Research Overview - Dan Ellis 2014-06-12-12/20 x x x 1159_10 urban cheer clap Texture features FFT Histogram 0 2 4 6 8 10 McDermott et al. 09 Ellis, Zheng, McDermott 11 Octave bins 0.5,1,2,4,8,16 Hz 1062_60 quiet dubbed speech music 0 2 4 6 8 10 time / s M V S K 0.5 2 8 32 5 10 15 M V S K 0.5 2 8 32 5 10 15 moments mod frq / Hz mel band moments mod frq / Hz mel band 1 0 level

Auditory Model Features Lyon et al. 2010 Lee & Ellis 2012 Cotton & Ellis 2013 Subband Autocorrelation PCA Simplified version of autocorrelogram 10x faster than Lyon original Capture fine time structure in multiple bands information lost in MFCCs delay line short-time autocorrelation Subband VQ Sound Cochlea filterbank frequency channels Subband VQ Subband VQ Subband VQ Histogram Feature Vector freq lag time Correlogram slice LabROSA Research Overview - Dan Ellis 2014-06-12-13/20

Subband Autocorrelation delay line short-time autocorrelation Autocorrelation stabilizes fine time structure Sound Cochlea filterbank frequency channels freq lag Correlogram time slice Subband VQ Subband VQ Subband VQ Subband VQ Histogram Feature Vector 25 ms window, lags up to 25 ms calculated every 10 ms normalized to max (zero lag) LabROSA Research Overview - Dan Ellis 2014-06-12-14/20

Retrieval Examples High precision for in-domain top hits LabROSA Research Overview - Dan Ellis 2014-06-12-15/20

3. Speech Enhancement Noisy speech scenarios Ambient recording (background noise) Communication channel (processing distortion) CAR KIT - BP 101 43086 20111025 160708 in 100 HOME LAND - BP 101 84943 20111020 144955 in db 50 freq / Hz level / db 4000 2000 100 0 0 1000 2000 3000 50 0 freq / Hz 1500 Hz chan 87 90 93 96 99 time / s 0 1000 2000 3000 freq / Hz 67 70 73 76 79 time / s 100 50 0 level / db LabROSA Research Overview - Dan Ellis 2014-06-12-16/20

RPCA Enhancement Chen, McFee & Ellis 14 Decompose spectrogram into sparse + low-rank Sparse activation H of dictionary W min H,L,S H khk 1 + L klk + S ksk 1 + I + (H) s.t. Y = WH + L + S ASR benefits: C S D I Orig 6.8 10.6 82.6 0.7 RPCA 10.8 36.5 52.7 0.5 wie+rpca 10.4 40.1 49.5 2.1 LabROSA Research Overview - Dan Ellis 2014-06-12-17/20

Classification Pitch Tracker SAcC: MLP trained on noisy speech with ground-truth pitch track targets Large benefits for in-domain noisy speech PTE (%) 100 10 FDA RBF and pink noise YIN Wu get_f0 10 SAcC 10 0 10 20 30 SNR (db) Lee & Ellis 12 LabROSA Research Overview - Dan Ellis 2014-06-12-18/20

Pitch-Normalized Enhancement Use noise-robust pitch tracker for enhancement? 1000 Clean signal Normalize voice pitch Fixed-pitch enhancement Reimpose pitch Frequency 500 0 0 1000 0.5 1 1.5 2 Noisy signal 2.5 3 pitch 500 smoothed pvx 0 0 0.5 1 1.5 2 2.5 3 resampled to pitch = 100 Hz 1000 500 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Filtered pvsmooth 1000 500 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Resampled back to original pitch 1000 500 0 0 0.5 1 1.5 2 2.5 3 LabROSA Research Overview - Dan Ellis 2014-06-12-19/20 Time

Summary Music transcription, segmentation, alignment for ground truth Soundtracks foreground events, background ambience Noisy Speech classification pitch tracking spectrogram enhancement LabROSA Research Overview - Dan Ellis 2014-06-12-20/20