LabROSA Research Overview

Size: px

Start display at page:

Download "LabROSA Research Overview"

Brianne Hopkins
6 years ago
Views:

1 LabROSA Research Overview Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA 1. Music 2. Environmental sound 3. Speech Enhancement Laboratory for the Recognition and Organization of Speech and Audio COLUMBIA UNIVERSITY IN THE CITY OF NEW YORK LabROSA Research Overview - Dan Ellis /20

2 LabROSA Getting information from sound Information Extraction Music Machine Learning Recognition Separation Retrieval Speech Environment Signal Processing LabROSA Research Overview - Dan Ellis /20

3 1. Music Audio Analysis Trained classifiers for low-level information notes, chords, beats, section boundaries E.g. Polyphonic transcription feature agnostic needs training data Poliner & Ellis 06 LabROSA Research Overview - Dan Ellis /20

4 Million Song Dataset Industrial-scale database for music information research Many facets: Echo Nest audio features + metadata Echo Nest taste profile user-song-listen count Second Hand Song covers musixmatch lyric BoW last.fm tags Now with audio? resolving artist / album / track / duration against what.cd Bertin-Mahieux McFee LabROSA Research Overview - Dan Ellis /20

5 MIDI-to-MSD Raffel Aligned MIDI to Audio is a nice transcription Can we find matches in large databases? LabROSA Research Overview - Dan Ellis /20

6 Singing ASR McVicar Speech recognition adapted to singing needs aligned data Extensive work to line up scraped acapellas and full mix including jumps LabROSA Research Overview - Dan Ellis /20

7 4, one can hear some high frequency isolated coefficients superimposed to the separated voice. This drawback could be reduced by including harmonicity priors in the sparse component of RPCA, as Papadopoulos proposed in [20]. Ground versus estimated voice activity location. ImRPCAtruth separates vocals and background perfect voice location still allows an improvebasedactivity on low rankinformation optimization ment, although to a lesser extent than with ground-truth voice acparameter single trade-off tivity information. The decrease in the results mainly comes from basedclassified on higher-level features? adjust background segments as vocalmusical segments. Block Structure RPCA e. Fig. 4. Separated for various LabROSA Research Overview voice - Dan Ellis values of λ for the Pink Noise Party song - 7 /

8 Ordinal LDA Segmentation McFee Low-rank decomposition of skewed self-similarity to identify repeats Learned weighting of multiple factors to segment Linear Discriminant Analysis between adjacent segments LabROSA Research Overview - Dan Ellis /20

2. Environmental Sound Extracting useful information

find matches in 100k videos manual annotations for

9 2. Environmental Sound Extracting useful information from soundtracks e.g. TRECVID Multimedia Event Detection (MED) Making a Sandwich, Getting a Vehicle Unstuck 100 examples, find matches in 100k videos manual annotations for ~10 h E009 Getting a Vehicle Unstuck LabROSA Research Overview - Dan Ellis /20

10 Foreground Event Recognition Cotton, Ellis, Loui 11 Transients = foreground events? Onset detector finds energy bursts best SNR PCA basis to represent each 300 ms x auditory freq bag of transients LabROSA Research Overview - Dan Ellis /20

11 NMF Transient Features Decompose spectrograms into templates + activation X = W H well-behaved gradient descent 2D patches sparsity control computation time Basis 1 (L2) Basis 2 (L1) Basis 3 (L1) freq / Hz Original mixture Smaragdis & Brown 03 Abdallah & Plumbley 04 Virtanen 07 Cotton & Ellis 11 LabROSA Research Overview - Dan Ellis / time / s

Background Retrieval Classify soundtracks by statistics of

Texture features Sound Automatic gain control Subband

chans) x freq / Hz 2404 1273 617 mel band 15 10 5 Envelope

skew, kurt (18 x 4) Modulation energy (18 x 6) LabROSA

urban cheer clap Texture features FFT Histogram 0 2 4 6 8 10

5,1,2,4,8,16 Hz 1062_60 quiet dubbed speech music 0 2 4 6 8

12 Background Retrieval Classify soundtracks by statistics of ambience E.g. Texture features Sound Automatic gain control Subband distributions Envelope cross-corrs mel x filterbank x (18 chans) x freq / Hz mel band Envelope correlation Cross-band correlations (318 samples) mean, var, skew, kurt (18 x 4) Modulation energy (18 x 6) LabROSA Research Overview - Dan Ellis /20 x x x 1159_10 urban cheer clap Texture features FFT Histogram McDermott et al. 09 Ellis, Zheng, McDermott 11 Octave bins 0.5,1,2,4,8,16 Hz 1062_60 quiet dubbed speech music time / s M V S K M V S K moments mod frq / Hz mel band moments mod frq / Hz mel band 1 0 level

13 Auditory Model Features Lyon et al Lee & Ellis 2012 Cotton & Ellis 2013 Subband Autocorrelation PCA Simplified version of autocorrelogram 10x faster than Lyon original Capture fine time structure in multiple bands information lost in MFCCs delay line short-time autocorrelation Subband VQ Sound Cochlea filterbank frequency channels Subband VQ Subband VQ Subband VQ Histogram Feature Vector freq lag time Correlogram slice LabROSA Research Overview - Dan Ellis /20

Subband Autocorrelation delay line short-time autocorrelation Autocorrelation stabilizes fine time structure Sound Cochlea filterbank frequency channels freq lag Correlogram time slice Subband VQ

14 Subband Autocorrelation delay line short-time autocorrelation Autocorrelation stabilizes fine time structure Sound Cochlea filterbank frequency channels freq lag Correlogram time slice Subband VQ Subband VQ Subband VQ Subband VQ Histogram Feature Vector 25 ms window, lags up to 25 ms calculated every 10 ms normalized to max (zero lag) LabROSA Research Overview - Dan Ellis /20

15 Retrieval Examples High precision for in-domain top hits LabROSA Research Overview - Dan Ellis /20

3. Speech Enhancement Noisy speech scenarios Ambient

(processing distortion) CAR KIT - BP 101 43086 20111025

in db 50 freq / Hz level / db 4000 2000 100 0 0 1000

time / s 0 1000 2000 3000 freq / Hz 67 70 73 76 79 time

16 3. Speech Enhancement Noisy speech scenarios Ambient recording (background noise) Communication channel (processing distortion) CAR KIT - BP in 100 HOME LAND - BP in db 50 freq / Hz level / db freq / Hz 1500 Hz chan time / s freq / Hz time / s level / db LabROSA Research Overview - Dan Ellis /20

RPCA Enhancement Chen, McFee & Ellis 14 Decompose spectrogram into sparse + low-rank Sparse activation H of dictionary W min H,L,S H khk 1 + L klk + S ksk 1 + I + (H) s.

17 RPCA Enhancement Chen, McFee & Ellis 14 Decompose spectrogram into sparse + low-rank Sparse activation H of dictionary W min H,L,S H khk 1 + L klk + S ksk 1 + I + (H) s.t. Y = WH + L + S ASR benefits: C S D I Orig RPCA wie+rpca LabROSA Research Overview - Dan Ellis /20

18 Classification Pitch Tracker SAcC: MLP trained on noisy speech with ground-truth pitch track targets Large benefits for in-domain noisy speech PTE (%) FDA RBF and pink noise YIN Wu get_f0 10 SAcC SNR (db) Lee & Ellis 12 LabROSA Research Overview - Dan Ellis /20

Pitch-Normalized Enhancement Use noise-robust pitch tracker for enhancement?

Frequency 500 0 0 1000 0.5 1 1.5 2 Noisy signal 2.5 3 pitch 500 smoothed pvx 0 0 0.

5 Filtered pvsmooth 1000 500 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.

19 Pitch-Normalized Enhancement Use noise-robust pitch tracker for enhancement? 1000 Clean signal Normalize voice pitch Fixed-pitch enhancement Reimpose pitch Frequency Noisy signal pitch 500 smoothed pvx resampled to pitch = 100 Hz Filtered pvsmooth Resampled back to original pitch LabROSA Research Overview - Dan Ellis /20 Time

20 Summary Music transcription, segmentation, alignment for ground truth Soundtracks foreground events, background ambience Noisy Speech classification pitch tracking spectrogram enhancement LabROSA Research Overview - Dan Ellis /20

Mining Large-Scale Music Data Sets

Mining Large-Scale Music Data Sets Dan Ellis & Thierry Bertin-Mahieux Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA {dpwe,thierry}@ee.columbia.edu