LabROSA Research Overview Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA dpwe@ee.columbia.edu 1. Music 2. Environmental sound 3. Speech Enhancement http://labrosa.ee.columbia.edu/ Laboratory for the Recognition and Organization of Speech and Audio COLUMBIA UNIVERSITY IN THE CITY OF NEW YORK LabROSA Research Overview - Dan Ellis 2014-06-12-1 /20
LabROSA Getting information from sound Information Extraction Music Machine Learning Recognition Separation Retrieval Speech Environment Signal Processing LabROSA Research Overview - Dan Ellis 2014-06-12-2 /20
1. Music Audio Analysis Trained classifiers for low-level information notes, chords, beats, section boundaries E.g. Polyphonic transcription feature agnostic needs training data Poliner & Ellis 06 LabROSA Research Overview - Dan Ellis 2014-06-12-3 /20
Million Song Dataset Industrial-scale database for music information research Many facets: Echo Nest audio features + metadata Echo Nest taste profile user-song-listen count Second Hand Song covers musixmatch lyric BoW last.fm tags Now with audio? resolving artist / album / track / duration against what.cd Bertin-Mahieux McFee LabROSA Research Overview - Dan Ellis 2014-06-12-4 /20
MIDI-to-MSD Raffel Aligned MIDI to Audio is a nice transcription Can we find matches in large databases? LabROSA Research Overview - Dan Ellis 2014-06-12-5 /20
Singing ASR McVicar Speech recognition adapted to singing needs aligned data Extensive work to line up scraped acapellas and full mix including jumps LabROSA Research Overview - Dan Ellis 2014-06-12-6 /20
4, one can hear some high frequency isolated coefficients superimposed to the separated voice. This drawback could be reduced by including harmonicity priors in the sparse component of RPCA, as Papadopoulos proposed in [20]. Ground versus estimated voice activity location. ImRPCAtruth separates vocals and background perfect voice location still allows an improvebasedactivity on low rankinformation optimization ment, although to a lesser extent than with ground-truth voice acparameter single trade-off tivity information. The decrease in the results mainly comes from basedclassified on higher-level features? adjust background segments as vocalmusical segments. Block Structure RPCA e. Fig. 4. Separated for various LabROSA Research Overview voice - Dan Ellis values of λ for the Pink Noise Party song - 7 /20 2014-06-12
Ordinal LDA Segmentation McFee Low-rank decomposition of skewed self-similarity to identify repeats Learned weighting of multiple factors to segment Linear Discriminant Analysis between adjacent segments LabROSA Research Overview - Dan Ellis 2014-06-12-8 /20
2. Environmental Sound Extracting useful information from soundtracks e.g. TRECVID Multimedia Event Detection (MED) Making a Sandwich, Getting a Vehicle Unstuck 100 examples, find matches in 100k videos manual annotations for ~10 h E009 Getting a Vehicle Unstuck LabROSA Research Overview - Dan Ellis 2014-06-12-9 /20
Foreground Event Recognition Cotton, Ellis, Loui 11 Transients = foreground events? Onset detector finds energy bursts best SNR PCA basis to represent each 300 ms x auditory freq bag of transients LabROSA Research Overview - Dan Ellis 2014-06-12-10/20
NMF Transient Features Decompose spectrograms into templates + activation X = W H well-behaved gradient descent 2D patches sparsity control computation time Basis 1 (L2) Basis 2 (L1) Basis 3 (L1) freq / Hz 5478 3474 2203 1398 Original mixture Smaragdis & Brown 03 Abdallah & Plumbley 04 Virtanen 07 Cotton & Ellis 11 LabROSA Research Overview - Dan Ellis 2014-06-12-11/20 883 442 0 1 2 3 4 5 6 7 8 9 10 time / s
Background Retrieval Classify soundtracks by statistics of ambience E.g. Texture features Sound Automatic gain control Subband distributions Envelope cross-corrs mel x filterbank x (18 chans) x freq / Hz 2404 1273 617 mel band 15 10 5 Envelope correlation Cross-band correlations (318 samples) mean, var, skew, kurt (18 x 4) Modulation energy (18 x 6) LabROSA Research Overview - Dan Ellis 2014-06-12-12/20 x x x 1159_10 urban cheer clap Texture features FFT Histogram 0 2 4 6 8 10 McDermott et al. 09 Ellis, Zheng, McDermott 11 Octave bins 0.5,1,2,4,8,16 Hz 1062_60 quiet dubbed speech music 0 2 4 6 8 10 time / s M V S K 0.5 2 8 32 5 10 15 M V S K 0.5 2 8 32 5 10 15 moments mod frq / Hz mel band moments mod frq / Hz mel band 1 0 level
Auditory Model Features Lyon et al. 2010 Lee & Ellis 2012 Cotton & Ellis 2013 Subband Autocorrelation PCA Simplified version of autocorrelogram 10x faster than Lyon original Capture fine time structure in multiple bands information lost in MFCCs delay line short-time autocorrelation Subband VQ Sound Cochlea filterbank frequency channels Subband VQ Subband VQ Subband VQ Histogram Feature Vector freq lag time Correlogram slice LabROSA Research Overview - Dan Ellis 2014-06-12-13/20
Subband Autocorrelation delay line short-time autocorrelation Autocorrelation stabilizes fine time structure Sound Cochlea filterbank frequency channels freq lag Correlogram time slice Subband VQ Subband VQ Subband VQ Subband VQ Histogram Feature Vector 25 ms window, lags up to 25 ms calculated every 10 ms normalized to max (zero lag) LabROSA Research Overview - Dan Ellis 2014-06-12-14/20
Retrieval Examples High precision for in-domain top hits LabROSA Research Overview - Dan Ellis 2014-06-12-15/20
3. Speech Enhancement Noisy speech scenarios Ambient recording (background noise) Communication channel (processing distortion) CAR KIT - BP 101 43086 20111025 160708 in 100 HOME LAND - BP 101 84943 20111020 144955 in db 50 freq / Hz level / db 4000 2000 100 0 0 1000 2000 3000 50 0 freq / Hz 1500 Hz chan 87 90 93 96 99 time / s 0 1000 2000 3000 freq / Hz 67 70 73 76 79 time / s 100 50 0 level / db LabROSA Research Overview - Dan Ellis 2014-06-12-16/20
RPCA Enhancement Chen, McFee & Ellis 14 Decompose spectrogram into sparse + low-rank Sparse activation H of dictionary W min H,L,S H khk 1 + L klk + S ksk 1 + I + (H) s.t. Y = WH + L + S ASR benefits: C S D I Orig 6.8 10.6 82.6 0.7 RPCA 10.8 36.5 52.7 0.5 wie+rpca 10.4 40.1 49.5 2.1 LabROSA Research Overview - Dan Ellis 2014-06-12-17/20
Classification Pitch Tracker SAcC: MLP trained on noisy speech with ground-truth pitch track targets Large benefits for in-domain noisy speech PTE (%) 100 10 FDA RBF and pink noise YIN Wu get_f0 10 SAcC 10 0 10 20 30 SNR (db) Lee & Ellis 12 LabROSA Research Overview - Dan Ellis 2014-06-12-18/20
Pitch-Normalized Enhancement Use noise-robust pitch tracker for enhancement? 1000 Clean signal Normalize voice pitch Fixed-pitch enhancement Reimpose pitch Frequency 500 0 0 1000 0.5 1 1.5 2 Noisy signal 2.5 3 pitch 500 smoothed pvx 0 0 0.5 1 1.5 2 2.5 3 resampled to pitch = 100 Hz 1000 500 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Filtered pvsmooth 1000 500 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Resampled back to original pitch 1000 500 0 0 0.5 1 1.5 2 2.5 3 LabROSA Research Overview - Dan Ellis 2014-06-12-19/20 Time
Summary Music transcription, segmentation, alignment for ground truth Soundtracks foreground events, background ambience Noisy Speech classification pitch tracking spectrogram enhancement LabROSA Research Overview - Dan Ellis 2014-06-12-20/20