Audio & Music Research at LabROSA

Size: px

Start display at page:

Download "Audio & Music Research at LabROSA"

Benjamin Harrington
5 years ago
Views:

1 Audio & Music Research at LabROSA Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA 1. Eigenrhythms: Representing drum tracks 2. Frequency-Domain Linear Prediction 3. Segmenting meeting turns 4. Analyzing personal audio recordings

2 LabROSA Projects Overview Information Extraction Music Eigenrhythms Environment Personal audio Machine Learning Meeting turns Speech FDLP Signal Processing

3 1. Eigenrhythms: Drum Pattern Space Pop songs built on repeating drum loop bass drum, snare, hi-hat small variations on a few basic patterns with John Arroyo Eigen-analysis (PCA) to capture variations? by analyzing lots of (MIDI) data Applications music categorization beat box synthesis

4 Aligning the Data Need to align patterns prior to PCA... tempo (stretch): by inferring BPM & normalizing downbeat (shift): correlate against mean template

5 Eigenrhythms Need 20+ Eigenvectors for good coverage of 100 training patterns (1200 dims) Top patterns:

6 Eigenrhythms for Classification Clusters in Eigenspace: Eigenrhythm All tracks projected onto 1st two eigenrhythms hh:gthang rb:honey ho:inside bl:hideaway pp:dllal rc:whteroom hh:rufryder rb:heylover rc:californ nw:psboysi n ho:pvandyk pp:distance di:danqueen nw:evcount s di:booty rc:zztop nw:dontyou rb:mgirlsat hh:1mchance di:funkytwn di:satnight nw:pure hh:nepisode nw:amadeus hh:stan hh:jackson bl:crosfire bl:thrill co:alabama hh:bigpimpn nw:deservepp:fly pu:blitzkr rb:downlow pp:lkvirgin rc:hardday rc:jump rc:money rc:tuesdays pu:rubysoho g hh:slmshady pu:bsedated pu:beatbrat pu:waitinrm rc:blackdog co:sarose pp:lvprayer hh:superst rdi:lafreak di:dontstop nw:whipi nw:bmonday t rb:chgworld pp:mjbeatit pp:loveshck co:walkline rc:rolstone di:carwash bl:blues2gm bl:meanwoma co:aftermid co:walkmi d nw:dbdance ho:modjo hh:bigpoppa bl:onebeer pu:happyguy co:goodlook pu:bombshel rc:layla bl:chicken bl:dimples co:tennesse co:texas co:byyrman rb:volove di:boogient ho:bemylove pu:awal k ho:dpworld rb:lsaround pp:bholly di:boogiewl ho:onemore di:discoinf bl:boomboom -4 rb:bismine co:ringfire nw:banvenus pp:onemore ho:badtouch pu:anarchy pp:downundr Eigenrhythm 1 Genre classification? (10 way) nearest neighbor in 4D eigenspace: 21% correct

7 Eigenrhythm BeatBox Resynthesize rhythms from eigen-space

8 2. Frequency-Domain Lin. Pred. (Time-domain) Linear Prediction the well-known spectral estimator y[n] = TDLP " a i y[n! i] + e[n] i=1.. p Apply to a frequency domain signal dual: estimates temporal envelope FDLP DCT Y[k] = " b i Y[k! i] + E[k] i=1.. p with Marios Athineos

9 Aside: Spectrogram of the DCT DCT gives a pure-real signal: Can we treat it like a waveform?

10 FDLP and TDLP Duality!,-. ),-. )*+#!"#$%#&'(

11 Subband FDLP Temporal envelopes without 25 ms windows Auditory STFT (10-25ms + Bark bin) TDLP (per time frame) Subband FDLP (per frequency subband)

12 FDLP Applications Time-scale modification Modulation-domain temporal equalization DCT Residual in freq. 1 sec up to whole sample OLA & idct Overlap Perceptual audio features... Flat Temporal Envelopes

PLP-squared Marios Athineos Hynek Hermansky FDLP fits temporal envelope with LP Perceptual Linear Prediction (PLP) smooths across frequency

13 PLP-squared Marios Athineos Hynek Hermansky FDLP fits temporal envelope with LP Perceptual Linear Prediction (PLP) smooths across frequency can we do both... iteratively? Speech features without ST windows 15 Bark band t / sec

14 3. Meeting Turns with Jerry Liu and ICSI Multi-mic recordings for speaker turns every voice reaches every mic... (?)... but with differing coupling filters (delays, gains) Find turns with minimal assumptions e.g. ad-hoc sensor setups (multiple PDAs) differences to remove effect of source signal - no spectral models, < 1xRT

15 Between-channel cues: Timing (ITD) & Level Speaker activity Speaker ground-truth skew/samp db db norm xcorr pk val xocrr peak lags (5pt med filt) per-chan E chan E diffs time/s Timing diffs (ITD) (2 mic pairs, 250ms win) Peak correlation coefficient r Per-channel energy Between-channel energy differences

.. 100 Short-time xcorr: raw signals 100 Short-time xcorr: whitened+filtered signals lag / samps 50 0-50 50 0-50

16 Pre-whitening for ITD Inverse-filter by 12-pole LPC models (32 ms windows) to remove local resonances Filter out noise < 500 Hz, > 6 khz Then cross-correlate Short-time xcorr: raw signals 100 Short-time xcorr: whitened+filtered signals lag / samps Speaker ground truth Speaker ground truth spkr ID time / sec time / sec

17 Choosing Good Frames Correlation coef. r ~ channel similarity: r i j [l] =! nm i [n] m j [n + l]!m 2 i!m 2 j Select frames with r in top 50% in both pairs ITD - all points ITD - high-correlation points (435/1201) Skew34 / samples Cleaner basis for models Skew12 / samples Skew12 / samples Skew34 / samples about 35% of points

Eigenvectors of affinity matrix A to pick out similar points: 400 350 300 250 200 150 100 50 Spectral clustering Affinity matrix A 100 200 300 400 point index a mn = exp{

18 Eigenvectors of affinity matrix A to pick out similar points: Spectral clustering Affinity matrix A point index a mn = exp{ x[m] x[n] 2 /2! 2 } point index Ad-hoc mapping to clusters Number of clusters K from eigenvalues points first 12 eigenvectors (normalized)

19 Speaker Models & Classification Actual clusters depend on! and K heuristic Fit Gaussians to each cluster, assign that class to all frames within radius or: consider dimensions independently, choose best 0 ICSI0: good points 0 All pts: nearest class 0 All pts: closest dimension

20 Performance Analysis Compare reference & system activity maps: system misses quiet speakers 2,3,4 (deletions) system splits speaker 6 (deletions+insertions) many short gaps (deletions) ~52% avg. error on NIST 2004 dev set speaker-characteristic-based systems ~25%

4. Segmenting Personal Audio Easy to record everything you hear ~100GB / year @ 64 kbps Very hard to find anything how to scan? how to visualize? how to index?

21 4. Segmenting Personal Audio Easy to record everything you hear ~100GB / 64 kbps Very hard to find anything how to scan? how to visualize? how to index? Starting point: Collect data ~ 60 hours (8 days, ~7.5 hr/day) hand-mark 139 segments (26 min/seg avg.) assign to 16 classes (8 have multiple instances) with Kean sub Lee

22 Features for Long Recordings Feature frames = 1 min (not 25 ms!) Characterize variation within each frame Average Linear Energy Normalized Energy Deviation 60 freq / bark freq / bark Average Log Energy 60 db Log Energy Deviation db 15 freq / bark freq / bark Average Spectral Entropy bits and structure within coarse auditory bands db freq / bark freq / bark Spectral Entropy Deviation time / min 10 5 db bits

23 BIC Segmentation Untrained segmentation technique statistical test indicates good change points: log L(X 1;M 1 )L(X 2 ;M 2 ) L(X;M 0 ) λ 2 log(n) #(M) Evaluate: 60hr hand-marked boundaries different features & combinations Correct Accept False Accept = 2%: µdb 80.8% µh 81.1% σh/µh 81.6% µdb + σh/µh 84.0% µdb + σh/µh + µh 83.6% Specificity Sensitivity µ db µ H! H/µ H µ db +! H/µ H µ db + µ H +! H/µ H

24 Segment clustering Daily activity has lots of repetition: Automatically cluster similar segments supermkt meeting karaoke barber lecture2 billiard break lecture1 car/taxi home bowling street restaurant library campus cmp lib rst str... Spectral clustering achieves ~70% correct 16-way ground truth labels KL distance, smoothed covariance estimates

25 Future Work Visualization / browsing / diary inference link to other information sources Privacy protection speaker/speech search and destroy

26 LabROSA Summary LabROSA signal processing + machine learning + information extraction Applications Eigenrhythms: drum pattern models FDLP temporal envelopes Meeting recordings Personal audio analysis Also... music similarity, signal separation,...

Minimal-Impact Personal Audio Archives

Minimal-Impact Personal Audio Archives Dan Ellis, Keansub Lee, Jim Ogle Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA dpwe@ee.columbia.edu