Minimal-Impact Personal Audio Archives

Size: px

Start display at page:

Download "Minimal-Impact Personal Audio Archives"

Lindsay Reynolds
5 years ago
Views:

Minimal-Impact Personal Audio Archives Dan Ellis, Keansub Lee, Jim Ogle Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ.

1 Minimal-Impact Personal Audio Archives Dan Ellis, Keansub Lee, Jim Ogle Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA 1. Personal Audio Archives 2. Segmenting & Clustering 3. Speech Detection 4. Repeated Events 5. Future Personal Audio Archives - Ellis, Lee, Ogle p. 1 /18

1. Personal Audio Archives Easy to record everything you hear <2GB / week @ 64 kbps Hard to find anything how to scan?

2 1. Personal Audio Archives Easy to record everything you hear <2GB / 64 kbps Hard to find anything how to scan? how to visualize? how to index? Need automatic analysis Need minimal impact Personal Audio Archives - Ellis, Lee, Ogle p. 2 /18

3 Information in Audio Long-duration recordings contain info on: location type (restaurant, street,...) and specific activity talking, walking, typing people generic (2 males), specific (Chuck & John) spoken content... maybe but not: what people and things looked like day/night gaze, posture, motion,... Personal Audio Archives - Ellis, Lee, Ogle p. 3 /18

4 Applications Automatic appointment-book history fills in when & where of movements Life statistics how long did I spend in meetings this week? most frequent conversations favorite phrases? Retrieving details what exactly did I promise? privacy issues... Nostalgia... or what? Personal Audio Archives - Ellis, Lee, Ogle p. 4 /18

5 2. Segmentation & Clustering Top-level structure for long recordings: Where are the major boundaries? e.g. for diary application support for manual browsing Length of fundamental time-frame 60s rather than 10ms? background more important than foreground average out uncharacteristic transients Perceptually-motivated features.. so results have perceptual relevance broad spectrum + some detail Personal Audio Archives - Ellis, Lee, Ogle p. 5 /18

6 Features 20 Average Linear Energy Normalized Energy Deviation 60 freq / bark freq / bark Average Log Energy 60 db Log Energy Deviation db 15 freq / bark freq / bark Average Spectral Entropy db freq / bark freq / bark Spectral Entropy Deviation 10 5 db bits time / min Capture both average and variation Capture a little more detail in subbands... bits Personal Audio Archives - Ellis, Lee, Ogle p. 6 /18

7 BIC Segmentation Results Evaluate: 62 hr hand-marked dataset 8 days, 139 segments, 16 categories measure Correct Accept False Accept = 2%: Feature Correct Accept μdb 80.8% μh 81.1% σh/μh 81.6% μdb + σh/μh 84.0% μdb + σh/μh + μh 83.6% mfcc 73.6% Sensitivity o µ db µ H! H /µ H µ db +! H /µ H µ db + µ H +! H /µ H Specificity Personal Audio Archives - Ellis, Lee, Ogle p. 7 /18

Segment Clustering Daily activity has lots of repetition: Automatically cluster similar segments affinity of segments as KL2 distances 4*5)#1-% 1))%'23 -"#"0-),"#,)# ()!

8 Segment Clustering Daily activity has lots of repetition: Automatically cluster similar segments affinity of segments as KL2 distances 4*5)#1-% 1))%'23 -"#"0-),"#,)# ()!%*#)/,'(('"#.,#)"- ()!%*#)+!"#$%"&' ;01),0:('23 4%#))% #)4%"*#"2% (',#"# !"15*4 7!15 (', #4% 4%# 666 Personal Audio Archives - Ellis, Lee, Ogle p. 8 /18

clusters Frame-level scoring gives ~70% correct errors when same place has

9 Clustering Results Clustering of automatic segments gives anonymous classes BIC criterion to choose number of clusters make best correspondence to 16 GT clusters Frame-level scoring gives ~70% correct errors when same place has multiple ambiences Personal Audio Archives - Ellis, Lee, Ogle p. 9 /18

3. Speech Detection Speech emerges as most interesting content Just identifying speech would be useful goal is speaker identification / labeling Lots of background noise conventional Voice Activity

10 3. Speech Detection Speech emerges as most interesting content Just identifying speech would be useful goal is speaker identification / labeling Lots of background noise conventional Voice Activity Detection inadequate Insight: Listeners detect pitch track (melody) look for voice-like periodicity in noise 4000 coffeeshop excerpt Frequency Time Personal Audio Archives - Ellis, Lee, Ogle p. 10/18

11 Voice Periodicity Enhancement Noise-robust subband autocorrelation Subtract local average suppresses steady background e.g. machine noise 15 min test set; 88% acc (79% w/o enhancement) also for enhancing speech (harmonic filtering) Personal Audio Archives - Ellis, Lee, Ogle p. 11/18

12 4. Repeating Events Recurring sound events can be informative indicate similar circumstance... but: define event sound organization define recurring event how similar?.. and how to find them tractable? Idea: Use hashing (fingerprints) index points to other occurrences of each hash; intersection of hashes points to match - much quicker search use a fingerprint insensitive to background? Personal Audio Archives - Ellis, Lee, Ogle p. 12/18

13 Shazam Fingerprints Prominent spectral onsets are landmarks; Use relations {f1, f2, t} as hashes 4000 Phone ring - Shazam fingerprint intrinsically robust to background noise Personal Audio Archives - Ellis, Lee, Ogle p. 13/18

14 Exhaustive Search for Repeats More selective hashes few hits required to confirm match (faster; better precision) but less robust to backgound (reduce recall) Works well when exact structure repeats recorded music, electronic alerts no good for organic sounds e.g. garage door Personal Audio Archives - Ellis, Lee, Ogle p. 14/18

15 5. Future: Browsing Tools / Diary interface Browsing links to other information (diary, , photos) synchronize with note taking? (Stifelman & Arons) Release Tools + how to for capture!"#!! '!!(D!%D&$ '!!(D!%D&( '!!(D!%D&) '!!(D!%D&* '!!(D!%D&+!"#$!!%#!!!%#$! &!#!! &!#$! &&#!!,-./01223,-./01223 >2= <..68=: <..68=:',2/63.0 EFG!( C' EFG!$ &&#$! &'#!! &'#$! :-27, <..68=:' &$#$! 34; &(#!! <..68=:'?4= ?8H. C' <..68=: F4<;4-64B &(#$! <..68=:,2/63.0 <..68=: &+#$! &"#$! &%#!! &%#$! /.<8=4- :,,2/63.0 :-27, C.//.- H.4=/7; <..68=:' :-27, ; Personal Audio Archives - Ellis, Lee, Ogle &"#!! :-27, :-414< &*#$! &+#!! :-27, &)#!! &*#!! -9: 02<,<6: &$#!! &)#$! :-27, ,-./01223 =4614= p. 15 /18

16 Future: Speech Recognition Most audio is too noisy for standard ASR actually reassuring for privacy issues But... similar to Meeting Recordings NIST distant microphone conditions Speech enhancement - directional filtering 2 channels a big improvement over one... use a more special-purpose directional mic? Personal Audio Archives - Ellis, Lee, Ogle p. 16/18

17 Privacy and Security Recordings are controversial privacy expectations: speech should be ephemeral? Oops button, delayed review (Roy) subpoenas... (Golubchik) Access to recordings is very sensitive.. but preservation is important too Approaches don t store intelligible audio.. but lessens utility - maybe store ASR output? split and store on multiple machines - tiered, distributed trust/access protocols Big issue! Personal Audio Archives - Ellis, Lee, Ogle p. 17 /18

18 Conclusions Personal Audio is easy & cheap to collect but is it any use? Segmentation/clustering works well Voice detection in noise is harder prospects for speaker identification Hashing to find arbitrary repeating events Tools distribution as a goal Personal Audio Archives - Ellis, Lee, Ogle p. 18 /18

Audio & Music Research at LabROSA

Audio & Music Research at LabROSA Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA dpwe@ee.columbia.edu http://labrosa.ee.columbia.edu/