Minimal-Impact Personal Audio Archives

Minimal-Impact Personal Audio Archives Dan Ellis, Keansub Lee, Jim Ogle Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA dpwe@ee.columbia.edu 1. Personal Audio Archives 2. Segmenting & Clustering 3. Speech Detection 4. Repeated Events 5. Future Personal Audio Archives - Ellis, Lee, Ogle 2006-07-19 p. 1 /18

1. Personal Audio Archives Easy to record everything you hear <2GB / week @ 64 kbps Hard to find anything how to scan? how to visualize? how to index? Need automatic analysis Need minimal impact Personal Audio Archives - Ellis, Lee, Ogle 2006-07-19 p. 2 /18

Information in Audio Long-duration recordings contain info on: location type (restaurant, street,...) and specific activity talking, walking, typing people generic (2 males), specific (Chuck & John) spoken content... maybe but not: what people and things looked like day/night gaze, posture, motion,... Personal Audio Archives - Ellis, Lee, Ogle 2006-07-19 p. 3 /18

Applications Automatic appointment-book history fills in when & where of movements Life statistics how long did I spend in meetings this week? most frequent conversations favorite phrases? Retrieving details what exactly did I promise? privacy issues... Nostalgia... or what? Personal Audio Archives - Ellis, Lee, Ogle 2006-07-19 p. 4 /18

2. Segmentation & Clustering Top-level structure for long recordings: Where are the major boundaries? e.g. for diary application support for manual browsing Length of fundamental time-frame 60s rather than 10ms? background more important than foreground average out uncharacteristic transients Perceptually-motivated features.. so results have perceptual relevance broad spectrum + some detail Personal Audio Archives - Ellis, Lee, Ogle 2006-07-19 p. 5 /18

Features 20 Average Linear Energy 120 20 Normalized Energy Deviation 60 freq / bark 15 10 5 100 80 freq / bark 15 10 5 40 20 20 Average Log Energy 60 db 120 20 Log Energy Deviation db 15 freq / bark freq / bark 15 10 5 20 15 10 5 Average Spectral Entropy 100 80 60 db 0.9 0.8 0.7 0.6 0.5 freq / bark freq / bark 15 10 5 20 15 10 5 Spectral Entropy Deviation 10 5 db 0.5 0.4 0.3 0.2 0.1 bits 50 100 150 200 250 300 350 400 450 time / min Capture both average and variation Capture a little more detail in subbands... bits Personal Audio Archives - Ellis, Lee, Ogle 2006-07-19 p. 6 /18

BIC Segmentation Results Evaluate: 62 hr hand-marked dataset 8 days, 139 segments, 16 categories measure Correct Accept % @ False Accept = 2%: Feature Correct Accept μdb 80.8% μh 81.1% σh/μh 81.6% μdb + σh/μh 84.0% μdb + σh/μh + μh 83.6% mfcc 73.6% Sensitivity 0.8 0.7 0.6 0.5 0.4 0.3 o µ db µ H! H /µ H µ db +! H /µ H µ db + µ H +! H /µ H 0.2 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 1 - Specificity Personal Audio Archives - Ellis, Lee, Ogle 2006-07-19 p. 7 /18

Segment Clustering Daily activity has lots of repetition: Automatically cluster similar segments affinity of segments as KL2 distances 4*5)#1-% 1))%'23 -"#"0-),"#,)# ()!%*#)/,'(('"#.,#)"- ()!%*#)+!"#$%"&' ;01),0:('23 4%#))% #)4%"*#"2% (',#"#9 + 768!"15*4 7!15 (', #4% 4%# 666 Personal Audio Archives - Ellis, Lee, Ogle 2006-07-19 p. 8 /18

Clustering Results Clustering of automatic segments gives anonymous classes BIC criterion to choose number of clusters make best correspondence to 16 GT clusters Frame-level scoring gives ~70% correct errors when same place has multiple ambiences Personal Audio Archives - Ellis, Lee, Ogle 2006-07-19 p. 9 /18

3. Speech Detection Speech emerges as most interesting content Just identifying speech would be useful goal is speaker identification / labeling Lots of background noise conventional Voice Activity Detection inadequate Insight: Listeners detect pitch track (melody) look for voice-like periodicity in noise 4000 coffeeshop excerpt Frequency 3000 2000 1000 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Time Personal Audio Archives - Ellis, Lee, Ogle 2006-07-19 p. 10/18

Voice Periodicity Enhancement Noise-robust subband autocorrelation Subtract local average suppresses steady background e.g. machine noise 15 min test set; 88% acc (79% w/o enhancement) also for enhancing speech (harmonic filtering) Personal Audio Archives - Ellis, Lee, Ogle 2006-07-19 p. 11/18

4. Repeating Events Recurring sound events can be informative indicate similar circumstance... but: define event sound organization define recurring event how similar?.. and how to find them tractable? Idea: Use hashing (fingerprints) index points to other occurrences of each hash; intersection of hashes points to match - much quicker search use a fingerprint insensitive to background? Personal Audio Archives - Ellis, Lee, Ogle 2006-07-19 p. 12/18

Shazam Fingerprints Prominent spectral onsets are landmarks; Use relations {f1, f2, t} as hashes 4000 Phone ring - Shazam fingerprint 3000 2000 1000 0 0 0.5 1 1.5 2 2.5 3 intrinsically robust to background noise Personal Audio Archives - Ellis, Lee, Ogle 2006-07-19 p. 13/18

Exhaustive Search for Repeats More selective hashes few hits required to confirm match (faster; better precision) but less robust to backgound (reduce recall) Works well when exact structure repeats recorded music, electronic alerts no good for organic sounds e.g. garage door Personal Audio Archives - Ellis, Lee, Ogle 2006-07-19 p. 14/18

5. Future: Browsing Tools / Diary interface Browsing links to other information (diary, email, photos) synchronize with note taking? (Stifelman & Arons) Release Tools + how to for capture!"#!! '!!(D!%D&$ '!!(D!%D&( '!!(D!%D&) '!!(D!%D&* '!!(D!%D&+!"#$!!%#!!!%#$! &!#!! &!#$! &&#!!,-./01223,-./01223 >2= 3.067-. <..68=: <..68=:',2/63.0 EFG!( C' 2769223.067-. EFG!$ &&#$! &'#!! &'#$! 276922:-27, 276922276922 276922<..68=:' &$#$! 34; &(#!! <..68=:'?4=7.3 276922?8H. C' <..68=: 3.067-.' @--2A2B 276922- F4<;4-64B &(#$! 276922276922<..68=:,2/63.0 <..68=: &+#$! &"#$! &%#!! &%#$! /.<8=4- :,,2/63.0 :-27, C.//.- H.4=/7; <..68=:' :-27, 27692234; Personal Audio Archives - Ellis, Lee, Ogle &"#!! :-27, :-414< &*#$! &+#!! :-27, 276922- &)#!! &*#!! -9: 02<,<6: &$#!! &)#$! :-27, 3.067-. 276922-,-./01223 =4614= 2006-07-19 p. 15 /18

Future: Speech Recognition Most audio is too noisy for standard ASR actually reassuring for privacy issues But... similar to Meeting Recordings NIST distant microphone conditions Speech enhancement - directional filtering 2 channels a big improvement over one... use a more special-purpose directional mic? Personal Audio Archives - Ellis, Lee, Ogle 2006-07-19 p. 16/18

Privacy and Security Recordings are controversial privacy expectations: speech should be ephemeral? Oops button, delayed review (Roy) subpoenas... (Golubchik) Access to recordings is very sensitive.. but preservation is important too Approaches don t store intelligible audio.. but lessens utility - maybe store ASR output? split and store on multiple machines - tiered, distributed trust/access protocols Big issue! Personal Audio Archives - Ellis, Lee, Ogle 2006-07-19 p. 17 /18

Conclusions Personal Audio is easy & cheap to collect but is it any use? Segmentation/clustering works well Voice detection in noise is harder prospects for speaker identification Hashing to find arbitrary repeating events Tools distribution as a goal Personal Audio Archives - Ellis, Lee, Ogle 2006-07-19 p. 18 /18