GYROPHONE RECOGNIZING SPEECH FROM GYROSCOPE SIGNALS. Yan Michalevsky (1), Gabi Nakibly (2) and Dan Boneh (1)

Similar documents
Neetha Das Prof. Andy Khong

Real Time Speaker Recognition System using MFCC and Vector Quantization Technique

Voice Command Based Computer Application Control Using MFCC

Dia: AutoDirective Audio Capturing Through a Synchronized Smartphone Array

Dietrich Paulus Joachim Hornegger. Pattern Recognition of Images and Speech in C++

Speaker Diarization System Based on GMM and BIC

Implementing a Speech Recognition System on a GPU using CUDA. Presented by Omid Talakoub Astrid Yi

Accelerometer Gesture Recognition

Detection of Acoustic Events in Meeting-Room Environment

Input speech signal. Selected /Rejected. Pre-processing Feature extraction Matching algorithm. Database. Figure 1: Process flow in ASR

Dynamic Time Warping

The Automatic Musicologist

Activity recognition and energy expenditure estimation

Multifactor Fusion for Audio-Visual Speaker Recognition

Comparative Evaluation of Feature Normalization Techniques for Speaker Verification

Flexible Stand-Alone Keyword Recognition Application Using Dynamic Time Warping

Embedded Audio & Robotic Ear

Separating Speech From Noise Challenge

Implementation of Speech Based Stress Level Monitoring System

A text-independent speaker verification model: A comparative analysis

SCREAM AND GUNSHOT DETECTION IN NOISY ENVIRONMENTS

i436 ons Studio is no mic plugged into false? Answer 1 : This is the false roll-off MicW TN 1402

2014, IJARCSSE All Rights Reserved Page 461

FUSION MODEL BASED ON CONVOLUTIONAL NEURAL NETWORKS WITH TWO FEATURES FOR ACOUSTIC SCENE CLASSIFICATION

Audio-coding standards

Pitch Prediction from Mel-frequency Cepstral Coefficients Using Sparse Spectrum Recovery

THE PERFORMANCE of automatic speech recognition

IMPROVED SPEAKER RECOGNITION USING DCT COEFFICIENTS AS FEATURES. Mitchell McLaren, Yun Lei

Environment Recognition for Digital Audio Forensics Using MPEG-7 and Mel Cepstral Features

Audio-coding standards

A Novel Template Matching Approach To Speaker-Independent Arabic Spoken Digit Recognition

ON THE WINDOW-DISJOINT-ORTHOGONALITY OF SPEECH SOURCES IN REVERBERANT HUMANOID SCENARIOS

Frequency Response (min max)

Replacing ECMs with Premium MEMS Microphones

Improving Mobile Phone Based Query Recognition with a Microphone Array

Dimensionality reduction for acoustic vehicle classification with spectral clustering

SpeakUp click. Contents. Applications. SpeakUp Firwmware. Algorithm. SpeakUp and SpeakUp 2 click. From MikroElektonika Documentation

2-2-2, Hikaridai, Seika-cho, Soraku-gun, Kyoto , Japan 2 Graduate School of Information Science, Nara Institute of Science and Technology

Speaker Verification with Adaptive Spectral Subband Centroids

ENVIRONMENT RECOGNITION FOR DIGITAL AUDIO FORENSICS USING MPEG 7 AND MEL CEPSTRAL FEATURES

An Environmental Audio Based Context Recognition System Using Smartphones

Introducing Audio Signal Processing & Audio Coding. Dr Michael Mason Senior Manager, CE Technology Dolby Australia Pty Ltd

Bo#leneck Features from SNR- Adap9ve Denoising Deep Classifier for Speaker Iden9fica9on

Introducing Audio Signal Processing & Audio Coding. Dr Michael Mason Snr Staff Eng., Team Lead (Applied Research) Dolby Australia Pty Ltd

Mpeg 1 layer 3 (mp3) general overview

Secure E- Commerce Transaction using Noisy Password with Voiceprint and OTP

Aditi Upadhyay Research Scholar, Department of Electronics & Communication Engineering Jaipur National University, Jaipur, Rajasthan, India

MACHINE LEARNING: CLUSTERING, AND CLASSIFICATION. Steve Tjoa June 25, 2014

MP3 Speech and Speaker Recognition with Nearest Neighbor. ECE417 Multimedia Signal Processing Fall 2017

DETECTING INDOOR SOUND EVENTS

Speech Recognition on DSP: Algorithm Optimization and Performance Analysis

Multimedia Database Systems. Retrieval by Content

RECOGNITION OF EMOTION FROM MARATHI SPEECH USING MFCC AND DWT ALGORITHMS

Introduction to Massive Data Interpretation

Lecture 16 Perceptual Audio Coding

[N569] Wavelet speech enhancement based on voiced/unvoiced decision

IVM 4. Audio Presets Details SST 4. AKG Acoustics 2008

Sound Lab by Gold Line. Noise Level Analysis

Authentication of Fingerprint Recognition Using Natural Language Processing

Measurement of 3D Room Impulse Responses with a Spherical Microphone Array

SPEECH FEATURE EXTRACTION USING WEIGHTED HIGHER-ORDER LOCAL AUTO-CORRELATION

ACOUSTIC BASED SKETCH RECOGNITION. A Thesis WENZHE LI

Testing Approaches for Characterization and Selection of MEMS Inertial Sensors 2016, 2016, ACUTRONIC 1

VSEW_mk2-8g. Data Sheet. Dec Bruno Paillard

MATLAB Apps for Teaching Digital Speech Processing

Dr Andrew Abel University of Stirling, Scotland

Auto-Digitizer for Fast Graph-to-Data Conversion

Programming-By-Example Gesture Recognition Kevin Gabayan, Steven Lansel December 15, 2006

The 5.8mm diameter ROM-2235L-HD-R is designed for extreme fidelity in a compact housing with true 20 Hz to 20 khz performance.

Further Studies of a FFT-Based Auditory Spectrum with Application in Audio Classification

Acoustic to Articulatory Mapping using Memory Based Regression and Trajectory Smoothing

Multimedia Systems Speech I Mahdi Amiri February 2011 Sharif University of Technology

Separation Of Speech From Noise Challenge

Human Activity Recognition via Cellphone Sensor Data

Environment Independent Speech Recognition System using MFCC (Mel-frequency cepstral coefficient)

SOUND EVENT DETECTION AND CONTEXT RECOGNITION 1 INTRODUCTION. Toni Heittola 1, Annamaria Mesaros 1, Tuomas Virtanen 1, Antti Eronen 2

A Gaussian Mixture Model Spectral Representation for Speech Recognition

VentriLock: Exploring voice-based authentication systems

Complex Identification Decision Based on Several Independent Speaker Recognition Methods. Ilya Oparin Speech Technology Center

Detection of goal event in soccer videos

SAS: A speaker verification spoofing database containing diverse attacks

Audio-visual interaction in sparse representation features for noise robust audio-visual speech recognition

Archetypal Analysis Based Sparse Convex Sequence Kernel for Bird Activity Detection

NON-UNIFORM SPEAKER NORMALIZATION USING FREQUENCY-DEPENDENT SCALING FUNCTION

Analyzing Vocal Patterns to Determine Emotion Maisy Wieman, Andy Sun

BINAURAL SOUND LOCALIZATION FOR UNTRAINED DIRECTIONS BASED ON A GAUSSIAN MIXTURE MODEL

Speech-Coding Techniques. Chapter 3

Multi-Modal Audio, Video, and Physiological Sensor Learning for Continuous Emotion Prediction

Sound Source Separation Algorithm Using Phase Difference and Angle Distribution Modeling Near the Target

Variable-Component Deep Neural Network for Robust Speech Recognition

Least Squares Signal Declipping for Robust Speech Recognition

Chapter 3. Speech segmentation. 3.1 Preprocessing

AUTOMATIC speech recognition (ASR) systems suffer

CHROMA AND MFCC BASED PATTERN RECOGNITION IN AUDIO FILES UTILIZING HIDDEN MARKOV MODELS AND DYNAMIC PROGRAMMING. Alexander Wankhammer Peter Sciri

SPEECH is the natural and easiest way to communicate

Voiced-Unvoiced-Silence Classification via Hierarchical Dual Geometry Analysis

Applications of Keyword-Constraining in Speaker Recognition. Howard Lei. July 2, Introduction 3

Xing Fan, Carlos Busso and John H.L. Hansen

Vulnerability of Voice Verification System with STC anti-spoofing detector to different methods of spoofing attacks

Chapter 14 MPEG Audio Compression

Transcription:

GYROPHONE RECOGNIZING SPEECH FROM GYROSCOPE SIGNALS Yan Michalevsky (1), Gabi Nakibly (2) and Dan Boneh (1) (1) Stanford University (2) National Research and Simulation Center, Rafael Ltd. 0

MICROPHONE ACCESS REQUIRES PERMISSIONS

GYROSCOPE ACCESS DOES NOT REQUIRE PERMISSIONS

MEMS GYROSCOPES Major vendors: STM Microelectronics (Samsung Galaxy) InvenSense (Google Nexus)

GYROSCOPES ARE SUSCEPTIBLE TO SOUND 70 HZ TONE POWER SPECTRAL DENSITY 50 HZ TONE POWER SPECTRAL DENSITY

GYROSCOPES ARE (LOUSY, BUT STILL) MICROPHONES Hardware sampling frequency: InvenSense: up to 8000 Hz STM Microelectronics: 800 Hz Software sampling frequency: Android: 200 Hz ios: 100 Hz

GYROSCOPES ARE (LOUSY, BUT STILL) MICROPHONES Very low SNR (Signal-to-Noise Ratio) Acoustic sensitivity threshold: ~70 db Comparable to a loud conversation. Sensitive to sound angle of arrival Directional microphone (due to 3 axes)

BROWSERS ALLOW GYROSCOPE ACCESS TOO

BROWSERS ALLOW GYROSCOPE ACCESS TOO

BROWSERS ALLOW GYROSCOPE ACCESS TOO

BROWSERS ALLOW GYROSCOPE ACCESS TOO

PROBLEM: HOW DO WE LOOK INTO HIGHER FREQUENCIES? SPEECH RANGE Adult male 85-180 Hz Adult female 165-255 Hz

ALIASING

WE CAN SENSE HIGH FREQUENCY SIGNALS DUE TO ALIASING THE RESULT OF RECORDING TONES BETWEEN 120 AND 160 HZ ON A NEXUS 7 DEVICE

EXPERIMENTAL SETUP Room. Simple speakers. Smartphone. Subset of TIDigits speech recognition corpus 10 speakers 11 samples 2 pronunciations = 220 total samples

SPEECH ANALYSIS USING A SINGLE GYROSCOPE Gender identification Speaker identification Isolated word recognition

Speech recognition engine developed at CMU Tested for isolated word recognition 14% success rate (random guess is 9%)

PREPROCESSING All samples are converted to audio files in WAV format Upsampled to 8 KHz Silence removal (based on voiced/unvoiced segment classification)

FEATURES MFCC - Mel-Frequency Cepstral Coefficients Statistical features are used (mean and variance) delta-mfcc Spectral centroid RMS energy STFT - Short-Time Fourier Transform

CLASSIFIERS SVM (and Multi-class SVM) GMM (Gaussian Mixture Model) DTW (Dynamic Time Warping)

DYNAMIC TIME WARPING EUCLIDEAN DISTANCE DTW DISTANCE

GENDER IDENTIFICATION Binary SVM with spectral features DTW with STFT features Window size: 512 samples - corresponds to 64 ms under 8 KHz sampling rate

WE CAN SUCCESSFULLY IDENTIFY GENDER NEXUS 4 84% (DTW) GALAXY S III 82% (SVM) Random guess probability is 50%

SPEAKER IDENTIFICATION Multi-class SVM and GMM with spectral features DTW with STFT features (same as before)

A GOOD CHANCE TO IDENTIFY THE SPEAKER Nexus 4 Mixed Female/Male 50% (DTW) Female speakers 45% (DTW) Male speakers 65% (DTW) Random guess probability is 20% for one gender and 10% for a mixed set

ISOLATED WORDS RECOGNITION SPEAKER INDEPENDENT Nexus 4 Mixed Female/Male 17% (DTW) Female speakers 26% (DTW) Male speakers 23% (DTW) Confusion matrix corresponds to the mixed set results using DTW Random guess probability is 9%

ISOLATED WORDS RECOGNITION SPEAKER DEPENDENT Confusion matrix corresponds to the DTW results Random guess probability is 9%

HOW CAN WE LEVERAGE EAVESDROPPING SIMULTANEOUSLY ON TWO DEVICES?

SIMILAR TO TIME-INTERLEAVED ADC's

SIMILAR TO TIME-INTERLEAVED ADC's DC component removal

SIMILAR TO TIME-INTERLEAVED ADC's Normalization / use a reference signal

SIMILAR TO TIME-INTERLEAVED ADC's Background or foreground calibration

NON-UNIFORM RECONSTRUCTION REQUIRES KNOWING PRECISE TIME-SKEWS Filterbank interpolation based on Eldar and Oppenheim's paper

PRACTICAL COMPROMISE Interleaving samples from multiple devices

EVALUATION (Tested for the case of speaker dependent word recognition) Single device Two devices Exhibits improvement over using a single device Using even more devices might yield even better results Not a proper non-uniform reconstruction

FURTHER ATTACKS

SOURCE SEPARATION Use the 3 axes of the gyro Learn the number of sound sources around Use angle of arrival information for source separation

AMBIENT SOUND RECOGNITION IS THE USER IN A ROOM/OUTDOORS/ON A STREET?

DEFENSES

SOFTWARE DEFENSES Low-pass filter the raw samples 0-20 Hz range should be enough for browser based applications (according to WebKit) Access to high sampling rate should require a special permission

HARDWARE DEFENSES Hardware filtering of sensor signals (Not subject to configuration) Acoustic masking (won't help against vibration of the surface)

CONCLUSION Giving applications direct access to hardware is dangerous. Especially given the high sampling rate.

THANK YOU VERY MUCH QUESTIONS? CRYPTO.STANFORD.EDU/GYROPHONE

IT IS POSSIBLE TO SAMPLE THROUGH JAVASCRIPT

FAQ Did you experiment with an anechoic chamber? Yes, and did not find it beneficial at this stage.

FAQ Perhaps the gyro actually measures the vibrations of the surface? Maybe, but tests suggest it's not only that. In any case it is still dangerous.

FAQ Is it possible to use measurements from multiple devices in other ways? Yes. For example as in MIMO: EGC (Equal Gain Combining).