Extraction and Representation of Features, Spring Lecture 4: Speech and Audio: Basics and Resources. Zheng-Hua Tan

Similar documents
2.4 Audio Compression

Audio Fundamentals, Compression Techniques & Standards. Hamid R. Rabiee Mostafa Salehi, Fatemeh Dabiran, Hoda Ayatollahi Spring 2011

Speech-Coding Techniques. Chapter 3

Digital Media. Daniel Fuller ITEC 2110

AUDIO. Henning Schulzrinne Dept. of Computer Science Columbia University Spring 2015

Principles of Audio Coding

Mahdi Amiri. February Sharif University of Technology

3 Sound / Audio. CS 5513 Multimedia Systems Spring 2009 LECTURE. Imran Ihsan Principal Design Consultant

CT516 Advanced Digital Communications Lecture 7: Speech Encoder

Digital Speech Coding

Multimedia Systems Speech II Hmid R. Rabiee Mahdi Amiri February 2015 Sharif University of Technology

Multimedia Systems Speech II Mahdi Amiri February 2012 Sharif University of Technology

Data Compression. Audio compression

Transporting audio-video. over the Internet

Application of wavelet filtering to image compression

Audio Coding and MP3

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

Multimedia Systems Speech I Mahdi Amiri February 2011 Sharif University of Technology

Source Coding Basics and Speech Coding. Yao Wang Polytechnic University, Brooklyn, NY11201

CHAPTER 10: SOUND AND VIDEO EDITING

Chapter 5.5 Audio Programming

GSM Network and Services

08 Sound. Multimedia Systems. Nature of Sound, Store Audio, Sound Editing, MIDI

MATLAB Apps for Teaching Digital Speech Processing

Speech and audio coding

ITNP80: Multimedia! Sound-II!

Skill Area 214: Use a Multimedia Software. Software Application (SWA)

Fundamental of Digital Media Design. Introduction to Audio

Topics in Linguistic Theory: Laboratory Phonology Spring 2007

Multimedia Systems Speech I Mahdi Amiri September 2015 Sharif University of Technology

Digital Audio. Amplitude Analogue signal

ON-LINE SIMULATION MODULES FOR TEACHING SPEECH AND AUDIO COMPRESSION TECHNIQUES

CS 074 The Digital World. Digital Audio

Squeeze Play: The State of Ady0 Cmprshn. Scott Selfon Senior Development Lead Xbox Advanced Technology Group Microsoft

Perceptual Pre-weighting and Post-inverse weighting for Speech Coding

KINGS COLLEGE OF ENGINEERING DEPARTMENT OF INFORMATION TECHNOLOGY ACADEMIC YEAR / ODD SEMESTER QUESTION BANK

COS 116 The Computational Universe Laboratory 4: Digital Sound and Music

Compressed Audio Demystified by Hendrik Gideonse and Connor Smith. All Rights Reserved.

Lecture #3: Digital Music and Sound

Digital Audio Basics

The Effect of Bit-Errors on Compressed Speech, Music and Images

COS 116 The Computational Universe Laboratory 4: Digital Sound and Music

MULTIMODE TREE CODING OF SPEECH WITH PERCEPTUAL PRE-WEIGHTING AND POST-WEIGHTING

Both LPC and CELP are used primarily for telephony applications and hence the compression of a speech signal.

5: Music Compression. Music Coding. Mark Handley

Introducing Audio Signal Processing & Audio Coding. Dr Michael Mason Snr Staff Eng., Team Lead (Applied Research) Dolby Australia Pty Ltd

REVIEW: PROCEDURES FOR PRODUCING AUDIO

Synopsis of Basic VoIP Concepts

Perceptual coding. A psychoacoustic model is used to identify those signals that are influenced by both these effects.

EE482: Digital Signal Processing Applications

Sampling & Sequencing. Combining MIDI and audio

DSP. Presented to the IEEE Central Texas Consultants Network by Sergio Liberman

Proceedings of Meetings on Acoustics

For Mac and iphone. James McCartney Core Audio Engineer. Eric Allamanche Core Audio Engineer

Introducing Audio Signal Processing & Audio Coding. Dr Michael Mason Senior Manager, CE Technology Dolby Australia Pty Ltd

CSC 101: Lab #7 Digital Audio Due Date: 5:00pm, day after lab session

RESIDUAL-EXCITED LINEAR PREDICTIVE (RELP) VOCODER SYSTEM WITH TMS320C6711 DSK AND VOWEL CHARACTERIZATION

Digital Recording and Playback

AN EFFICIENT TRANSCODING SCHEME FOR G.729 AND G SPEECH CODECS: INTEROPERABILITY OVER THE INTERNET. Received July 2010; revised October 2011

AudioLiquid Converter User Guide

1 Introduction. 2 Speech Compression

Sound 5 11/21/2016. Use MIDI and understand its attributes, especially relative to digitized audio.

Mpeg 1 layer 3 (mp3) general overview

Using Sweep: Fun with Scrubby

MPEG-4 Version 2 Audio Workshop: HILN - Parametric Audio Coding

Improved Tamil Text to Speech Synthesis

Effect of MPEG Audio Compression on HMM-based Speech Synthesis

Chapter 14 MPEG Audio Compression

SAOC and USAC. Spatial Audio Object Coding / Unified Speech and Audio Coding. Lecture Audio Coding WS 2013/14. Dr.-Ing.

Audio Compression. Audio Compression. Absolute Threshold. CD quality audio:

CHAPTER 2 - DIGITAL DATA REPRESENTATION AND NUMBERING SYSTEMS

A NEURAL NETWORK APPLICATION FOR A COMPUTER ACCESS SECURITY SYSTEM: KEYSTROKE DYNAMICS VERSUS VOICE PATTERNS

AET 1380 Digital Audio Formats

CS 4455: Video Game Design & Implementation

Lossy compression. CSCI 470: Web Science Keith Vertanen

Audio and video compression

计算原理导论. Introduction to Computing Principles 智能与计算学部刘志磊

DESIGN AND DEVELOPMENT OF A MULTIRATE FILTERS IN SOFTWARE DEFINED RADIO ENVIRONMENT

Digital Music. You can download this file from Dig Music May

DIY Audio Rosanne Hutton, MindCross Training

Implementation of G.729E Speech Coding Algorithm based on TMS320VC5416 YANG Xiaojin 1, a, PAN Jinjin 2,b

UNDERSTANDING MUSIC & VIDEO FORMATS

CHAPTER 6 Audio compression in practice

Quick Guide to Getting Started with:

Massachusetts Institute of Technology Department of Electrical Engineering & Computer Science Automatic Speech Recognition Spring, 2003

Adobe Sound Booth Tutorial

McFunSoft Audio Studio

A Digital Audio Primer

Perspectives on Multimedia Quality Prediction Methodologies for Advanced Mobile and IP-based Telephony

VoIP Forgery Detection

Speech-Music Discrimination from MPEG-1 Bitstream

Elementary Computing CSC 100. M. Cheng, Computer Science

Assignment 1: Speech Production and Models EN2300 Speech Signal Processing

Audio-coding standards

Binary representation and data

Voice Quality Assessment for Mobile to SIP Call over Live 3G Network

Audio for Everybody. OCPUG/PATACS 21 January Tom Gutnick. Copyright by Tom Gutnick. All rights reserved.

Audacity: How- To. Import audio (a song or SFX) Before we start. Import song into Audacity

ETSI TS V2.1.3 ( )

Speech Synthesis. Simon King University of Edinburgh

Transcription:

Extraction and Representation of Features, Spring 2011 Lecture 4: Speech and Audio: Basics and Resources Zheng-Hua Tan Multimedia Information and Signal Processing Department of Electronic Systems Aalborg University, Denmark zt@es.aau.dk Extraction of Features, IV, Zheng-Hua Tan 1 Outline Introduction Speech basics Sound basics Resources Extraction of Features, IV, Zheng-Hua Tan 2 1

Computer as dream of human being HAL talks, listens, reads lips and solves problems Nature and effortless for huamn Hard for computer Dream of AI scientists and human True in 2001: A Space Odyssey (After 2001: A Space Odyssey, 1968 ) Extraction of Features, IV, Zheng-Hua Tan 3 Computer as a reality: state-of-the-art Demos Microsoft Nuance Text to speech (TTS) Festival TTS @ CSTR Edinburg University Next generation TTS @ AT&T Extraction of Features, IV, Zheng-Hua Tan 4 2

Information in Speech Speech coding data rates Rate (bits/sec) 200k 100k 64k 32k 16k 12k 9k 4.8k 2k 1k 500 100 60 ADPCM, DPCM, PCM LPC, CELP, MELP, Vocoders Waveform coding Parametric (source) coding Human can understand text: 10 char/sec x 6 bits/ascii char = 60 bits/sec Is content in speech more than 60 bits/sec? Extraction of Features, IV, Zheng-Hua Tan 5 Information in Speech cont. Examples That's one small step for man; one giant leap for mankind. -- Neil Armstrong, Apollo 11 Moon Landing Speech "I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character. I have a dream today!" -- Martin Luther King, Jr., I Have a Dream Speech contains speaker identity, emotion, meaning, text. speech techniques Extraction of Features, IV, Zheng-Hua Tan 6 3

Speech is a complex process Physiology Linguistics Speech Acoustics Extraction of Features, IV, Zheng-Hua Tan 7 Human speech communication process Rabiner and Levinson, IEEE Tans. Communications, 1981 (After Rabiner & Levinson, 1981) Speech synthesis Speech understanding Speech coding Speech recognition Extraction of Features, IV, Zheng-Hua Tan 8 4

Literature Textbook: J Deller, J Hansen and J Proakis, Discrete-Time Processing of Speech Signals, IEEE Press, 2000. References: Huang, Acero and Hon, Spoken Language Processing, Prentice-Hall, 2001. D. O Shaughnessy, Speech Communications, IEEE Press, 2000 Extraction of Features, IV, Zheng-Hua Tan 9 Schematic diagram of speech production Vocal folds The frequency of vocal cord vibration determines the pitch of the voice (for a male, 50-200Hz; for a female, up to 500Hz). Extraction of Features, IV, Zheng-Hua Tan 10 5

Model of speech production Digital model of speech production Time-varying parameters: fundamental frequency (pitch), voiced/unvoiced/silence, gain, formants, vocal tract area functions, Extraction of Features, IV, Zheng-Hua Tan 11 Vocal tract modelling Source-filter model Source Filter Vocal tract Output Type of excitation (source) Voiced: produced by forcing air through the glottis. Vowels are voiced. Unvoiced: generated by forming a constriction at some point along the vocal tract and forcing air through the constriction. Extraction of Features, IV, Zheng-Hua Tan 12 6

Role of the vocal tract Vowels: produced by exciting a fixed vocal tract with quasi-periodic pulsed of air caused by vibration of the vocal cords Consonants: a significant restriction and thus weaker in amplitude and noisy-like Formants: resonances determined by the shape of vocal tract, which form the overall spectrum and the properties of the filter Extraction of Features, IV, Zheng-Hua Tan 13 The speech signal Speech is a sequence of highly changing sounds When producing sounds, the vocal cords and the various articulators slowly change over time Extraction of Features, IV, Zheng-Hua Tan 14 7

Vowel production: examples (After Joseph Picone ) Fixed vocal tract shape Voiced Cross-sectional area F i Extraction of Features, IV, Zheng-Hua Tan 15 The vowel space By the locations of the first and second formant frequencies: (After Peterson & Barney, 1952) F1 F2 F3 Extraction of Features, IV, Zheng-Hua Tan 16 8

Consonant production: examples (After Joseph Picone ) Extraction of Features, IV, Zheng-Hua Tan 17 Speech sounds and waveforms sixteen /s/ /i/ /k/ /s/ /t/ /ee/ /n/ six periodicity, intensity, duration, boundary, etc Extraction of Features, IV, Zheng-Hua Tan 18 9

Observing pitch from waveforms Extraction of Features, IV, Zheng-Hua Tan 19 Spectrogram Spectrogram two-dimensional waveform (amplitude/time) is converted into a three-dimensional pattern (amplitude/frequency/time) Wideband spectrogram: analyzed on 15ms sections of waveform with a step of 1ms voiced regions with vertical striations due to the periodicity of the time waveform (each vertical line represents a pulse of vocal folds) while unvoiced regions are solid/random, or snowy Narrowband spectrogram: on 50ms pitch for voiced intervals in horizontal lines Extraction of Features, V, Zheng-Hua Tan 20 10

Wide- and narrow-band spectrograms waveform F3 F2 F1 Wideband spectrogram narrowband spectrogram Extraction of Features, V, Zheng-Hua Tan 21 Outline Introduction Speech basics Sound basics Resources Extraction of Features, IV, Zheng-Hua Tan 22 11

Sound basics Audio (sound) wave one-dimensional acoustic pressure wave causes vibration in the eardrum or in a microphone Frequency range of human ear 20 20.000 Hz (20 KHz) perception nearly logarithmic, relation of amplitudes A and B is expressed as db = 20 log 10 (A/B) Extraction of Features, IV, Zheng-Hua Tan 23 Sounds (From Sound Jay) Extraction of Features, IV, Zheng-Hua Tan 24 12

Heartbeat Extraction of Features, IV, Zheng-Hua Tan 25 Footstep Extraction of Features, IV, Zheng-Hua Tan 26 13

Piano note Extraction of Features, IV, Zheng-Hua Tan 27 Piano Extraction of Features, IV, Zheng-Hua Tan 28 14

Scratch sound B. Lemoine, J. Nicolaou, A. Osmar, A. Palsky, C. Petitimbert, 2011. Extraction of Features, IV, Zheng-Hua Tan 29 Analog-to-digital conversion Converting an analog signal into a digital signal has 2 sub-processes: 1. sampling - conversion of a continuousspace/time (audio, video) signal into a discretespace/time (audio, video) signal. 2. quantization - converting a continuous-valued (audio, video) signal that has a continuous range (set of values that it can take) of intensities and/or colors into a discrete-valued (audio, video) signal that has a discrete range of intensities and/or colors. Extraction of Features, IV, Zheng-Hua Tan 30 15

Analog-to-digital converter Sampling What the ADC circuit does is to take samples from the analog signal from time to time. Each sample will be converted into a number, based on its voltage level. Extraction of Features, IV, Zheng-Hua Tan 31 Representation of digital audio Temporal resolution, i.e. sampling frequency Audible frequency range: ~ 20Hz-20kHz Nyquist-Shannon theorem: 2*20kHz = 40kHz Actual frequency due to production: 44.1 khz Bit depth, i.e. amplitude quantization 16-bit linear PCM (Pulse-code modulation) Digital audio stored in computers: Windows WAV, Apple AIF, Sun AU, Blu-ray (incl. 20-, 24-bit as well) Compact Disc Digital Audio 16 bit/sample per channel ~ 2^16 = 65,536 Extraction of Features, IV, Zheng-Hua Tan 32 16

Representation of digital audio Bit-rate stereo 16*2*44100 ~ 1.4 Mbit/s A CD can store up to 74 minutes of music Total amount of data = 44,100 samples/(channel*second) * 2 bytes/sample * 2 channels * 60 seconds/minute * 74 minutes = 783,216,000 bytes There are CD-Rs that can hold 700 megabytes (734,003,200 bytes) of error corrected data, or 80 minutes of stereo 16 bit 44.1 khz audio (846,739,200 bytes) (807.51 megabytes) without error correction code. Extraction of Features, IV, Zheng-Hua Tan 33 Coding standards Sampling Rate (KHz) Quantization level (bits) Bit Rate (Kbps) Telephone 8 8 64 AM Radio 16 16 256 FM Radio 22.05 16 352.8 CD Stereo 44.1 16 1411.2 DAT 48 16 1536 DVD (Stereo) 192 24 9216 What about mobile communication and VoIP? Extraction of Features, IV, Zheng-Hua Tan 34 17

Coding standards The International Telecommunications Union (ITU) Standard Method Bit rete (kb/s) MOS Complexity (MIPS) Release Time ITU G.711 Mu/A-law PCM 64 4.3 0.01 1972 ITU G.729 CS-ACELP 8 4.0 20 1996 The European Telecommunications Standards Institutes (ETSI) Standard Method Bit rete (kb/s) MOS Complexity (MIPS) Release Time GSM FR RPE-LTP 13 1987 GSM AMR ACELP 4.75-12.2 1998 Extraction of Features, IV, Zheng-Hua Tan 35 Outline Introduction Speech basics Sound basics Resources Extraction of Features, IV, Zheng-Hua Tan 36 18

Sound and speech databases http://www.soundjay.com/ http://soundjax.com http://www.ldc.upenn.edu/ Extraction of Features, IV, Zheng-Hua Tan 37 Speech Tool Audacity http://audacity.sourceforge.net Audacity is a free audio editor and recorder for many operating systems. Key features: Record live audio. Convert tapes and records into digital recordings or CDs. Edit Ogg Vorbis, MP3, WAV or AIFF sound files. Cut, copy, splice or mix sounds together. Change the speed or pitch of a recording. Analyze audio signal Extraction of Features, IV, Zheng-Hua Tan 38 19

Speech Tool speech filing system Speech Filing System- Tools for Speech Research It performs standard operations such as recording, replay, waveform editing and labelling, spectrographic and formant analysis and fundamental frequency estimation. http://www.phon.ucl.ac.uk/resource/sfs/ Demo Extraction of Features, IV, Zheng-Hua Tan 39 Speech tool voicebox VOICEBOX: Speech Processing Toolbox for MATLAB http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/ voicebox.html Demo Extraction of Features, IV, Zheng-Hua Tan 40 20

Summary Introduction Speech basics Sound basics Resources Extraction of Features, IV, Zheng-Hua Tan 41 21