LIVE MUSIC PERFORMANCES OVER HIGH- SPEED IP NETWORKS

Similar documents
CS 457 Multimedia Applications. Fall 2014

Synopsis of Basic VoIP Concepts

Distributed Multimedia Systems. Introduction

Transporting audio-video. over the Internet

Networking Applications

ITEC310 Computer Networks II

Introducing Audio Signal Processing & Audio Coding. Dr Michael Mason Senior Manager, CE Technology Dolby Australia Pty Ltd

Multimedia Networking

ABSTRACT. that it avoids the tolls charged by ordinary telephone service

Introducing Audio Signal Processing & Audio Coding. Dr Michael Mason Snr Staff Eng., Team Lead (Applied Research) Dolby Australia Pty Ltd

MULTIMEDIA I CSC 249 APRIL 26, Multimedia Classes of Applications Services Evolution of protocols

Introduction to LAN/WAN. Application Layer 4

Troubleshooting Packet Loss. Steven van Houttum

Chapter 9. Multimedia Networking. Computer Networking: A Top Down Approach

_APP B_549_10/31/06. Appendix B. Producing for Multimedia and the Web

VoIP Overview. Device Setup The device is configured via the VoIP tab of the devices Device Properties dialog in Integration Designer.

The following bit rates are recommended for broadcast contribution employing the most commonly used audio coding schemes:

Traversal Solution. Research and Metrics for Transitioning Green Intentions to Green Outcomes

Battle Command Radio Net User Guide Historical Software Corporation - Copyright 2011, HSC. All rights reserved.

Making IP Networks Voice Enabled

Polycom Converged Management Application (CMA ) Desktop for Mac OS X. Help Book. Version 5.0.0

White Paper Voice Quality Sound design is an art form at Snom and is at the core of our development utilising some of the world's most advance voice

CHAPTER 10: SOUND AND VIDEO EDITING

Polycom SoundStructure

Voice Quality Assessment for Mobile to SIP Call over Live 3G Network

Multimedia Networking

Quality of Service (QoS) Whitepaper

Streaming (Multi)media

The Benefits of a Telepresence Platform

Multimedia. Sape J. Mullender. Huygens Systems Research Laboratory Universiteit Twente Enschede

Pro Digital 940R receiver/amplifier. specifications. Back. 940R accessories

Can Congestion-controlled Interactive Multimedia Traffic Co-exist with TCP? Colin Perkins

IP Video Intercom. AddPac Technology. Sales and Marketing.

xcore VocalFusion Speaker Evaluation Kit Quick Start Guide

CSC 4900 Computer Networks: Multimedia Applications

ELL 788 Computational Perception & Cognition July November 2015

Why do we need headsets?

Cisco TelePresence MX Series

Release Notes. Hardware and Software Requirements. Polycom RealPresence Mobile, Version Motorola Xoom and Samsung Galaxy Tab

IP Video Network Gateway Solutions

Multimedia Applications. Classification of Applications. Transport and Network Layer

Optimizing A/V Content For Mobile Delivery

HD Multipoint Control Unit (MCU) System

Multimedia Networking

Chapter 5 VoIP. Computer Networking: A Top Down Approach. 6 th edition Jim Kurose, Keith Ross Addison-Wesley March Multmedia Networking

Digital Asset Management 5. Streaming multimedia

PJP-50USB. Conference Microphone Speaker. User s Manual MIC MUTE VOL 3 CLEAR STANDBY ENTER MENU

Mul$media Streaming. Digital Audio and Video Data. Digital Audio Sampling the analog signal. Challenges for Media Streaming.

Optical Storage Technology. MPEG Data Compression

USER MANUAL DUET PCS USB DESKTOP SPEAKERPHONE

2.4 Audio Compression

Cisco TelePresence Quick Set C20

Technical Specifications

4 rd class Department of Network College of IT- University of Babylon

Remote Media Immersion (RMI)

PJP-EC200 Setup Procedure Yamaha Corporation

D-STAR Review & Final Exam

ENSC 427 COMMUNICATION NETWORKS

UNDERSTANDING MUSIC & VIDEO FORMATS

H.264 ENCODER AND DECODER

Location Based Advanced Phone Dialer. A mobile client solution to perform voice calls over internet protocol. Jorge Duda de Matos

ACCESS 4.0 Addendum. Product Manual

WHITE PAPER. Atlona OmniStream: Truly Converged, Networked AV

1. Before adjusting sound quality

Cisco TelePresence SX10 Quick Set

Cisco TelePresence MX200 and MX300 G1

ILE Implementation: CYBER TELEPORTATION TOKYO at SXSW

UCC Unified Communication & Collaboration

Overview. A Survey of Packet-Loss Recovery Techniques. Outline. Overview. Mbone Loss Characteristics. IP Multicast Characteristics

Introduction to Quality of Service

Owner's Manual SPK-24G. 2.4GHz Wireless Outdoor Speaker System. (Read carefully the instructions before using this system)

Content distribution networks

Presents 2006 IMTC Forum ITU-T T Workshop

Network-Adaptive Video Coding and Transmission

Sounds Better. US Office 27 World s Fair Drive, Somerset, New Jersey 08873, USA Tel: Fax:

Ch 4: Multimedia. Fig.4.1 Internet Audio/Video

Perfect Video over Any Network. State-of-the-art Technology for Live Video Comunications

TECHNICAL SPECIFICATIONS. Polycom RealPresence Immersive Studio Flex Technical Specifications

Introduction to Wireless Networking ECE 401WN Spring 2009

Mpeg 1 layer 3 (mp3) general overview

ITNP80: Multimedia! Sound-II!

Chapter 28. Multimedia

Overcoming Barriers to High-Quality Voice over IP Deployments

Collaborative Conferencing

UNIT 3 INTEGRATED SERVICES DIGITAL NETWORK (ISDN)

i-phonenet X-Lite (ver 3.0) SoftPhone Setup June 2006 ver 1.0

Polycom VSX 7000e Features and Benefits

A Digital Audio Primer

The Frozen Mountain irtc White Paper Series

MCU Built-in Two(2)-Channel Full

Video Communication System PCS-TL50

Using a WAN Emulator to Demonstrate DOZER Automated Packet Recovery

Lecture 16 Perceptual Audio Coding

TECHNICAL SPECIFICATIONS. Polycom RealPresence Immersive Studio Flex Technical Specifications

RPT: Re-architecting Loss Protection for Content-Aware Networks

INSE 7110 Winter 2009 Value Added Services Engineering in Next Generation Networks Week #2. Roch H. Glitho- Ericsson/Concordia University

Getting Started Guide for the VSX Series

AVer EVC350 HD Multipoint Conferencing System

Multimedia Webcams VGA CAM 1.3 mega-pixel webcam with microphone, driver-free B_VGA CAM NANO VGA webcam with microphone

6 Controlling the Technomad Encoder

Transcription:

LIVE MUSIC PERFORMANCES OVER HIGH- SPEED IP NETWORKS Stefan Karapetkov Polycom, Inc. e-mail: Stefan.Karapetkov@polycom.com ABSTRACT High-speed IP networks are creating opportunities for new kinds of real-time applications that connect artists and audiences across the world. A new generation of audio-visual technology is required to deliver the exceptionally high quality required to enjoy performances over IP networks. There are three major challenges around transmission of high-quality live music performances over IP networks: true acoustic representation, efficient and loss-less compression / decompression that preserves the performance quality, and recovery from IP packet loss in the network. This paper analyzes each of these three elements and provides an overview of the mechanisms developed by Polycom to deliver exceptional audio and video quality, and special modifications for the Live Music Mode in Polycom equipment. The paper also discussed how the Manhattan School of Music uses this technology for transmission of their live music performances over the high-speed Internet2 network which connects schools and universities in the United States. KEYWORDS Internet2; Live Music Mode; audio-visual technology; Lost Packet Recovery; Polycom; INTRODUCTION When MSM approached Polycom with the request for equipment that can be used to deliver their live music performances to remote schools across the Unites States, we suggested using our high-end audio and video codecs (endpoints). These systems deliver the highest level of voice and video quality for applications such as video conferencing, telepresence, and vertical applications for the education, healthcare, and government markets. Very soon we realized that the technologies that guarantee true music reproduction are often in conflict with the technologies developed to improve the user experience in a standard bi-directional voice call. Key technologies are Automatic Gain Control (AGC), Automatic Noise Suppression (ANS), Noise Fill, and Acoustic Echo Cancellation (AEC). This paper will explain how each of these mechanisms works for voice transmissions, and focus on the modifications that have to be made in standard audio-visual communications equipment for transmission of live music performances (Live Music Mode). While secondary to audio, the video component is important for the overall audience experience. The paper will therefore briefly discuss the key video technologies relevant to music performance transmission. AUDIO AND VIDEO TRANSMISSION Due to bandwidth limitations, sending uncompressed audio and video is not an option in most IP networks today. Audio and video streams have to be encoded (compressed) at the sender side to reduce the used network bandwidth. At the receiver side, the real-time audio-video stream has to be decoded and played at full audio and video quality. Figure Page 1 of 8

1 summarizes the encoding/decoding concept. Critical element here is the equipment s ability to preserve high audio quality of 50Hz-22kHz and high video quality of at least HD 720p. Figure 1: Audio and video transmission basics IP networks inherently lose packets when there are bottlenecks and congestions, and mechanisms for recovery from packet loss in the IP network are required to deliver a high-quality performance to the receiver. Compressed audio and video are packetized in the so-called Real Time Protocol (RTP) packets, and then transmitted over the IP network. The Lost Packet Recovery (LPR) mechanism developed by Polycom to overcome this issue is discussed later in this paper. ADVANCED AUDIO CODING The standard for voice transmission quality was set about 120 years ago with the invention of the telephone. Based on technical capabilities at the time, it was decided that transmitting the acoustic frequencies from 300Hz to 3300Hz is sufficient for a regular conversation. Even today, basic narrow-band voice encoders, such as ITU-T G.711, work in this frequency range, and are therefore referred to as 3.3kHz voice codecs. Another important characteristic for a voice codec is the bit rate. For example, G.711 has a bit rate of 64kbps, i.e. transmitting voice in G.711 format requires network bandwidth of 64kbps (plus network protocol overhead). Other narrow-band codecs are G.729A (3.3kHz, 8kbps), G.728 (50Hz-3.3kHz, 16kbps), and AMR-NB (3.3kHz, 4.75-12.2 kbps). Advanced audio coding went far beyond basic voice quality with the development of wide-band codecs supporting 7kHz, 14kHz, and most recently 22kHz audio. ITU-T G.722 was the first practical wideband codec, providing 7kHz audio bandwidth at 48 to 64kbps. Polycom entered the forefront of audio codec development with Siren 7 (7kHz audio at 16 to 32kbps), standardized by ITU-T as G.722.1 (7kHz, 16/24/32kbps). Figure 2 provides an overview of the most popular codecs. Figure 2: Advanced audio encoding Polycom later developed two super wide-band codecs. Siren 14 (14kHz, 48/64/96kbps) codecs supports stereo and mono; the mono version then became ITU-T G.722.1C (14kHz, 24/32/48kbps). Polycom also has developed Siren 22 (22kHz, 64/96/128kbps stereo and mono) and is collaborating in the development of the G.719 (Full Band) codec under ITU. Page 2 of 8

SIREN 22 CODEC HIGHLIGHTS Siren 22 Stereo is the highest audio quality available on the market. To illustrate the quality, we have compared it to another stereo codecs - the popular MP3 (MPEG-1/2 Layer-3) audio codec used in portable music players. Siren 22 Stereo covers the acoustic frequency range up to 22kHz while MP3 stops at 18kHz. While the human ear usually does not hear above 18kHz, the higher upper limit in Siren 22 delivers additional audio dynamic that is especially important for music. Siren 22 was developed for real-time applications and is therefore optimized for low latency (40ms). MP3 has an algorithmic delay of 2591 samples. For the sample rate of 48kHz it corresponds to about 54ms latency. In case of 32kHz, it is even up to about 81ms. A major advantage of Siren 22 is its low complexity, i.e. it can run on inexpensive hardware (CPU or DSP chips). The complexity is measured in Millions Instructions Per Second (MIPS) - Siren 22 requires 15MIPS. In comparison, MP3 has much higher complexity, and requires 100MIPS. The final advantage of Siren 22 is that it has relatively low bit rate maximum 128kbps. MP3 bit rates are usually higher than 128kbps. AUTOMATIC GAIN CONTROL (AGC) AGC is a technology that compensates for either audio or video input level changes by boosting or lowering incoming signals to match a preset level. There are a variety of options how to implement audio AGC in audio-visual equipment. Polycom systems have multiple AGC instances: for room microphones, for VCR/DVD input, etc. and they are all turned on per default. We focus here on the AGC for the connected microphones. The function is activated by speech and music. It ignores white noise, e.g. if a computer or projector fan is working close to the microphone, AGC will not ramp up the gain. The Polycom AGC is very powerful, and automatically levels an input signal in the range +6dB from nominal while covering distances of up to 3.6 meters (depending on the room characteristics and on the person talking). Nominal is the volume of an average person speaking with normal voice about 60cm away from the microphone. As a rule of thumb, 3dB is twice the nominal signal strength (10^(3/10) = 10^0.3 = 1.995) and 6 db is four times stronger signal than the nominal (10^(6/10) = 10^0.6 = 3.981). Figure 3 is a graphical representation of the AGC concept. Figure 3: Automatic Gain Control In the nominal case, AGC does not modify the gain, i.e. the input volume is equal to the output volume. When the speaker moves further away, the microphone picks up lower volume, and the AGC in the audio-visual equipment has to amplify the volume to keep the output constant. As we found in our research, however, AGC completely destroys the natural dynamic range of music. If the music volume increases, AGC kicks in, and compensates by decreasing it. If the music becomes quiet, AGC automatically Page 3 of 8

increases the volume. There is not a good way to overcome this behaviour and we had to disable AGC in Live Music Mode. AUTOMATIC NOISE SUPPRESION (ANS) AND NOISE FILL White noise is a mix of all frequencies. ANS detects white noise at the sender side, and filters it out. The sender codec measures the signal strength when the audio volume is the lowest, i.e. no one is speaking, and uses this value as a filter threshold for the automatic noise suppression algorithm. Polycom ANS reduces background white noise up to 9 db or eight times (10^(9/10) = 10^0.9 = 7.943). If the noise energy is above the threshold, gain stays at 1. If it is below the threshold, gain is reduced by 9 db or eight times. Many noise suppression implementations make voice sound artificial, like on a cell phone. Sophisticated technical implementations, such as the one in Polycom endpoints, deliver artefacts-free audio, and make the noise suppression completely transparent, i.e. not perceptible to users. Important is also the algorithm reaction time, e.g. if a fan turns on, it should not take more than few seconds for ANS to recalculate the new threshold and remove the noise. Note that while ANS is very useful for eliminating the noise of laptop and projector fans and vents, it does not eliminate clicks and snaps this is done by a separate algorithm in Polycom systems (keyboard auto mute function). Noise Fill is a receiver side function and works like comfort noise in telephony a low level of noise is added at the receiver side to reassure the user that the line is up. Noise fill is necessary because all echo cancellers mute the microphones in single talk when one side is talking and the other is not so that the quiet side does not send any traffic over the network. To make sure users do not think the line is completely down, the receiver system adds some noise, and hides the fact that the other side is muted in single talk. Figure 4 provides a graphic representation of ANS and Noise Fill. Figure 4: ANS and Noise Fill Unfortunately, ANS is not designed for music. The problem is that if you have a sustained musical note, which lasts for several seconds, the ANS algorithm thinks it is noise and removes it. Imagine how ANS can destroy a classical music piece. Adding noise at the receiver side would not help either we would be far better off just playing the audio that comes over the network. After analyzing ANS and Noise Fill, we concluded that both have to be disabled in Live Music Mode. ACOUSTIC ECHO CANCELLATION (AEC) Echo in telecommunications can have many causes: it may be a result of acoustic coupling at the remote site or may be caused by the communications equipment itself. Whatever the reason, it causes audio which is sent to a remote site to feed back through the speakers at the sender site. Through acoustic coupling between speakers and microphones at the sender site, this feedback is sent again through the created loop and adds even more echo on the line. AEC technology has to be deployed at the receiver site to filter out the echo. A graphical representation of the AEC concept is in Figure 5. Page 4 of 8

Figure 5: Acoustic Echo Cancellation Basic AEC characteristics include the operating range and adaptive filter length. For systems that support 50-22,000 Hz acoustic frequency range, AEC should be able to support the same range. Adaptive filter length is the maximum delay of the echo that the system can compensate for. Polycom is leading in this area with maximum adaptive filter length of 260ms. Additional benefit of our implementation is that the AEC algorithm does not need any learning sequence, i.e. we do not send out white noise to train it. The AEC algorithm trains quickly enough on real voice/audio. If the system supports stereo, it is important that AEC can identify multiple paths of the stereo loudspeakers, and quickly adapt to microphones that are moved (if you move the microphone, you change the echo path and the adaptive filter has to learn the new path). Our AEC implementation adapts within two words of speech, i.e., echo comes back for short time (one-two words) and then the canceller adjusts. Musicians want to be able to play sustained note (press sustain pedal on piano), and hear it all the way even 1dB over the noise floor. If the standard AEC is active during a music performance transmission, receiver will get artifacts, and low notes can be cut out. Therefore, we developed special tunings in the echo canceller that prevent very quiet musical sounds from being cut out. Knowing that people will use Live Music Mode in a quiet studio environment, we changed the AEC thresholds to be more much more aggressive than in standard mode. ADVANCED VIDEO TECHNOLOGY While secondary to audio, video plays an important role in transmitting live music performances. The ability to see the performers in high resolution enhances the experience, as does the ability to zoom on a solo player or see additional content related to the performance. Let s look at the changed expectations for video quality first. The vertical axle of the diagram in Figure 6 shows the video quality while the horizontal axle shows the necessary bandwidth to transmit video at this quality across the network. Figure 6: HD Video Page 5 of 8

The Common Intermediate Format (CIF) has resolution of 352x288 pixels. Current video technology allows CIF quality video to be transmitted at 30fps over 384kbps. Standard Definition (SD) is the quality of a regular TV, and has resolution of 704x480 pixels. Current video technology allows transmitting SD quality video at 30fps over 512kbps- 1Mbps. The latest generation of products support High Definition (HD), starting with 720p HD, i.e. 1280 x 720 pixels with progressive scan. Current video codec technology allows transmitting HD quality video at 30fps over 1-2Mbps. Technologies like Telepresence combine multiple codecs and deliver even higher quality by using higher bandwidth in the IP network. Far End Camera Control is a feature that allows controlling the video camera at the remote site. The prerequisite is that the remote site supports the feature and uses a PTZ (pan, tilt, zoom) camera. The feature can be used, for example, to see particular group of musicians within an orchestra, or to focus on a solo player. Technologies for automatic detection of the audio source allow the system to auto focus on solo player and then quickly refocus on another musician or the whole orchestra based on intelligent audio source detection algorithms. Video systems today support two parallel video channels while the live channel is used for the live video, the content channel is used for sending text, pictures, or videos, associated with the live channel. When transmitting live music performances, the content channel can be used for transmitting any supporting information related to the performance pictures, background videos, on in the case of opera in a foreign language, subtitles in the local language. The above video technologies in concert can not only enhance the experience of traditional music performances but allow for experimentation with new art forms. PACKET LOSS AND LOST PACKET RECOVERY (LPR) Packet loss is a common problem in IP networks because IP routers and switches drop packets when links get congested and when their buffers overflow. Real-time streams, such as voice and video, are sent over the User Datagram Protocol (UDP) which does not retransmit packets. Even if an UDP/IP packet gets lost, retransmission does not make much sense, since the retransmitted packet would arrive too late; playing it will destroy the real-time experience. Lost Packet Recovery (LPR) is a new method of video error concealment for packet based networks, and is based upon Forward Error Correction (FEC). Additional packets that contain recovery information are sent along with the original packets in order to reconstruct packets that were lost during transmission. For example, suppose you have 2% packet loss at 4Mbps (losing approx. 8 packets for every 406 packets in a second). After engaging LPR, the packet loss can be reduced to e.g. less than 1 packet /5 minutes = 1 packet/(406 packets/sec*300 sec) = 1 packet/121,800 packets =.00082%. Figure 7 depicts the functions in LPR both sender side (top) and receiver side (bottom). Figure 7: Lost Packet Recovery LPR includes a network model which estimates the amount of bandwidth the network can currently carry. This network model is driven from the received packet and lost packet statistics. From the model, the system determines the optimal Page 6 of 8

bandwidth and the strength of the FEC protection that is required to protect the media flow from irrecoverable packet loss. This information is fed back to the transmitter via the signalling channel, which then adjusts its bit rate and FEC protection level to match the measured capacity and quality of the communication channel. The algorithm is designed to adapt extremely quickly to changing network conditions such as cross congestion from competing network traffic over a Wide Area Network (WAN). The LPR encoder takes RTP packets from the RTP Tx channel, encapsulates them into LPR data packets and inserts the appropriate number of LPR recovery packets. The encoder is configured by the signalling connection, usually when the remote LPR decoder signals that it needs a different level of protection. The LPR decoder takes incoming LPR packets from the RTP Rx channel. It reconstructs any missing LPR data packets from the received data packets and recovery packets. It then removes the LPR encapsulation (thereby converting them into the original RTP packets) and gives them back so they can be processed and forwarded onto the video decoder. The decoder has been optimized for compute and latency. LPR has advanced dynamic bandwidth allocation capabilities. Figure 8 illustrates the behaviour. Figure 8: LPR dynamic bandwidth allocation example When packet loss is detected, the bit rate is initially dropped by approximately the same percentage as the packet loss rate. At the same time, we turn on FEC and begin inserting recovery packets. This two-pronged approach provides the fastest restoration of the media streams that loss creates, ensuring that there is little or no loss perceived by the user. The behaviour can be modified by the system administrator through configuration. When the system determines that the network is no longer congested the system reduces the amount of protection, and ultimately goes back to no protection. If the system determines that the network congestion has lessened, it will also increase the bandwidth back towards the original call rate (Up-speed). This allows the system to deliver media at the fastest rate that the network can safely carry. Of course, the system will not increase the bandwidth above the limits allowed by the system administrator. The up-speed is gradual, and the media is protected by recovery packets when the up-speed occurs. This ensures that it no impact on the user experience. In the example in Figure 8, up-speeds are in steps of 10% of the previous bit rate. If the packet loss does not disappear, the system continues to monitor the network, and finds the protection amount and bandwidth that delivers the best user experience possible given the network condition. A recent evaluation of Polycom LPR by the independent Wainhouse Research (1) concluded: While most of the video systems on the market today include some form of packet loss / error concealment capability, Polycom LPR is one of only two error protection schemes available today that uses forward error correction (FEC) to recover lost data. One of LPR s differentiators and strengths is that it protects all parts of the video call, including the audio, video, and content / H.239 channels, from packet loss. Page 7 of 8

CONCLUSION When MSM approached Polycom with the request for equipment that can be used to deliver their live music performances to remote schools across the Unites States, we suggested using our high-end audio and video codecs - in the newly developed Live Music Mode. In the administration web page of system, LMM is just one check mark. Clicking the LMM, users do not realize the amount of work and the wealth of technology knowhow that is behind this simple option. Neither should they. Users such as the Manhattan School of Music are musicians and not engineers, and the only thing they care about is that thousands of miles away, their notes are played true to the original and without any distortions and artifacts. Details about the MSM experience are available at http://youtube.com/watch?v=_gqjylphfac&mode=related&search. REFERENCES [1] I. M. Weinstein: Polycom s Lost Packet Recovery (LPR) Capability, Wainhouse Research Evaluation, January 2008 ACKNOWLEDGEMENTS I would like to thank my colleagues Peter Chu, Stephen Botzko, and Minjie Xie for their contributions to this work. Special thanks to Jeff Rodman for his support. Page 8 of 8