WebRTC Server Side Media Processing: Simplified

WebRTC Server Side Media Processing: Simplified Meeting the challenges of the rising need of server side media processing in WebRTC Tsahi Levent-Levi tsahil@bloggeek.me

Contents Contents... 1 Executive Summary... 2 WebRTC in 2016... 2 Three Modern WebRTC Requirements... 3 1. Everyday Transcoding... 3 Archive and Playback... 4 Streaming... 5 Video Conferencing... 6 Telephony... 7 Transcoding is Here to Stay... 8 2. The Telephony Gateway... 9 Connectivity to VoIP Systems... 9 Connectivity to Legacy Video Conferencing Systems... 9 Connectivity to VoLTE... 11 3. Rise of the Hybrid SFU/MCU... 12 Filling the Enterprise WebRTC Needs in the Cloud... 15 The Rise of the GPU... 16 About SURF-HMP... 17 About the Author... 17 1

Executive Summary WebRTC has been with us for five years now. During that time, we ve gone from exploration to industrializing WebRTC. The main trend we see today is the introduction of requirements and use cases where server side media processing is part and parcel of the service. This need for WebRTC server side media processing stems from three main functions that are required by many modern use cases: transcoding different codecs, gatewaying across different protocols, and the emergence of hybrid multi-party video architectures. These functions are driven by many different high-end requirements, from the need to archive an interaction taking place over WebRTC or connect to legacy video conferencing devices and our traditional telephony system, to the desire to live stream an event to large audiences. In this whitepaper, we will explore these functions and the high-end requirements that they serve. We will list the various use cases that necessitate server side media processing, ending with a look at the need for large scale software solutions, and how these solutions rely heavily today on the efficiency brought by GPU capabilities. WebRTC in 2016 WebRTC is an open source technology and standard specification enabling real time voice and video communications inside browsers without the need to install any plugin. By adding a few lines of JavaScript code, websites and web apps can connect people as if they were using Skype. WebRTC emerged in 2011 and is now available in Chrome, Firefox, and to some extent on Microsoft s new Edge browser. Companies large and small are using WebRTC already: from Facebook Messenger and Google Hangouts, which drives its voice and video calling capabilities in their mobile app and their desktop browser; to AT&T and NTT who offer developers APIs that connect web-based communications to their traditional phone subscriber base. At its inception, WebRTC s main use cases focused on connecting simple voice and video calls through the browser. Today, after several years of availability, developers have had time to understand what can be achieved with WebRTC, giving rise to more sophisticated use cases such as live broadcasts. This rise in use case sophistication is enabled mainly due to the availability of server side media processing. 2

Three Modern WebRTC Requirements A simple use case of a one-to-one video call can become more elaborate as additional requirements are stacked on the initial proof of concept. Features such as recording and archiving, call escalation, and connectivity to VoIP equipment often times require the introduction of a server component capable of processing media. Such needs can generally be traced to three modern requirements that product managers now introduce to their WebRTC-enabled use cases. 1. Everyday transcoding 2. The telephony gateway 3. Rise of the hybrid SFU/MCU 1. Everyday Transcoding Transcoding is the act of converting a media stream from one codec or format to another. While the cost and complexity associated with transcoding is rather high, oftentimes we cannot deliver a service or support certain features without it. One of the reasons we need transcoding is mismatch in codec support. As WebRTC matures, the number of codecs it can support is growing as well. On one hand, this can be said to lead to a reduction in potential codec mismatch; on the other hand, it shows that the range of codec alternatives is also growing - and with it the potential for codec mismatch. Until recently, Chrome supported only VP8 as its video codec. While Chrome is well on its way to support five codecs for WebRTC: G.711 and Opus for voice; VP8, VP9, and H.264 for video; this will not mean an end for the need to transcode. Different systems and services make use of different codecs and at times, supporting a specific codec is a necessity due to a technical need such as the desire to reduce bandwidth consumption or exploit hardware coding on devices. There are many cases where transcoding becomes important in WebRTC: 3

Archive and Playback Imagine a consulting service: One where a person can receive online assistance from an expert. Now think of being the one on the receiving end of the consultation. Would you like the ability to playback this session later on? This is one of the many cases where there is a requirement to be able to record and archive a live session and later on, play it back. The archiving and playback of the session do not necessarily use the same technologies or codecs. In order to playback a media file on mobile, H.264 is the most common video codec today; while most WebRTC services today make use of VP8. To enable support for archiving and playback there usually is a need for transcoding both codecs and media file formats. 4

Streaming There is a rise in live broadcast-type services at the moment, stemming from the popular Meerkat and Periscope platforms. These new live broadcast use cases include everything from teens starting their own radio channels and cooking shows, to chatting with sports fans and breaking news stories. This type of live content created using WebRTC needs to be streamed over the Internet to a large number of passive listeners or viewers. That means switching from real time communications protocols such as WebRTC to CDN (Content Delivery Network) and media streaming ones such as RTMP or HLS. This change of protocol usually brings a change in supported codecs as well, translating into the need to transcode. 5

Video Conferencing In the last two decades, we have seen a steady growth in the use of video conferencing within the enterprise. This growth and focus usually targeted multinational companies that needed to increase their means of communications within the organization to raise productivity. Enterprise video conferencing systems have their own set of protocols, codecs, and existing products and services. In most cases, these legacy systems support H.323 and SIP protocols, which are widely spread in enterprises. For multiple reasons, these communication protocols and video conferencing systems use different codecs than the ones selected and used by WebRTC. The need and expectations of enterprises to be able to use new services and enable browser-based connectivity while maintaining the use of legacy devices necessitates the need to transcode between the codecs used by WebRTC and those used by the video conferencing devices already in use by enterprises. While this may change over time, there will be a need to support legacy devices in enterprises for years to come. In many cases, such support translates to the need to transcode. 6

Telephony Contact centers are one of the main areas where WebRTC is finding a home for itself. A leading use case today is to replace agent IP phones with an integrated WebRTC phone as part of the CRM system itself. This leads to the need to interconnect that WebRTC call from the browser with the carrier telephony system. Both support G.711, an ubiquitous voice codec. The only problem is G.711 s lack of quality and lack of resilience to packet losses. This is why, in many cases, it is advisable to use Opus on the WebRTC side, and transcode it to a wideband codec or even to G.711 once connecting to traditional telephony or on premise VoIP systems. 7

Transcoding is Here to Stay With about two decades of experience in the video conferencing market, we can generally assume that the need to transcode will continue to exist for many years to come. There will always be an incentive to advance and use newer codecs to improve user experience coupled with the need of supporting existing deployments that make use of older codecs. In many cases, transcoding will be needed due to other types of requirements, where it will be coupled with a different set of media processing capabilities. For example, if we take three separate video streams, mix them into a single stream, and add to that stream other UI elements such as a text box or a logo, this necessitates the need to: 1. Decode video feeds 2. Scale video feeds 3. Combine video feeds 4. Overlay video feeds with visual effects 5. Encode video feeds 8

2. The Telephony Gateway One of the main reasons to transcode stems from the need to gateway - interconnect one protocol/service with another. In the context of WebRTC, gatewaying will happen most of the time in front of VoIPbased systems that are not making direct use of WebRTC. There are three broad markets where WebRTC needs to rely on a gateway: 1. Connectivity to VoIP systems 2. Connectivity to legacy video conferencing systems 3. Connectivity to VoLTE/ViLTE Connectivity to VoIP Systems In many cases, vendors are looking to connect WebRTC to an existing VoIP system. This may be in order to integrate with contact centers or offer telephony services to enterprises and SMBs. When that happens, there tends to be a need to bridge WebRTC with an existing SIP deployment. Besides protocol translations, the main concern is audio codec transcoding. Most SIP deployments do not support Opus, but rely on a slew of other voice codecs. Transcoding between these codecs is necessary, as relying on G.711 usually means a reduction in voice quality. Connectivity to Legacy Video Conferencing Systems There is a lot of pressure for video conferencing vendors who support WebRTC to introduce interoperability with legacy video conferencing systems. At the same time, video conferencing vendors are pressured to support WebRTC in their systems. The result is a need to better bridge the protocols and codecs in legacy video conferencing products with the existing WebRTC implementations in the browser. 9

There are four main challenges to overcome: 1. Transcoding, as seen previously 2. Protocol translation, from SIP and H.323 that are prevalent in video conferencing to proprietary WebRTC signaling 3. Dealing with protocol discrepancies between the ways SRTP and ICE are implemented in WebRTC, SIP, and H.323 4. Handling on-the-fly adjustments of stream parameters in order to optimize the end user experience, such as bitrate/frame-rate alterations, resending of reference frames, and more These challenges necessitate a server side media processing component capable of taking care of these issues. 10

Connectivity to VoLTE VoLTE stands for Voice over LTE. It is the selected alternative to making voice calls by carriers on LTE networks using IMS protocol extensions (SIP IMS + diameter for billing). It is a relatively new protocol that is set to replace circuit switched telephony in our 4G networks with a more modern packet switched alternative. A variant/extension of VoLTE is ViLTE (Video over LTE). Both need to be accessible via WebRTC as well. Carriers who are working on deploying VoLTE are also looking for ways to connect it to WebRTC to increase the reachability of their network, as well as open routes for integration. The challenge here is that the voice codecs selected for VoLTE are different than the ones used in WebRTC. VoLTE uses AMR-NB and AMR-WB for its voice codecs, while WebRTC relies on G.711 and Opus. When it comes to video codecs in ViLTE versus WebRTC, there is a need to handle codec conversions between H.264 and VP8; with an added complication of supporting multi-stream scenarios, where multiple participants in a conversation need to be converted into a single media stream to support a ViLTE device. To support WebRTC properly, anyone who wants to connect to VoLTE will need to transcode across these codecs as well as handle any other media-related issues, similar to what we have seen in VoIP and video conferencing systems. 11

3. Rise of the Hybrid SFU/MCU There are two main models of handling large video conferences: the MCU and the SFU. MCU - Multipoint Conferencing Unit. In this model, all participants operate in the same fashion as they would in a peer-to-peer session. They send a single media stream and receive a single media stream. The MCU is in charge of decoding all streams, combining them into a single view, and then encoding and sending it out to each participant. 12

SFU - Selective Forwarding Unit. In this model, participants send their media to the SFU, which in turn selectively decides which of the media streams to send to the participants. Effectively, each participant sends out a single media stream but receives multiple incoming streams. Although each of these alternatives has its own advantages and challenges, there is a growing trend these days to opt for a hybrid approach: one which uses an MCU and an SFU to drive the same use case. We use an SFU to push processing to the devices as much as possible, and when needed, we employ an MCU. 13

The diagram above shows a few of the cases where a hybrid model is advisable: A few active participants are engaging in a conversation. Towards that goal, they use an SFU architecture. An MCU is tethered to the conversation, providing the necessary conversions: Downscaling and encoding the stream to fit into smaller mobile devices that may not be capable of processing multiple incoming video streams Generating and recording a single video stream of the conversation for future playback purposes Creating a transcoded video stream connecting to a video CDN for large-scale passive viewing Connecting to legacy rooms systems that support H.323 and SIP communication protocols, where an MCU architecture is popular and SFU architecture is usually not supported at all This approach is becoming a necessity for many of the recent use cases coming to the market. It is flexible enough to offer a good mix between usability and costs, along with powerful backend media processing capabilities. 14

Filling the Enterprise WebRTC Needs in the Cloud There is a growing shift of moving enterprise services into cloud deployments. For many years, video conferencing and other real time applications resisted that migration, but now, with WebRTC being a first-class citizen in the web browser, it is part of that trend. Video conferencing and server media processing were traditionally designed and deployed by way of proprietary hardware. These days, the market is requiring even media-intensive real time processing platforms to be cloud-oriented. This boils down to two main needs: 1. Have media processing run as pure software on commodity datacenter machines 2. Have media processing run in virtualized environments These needs align with the greater flexibility sought after today in how a service gets deployed, maintained, and scaled. By having the ability to run media processing on commodity hardware in a virtualized environment, services can put in place simple rules that automate the addition or reduction in system size across machines based on user demand, perfectly aligning costs of the service infrastructure with actual use. This places a rather challenging need on media processing software - it needs to be scalable to make sense. Transcoding, encoding, decoding, mixing, and overlays all require considerable horsepower to take place, especially if you factor in the steady increase in video resolutions (from VGA, to HD, and towards 4K). 15

The Rise of the GPU There are three main alternatives to processing media: 1. Running pure software on CPUs 2. Using DSPs, which is common with video coding on mobile devices and in legacy video conferencing hardware-based solutions 3. Utilizing GPUs GPUs started as graphic accelerators for PCs. Their focus was mainly in the area of 2D and 3D graphics. In recent years, a trend started where general purpose workloads are run by GPUs in order to speed up processing while lowering power consumption. Lately, codec acceleration started appearing in GPUs, which better addresses the scale challenge for media servers. This reliance on GPUs is important: While CPU speeds and performance per core has not changed much in the last decade, GPUs still double in performance every couple of years; a trend that seems to be accelerating with the embedding of powerful GPUs within CPUs, which is being done increasingly by major CPU manufacturers. The result: Higher performance and density of media processing on GPUs comparable to CPUs and DSPs Lower power consumption Lower the price of video processing compared to CPUs and DSPs This places software-based server media processing solutions that make use of GPUs optimally in both CAPEX and OPEX - for less hardware and power you can cram more processing. 16

About SURF-HMP SURF-HMP marks a breakthrough in multimedia service provisioning, revolutionizing cost, performance and functionality - extended via superb 4K video resolution and ultrahigh capacity voice and video transcoding, mixing and processing on any GPU accelerated INTEL processor. Its architecture enables it to be offered in a wide variety of licensing models to meet requirements from evolving and up-to full-blown/large-scale deployments. It is driven by a powerful multimedia processing engine that facilitates a multitude of applications (transcoding, conferencing/mixing, MRF, playout, recording, messaging, video surveillance, and more), and it is instrumental in bridging the battle between codecs in the WebRTC world supporting any-to-any transcoding between H.264, H.265, VP8 and VP9 codecs. In addition to making WebRTC accessible for users residing in legacy environments (SIP, H.323, E1/T1), it offers comprehensive collaboration (supporting both H.239 and BFCP), extreme low latency, extensive encryption capabilities compulsory for security/surveillance related deployments, and many additional capabilities. Last but not least, SURF-HMP is unique in its coping capabilities with signaling and termination gap challenges - mandatory when interoperability is required between new era communications and existing, legacy environments, making it an imperative building block when interconnecting between legacy/incumbent operator networks and emerging WebRTC environments. For more information, visit our new website at www.surfsolutions.com. About the Author Tsahi Levent-Levi is an Independent Analyst and Consultant for WebRTC. He has over 15 years of experience in the telecommunications, VoIP, and 3G industry as an engineer, manager, marketer, and CTO. Tsahi is an entrepreneur, assisting companies with bridging technologies and business strategy in the domain of telecommunications. He has an M.Sc in Computer Science and an MBA degree specializing in entrepreneurship and strategy. Tsahi has been granted three patents related to 3G- 324M and VoIP. He acted as the chairman of various activity groups within the IMTC, an organization focused on the interoperability of multimedia communications. Tsahi is also the author and editor of bloggeek.me, which focuses on the ecosystem and business opportunities around WebRTC. 17