2017 IEEE Symposium on Service-Oriented System Engineering A Multilingual Video Chat System Based on the Service-Oriented Architecture Jayanti Andhale, Chandrima Dadi, Zongming Fei Laboratory for Advanced Networking University of Kentucky Lexington, Kentucky 40506, USA Emails: jran229@g.uky.edu, cda232@g.uky.edu, fei@netlab.uky.edu Abstract The use of video chat and video conference applications is ubiquitous, especially in this era of wireless and mobility. However, a video chat system that allows communicating parties to speak/type in different languages has not been widely used. In this paper, we propose a service-oriented architecture to develop a multilingual video chat system that supports people speaking/texting in different languages. It uses the Web Real Time Communication (WebRTC) technology and takes advantage of the services available on the Internet, including Google Web Speech API, Google Transliterate API, and Microsoft Translator. It is a browser based solution that allows users to connect from various platforms, such as Windows, Linux, or Mac. Since the application uses WebRTC, the user does not have to download and install any plugins. The service-oriented architecture design based on WebRTC allows us to develop and implement the whole system in a short period of time. Keywords-video conferencing; multilingual; translation; social networking I. INTRODUCTION Video conference applications have been widely used in work place and our daily life. They bring us closer and enable us to see each other even when we are located in different places. They give us a look-and-feel of face-to-face meeting and greatly shorten the distance between different people. Most of the existing video conferencing applications, such as ichat [1], Google Hangouts [2], and Skype [3] need to download native applications (apps) or plugins. These plugins or downloads can sometimes be difficult to install. In addition, the users are required to constantly update and maintain the up-to-date version. Recently the Web Real Time Communication (WebRTC) technology [4] has attracted attention because it enables in-browser communications by integrating voice and video directly in the browser without the need for any download or any plugin. The Web API provided can be easily used to develop web-based real-time applications. Another issue with most existing video conferencing applications, including those WebRTC based applications, such as OpenTok [5], Vline [6] and GoInstant [7], is that they require users from the same virtual room to use a common (natural) language for communication. One of design goals of this project is that we want to enable people speaking different languages to use the application for video chat and text messaging. To achieve the goal, we adopt a service-oriented architecture for the system design. Instead of implementing all the functions related to facilitating multilingual communications from scratch, we take advantage of the services available out there in the Internet and invoke these functions in our multilingual video chat application. In particular, we made use of Google Web Speech service, Google Transliterate service, and Microsoft Translator in our system. Our experience with the WebRTC technology and service-oriented architecture highlights some challenges and lessons learned in the implementation. The approach allows us to develop and implement the whole system in a short period of time. The rest of the paper is organized as follows. Section II presents the design of the service-oriented architecture of the multilingual video conference system. Section III illustrates the steps for establishing media and data channels between browsers. Section IV describes implementation of the modules in the system. Section V discusses related work and Section VI concludes the paper. II. THE SYSTEM DESIGN BASED ON THE SERVICE-ORIENTED ARCHITECTURE As mentioned before, the goal of the multilingual video conference/chat system (called MLChat) is to enable people speaking different languages to video conference/chat with each other directly. Each person can select the language of his/her preference and use the language in speaking in the video call and typing in the text messages. If two persons A and B using different languages communicate with each other using MLChat, the video displayed at A will be captioned with A s language. Ideally, we would like the audio to be translated and synthesized in A s language. Similarly, the language displayed in the text messages at A will be in A s language, even if B uses a different language. The situation on B s side will be similar, i.e., in B s language. The design of the multilingual video conference/chat system is based on the service-oriented architecture, as shown in Figure 1. When two users want to communicate via video conferencing/video chat, they get into the system from their respective browsers. The WebRTC provides basic functions 978-1-5090-6320-8/17 $31.00 2017 IEEE DOI 10.1109/SOSE.2017.17 126
the video and audio engines render received media data back. These engines contain audio codec, video codec, text codec and other components essential for encoding and decoding of audio/video data and processing text messages. Several techniques have been used to improve the quality of voice and video playback, including equalizer for audio, image enhancement techniques for video and echo cancellation. The WebRTC uses real time protocol (e.g., RTP [8], or SRTP) for sending audio and video data using UDP. These real time protocols contain timing information and their associated control protocols specify information like media codec, frame rate, and bit rate, etc. Besides WebRTC APIs, signaling is one other important task when establishing the browser-to-browser communication. Signaling is used to exchange control messages about the communication channels. Signaling methods and protocols are not specified by WebRTC and are not a part of the WebRTC API. The functionalities are implemented by the signaling server. Figure 1. The Architecture of the System for video/audio communications between browsers. It is distributed as a part of the browser code. Many popular browsers like Google Chrome and Mozilla Firefox support WebRTC. MLChat uses HTML5 and JavaScript APIs provided by WebRTC to capture user media, establish a peerto-peer connection and establish a data channel between browsers. To establish the peer to peer connections, MLChat implements a signaling server that facilitates the process. It is co-located with the web server that serves the application pages to the browsers. The HTML5 Web Speech service from Google is used to capture the voice data, which will be sent over to the server and translated to the language of the receiving side. Then it will be displayed as captioning in the language of the receiving side. The Web Speech service is invoked by each browser independently. The translation support is achieved by using the Microsoft Translator service and is invoked by the web/signaling server. The Google Tranliteration service is also used to support users to type in the selected script. A. Web Browser and WebRTC The web browser is the portal and the interface for users. To start the system, the user just loads the page using the URL provided. One of the key components in the browser is the WebRTC, which provides an abstraction layer and APIs that allow developers to ignore the low-level details about dealing with voice, video and cross-browser communication. The WebRTC defines an abstract session layer that performs session management, including conference initiation and conference management. The video and audio engine modules capture media contents from camera and microphone, respectively, and send them to the application. In addition, B. Browser to Browser Media Channel We use web APIs provided by WebRTC and the signaling server to establish browser to browser media channel and data channel. One important feature of WebRTC is that it provides a way for traversing NAT and firewall. When two browsers are behind the NAT and/or firewall, they will still be able to communicate with each other. We used the three APIs provided by WebRTC for developing the MLChat system, i.e., getusermedia, RTCPeerConnection and RTCDataChannel. We use getusermedia Javascript API and HTML5 elements to obtain the local media stream. We use RTCPeerConnection API to instantiate RTCPeerConnection object at each browser and start the signaling for establishing browser to browser communication. We use RTCDataChannel for the transmission of other data such as video captioning and text messages. After a successful signaling process, a media channel and a data channel between browsers will be established as shown in Figure 1. The media channel will be used for exchange of audio and video data between browsers and hence will provide video chat feature for the application. The data channel can be used for communication of other data. C. Application/Signaling Server We implement the server to perform the functions of both application server and the signaling server. As an application server, it serves the HTML file for the application to the browsers after the users load the page. As a signaling server, it is used as the forwarding engine for the signaling process when local media streams are active in the browser. D. Google Web Speech and Transliteration Engine We use the HTML5 Web Speech API from Google for voice recognition (speech to text). It allows continuous 127
Figure 2. MLChat Peer-to-Peer Media Channel Setup speech dictation and can record recognized speech in a text format. Because the session limit of the Speech recognition API is 60 seconds, the application has to restart the Speech Recognition every 60 seconds. With the help of the established data channel, we can provide Real Time Captioning feature for video conferencing. If the preferred languages are different for communicating parties, we can provide the captioning in the language chosen by the receiver with the help of the translation service discussed below. We use the HTML5 transliteration API from Google to transliterate the text entered into the specific text area in a script selected by the user. It can only handle UTF-8 text and uses the dictionary-based phonetic transliteration approach. It provides the transliteration typing service for Chinese, Hindi and other languages. E. Microsoft Translator We use Microsoft translator to translate text between different languages. It is invoked by the application/signaling server because it is a subscribed service. Microsoft Translator API is a cloud-based automatic translation service supporting multiple languages. When a browser gets the text chats or text from the speech recognition service, it will send to the application server. After the translation, the content in the selected language will be sent to the receiver for displaying. III. ESTABLISHING THE MEDIA CHANNEL One of the key steps in starting the video conference is to establish the media channel between two browsers using the signaling process. The application follows steps illustrated by Figure 2 for this process. Browsers can be either Google Chrome or Mozilla Firefox. During the process, we set up the channel using Interactive Connectivity Establishment (ICE), which employs Session Traversal Utilities for NAT (STUN) and Traversal Using Relay around NAT (TURN) to traverse the NAT and Firewalls. 1: Browser 1 loads MLChat URL. 2: Browser 2 loads MLChat URL. 3: Browser 1 loads local user media. 4: Browser 2 loads local user media. 5: Browser 1 instantiates peer connection object and prepares the call offer using Session Description Protocol (SDP). It contains the supported configuration for the session, i.e., the description of the local MediaStreams. 6: The offer is sent to the signaling server. 7: The signaling server forwards the offer to Browser 2. 8: Browser 2 instantiates the peer connection object and sets its local configuration based on the received offer. It prepares answer using SDP. It contains supported configurations for the session that is compatible with the parameters in the offer. 128
Figure 4. MLChat: User Local Media Figure 3. MLChat System 9: The answer is sent to the signaling server. 10: The signaling server forwards the answer to Browser 1. 11: Browser 1 sets its local configuration based on the received answer. Based on the type of connectivity discovered by running tests using STUN and TURN servers, Browser 1 calls startice() method of peer connection and prepares ICE candidate. 12: The ICE candidate is sent to the signaling server. 13: The signaling server forwards the ICE candidate to Browser 2. 14: Browser 2 adds the received candidate to the current session and prepares its own ICE candidate. 15: The ICE candidate is sent to the signaling server. 16: The signaling server forwards the ICE candidate to Browser 1. 17: Browser 1 adds the received candidate to the session. 18: Browser 1 formulates the best solution and tries to establish a connection with Browser 2. If successful, Browser 1 and Browser 2 establish the media channel between them. After the media channel is established, the audio and video streams will be sent from one browser to the other as shown in Figure 1. During the process, both STUN server and TURN server are used as illustrated in Figure 3. STUN servers are used to find the public IP address of the browsers, if they are located behind NAT boxes. The TURN server is used as a fallback mechanism for browsers that cannot establish a Peer to Peer connection. It is used as a media proxy server between the browsers. The RTCDataChannel will be established by using the media channel. It provides the text messaging feature to the application and transfers data captioning for the video content. IV. IMPLEMENTATION OF THE MULTILINGUAL VIDEO CHAT SYSTEM MLChat uses several APIs, JavaScript libraries and packages. The server runs in Node.js environment and uses express development framework. Additionally, the server uses socket.io library for secure communication and listens on port 8443. The messages sent and received during the RTCPeerConnection establishment phase are exchanged using the server. When the user starts the MLChat system and loads the page, the screen shown in Figure 4 will appear. The user can click on Start video button to request for access to the microphone and the camera. The user has to explicitly give the permission to the application to use these devices to capture the media data for privacy reasons. The browser uses navigator.getusermedia() API provided by WebRTC to get the video and the audio streams. Each MediaStream has one input and one output. The input can be the MediaStream generated by navigator.getusermedia(). The output stream can be the one sent by the other peer. After the browsers obtain local media data using getuser- Media API, they will establish peer connection with each other and exchange its session, network and media information as we described before. They uses ICE framework to find the other peer via STUN server and may use TURN server as a proxy for transferring media data. In addition to provide basic functions of video conferencing using WebRTC, MLChat also aims to provide additional supports, especially for users speaking different languages. As a starting point, we implement a real time captioning function. Using HTML5 speech-to-text API provided by Google Chrome browser, the speech is recorded and transferred to the other side so that the real-time captioning can be displayed along with the video content. Figure 5 shows the short captioning under the received video window (big window). In particular, MLChat allows the user to select the language of his/her preference. In Figure 5 the user selected Hindi in the selection box. The transcribed text from the peer will be sent to the server, which will then be translated to Hindi and sent to the browser. During the process, the application uses Microsoft Translator for translation. As shown in the figure, the caption is displayed in Hindi, even though the other user may speak a different language. When entering text messages for sending to the other user, the MLChat uses Google s Transliteration API to allow the user to type in the script of the language the user has selected as shown in Figure 6. The transliteration API supports multiple destination languages such as Arabic, Chinese, Greek, Hindi, Russian, Urdu, Serbian and Persian. If the preferred language selected by the user is not one of the 129
Figure 5. MLChat: Translated Subtitles Figure 6. MLChat: Translation and Transliteration for text messages supported languages by Transliteration API, then the text message can be typed in English. Similar to captioning, the text messages will be sent to the server and translated to the language of the other side. In the figure, the user just types in the Hindi language selected. When the browser receives a text message from the other side, it will be displayed in Hindi, even if the other side uses a different language. V. RELATED WORK Over the decades, we have seen the increasing demand for Real Time Communication (RTC) [9], such as video conferencing. These applications have been used in distance learning, business meetings, and social networking sites. Distance learning allows students to take classes remotely, saving the time and travel. In business, these online applications allow employees from different parts of the world to share their ideas and collaborate with each other without the need to be physically present in the same place. Furthermore, the social networking sites allow friends and families to get together virtually and share their thoughts. Some popular real time applications are ichat [1], Google Hangouts [2] and Skype [3]. ichat is a video chat application from Apple, and is included in the latest Mac Operating System as a built-in plugin. Google Hangouts provides support for video conferencing, desktop sharing and instant messaging. However, similar to ichat, Google Hangouts requires the user to install the Google talk plugin. Skype also allows instant messaging and video conferencing, but the user has to install the application. We have seen that some web applications start using WebRTC [4] for real time communication, such as Appear.in [10], OpenTok [5], vline [6], Bistri [11], GoInstant [7], GetOnSIP [12] and 1Click [13]. These applications implemented browser-based video conferencing, file sharing and/or instant messaging for a group of people. However, they did not provide multilingual support. A closely related work is the multilingual chat system [14], which supports multilingual chat and focuses on using images automatically generated at the sender side to detect mistranslation. However, it only supports text chat without implementing video chat function and is a traditional stand-alone system rather than a web-based system. VI. CONCLUSION In this paper, we developed a video chat application that is plugin free and platform independent. The application allows two users from any part of the world to communicate using their own preferred languages. The service-oriented architecture greatly facilitated the design and development of the system. 130
ACKNOWLEDGMENT This research work was supported in part by a grant from the Kentucky Science and Engineering Foundation as per Grant Agreement #KSEF-148-502-16-394 with the Kentucky Science and Technology Corporation. REFERENCES [1] Apple ichat, https://www.macupdate.com/app/mac/12174/ apple-ichat. [2] Google Handouts, https://hangouts.google.com/. [3] Skype, https://www.skype.com/en/. [4] WebRTC, https://webrtc.org/. [5] The OpenTok Platform - A Cloud Platform for Embedding, http://www.tokbox.com. [6] Vline, http://blog.vline.com/. [7] GoInstant, https://github.com/goinstant. [8] H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson, RTP: A transport protocol for real-time applications, RFC 1889, 1996. [9] S. Casner and S. Deering, First IETF internet audiocast, ACM Computer Communication Review, pp. 92 97, July 1992. [10] Appear.in, https://appear.in/. [11] Bistri, https://bistri.com/. [12] GetOnSIP, https://www.onsip.com/getonsip. [13] 1ClICK, https://1click.io/. [14] E. Hosogai, T. Mukai, S. Jung, Y. Kowase, A. Bossard, Y. Xu, M. Ishikawa, and K. Kaneko, A multilingual chat system with image presentation for detecting mistranslation, Journal of Computing and Information Technology, vol. 19, no. 4, pp. 247 253, 2011. 131