Automatic Subtitle Generation for Sound in Videos

Size: px

Start display at page:

Download "Automatic Subtitle Generation for Sound in Videos"

Magdalene Strickland
6 years ago
Views:

ISSN: 2454-132X Impact factor: 4.295 (Volume 4, Issue 2) Available online at: www.ijariit.com Automatic Subtitle Generation for Sound in Videos Anshul Ganvir anshulganvir65@gmail.

1 ISSN: X Impact factor: (Volume 4, Issue 2) Available online at: Automatic Subtitle Generation for Sound in Videos Anshul Ganvir Sanket Jagtap Kunal Pal Pranita Katole Mayur Bhalavi Datta Meghe Institute of Engineering Technology and ABSTRACT The last ten years have been the witnesses of the emergence of any kind of video content. Moreover, the appearance of dedicated websites for this phenomenon has increased the importance the public gives to it. In the same time, certain individuals are deaf and occasionally cannot understand the meanings of such videos because there is not any text transcription available. Therefore, it is necessary to find solutions for the purpose of making these media artifacts accessible to most people. Several software proposes utilities to create subtitles for videos but all require an extensive participation of the user. Hence, a more automated concept is envisaged. This report indicates a way to generate subtitles following standards by using speech recognition. Three parts are distinguished. The first one consists of separating audio from video and converting the audio in suitable format if necessary. This second phase proceeds to the recognition of speech contained in the audio. The ultimate stage generates a subtitle file from the recognition results of the previous step. Directions of implementation have been proposed for the three distinct modules. The experiment results have not done enough satisfaction and adjustments have to be realized for further work. Decoding parallelization, use of well-trained models, and punctuation insertion are some of the improvements to be done. Keywords: Audio Extraction, Java Media Framework, Speech Recognition, Acoustic Model, Subtitle Generation, FFMPEG. 1. INTRODUCTION This application indicates a way to generate subtitles following standards by using speech recognition. The systems should take video file as input and generate subtitle file as output. Consequently, the study of automatic subtitle generation appears to be a valid subject of research. Nowadays, it exists much software dealing with subtitle generation. Some proceed on copyright DVDs by extracting the original subtitle track and converting it in a format recognized by media players, for example, ImTOO DVD Subtitle Ripper, and Xilisoft DVD Subtitle Ripper. Others allow the user to watch the video and to insert subtitles using the timeline of the video, e.g. Subtitle Editor, and Subtitle Workshop. It can also be found subtitle editors providing facilities to handle subtitle formats and ease changes, for instance, Jubler and Gaupol. Nonetheless, software generating subtitles without the intervention of an individual using speech recognition have not been developed. Therefore, it seems necessary to start investigations on this concept. 2. LITERATURE SURVEY By Prof. Sanjib Das in this paper, they have collected all the information about Speech Recognition Technique which is also known as Automatic Speech Recognition (ASR), or computer speech recognition which is the process of converting a speech signal to a sequence of words by means of an algorithm implemented as a computer program. It has the potential of being an important mode of interaction between humans and computers. Generally, machine recognition of spoken words is carried out by matching the given speech signal against the sequence of words which best matches the given speech sample. The main goal of speech recognition area 2018, All Rights Reserved Page 410

2 is to develop techniques and systems for speech input to the machine. Speech is the primary means of communication between humans. For reasons ranging from technological curiosity about the mechanisms for the mechanical realization of human speech capabilities to desire to automate simple tasks necessitate human-machine interactions. The research in ASR by machines has attracted a great deal of attention for about sixty years and ASR today finds widespread application in tasks that require humanmachine interface, such as automatic call processing. India is a linguistically rich area which has 18 constitutional languages written in 10 different scripts. Hence there is a special need for the ASR system to develop in different native languages. [1] Sultana, S.; Akhand, M. A H; Das, P.K.; Hafizur Rahman,M.M. Investigate Speech-to-Text (STT) conversion using SAPI for Bangla language. They say that experimental study was carried out for the technique on an article from a newspaper and the recognition rate was approximately 78% on an average. Although achieved performance is promising for STT related studies, they identified several elements to improve the performance and might give better accuracy and assures that the theme of this study will also be helpful for other languages for Speech-to-Text conversion and similar tasks. [2] Moulines, E.In his paper "Text-to-speech algorithms based on FFT synthesis," present FFT synthesis algorithms for a French textto-speech system based on diphone concatenation. FFT synthesis techniques are capable of producing high-quality prosodic modifications of natural speech. Several approaches are presented to reduce the distortions due to diphone concatenation. [3] Martinez, M.; Quilis, A.; Bernstein, J.In this paper, they have done a research aiming to develop a text-to-speech converter (TSC) for Spanish, that accepts a continuous source of alphanumeric characters (up to 250 words per minute) and produces good quality, natural Spanish output, is described. Four sets of problems are considered in this work: the hard-ware structure adopted for realtime operation; the complex control software needed to handle the orthographic input and linguistic programs; the linguistic processing rules, and the parameterization of the Spanish language matched to a TSC. Emphasis is made on the problems of adapting a general hardware structure to a specific language.[4] By Boris Guenebaut. So, after doing all this research and literature survey we want to design a system which will generate subtitles for sound in videos. Reviewing all these papers we came up with a system where subtitles will be automatically generated from the video, and you will not need to download it from third-party website.[5] 3. OBJECTIVES The main objective is to generate subtitle automatically without human intervention. Our objective is to generate subtitle by using an FFMPEG library. Our objective is to produce which is properly time synchronized and displays accurate subtitles. Our aim is to make a media player which will be more convenient for the use. 4. PROBLEM STATEMENT In a majority of cases within a video, the sound holds an important place. It appears essential to make the understanding of a sound video available for people with auditory problems the most natural way lies in the use of subtitles. At present, we have to download subtitle by our own and copy it to video. However, manual subtitle creation is a long and boring activity and requires the presence of the user. 5. PROJECT DESCRIPTION Start Media File Audio Extraction Audio File Speech Recognition Time Synchronization Subtitle Generation Subtitle File End Fig (a):- Architecture Breakdown structure of the AutoSubGen experimental system. A media file (either video or directly audio) is given in input. The audio track is extracted and then read chunk by chunk until the end of the track is reached. Within this loop happen successively three tasks: speech recognition, and subtitle generation. Finally, a subtitle file is returned as output. 2018, All Rights Reserved Page 411

3 A. FFMPEG FFMPEG libraries are used to do most of our multimedia tasks quickly and easily say, audio compression, audio/video format conversion, extract images from a video and a lot more. It can be used by developers for transcoding, streaming and playing. It is a very stable framework for transcoding of videos and audio. ffmpeg is a command-line tool that converts audio or video formats. It can also capture and encode in real-time from various hardware and software sources such as a TV capture card. ffplay is a simple media player utilizing SDL and the FFmpeg libraries. ffprobe is a command-line tool to display media information (text, CSV, XML, JSON), see also Mediainfo. FFmpeg is used by software such as VLC media player, xine, Plex, Kodi, Blender, YouTube, and MPC-HC; it handles video and audio playback in Google Chrome and Linux version of Firefox. Graphical user interface front-ends for FFmpeg have been developed, including Avanti, XMedia Recode, and Multimedia Xpert. JavaCV, a Javawrapper for OpenCV, includes a supplementary Java wrapper for FFmpeg. FFmpeg is used by ffdshow, LAV Filters, GStreamer FFmpeg plug-in, Perian and OpenMAX IL to expand the encoding and decoding capabilities of their respective multimedia platform. B. Audio Extraction The audio extraction routine is expected to return a suitable audio format that can be used by the speech recognition module as pertinent material. It must handle a defined list of video and audio formats. It has to verify the file given in input so that it can evaluate the extraction feasibility. The audio track has to be returned in the most reliable format. C. Speech Recognition The speech recognition routine is the key part of the system. Indeed, it affects directly performance and results in evaluation. First, it must get the type (film, music, information, home-made, etc...) of the input file as often as possible. Then, if the type is provided, an appropriate processing method is chosen. Otherwise, the routine uses a default configuration. D. Subtitle Generation The subtitle generation routine aims to create and write in a file in order to add multiple chunks of text corresponding to utterances limited by silences and their respective start and end times. Time synchronization considerations are of main importance. 6. IMPLEMENTATION METHODOLOGY A. Audio Extraction FFMPEG Input Video FFMPEG Process in PowerShell Output Audio File Fig (b):- Audio Extraction Activity diagram for audio extraction describes the successive steps of the audio extraction module in order to obtain an audio file from a media file given in input. However, we face up to some limitations. Indeed, it will not be able to define punctuation in our system it involves much more speech analysis and deeper design.. The task was to figure out how to convert the output audio file into a format recognized by FFMPEG. Despite the fact we followed guidelines to do so in Java we did not obtain the expected result. 2018, All Rights Reserved Page 412

4 B. Subtitle Generation Current Directory Audio File Path 1 Subtitle Generation Bubble Timeout Initial Silence Timeout End Silence Timeout String Builder Break True If(rec text==null) False Fig(c):- Subtitle Generation Activitty diagram for subtitle generation exhibits the principle statements of the subtitle generation module. First, it receives a list of pairs Utterance- Speech Time. Then, it traverses thr list till the end. In each iteration, the current utterance is checked. If it is a real utterance, we verify if the current line is empty. If so, the subtitle number is incremented and the start time of the current line is set to the utterance speech time. Then, utteranc eis added to the current line. In the case, it is a SIL utterance, we check if the current line is empty: if not, the end time of the current line is set to SIL speech time. If the line is empty, we ignore the SIL utterance. Once the list has been traversed, the file is finalized and released to the user. C. Speech Recognition Prompt for input audio file and Parameters Check Audio File Filter Input Media Category Valid Format Wrong Format Select Suitable Model Throw Exception Adjust SR Config Allocate Retained Components Show Helper Launch Decode Process Store Result for Later Usage Fig (d):- Speech Recognition Activity diagram for speech recognition shows the successive statements to be executed at the time of speech recognition process. An audio file and some parameters are given as arguments to the module. First, the audio file is checked: if its format is valid, the 2018, All Rights Reserved Page 413

5 process continues; otherwise, an exception is thrown and the execution ends. According to the category (potentially amateur, movie news, series, music) given as argument, related acoustic and language models are selected. Some adjustments are realized in the FFMPEG configuration based on the set parameters. Then, all components used in ASR process are allocated required resources. Finally, the decoding phase takes place and results are periodically saved to be reused later. 7. CONCLUSION By using this application subtitles or subtitle file will be generated for any English videos. This software will minimize the efforts for downloading or manually writing the subtitle file. It supports all the MPEG standards. The video and subtitles are synchronized. User can extract audio in any MPEG standard formats. 8. REFERENCES [1] Santosh K. Gaikwad, BhartiW. Gawali, Pravin Yannawar, A Review on Speech Recognition Technique, International Journal of Computer Applications ( ) Volume 10 No.3, November [2] Penagarikano, M.; Bordel, G., Speech-to-text translation by a non-word lexical unit based system,"signal Processing and Its Applications, ISSPA '99. Proceedings of the Fifth International Symposium on, vol.1, no., pp.111,114 vol.1, 1999 [3] Olabe, J. C.; Santos, A.; Martinez, R.; Munoz, E.; Martinez, M.; Quilis, A.; Bernstein, J., Real timetext-to-speech conversion system for spanish," Acoustics, Speech, and Signal Processing,IEEEInternational Conference on ICASSP '84., vol.9, no., pp.85,87, Mar [4] Kavala, R. et al., A Dynamic Time Warp Integrated Circuitfor a 1000-Word Recognition System, IEEE Journal ofsolid-state Circuits, vol SC- 22, NO 1, February 1987, pp 3-14 [5] F.; Moulines, E., "Text-to-speech algorithms based on FFT synthesis," Acoustics, Speech, and Signal Processing, ICASSP- 88., 1988 International Conference on, vol., no., pp.667,670 vol.1, Apr [6] Willie Walker, Paul Lamere, Philip Kwok, Bhiksha Raj, Rita Singh, Evandro Gouvea, Peter Wolf, and Joe Woelfel. Sphinx-4: A flexible open source framework for speech recognition. In SMLI TR SUN MICROSYSTEMS INC., , All Rights Reserved Page 414

Comprehensive Tool for Generation and Compatibility Management of Subtitles for English Language Videos

International Journal of Computational Intelligence Research ISSN 0973-1873 Volume 12, Number 1 (2016), pp. 63-68 Research India Publications http://www.ripublication.com Comprehensive Tool for Generation