D 5.2 Time stretching modules with synchronized multimedia prototype

Size: px

Start display at page:

Download "D 5.2 Time stretching modules with synchronized multimedia prototype"

Karin Wilkins
6 years ago
Views:

1 The time-stretching factor is sent to the audio processing engine in order to change the analysis hop size, and the audio output frame timestamp is calculated accordingly. However, this timestamp is not sufficient for proper A/V synchronisation, since it represents the time when the audio frame is sent to audio hardware buffer. For example, if an audio frame is 1024 samples and the sample rate is Hz, the time resolution will be 23.2 ms. For the normal playback speed, this may be sufficient, but in the case of doubling the playback speed the time span between two audio sample points on the media timeline becomes 46.4 ms. Hence, some measure of fullness of the audio hardware buffer needs to be introduced for precise timing of outputted audio samples. The fullness of the hardware audio buffer is hardware dependent and measuring it is often a complex task, so we propose to find approximate timing of the audio sample by measuring the time difference (Δt) between the moment the audio frame is sent to the hardware buffer and the current time. This value is then added to the timestamp of the audio frame that was sent to the audio buffer (T audio ), and is then compared with the video frame timestamp (T video ). The display is refreshed with this frame when the video frame time code is smaller than or equal to the calculated audio time: T! T + " t video audio Another issue is timer precision for measuring Δt. In Windows OS, the maximal precision that can be achieved with the standard timer is 15ms, which is hardly enough for a synchronisation application. Hence, Δt is measured by measuring CPU counts from the moment the frame is sent to the hardware buffer and then dividing by the CPU count frequency. Since Δt gives a value related to the real playback time-line, it is transposed to the media time line by dividing it by the time-stretching factor α: " CNT " t = #! f 1 cpu cnt (95) (10) 32

2 5. A/V Synchronisation Evaluation To measure the quality of the A/V synchronisation algorithm, we compared it with our integration of time-stretching in ffplay on the Linux platform and also with the MPlayer implementation in LinuxOS. MPlayer is a robust, open source video player in Linux based on ffmpeg libraries. One of the many features of MPlayer is the possibility to change playback speed, but without independent pitch-shifting. Nevertheless, this feature, robust implementation and the possibility to extract A/V synchronisation information make MPlayer useful for evaluation and comparison with our algorithm. We compared video players on the Casino Royale trailer sequence coded in MPEG1 format with video frame dimension 640x352 at frames per second and an audio sample rate of Hz. The video frame lag with respect to audio is presented for 100 video frames from the middle of the sequence in the case of playing the video at half of the original speed (Figure 15) and with double the original speed (Figure 16). It can be seen that our adaptive video refresh rate algorithm (marked Easaier on the figures after the name of the project it was implemented for) clearly outperforms the other two, because of the precise matching of the video timestamp to the audio clock. The video lag of the Easaier time-stretching algorithm is also well below the ITU lip sync error recommendation with maximal video lag being 14 ms and maximal video advance being 13 ms in the case of doubled playback speed. Moreover, the standard deviation of video lag is ms, showing stability of this solution. Figure 15. Comparison of video lag for three video player implementations when playback speed is half of original. 33

3 Figure 16. Video lag when playback speed is doubled. 34

4 6. Conclusions A framework for real-time video/audio synchronised time scaling and pitch shifting was developed for EASAIER. Careful consideration was given to the problems which arise in a real-time context and novel solutions to these issues have been provided. It was shown how time-scale changes can be achieved in real-time with almost imperceptible latency and no transitional artefacts. The approach is based on a modified phase vocoder with optional phase locking and an integrated transient detector which enables high quality transient preservation in real-time. The framework presented is the basis for the developments of applications which allow for a seamless real-time transition between continually varying, independent video/audio time-scale and pitch-scale parameters. A novel solution for audio/visual synchronisation called adaptive video refresh rate has also been developed. Due to the fact that synchronisation errors in the foreseen applications will be easier to detect, special focus was given to minimizing video lags and advances, resulting in algorithm that significantly outperforms existing algorithms. This work has also been presented for review in the IEEE Transactions on Multimedia [23] The framework and described algorithms have been integrated into the EASAIER client application successfully as shown in Figure 17. Figure 17. The EASAIER client application, showing the time scale modification tool along with synchronised video playback. Also shown, the freehand EQ with synchronised spectral display. All dynamic screen objects are synchronised to the time scaled time-base 35

5 36

6 7. References [1] LaBarbera P, and MacLachlan J, Time-Compressed Speech in Radio Advertising, Journal of Marketing, v. 43, n. 1, January 1979, pp [2] Landone C, Harrop J, Reiss J, Enabling Access to Sound Archives through Integration, Enrichment and Retrieval: the EASAIER Project, 8th ISMIR Conference, Vienna, 2007 [3] Barrett S, Duffy C, and Marshalsay K, HOTBED (Handing On Tradition By Electronic Dissemination), Royal Scottish Academy of Music and Drama, Glasgow, Report March [4] Harrigan K, The SPECIAL system: Self-paced education with compressed interactive audio learning, Journal of Research on Computing in Education,vol. 27, no. 3, 1995, pp [5] Harrigan K., The SPECIAL system: Searching time-compressed digital video lectures, Journal of Research on Computing in Education, vol. 33, no. 1, 2000, pp [6] King P. E, and Behnke R. R, The Effect of Time-Compressed Speech on Comprehension, Interpretive and Short-Term Listening, Human Communication Research, vol. 15, no. 3, [7] Olson J. S, A Study of the relative effectiveness of verbal and visual augmentation of rate-modified speech in the presentation of technical material, Annual Conference of the Association or Educational Communications and Technology (AECT), Anaheim, Ca, [8] Orr D. B, Friedman H. L, and Williams J. C, Trainability of listening comprehension of speeded discourse, Journal of Educational Psychology, vol. 56, 1965, pp [9] Short S, A Comparison of Variable Time-Compressed Speech and Normal Rate Speech Based on Time Spent and Performance in a Course Taught with Self- Instructional Methods, British Journal of Educational Technology,vol. 8, no. 2, 1977, pp [10] Li F. C, Gupta A, Sanocki E, He L, and Rui Y, Browsing digital video,. ACM CHI 2000, Hague, Netherlands, April 2000, pp [11] Flanagan J.L.,and Golden R.M, Phase Vocoder, Bell System Technical Journal vol. 45:, pp [12] Dolson M, The phase vocoder: A tutorial, Computer Music Journal, vol. 10, 1986, pp [13] Portnoff M, Implementation of the digital phase vocoder using the fast Fourier transform in IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 24, no. 3O,Jun 1976, pp [14] Laroche J; and Dolson M, Improved phase vocoder, In Proc. IEEE Trans. Speech and Audio Processing, v. 7, n. 3, May 1999, p

7 [15] Bonada J, Automatic technique in frequency domain for near-lossless time-scale modification of audio, 'Proceedings of International Computer Music Conference, Berlin, Germany 2000 [16] McAulay, R. J. and Quatieri, T. F. Speech Transformations Based on a Sinusoidal Representation. IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-34:6, pp , August 1986 [17] Laroche J, Autocorrelation method for high quality time/pitch scaling, IEEE WASPAA, Mohonk, NY, [18] Tony S. Verma and Teresa H. Y. Meng, "An analysis /synthesis tool for transient signals," in Proc. 16th International Congress on Acoustics/135th Meeting of the Acoustical Society of America, June 1998, vol. 1, pp [19] Duxbury, C., M. Davies, and M. Sandler. Improved time-scaling of musical audio using phase locking at transientsm, 112th AES Convention. Convention Paper5530, 2002 [20] Barry D; FitzGerald D; and Coyle E, Drum Source Separation using Percussive Feature Detection and Spectral Modulation, IEE Irish Signals and Systems Conference, Dublin, Ireland., 2005 [21] International Telecommunication Union Document 11A/47-E, 13 October 1993 [22] International Telecommunication Union, Relative Timing of Sound and Vision for Broadcasting. Recommendation, ITU-R BT , [23] Damnjanovic I, Barry D, Dorran D, Reiss J, Real-time Synchronised Audio/Video Time and Pitch Scale Modification, Submitted to IEEE Transactions on Multimedia, September

1 Audio quality determination based on perceptual measurement techniques 1 John G. Beerends

Contents List of Figures List of Tables Contributing Authors xiii xxi xxiii Introduction Karlheinz Brandenburg and Mark Kahrs xxix 1 Audio quality determination based on perceptual measurement techniques