Because of the good performance of vocoder and the potential

Size: px

Start display at page:

Download "Because of the good performance of vocoder and the potential"

Marianna Melton
5 years ago
Views:

1 FINAL REVIEW ABOUT APPLIED FFT APPROACH IN PHASE VOCODER TO ACHIEVE TIME/PITCH SCALING Digital Audio Systems, DESC9115, 2018 Graduate Program in Audio and Acoustics Sydney School of Architecture, Design and Planning, The University of Sydney ABSTRACT Because of the good performance of vocoder and the potential value of time and pitch shifting, this paper attempts to create a phase vocoder APP through MATLAB, which is used for teaching or research in the initial stage. The main expectation of this APP can achieve sound, Window type, Spectrogram and Waveform real-time contrast and the diagram (or sound) fast switching. Later stages can be developed as a stand-alone vocoder, or as a Third-party plug-in. Hence, there are three problems in here. Firstly, what is the principle of the phase vocoder? Then, how to implement it in MATLAB? Finally, how to integrate these function into the GUI interface. The results show that GUI interface as expected, waveform, spectrogram, sound and window functions can appear simultaneously, and switching speed as expected. The drawback is that the overall design of the interface is not very intelligent, so it needs to be improved in the future. INTRODUCTION Due to the humorous voice, Talking Tom Cat became to one of the most popular APP around the world in a few years ago. Even now, the official channel of Talking Tom Cat still has more than 7 million followers on YouTube, and each new upload video has more than 2 million hits within two weeks[1]. Not only that, different rates of voice playback features are increasingly sought after by users, and many people were got attention by uploading these interesting sounds on Instagram and Facebook. The most incredible thing is a music APP Tik Tok has reached 120 million daily active users in April 2018 [2]. These cases embody the research value of audio production technology, and the key point of these technologies is time and pitch scaling. A sound can be understood as the superposition of various frequency sine waves, and the pitch is the main frequency of a sound. So, pitch shifting is changing the main frequency. The time and pitch scaling include two parts: time scaling (pitch constant) and pitch scaling (time constant). The time scaling is defined as the pitch unchanged and the speech rate is faster or slower. That also means the fundamental frequency value is almost constant. Pitch scaling means speech rate is constant, but fundamental frequency is changed. Normally, time and pitch scaling can be processed either in the time domain or frequency domain. Unfortunately, some approaches cannot separate those two effects since scaling the length of a signal can affect the pitch. But Phase vocoder is a way of scaling the length of a signal without affect the pitch[3]. Because of the good performance of vocoder and the potential value of time and pitch shifting, this paper attempts to create a phase vocoder APP through MATLAB, which is used for teaching or research in the initial stage. The main aim of this APP can achieve to demonstrate time and pitch scaling, Window type, Spectrogram and Waveform real-time contrast and the diagram (or sound) fast switching. In the future, it can be developed as a stand-alone vocoder, or as a Third-party plug-in. Therefore, there are mainly have three problems to solve, the first is to understand the principle of the phase vocoder, and then to implement it in MATLAB, and finally integrate these function into the GUI interface. 1. THEORETICAL CONSIDERATION The short time Fourier transform is a basic theory when implement phase vocoder. The algorithm and principle has been introduced in the initial review and lab report specifically. So, this final review just cites that key information Short Time Fourier Transform The Short Time Fourier Transform is proposed to carry out the partial analysis. Moreover, the prefect reconstruction can be achieved if overlapping window is unity [4]. There are many types of window function, which including the Hamming, the Hanning and Gauss window [4][5][6]. Figure 2 is an example to demonstrated how Hanning windows reconstruct a short period of DFT [4]. When take a Gaussian window it is generally called Gabor Transform [6]. However, time and frequency are two independent variables for the output of the STFT equation (shows in equation 1, t is time, and ω is radial frequency) [4]. So as it has time-frequency representation in each sampling point. Figure 1. Sum of small windows. Figure source from DAFX[2] Normally, we can use Hanning, Hamming and Gauss window optionally. In order to compare the effect of the window function on the signal, there are two kinds of window function is adopted in this phase vocoder, one is Gauss window (shows in equation 2) and the other one is Hanning window (shows in equation 3, window length L is N+1).

The Window Fourier Transform with Gaussian function is known as Gaboret transform which has the smallest time-frequency window. For this reason, Gabor transform is the optimal STFT[4][7][8].

Phase vocoder implementation diagram Phase vocoder can extend the time domain and frequency domain by changing the phase of sound.

The basic principle of FFT/IFFT approach is that the timefrequency representation can be regarded as a series of overlapping FFT (may include a window function).

Base on this principle, there are three steps when using direct FFT/IFFT approach to implement vocoder.

2 The Window Fourier Transform with Gaussian function is known as Gaboret transform which has the smallest time-frequency window. For this reason, Gabor transform is the optimal STFT[4][7][8]. We can obtain Gaboret transform equation if Substitute equation 2 into equation IMPLEMENTATION Figure 2. Phase vocoder implementation diagram Phase vocoder can extend the time domain and frequency domain by changing the phase of sound. The extension of time domain and frequency domain corresponds to the time scaling (change playback rate) and pitch scaling (change foundational frequency) [8]. The basic principle of FFT/IFFT approach is that the timefrequency representation can be regarded as a series of overlapping FFT (may include a window function). Since the FFT is reversible, adding the IFFT of vertical line can reconstruct the sound, that means reconstruct sound from time-frequency domain [4]. The signal diagram has demonstrated in figure 2. Base on this principle, there are three steps when using direct FFT/IFFT approach to implement vocoder. First step, use the FFT to compute the relationship between instantaneous frequency and amplitude of the signal. Second step, resampling the FFT blocks. Third step, perform an IFFT by take the inverse Fourier transform on each chunk and adding the resulting waveform chunks [3] Window function Comparison In step one, we can identify window function as well as compare the effect of the window function on the signal. so, we can use the Matlab built-in function gausswin() and hanning() for the signal. Figure 3. comparison between Hanning window and Gauss window Figure 3 shows that the main lobe of the Hanning window is slightly wider than the Gauss window, which is equivalent to widening the bandwidth of the analysis and decreasing the frequency resolution. From the angle of frequency resolution, Gauss window is superior to Hanning window. The Side lobe of Gauss window is higher than that of Hanning Window, so from the point of view of leakage, Gauss window is better than Hanning window[6] FFT/IFFT approach to Implement Time scaling The following figure 4 and 5 is time scaling result based on the direct FFT/IFFT approach, the function routine includes one vectors of the gentlemen sound (I am speaking from over here), window size (2048), analysis steps (512) and synthesis steps(384). It has shown the implementation result of time scaling from waveform diagram and spectrogram when we use FFT approach. Figure 4. FFT approaches implement time scaling (waveform diagram) Figure 5. FFT approaches implement time scaling (spectrogram)

2.3. FFT/IFFT to Implement Pitch Scaling The following figure 6 and 7 is pitch scaling result based on the direct FFT/IFFT approach, but this function is different with time-scaling function.

It has shown the implementation result of pitch scaling from waveform diagram and spectrogram when we use FFT approach. Pitch Ratio, normal pitch ratio and high pitch ratio.

Test APP Run Figure Figure 7. FFT approaches implement pitch scaling (spectrogram) 2.4.

Portion 1 is window function display area, which is controlled in portion 3. Once the control button in the portion 3 is clicked, the selected window function image will be displayed in portion 1.

3 2.3. FFT/IFFT to Implement Pitch Scaling The following figure 6 and 7 is pitch scaling result based on the direct FFT/IFFT approach, but this function is different with time-scaling function. its routine includes one vectors of the gentlemen sound (I am speaking from over here), window size (2048), pitch ratio (0.9) and synthesis steps (512). It has shown the implementation result of pitch scaling from waveform diagram and spectrogram when we use FFT approach. Pitch Ratio, normal pitch ratio and high pitch ratio. Not only that, The sound of time/pitch-scaling is audible simultaneously. Figure 6. FFT approaches implement pitch scaling (waveform diagram) Figure 8. GUI development 3. EVALUATION 3.1. Test APP Run Figure Figure 7. FFT approaches implement pitch scaling (spectrogram) 2.4. GUI development Figure 8 has demonstrated that phase vocoder APP design idea, and this GUI interface has been divided into 4 portions. Portion 1 is window function display area, which is controlled in portion 3. Once the control button in the portion 3 is clicked, the selected window function image will be displayed in portion 1. Portion 2 is waveform and spectrogram display area, which is controlled in portion 4. Once the control button in the portion 4 is clicked, the selected time/pitch scaling waveform and spectrogram will be displayed in portion 2 separately. Portion 3 is window function control area which designed with 3 buttons, including two window functions and a set of reciprocal comparisons. Portion 4 is the control area, mainly used to control the diagram in portion 2. The Control area has two groups, one use for Hanning windows function and the other one for Gauss windows function. Then, each group also includes the time-scaling and pitch-scaling two kinds of buttons, time-scaling is controlled by the playback rate, three buttons are 1.5,2,3 times the playback rate respectively; pitch-scaling three buttons are divided into low Figure 9. error when start to run figure Run the function file Pitch_shift.m, As shown in figure 9, when the app is started, the system complains about some programs, but the error does not affect the test results, so the author prepares to do a detailed error correction later. When we click OK, we will enter the app interface (as shown in figure 10).

3.4. Test APP Time Scaling Test Figure 13. Time scaling (rate is 1.5) with Hanning window function (left) and Gauss window function(right) Figure 10.

Test APP Window Function Test Figure 12 evaluated the operation of the time scaling with Hanning and Gauss window function, and the results show that it is in

During the audition process, there is no way to distinguish between the two window function and the sound.

interface of this APP The three diagrams in Figure 11 evaluated the operation of the different window function, and the results show that it is in good condition.

Low pitch ratio of Pitch scaling with Hanning window function (left) and Gauss window function(right) Figure 12 evaluated the operation of the pitch scaling with

All the scaled sounds were as expected(all demo will be attached), and the low pitch ratio sounded even more muffled.

The only difference between the two is the sound at the end, and the Gauss window function is slightly louder than the Hanning window function To test the speed

4 3.4. Test APP Time Scaling Test Figure 13. Time scaling (rate is 1.5) with Hanning window function (left) and Gauss window function(right) Figure 10. interface of this APP 3.2. Test APP Window Function Test Figure 12 evaluated the operation of the time scaling with Hanning and Gauss window function, and the results show that it is in good condition. Same as pitch test, all sounds were as expected(all demo will be attached), and the scaled sounds sped up a lot. During the audition process, there is no way to distinguish between the two window function and the sound. Although the waveform and spectrogram of the two appear to be somewhat different Test APP Switching Time Figure 11. interface of this APP The three diagrams in Figure 11 evaluated the operation of the different window function, and the results show that it is in good condition Test APP Pitch Scaling Test Figure 14. calculation time for each button. Figure 12. Low pitch ratio of Pitch scaling with Hanning window function (left) and Gauss window function(right) Figure 12 evaluated the operation of the pitch scaling with Hanning and Gauss window function, and the results show that it is in good condition. All the scaled sounds were as expected(all demo will be attached), and the low pitch ratio sounded even more muffled. In the audition process, there is no direct difference between the two window function and the sound. The only difference between the two is the sound at the end, and the Gauss window function is slightly louder than the Hanning window function To test the speed of each operation, the ' tic ' and ' toc ' (built-in function) has been added in every function to calculate the time of operation. As shown in Figure 14, the operation time has been attached on top of each button. Strangely, the operation time of Rate (Gauss) is almost 50 times than to Rate (Hanning). One possibility is that algorithm of rate (Gauss) was used Gauss approach, its basic principle is similar to FFT algorithm, but it has an extra calculation item Gaborets[4]. Therefore, the increase in the time is likely to be calculated when the Gaborets generated. 4. CONCULSION The results show that GUI interface as expected, waveform, spectrogram, sound and window functions can appear

5 simultaneously. The switching speed also is satisfying (except for the group of Rate (Gauss)), which will bring good experience to users. The disadvantage is that the overall design of the interface is not very intelligent, so it needs to be improved in the future. 5. REFERENCES [1]"Talking Tom", YouTube, [Online]. Available: [Accessed: 07- Jun- 2018]. [2]"Tik Tok (app)", En.wikipedia.org, [Online]. Available: [Accessed: 08- Jun- 2018]. [3]"Audio time stretching and pitch scaling", En.wikipedia.org, [Online]. Available: scaling. [Accessed: 08- Jun- 2018]. [4]U. Zölzer, DAFX, 2nd ed. Chichester: Wiley, 2011, pp. Chapter 7 page [5]"Gabor transform", En.wikipedia.org, [Online]. Available: [6]"Window function", En.wikipedia.org, [Online]. Available: [7] J. Proakis and D. Manolakis, Digital signal processing, 4th ed. Harlow, Essex: Pearson, 2014, pp. chapter 4, page [8]"The Fast Fourier Transform Algorithm", YouTube, [Online]. Available:

Spectral modeling of musical sounds

Spectral modeling of musical sounds Xavier Serra Audiovisual Institute, Pompeu Fabra University http://www.iua.upf.es xserra@iua.upf.es 1. Introduction Spectral based analysis/synthesis techniques offer