Automatic Video Segmentation for Czech TV Broadcast Transcription

Size: px

Start display at page:

Download "Automatic Video Segmentation for Czech TV Broadcast Transcription"

Lucinda Webb
5 years ago
Views:

1 Automatic Video Segmentation or Czech TV Broadcast Transcription Jose Chaloupka Laboratory o Computer Speech Processing, Institute o Inormation Technology and Electronics Technical University o Liberec Liberec, Czech Republic Jose.chaloupka@tul.cz Abstract This contribution deals with the testing and selection o methods and algorithms or the automatic video image (or visual signal) segmentation. The aim o this work has been to select a reliable and ast method or visual signal segmentation, which can be used in the system or audio-visual automatic TV broadcast transcription. Keywords-visual signal segmentation, shot change detection, audio-vsual TV broadcast processing I. ITRODUCTIO Video recordings o TV broadcasts (or movies) are made up o single short shots; a short shot is a sequence o subsequent video images where visual inormation does not change too much or changes very little. We want to ind the time boundaries (shot change) o these short shots in the task o automatic visual signal segmentation. The parts between time boundaries are visual segments which should correspond to the original short shots. We most oten have shot cuts between two subsequent visual segments or a special video eect such as dissolve, ade or wipe is used. The shot cuts and dissolve have been in our TV broadcast video recordings (see ig. ), thereore shot change detection has been solved or these two cases in our work. At present, the segmentation o a visual signal is used mainly or the indexation o audio-visual data (video recording). It is possible to represent each indexed visual segment by a key rame and work only with the key rames in very large audio-visual databases. Visual signal segmentation is urther used in the research area o modern voice technologies, where visual inormation rom visual segments can improve the recognition rate o inormation rom audio signals. We have used visual segmentation in our very large vocabulary speech recognition system or the automatic transcription o Czech broadcasts (system ATT Audio Transcription Toolkit). This system has been being developed in our Laboratory o Computer Speech Processing at the Technical University o Liberec since 004 []. We are using our own recognizer or the automatic continuous speech recognition o the Czech language in the ATT. The speech recognizer works with a vocabulary o more than Czech words and with a Czech language model. The principle o the ATT is as ollows (see ig. ): An input audio signal is preprocessed and parameterized in the irst step. The parameterized signal is segmented into smaller audio segments containing homogenous inormation only e.g. only one speaker speaks, some music plays, silence and so on. Audio segmentation is based on speech/non-speech and speakerchanged detection []. In the next step, a speech act is recognized in the audio speech segments. Our speech recognizer is based on the Hidden Markov Models (HMM) o single Czech phonemes. It is possible to use Speaker Independent (SI) HMM or Gender Dependent (GD) HMM. The recognition accuracy o a speech recognizer is higher i we use GD HMM instead o SI HMM. The identiication accuracy o gender rom a voice reaches more than 99% in our systems; we are, thereore, using GD HMM in our speech recognizer. Another strategy how to improve recognition accuracy is to use Speaker Adaptive (SA) HMM, where HMM are adapted or speciic speakers [3]. Some TV broadcasters, politicians or well-known celebrities and people are very oten in a TV broadcast, it is, thereore, possible to adapt HMM or them. The speakers are then identiied and veriied beore speech recognition in the ATT. Speaker veriication is used ater speaker identiication because it is necessary to ind out whether the identiied speaker has been identiied correctly. The speaker may have been identiied correctlyut he (she) sometimes could not be veriied. Thereore, we have modiied the ATT system [4] or the use o audio-visual speaker identiication instead o audio identiication and subsequent veriication. The visual signal is irst segmented and the visual segments are compared with audio segments according to time boundaries. It can be assumed that the inormation rom an audio segment would be similar (or equivalent) to the inormation in the visual segment i the time boundaries o the audio segment are more or less the same as the time boundaries o the visual segments. For example, the speaker is camera scanned and it is recorded into audio and visual segments. It is, thereore, possible to identiy the speaker rom the audio signal and this inormation is compared with the visual speaker identiication where the speaker is identiied by the detected ace in the video image rom the visual segment. Several dierent methods and algorithms exist or the visual speaker identiication based on //$ IEEE

the image o a human ace at present. The method based on the Principle Component Analysis (PCA) is most oten used or the visual speaker (ace) identiication [5].

audio-visual identiication is incorporated into the ATT.

2 the image o a human ace at present. The method based on the Principle Component Analysis (PCA) is most oten used or the visual speaker (ace) identiication [5]. Audio speaker veriication is not used when the identiied speaker is the same in the visual segment and in the relevant audio segment, the recognition accuracy being slightly higher when the module o audio-visual identiication is incorporated into the ATT. One o the important tasks or the audio-visual speaker identiication in the system o automatic TV broadcast transcription is to ind a reliable algorithm or visual signal segmentation, so several visual segmentation methods have been tested in this work. There are many algorithms and methods or visual signal segmentation [9]ut only the methods and the algorithms that were used in this work are described here. Shot cut: The resulting value o the similarity o two video images rom () may be close to zero even when comparing two completely dierent video images, so it is better to directly compare pixels in two successive video images (): x= y= ( x y) ( x, y) T, () The criterion or the boundary creation o a visual segment in almost all visual segmentation methods is based on comparing the resulting value with a threshold T. Such methods are quite reliable in the case o static video shots. However, over-segmentation can be expected i an object (or the camera) is moved. Over-segmentation is understood as one o the expected distributions o the multiple visual segment Dissolve: # rame no # rame no. Figure. Video eects shot cut, dissolve II. VISUAL SIGAL SEGMETATIO METHODS A. Pixel Based Methods The simplest method or visual signal segmentation is based on the comparison o corresponding pixels in two successive video images (, ) [6], or we can determine how likely it is or the corresponding pixels to be identical, possibly to ollow the development o changes in the color values o the corresponding pixels over several consecutive video images. i i+ x= y= x= y= ( x y) ( x, y) T, () where i (x, y) is image unction o a video image i rom a video signal, where a values o image unction can be a RGB color vector, a brightness or other color part rom some color space. X and Y is a dimension (width a height) o video image, is the shit to the next video image (usually ) and T is a threshold which we use or the set o boundaries o visual segment. Figure. The principle o ATT system B. Histogram Based Methods One o the ew global inormation sources which somehow characterize the image, are the image histograms. An image histogram is created rom the requency o single color values in single pixels. We get one image histogram or one video image but dierent video images may have the same image

3 histogram, which may be a disadvantage o this method. However, due to the acquisition o global inormation rom the image histogram are histogram-based visual segmentation methods [7] more robust as compared to pixel-based methods, mainly owing to the low shake or turn o the camera or an object located in the video shot. The simplest calculation can be realized by the dierence o values in the image histograms o two successive video images: V v= H () v H () v T where H i ( are values o a video image histogram. An image histogram is computed rom the brightness or RGB o single pixelsut dierent color parts rom dierent color spaces (HSV Hue, Saturation, Value, YcbCr, ) are used or the computation o image histograms in some urther projects. These color parts can have dierent eect on their own inormation in each video image, thereore in some visual segmentation methods, dierent weights are set or each color part. The intersections o image histograms are searched or image histograms are normalized or a better comparison in other histogram-based segmentation methods. C. Feature Based Methods A video image (matrix o pixels) can be described with eatures which well characterize the video image. A video signal is segmented by the help o these eatures. A boundary o visual segment is set i the eatures rom two successive video images are dierent. The color values o pixels rom video image or values rom image histogram can be eatures but only some smaller group o eatures is acceptable or us in the eature based visual segmentation methods. Useul eatures may be or example image moments, edges, parameters rom some statistic methods, coeicients rom some D transorms and so on. We have developed a eature-based visual segmentation method where eatures are extracted rom the coeicients o the D Discrete Cosine Transorm (DCT) [4]. The principle o the eature extraction and the subsequent visual segmentation is as ollows: A video image is transormed by D DCT: F( u, c( u) c( = x + y + ( x, y) cos uπ cos vπ x= 0 y= 0 where F(u, are DCT coeicients o transormed image (x,y) a c are coeicients: or k = 0 c( k) = (5) otherwise The computation o DCT is relatively ast because there is an algorithm very similar to the FFT algorithm (Fast Fourier Transorm) or the computation o DFT (Discrete Fourier Transorm). The square o DCT coeicients is computed: E ( u, = F( u, (6) (3) (4) P- the highest E coeicients are selected as eatures. The distance between the eatures rom two successive video images is counted in the last step (7). The criterion or shot change detection is very similar to the one in the previous method, where a speciic threshold is used. P p= VP ( p) VP ( p) T where VP i (p) is eature vector rom i video image. The advantage o our method is that the distance between two similar successive video images is several times lower than or two dierent ones. The advantage o this method is similar to the distance between consecutive rames is several times lower than that or two dierent. The disadvantage is that the irst visual eature VPI () is usually several times higher than the others. Thereore, it is good to normalize the eature vector. The logarithm o the eature vector is used in our algorithm. D. Block Based Methods A video image is divided into several parts (blocks) using a block-based visual segmentation method. Visual inormation is compared in the same blocks rom two successive video images. The result o the comparison rom single blocks is then evaluated. We can assign dierent weights to the single blocks. Each block may contribute to the result o the evaluation in a dierent way. The same method can be used or evaluation in blocks such as those described above, where the eatures, the image histograms or the sum o pixel values are computed and compared only in blocks o video images. Another possibility is to count the statistical values in single blocks such as variance and mean [8], the unction L(i) is then calculated or two corresponding blocks L ( i, b) = ( σ + σ )/ + (( μ μ )/ ) ) i i i, b i σ σ i, b i where i,b is the variance and i,b is mean color values in the single blocks b in i video image. Value L(i) is compared with same threshold T b then. L(i, b) = i it is higher than threshold, otherwise L(i) = 0. The criterion or shot change detection is: B b= w L ( i b) T b (7) (8), (9) where B is number o blocks, T is segmentation threshold and w b is weight value or single blocks. It is necessary to properly determine the number and distribution o single blocks in the video image or the blockbased segmentation methods. It is easier to correctly adjust the threshold value T i we choose a suitable number o blocks and their distribution in the video image.

III. EXPERIMETS Seven methods or visual signal segmentation have been tested in our experiments: M_PB a pixel-based segmentation method (equation ), M_PB a pixel-based segmentation method (equation

); the segmentation evaluation has been computed by equation, M6_BB a block based segmentation method 6 blocks (4 x 4) and the segmentation evaluation has been computed just like in the previous

8606 (00:05:44) 870 (00:05:48) 884 (00:05:53) 885 (00:05:53) 8844 (00:05:54) 8874 (00:05:55) 8875 (00:05:55) 8893 (00:05:56) 8909 (00:05:56) 89 (00:05:56) 946 (00:06:06) 9380 (00:06:5) # rame no.

4 III. EXPERIMETS Seven methods or visual signal segmentation have been tested in our experiments: M_PB a pixel-based segmentation method (equation ), M_PB a pixel-based segmentation method (equation ), M3_HB a histogrambased segmentation method (equation 3), M4_FB a DCT visual-based segmentation method, M5_BB a block-based segmentation method where video image has been divided into 4 blocks ( x ); the segmentation evaluation has been computed by equation, M6_BB a block based segmentation method 6 blocks (4 x 4) and the segmentation evaluation has been computed just like in the previous method, M7_BB a block based segmentation method 6 blocks (4 x 4) computation by equations 8, (00:05:44) 870 (00:05:48) 884 (00:05:53) 885 (00:05:53) 8844 (00:05:54) 8874 (00:05:55) 8875 (00:05:55) 8893 (00:05:56) 8909 (00:05:56) 89 (00:05:56) 946 (00:06:06) 9380 (00:06:5) # rame no. (hour:minute:second) Figure 3. The sample o short visual signal A database with almost hours o video recordings o Czech TV broadcast news has been used or the automatic threshold setting or single segmentation methods. The boundaries (shot changes) o single visual segments have been ound and set manually in this database or urther evaluation where the threshold has been changed in an interval or each segmentation method. The resulting threshold has been selected according to the highest value o the Visual Segmentation Rate - VSR (0). The single segmentation methods (with a set threshold) have been tested in the next step using another database COST78 [0], where video recordings o TV broadcasts rom 3 Czech TV stations are included ( hour). The shot changes o visual segments have also been set manually in this database; the reliability (VSR) o single segmentations methods can, thereoree evaluated. CS IS VSR = 00 [%] (0) S where S is the number o all manually selected shot changes, CS is the number o correctly recognized shot changes, IS is the number o shot changes which were detected in addition. Figure 4. The result rom visual signal segmentation methods: a) M_PB) M_PB, c) M3_HB, d) M4_FB, e) M5_BB, ) M6_BB, g) Mt_BB

5 The best testing method has been a DCT eatures-based visual segmentation method with the VSR o 7,3%. The result rom the visual segmentation methods or a short visual signal (igure 3.) is shown in igure 4. The y-axis (the segmentation value) is normalized to the interval rom 0 to 00 or a better comparisonut another interval is used in the last method because the segmentation value is almost zero between two similar video images and it is highly variable or the detected shot change. Only one video eect (dissolve) has been (used) in our video recordingsut it has not been necessary to prepare a special algorithm or tackling shot change detection in this eect because all segmentation methods detected the shot change in the dissolve video eect. IV. COCLUSIO AD FUTURE WORK The utilization o several methods or visual signal segmentation has been tested in this work - two pixel-based, one histogram-based, one eature-based and three block-based visual segmentation methods have been used in the experiments. The best result has been reached by the eaturebased visual segmentation method, where visual eatures are computed rom the DCT coeicients. The advantage o this method is that it is possible to ind a robust segmentation threshold or reliable visual signal segmentation. The DCT visual eatures-based segmentation method is used in our experiments with our system or automatic TV broadcast transcription. We would like to improve our visual segmentation method in the near uture and add some algorithms or solving the visual segmentation task, where several special video eects (ade, wipe,..) are used in the video recordings. ACKOWLEDGMET The research reported in this paper was partly supported by the grant (TACR) no. TA0004 and by the Czech Science Foundation (GACR) through the project no. 0/08/0707. REFERECES [] ouza, J., ejedlová, D., Žánský, J., Koloren, J.: Very Large Vocabulary Speech Recognition System or Automatic Transcription o Czech Broadcast. In: Proc. o ICSLP 004, Jeju Island, Korea, pp , ISS 5-44x, 004 [] Žánský, J.: BISEG: An Eicient Speaker-based Segmentation Technique. In: International Conerence on Spoken Language Processing Interspeech 006 ICSLP 006, September, 006, Pittsburgh, USA, pp. 8-85, ISS [3] erva, P., ouza, J., Silovský, J.: Two-Step Unsupervised Speaker Adaptation Based on Speaker and Gender Recognition and HMM Combination. In: International Conerence on Spoken Language Processing Interspeech 006 ICSLP 006, September, 006, Pittsburgh, USA, pp , ISS [4] Chaloupka, J.: Visual Speech Segmentation and Speaker Recognition or Transcription o TV ews. In: Proc. o International Conerence on Spoken Language Processing Interspeech 006 ICSLP 006, Pittsburgh, USA, pp , ISS , 006 [5] Chan, L., H., Salleh, S., H., Ting, C., M.: PCA, LDA and neural network or ace identiication, In: IEEE Conerence on Industrial Electronics and Applications, ICIEA 009, art. no , pp , 009 [6] agasaka, A., Tanaka, Y.: Automatic video indexing and ull-video search or object appearances. In: IFIP Working Conerence on Visual Database Systems, Hungary, pp. 3-7, 99 [7] Tonomura, Y., Abe, S.: Content oriented visual interace using video icons or visual database systems. In: Journal o Visual Languages and Computing, pp , 990 [8] Kasturi, R., Jain, R., C.: Dynamic Vision, In: Computer vision: principles, editors: Kasturi a Jain, IEEE Computer Society Press, USA, pp , 99 [9] Laevre, S., Holler, J., Vincent,.: A review o real-time segmentation o uncompressed video sequences or content-based search and retrieval, In: Real-Time Imaging 9, pp , 003 [0] Vandecatseye et al.: The COST78 pan-european broadcast newsdatabase. in Proc. o LREC 004, Lisbon, Portugal, May 004

CS485/685 Computer Vision Spring 2012 Dr. George Bebis Programming Assignment 2 Due Date: 3/27/2012

CS485/685 Computer Vision Spring 2012 Dr. George Bebis Programming Assignment 2 Due Date: 3/27/2012 CS8/68 Computer Vision Spring 0 Dr. George Bebis Programming Assignment Due Date: /7/0 In this assignment, you will implement an algorithm or normalizing ace image using SVD. Face normalization is a required