OPTIMAL VIDEO SUMMARY GENERATION AND ENCODING + Zhu L, * Aggelos atsaggelos and + Bhavan Gandh (ICIP Draft v.2, -2-23) + Multmeda Communcaton Research Lab, Motorola Labs, Schaumburg * Department of Electrcal & Computer Engneerng, Northwestern Unversty, Evanston ABSTRACT Vdeo summary work orgnates from vew tme constrant; a shorter verson of orgnal vdeo sequence s desrable n some applcatons. Our work s based on vsual sgnfcance analyss, whch s a functon of vsual features of nterests over tme. Once the frames n a sequence are labeled wth vsual sgnfcance, one-pass or two-pass frame selecton algorthms are proposed to generate perceptually optmal vdeo summary accordng to the vsual sgnfcance functon. We also proposed several optmal bt allocaton strategy for encodng of the vdeo summary.. INTRODUCTION The demand for vdeo summary work orgnates from securty, mltary and entertanment applcatons. Vew tme constrant s essental n many applcatons. In mltary stuaton, a battalon commander may request a 5 mnutes summary of Company B s fghtng at mountan pass for the last 2 mnutes; or n a securty applcaton, a supervsor want to see a 2 mnutes summary of what happened at arport gate B2, from camera #22, n the last mnutes; or n an entertanment scenaro a vewer ust want to spend hour to vew a.5 hour move. A shorter verson of the orgnal vdeo sequence also has the beneft of requrng less bts to encode. Ths makes t also attractve as a mechansm n rate control. Examples of recent works n vdeo summary are n [] [2][3][4]. People have used varous vsual features and the statstcs of vsual features to dentfy vdeo shot boundares and determne key frames by thresholdng and clusterng. Generally they are very complcated n computaton and requre a two-pass approach. They do not address the temporal resoluton wthn a vdeo shot gracefully. Our approach s based on vsual sgnfcance analyss. Vsual sgnfcance s a real functon defned for each frame to ndcate how vsually mport a frame s for user to comprehend the event n a vdeo sequence. We obtan our vsual sgnfcance value from color layout changes from the prevous frame and moton actvty wthn the frame. Wth some predetermned threshold T on vsual sgnfcance, we can dvde vdeo sequence up nto vdeo shots and pck the very frst frame as key frame. Wthn each vdeo shot, a cumulatve vsual sgnfcance functon s computed from vsual sgnfcance. Then a step sze s pcked to select the temporal enhancement frames. Note that the vsual sgnfcance and cumulatve vsual sgnfcance functons are causal functons. The paper s organzed nto the followng sectons: In secton 2, vsual sgnfcance analyss; secton 3, vdeo summary frame selecton, one-pass approach; secton 4, vdeo summary generaton, two-pass approach, and n secton 5, optmal codng. 2. VISUAL SIGNIFICANCE ANALYSIS Vsual Sgnfcance (VS) functon characterzes the mportance of frames n the orgnal vdeo sequence n helpng users understand the events. VS for frame s computed as weghted sum of color changes and moton: CLD VS wvs + w2 = VS () MAD where color change VS s the L2 dstance between two MPEG-7 [5] defned Color Layout Descrptors (CLD) [6] of frames and -: VS CLD = L Dst( CLD, CLD ) (2) 2 n whch CLD for frame s the 8x8 DCT transform coeffcents of the 8x8 pxel thumbnal mage downsampled from the orgnal frame. DCT n effect performs a Prncple Component Analyss (PCA) to capture most energy of the frame, whle the L2 dstance provdes metrc for how much has changed from frame to frame.
Moton actvty VS s based on MPEG-7 s Moton Actvty Descrptor (MAD) [7]: VS MAD VAR ( MV = (3) whch s the varance of the magntude of the moton vectors (MV) n frame. Note that f frame s a scene change frame, then the moton VS s set to zero. Examples of vsual sgnfcance analyss are shown n fgure..8.6.4.2 5 5 2 25 3 35 4 vsual sgnfcance : bond sequence 6 4 2 ) vsual sgnfcance : foreman sequence 5 5 2 25 3 35 4 Fgure. Vsual Sgnfcance Analyss Examples When compares the VS plot wth the actual vdeo sequence, t s clear that VS captures the mportance of frames well. For example, n foreman sequence, spkes before frame 2 correspond to the head shakng and spkes around frame 25 correspond to the hand wavng and around 3 s the camera pannng. In another sequence bond, whch contans several scene cut and hgh moton actvty scenes, VS captures them accurately as well. Note that the moton VS s not requred n vsual sgnfcance analyss, snce the CLD based VS captures most of the nformaton and s computatonally much smpler. We fnd only n dense moton sequences lke soccer, where there are not much color layout change, MAD based VS wll be needed. By a thresholdng operaton on VS functons, orgnal vdeo sequence s chopped up nto vdeo shots,.e. the perceptually consstent groups of vdeo frames. In our experment we pcked threshold T = 2. for VS functons. Then the foreman sequence s a sngle shot and the bond sequence s broken nto 6 shots. Note that T can be changed to sut dfferent applcatons. Wthn a vdeo shot, the Cumulatve Vsual Sgnfcance (CVS) s computed as the summaton of VS of frames: CVS VS, f : VS < T = = n, n =, f : VS T In (4), n s the last key frame, T s the threshold. (4) Examples of CVS functons are shown n fgure 2 for foreman and bond sequences. 6 4 2 cumulatve vsual sgnfcance : foreman seq 5 5 2 25 3 35 4 5 4 3 2 cumulatve vsual sgnfcance : bond seq 5 5 2 25 3 35 4 Fgure 2. Cumulatve Vsual Sgnfcance Analyss Note that the slope of CVS functon ndcates the rate of vsual sgnfcance at frame tme. The steeper the slope the more nformaton s contaned. We wll desgn our frame selecton mechansm accordngly n the next secton. 3. FRAME SELECTION: ONE PASS APPROACH The VS and CVS functon gve us an approxmaton of how vsually sgnfcant events or nformaton s dstrbuted among frames. Wth ths nformaton, vdeo sequence s break up nto vdeo shots, and for each vdeo shot, a vdeo summary conssts of a key frame and multple temporal enhancement frames s selected. The overall operaton s llustrated n the psuedo code lsted below, n whch CVS[] s the cumulatve VS value for frame, and F_SEL[] s the frame selecton ndcator: V = ; = ; Whle ( < n) { If ( CVS[] == )
} F_SEL[] = ; #key frame V = ; Else f (CVS[]-V > ) F_SEL[] = E ; #enh frame V = CVS[]; Else F_SEL[] = ; #skp The start of a vdeo shot s dentfed by threshold on VS functon. The frst frame n a vdeo shot s always pcked as key frame for the shot, note that key frame here may have dfferent meanng from some prevous works. The total number of key frames of a vdeo sequence s determned by threshold T, pre-determned by experment. A steppng operaton s performed on CVS functon to select temporal enhancement frames for the shot. The ncrement of CVS from last frame selecton tme s computed, f t s greater than the step sze, then the frame s selected as a temporal enhancement frame. Example of ths operaton of vdeo summary frame selecton for foreman and bond sequences are llustrated n Fgure 3 and Fgure 4: 5 45 4 35 3 25 2 5 cumulatve vsual sgnfcance 5 enh frame key frame Threshold T = 2. Stepsze Delta = 2. Frame Selecton : "Bond" Sequence 5 5 2 25 3 35 4 Fgure 3. Vdeo Summary Frame Selecton For Bond 6 5 4 3 2 enh frame key frame Threshold T=2. Stepsze Delta = 2. cumulatve vsual sgnfcance Frame Selecton : "Foreman" Sequence 5 5 2 25 3 35 4 Fgure 4. Vdeo Summary Frame Selecton For Foreman Note that more frames are selected on steeper slopes of CVS functon. Ths s reasonable snce more nformaton s conveyed at those nstances. For a one-pass soluton, an ntal wll be pcked by the system, and accordng to vew-tme and bandwdth constrants, t can be adusted on the fly. 4. FRAME SELECTION: TWO-PASS APPROACH For a two-pass soluton, we have the luxury of fully analyze the sequence before generatng ts summary. For a gven sequence of N frames, f we want to reduce t to a vdeo summary of M frames, then the step sze can be determned precsely as the total vsual sgnfcance conveyed dvded by total frames avalable for temporal enhancement: CVSn = M = (5) where s total number of shots n the sequence, thus we wll have key frames and M- total enhancement frames; and n s the last frame n vdeo shot. Note that can also serve as a temporal dstorton metrc for the vdeo summary. 5. OPTIMAL CODING OF THE VIDEO SUMMARY For any gven vdeo coder, the optmal codng of vdeo summary becomes a frame level optmal bt allocaton problem. Ths needs accurate modelng of frame level Rate-Dstorton functon. Attempts to model the rate
dstorton curve of vdeo coder are lsted n [8][9]. However these approaches suffered from naccuracy when try to employ them n real vdeo coder. In [] a numercal soluton s proposed, the computaton cost nvolved s qute hgh. We try to solve ths problem wth a compromse of pure model based approach and operatonal rate-dstorton optmzaton approach. An analytcal dstorton model s assumed for both Intra and Inter frames, wth parameters provdng extra freedom to ft the actual operatonal Rate-Dstorton (R- D) curve of the chosen coder. For ntra frames, we assume: d = f b : X ) (6) ( and for nter frames, we assume: d = g b ; Y ) (7) ( where the dstorton of ntra frame or s a functon of bts spent b and codng complexty parameter vector X and Y. Functons f and g are convex functons and can be nverse proportonal or exponental type. Parameter vectors X and Y are solved from encodng an nter and ntra frame wth dfferent QPs. A recent work [] showed that a fast R-D operaton pont estmaton s possble by computng the rato of zeros n the transform coeffcents. Note also that the actvty of codng complexty parameter vectors X and Y are strongly correlated to that of the VS functon. So we only update X and Y after VS actvty s above certan threshold. Then we formulate the optmal codng problem as: arg mn f ( b ; X { b, b } M ) + = = sub. to : b + b = B = = M g( b ; Y ) (8) whch mnmze the average dstorton among vdeo summary frames wth bt budget constrant B. Snce functons f and g are convex and dfferentable, by ntroducng Lagrangan Multpler we can reduce (8) nto an un-constraned problem of mnmzng J: + [ f ( b ; X ) λb ] J ( λ) = + M [ g( b ; Y ) λb ] + = = (9) To satsfy frst order requrement, set the dervatve of J to zero: J b J b = f ' ( b ; X ' = g ( b ; Y ) + λ = ) + λ = () Solve () and wth total bt budget constrant n (8) we can fnd the optmal bt allocaton {b, b }. If constant dstorton s desred, an alternatve formulaton s for a gven bt budget B, fnd the mnmum constant dstorton d and the optmal bt allocaton {b, b } that wll meet the bt budget: arg sub. to : d () M f ( d ; X ) + = = g ( d ; Y ) = B Ths can be acheved by b-secton searchng. Work s underway to mplement ths method wth H.263 reference software TMN8 [2]. Varous models and codng complexty parameter estmaton methods are under nvestgaton. 6. SPATIAL AND TEMPORAL DISTORTION TRADE OFF When encodng a sequence of pctures, we need to consder both temporal and spatal dstortons. If we defne as a temporal dstorton metrc and average MSE dstorton D as the spatal dstorton metrc for a vdeo sequence, then the Rate-Dstorton functon becomes a convex surface defned on temporal and spatal dstorton axs: R h(, D) = (2) For any gven R=B, admssble (, D) pars are on a curve n -D plane. To fnd an optmal soluton a utlty functon for perceptual qualty s defned as: Q = (3) q(, D)
Whch s concave over -D plane. An optmal soluton can be found by solvng the constraned optmzaton problem: arg max q(, D), sub. to : h(, D) = B {, D} (4) Once s pcked by solvng (4), along wth VS functon threshold T, we can select frames to the optmal vdeo summary, and by solvng () and bt budget constrant (8) we can fnd the optmal bt allocaton among vdeo summary frames. Work s underway to fnd parameterzed analytc forms of functons h and q. For q, subectve evaluaton experment need to be set up. 7. EXPERIMENTAL RESULTS We encoded vdeo summary of foreman and bond sequences wth fxed threshold T=2., and varous step sze. It shows that the vdeo summary gracefully degrades the perceptual qualty wth the ncreasng of. Ths observaton s subectve, but an analytcal explanaton can be found from Fgure 3 and Fgure 4. For any gven, we always pck up enhancement frames for summary after a fxed amount of vsual nformaton, or sgnfcant events are conveyed. Ths ensures the temporal smoothness of the vdeo summary as compared wth some prevous clusterng approach. It s dffcult to fnd an obectve temporal dstorton measurement, but a plot of bts spent encodng the sequence as a functon of may demonstrate some ntutve clue, as shown n Fgure 5: Some sample vdeo summares wth dfferent are avalable on the web for evaluaton at: http://www.ece.northwestern.edu/~zl/research/cme3/d emo.html. 7. CONCLUSION AND FUTURE WORS In ths paper we demonstrated a vsual sgnfcance analyss based vdeo summary generaton method. It s computatonally smple and can operate n one-pass and two-pass scenaros. The summary generated by ths method acheves the graceful degradaton of perceptual qualty wth vew tme reducton and can be used n a varety of applcatons n securty, mltary and entertanment stuatons. Work s underway to fnd optmal encodng strategy for vdeo summary; good compromse between temporal resoluton and spatal PSNR qualty; subectve/obectve metrcs for vdeo summary qualty evaluaton. 7. REFERENCES [] Y. Wang, Z. Lu and J-C. Huang, Multmeda Content Analyss, IEEE Sgnal Processng Magazne, vol. 7, November 2. [2] H. Sundaram and S-F. Chang, Constraned Utlty Maxmzaton for Generatng Vsual Skms, IEEE Workshop on Content-Based Access of Image & Vdeo Lbrary, 2. [3] A. Grgenshohn and J. Boreczky, Tme-Constraned ey frame Selecton Technque, Proc. of IEEE Multmeda Computng and Systems (ICMCS), 999. bts 6 4 2 8 6 bts expendture as functon of delta: bond sequence [4] Y. Gong and X. Lu, Vdeo Summarzaton wth Mnmal Vsual Content Redundances, Proc. of Int l Conference on Image Processng, 2. [5], Informaton Technology Multmeda Content Descrpton Interface Part 3: Vsual, ISO/IEC FCD 5938-3. [6] B. S. Manunath, J-R. Ohm, V. V. Vasudevan and A. Yamada, Color and Texture Descrptors, IEEE Trans. on Crcuts and Systems for Vdeo Technology, vol., June 2. [7] S. Jeannn and A. Dvakaran, MPEG-7 Vsual Moton Descrptors, IEEE Trans. on Crcuts and Systems for Vdeo Technology, vol., June 2. 4.5.5 2 2.5 3 3.5 4 delta Fgure 5. Bts Functon of, Bond Sequence [8] T. Chang and Y-Q. Zhang, A New Rate Control Scheme Usng Quadratc Rate Dstorton Model, IEEE Trans. on Crcuts and Systems for Vdeo Technology, vol.7, February 997.
[9] H-M. Hang and J-J. Chen, Source Model for Transform Vdeo Coder and Its Applcaton Part I: Fundamental Theory, IEEE Trans. on Crcuts and Systems for Vdeo Technology, vol.7, Aprl 997. [] L-J. Ln and A. Ortega, Bt-Rate Control Usng Pecewse Approxmaton Rate-Dstorton Characterstcs, IEEE Trans. on Crcuts and Systems for Vdeo Technology, vol.8, August 998. [] Z. He, J. Ca and C-W. Chen, Jont Source Channel Rate- Dstorton Analyss for Adaptve Mode Selecton and Rate Control n Wreless Vdeo Codng, IEEE Trans. on Crcuts and Systems for Vdeo Technology, vol.2, June 22. [2] Unversty of Brtsh Columba, H.263 Reference Software Model: TMN8.