Linear Hashtable Motion Estimation Algorithm for Distributed Video Processing

Lnear Hashtable Moton Estmaton Algorthm for Dstrbuted Vdeo Processng Yunsong Wu 1,, Graham Megson 1 Jangx Scence & Technology ormal Unversty anchang, Chna School of Systems Engneerng, Readng Unversty Readng, UK {sryw, g.m.megson}@rdg.ac.uk Abstract. Ths paper presents a lel Lnear Hashtable Moton Estmaton Algorthm (LHMEA). Most lel vdeo compresson algorthms focus on Group of Pcture (GOP). Based on LHMEA we proposed earler [1][], we developed a lel moton estmaton algorthm focus nsde of frame. We dvde each reference frames nto equally szed regons. These regons are gong to be processed n lel to ncrease the encodng speed sgnfcantly. The theory and practce speed up of lel LHMEA accordng to the number of PCs n the cluster are compared and dscussed. Moton Vectors (MV) are generated from the frst-pass LHMEA and used as predctors for second-pass Hexagonal Search (HEXBS) moton estmaton, whch only searches a small number of Macroblocks (MBs). We evaluated dstrbuted lel mplementaton of LHMEA of TPA for real tme vdeo compresson. Keywords: Parallel Algorthm, Dstrbuted Computng, Dstrbuted Vdeo Codng, Lnear Hashtable, Moton Estmaton 1 Introducton In ths paper, a lel Lnear Hashtable Moton Estmaton Algorthm (LHMEA) for the Two-Pass Algorthm (TPA) consttuted by LHMEA and Hexagonal Search (HEXBS) to predct moton vectors for nter-codng [1] s proposed. The objectve of the moton estmaton scheme s to acheve good qualty vdeo wth very low computatonal tme and low transmsson rate. It s hard to fnd software solutons that effcently code hgh-qualty vdeo n real-tme or faster. We propose and evaluate dstrbuted lel mplementatons of the LHMEA of TPA on clusters of workstatons for real tme vdeo compresson as test. It dscusses how dstrbuted vdeo codng on load balanced multprocessor systems can help, especally on moton estmaton. The software platform used for these s the Parallel Vrtual Machnes (PVM) programmng model and C respectvely. The effect of load balancng for mproved performance wll also be dscussed. Ths paper s only concerned wth the Block Matchng Algorthms (BMA), whch s wdely used n MPEG, MPEG4, and H.63. In BMA, each block of the current vdeo frame s compared to blocks n

reference frame n the vcnty of ts correspondng poston. It s hghly desred to speed up the process of compresson wthout ntroducng serous dstorton. The HEXBS s a wdely accepted fast moton estmaton algorthm []. The Lnear Algorthm and Hexagonal Search Based Two-Pass Algorthm (LAHSBTPA) prevously proposed has an mprovement over the HEXBS on compresson rate, PSR and compresson tme. In the last years, many fast algorthms have been proposed to reduce the exhaustve checkng of canddate Moton Vectors (MV). Such as Two Level Search (TS), Two Dmensonal Logarthmc Search (DLS) and Subsample Search (SS) [3], the Three-Step Search (TSS), Four-Step Search (4SS) [4], Block-Based Gradent Descent Search (BBGDS) [5], and Damond Search (DS) [6], [7] algorthms. A very nterestng method called HEXBS has been proposed by Zhu, Ln, and Chau [8]. The fast BMA ncreases the search speed by takng the nature of most real-world uences nto account whle also mantan a predcton qualty comparable to Full Search. Most algorthms suffer from beng easly trapped n a nonoptmum soluton. LHMEA based TPA sorts out ths problem. ormally vdeo encoders are very effectve reducng the sze of the vdeo stream, but the processng cost s very hgh for hgh qualty vdeo uences. Although there are hardware vdeo encoders avalable, they have severe restrctons (resoluton, codng optons, etc). A more flexble choce s to use dstrbuted lel mplementatons. Processng vdeo wth hgh performance dstrbuted computng has great potental and good future, but the studes n these felds manly concentrated on Group of Pctures (GOP) separaton. To take advantage of the potental processng power of dstrbuted computng, we use dstrbuted programmng technques based on message passng. We have used PVM because there are free mplementatons avalable and t s a wdely accepted standard. Varous mage and vdeo compresson algorthms use lel processng. Approaches used can largely be dvded nto four areas. The frst s the use of specal purpose archtectures desgned specally for mage and vdeo compresson. An example of ths s the use of an array of DSP chps to mplement a verson of MPEG. The second approach s the use of VLSI technques. The thrd approach s algorthm drven, n whch the structure of the compresson algorthm descrbes the archtecture, e.g. pyramd algorthms. The fourth approach s the mplementaton of algorthms on hgh performance lel computers. The TPA whch we have proposed has acheved best result n all the algorthms n the survey. To further mprove the result and speed, the most sutable and easest way s usng lel algorthm to mplement the algorthm on hgh performance lel computers. In the frst-pass codng of TPA, LHMEA s employed to search all Macroblocks (MB) n the pcture. Because LHMEA s based on a lnear algorthm, whch fully utlzes optmzed computer s structure based on addton, so t s easy to be leled. Meanwhle HEXBS s one of the best moton estmaton methods to date. The new method proposed n ths paper acheves the best results so far among all the algorthms nvestgated on compresson rate, tme and PSR. Contrbutons from ths paper are: 1. The TPA acheves the best results among all nvestgated BMA algorthms.. Improved Hashtable s used n vdeo encodng. 3. The lel algorthm mproves LHMEA of TPA. It mplements and shows better compresson speed, and far compresson rate and PSR than orgnal TPA.

4. Work load balancng algorthm s mplemented n the hashtable mage encodng process. The rest of the paper s organzed as follows. Secton contnues wth an ntroducton to mproved LHMEA and TPA and gves expermental result showng TPA s advantage over other algorthms. The proposed lel algorthm and ts mplementaton for LHMEA are ntroduced n Secton 3. Expermental results showng leled hashtable compared wth the orgnal are also ncluded n Secton3. The paper concludes n Secton 4 wth some remarks and dscussons about the proposed scheme. 1 Sequental and Parallel Implementaton of Lnear Hashtable Moton Estmaton Algorthm (LHMEA) Our method attempts to predct the moton vectors usng lnear algorthm.[1][] It uses hashtable method nto vdeo compresson. After nvestgatng of most tradtonal and on-the-edge moton estmaton methods, we use latest optmzaton crteron and predcton search method. Spatally MBs nformaton s used to generate the best moton vectors[8]. We desgned a vector hashtable lookup matchng algorthm whch s more effcent method to perform an exhaustve search: t consders every macroblock n the search wndow. Ths block-matchng algorthm calculates each block to set up a hashtable. It s a dctonary n whch keys are mapped to array postons by a hash functon. We try to fnd as few varables as possble to represent the whole macroblock. Through some preprocessng steps, ntegral projectons are calculated for each macroblock. These projectons are dfferent accordng to dfferent algorthm. The am of these algorthms s to fnd best projecton functon. The algorthms we present here has projectons. One of them s the massve projecton, whch s a scalar denotng the sum of all pxels n the macroblock. It s also DC coeffcent of macroblock. The other s A of YAx+B ( y s lumnance, x s locaton.) Each of these projectons s mathematcally related to the error metrc. Under certan condtons, the value of the projecton ndcates whether or not the canddate macroblock wll do better than best-so-far match..1 Sequental Implementaton of LHMEA The followngs are the pseudo code, theory tme, practcal tme calculaton of lnear hashtable moton estmaton algorthm. The algorthm s used n pre-computaton part of n MPEG codec and mplemented n both uental and lel ways. In the program, we try to use polynomal approxmaton to get such ymx+c; y s lumnance value of all pxels; x s the locaton of pxel n macroblocks. The way of scan y s from left to rght, from top to button. Coeffcents m and c are what we are lookng for. As shown n the fgure below. In ths functon yf(x), x wll be from to 55 n a 16*16 pxels macroblock, yf(x)mx+c.

m c * Fg. 1. Lnear algorthm for dscrete algorthm ( x * y ) * x x * x y * x x * x * x x * x x * y * y Here we state the pseudo code to calculate the hashtable functon: The functon to mplement the algorthm s encapsulated n MyMotonSearchPreComputaton Mpeg- Frame *frame) () Sequental Code: Step 1: f (( psearchalg VECTOR_HASH psearchalg HEX_VECTOR_HASH psearchalg HEX) && (frame->type I_FRAME frame->type P_FRAME )) Step : EnterTmeCount() Step 3, Paral: f (IsSetUpHashTablePVM ) { call PVM Moton Search PreComputaton;} else{ Step 4: f(hashtablesearchtype) IntMHashTable(); Step 5: for (y ; y < Fsze_y - 16; y++) { Step 6: for (x ; x < Fsze_x - 16; x++) Step 7: { call dfferent hashtable setup functons } Step 8:f (use HashTable) { add M,C,X,Y nto hashtable }}}} Step 9: LeaveTmeCount(); (1) MB s transferred by hash functon to hash coeffcents, M,C,X,Y generated are added nto hashtable. In prevous research methods, when people try to fnd a block that best matches a predefned block n the current frame, matchng was performed by SAD (calculatng dfference between current block and reference block). In Lnear Hashtable Moton Estmaton Algorthm (LHMEA), we only need to compare two coeffcents of two blocks. In current exstng methods, the MB moves nsde a search wndow centered on the poston of the current block n the current frame. In LHMEA, the coeffcents move nsde hashtable to fnd matched blocks. If coeffcents are powerful enough to

hold enough nformaton of MB, moton estmators should be accurate. So LHMEA ncreases speed and accuracy to a large extent. From the pseudo code above, we can get calculaton tme n theory: The precomputaton complexty s the functon (3) T ( n, s, φ ) n s φ (3) The varables nsde the functon are 1. n : reference frame number, whch s also number of I, P frames. s : frame sze, whch n the program s ( wdth _ frame) ( length _ frame) (4) 3. φ : the complexty to calculate the hash functon per macroblock, whch wll be explaned later. So the complexty of the lnear hashtable moton estmaton algorthm depends on the three varables. To demonstrate the complexty of calculaton, the followng example s gven: The vdeo uence used n the expermentaton s three YUV (35x4 pxels) test uence, whch s known as Flower Garden uence. There are 15 frames n the orgnal uences, whch sub sampled to the 4:1:1 format n the YUV color space. The vdeo uence was dvded nto several sectons (GOPs), each of whch contaned 15 frames to be compressed and a reference frame. A frame pattern of IBBPBBIBBPBBPBB was used. The average tme s defned as the overall executon tme of the group, ncludng the I/O tme, the computaton tme and the communcatons tme. The moton vector search algorthm used s the LHMEA based TPA and produces nteger pxel moton vectors. We calculate t n detals here to demonstrate how t s workng. n 5 out of 15 frames. s ( wdth _ frame MB _ sze) ( length _ frame MB _ sze) 7564. Accordng to the complexty of calculate Macroblock, φ depends on the hash functon calculaton method. For the coeffcents m and c we mentoned earler: m * ( x * y ) * x x * x x * y (1)

c y * x x * x * x x * x * y In the C codec, We only calculate ( x y ) and y, because for a 16x16 x macroblock,, x,, * x x x can be pre-calculated before callng the functon. In the codec, pseudo code decdes the complexty of φ s as followng: so φ 16*16*[ 1 (*)+5 (+) ]+ 4 (*)+ (+) () for(y;y<mb_sze;y++){ for(x;x<mb_sze;x++){ temp1 frame->ref_y[y+y][x+x]; sum_y + temp1; sum_xy + count* temp1; count ++; } } (*pnowbuldtable)[y][x].b.15*sum_y; (*pnowbuldtable)[y][x].a (1*sum_xy-6*(total_sze+1)*sum_y)>>1; In ths example total uental tme n theory s ( n, s, φ) T n s ( M frame _ dm enson ) φ ( MB _ dm enson γ hashfunctoncomplexty ) (5) n M frame _ dm enson MB _ dm enson γ hashfunctoncomplexty 5*[(Fsze_x-MB_sze)*(Fsze_y-MB_sze)]*{16*16*[1 (*)+5 (+) ]+ 4 (*)+ (+)} 97843 (*)+48444 (+) Practcal uental tme countng: T ( n, s, φ) 7.763(s) (6). Parallel Implementaton of LHMEA In the lel mplementaton, to lelze an encoder, we dvde each reference frames (whch can be I or P frame) nto equally szed regons. Current frames are also dvded nto non-overlapped regons. These regons are gong to be processed n lel to ncrease the encodng speed sgnfcantly. Each regon s dvded nto nonoverlappng range blocks. Each regon wll be sent to correspondng slaves and generates ts own hashtable table. The slave wll be alve untl encodng fnshes. Slaves wll generate ts own hashtable and Moton Vectors table, sendng MVs table back to the master. However, there s an upper lmt on the number of PEs that can be used due to the lmted spatal resoluton of a vdeo uence. Also a massve spatal

lel algorthm usually needs to tolerate a relatvely large communcaton over-head. In our approach of spatal lelsm, load balancng was mplemented to ensure an equal dstrbuton of the frame data among the processors. Here we state the pseudo code to calculate the hashtable functon n lel. The functon to mplement the algorthm s encapsulated n PreComputaton() Parallel Code: Input: part of reference frame from master Output: part of hashtable Step 1: rcodepvm_upknt (FrameData,Fsze_X*(rows),1); /*Get Data from Master*/ Step : /*Gve Data from buffer to Reference Frame, */ for(;< Fsze_X *(rows);++) {prevframe.ref_y[tempy][tempx] FrameData[];} Step 3: For (;< rows;++) Step 4: for (k;k< Fsze_x-16; k++){ Step 5: for(y;y<16;y++){ Step 6: for(x;x<16;x++) {calculate sum_x*y and sum_y for each Pxel;}} Step 7: calculate M and C for each Pxel;}}} The structure of the algorthm can be demonstrated n the fgure. The reference frame are dvded nto several parts, rows ( wdth _ frame wdth _ MB) / _ PCs + searchwndows are sent to clents. Fg.. The Parallel Structure of Hashtable From the pseudo code above, we can get calculaton tme n theory: The precomputaton complexty s functon T ( n, s, φ ) n s φ (7) The varables nsde the functon have smlar meanng as n uental functon n n 1, s, : frame sze, whch s whole frame dvded by umber_pcs (( wdth _ frame wdth _ MB)/ _ PCs + searchwndows)*( length _ frame) (8)

3, per macroblock. pseudo code decdes the complexty of φ φ Start, Allocate processes Master process (1) Slave process () Interacton, gan setup nformaton Intalze envronment and allocate memory Broadcast setups Scatters the data of the frames Gan the current frame data and reference frame to be encoded Setup own part of hashtable based on reference frame. Sze: ((Frame_Sze_X/)+wndow)* ((Frame_Sze_Y/)+wndow) Search n own part of hashtable and buld own part of MV table. Sze: (Frame_Sze_X/)* (Frame_Sze_Y/) Collect data for followng process Send MV table to the master process Fnsh? Fnsh? Vdeo output Kll slaves End Fg. 3. Process of lel LHMEA setup Usng the same example uences of frames and number of slaves equal to 4, f we use slaves, each slave wll get rows35/ + search wndow16 for each. If we use 4 slaves, each slave wll get rows35/4 + search wndow, 18,18,18,18 for each. We use bggest one to calculate totally tme for slaves. In ths case: When the number of PCs: T ( n, s, φ ) n s φ (9) n s ( M frame _ dm enson ) φ ( MB _ dm enson γ ) hashfunctoncomplexty M frame _ dm enson n MB _ dm enson γ hashfunctoncomplexty _ PCs (1)

5*[(rows-MB_sze)*(Fsze_x-MB_sze)]*{16*16[1(*)+5(+)]+4(*)+4(+)} 5*[14*336]*[6(*)+184(+)] 54163 (*)+674888 (+) T( n, s, φ) Speedup: T ( n, s, φ ) τ 97843 (*) + 48444 ( + ) 1.865 54163 (*) + 674888 ( + ) Practcal uental tme countng: ( n, s, φ) 7.763(s) T The fgure 4& 5 below are Tme Spent, Actual Speed Up, Theory Speed Up comparson for lel LHMEA based on the 15 Flower Garden Sequences. PSR and compresson rate reman the same as uental algorthm [1][]. Tme Cost(s) Tme Cost(s) 7 6 5 4 3 1 6.41 4.18 3.356.91.381.371 1 3 4 5 6 7 umber_pcs 1.878 Tme Cost(s) Fg. 4. Tme cost decrease wth umber of PCs Speedup Comparson Speed Up Rate 4.5 4 3.5 3.5 1.5 1.5 1 1.864 1.915.1538 1.5175.6666.7 3.1111 3.1111.6894.6996 1 3 4 5 6 7 umber_pcs 4 3.418 Actual Speed Up Theory Speed Up Fg. 5. Actual Speed Up, Theory Speed Up comparson In theory, the speedup should be n lnear ncreasng wth number of PCs. The reason why t does not match a lnear model s that we are not sendng exact

Frame/umber_PCs data to slaves, nstead, we send Fsze_y/umber_PCs plus search wndows sze rows data to slaves. Also t s lmted by resoluton of mages. More data ( wndow _ sze Frame _ wdth ) wll be calculated than the orgnal frame. In theory, the larger number of PCs, the more redundant data. The curve of speedup-umber PCs wll have less descent when the PCs number ncreases. Tme cost also depends on the speed of PCs. We use a network of workstatons comprses smlar workstatons lnked together by a common network e.g. Ethernet. When CPU clock s counted, the faster the PC, the less tme t takes. 3 Concluson In the paper, a lel Lnear Hashtable Moton Estmaton Algorthm (LHMEA) and Hexagonal Search Based Two-Pass Algorthm (TPA) n vdeo compresson s proposed based on the LHMEA. The hashtable s used n vdeo compresson and mplemented wth lel computng n moton estmaton. The algorthm searches n the hashtable to fnd the moton estmator n-stead of by full search algorthm n whole frame. Then the LHMEA was mplemented n lel algorthm. The speedup of leled LHMEA s compared to the orgnal uental LHMEA. The lel vdeo codng s mplemented nsde frame rather than between frames. The key pont n the method s to fnd sutable hash functon to produce the hashtable. References 1. Yunsong Wu, Graham Megson, Lnear Predcted Hexagonal Search Algorthm wth Moments, ICIC 5, Part I, Sprnger LCS 3644, pp. 136 145, (5).. Yunsong Wu, Megson G, Two-pass hexagonal algorthm wth mproved hashtable structure for moton estmaton Pro-ceedngs. IEEE Conference on Advanced Vdeo and Sgnal Based Survellance, pp. 564 569, (5). 3. Ce Zhu, Xao Ln, Lappu Chau, and La-Man Po, Enhanced Hexagonal Search for Fast Block Moton Estmaton, IEEE Trans on Crcuts and Systems for Vdeo Technology, Vol. 14, o. 1, (Oct 4) 4. Qang Peng; Yuln Zhao, Study on lel approach n H.6L vdeo encoder, PDCAT'3. Proc of the Fourth Internatonal Conference, p:834 837 Aug. (3) 5. K. Shen, L. A. Rowe, E. J. Delp. A Parallel Implementaton of an MPEGI Encoder: Faster Than Real-Tme!. Proc of the SPIE - The Internatonal Socety for Optcal Engneerng, ~1.419p, p:47-418. 6. M. Rbero,. Snnen, L. Sousa. MPEG-4 atural Vdeo Parallel Implementaton on a Cluster. 1th (RECPAD), Portugal, June (). 7. H. ng, J. T. L and S. X. Ln. A Study of Parallelsm n MPEG-4 Vdeo Encoder, Journal of Computer Engneerng and Applcatons, Vol 38, pp.9-1, July, () 8. Alexs M. Touraps, Oscar C. Au, Mng L. Lou, Predctve Moton Vector Feld Adaptve Search Technque (PMVFAST) Enhancng Block Based Moton Estmaton, Proc Vsual Communcatons and Image Processng, San Jose, CA, January (1)