A Full-mode FME VLSI Architecture Based on 8x8/4x4 Adaptive Hadamard Transform For QFHD H.264/AVC Encoder

20 IEEE/IFIP 9th Intenational Confeence on VLSI and System-on-Chip A Full-mode FME VLSI Achitectue Based on 8x8/ Adaptive Hadamad Tansfom Fo QFHD H264/AVC Encode Jialiang Liu, Xinhua Chen College of Infomation Science and Engineeing Shandong Univesity of Science and Technology Qingdao, China Abstact Adaptive Block-size Tansfom (ABT) has been added to H264/AVC standad with the Fidelity Range Extension In this pape, we apply this ABT concept to ou FME design and popose a full-mode FME achitectue based on 8x8/ adaptive Hadamad Tansfom This technique can avoid unifying all vaiable block-size blocks into -size blocks and impove the encoding pefomance We also exploit the lineaity of Hadamad Tansfom in quate-pel efinement and decease the cycles caused by the second long seach pocess In achitectue level, we employ two intepolating engines that can suppot 8-pel and 4-pel input to time-shae one SATD (Sum of Absolute Hadamad Tansfom) Geneato These stategies can incease paallelism and educe the cycles efficiently Besides, this design can suppot full modes, which guaantees the encoding pefomance Expeimental esults show that ou design can achieve eal-time pocessing fo QFHD@30fps at the opeation fequency of 320MHz with 4446K gates hadwae Keywods: H264/AVC; factional motion estimation; adaptive block-size hadamad tansfom; quad full high definition I INTRODUCTION H264/MPEG-4 AVC is the latest video-coding standad jointly developed by ITU-T Video Coding Expets Goup (VCEG) and ISO/IEC Moving Pictue Expets Goup (MPEG) Afte the fist vesion, H264/MPEG-4 AVC didn t stop its development Aftewads Fidelity Range Extension (ie High Pofile), Scalable Video Coding (SVC) and Multiview Video Coding (MVC) extensions wee added to H264/MPEG-4 AVC standad Nowadays, H264/MPEG-4 standad can be widely applied to all kinds of multimedia devices o occasions because of its moe pedominant pefomance Vaiable block size motion estimation (VBSME) is one of the most impotant techniques And this technique exploits fully the tempoal edundancy to povide much bette compession pefomance In geneal, most implementation woks sepaate Motion Estimation (ME) into two stages: intege motion estimation (IME) and factional motion estimation (FME) Fistly, IME pefoms motion seach within the entie seach window to find an intege motion vecto (IMV) fo each of 4 subblocks In the next step, FME pefoms motion seach aound the efinement cente pointed to by IMV and futhe efines 4 IMVs into quate-pel pecision [3] In the whole ME pocess, FME plays an impotant ole in Yibo Fan, Xiaoyang Zeng State Key Lab of ASIC and System Fudan Univesity Shanghai, China Email: fanyibo@fudaneducn impoving the video quality (about 4+ db) and educing the bit ate (about 20%~40%) Almost all pevious woks about ME mentioned this point Howeve, FME will occupy almost ove 45% of the computation complexity of encoding pocess [4] Hadamad Tansfom (HT) is main pat of Sum of Absolute Tansfomed Diffeence (SATD), which is widely used fo estimating cost of ate distotion The 8x8/ adaptive tansfom (Adaptive Block-size Tansfom, ABT) technology with Fidelity Range Extension is added to the main pofile of H264/AVC standad Ou simulation shows that this technique can impove about 005dB video quality and save about 3% bit ate In this pape, we popose a FME achitectue that suppots 8x8/ adaptive HT (Adaptive Block-size HT, ABHT) ABHT is a subset of ABT technique When the ABT technique is applied into the entie encode, the ABHT will obtain its maximum pefomance The est of this pape is oganized as follows In Section Ⅱ, we fistly point out the challenges of FME and intoduce some othes effots Then we descibe elated algoithm which is the foundation of ou design in Section Ⅲ In Section Ⅳ, we intoduce the details of ou poposed achitectue And finally, we give out the implementation esults and conclusion in Section Ⅴ and Section Ⅵ espectively II CHALLENGES AND PREVIOUS WORKS A The Challenges of FME As one pat of entie ME, FME isn t also nomative Howeve, thee is a commonly used method fo FME in JM efeence softwae Most of FME VLSI achitectues poposed late adopt this method [][2][3] The seach pocess of this method is also sepaated into two steps: in the fist step, we pefom half-pel efinement aound the best intege-pel candidates, and then find the best quate-pel candidate aound the best half-pel candidate in the second step The second step doesn t wok until the accomplishment of the fist step As a esult, this two-step method citically limits the seaching speed The second challenge of FME is the multiple modes caused by VBSME VBSME splits one MacoBlock (MB) into one 6x6, two 6x8, two 8x6, o fou 8x8 patitions, and that evey 8x8 patition can be futhe split into one 8x8, two 8x4, two 4x8 o fou sub-patitions In the two-step seach 978--4577-070-2//$2600 20 IEEE 434

method, all patitions ae oganized as shown in Fig Thee ae total 7 modes, ie 4 patitions with 4 IMVs If one FME achitectue suppots all modes, it will need vey high pefomance to meet the equiement of eal-time highdefinition video encoding Figue 7 modes of all patitions B Pevious Woks Nowadays, thee ae some VLSI achitectues poposed, some of them suppot two-step full modes seach, but most of them employ the low complexity algoithm fo high pefomance application at the cost of visual quality dop Chen et al[] poposes a achitectue based on the two-step full seach algoithm This achitectue unifies diffeent size blocks into size block, which obtains 00% hadwae utilization This achitectue can suppot full modes, howeve, which also limits its pefomance Finally, they adopt one Advanced Mode Pe-Decision Algoithm (AMPA) to educe 7 modes to 3 modes only fo SDTV specification Wu s achitectue [2] also suppots all modes It employs thee intepolato engines to shae two SATD geneatos, which obtains the high pefomance by inceasing the data thoughput It still takes moe than 600 cycles @54MHz to suppot the 080p specification, but also the contol of this achitectue would be moe complex The full-mode and two-step seach algoithm limits the high specification application of FME So most achitectues based on the low complexity algoithm ae poposed Wang et al[4] exploit the unimodal eo suface featue of FME and only exploe the neighbohood position aound the minimum one to skip othe unlikely ones Though this method educes the hadwae cost by educing seach points, it doesn t esolve the poblem caused by the two-iteation seach Lin et al[5], Kuo et al[6] and Ta et al[7] popose thei single-iteation achitectues based on the pediction-based diectional FME algoithm This algoithm only seaches six candidates, so it is had to find the best matching candidates, especially when encoding the complex motion video sequences Huang et al[8] popose a high paallel achitectue meeting SHV application They emove the modes below 8x8 and then exploit the edge detection techniques to futhe educe modes Tsung et al[8] adopt a completely diffeent solution They abandon the seach method uled by JM efeence softwae, and then utilize the lineaity of HT to obtain the HT coefficients of sub-pels This method educed the high bandwidth caused by 6-tap FIR and only 002dB PSNR dop TABLE I THE NUMBER OF MODES SUPPORTED BY SOME IMPLEMENTATIONS Refeences Techniques Modes [] IME mode decision 3 [4] IME mode decision 2 educe candidates [6] educe modes 5-7 [7] emove lowe8x8 modes 2 edge detection [0] adaptive mode decision -7 [2][3] full seach 7 ou wok single-iteation 7 Table I gives the numbe of modes suppoted by some implementations, and most of pevious woks achieved thei equied pefomance by educing modes In this pape, we popose a FME achitectue that can suppot full modes on the basis of pevious studies In ou achitectue, we employ two intepolatos to shae one SATD geneato in tun Moeove, we use ABT technology in the HT fo SATD Geneating which can impove the encoding pefomance In ode to avoid the long latency caused by the second step, we also exploit the lineaity of HT to obtain HT coefficients of quate-pels This idea has been bought up in [9], howeve, thee ae some diffeences We make use of the aival delay of the SATD of quate-pels and half-pels to avoid adopting 25-input compaato, which educes hadwae cost and long data path III RELATED ALGORITHM A Reusable elationship between 8x8 and Hadamad Tansfom Fistly, we give out the 8x8 and HT matices () and (2) Hee we ignoe the scaling facto of tansfom matix = H () 8x8 = H (2) 435

By analyzing the elationship between H matix and H 8x8 matix, we can easily expess 8x8 HT matix as (3) H = H H H 8x8 (3) H Assume we have a 8x vecto, x, which can be expessed as two 4x vectos, x 0 andx, as shown in (4) x0 x = (4) x Now, we opeate one dimension (-d) HT to vectox, which is expessed as (5) X = H 8x8 H = H 0 0 + H H Fom (2), we can see that one -d 8x8 HT unit contains two -d HT units So we can design a eusable cicuit that suppots one 8x8 HT and two HT adaptively Fig2 shows this -d 8x8/ HT cicuit The shaded pat indicates the -d HT In ou design, we utilize this -d 8x8/ adaptive HT cicuit to implement the 2-d 8x8/ HT cicuit used in the SATD calculation x0 x x2 x0+x x0-x x2+x3 X0 X X2 (5) y0 y y2 fo geneating quate pixels vecto by bilinea filte that is defined in standad, as shown in (7) Q = ( H + H + ) 2 (7) The HT of diffeences between quate-pel candidates and CMB pixels can be expessed as (8) H + H + HT ( Q C) = HT ( C) 2 H C H C = HT ( + + ) 2 2 2 = ( HT ( H C) + HT ( H C) + HT ( )) 2 (8) Hee, HT( C) and HT( C) ae the HT of diffeences between half pixels (o intege pixels) and CMB pixels, the values of which have been yielded in the fist step seach So, we can use this featue to geneate the HT coefficients of quate-pel candidates in the fist step, which can emove the long latency caused by the second step seach IV ht_coef of half-pels H H THE PROPOSED ARCHITECTURE FOR FME 8x8/ intepolato 8x8/ intepolato0 H HPU3 ½-pels and int-pels HPU6 ht_coef of half-pels x3 x4 x2-x3 x4+x5 X3 X0 y3 y4 Q QPU4 satd_quatepels satd_halfpels satd_quatepels QPU8 QPU2 x5 x6 x4-x5 x6+x7 X X2 y5 y6 FME Contolle fmv MV Cost MUX Compaato fmv Mode Decision best mode x7 x6-x7 X3 Figue 2 Reusable 8x8/ -d Hadamad Tansfom B The lineaity of Hadamad Tansfom The lineaity of HT is defined as (6): HT( ax by) = aht( x) + bht( y) y7 + (6) Hee, HT () pesents HT opeation a and b ae constants x and y ae tansfomed data Let Q and C denote the quate pixels vecto and cuent MB (CMB) pixels vecto espectively H and H epesent the half pixels (o intege pixels) vectos used Figue 3 The poposed FME achitectue Aiming at high pefomance application, we popose a high paallelism and high thoughput achitectue, as shown in Fig3 In this achitectue, we employ two intepolatos These two intepolatos can continuously feed the half candidates to SATD Geneato in tun, which can incease the utilization ate of SATD Geneato in unit time In Fig3, SATD Geneato is divided into two pats: one pat is half-pel SATD calculation (HPU, Pocessing Unit of half pixels) and the othe is quatepel SATD calculation (QPU, Pocessing Unit of quate pixels) In ou design, we employ 9 HPUs and 6 QPUs fo high paallelism The HT cicuit in HPU can suppot 8x8/ adaptive tansfom The uppe 8x8 blocks (including 8x8 block) ae aanged into 8x8 HT, while the lowe 8x8 blocks un at HT Compaed with those achitectues in which thee is only HT, this design can incease the data thoughput and 436

vfilte dfilte encoding pefomance In addition, we also exploit the lineaity of HT to obtain the quate-pel HT fo quate-pel seach The half-pel HT can be obtained diectly by bilinea filteing of half-pel HT The QPUs in Fig3 include the bilinea filtes and add-tee fo geneating SATD of quate-pel candidates A Two Intepolatos Time-shaing One SATD Geneato The poposed intepolato is shown as Fig4 This intepolato can suppot 8-pixel intepolation and two 4-pixel intepolations, which can adapt to all vaiable block sizes One 6x6 block is split into two 8x6 blocks, and 6x8 block is split into two 8x8 size blocks Since ou intepolato can suppot two 4-pixel intepolations, it allows loading concuently two 4xn (n=4, 8) blocks One poblem exists in the conventional achitectues based on single intepolato The available data thown out by the intepolato ae not continuous, while SATD Geneato has the capacity of accepting the input data continuously So if we only adopt single intepolato, the utilization ate of SATD Geneato in unit time is vey low To solve this poblem, we employ two intepolatos which can continuously send the candidates geneated by these two intepolatos in tun to SATD Geneato These two intepolatos keep SATD Geneato woking Fig5 depicts the schedule of loading efeence pixels fo all patitions in evey mode If we only use one intepolato applied to ou design, fom Fig5 (a) can we see that it will take 49 cycles in loading efeence data Howeve, accoding to ou two-intepolato scheme, the cycles can be educed to 292 cycles This is because we adopt a peloading stategy Now, let us take 8x8 mode as an example An 8x8 mode has fou 8x8 blocks, and evey block needs a size efeence block When the efeence pixels of the fist 8x8 block ae continuously fed into one intepolato by ow, accoding to ou design, only the outputs fom 0 th to 7 th afte cycles latency ae available If thee is only one intepolato and we continuously feed data to it, thee will be 6 cycles latency duing switching blocks, and in this inteval SATD Geneato is ideal state By employing anothe intepolato, we Figue 4 8x8/ intepolato 0 44 00 44 200 mode0 6x6 mode 6x8 mode2 8x6 mode3 8x8 22 22 4 4 4 4 22 22 4 4 4 4 200 280 336 mode4 8x4 mode5 4x8 0 0 0 0 0 0 0 0 4 4 4 4 336 mode6 46 0 0 0 0 0 0 0 0 (a) The loading opeation fo only one intepolato cycle 0 mode0 6x6 38 mode 6x8 76 mode2 8x6 4 mode3 8x8 52 22 4 4 22 4 4 22 4 4 22 4 4 52 200 244 mode4 8x4 292 mode5 4x8 mode6 0 0 0 0 0 0 0 0 4 4 4 cycle 4 0 0 0 0 0 0 0 0 (b) The loading opeation fo two intepolatos Figue 5 the schedule of loading efeence pixels fo all the patitions in evey mode can load the efeence data in advance of 6 cycles So, when one intepolato ends outputting available data, anothe one is just in time to yield the available data of the next block Compaed with conventional single-intepolato scheme, this method can cut down cycles efficiently B SATD Geneato Based 8x8/ Adaptive Hadamad Tansfom Most of pevious solutions only adopted HT, because the minimum block size is Howeve, the vaiable blocksize tansfom can bette pesent the natual featue of the ealwold image In this design, we intoduce the 8x8 HT into ou FME achitectue, as shown in Fig2 The eusable elationship between 8x8 and HT matices detemines that the cicuit used fo 8x8 HT can also be used fo HT The uppe 8x8- size blocks pefom 8x8 HT, while the est blocks un at HT This SATD Geneato can suppot two-way HT calculations, so we can tansfom two size blocks simultaneously Accoding to ou expeimental esults (JM 60, IPPP, RDO off, QP=25/30/35/40/45), ABT technology 437

cause about 005dB PSNR incease and about %3 bitate saving, as shown in Fig6 PSNR(dB) 42 40 38 36 we still select the quate-pel candidates accoding to the best half candidate as oiginal two-iteation pocess The quate-pel candidates ae aound the best half-pel candidate but limited in the ed squae This can avoid seaching simultaneously among 6 quate-pel candidates, and sequentially we can avoid the long compaing data path Accoding to ou simulation, thee is about 003 db quality dop in aveage compaed with the situation in which thee is only HT If we didn t exploit 8x8/ adaptive HT techniques, the quality loss would be moe Fig0 shows the ate-distotion pefomance afte intoducing ou scheme 34 32 HT(Tansfom8x8Mode off) 8x8/HT(Tansfom8x8Mode on) 30 00 200 300 400 500 600 700 800 BitRate(kbps) Figue 6 Rate-distotion cuve of adopting 8x8/ adaptive HT, compaed with only HT C Impoved Two-iteation Pocess The oiginal second iteative pocess includes bilinea filteing fo quate-pel candidates, SATD Geneating and best candidates selecting Now, we apply the lineaity of HT to quate-pel efinement pocess and educe the second iteative pocess to one pocess that only includes Best Candidate Selecting Fig7 shows compaison of the oiginal iteation and the impoved two-iteation pocess the atio of evey candidates(%) 2 0 8 6 4 Figue 8 Sub-pel candidates in ed squae 2 Second iteation Second iteation 0 3 2 0 - -2-3 facy -3-2 - 0 2 3 facx Figue 9 The statistic of the best candidate distibution among 49 sub-pel candidates (city, socce,football,mobile) 42 40 Figue 7 The two-iteation seach pocess(a) the oiginal two-iteation seach pocess(b) the impoved two-iteation seach pocess Afte adopting this method, we have to face one poblem Accoding to ou intepolato, thee will be 9 candidates aiving contempoaily Using the HT coefficients coesponding to these nine candidates, we only get HT coefficients of 6 quate-pel candidates in the ed squae, as shown in Fig8 If we want get HT coefficients of all quatepel candidates, we have to incease 2 HT cicuits to seve othe 2 positions (yellow cicles in Fig8) that ae useless fo FME Fotunately, ou expeimental esults show that about 774% of the best matching candidates ae located at the cental 25 candidates, as shown in Fig9 Based on this featue, we decide to abandon the maginal 2 quate-pel candidates fo saving hadwae cost Besides, in the second iteation pocess, PSNR(dB) 38 36 34 8x8/HT + 7candidates(Tansfom8x8Mode on) 32 HT + 7 candidates(tansfom8x8mode off) HT + ou scheme(tansfom8x8mode off) 8x8/HT + ou scheme(tansfom8x8mode on) 30 00 200 300 400 500 600 700 800 BitRate(kbps) Figue 0 Rate-distotion compaison between ou poposed method and oiginal method 438

Modules TABLE II COMPARISON BETWEEN OUR PROPOSED WORK AND SOME OTHER S WORKS Tsung s[8] Yang s[0] Huang s[7] Wu s[2] Ou Poposed Methods Using lineaity of AMS () full-mode full-mode (-7mode) 2 modes HT to /2 and /4 (7modes) (7modes) Max Spec 4Kx2K@24fps 4Kx2K@30fps 4Kx4K@60fps 920x080@30fps 4Kx2K@30fps Lat/MB(cycles) -- 300cycles -- 63cycles 330cyles Opeating Feq 280MHz 300MHz 45MHz 54MHz 340MHz Gate Count(K) 448 254 9765/3 32 4446 CMOS Tech TSMC 90nm UMC 90nm TSMC 80nm TSMC 80nm SMIC 30nm () ASM: Adaptive Selected Modes V IMPLEMENTATION RESULTS AND COMPRASION Tansfom (ABHT) Compaed with conventional achitectues based on HT, ou achitectue can educe computational A Implementation Results We have implemented the poposed FME in Veilog HDL and synthesized it with the SMIC 30nm CMOS libay unde the constaint of 350MHz (The Max Fequency)The detailed implementation esults ae listed in Table III complexity and incease the paallelism In addition, we also exploit the lineaity of HT to the quate-pel efinement This method can cut down the long latency caused by the second iteation efficiently Using the SMIC 30nm technology, the poposed FME design cost 4446K gates with the maximum woking fequency of 350MHz Ou design can meet the equiement of encoding QFHD 3840 260 @30fps at the TABLE III IMPLEMENTATION RESULTS OF OUR DESIGN fequency of 320MHz Gate Count 2xIntepolatos 75746 9xHalf-pel SATD(HPU) 27425 6xQuate-pel SATD(QPU) 32409 Best Candidates Selecting 458 Mode Decision 4505 Contolle 48 Total 444648 B Compaison Table II gives out the compaison between some othe s woks and ou wok Since we employ some paallel schemes, such as 8x8/ adaptive HT, two paallel intepolatos and impoved two- iteation pocess, ou implementation efficiently deceases the cycles needed by one MB So, we can suppot highe pefomance compaed with [2] Though Huang s implementation could each SHV specification, the numbe of modes is educed to 2 modes Reducing modes is a vey efficient method to decease the latency at the cost of highe video quality dop Ou pefomance is close to Yang s, but Yang s also adopts the method of educing modes fo saving hadwae and deceasing cycles Tsung s[8] changes the JM algoithm and apply the lineaity of HT to half-pel and quatepel efinement Thei design still needed high paallelism, so the cost of hadwae is still vey high unde the 90nm technology Ou implementation unde the 30 nm technology can suppot QFHD 3840 260 @30fps at the fequency of 320MHz and 4Kx2K@30fps at the fequency of 340MHz VI CONCLUSION In this pape, we pesent a full-mode FME achitectue The achitectue can suppot 8x8/ adaptive Hadamad REFERENCES [] T C Chen, Y W Huang, and L G Chen, Fully utilized and eusable achitectue fo factional moiton estimation of h264/avc, Poc IEEE Int Conf Acoustics, Speech, and Signal Pocessing, 2004, ppv9-v2 [2] C Y Wu, C L Wu, Y L Lin, A high-pefomance thee-engine achitectue fo h264/avc factional motion estimation, IEEE Tans Vey Lage Scale Integation(VLSI) Systems, vol8, 200, pp662-666 [3] C Q Yang, S S Goto, T S Ikenaga, High pefomance VLSI achitectue of factional motion estimation in h264 fo hdtv, in Poc IEEE Int Symp Cicuits and Systems, 2006, pp2605-2608 [4] Y K Lin, C C Lin, T Y Kuo, and T S Chang, A hadwae-efficient h264/avc motion-estimation design fo high-definition video, IEEE Tans Cicuits and Systems, vol 55, 2008, pp526-533 [5] T Y Kuo, Y K Lin, T S Chang, SIFME A single iteation factionalpel motion estimation algoithm and achitectue fo hdtv sized h264 video coding, IEEE Int Conf Acoustics, Speech and Signal Pocessing, 2007, ppi-85-i88 [6] N Th Ta, J R Ch, High pefomance factional motion estimation in h264/avc based on one-step algoithm and 8x4 element block pocessing, in Jounal of Image Communication, vol26,20, pp85-92 [7] Y Q Huang, Q liu, T S Ikenaga, Highly paallel factional motion estimation engine fo supe hi-vision 4kx4k@60fps, IEEE Int Woks Multimedia Signal Pocessing, 2009 [8] P K Tsung, W Y Chen, L F Ding, C Y Tsai, T D Chuang, L G Chen, Single-iteation full-seach factional motion estimation fo quad full hd h264/avc encoding, IEEE Int Conf Multimedia and Expo, 2009, pp 9-2 [9] Y H Chen, T Ch Chen, Ch Y Tsai, S F Tsai, L G Chen, Algoithm and achitectue design of powe-oiented h264/avc baseline pofile encode fo potable devices, IEEE Tans Cicuits and Systems, 2009, vol 9, pp8-27 [0] C C Yang, K J Tan, Y Ch Yang, J I Guo, Low complexity factional motion estimation with adaptive mode selection fo h264/avc, IEEE Int Conf Multimedia and Expo, 200, pp 673-678 [] Z Y Liu, D S Wang, T S Ikenaga, Hadwae optimizations of vaiable block size hadamad tansfom fo h264/avc fext, IEEE Int Conf Image Pocessing, 2009, pp 270-2704 439