ALU Augmentation for MPEG-4 Repetitive Padding

Size: px

Start display at page:

Download "ALU Augmentation for MPEG-4 Repetitive Padding"

Camron Griffin
5 years ago
Views:

1 ALU Augmetatio for MPEG-4 Repetitive Paddig Georgi Kuzmaov Stamatis Vassiliadis Computer Egieerig Lab, Electrical Egieerig Departmet, Faculty of formatio Techology ad Systems, Delft Uiversity of Techology, P.O. Box 5031, 2600 GA Delft, The Netherlads fG.Kuzmaov, Abstract this paper we augmet a geeral purpose ALU with a extra fuctioality - a repetitive paddig operatio. The proposed solutio eables the processor to perform the time exhaustive MPEG-4 paddig algorithm i real time. At trivial hardware costs of a few hudred 2x2 AND-OR (or equivalet) logical gates, we achieve a order of magitude speed-up whe compared to software ruig o a geeral purpose processor. Our approach allows the MPEG-4 software paddig algorithm to ru i real time for its most demadig profiles. As a example, a 64-bit pipelied implemetatio is discussed i details. The approach is geeral ad fits ito differet architectural cocepts ad ALU operad widths. More specifically it is show that 344 extra gates are ecessary for a 64-bit implemetatio ad at these expeses a processig speed of more tha 7 millio macroblocks per secod (MB/s) ca be achieved. This is a order of magitude higher tha the requiremets of the most-demadig MPEG-4 profile levels. Speed ad hardware estimatios are also reported for 32 ad 128-bit ALUs. 1. troductio Assumig MPEG stadards, MPEG-4 [5, 6] is the first to deal with cotet-based codig of audio-visual scees. To allow the efficiet implemetatio of the stadard, MPEG-4 defies several profiles. These profiles group the large set of required tools, accordig to the targeted classes of applicatios. Withi each profile, a umber of levels costrai the computatioal complexity ad the required data badwidth of the applicatio. literature [8, 9], complexity aalysis results idicate that the computatioal requiremets of the highest profiles ad levels of MPEG-4 are measured i billios of RSC-like istructios per secod. These umbers will sigificatly exceed the capabilities of the geeral purpose processors, despite the ear future techology achievemets. The work preseted i this paper is a part of a research activity, i which we focus o speedig MPEG-4 applicatios up. Our basic approach is to augmet geeral-purpose Arithmetic- Logical-Uits (ALU) with applicatio specific fuctioalities. The specific fuctios are used to redefie the architecture ad are implemeted i hardware. Due to the differet requiremets of the MPEG-4 profiles ad the large umber of fuctioalities ivolved i it, the requiremets for costeffective implemetatios are essetial. Oe importat ew feature i MPEG-4 is the paddig techique, defied at all Levels i the Core ad Mai Profiles of the stadard. Software profilig results, reported i [2, 3, 8, 14, 15], idicate that paddig is a computatioally demadig ad time cosumig process, which restricts the real time operatio of the MPEG-4 codecs. this paper we propose a ALU augmetatio circuitry, which eables a geeral purpose processor to perform the paddig algorithm i real time. Thus we utilize the available resources of the ALU, ad by exploitig the sub-word parallelism ad dedicated istructios, we dramatically icrease the computatioal power of the architecture. The proposed solutio proved the followig advatages: ffl Real time processig for all MPEG-4 profiles ad levels, utilizig the paddig algorithm ca be achieved ffl Scalable implemetatio, tuable to differet operad sizes ffl Geeral implemetatio for differet architectural cocepts The implemetatio is estimated aalytically ad its ifluece o the performace ad the hardware cost of augmeted ALUs with differet operad widths is discussed. Our results idicate high processig speed, achieved at trivial implemetatio costs. The achieved worst-case beefits for a 64-bit ALU example are: ffl Approximately 30 times faster processig tha the highest MPEG-4 requiremets. 45 SBN:

2 ffl Over 1000 times less MPS (Millios of RSC-like structios per Secod) tha the software implemetatio o geeral purpose processors. ffl Oly 344 AND-OR gates extra hardware pealty. Our results strogly suggest that software ruig o a processor icorporatig our proposal will always meet the highest stadard requiremets. The remaider of the discussio i this paper is orgaized as follows. Sectio 2 gives some backgroud iformatio ad the motivatio for the preseted research, icludig a descriptio of the repetitive paddig algorithm. Sectio 3, we describe the proposed implemetatio i details. Sectio 4 gives a quatitative evaluatio of the desig. Fially, the coclusios are preseted i Sectio Backgroud ad Motivatio For cotet-based codig, MPEG-4 uses the cocept of a Video Object Plae (VOP). A VOP is a arbitrarily shaped regio of a frame, which usually correspods to a sematic object i the visual scee. A sequece of VOPs i the time domai is referred to as a Video Object (VO). Each VOP is described by its shape ad texture. Shape is maily represeted i biary format. This format represets the shape as a bitmap, referred to as biary alpha plae. Each pixel i this plae takes oe of two possible values, which idicate whether the pixel belogs to the object or ot. The biary alpha plae is divided ito 16x16-pixel blocks called Biary Alpha Blocks (B). The texture of a VOP represets its color by macroblocks. Each macroblock cosists of oe 16x16 array of lumiace (grayscale) pixels ad two 8x8 arrays of chromiace (color) pixels, which represet the full-color of the correspodig 16x16 area of a VOP. As its precedig visual data compressio stadards, MPEG-4 adopts motio compesatio techiques to exploit temporal redudacies i the ecoded video sequeces. MPEG-4, this process icludes a search algorithm for best matchig betwee the macroblock to be ecoded ad a area of previously ecoded frame. The Repetitive Paddig Algorithm. The purpose of paddig i MPEG-4 is to esure more accurate block matchig i motio compesatio algorithms for arbitrary shaped visual objects. The paddig process defies the fullcolor values (lumiace + chromiace) for pixels outside the shape of a VOP. paddig, two types of macroblocks are of iterest. Macroblocks, which lie o the boudary of the VOP are referred to as boudary blocks. They are processed by the so called repetitive paddig. Exterior macroblocks (completely outside the VOP) are padded usig the exteded paddig method. Sice repetitive paddig is the most demadig paddig algorithm, i this paper we will cosider the paddig of boudary macroblocks. The repetitive paddig algorithm is described i [5, 13], but i literature some modificatios ca be met. [4, 7, 11] ew algorithms or algorithm modificatios are proposed to redefie or eve substitute the origial repetitive paddig. All of them, however, suggest software improvemets of the codig efficiecy ad visual quality ad do ot focus o the real time performace. A hardware acceleratio of the paddig is discussed i [1]. the same paper, the paddig algorithm is modified to support specific istructio set extesios as the horizotal ad vertical paddig processes are divided ito two phases each. These two phases cosequetly sca the lies/colums ito two opposite directios ad perform the paddig operatios. The proposed hardware solutio there is a hardwired dedicated paddig uit. the preset paper we use the stadard repetitive paddig algorithm (differetiates from [4, 7, 11]) o a geeral ALU (differetiates from [1, 4, 7, 11]), where a boudary block is separately processed horizotally, per sca-lie basis ad vertically - per colums. We still sca each lie ad colum of a macroblock bidirectioally, but we do it i parallel (differetiates from [1]), thus savig a umber of processig cycles. The stadard repetitive paddig algorithm, as defied i [5], is equivalet to the followig steps: 1. Defie ay pixel outside the object boudary as a zero pixel. Make a duplicate biary alpha map. 2. Sca each horizotal lie of a block. Each sca lie is possibly composed of zero ad ozero lie segmets (accordig to the shape bits i the biary alpha map). (a) zero segmets, betwee a ed poit of the sca lie ad the ed poit of a ozero segmet, all zero pixels are replaced by the pixel value of the ed pixel of ozero segmet. (b) zero segmets, betwee the ed poits of two differet ozero segmets, all zero pixels take the average value of these two ed poits. Nozero segmets are ot processed. All shape bits, correspodig to padded pixels are set i the duplicate biary alpha map. 3. Sca each vertical lie of the block ad perform the idetical procedure as described for the horizotal lie. The updated shape iformatio from the duplicate biary alpha map is used. Ulike its predecessors, MPEG-4 is much more demadig i terms of computatioal complexity with eve more data itesive algorithms, which is illustrated i Table 1. While the computatioal complexity of the future geeral purpose processors may meet the demads of the lowest 46 SBN:

3 Simple Profile Levels, the challege is still to meet the requiremets of the most-demadig Core ad Mai Visual Profile of MPEG-4. Table 1. Visual Defiitios ad Processig Speed i MacroBlocks per secod [MB/s] Profile Le- Sessio # Max. Boudavel Size VO MB/s ry MB/s Mai L4 1920x L3 CCR L2 CF L1 N.A. N.A. N.A. N.A. Core L2 CF L1 QCF Simple L2 CF N.A. Scalable L1 CF N.A. Simple L3 CF N.A. L2 CF N.A. L1 QCF N.A. Motivatio. A summary of the computatioal complexity of the QCF, Core Profile Level 1 of MPEG-4 is reported i [8]. Sice this is the lowest profile level, utilizig the paddig algorithm, we shall cosider its real-time requiremets as the miimum for a hardware implemetatio. At this level, the computatioal power, reported for the software ecodig of a sigle object is i the order of 4500 Millio RSC-like structios Per Secod (MPS). Assumig a software performace optimizatio by a factor of up to 10, the total computatioal complexity is just withi the computatioal capabilities of the cotemporary geeral purpose processors( MPS). the case of 4 video objects (see Table 1), however, the real-time software feasibility becomes problematic. Therefore, the eed of a hardware acceleratio of MPEG-4 is evidet, eve at this relatively low profile level. Further aalysis of the requiremets for the software implemetatio idicates that the paddig algorithm occupies some 175 MPS for a sigle video object, or aroud 700 MPS for the maximum 4 video objects, stated at Level 1 of the Core profile (Table 1). A geeral purpose processor with such computatioal power is still very expesive to be dedicated just for the repetitive paddig calculatios. f we cosider Table 1, we ca estimate that the required speed of 5940 MB/s for the Core Profile Level 1 is approximately 82 times lower tha the speed requiremets of the highest - Mai@Level4 Profile ( MB/s). A simple arithmetic estimatio idicates that for the highest MPEG-4 profile level, the o-optimized software paddig would require approximately MPS ad whe extremely optimized (10 times speed-up) - i the order of 6000 MPS. Eve for the sigificatly less complex decoder part of MPEG-4, the paddig algorithm will require some MPS for o-optimized software implemetatio dow to 2500 MPS i dramatically optimized programmig. These umbers, huge eve for the expected techology levels of the ear future, motivated our research to focus o a cost-effective hardware solutio of the MPEG- 4 algorithms ad the repetitive paddig i particular. The large umber of alteratively used algorithms i MPEG-4, however, makes the implemetatio of dedicated hardware uits iefficiet, sice a lot of them may remai uutilized. A good example is the paddig algorithm, which is ot icluded i the Simple Profile Levels of MPEG-4. Thus, a multi-profile codec would ot utilize a hardwired paddig accelerator, whe ruig at ay of the Simple Profile levels. A promisig solutio of this problem is the recofigurable implemetatio of hardware accelerators [10, 16]. Aother possible approach to icrease the efficiecy of a geeral purpose architecture is adopted i this paper. We augmet the ALU with additioal circuitry that eriches its fuctioality with oe more operatio - the repetitive paddig. Thus we utilize the available resources of the ALU, ad by exploitig a sub-word parallelism ad dedicated istructios, we dramatically icrease the computatioal power of the architecture. 3. The Augmeted ALU Sice paddig is performed over horizotal ad vertical pixel lies i idetical maer ad the width of the pixel data is relatively small (8 or 12 bits i MPEG-4), a wider geeral purpose ALU ca ot compute it efficietly. To improve the computatioal efficiecy of such ALUs, we ca exploit a sub-word data parallelism. this sectio we augmet a geeral purpose ALU, to make it capable to perform a paddig operatio. Sice 8-bit iteger chromiace ad lumiace data represetatios are the most frequetly used, i this paper we assume the same data formats. Pixel Processig. A sigle byte processig structure, which is dedicated to process each pixel of a block, is depicted i Figure 1. The same structure is used both for lumiace ad chromiace blocks paddig. We pipelie the processig flow by dividig it ito two stages. The first stage cotais a Propagatio Node (PN) ad two multiplexers. The multiplexers are required to preserve the origial fuctioality of the ALU. A byte adder ad a output multiplexer build the secod pipelie stage. The byte adder is a part of the origial multi-byte ALU adder, with cotrollable carries betwee the bytes. The paddig output multiplexer ca be merged with the existig ALU output multiplexer ad that is depicted i Figure 1 by the dash-lied arrow, from the logical part of the ALU (LU). The fuctio of the PN is to propagate the appropriate 47 SBN:

4 L (OP1) S(OP2) R S ffl f the iput shape bit S is set (the pixel belogs to the object), the: Stage 1 Stage 2 Cotrol A R N C OUT L N Byte Adder O B S C =0 N From LU RO OP1 OP2 Cotrol L 3 - Register PN AB "PN" - Propagatio Node - Multiplexer RO R 1. The output O takes the value of the iput, i.e. the pixel keeps its color. 2. The value of the iput (pixel) is propagated to the left ad to the right (via outputs ad RO) for further processig. The shape iput bit S is propagated by the same multiplexers ad occupies the most-sigificat bits of ad RO. 3. The output bit S 0 is set, meaig the pixel has bee processed. Figure 1. A ALU augmetatio for a sigle pixel paddig values to its eighborig paddig PNs ad to supply data ad cotrol sigals to the byte adder ad the output multiplexer. Table 2 represets the cotrol sigals of the output multiplexer (SR(C OUT,ADD) meas Shift Right with 1-bit the ADDer output together with its carry). Sigal Cotrol determies whether the ALU will perform paddig, or its origial operatio(s). Table 2. Truth Table for the Cotrol Sigals of the Output Multiplexer Cotrol S R N R N O 0 ADD ADD ADD SR(C OUT,ADD) 1 1 The followig equatios describe the fuctioality of the PN: = S js; j _S R; RO = S js; j _S L; (1) S 0 = S _ L 8 _ R 8 ; (2) L; R are left ad right 9 bit iput vectors; ; RO are left ad right 9 bit output vectors; is the data iput 8 bit vector; S is the shape (iput) bit before processig; S 0 is a mask output bit after processig; js; j deotes the cocateatio of bit S ad vector ; ad L N represets the N th bit of vector L. The operatio of the PN is as follows. ffl f the iput shape bit S is zero (the pixel does ot belog to the object ad has to be padded), the: 1. The outputs A ad B propagate to the byte adder the least sigificat 8-bits of R ad L respectively. The propagated eighborig shape bit values (L 8, R 8 ) ad bit S, are issued as cotrol sigals to the output multiplexer. The multiplexer redirects to O either the output of the adder or its right shifted value (see Table 2) i.e. the pixel takes the padded value. 2. The L value is propagated via RO ad the R - via icludig color ad shape iformatio. 3. The output bit S 0 is set, meaig the pixel has bee processed. Lie / Colum Paddig. To process a lie or a colum from a block by a -byte ALU, we have to implemet a chai of PN (i.e. -pixel parallel paddig). A sectio of such processig circuitry for two eighborig pixels is depicted i Figure 2. This structure is scalable ad ca cotai a arbitrary umber of propagatio odes, depedig o the ALU width (i.e. elemets for a -byte ALU). t has the followig features: 1. Operads are bypassig the first pipelie stage for a covetioal ALU operatio, thus preservig the origial timig. 2. Operads are passed through the Propagatio Nodes oly whe a paddig operatio is performed 3. The critical path of the operads through the PNs for a -byte adder is equal to the delay of multiplexers. 4. The critical path pealty for a read ALU operad is oly oe 2-1 multiplexer. 5. The ALU critical path pealty is a sigle 2-way AND elemet. 48 SBN:

5 Op A Op B Op A Op B 64 Stage 1 L PN S R RO L PN S Stage 1 PN Chai 64 Stage 2 C N + + C OUT C N Shape A B + From Logical Uit Stage 2 64-bit Adder Shape Result (a) Geeral View of the Pipelie Stages Figure 2. A Sca Lie / Colum Paddig Augmetatio of a ALU PN Chai The last two pealties may be avoided if special desigs are employed (see [12]). Puttig Everythig Together. Sice a macroblock cosist of oe 16x16 lumiace ad two 8x8 chromiace blocks, it is efficiet to implemet structures processig 8 or 16 pixels simultaeously. This meas that with 64 or 128- bit ALUs we will be able to pad a etire lie or colum of chromiace ad/or lumiace blocks i pipelie maer for two ALU cycles. However, the proposed structure is capable to perform paddig eve whe it is implemeted o smaller ALUs. Here, we will explai i more details the paddig process flow, performed by a 64-bit ALU. the ext sectio we will give geeralized estimatios for -byte ALUs. First, for the proper circuit operatio, the left-most ad right-most iputs of the structure should be iitialized. Figure 3(a) depicts a geeral view of the 64-bit iitializatio for lumiace lie / colum paddig. The ad buffers are 8-bits wide ad cotai the iitializatio values, required by the uit to start the operatio. Figure 3(b), the cycle partitioig of the lumiace lie paddig is illustrated. Sice a lumiace lie (16x8-bit) ca ot be processed i oe pass by a 64 bit (8x8-bit) ALU, we assume that the left-most half of the lie is processed first. Depedig o the right-most shape bit of this first half of the lie, the full-lie paddig would require oe or two more cycles. f the right-most shape bit of the first half of the lumiace lie is 0, it meas that data from the other half of the lie is required to pad the first half. Thus the data, propagated to the right is stored ito register. the ext cycle this data is used to fully pad the secod half of the lie, ad the byte, which has to be propagated to the left is PN Chai PN Chai BN - Buffer N; -Do t care; 0-zero. (b) Data Bufferig by Cycles 0 Figure 3. Data itializatio ad Bufferig for Lumiace Lie / Colum Processig by a 64- bit ALU. stored i buffer. Now, a third cycle is executed o the first half of the lumiace lie, with the proper right-to-left propagatio value stored i the buffer. f the right-most shape bit of the first half of the lumiace lie is 1, the pixel value to be propagated right is stored ito buffer, ad the first half of the lumiace lie is completely padded. Just oe more cycle is required to pad the secod half of the lumiace lie, provided the right propagatio value is drive to the left iput of the circuit from. Sice the right-most shape bit of the first half of the lumiace lie is available before the ext operads (the other half of the padded lie) are issued, we have brach determiatio (a perfect brach predictio) for the pipelie. Therefore, we ca coclude that a complete lumiace lie paddig by a 64-bit paddig augmeted ad pipelied ALU would take o average 2.5 cycles to perform. The paddig 49 SBN:

6 of a chromiace lie by the same ALU would take approximately oe cycle for a log data sequece. The processig flow is similar for a 32-bit ALU, but the required average cycle umber is 5.5 (at least 4 at most 7 cycles), ad the aalysis is made o the right-most shape bits of each byte (4x8-bits) to be processed. Give the shape iformatio of the whole lie is available a priory we ca still make a perfect brach predictio, i.e. we ca predict the operads issue sequece perfectly. Note o Figure 3(b) that the left ad right-most iputs of the whole lumiace lie are iitialized with 0, icludig the propagatio of the shape bit. This is also valid for the chromiace lie processig ad for all up or dow scalig ALU implemetatios (say 32 or 128 bit ALUs). The ext sectio gives more accurate estimatio of the processig speed of several differet implemetatios. 4. Evaluatio whole macroblock will be padded for 32 F (N P 8 + N P 16 ) [s]. We ca formulate the Processig speed of the ALU as follows: F P rocessig speed ß 32 (N P 8 + N P 16 [MB=s] (3) ) Assumig a realistic value of F = 1GHz ad usig the data from Table 3 ito Equatio 3, we calculated the processig speed results, give i Table 4. Table 4. Processig Speed at F =1GHz ALU Speed [MB/s] bits Average Worst Case 32-bit bit bit Speed estimatio. We ca easily evaluate the processig speed of the structure, give its operatig speed 1. Let s assume a 8-bit paddig augmeted ALU like the oe, depicted i Figure 2, operatig at frequecy F [Hz]. The values of with practical sigificace are 4, 8, 16 (32, 64 or eve 128-bit ALU). We state two more parameters of the particular implemetatio: N P8 ad N P16 respectively deote the umbers of cycles, ecessary to process a 8-pixel (chromiace) ad a 16- pixel (lumiace) lie from a log data sequece. Some potetial values of these parameters are show i Table 3. Because of the data structures, imposed by MPEG-4, N P 8 [MacroBlocks / secod] Our scheme MPEG-4 Requiremets Table 3. Values of N P 8 P 16 ad N ALU Average Worst Case bits N P 8 N P 16 N P 8 N P bit bit bit is a variable for < 8, ad approximately costat for 8 (64-bit ALU). N P 16 is a variable for < 16, ad a costat for 16 (128-bit ALU). Thus the processig 16 NP of 16 pixels by ay 8 ALU will take [secods] F NP F ad for a 256-pixel lumiace block- [s]. detically, the processig of two 8x8-pixel chromiace blocks will take 16 NP 8 [s] i the same ALU cofiguratio. Sice F a macroblock cosists of 256 lumiace ad 128 (2 x 64) chromiace pixels, padded vertically ad horizotally, a 1 this paper we distiguish (data) processig speed, measured i [macroblocks/sec] (or [MB/s]) from the device operatig speed (frequecy), measured i [Hz] Number of ALU bits Figure 4. Processig Speed for Differet ALU Operad Sizes ad F =1GHz (ote the logarithmic scale) The most demadig profile level, level 4 of the Mai MPEG-4 profile, requires Boudary MB/s (maximum MB/s) for a high resolutio sessio type (1920 x 1088) ad 32 objects (Table 1). These rates are a order of magitude lower tha the reported speed results for the feasible paddig augmeted ALU implemetatios (see Figure 4). The potetials of the structure idicate capabilities to meet eve more-demadig future profiles of the visual data compressio stadards. Furthermore, a software implemetatio of the paddig algorithm o a paddig-augmeted ALU would cost a eglectable umber of MPS (Millios of RSC-like structios per Secod), compared to the MPS umbers, discussed i the Sectio 2. Table 5 cotais the estimated computatioal load of differet paddig augmeted ALU sizes for Level4 of the Mai MPEG-4 profile. 50 SBN:

7 Table 5. Computatioal load for Mai Profile - Level 4 at F =1GHz ALU [MPS] bits Average Worst Case 32-bit bit bit Hardware estimatios. We choose the 2x2 AND-OR logic block as a basis for the hardware estimatios. A oebit 2 to 1 multiplexor is a 2x2 AND-OR gate. The hardware pealty for a sigle byte paddig structure is: 2 x 9-bit multiplexers, 2 x 8-bit multiplexers ad 1 OR gate. That makes 2x9+2x8+1=35 2x2 AND-OR gates. A -byte implemetatio will cost 35 AND-OR gates plus additioal cost for the ALU multiplexer of 8 gates, i.e. 43 2x2 AND- OR gates. Table 6 cotais the exact values of the hardware pealties for differet ALU sizes. 5. Coclusios Table 6. Hardware estimatio ALU Number of bits extra gates 32-bit bit bit this paper we itroduced a scheme for geeral purpose ALU augmetatio, which accelerates the MPEG-4 paddig algorithm with orders of magitude. We proposed a pipelied implemetatio of the idea, thus preservig the origial fuctioality ad timig scheme of the target ALU. At a trivial hardware cost of oly a few hudred elemetary 2x2 AND-OR logical gates, we could easily achieve a real-time performace at the most-demadig profile levels of MPEG-4. We proved that the proposed desig is scalable by applyig it o ALUs with differet operad widths. The approach is geeral ad ca fit ito differet architectural cocepts. 6. Ackowledgemets This research is supported by PROGRESS, the embedded systems research program of the Dutch orgaizatio for Scietific Research NWO, the Dutch Miistry of Ecoomic Affairs, the Techology Foudatio STW (project AES.5021) ad PHLPS Research Laboratories, Eidhove, The Netherlads. Refereces [1] M. Berekovic, H.-J. Stolberg, M. B. Kulaczewski, P. Pirsh, H. Moler, H. Ruge, J. Keip, ad B. Staberack. structio set extesios for mpeg-4 video. Joural of VLS Sigal Processig, 23(1):27 49, October [2] H.-C. Chag, L.-G. Che, M.-Y. Hsu, ad Y.-C. Chag. Performace aalysis ad architecture evaluatio of MPEG-4 video codec system. EEE teratioal Symposium o Circuits ad Systems, volume, pages , Geeva, Switzerlad, May [3] H.-C. Chag, Y.-C. Wag, M.-Y. Hsu, ad L.-G. Che. Efficiet algorithms ad architectures for MPEG-4 object-based video codig. EEE Workshop o Sigal Processig Systems, pages 13 22, Oct [4] E. A. Edirisighe, J. Jiag, ad C. Grecos. Shape adaptive paddig for MPEG-4. EEE Trasactios o Cosumer Electroics, 46(3): , August [5] SO/EC JTC11/SC29/WG11, N3312. MPEG-4 video verificatio model versio [6] SO/EC JTC11/SC29/WG11 N4030. MPEG-4 overview, March [7] A. Kaup. Object-based texture codig of movig video i MPEG-4. EEE Trasactios o Circuits ad Systems for Video Techology, 9(1):5 15, February [8] J. Keip, S. Bauer, J. Vollmer, B. Schmale, P. Kuh, ad M. Reissma. The MPEG-4 video codig stadard - a VLS poit of view. EEE Workshop o Sigal Processig Systems,(SPS98), pages 43 52, 8-10 Oct [9] P. Kuh ad W. Stechele. Complexity aalysis of the emergig MPEG-4 stadard as a basis for VLS implemetatio. SPE Visual Comuicatios ad mage Processig (VCP), volume 3309, pages , Sa Jose, Ja [10] G. Kuzmaov, S. Vassiliadis, ad J. va Eijdhove. A Paddig Processor for MPEG-4. 12th Aual Workshop o Circuits, Systems, ad Sigal Processig (ProR- SC2001), Veldhove, The Netherlads, October [11] J.-H. Moo, J.-H. Kweo, ad H.-K. Kim. Boudary blockmergig (M) techique for efficiet texture codig of arbitrarily shaped object. EEE Trasactios o Circuits ad Systems for Video Techology, 9(1):35 43, February [12] M. Putrio, S. Vassiliadis, ad E. Schwarz. Parallel biary byte adder / subtracter. teratioal Joural of Electroics, 65(2): , February [13] Y. Q. Shi ad H. Su. mage ad Video Compressio for Multimedia Egieerig. Boca Rato CRC Press, [14] H.-J. Stolberg, M. Berekovic, P. Pirsch, H. Ruge, H. Moller, ad J. Keip. The M-PRE MPEG-4 codec DSP ad its macroblock egie. EEE teratioal Symposium o Circuits ad Systems, volume, pages , Geeva, Switzerlad, May [15] S. Vassiliadis, G. Kuzmaov, ad S. Wog. MPEG-4 ad the New Multimedia Architectural Challeges. 15th teratioal Coferece SAER 2001, St.Kostati, Bulgaria, Sept [16] S. Vassiliadis, S. Wog, ad S. Cotofaa. The MOLEN rm-coded processor. 11th teratioal Coferece o Field Programmable Logic ad Applicatios (FPL), Belfast, Norther relad, UK, August SBN:

Appendix D. Controller Implementation

Appendix D. Controller Implementation COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Appedix D Cotroller Implemetatio Cotroller Implemetatios Combiatioal logic (sigle-cycle); Fiite state machie (multi-cycle, pipelied);