System-level power optimization of video codecs on embedded cores : a systematic approach.

Size: px

Start display at page:

Download "System-level power optimization of video codecs on embedded cores : a systematic approach."

Beatrice Caldwell
5 years ago
Views:

1 Kluwer Journal on VLSI Signal Processing, 18, (1998) c 1998 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands. System-level power optimization of video codecs on embedded cores : a systematic approach. LODE NACHTERGAELE, DENNIS MOOLENAAR, BART VANHOOF, FRANCKY CATTHOOR AND HUGO DE MAN nachterg@imec.be, dennism@imec.be, vanhoofb@imec.be, catthoor@imec.be, deman@imec.be Interuniversity Micro Electronics Centrum (IMEC), Leuven, Belgium Received March, 1997; Revised July, 1997 Abstract. A battery powered multimedia communication device requires a very energy efficient implementation. The required efficiency can only be acquired by careful optimization at all levels of the design. System-level power optimizations have a dramatic impact on the overall power budget. We have proposed a system-level step-wise methodology to reduce the power in hardware realizations of data-dominated applications, which is partly supported with our ATOMIUM environment. In this paper, we extend the methodology to the realization of embedded software on processor cores. Starting from a high level algorithm description (e.g. in C), a set of optimizations gradually refine the code and the corresponding memory organization of the array data types. These array data types represent a fully detailed optimized data storage and transfer organization. Instead of creating the physical memories, a mapping can be done either on a general memory architecture, including a cache, or on a custom memory architecture. First, typical optimizations addressed by our methodology are applied on a didactical example. The effectiveness of this methodology is then demonstrated by the optimization of two complex applications in an embedded processor context : a MPEG2 and a H.263 video decoder. The impact of the power optimizations on the typical power consumption is demonstrated by simulating the optimized decoders with real video streams. Keywords: low power, system power optimizations, embedded system design 1. Introduction Nowadays, people see the need for a portable multimedia device capable of handling complex and advanced data communication. Text and speech services are already available. But extensions towards audio, speech recognition/synthesis and video communication are still requested [1, 2, 3]. To provide those with a battery powered device is a real challenge stimulating technological advances in several area s ranging from circuit techniques to new algorithms. The detailed H.263 design to which this paper makes some links was performed in co-operation with Texas Instruments Inc. This paper addresses this power reduction challenge at the system-level by showing that our data storage and transfer optimization methodology [4], ATOMIUM, oriented to hardware realization of data-dominated applications, can be applied also in an embedded software context if applied with care and if some extra transformations are incorporated to keep the cycle impact low. The traditional approach for software is compiling the input C-code on a general memory architecture as shown in Figure 1(a) and relying on the general purpose cache to perform the data transfer management. In embedded systems, extra freedom in de memory architecture allows to replace this cache with a more effective alternative since the application is known in

2 90 Nachtergaele et al Conventional Memory Organization Application Specific Memory Organization Main Memory Layer 1 Memory Layer 2 Memory General Purpose Cache D1 D11 D1 µp D2 D3 D4 µp D3 D41 D4 Dx : Multi Dimensional Data-structures of the C-code (a) (b) Fig. 1. Conventionally C-code is compiled onto architecture (a), for embedded systems a optimized C-code together with a memory organization like in (b) will be more power efficient. advance and can therefore be fully analyzed. Hence we propose to replace the general purpose cache by an application specific memory architecture 1 as shown in Figure 1(b). This memory organization, fully matched to the optimized code which is obtained after applying global system-level control- and data-flow transformations on the original C-code, heavily reduces the overall power consumption for data-dominated applications. This is demonstrated on two complex algorithms, namely a MPEG2 and a H.263 video decoder. Multimedia applications are almost by definition data intensive applications meaning that the data transfer and memory organization will have a dominant impact on the power and area cost of the realization. Providing relevant feedback at an early phase of the design makes algorithmic designers more implementation aware, enabling them to make global algorithm/implementation trade-offs. This is crucial if energy efficiency of complex multimedia systems is of vital importance. Therefore we have proposed to optimize the data transfer and memory organization prior to realizations of the data-path and controller functions, since they have the biggest relative impact on the global implementation cost [5, 6]. The optimized description of the data transfer and memory organization is then used as input for system realization. As long as the memory organization can be customized, the global optimized result will yield better overall result. This has been demonstrated by us for application-specific implementations where the designer has full freedom [5, 6]. However, even if the number and size of the memories is predefined, the freedom in the data layout inside the memories still provides a major opportunity for optimization, as we will show in this paper. Here, the impact of the potential system level data transfer and storage optimizations is demonstrated by comparing simulations of the optimized decoders with the reference. The outcome of these simulations shows that embedded software implementations of multimedia standards can be made much more energy efficient than compared to the conventional method starting from C. This is obtained without paying a significant penalty in terms of the size of the object code and the number of instruction cycles needed. The resulting optimized software can then be mapped on a general or a custom memory architecture. The performance of both implementations in terms of power and execution time will be better compared to the original code. This is an important new contribution. It allows to combine the main advantages of software techniques used in embedded software design with the energy efficiency of application specific implementations for data-dominated applications.

3 System-Level Power Optimization Related work Many video processing architectures have been published (see e.g. [7, 8, 9, 10]). Several of these exhibit low power consumption (see e.g. [2, 1, 5, 11]). A very nice example of what can be reached is a video decompression circuit dissipating less then 9mW [11]. The low power requirements were met by making use of special circuits, a low voltage supply (1.35V) and through several system-level measures including the avoidance of external memories. Based on this number, one could conclude that video for portable devices is already possible. This is indeed true for small image sizes and certain decompression techniques. However a considerable design effort is needed to implement advanced algorithms in an efficient way. Moreover, the system-level design approaches used in such designs are ad hoc up to now. These ad hoc methods will not suffice for the more complex future applications. Next generation information presentation technologies such as 3D graphics, volume rendering and virtual reality will strain the implementation requirements over their current design limits. The magic trick that reduces the power reduction with a factor of 25 by lowering the voltage supply from 5V to 1V [12] by the year 2000 will run out of steam too. Further lowering the supply voltage is expected to be problematic due to problems with the threshold voltage V T. Hence lowering the supply voltage is only part of the answer. Therefore, also (much) more power efficient system-level architectures and algorithms optimized to exploit these are needed to cope with upcoming implementation challenges. Chips that decode [13, 14, 15] or encode [16, 15] MPEG2 streams at main level (ML), main profile (MP) dissipate an acceptable few Watts. However they typically need 4 to 16 Mbits of fast RAM with a high bandwidth. The resulting global multi-chip system dissipation prohibits acceptable lifetimes of the battery [11, 2]. The need of I/O buffers could be eliminated if logic and DRAM functionality can be provided by one and the same technology. This technology is not only required for energy efficiency reasons but also to reduce the system cost and to increase the memory bandwidth. The M32R/D from Hitachi is a step in this direction [17]. This chip combines a 52.4 MIPS (Drystone V2.1 rating) 32 bit RISC processor with 16Mbit DRAM which still does not allow to decode MPEG2 real-time at all. The typical power consumption varies between 275 mw and 700mW. One reason for the still relatively high power consumption is that the memory architecture of the M32R/D is not customized. We believe that even when an effective combination of memory-logic technology will be provided, for power reasons the memory organization for future application must still be better adapted to the application. In this paper a system level data storage and transfer exploration [4, 5, 18, 19] is extended to an embedded software context. The impact of this is first explained by applying it to a relatively simple but realistic and easy to follow didactical example (Section 5). The effect of the optimizations is demonstrated by a highlevel model of the power consumption presented in section 3. Because a methodology is not validated with a simple design (that took less than 1 manmonth of design time), also the system design optimization of a MPEG2 video decoder and the impact on embedded power are presented in Section 6. Optimizations of the worst case power of a H.263 video decoder in a hardware context have been published in [5, 6]. In Section 7 estimates of the average energy dissipation in an embedded processor context before and after optimization are given. They are obtained by counting the number of transfers during simulation of the decoders with real video streams. 3. Power model Obtaining a precise power estimation from a highlevel system description requires a major design effort. Luckily this is not needed to take relevant high-level decisions during the system exploration. It is sufficient to have a relative comparison that selects the most promising candidates. For data-intensive applications, power due to memory transfers is dominant [20, 21, 11, 5]. We can neglect the power consumption in operators, control and internal routing during memory optimizations since their power consumption will not be drastically affected by the memory optimizations compared to the achieved power reduction for data transfers and memory accesses. Clocking and most data transfers are for a large part related to the actual communication and storage access and therefore assumed to be proportional with it. In this way, our storage based power model is good enough to perform relative comparisons between design alternatives. It also provides a lower bound on the absolute access related power.

4 92 Nachtergaele et al Unlike in operators, power due to a memory access is independent of the data value transferred [20]. The power is function of the size of the memory, the frequency of access and the technology : Size: usually has a proportional sublinear influence but depends very much on the memory organization. Nowadays memories are often partioned into several memory banks. Each time another bank is accessed extra energy is consumed to power up the bank. In the extreme case, the power bottle-neck is moved mostly in the periphery and a logarithmic dependence is created. Frequency of access: the power is considered linearly proportional to the frequency of access. This assumes that the memory is in power down mode when no transfer is going on. Most modern embedded RAM, have a power down mode [22, 23]. Some memories require that the address and data bits are kept stable when in power down mode. Technology: since this is the same for all on-chip memories considered in this paper, we exclude it from the power model. For off-chip memories different technologies are normally available. However, a low power 1 MB SRAM [24] will be assumed for external memory since power consumption figures of other memories are unavailable. Our external memory model will only be used for the optimization of the MPEG-2 decoder. All other memories are assumed to be on-chip. This is a conservative assumption since accesses to off-chip memories are very costly in term of energy consumption [11]. The simple power model function used in this paper is : P T ransf ers = E #Transfers T r Second (1) E T r = f (#words; #bits) (2) In [25] a function f is proposed to estimate the energy per transfer E T r in terms of the number of words and the width in bits. Also some possible values for the parameters are provided there. 4. Target architecture This section describes the target architecture on which the initial behavior description is mapped using the step-wise optimization methodology ATOMIUM [4]. An example of the optimization steps is shown in the didactical example of section 5. Our goal is to map the behavioral description on an embedded core with a dedicated memory hierarchy. However, a general-purpose memory hierarchy, consisting of a data and instruction cache connected to data memory and instruction memory, could also be used. Both memory hierarchies are assumed to have an instruction cache in the instruction path. For datadominated applications which are mostly built around loop nests, the small (active part of the) instruction cache will seldom have a cache miss so little power is spent in this component during actual execution. For the flow of data between memory and data-path, two main options are available: general purpose memory hierarchy (a data-cache with data-memory) or a dedicated memory hierarchy. The dedicated memory hierarchy in an embedded core solution replaces the data-cache by several memory blocks which serve as buffers for the large data. One of these memory blocks could still behave as a cache (with an update protocol). The outcome of the system exploration is a detailed memory organization. This consist of the number and the type of memories that are allocated. These can be selected from a library. And for each memory the parameters that fully characterize the memory are decided. These parameters are the number of locations, the width in number of bits and the number of read, write and read/write ports. Every multi-dimensional array of the description is assigned to one of the allocated memories. Also the base address, the storage order (e.g. row-wise or column wise) and the window are known. The best result is obtained when freedom exists to have a physical memories system, as described by the system exploration. If this freedom does not exist then still an improvement of the power consumption and performance can be obtained by using the obtained results as a guideline for mapping the data structures on the available memory hierarchy. When a general memory hierarchy is used instead of the memory hierarchy as proposed by the system exploration then a better overall result can still be obtained since the system exploration methodology improves the temporal and spatial locality of the memory references which will result in an improved cache hitratio of the data-cache for instance. However a dedicated memory hierarchy will show the best improvement since a distinction can then be made between

5 System-Level Power Optimization 93 cacheable data and non-cacheable data. This distinction will greatly improve the hit rate of the data cache or could even remove the need for a data-cache overall. How the controller, data-paths and address generators are realized is still not decided yet at this design stage. They can for example be mapped on a general purpose processor or application specific components can be designed. A mixture of both approaches is also possible. This freedom is an important advantage of our approach. 5. Design methodology explained on a simple but realistic example In this section we will apply the ATOMIUM system methodology proposed in [4] to a basic image processing kernel mapped to both an embedded processor core and to a hardware target. Starting from a behavioral description, each step of the ATOMIUM script will be applied and explained. In section 6 and 7, the methodology illustrated in this section will be applied to two real-life applications, a MPEG-2 and a H.263 video decoder respectively Describing the behavior A convolution of a picture stored in a 2 dimensional array p[][] of size (W H) with a 2 dimensional mask m[][] of size (2N + 1 2N + 1) is defined by (r : row, c : column) : 8r 2 [0::H? 1]; 8c 2 [0::W? 1] : cp[r][c] = NX NX y=?n x=?n p[r + y][c + x] m[y][x] (3) Although most people would call this already a formal definition, it is not complete. The formula does not define what happens in case the free variables r or c are at the boundary of the allowed interval. In case of a negative address, or a value that exceeds a limit value max, a wrap-around is performed. For an index i valid in the range [0::max?1], the following index filter is used : f (i) = 8< : f (?i) if i < 0 f (max? 2? (i mod max)) if i max f (i) otherwise (4) Equation 3 and 4 can be easily converted to a C program. To avoid problems at the border, the array p is extended with 2N rows and 2N columns : const int W = 256; const int H = 256; const int N = 1; convol2d(p,m){ int p[h+2*n][w+2*n]; int cp[h][w]; int m[2*n+1][2*n+1]; m[0][0] = 1; m[0][1] = 2; m[0][2] = 1; m[1][0] = 2; m[1][1] = 4; m[1][2] = 2; m[2][0] = 1; m[2][1] = 2; m[2][2] = 1; } for r = 0 to H-1 do for c = 0 to W-1 do p[n+r][n+c] = input; for r = 0 to H-1 do { for c = 0 to W-1 do { int sum = 0; for y = -N to N do for x = -N to N do sum += p[n+r+y][n+c+x] * m[n+y][n+x]; cp[r][c] = sum/16; } } for r = 0 to H-1 do for c = 0 to W-1 do output = cp[r][c]; The three boxes indicate the three main loops : 1. read in the picture 2. calculate the convolution 3. write out the result The above code exhibits (2N + 1) (2N + 1) W H = = reads and (H + 2N ) (W + 2N ) = = writes of array p. (The code for initialization of the border is left out.) Array cp is W H = = times read and written. In total

6 94 Nachtergaele et al = transfers are performed to the memory for one picture. This would require a memory access time of 42 ns. Using the power model of [25] the energy for one read to a memory of words of eight bit is estimated to be 1.17 Joule. Hence, the power due to reads is estimated to be 76mWatt. In this way, the power consumption due to memory transfers is estimated to be 727mW for array p and 123 mw for array cp. The convolution needs 6 shift operations and 9 adds (not counting the operations required for address calculation ). To achieve a frame rate of 30 frames/second, we need to perform (6+9) = 29:491:200 operations/second 30MOPS. For 30 frames/second = 19:7M memory accesses are required. Although the number of memory transfers is less than the number of operations, the power of the memory transfers is much higher[1]. Furthermore it is very difficult to decrease the basic number of operations of a given algorithm while it is possible to decrease the number of memory transfers. The number of memory transfers depends on the implementation of the algorithm while the number of operations are implied by the algorithm Global data-flow optimizations In general, many system-level data-flow transformations can be performed to optimize the data transfer and storage [18]. In this particular case, only the conditional signal propagation class can be illustrated. Indeed, instead of enlarging the array p to cope with border effects, conditions can be inserted into the innerloop to handle boundary exceptions. The loop body of the second loop becomes : int j = r+y; if ( j < 0) j = -j; if ( j >= H) j = H-2 - (j%h); int i = c+x; if ( i < 0 ) i = -i; if ( i >= W) i = W-2 - (i%w); sum += p[j][i] * m[y][x]; These conditions allow to reduce the size and the number of writes of the array p from (H + 2N ) (W + 2N ) to H W hence reducing the power. The increase in the number of instructions will affect the execution time but since all these instructions can be found in the instruction cache, this increase will be very small. The reason for applying this optimization is that by reducing the number of memory transfers also the number of address calculations will be decreased which has a positive effect on the cycle count again. If the overall effect is still (too) negative, then it is however still possible to trade off some gain in power consumption against the number of instructions and the code size. We assume that such an optimization is incorporated further on to avoid any significant cycle overhead Global loop and control flow transformations Loop and other global control flow transformations are mainly introducing locality of array accesses, hereby enabling other steps further in the design script. For example, in the following code : FOR i:=1 TO 10 DO A[i] := A[i-1] + 1; FOR j:=1 TO 10 DO B[j] := A[j] + 1; all elements of A are still needed after execution of loop i. Merging loop i with loop j will only change the order of computation. The reordering enables the in-place optimizations of an array further in the system design script. This transformation is combined with the elimination of potential foreground signals from the arrays to be stored in the background memories. For instance, when an array can be reduced to one scalar which is the case for the A[] array when merging loop i and j, it is done at this stage in the design script. Let us return to our correlation test-vehicle. In the initial ordering loops 1, 2 and 3 are executed one after each other. We will now alter the loop ordering to enable reductions to scalars or in-place optimizations further on in the script. Since there is only a data-dependence between the production of cp[r][c] and its consumption, loop 2 and 3 can be merged. Due to this merger, the value that is written into array cp is immediately read again. Hence only 1 scalar is necessary and the array cp becomes obsolete. This reduces the number of words to be stored from 2W H (for array p and cp) to W H = = words of 8 bit (for array p only). Because the convolution of pixel p[r][c] can only be completed when pixel p[r + N ][c + N ] is read, the

7 System-Level Power Optimization 95 origin of the combined loop 2-3 has to be translated with respect to the origin of the index space of loop 1. The smallest translation vector that avoids read before write conflicts is (0,N). This means that the first operation of the combined loop 2-3 can be executed after reading N lines of the input image. This does not directly reduce the number of locations to be stored but it improves the locality and it enables the in-place optimization further in the optimization script Memory hierarchy, allocation and assignment At this stage we introduce layers of memory according to the distance between the data and the data operators. We define three layers : layer 0 : register in data-paths or processor cores. layer 1 : relatively small fast on-chip buffers and caches. layer 2 : relatively big and slow memories. By introduction of a 2N + 1 2N + 1 = 3 3 = 9 element buffer at layer 1, the 2N + 1 2N + 1 = 3 3 = 9 reads per pixel from array p at layer 2, can be reduced to 2N = 2 reads per pixel In-place optimization A pixel at coordinate (r; c) can be convoluted when the bottom right pixel (r + N; c + N ) is available. After this, the top left pixel at coordinate (r?n; c?n ) is not needed any longer. This reasoning clearly shows that there is an opportunity for in-place array optimization. Instead of storing the incoming pixels into an array p of size H W, a circular ( snake ) buffer can be used. This buffer holds all pixels that will be needed to calculate the convolutions. The length of the buffer is the Manhattan distance between the head and the tail of the buffer if a row-major storage is assumed. This distance is 2N rows of length W. Hence, the buffer length is 2N W +2N. If N is small, 2N locations can be stored in registers by carefully manipulation of the updates of the delay line. This reduces the remaining memory size to only 2N W Mapping on the target architecture For this illustrative example we selected two target architectures : 1. embedded software: the ARM7 RISC processor core with 1 external off-chip memory 2. hardware: custom standard cell design The ARM7 core is capable of processing 17MIPS@25Mhz@3V. It consumes 0.6mA/Mhz@3V when fabricated in a 0:8 CMOS technology [26]. For the mapping on the ARM7 RISC processor, the cycle count oriented optimizations discussed in [27] are applied on both the initial and optimized C description. The code was compiled with the Norsoft ARM C compiler (Version 4.66b) using the -Otime option. The resulting code was simulated with the ARM Sourcelevel Debugger (version 4.45b). The clock frequency was set to 33.3 Mhz. The memory system used was a 1 port read/write memory of 8 bits wide. This memory system was used since the ARM simulator does not incorporate caches or takes into account the differences in memory speeds. Therefore no difference will result from using our memory hierarchies or a single memory. We assumed 115ns and 85ns for non-sequential and sequential access respectively. Comparison of the original version (R) and the optimized version (V4) is shown in Table 1. We see that neither the reference description R (0.64 frames/second) nor the optimized description V4 (1.02 frames/second) reach the design goal of 30 frames/second. Still, the optimizations we propose have a very large impact on the memory size (factor 4) and cycle count (40% less). The execution times could be improved further by making use of ARM s special instructions to load/store multiple words (the so called STM and LDM instructions). A factor of 4 speed-up is expected but this is still not enough for real-time image processing 2. The main reason for not reaching the real-time image processing speed is that the ARM instruction set has no instructions to perform data access and data calculations in parallel. Several cycles are needed to read (3 cycles) or write (2 cycles) data from/to main memory. The number of execution cycles are lowered since less address calculation are needed for the reduced number of memory transfers. In case a data cache is available, as is in our general memory architecture a large speed up can be obtained since only the snake buffer is used instead of two arrays.

96 Nachtergaele et al But the best performance is obtained when an on-chip memory is used for the circular buffer. The second target architecture style is a standard cell design.

8 96 Nachtergaele et al But the best performance is obtained when an on-chip memory is used for the circular buffer. The second target architecture style is a standard cell design. The optimized C code was manually expanded to an VHDL description of the controller and the data-path. The VHDL was synthesized using the MTC22000 MIETEC 0:7 CMOS library with Synopsys Design Analyzer (Version 3.4). The resulting data-path and controller contain 1118 and 188 gates respectively. Note that no scan-test circuitry has been inserted. The resulting floorplan is depicted in Figure 3. We can estimate the power breakdown of the chip (see Figure 2) as follows. The on-chip memory of 512 words of 8 bit is about 2mm 2 big. Based on the data sheet, the power consumption is estimated to be 15 mw at 5V. The off chip driving power is estimated using : P of f chip = 1 Nr. transitions/second (5) 2 (C of f chip + C of f chipdriver ) V 2 DD where V DD is the supply voltage, C of f chip is 30 pf for an advanced package and advanced printed circuit technology, C of f chipdriver = 0:3C of f chip = 0:3 30pF = 9 pf if the width decrease ratio in the inverter chain is 4 [28]. We counted transitions in one output image of 256 by 256 pixels of eight bit. This corresponds to an activity of =( ) = 0:29. At 30f/s, the off chip driving power is : P of f chip = (30 + 9) 10? = 2.2 mw. The power consumption of the data-path and controller is estimated using PowerMill (Version HP). The test length of the stimuli, applied during the power estimation simulation, was 3 lines of 256 pixels of an image of a natural type (Lena). The data-path and controller consume 12.3 mw on the average for typical operation and 5.4 mw in standby mode. The latter is Output 7% Fig. 2. RAM 52% Datapath 41% Estimate of power breakdown of convolution chip. rather high. It is caused by the registers. Regardless the mode they are in, part of their circuitry is activated in each clock cycle and consumes power. Data-path, controller, off chip drivers and memory consume together about = 29mW while performing 2: operations/second. This correspond to operations/second/watt. Remark that no special low-power library or logic synthesis scripts have been used. This is the main reason why for this small example the heavily optimized memory organization needs only little more power than the unoptimized data-path Comparing relative improvements In Table 2 and Table 3, analytical formulas are listed for the number of transfers and the size of signals p and cp respectively. The reference description needed (256+2)(256+ 2) = words of 8 bit to store p and = Table 1. Mapping to the A.R.M. RISC processor. Version Size (8b) Time (s) #Instr. #Cycles R V R is the original version that serves as reference : V4 is the version after in-place optimizations : Size is the length of the program : Fig. 3. Standard cell realization of the 2D convolution kernel.

9 System-Level Power Optimization words of 8 bit for cp, or = words in total. The optimized description needs words of 8 bits. Using the power equation 1 shows that the power of version V4 is 31% of the power consumption of reference version R which results in significant savings. 6. System exploration of a MPEG-2 decoder The system level power exploration methodology as defined in the previous section will now be used on the first full size example, namely on a public domain MPEG-2 video decoder program. For the basis of optimization, the MPEG-2 decoder software of the Software Stimulation Group (SSG) was used. The version of the used software is 1.1a. The objective of the system exploration of the MPEG-2 decoder is to decrease the power consumption of the MPEG-2 video decoder program without Table 2. Accesses and size of array p in several stages of the script #Reads p #Writes p Size p R (2N + 1) (H + 2N ) (H + 2N ) (2N + 1) (W + 2N ) (W + 2N ) W H V1 id. H W H W V2 id. id. id. V3 (2N + 1) W H id. id. V4 id. id. (2N ) W R is the original version that serves as reference : V1 is the version after dataflow transformations : V2 is the version after loop transformations : V3 is the version after introduction of memory hierarchy : V4 is the version after in-place optimizations : Table 3. Accesses and size of array cp in several stages of the script #Reads cp #Writes cp Size cp R H W H W H W V1 id. id. id. V V V R is the original version that serves as reference : V1 is the version after dataflow transformations : V2 is the version after loop transformations : V3 is the version after introduction of memory hierarchy : V4 is the version after in-place optimizations : changing or restricting its functionality. The goal of the optimizations is not to obtain the best possible result for each optimization discussed below but to show the main principles and to illustrate that the steps of the ATOMIUM methodology arrive at an overall optimized result. Not all possible optimizations will be given here but details can be found in [29]. But first the most important facts about the MPEG-2 algorithm and software are provided. Then the steps of the ATOMIUM methodology adapted for embedded software realizations are applied to show the optimization impact on the C-code. After these optimization the software is ported to an ARM core in section 6.7. This exploration is ended with the conclusions of the results obtained Overview of the MPEG-2 video compression standard The aim of the Motion Pictures Expert Group (MPEG) is to produce video compression standards for different application areas. MPEG-1 is aimed at interactive games and low quality video, while the MPEG-2 standard is a general video compression standard for high quality universal video coding. An MPEG-2 picture consist of three components: Luminance (brightness), Chrominance 1 (Colour component Cb) and Chrominance 2 (Colour component Cr). The reason for splitting the picture in these three components is that the eye is less sensitive to colour information than to luminance (brightness). Thus MPEG-2 uses chrominance subsampling to increase the compression ratio. Other methods of compression used by MPEG-2 are Motion Compensation (Temporal compression) and Discrete Cosine Transform (DCT) (Spatial compression). Both of these compression methods are block based. Motion compensation is performed on Macroblocks of 16x16 pixels of the luminance picture. The DCT blocks are 8x8 pixels in size. MPEG-2 exhibits the following features: Frame (Non-Interlaced) and Field (Interlaced) pictures 3 motion compensation modes per Frame or Field picture. Interlaced and Non-Interlaced DCT coding. Multiple Chroma subsampling factors for different picture qualities Scalable video streams.

10 98 Nachtergaele et al Scalable MPEG-2 stream involve several bit streams which consist of a normal bit stream and several enhancement layer bit stream. The normal bit stream represents normal TV images for instance and the enhancement layer stream can upgrade the normal stream in: Transmission quality (Data partitioning) Picture Quality (SNR-scalable streams) Picture Size (spatial scalability) Picture rate (temporal scalability) For a more extensive comprehension of MPEG- 1/MPEG-2 we refer to [30, 31, 29] Behavior of the MPEG-2 decoder program. The used MPEG-2 decoder program is mainly written for explanation of the MPEG-2 standard and is therefore not fully optimized for speed. However a fast idct algorithm and a fast variable length decoding algorithm are used, resulting in good overall performance of the software. The structure of the MPEG-2 decoder software follows the MPEG-2 standard, so first the MPEG-2 picture headers are read. These contain the control information required by the decoder. Next, the procedure getm Bs is started. This is the main procedure which recreates a complete picture by traversing all Macroblocks of a picture. For each encoded Macroblock the Macroblock Header is read followed by the encoded DCT blocks. The last steps of the MPEG- 2 algorithm are to perform motion compensation for the Macroblocks, and to add the result of the inverse transformed DCT blocks (idct block) to the result of motion compensation. The main procedure in the MPEG-2 software is Reconstruct. It performs the motion compensation for one Macroblock. It first performs the forward motion compensation. This involves checking the picture type and the motion compensation type. Each of the picture types (Frame and Field pictures) have 3 motion compensation types each. For each motion compensation, a number of predictions are made(1, 2 or 4 predictions). A prediction performs the reconstruction from an odd field or even field from the reference picture. This happens even for the Frame motion compensation type. A graphical representation of the Reconstruct procedure is shown in Figure 4. The Reconstruct procedure calls the Recon procedure. Recon calls the Recon Comp procedure for each colour component, while Recon Comp performs the actual motion compensation. It copies the pixels from the reference picture, pointed at by the Motion vector, to the current picture. Since a motion vector can have half pixel resolution in the X- and Y-direction: 1, 2 or even 4 pixels of the reference picture are needed to predict one pixel of the current picture. Hence, the Recon comp procedure consists of four parts : Prediction for motion vectors with full pixel resolution Prediction for Motion vectors with half pixel resolution in the X-direction Prediction for motion vectors with half pixel resolution in the Y-direction Prediction for motion vectors with half pixel resolution in the X- and Y-direction Now that a short description is given of the structure of the MPEG-2 software, it is important to investigate which data structures are used within the MPEG-2 software. This is shown for the GetM Bs procedure in Figure 5. All these data structures are counted and the results are shown below in Table 4. Figure 5 and Table 4 show that the most important data structures in the getm Bs procedure are: idct block array Current picture array The accesses to all other data structures can be ignored. For the Reconstruct procedure the only main data structures are: Current picture array Forward Reference picture array Backward Reference picture array The list of the number of transfers for picture 29 of the T ceh v2 test streams, as shown in Table 4, will be used as reference for the optimization steps. The verification of all optimizations is done by using 6 test streams. These test streams differ in the used picture types/motion compensation methods. One test stream even uses SNR scalability. A description of the six test streams is provided in Table 5.

11 System-Level Power Optimization 99 Forward MC? YES NO Frame picture Picture Type? Field picture Motion Type? Motion Type? Frame MC Field MC Dual Prime MC Field MC 16x8 MC Dual Prime MC Prediction of top field Recon(MV) Recon(MV1) Recon(MV1) Recon(DMV1) Recon(MV) Recon(MV1) Recon(MV1) Prediction of bottom field Recon(MV) Recon(MV2) Recon(DMV2) Recon(MV1) Recon(MV2) Recon(DMV1) Backward MC? YES NO Frame picture Picture Type? Field picture Motion Type? Motion Type? Frame MC Field MC Field MC 16x8 MC Prediction of top field Recon(MV) Recon(MV1) Recon(MV) Recon(MV1) Prediction of bottom field Recon(MV) Recon(MV2) Recon(MV2) Fig. 4. Structure of the procedure Reconstruct. Table 4. Memory transfers for picture 29 of the Tceh v2 test stream Source of transfers #Reads #Writes IDCT BLOCKS DATASTRUCTURE Saturate procedure Clearblock Transform Getblock Addblock Zigzag array QUANTIZER ARRAY VLC TABLE FOR MACROBLOCK Header decoding VLC TABLE FOR idct VALUES FORWARD REFERENCE FRAME Forward Prediction Y Forward Prediction C BACKWARD REFERENCE FRAME Backward Prediction Y Backward Prediction C CURRENT FRAME Forward Prediction Y Forward Prediction C Backward Prediction Y Backward Prediction C Addblock (Y and C) Total Global data-flow optimizations. First phase of our high level optimization methodology is global data-flow optimizations. Data-flow optimizations can reduce the number of transfers to the memory

12 100 Nachtergaele et al Table 5. Test video streams used for Verification Name Pict motion picture nr of order Struct types size pict. Tceh v2 Frame Frame/Field 720 x I B B P Sony-ct1 Frame/ Frame/Field 352 x BB B IP B Field Field/16x8 Sony-5 Frame Frame/Field/DP 256 x I P P Sony-11 Field Field/16x8/DP 256 x IP PP PP Field3 Field Frame/16X8 704 x IP BB BB PP Nokia3 a Frame Frame/Field 352 x I B B P a This is an SNR scalable video stream. by inserting conditions to handle border exceptions. Furthermore it is an enabling step for other optimizations steps like loop transformations and memory hierarchy exploitation. Reducing the number of transfers to the idct block structure Table 4 shows that a large number of transfers are made to the idct block data structure. But it is also the data structure where large gains can be obtained. The following optimizations are possible to reduce the number of transfers to the idct block structure: Reduce the number of invocations of the idct transf orm, Saturate, Addblock and Clearblock procedures. Read Macroblock Header For all blocks in Macroblock do Clearblock If Coded or Intra Macroblock Fig. 5. For all coded blocks do Read idct blocks Macroblock Coded? Read VLC tables Write idct_blocks Read idct VLC tables Read zig-zag array Read Quant table Write idct_blocks If not Intra Macroblock Motion Compensation For all blocks in MB do Saturate idct Transform Addblock For all blocks in Macroblock do Clearblock Read idct_blocks Write idct_blocks Read idct_blocks Write idct_blocks Read idct_blocks Write idct_blocks Write idct blocks Read Forwards reference frame Read Backwards reference frame Read Current frame Write Current frame The structure of getmbs with the referenced data structures. Reduce the number of transfers by removing the Saturate procedure. Reduce the number of transfers by removing the Clearblock procedure. Reduce the number of transfers by removing the Sumblock procedure. The first optimization is explained in the next paragraph. The last three optimizations will be discussed later since they are not based only on data-flow optimizations but also involve loop transformations. General reduction of the number of transfers to the idct data structure. Figure 5 shows the basic structure of the getmbs procedure of the MPEG-2 decoder. If we look at the procedures that use the idct block data structure (Figure 5) then these procedures are always executed for all idct blocks in a Macroblock (minimum of 6 and maximum 12). If a idct block is empty then unnecessary processing is performed. Therefore the number of transfers to the idct block data structures can be reduced when the procedures are only executed where necessary. Thus the Sumblock, Saturate, idct transf orm and Addblock should only be executed for the encoded idct blocks in the normal or the SNR-enhancement layer stream. The condition has to be extended to include the idct blocks of the SNR-scalable stream to allow correct decoding of these streams. To reduce the number of transfers for the Clearblock procedure, the same approach can be used. The problem is how to deal with SNR-scalable streams, since the idct blocks of a SNR-scalable stream are added to the lower layer idct blocks. Therefore the idct blocks which are not used or which could be overwritten by a SNR-scalable stream should be cleared.

13 System-Level Power Optimization 101 The power consumption for the six test streams relative to the original power consumption of each test stream is indicated as Opt 1 in Figure Global loop and control flow transformations. The second step in the ATOMIUM Methodology involves loop and global control flow transformations. The goal is to change the global loop organization so that loops that access the same data structure can be merged and to enable optimizations in subsequent steps like memory hierarchy exploration. Combine forward and backward prediction. Table 4 shows that the number of reads to the current picture is not zero. Ideally this should be zero since the current frame is created for display and consequently does not have to be read in principle. The reason for reading the current picture is found when examining Figure 4. It shows that backward motion compensation is performed after forward motion compensation. Thus when a macroblock is reconstructed with backward motion compensation, the result of forward motion compensation is read from the current picture and added to the result of backward motion compensation. The number of read operation to the current picture can thus be reduced by performing the backward and forwards motion compensation at the same time. Consequently, the result from reconstruction is written only once in the current picture while it is no longer read from it. The principle of this optimization is based on global code reordering and loop merging. The loop for forward motion compensation is merged with the loop of backward motion compensation. The result of this optimization is that on a pixel level a decision has to be made if forward/backward motion compensation needs to be performed. This decision used to be on block basis. The result for the Reconstruct procedure after combining forward and backward prediction is shown in Figure 6. The corresponding power figures are shown as Opt 2 in Figure 8. The results show that there seems to be no improvement for the Field3 test stream. The reason for this is that there are hardly any forward and backward predicted Macroblocks in this stream (which is exceptional). Combine Motion compensation and idct results. The number of reads from current picture is still not zero after the previous optimization. Current picture is still read by the Addblock procedure. Addblock reads the result from motion compensation and adds the idct transformed data to it. Thus if we want to reduce the number of reads from current picture to zero we must also combine the idct with the motion compensation procedure. The basic principle used is again based on global code reorganization and loop merging. Only here, the loop in the Addblock procedure is merged with the complete reconstruction loop. Thus for each pixel the possible forward component, the possible backward component and/or the possible idct component are added. Although the idea is seemingly straightforward, the implementation is much more complicated. The problem with combining the idct and the motion compensation is that they don t have to traverse the same area in the same way. Remember that motion compensation is performed on fields while the idct can be frame based, as shown in Figure 7. The implemented solution is to find the corresponding block which holds the idct value per motion compensated pixel. The relative results for the power consumption are shown under Opt 3 in Figure 8. Removal of the Saturate procedure The Saturate procedure limits all elements to the allowed range for the idct procedure. However the number of valid entries in the idct procedure are less than the number of elements read by the Read idct block procedure. When the Saturate procedure is merged with the Read idct block procedure, the results for the power consumption are obtained as shown as Opt 4 Figure 8. Removal of the Clearblock procedure The MPEG-2 stream contains as data the idct information of the encoded Macroblocks. One idct block is variable length encoded and run length coded, which means that the most common values require a small amount of bits (variable length coded) and only a non-zero value is transmitted with the number of zero values before this non-zero value (run length coding). The MPEG-2 decoder program implements this by writing all zeroes into the idct block which is then followed by filling the blocks with the non-zero values. This initialization step cannot be skipped since the idct transform uses all values of a idct block.

14 102 Nachtergaele et al Forward MC Backward MC? YES NO Frame picture Picture Type? Field picture Motion Type? Motion Type? Frame MC Field MC Dual Prime MC Field MC 16x8 MC Dual Prime MC Recon(MV_F, MV_B) Recon(MV1_F, MV1_B) Recon(MV, DMV1) Recon(MV_F, MV_B) Recon(MV1_F, MV1_B) Recon(MV, DMV1) Recon(MV_F, MV_B) Recon(MV2_F, MV2_B) Recon(DMV2, MV) Recon(MV2_F, MV2_B) EXIT Fig. 6. Structure for motion compensation with reduced number of reads/writes into current frame. It is, however, possible to extend the idct block data structure with valid bits which indicates whether a valid value is in the corresponding position of the idct block. In this case the array does not have to be reset to zero, only the valid bits. To implement the valid bits one 64 bit array has to be used for each idct block. If we assume a processor with a 32 bit data-word then only two data-words are needed per idct block. This decreases the number of writes into the idct value from bit value to only two 32-bit. The resulting power figures are shown under Opt 5 in Figure 8. Removal of the Sumblock procedure Examination of Figure 5 shows that for SNR-scalable streams the Sumblock procedure is used to add the idct blocks of the enhancement layer to the normal layer after which normal processing is performed. This requires however 2 data structures of (12 64) 16-bit values each. It is possible to load the normal layer idct values and then to add the received enhancement layer idct values to the normal layer. Thus instead of storing both layer idct values in two data structures, the values of the second layer are immediately added to the lower layer. Again global loop merging is used to merge the loop of the Addblock procedure with the loop of the Read idct block procedure. The final power consumption is shown as Opt 6 in Figure Memory hierarchy, allocation and assignment. The previous optimizations have changed the algorithm to reduce the number of transfers and to enable this optimization. Small data structures were introduced which can now be placed in a small on-chip buffer. However this modification will also improve the performance of a embedded system with only a data-cache since the buffers hold frequently referenced data and these are very good caching candidates. Especially if the operation of the software controlled cache is known, it can be prevented that data which is only read once replaces these small and frequently referenced data. Although the total number of transfers is not reduced the power consumption can be significantly lowered because the number of layer 2 transfers is drastically reduced by inserting the layer 1 buffers. There are three data structures which could be moved from layer 2 memory to layer 1 memory. These data structures are: idct block data structure

DIGITAL TELEVISION 1. DIGITAL VIDEO FUNDAMENTALS

DIGITAL TELEVISION 1. DIGITAL VIDEO FUNDAMENTALS Television services in Europe currently broadcast video at a frame rate of 25 Hz. Each frame consists of two interlaced fields, giving a field rate of 50