System-level power optimization of video codecs on embedded cores : a systematic approach.

Size: px
Start display at page:

Download "System-level power optimization of video codecs on embedded cores : a systematic approach."

Transcription

1 Kluwer Journal on VLSI Signal Processing, 18, (1998) c 1998 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands. System-level power optimization of video codecs on embedded cores : a systematic approach. LODE NACHTERGAELE, DENNIS MOOLENAAR, BART VANHOOF, FRANCKY CATTHOOR AND HUGO DE MAN nachterg@imec.be, dennism@imec.be, vanhoofb@imec.be, catthoor@imec.be, deman@imec.be Interuniversity Micro Electronics Centrum (IMEC), Leuven, Belgium Received March, 1997; Revised July, 1997 Abstract. A battery powered multimedia communication device requires a very energy efficient implementation. The required efficiency can only be acquired by careful optimization at all levels of the design. System-level power optimizations have a dramatic impact on the overall power budget. We have proposed a system-level step-wise methodology to reduce the power in hardware realizations of data-dominated applications, which is partly supported with our ATOMIUM environment. In this paper, we extend the methodology to the realization of embedded software on processor cores. Starting from a high level algorithm description (e.g. in C), a set of optimizations gradually refine the code and the corresponding memory organization of the array data types. These array data types represent a fully detailed optimized data storage and transfer organization. Instead of creating the physical memories, a mapping can be done either on a general memory architecture, including a cache, or on a custom memory architecture. First, typical optimizations addressed by our methodology are applied on a didactical example. The effectiveness of this methodology is then demonstrated by the optimization of two complex applications in an embedded processor context : a MPEG2 and a H.263 video decoder. The impact of the power optimizations on the typical power consumption is demonstrated by simulating the optimized decoders with real video streams. Keywords: low power, system power optimizations, embedded system design 1. Introduction Nowadays, people see the need for a portable multimedia device capable of handling complex and advanced data communication. Text and speech services are already available. But extensions towards audio, speech recognition/synthesis and video communication are still requested [1, 2, 3]. To provide those with a battery powered device is a real challenge stimulating technological advances in several area s ranging from circuit techniques to new algorithms. The detailed H.263 design to which this paper makes some links was performed in co-operation with Texas Instruments Inc. This paper addresses this power reduction challenge at the system-level by showing that our data storage and transfer optimization methodology [4], ATOMIUM, oriented to hardware realization of data-dominated applications, can be applied also in an embedded software context if applied with care and if some extra transformations are incorporated to keep the cycle impact low. The traditional approach for software is compiling the input C-code on a general memory architecture as shown in Figure 1(a) and relying on the general purpose cache to perform the data transfer management. In embedded systems, extra freedom in de memory architecture allows to replace this cache with a more effective alternative since the application is known in

2 90 Nachtergaele et al Conventional Memory Organization Application Specific Memory Organization Main Memory Layer 1 Memory Layer 2 Memory General Purpose Cache D1 D11 D1 µp D2 D3 D4 µp D3 D41 D4 Dx : Multi Dimensional Data-structures of the C-code (a) (b) Fig. 1. Conventionally C-code is compiled onto architecture (a), for embedded systems a optimized C-code together with a memory organization like in (b) will be more power efficient. advance and can therefore be fully analyzed. Hence we propose to replace the general purpose cache by an application specific memory architecture 1 as shown in Figure 1(b). This memory organization, fully matched to the optimized code which is obtained after applying global system-level control- and data-flow transformations on the original C-code, heavily reduces the overall power consumption for data-dominated applications. This is demonstrated on two complex algorithms, namely a MPEG2 and a H.263 video decoder. Multimedia applications are almost by definition data intensive applications meaning that the data transfer and memory organization will have a dominant impact on the power and area cost of the realization. Providing relevant feedback at an early phase of the design makes algorithmic designers more implementation aware, enabling them to make global algorithm/implementation trade-offs. This is crucial if energy efficiency of complex multimedia systems is of vital importance. Therefore we have proposed to optimize the data transfer and memory organization prior to realizations of the data-path and controller functions, since they have the biggest relative impact on the global implementation cost [5, 6]. The optimized description of the data transfer and memory organization is then used as input for system realization. As long as the memory organization can be customized, the global optimized result will yield better overall result. This has been demonstrated by us for application-specific implementations where the designer has full freedom [5, 6]. However, even if the number and size of the memories is predefined, the freedom in the data layout inside the memories still provides a major opportunity for optimization, as we will show in this paper. Here, the impact of the potential system level data transfer and storage optimizations is demonstrated by comparing simulations of the optimized decoders with the reference. The outcome of these simulations shows that embedded software implementations of multimedia standards can be made much more energy efficient than compared to the conventional method starting from C. This is obtained without paying a significant penalty in terms of the size of the object code and the number of instruction cycles needed. The resulting optimized software can then be mapped on a general or a custom memory architecture. The performance of both implementations in terms of power and execution time will be better compared to the original code. This is an important new contribution. It allows to combine the main advantages of software techniques used in embedded software design with the energy efficiency of application specific implementations for data-dominated applications.

3 System-Level Power Optimization Related work Many video processing architectures have been published (see e.g. [7, 8, 9, 10]). Several of these exhibit low power consumption (see e.g. [2, 1, 5, 11]). A very nice example of what can be reached is a video decompression circuit dissipating less then 9mW [11]. The low power requirements were met by making use of special circuits, a low voltage supply (1.35V) and through several system-level measures including the avoidance of external memories. Based on this number, one could conclude that video for portable devices is already possible. This is indeed true for small image sizes and certain decompression techniques. However a considerable design effort is needed to implement advanced algorithms in an efficient way. Moreover, the system-level design approaches used in such designs are ad hoc up to now. These ad hoc methods will not suffice for the more complex future applications. Next generation information presentation technologies such as 3D graphics, volume rendering and virtual reality will strain the implementation requirements over their current design limits. The magic trick that reduces the power reduction with a factor of 25 by lowering the voltage supply from 5V to 1V [12] by the year 2000 will run out of steam too. Further lowering the supply voltage is expected to be problematic due to problems with the threshold voltage V T. Hence lowering the supply voltage is only part of the answer. Therefore, also (much) more power efficient system-level architectures and algorithms optimized to exploit these are needed to cope with upcoming implementation challenges. Chips that decode [13, 14, 15] or encode [16, 15] MPEG2 streams at main level (ML), main profile (MP) dissipate an acceptable few Watts. However they typically need 4 to 16 Mbits of fast RAM with a high bandwidth. The resulting global multi-chip system dissipation prohibits acceptable lifetimes of the battery [11, 2]. The need of I/O buffers could be eliminated if logic and DRAM functionality can be provided by one and the same technology. This technology is not only required for energy efficiency reasons but also to reduce the system cost and to increase the memory bandwidth. The M32R/D from Hitachi is a step in this direction [17]. This chip combines a 52.4 MIPS (Drystone V2.1 rating) 32 bit RISC processor with 16Mbit DRAM which still does not allow to decode MPEG2 real-time at all. The typical power consumption varies between 275 mw and 700mW. One reason for the still relatively high power consumption is that the memory architecture of the M32R/D is not customized. We believe that even when an effective combination of memory-logic technology will be provided, for power reasons the memory organization for future application must still be better adapted to the application. In this paper a system level data storage and transfer exploration [4, 5, 18, 19] is extended to an embedded software context. The impact of this is first explained by applying it to a relatively simple but realistic and easy to follow didactical example (Section 5). The effect of the optimizations is demonstrated by a highlevel model of the power consumption presented in section 3. Because a methodology is not validated with a simple design (that took less than 1 manmonth of design time), also the system design optimization of a MPEG2 video decoder and the impact on embedded power are presented in Section 6. Optimizations of the worst case power of a H.263 video decoder in a hardware context have been published in [5, 6]. In Section 7 estimates of the average energy dissipation in an embedded processor context before and after optimization are given. They are obtained by counting the number of transfers during simulation of the decoders with real video streams. 3. Power model Obtaining a precise power estimation from a highlevel system description requires a major design effort. Luckily this is not needed to take relevant high-level decisions during the system exploration. It is sufficient to have a relative comparison that selects the most promising candidates. For data-intensive applications, power due to memory transfers is dominant [20, 21, 11, 5]. We can neglect the power consumption in operators, control and internal routing during memory optimizations since their power consumption will not be drastically affected by the memory optimizations compared to the achieved power reduction for data transfers and memory accesses. Clocking and most data transfers are for a large part related to the actual communication and storage access and therefore assumed to be proportional with it. In this way, our storage based power model is good enough to perform relative comparisons between design alternatives. It also provides a lower bound on the absolute access related power.

4 92 Nachtergaele et al Unlike in operators, power due to a memory access is independent of the data value transferred [20]. The power is function of the size of the memory, the frequency of access and the technology : Size: usually has a proportional sublinear influence but depends very much on the memory organization. Nowadays memories are often partioned into several memory banks. Each time another bank is accessed extra energy is consumed to power up the bank. In the extreme case, the power bottle-neck is moved mostly in the periphery and a logarithmic dependence is created. Frequency of access: the power is considered linearly proportional to the frequency of access. This assumes that the memory is in power down mode when no transfer is going on. Most modern embedded RAM, have a power down mode [22, 23]. Some memories require that the address and data bits are kept stable when in power down mode. Technology: since this is the same for all on-chip memories considered in this paper, we exclude it from the power model. For off-chip memories different technologies are normally available. However, a low power 1 MB SRAM [24] will be assumed for external memory since power consumption figures of other memories are unavailable. Our external memory model will only be used for the optimization of the MPEG-2 decoder. All other memories are assumed to be on-chip. This is a conservative assumption since accesses to off-chip memories are very costly in term of energy consumption [11]. The simple power model function used in this paper is : P T ransf ers = E #Transfers T r Second (1) E T r = f (#words; #bits) (2) In [25] a function f is proposed to estimate the energy per transfer E T r in terms of the number of words and the width in bits. Also some possible values for the parameters are provided there. 4. Target architecture This section describes the target architecture on which the initial behavior description is mapped using the step-wise optimization methodology ATOMIUM [4]. An example of the optimization steps is shown in the didactical example of section 5. Our goal is to map the behavioral description on an embedded core with a dedicated memory hierarchy. However, a general-purpose memory hierarchy, consisting of a data and instruction cache connected to data memory and instruction memory, could also be used. Both memory hierarchies are assumed to have an instruction cache in the instruction path. For datadominated applications which are mostly built around loop nests, the small (active part of the) instruction cache will seldom have a cache miss so little power is spent in this component during actual execution. For the flow of data between memory and data-path, two main options are available: general purpose memory hierarchy (a data-cache with data-memory) or a dedicated memory hierarchy. The dedicated memory hierarchy in an embedded core solution replaces the data-cache by several memory blocks which serve as buffers for the large data. One of these memory blocks could still behave as a cache (with an update protocol). The outcome of the system exploration is a detailed memory organization. This consist of the number and the type of memories that are allocated. These can be selected from a library. And for each memory the parameters that fully characterize the memory are decided. These parameters are the number of locations, the width in number of bits and the number of read, write and read/write ports. Every multi-dimensional array of the description is assigned to one of the allocated memories. Also the base address, the storage order (e.g. row-wise or column wise) and the window are known. The best result is obtained when freedom exists to have a physical memories system, as described by the system exploration. If this freedom does not exist then still an improvement of the power consumption and performance can be obtained by using the obtained results as a guideline for mapping the data structures on the available memory hierarchy. When a general memory hierarchy is used instead of the memory hierarchy as proposed by the system exploration then a better overall result can still be obtained since the system exploration methodology improves the temporal and spatial locality of the memory references which will result in an improved cache hitratio of the data-cache for instance. However a dedicated memory hierarchy will show the best improvement since a distinction can then be made between

5 System-Level Power Optimization 93 cacheable data and non-cacheable data. This distinction will greatly improve the hit rate of the data cache or could even remove the need for a data-cache overall. How the controller, data-paths and address generators are realized is still not decided yet at this design stage. They can for example be mapped on a general purpose processor or application specific components can be designed. A mixture of both approaches is also possible. This freedom is an important advantage of our approach. 5. Design methodology explained on a simple but realistic example In this section we will apply the ATOMIUM system methodology proposed in [4] to a basic image processing kernel mapped to both an embedded processor core and to a hardware target. Starting from a behavioral description, each step of the ATOMIUM script will be applied and explained. In section 6 and 7, the methodology illustrated in this section will be applied to two real-life applications, a MPEG-2 and a H.263 video decoder respectively Describing the behavior A convolution of a picture stored in a 2 dimensional array p[][] of size (W H) with a 2 dimensional mask m[][] of size (2N + 1 2N + 1) is defined by (r : row, c : column) : 8r 2 [0::H? 1]; 8c 2 [0::W? 1] : cp[r][c] = NX NX y=?n x=?n p[r + y][c + x] m[y][x] (3) Although most people would call this already a formal definition, it is not complete. The formula does not define what happens in case the free variables r or c are at the boundary of the allowed interval. In case of a negative address, or a value that exceeds a limit value max, a wrap-around is performed. For an index i valid in the range [0::max?1], the following index filter is used : f (i) = 8< : f (?i) if i < 0 f (max? 2? (i mod max)) if i max f (i) otherwise (4) Equation 3 and 4 can be easily converted to a C program. To avoid problems at the border, the array p is extended with 2N rows and 2N columns : const int W = 256; const int H = 256; const int N = 1; convol2d(p,m){ int p[h+2*n][w+2*n]; int cp[h][w]; int m[2*n+1][2*n+1]; m[0][0] = 1; m[0][1] = 2; m[0][2] = 1; m[1][0] = 2; m[1][1] = 4; m[1][2] = 2; m[2][0] = 1; m[2][1] = 2; m[2][2] = 1; } for r = 0 to H-1 do for c = 0 to W-1 do p[n+r][n+c] = input; for r = 0 to H-1 do { for c = 0 to W-1 do { int sum = 0; for y = -N to N do for x = -N to N do sum += p[n+r+y][n+c+x] * m[n+y][n+x]; cp[r][c] = sum/16; } } for r = 0 to H-1 do for c = 0 to W-1 do output = cp[r][c]; The three boxes indicate the three main loops : 1. read in the picture 2. calculate the convolution 3. write out the result The above code exhibits (2N + 1) (2N + 1) W H = = reads and (H + 2N ) (W + 2N ) = = writes of array p. (The code for initialization of the border is left out.) Array cp is W H = = times read and written. In total

6 94 Nachtergaele et al = transfers are performed to the memory for one picture. This would require a memory access time of 42 ns. Using the power model of [25] the energy for one read to a memory of words of eight bit is estimated to be 1.17 Joule. Hence, the power due to reads is estimated to be 76mWatt. In this way, the power consumption due to memory transfers is estimated to be 727mW for array p and 123 mw for array cp. The convolution needs 6 shift operations and 9 adds (not counting the operations required for address calculation ). To achieve a frame rate of 30 frames/second, we need to perform (6+9) = 29:491:200 operations/second 30MOPS. For 30 frames/second = 19:7M memory accesses are required. Although the number of memory transfers is less than the number of operations, the power of the memory transfers is much higher[1]. Furthermore it is very difficult to decrease the basic number of operations of a given algorithm while it is possible to decrease the number of memory transfers. The number of memory transfers depends on the implementation of the algorithm while the number of operations are implied by the algorithm Global data-flow optimizations In general, many system-level data-flow transformations can be performed to optimize the data transfer and storage [18]. In this particular case, only the conditional signal propagation class can be illustrated. Indeed, instead of enlarging the array p to cope with border effects, conditions can be inserted into the innerloop to handle boundary exceptions. The loop body of the second loop becomes : int j = r+y; if ( j < 0) j = -j; if ( j >= H) j = H-2 - (j%h); int i = c+x; if ( i < 0 ) i = -i; if ( i >= W) i = W-2 - (i%w); sum += p[j][i] * m[y][x]; These conditions allow to reduce the size and the number of writes of the array p from (H + 2N ) (W + 2N ) to H W hence reducing the power. The increase in the number of instructions will affect the execution time but since all these instructions can be found in the instruction cache, this increase will be very small. The reason for applying this optimization is that by reducing the number of memory transfers also the number of address calculations will be decreased which has a positive effect on the cycle count again. If the overall effect is still (too) negative, then it is however still possible to trade off some gain in power consumption against the number of instructions and the code size. We assume that such an optimization is incorporated further on to avoid any significant cycle overhead Global loop and control flow transformations Loop and other global control flow transformations are mainly introducing locality of array accesses, hereby enabling other steps further in the design script. For example, in the following code : FOR i:=1 TO 10 DO A[i] := A[i-1] + 1; FOR j:=1 TO 10 DO B[j] := A[j] + 1; all elements of A are still needed after execution of loop i. Merging loop i with loop j will only change the order of computation. The reordering enables the in-place optimizations of an array further in the system design script. This transformation is combined with the elimination of potential foreground signals from the arrays to be stored in the background memories. For instance, when an array can be reduced to one scalar which is the case for the A[] array when merging loop i and j, it is done at this stage in the design script. Let us return to our correlation test-vehicle. In the initial ordering loops 1, 2 and 3 are executed one after each other. We will now alter the loop ordering to enable reductions to scalars or in-place optimizations further on in the script. Since there is only a data-dependence between the production of cp[r][c] and its consumption, loop 2 and 3 can be merged. Due to this merger, the value that is written into array cp is immediately read again. Hence only 1 scalar is necessary and the array cp becomes obsolete. This reduces the number of words to be stored from 2W H (for array p and cp) to W H = = words of 8 bit (for array p only). Because the convolution of pixel p[r][c] can only be completed when pixel p[r + N ][c + N ] is read, the

7 System-Level Power Optimization 95 origin of the combined loop 2-3 has to be translated with respect to the origin of the index space of loop 1. The smallest translation vector that avoids read before write conflicts is (0,N). This means that the first operation of the combined loop 2-3 can be executed after reading N lines of the input image. This does not directly reduce the number of locations to be stored but it improves the locality and it enables the in-place optimization further in the optimization script Memory hierarchy, allocation and assignment At this stage we introduce layers of memory according to the distance between the data and the data operators. We define three layers : layer 0 : register in data-paths or processor cores. layer 1 : relatively small fast on-chip buffers and caches. layer 2 : relatively big and slow memories. By introduction of a 2N + 1 2N + 1 = 3 3 = 9 element buffer at layer 1, the 2N + 1 2N + 1 = 3 3 = 9 reads per pixel from array p at layer 2, can be reduced to 2N = 2 reads per pixel In-place optimization A pixel at coordinate (r; c) can be convoluted when the bottom right pixel (r + N; c + N ) is available. After this, the top left pixel at coordinate (r?n; c?n ) is not needed any longer. This reasoning clearly shows that there is an opportunity for in-place array optimization. Instead of storing the incoming pixels into an array p of size H W, a circular ( snake ) buffer can be used. This buffer holds all pixels that will be needed to calculate the convolutions. The length of the buffer is the Manhattan distance between the head and the tail of the buffer if a row-major storage is assumed. This distance is 2N rows of length W. Hence, the buffer length is 2N W +2N. If N is small, 2N locations can be stored in registers by carefully manipulation of the updates of the delay line. This reduces the remaining memory size to only 2N W Mapping on the target architecture For this illustrative example we selected two target architectures : 1. embedded software: the ARM7 RISC processor core with 1 external off-chip memory 2. hardware: custom standard cell design The ARM7 core is capable of processing 17MIPS@25Mhz@3V. It consumes 0.6mA/Mhz@3V when fabricated in a 0:8 CMOS technology [26]. For the mapping on the ARM7 RISC processor, the cycle count oriented optimizations discussed in [27] are applied on both the initial and optimized C description. The code was compiled with the Norsoft ARM C compiler (Version 4.66b) using the -Otime option. The resulting code was simulated with the ARM Sourcelevel Debugger (version 4.45b). The clock frequency was set to 33.3 Mhz. The memory system used was a 1 port read/write memory of 8 bits wide. This memory system was used since the ARM simulator does not incorporate caches or takes into account the differences in memory speeds. Therefore no difference will result from using our memory hierarchies or a single memory. We assumed 115ns and 85ns for non-sequential and sequential access respectively. Comparison of the original version (R) and the optimized version (V4) is shown in Table 1. We see that neither the reference description R (0.64 frames/second) nor the optimized description V4 (1.02 frames/second) reach the design goal of 30 frames/second. Still, the optimizations we propose have a very large impact on the memory size (factor 4) and cycle count (40% less). The execution times could be improved further by making use of ARM s special instructions to load/store multiple words (the so called STM and LDM instructions). A factor of 4 speed-up is expected but this is still not enough for real-time image processing 2. The main reason for not reaching the real-time image processing speed is that the ARM instruction set has no instructions to perform data access and data calculations in parallel. Several cycles are needed to read (3 cycles) or write (2 cycles) data from/to main memory. The number of execution cycles are lowered since less address calculation are needed for the reduced number of memory transfers. In case a data cache is available, as is in our general memory architecture a large speed up can be obtained since only the snake buffer is used instead of two arrays.

8 96 Nachtergaele et al But the best performance is obtained when an on-chip memory is used for the circular buffer. The second target architecture style is a standard cell design. The optimized C code was manually expanded to an VHDL description of the controller and the data-path. The VHDL was synthesized using the MTC22000 MIETEC 0:7 CMOS library with Synopsys Design Analyzer (Version 3.4). The resulting data-path and controller contain 1118 and 188 gates respectively. Note that no scan-test circuitry has been inserted. The resulting floorplan is depicted in Figure 3. We can estimate the power breakdown of the chip (see Figure 2) as follows. The on-chip memory of 512 words of 8 bit is about 2mm 2 big. Based on the data sheet, the power consumption is estimated to be 15 mw at 5V. The off chip driving power is estimated using : P of f chip = 1 Nr. transitions/second (5) 2 (C of f chip + C of f chipdriver ) V 2 DD where V DD is the supply voltage, C of f chip is 30 pf for an advanced package and advanced printed circuit technology, C of f chipdriver = 0:3C of f chip = 0:3 30pF = 9 pf if the width decrease ratio in the inverter chain is 4 [28]. We counted transitions in one output image of 256 by 256 pixels of eight bit. This corresponds to an activity of =( ) = 0:29. At 30f/s, the off chip driving power is : P of f chip = (30 + 9) 10? = 2.2 mw. The power consumption of the data-path and controller is estimated using PowerMill (Version HP). The test length of the stimuli, applied during the power estimation simulation, was 3 lines of 256 pixels of an image of a natural type (Lena). The data-path and controller consume 12.3 mw on the average for typical operation and 5.4 mw in standby mode. The latter is Output 7% Fig. 2. RAM 52% Datapath 41% Estimate of power breakdown of convolution chip. rather high. It is caused by the registers. Regardless the mode they are in, part of their circuitry is activated in each clock cycle and consumes power. Data-path, controller, off chip drivers and memory consume together about = 29mW while performing 2: operations/second. This correspond to operations/second/watt. Remark that no special low-power library or logic synthesis scripts have been used. This is the main reason why for this small example the heavily optimized memory organization needs only little more power than the unoptimized data-path Comparing relative improvements In Table 2 and Table 3, analytical formulas are listed for the number of transfers and the size of signals p and cp respectively. The reference description needed (256+2)(256+ 2) = words of 8 bit to store p and = Table 1. Mapping to the A.R.M. RISC processor. Version Size (8b) Time (s) #Instr. #Cycles R V R is the original version that serves as reference : V4 is the version after in-place optimizations : Size is the length of the program : Fig. 3. Standard cell realization of the 2D convolution kernel.

9 System-Level Power Optimization words of 8 bit for cp, or = words in total. The optimized description needs words of 8 bits. Using the power equation 1 shows that the power of version V4 is 31% of the power consumption of reference version R which results in significant savings. 6. System exploration of a MPEG-2 decoder The system level power exploration methodology as defined in the previous section will now be used on the first full size example, namely on a public domain MPEG-2 video decoder program. For the basis of optimization, the MPEG-2 decoder software of the Software Stimulation Group (SSG) was used. The version of the used software is 1.1a. The objective of the system exploration of the MPEG-2 decoder is to decrease the power consumption of the MPEG-2 video decoder program without Table 2. Accesses and size of array p in several stages of the script #Reads p #Writes p Size p R (2N + 1) (H + 2N ) (H + 2N ) (2N + 1) (W + 2N ) (W + 2N ) W H V1 id. H W H W V2 id. id. id. V3 (2N + 1) W H id. id. V4 id. id. (2N ) W R is the original version that serves as reference : V1 is the version after dataflow transformations : V2 is the version after loop transformations : V3 is the version after introduction of memory hierarchy : V4 is the version after in-place optimizations : Table 3. Accesses and size of array cp in several stages of the script #Reads cp #Writes cp Size cp R H W H W H W V1 id. id. id. V V V R is the original version that serves as reference : V1 is the version after dataflow transformations : V2 is the version after loop transformations : V3 is the version after introduction of memory hierarchy : V4 is the version after in-place optimizations : changing or restricting its functionality. The goal of the optimizations is not to obtain the best possible result for each optimization discussed below but to show the main principles and to illustrate that the steps of the ATOMIUM methodology arrive at an overall optimized result. Not all possible optimizations will be given here but details can be found in [29]. But first the most important facts about the MPEG-2 algorithm and software are provided. Then the steps of the ATOMIUM methodology adapted for embedded software realizations are applied to show the optimization impact on the C-code. After these optimization the software is ported to an ARM core in section 6.7. This exploration is ended with the conclusions of the results obtained Overview of the MPEG-2 video compression standard The aim of the Motion Pictures Expert Group (MPEG) is to produce video compression standards for different application areas. MPEG-1 is aimed at interactive games and low quality video, while the MPEG-2 standard is a general video compression standard for high quality universal video coding. An MPEG-2 picture consist of three components: Luminance (brightness), Chrominance 1 (Colour component Cb) and Chrominance 2 (Colour component Cr). The reason for splitting the picture in these three components is that the eye is less sensitive to colour information than to luminance (brightness). Thus MPEG-2 uses chrominance subsampling to increase the compression ratio. Other methods of compression used by MPEG-2 are Motion Compensation (Temporal compression) and Discrete Cosine Transform (DCT) (Spatial compression). Both of these compression methods are block based. Motion compensation is performed on Macroblocks of 16x16 pixels of the luminance picture. The DCT blocks are 8x8 pixels in size. MPEG-2 exhibits the following features: Frame (Non-Interlaced) and Field (Interlaced) pictures 3 motion compensation modes per Frame or Field picture. Interlaced and Non-Interlaced DCT coding. Multiple Chroma subsampling factors for different picture qualities Scalable video streams.

10 98 Nachtergaele et al Scalable MPEG-2 stream involve several bit streams which consist of a normal bit stream and several enhancement layer bit stream. The normal bit stream represents normal TV images for instance and the enhancement layer stream can upgrade the normal stream in: Transmission quality (Data partitioning) Picture Quality (SNR-scalable streams) Picture Size (spatial scalability) Picture rate (temporal scalability) For a more extensive comprehension of MPEG- 1/MPEG-2 we refer to [30, 31, 29] Behavior of the MPEG-2 decoder program. The used MPEG-2 decoder program is mainly written for explanation of the MPEG-2 standard and is therefore not fully optimized for speed. However a fast idct algorithm and a fast variable length decoding algorithm are used, resulting in good overall performance of the software. The structure of the MPEG-2 decoder software follows the MPEG-2 standard, so first the MPEG-2 picture headers are read. These contain the control information required by the decoder. Next, the procedure getm Bs is started. This is the main procedure which recreates a complete picture by traversing all Macroblocks of a picture. For each encoded Macroblock the Macroblock Header is read followed by the encoded DCT blocks. The last steps of the MPEG- 2 algorithm are to perform motion compensation for the Macroblocks, and to add the result of the inverse transformed DCT blocks (idct block) to the result of motion compensation. The main procedure in the MPEG-2 software is Reconstruct. It performs the motion compensation for one Macroblock. It first performs the forward motion compensation. This involves checking the picture type and the motion compensation type. Each of the picture types (Frame and Field pictures) have 3 motion compensation types each. For each motion compensation, a number of predictions are made(1, 2 or 4 predictions). A prediction performs the reconstruction from an odd field or even field from the reference picture. This happens even for the Frame motion compensation type. A graphical representation of the Reconstruct procedure is shown in Figure 4. The Reconstruct procedure calls the Recon procedure. Recon calls the Recon Comp procedure for each colour component, while Recon Comp performs the actual motion compensation. It copies the pixels from the reference picture, pointed at by the Motion vector, to the current picture. Since a motion vector can have half pixel resolution in the X- and Y-direction: 1, 2 or even 4 pixels of the reference picture are needed to predict one pixel of the current picture. Hence, the Recon comp procedure consists of four parts : Prediction for motion vectors with full pixel resolution Prediction for Motion vectors with half pixel resolution in the X-direction Prediction for motion vectors with half pixel resolution in the Y-direction Prediction for motion vectors with half pixel resolution in the X- and Y-direction Now that a short description is given of the structure of the MPEG-2 software, it is important to investigate which data structures are used within the MPEG-2 software. This is shown for the GetM Bs procedure in Figure 5. All these data structures are counted and the results are shown below in Table 4. Figure 5 and Table 4 show that the most important data structures in the getm Bs procedure are: idct block array Current picture array The accesses to all other data structures can be ignored. For the Reconstruct procedure the only main data structures are: Current picture array Forward Reference picture array Backward Reference picture array The list of the number of transfers for picture 29 of the T ceh v2 test streams, as shown in Table 4, will be used as reference for the optimization steps. The verification of all optimizations is done by using 6 test streams. These test streams differ in the used picture types/motion compensation methods. One test stream even uses SNR scalability. A description of the six test streams is provided in Table 5.

11 System-Level Power Optimization 99 Forward MC? YES NO Frame picture Picture Type? Field picture Motion Type? Motion Type? Frame MC Field MC Dual Prime MC Field MC 16x8 MC Dual Prime MC Prediction of top field Recon(MV) Recon(MV1) Recon(MV1) Recon(DMV1) Recon(MV) Recon(MV1) Recon(MV1) Prediction of bottom field Recon(MV) Recon(MV2) Recon(DMV2) Recon(MV1) Recon(MV2) Recon(DMV1) Backward MC? YES NO Frame picture Picture Type? Field picture Motion Type? Motion Type? Frame MC Field MC Field MC 16x8 MC Prediction of top field Recon(MV) Recon(MV1) Recon(MV) Recon(MV1) Prediction of bottom field Recon(MV) Recon(MV2) Recon(MV2) Fig. 4. Structure of the procedure Reconstruct. Table 4. Memory transfers for picture 29 of the Tceh v2 test stream Source of transfers #Reads #Writes IDCT BLOCKS DATASTRUCTURE Saturate procedure Clearblock Transform Getblock Addblock Zigzag array QUANTIZER ARRAY VLC TABLE FOR MACROBLOCK Header decoding VLC TABLE FOR idct VALUES FORWARD REFERENCE FRAME Forward Prediction Y Forward Prediction C BACKWARD REFERENCE FRAME Backward Prediction Y Backward Prediction C CURRENT FRAME Forward Prediction Y Forward Prediction C Backward Prediction Y Backward Prediction C Addblock (Y and C) Total Global data-flow optimizations. First phase of our high level optimization methodology is global data-flow optimizations. Data-flow optimizations can reduce the number of transfers to the memory

12 100 Nachtergaele et al Table 5. Test video streams used for Verification Name Pict motion picture nr of order Struct types size pict. Tceh v2 Frame Frame/Field 720 x I B B P Sony-ct1 Frame/ Frame/Field 352 x BB B IP B Field Field/16x8 Sony-5 Frame Frame/Field/DP 256 x I P P Sony-11 Field Field/16x8/DP 256 x IP PP PP Field3 Field Frame/16X8 704 x IP BB BB PP Nokia3 a Frame Frame/Field 352 x I B B P a This is an SNR scalable video stream. by inserting conditions to handle border exceptions. Furthermore it is an enabling step for other optimizations steps like loop transformations and memory hierarchy exploitation. Reducing the number of transfers to the idct block structure Table 4 shows that a large number of transfers are made to the idct block data structure. But it is also the data structure where large gains can be obtained. The following optimizations are possible to reduce the number of transfers to the idct block structure: Reduce the number of invocations of the idct transf orm, Saturate, Addblock and Clearblock procedures. Read Macroblock Header For all blocks in Macroblock do Clearblock If Coded or Intra Macroblock Fig. 5. For all coded blocks do Read idct blocks Macroblock Coded? Read VLC tables Write idct_blocks Read idct VLC tables Read zig-zag array Read Quant table Write idct_blocks If not Intra Macroblock Motion Compensation For all blocks in MB do Saturate idct Transform Addblock For all blocks in Macroblock do Clearblock Read idct_blocks Write idct_blocks Read idct_blocks Write idct_blocks Read idct_blocks Write idct_blocks Write idct blocks Read Forwards reference frame Read Backwards reference frame Read Current frame Write Current frame The structure of getmbs with the referenced data structures. Reduce the number of transfers by removing the Saturate procedure. Reduce the number of transfers by removing the Clearblock procedure. Reduce the number of transfers by removing the Sumblock procedure. The first optimization is explained in the next paragraph. The last three optimizations will be discussed later since they are not based only on data-flow optimizations but also involve loop transformations. General reduction of the number of transfers to the idct data structure. Figure 5 shows the basic structure of the getmbs procedure of the MPEG-2 decoder. If we look at the procedures that use the idct block data structure (Figure 5) then these procedures are always executed for all idct blocks in a Macroblock (minimum of 6 and maximum 12). If a idct block is empty then unnecessary processing is performed. Therefore the number of transfers to the idct block data structures can be reduced when the procedures are only executed where necessary. Thus the Sumblock, Saturate, idct transf orm and Addblock should only be executed for the encoded idct blocks in the normal or the SNR-enhancement layer stream. The condition has to be extended to include the idct blocks of the SNR-scalable stream to allow correct decoding of these streams. To reduce the number of transfers for the Clearblock procedure, the same approach can be used. The problem is how to deal with SNR-scalable streams, since the idct blocks of a SNR-scalable stream are added to the lower layer idct blocks. Therefore the idct blocks which are not used or which could be overwritten by a SNR-scalable stream should be cleared.

13 System-Level Power Optimization 101 The power consumption for the six test streams relative to the original power consumption of each test stream is indicated as Opt 1 in Figure Global loop and control flow transformations. The second step in the ATOMIUM Methodology involves loop and global control flow transformations. The goal is to change the global loop organization so that loops that access the same data structure can be merged and to enable optimizations in subsequent steps like memory hierarchy exploration. Combine forward and backward prediction. Table 4 shows that the number of reads to the current picture is not zero. Ideally this should be zero since the current frame is created for display and consequently does not have to be read in principle. The reason for reading the current picture is found when examining Figure 4. It shows that backward motion compensation is performed after forward motion compensation. Thus when a macroblock is reconstructed with backward motion compensation, the result of forward motion compensation is read from the current picture and added to the result of backward motion compensation. The number of read operation to the current picture can thus be reduced by performing the backward and forwards motion compensation at the same time. Consequently, the result from reconstruction is written only once in the current picture while it is no longer read from it. The principle of this optimization is based on global code reordering and loop merging. The loop for forward motion compensation is merged with the loop of backward motion compensation. The result of this optimization is that on a pixel level a decision has to be made if forward/backward motion compensation needs to be performed. This decision used to be on block basis. The result for the Reconstruct procedure after combining forward and backward prediction is shown in Figure 6. The corresponding power figures are shown as Opt 2 in Figure 8. The results show that there seems to be no improvement for the Field3 test stream. The reason for this is that there are hardly any forward and backward predicted Macroblocks in this stream (which is exceptional). Combine Motion compensation and idct results. The number of reads from current picture is still not zero after the previous optimization. Current picture is still read by the Addblock procedure. Addblock reads the result from motion compensation and adds the idct transformed data to it. Thus if we want to reduce the number of reads from current picture to zero we must also combine the idct with the motion compensation procedure. The basic principle used is again based on global code reorganization and loop merging. Only here, the loop in the Addblock procedure is merged with the complete reconstruction loop. Thus for each pixel the possible forward component, the possible backward component and/or the possible idct component are added. Although the idea is seemingly straightforward, the implementation is much more complicated. The problem with combining the idct and the motion compensation is that they don t have to traverse the same area in the same way. Remember that motion compensation is performed on fields while the idct can be frame based, as shown in Figure 7. The implemented solution is to find the corresponding block which holds the idct value per motion compensated pixel. The relative results for the power consumption are shown under Opt 3 in Figure 8. Removal of the Saturate procedure The Saturate procedure limits all elements to the allowed range for the idct procedure. However the number of valid entries in the idct procedure are less than the number of elements read by the Read idct block procedure. When the Saturate procedure is merged with the Read idct block procedure, the results for the power consumption are obtained as shown as Opt 4 Figure 8. Removal of the Clearblock procedure The MPEG-2 stream contains as data the idct information of the encoded Macroblocks. One idct block is variable length encoded and run length coded, which means that the most common values require a small amount of bits (variable length coded) and only a non-zero value is transmitted with the number of zero values before this non-zero value (run length coding). The MPEG-2 decoder program implements this by writing all zeroes into the idct block which is then followed by filling the blocks with the non-zero values. This initialization step cannot be skipped since the idct transform uses all values of a idct block.

14 102 Nachtergaele et al Forward MC Backward MC? YES NO Frame picture Picture Type? Field picture Motion Type? Motion Type? Frame MC Field MC Dual Prime MC Field MC 16x8 MC Dual Prime MC Recon(MV_F, MV_B) Recon(MV1_F, MV1_B) Recon(MV, DMV1) Recon(MV_F, MV_B) Recon(MV1_F, MV1_B) Recon(MV, DMV1) Recon(MV_F, MV_B) Recon(MV2_F, MV2_B) Recon(DMV2, MV) Recon(MV2_F, MV2_B) EXIT Fig. 6. Structure for motion compensation with reduced number of reads/writes into current frame. It is, however, possible to extend the idct block data structure with valid bits which indicates whether a valid value is in the corresponding position of the idct block. In this case the array does not have to be reset to zero, only the valid bits. To implement the valid bits one 64 bit array has to be used for each idct block. If we assume a processor with a 32 bit data-word then only two data-words are needed per idct block. This decreases the number of writes into the idct value from bit value to only two 32-bit. The resulting power figures are shown under Opt 5 in Figure 8. Removal of the Sumblock procedure Examination of Figure 5 shows that for SNR-scalable streams the Sumblock procedure is used to add the idct blocks of the enhancement layer to the normal layer after which normal processing is performed. This requires however 2 data structures of (12 64) 16-bit values each. It is possible to load the normal layer idct values and then to add the received enhancement layer idct values to the normal layer. Thus instead of storing both layer idct values in two data structures, the values of the second layer are immediately added to the lower layer. Again global loop merging is used to merge the loop of the Addblock procedure with the loop of the Read idct block procedure. The final power consumption is shown as Opt 6 in Figure Memory hierarchy, allocation and assignment. The previous optimizations have changed the algorithm to reduce the number of transfers and to enable this optimization. Small data structures were introduced which can now be placed in a small on-chip buffer. However this modification will also improve the performance of a embedded system with only a data-cache since the buffers hold frequently referenced data and these are very good caching candidates. Especially if the operation of the software controlled cache is known, it can be prevented that data which is only read once replaces these small and frequently referenced data. Although the total number of transfers is not reduced the power consumption can be significantly lowered because the number of layer 2 transfers is drastically reduced by inserting the layer 1 buffers. There are three data structures which could be moved from layer 2 memory to layer 1 memory. These data structures are: idct block data structure

DIGITAL TELEVISION 1. DIGITAL VIDEO FUNDAMENTALS

DIGITAL TELEVISION 1. DIGITAL VIDEO FUNDAMENTALS DIGITAL TELEVISION 1. DIGITAL VIDEO FUNDAMENTALS Television services in Europe currently broadcast video at a frame rate of 25 Hz. Each frame consists of two interlaced fields, giving a field rate of 50

More information

Low Power Mapping of Video Processing Applications on VLIW Multimedia Processors

Low Power Mapping of Video Processing Applications on VLIW Multimedia Processors Low Power Mapping of Video Processing Applications on VLIW Multimedia Processors K. Masselos 1,2, F. Catthoor 2, C. E. Goutis 1, H. DeMan 2 1 VLSI Design Laboratory, Department of Electrical and Computer

More information

Video Compression An Introduction

Video Compression An Introduction Video Compression An Introduction The increasing demand to incorporate video data into telecommunications services, the corporate environment, the entertainment industry, and even at home has made digital

More information

2014 Summer School on MPEG/VCEG Video. Video Coding Concept

2014 Summer School on MPEG/VCEG Video. Video Coding Concept 2014 Summer School on MPEG/VCEG Video 1 Video Coding Concept Outline 2 Introduction Capture and representation of digital video Fundamentals of video coding Summary Outline 3 Introduction Capture and representation

More information

MPEG-4: Simple Profile (SP)

MPEG-4: Simple Profile (SP) MPEG-4: Simple Profile (SP) I-VOP (Intra-coded rectangular VOP, progressive video format) P-VOP (Inter-coded rectangular VOP, progressive video format) Short Header mode (compatibility with H.263 codec)

More information

Multimedia Decoder Using the Nios II Processor

Multimedia Decoder Using the Nios II Processor Multimedia Decoder Using the Nios II Processor Third Prize Multimedia Decoder Using the Nios II Processor Institution: Participants: Instructor: Indian Institute of Science Mythri Alle, Naresh K. V., Svatantra

More information

MPEG-2. And Scalability Support. Nimrod Peleg Update: July.2004

MPEG-2. And Scalability Support. Nimrod Peleg Update: July.2004 MPEG-2 And Scalability Support Nimrod Peleg Update: July.2004 MPEG-2 Target...Generic coding method of moving pictures and associated sound for...digital storage, TV broadcasting and communication... Dedicated

More information

Introduction to Video Compression

Introduction to Video Compression Insight, Analysis, and Advice on Signal Processing Technology Introduction to Video Compression Jeff Bier Berkeley Design Technology, Inc. info@bdti.com http://www.bdti.com Outline Motivation and scope

More information

Week 14. Video Compression. Ref: Fundamentals of Multimedia

Week 14. Video Compression. Ref: Fundamentals of Multimedia Week 14 Video Compression Ref: Fundamentals of Multimedia Last lecture review Prediction from the previous frame is called forward prediction Prediction from the next frame is called forward prediction

More information

Interframe coding A video scene captured as a sequence of frames can be efficiently coded by estimating and compensating for motion between frames pri

Interframe coding A video scene captured as a sequence of frames can be efficiently coded by estimating and compensating for motion between frames pri MPEG MPEG video is broken up into a hierarchy of layer From the top level, the first layer is known as the video sequence layer, and is any self contained bitstream, for example a coded movie. The second

More information

5LSE0 - Mod 10 Part 1. MPEG Motion Compensation and Video Coding. MPEG Video / Temporal Prediction (1)

5LSE0 - Mod 10 Part 1. MPEG Motion Compensation and Video Coding. MPEG Video / Temporal Prediction (1) 1 Multimedia Video Coding & Architectures (5LSE), Module 1 MPEG-1/ Standards: Motioncompensated video coding 5LSE - Mod 1 Part 1 MPEG Motion Compensation and Video Coding Peter H.N. de With (p.h.n.de.with@tue.nl

More information

10.2 Video Compression with Motion Compensation 10.4 H H.263

10.2 Video Compression with Motion Compensation 10.4 H H.263 Chapter 10 Basic Video Compression Techniques 10.11 Introduction to Video Compression 10.2 Video Compression with Motion Compensation 10.3 Search for Motion Vectors 10.4 H.261 10.5 H.263 10.6 Further Exploration

More information

MPEG-2. ISO/IEC (or ITU-T H.262)

MPEG-2. ISO/IEC (or ITU-T H.262) MPEG-2 1 MPEG-2 ISO/IEC 13818-2 (or ITU-T H.262) High quality encoding of interlaced video at 4-15 Mbps for digital video broadcast TV and digital storage media Applications Broadcast TV, Satellite TV,

More information

Comparative Study of Partial Closed-loop Versus Open-loop Motion Estimation for Coding of HDTV

Comparative Study of Partial Closed-loop Versus Open-loop Motion Estimation for Coding of HDTV Comparative Study of Partial Closed-loop Versus Open-loop Motion Estimation for Coding of HDTV Jeffrey S. McVeigh 1 and Siu-Wai Wu 2 1 Carnegie Mellon University Department of Electrical and Computer Engineering

More information

An Infrastructural IP for Interactive MPEG-4 SoC Functional Verification

An Infrastructural IP for Interactive MPEG-4 SoC Functional Verification International Journal on Electrical Engineering and Informatics - Volume 1, Number 2, 2009 An Infrastructural IP for Interactive MPEG-4 SoC Functional Verification Trio Adiono 1, Hans G. Kerkhoff 2 & Hiroaki

More information

Video Compression Standards (II) A/Prof. Jian Zhang

Video Compression Standards (II) A/Prof. Jian Zhang Video Compression Standards (II) A/Prof. Jian Zhang NICTA & CSE UNSW COMP9519 Multimedia Systems S2 2009 jzhang@cse.unsw.edu.au Tutorial 2 : Image/video Coding Techniques Basic Transform coding Tutorial

More information

Outline Introduction MPEG-2 MPEG-4. Video Compression. Introduction to MPEG. Prof. Pratikgiri Goswami

Outline Introduction MPEG-2 MPEG-4. Video Compression. Introduction to MPEG. Prof. Pratikgiri Goswami to MPEG Prof. Pratikgiri Goswami Electronics & Communication Department, Shree Swami Atmanand Saraswati Institute of Technology, Surat. Outline of Topics 1 2 Coding 3 Video Object Representation Outline

More information

Chapter 10. Basic Video Compression Techniques Introduction to Video Compression 10.2 Video Compression with Motion Compensation

Chapter 10. Basic Video Compression Techniques Introduction to Video Compression 10.2 Video Compression with Motion Compensation Chapter 10 Basic Video Compression Techniques 10.1 Introduction to Video Compression 10.2 Video Compression with Motion Compensation 10.3 Search for Motion Vectors 10.4 H.261 10.5 H.263 10.6 Further Exploration

More information

Using animation to motivate motion

Using animation to motivate motion Using animation to motivate motion In computer generated animation, we take an object and mathematically render where it will be in the different frames Courtesy: Wikipedia Given the rendered frames (or

More information

Efficient support for interactive operations in multi-resolution video servers

Efficient support for interactive operations in multi-resolution video servers Multimedia Systems 7: 241 253 (1999) Multimedia Systems c Springer-Verlag 1999 Efficient support for interactive operations in multi-resolution video servers Prashant J. Shenoy, Harrick M. Vin Distributed

More information

The Scope of Picture and Video Coding Standardization

The Scope of Picture and Video Coding Standardization H.120 H.261 Video Coding Standards MPEG-1 and MPEG-2/H.262 H.263 MPEG-4 H.264 / MPEG-4 AVC Thomas Wiegand: Digital Image Communication Video Coding Standards 1 The Scope of Picture and Video Coding Standardization

More information

Video coding. Concepts and notations.

Video coding. Concepts and notations. TSBK06 video coding p.1/47 Video coding Concepts and notations. A video signal consists of a time sequence of images. Typical frame rates are 24, 25, 30, 50 and 60 images per seconds. Each image is either

More information

Chapter 11.3 MPEG-2. MPEG-2: For higher quality video at a bit-rate of more than 4 Mbps Defined seven profiles aimed at different applications:

Chapter 11.3 MPEG-2. MPEG-2: For higher quality video at a bit-rate of more than 4 Mbps Defined seven profiles aimed at different applications: Chapter 11.3 MPEG-2 MPEG-2: For higher quality video at a bit-rate of more than 4 Mbps Defined seven profiles aimed at different applications: Simple, Main, SNR scalable, Spatially scalable, High, 4:2:2,

More information

CODING METHOD FOR EMBEDDING AUDIO IN VIDEO STREAM. Harri Sorokin, Jari Koivusaari, Moncef Gabbouj, and Jarmo Takala

CODING METHOD FOR EMBEDDING AUDIO IN VIDEO STREAM. Harri Sorokin, Jari Koivusaari, Moncef Gabbouj, and Jarmo Takala CODING METHOD FOR EMBEDDING AUDIO IN VIDEO STREAM Harri Sorokin, Jari Koivusaari, Moncef Gabbouj, and Jarmo Takala Tampere University of Technology Korkeakoulunkatu 1, 720 Tampere, Finland ABSTRACT In

More information

ECE 417 Guest Lecture Video Compression in MPEG-1/2/4. Min-Hsuan Tsai Apr 02, 2013

ECE 417 Guest Lecture Video Compression in MPEG-1/2/4. Min-Hsuan Tsai Apr 02, 2013 ECE 417 Guest Lecture Video Compression in MPEG-1/2/4 Min-Hsuan Tsai Apr 2, 213 What is MPEG and its standards MPEG stands for Moving Picture Expert Group Develop standards for video/audio compression

More information

Module 7 VIDEO CODING AND MOTION ESTIMATION

Module 7 VIDEO CODING AND MOTION ESTIMATION Module 7 VIDEO CODING AND MOTION ESTIMATION Lesson 20 Basic Building Blocks & Temporal Redundancy Instructional Objectives At the end of this lesson, the students should be able to: 1. Name at least five

More information

Memory Systems IRAM. Principle of IRAM

Memory Systems IRAM. Principle of IRAM Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several

More information

CMPT 365 Multimedia Systems. Media Compression - Video

CMPT 365 Multimedia Systems. Media Compression - Video CMPT 365 Multimedia Systems Media Compression - Video Spring 2017 Edited from slides by Dr. Jiangchuan Liu CMPT365 Multimedia Systems 1 Introduction What s video? a time-ordered sequence of frames, i.e.,

More information

Rate Distortion Optimization in Video Compression

Rate Distortion Optimization in Video Compression Rate Distortion Optimization in Video Compression Xue Tu Dept. of Electrical and Computer Engineering State University of New York at Stony Brook 1. Introduction From Shannon s classic rate distortion

More information

Video Transcoding Architectures and Techniques: An Overview. IEEE Signal Processing Magazine March 2003 Present by Chen-hsiu Huang

Video Transcoding Architectures and Techniques: An Overview. IEEE Signal Processing Magazine March 2003 Present by Chen-hsiu Huang Video Transcoding Architectures and Techniques: An Overview IEEE Signal Processing Magazine March 2003 Present by Chen-hsiu Huang Outline Background & Introduction Bit-rate Reduction Spatial Resolution

More information

Welcome Back to Fundamentals of Multimedia (MR412) Fall, 2012 Chapter 10 ZHU Yongxin, Winson

Welcome Back to Fundamentals of Multimedia (MR412) Fall, 2012 Chapter 10 ZHU Yongxin, Winson Welcome Back to Fundamentals of Multimedia (MR412) Fall, 2012 Chapter 10 ZHU Yongxin, Winson zhuyongxin@sjtu.edu.cn Basic Video Compression Techniques Chapter 10 10.1 Introduction to Video Compression

More information

Digital Video Processing

Digital Video Processing Video signal is basically any sequence of time varying images. In a digital video, the picture information is digitized both spatially and temporally and the resultant pixel intensities are quantized.

More information

In the name of Allah. the compassionate, the merciful

In the name of Allah. the compassionate, the merciful In the name of Allah the compassionate, the merciful Digital Video Systems S. Kasaei Room: CE 315 Department of Computer Engineering Sharif University of Technology E-Mail: skasaei@sharif.edu Webpage:

More information

Upcoming Video Standards. Madhukar Budagavi, Ph.D. DSPS R&D Center, Dallas Texas Instruments Inc.

Upcoming Video Standards. Madhukar Budagavi, Ph.D. DSPS R&D Center, Dallas Texas Instruments Inc. Upcoming Video Standards Madhukar Budagavi, Ph.D. DSPS R&D Center, Dallas Texas Instruments Inc. Outline Brief history of Video Coding standards Scalable Video Coding (SVC) standard Multiview Video Coding

More information

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy Chapter 5A Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) Fast, expensive Dynamic RAM (DRAM) In between Magnetic disk Slow, inexpensive Ideal memory Access time of SRAM

More information

Cross Layer Protocol Design

Cross Layer Protocol Design Cross Layer Protocol Design Radio Communication III The layered world of protocols Video Compression for Mobile Communication » Image formats» Pixel representation Overview» Still image compression Introduction»

More information

An Infrastructural IP for Interactive MPEG-4 SoC Functional Verification

An Infrastructural IP for Interactive MPEG-4 SoC Functional Verification ITB J. ICT Vol. 3, No. 1, 2009, 51-66 51 An Infrastructural IP for Interactive MPEG-4 SoC Functional Verification 1 Trio Adiono, 2 Hans G. Kerkhoff & 3 Hiroaki Kunieda 1 Institut Teknologi Bandung, Bandung,

More information

Audio-coding standards

Audio-coding standards Audio-coding standards The goal is to provide CD-quality audio over telecommunications networks. Almost all CD audio coders are based on the so-called psychoacoustic model of the human auditory system.

More information

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY Abridged version of Patterson & Hennessy (2013):Ch.5 Principle of Locality Programs access a small proportion of their address space at any time Temporal

More information

Data Storage Exploration and Bandwidth Analysis for Distributed MPEG-4 Decoding

Data Storage Exploration and Bandwidth Analysis for Distributed MPEG-4 Decoding Data Storage Exploration and Bandwidth Analysis for Distributed MPEG-4 oding Milan Pastrnak, Peter H. N. de With, Senior Member, IEEE Abstract The low bit-rate profiles of the MPEG-4 standard enable video-streaming

More information

PERFORMANCE ANALYSIS OF AN H.263 VIDEO ENCODER FOR VIRAM

PERFORMANCE ANALYSIS OF AN H.263 VIDEO ENCODER FOR VIRAM PERFORMANCE ANALYSIS OF AN H.263 VIDEO ENCODER FOR VIRAM Thinh PQ Nguyen, Avideh Zakhor, and Kathy Yelick * Department of Electrical Engineering and Computer Sciences University of California at Berkeley,

More information

System Modeling and Implementation of MPEG-4. Encoder under Fine-Granular-Scalability Framework

System Modeling and Implementation of MPEG-4. Encoder under Fine-Granular-Scalability Framework System Modeling and Implementation of MPEG-4 Encoder under Fine-Granular-Scalability Framework Literature Survey Embedded Software Systems Prof. B. L. Evans by Wei Li and Zhenxun Xiao March 25, 2002 Abstract

More information

A LOW-COMPLEXITY AND LOSSLESS REFERENCE FRAME ENCODER ALGORITHM FOR VIDEO CODING

A LOW-COMPLEXITY AND LOSSLESS REFERENCE FRAME ENCODER ALGORITHM FOR VIDEO CODING 2014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP) A LOW-COMPLEXITY AND LOSSLESS REFERENCE FRAME ENCODER ALGORITHM FOR VIDEO CODING Dieison Silveira, Guilherme Povala,

More information

Lecture 5: Error Resilience & Scalability

Lecture 5: Error Resilience & Scalability Lecture 5: Error Resilience & Scalability Dr Reji Mathew A/Prof. Jian Zhang NICTA & CSE UNSW COMP9519 Multimedia Systems S 010 jzhang@cse.unsw.edu.au Outline Error Resilience Scalability Including slides

More information

Compressed-Domain Video Processing and Transcoding

Compressed-Domain Video Processing and Transcoding Compressed-Domain Video Processing and Transcoding Susie Wee, John Apostolopoulos Mobile & Media Systems Lab HP Labs Stanford EE392J Lecture 2006 Hewlett-Packard Development Company, L.P. The information

More information

Reduction of Periodic Broadcast Resource Requirements with Proxy Caching

Reduction of Periodic Broadcast Resource Requirements with Proxy Caching Reduction of Periodic Broadcast Resource Requirements with Proxy Caching Ewa Kusmierek and David H.C. Du Digital Technology Center and Department of Computer Science and Engineering University of Minnesota

More information

International Journal of Emerging Technology and Advanced Engineering Website: (ISSN , Volume 2, Issue 4, April 2012)

International Journal of Emerging Technology and Advanced Engineering Website:   (ISSN , Volume 2, Issue 4, April 2012) A Technical Analysis Towards Digital Video Compression Rutika Joshi 1, Rajesh Rai 2, Rajesh Nema 3 1 Student, Electronics and Communication Department, NIIST College, Bhopal, 2,3 Prof., Electronics and

More information

Performance Tuning on the Blackfin Processor

Performance Tuning on the Blackfin Processor 1 Performance Tuning on the Blackfin Processor Outline Introduction Building a Framework Memory Considerations Benchmarks Managing Shared Resources Interrupt Management An Example Summary 2 Introduction

More information

Design guidelines for embedded real time face detection application

Design guidelines for embedded real time face detection application Design guidelines for embedded real time face detection application White paper for Embedded Vision Alliance By Eldad Melamed Much like the human visual system, embedded computer vision systems perform

More information

A real-time SNR scalable transcoder for MPEG-2 video streams

A real-time SNR scalable transcoder for MPEG-2 video streams EINDHOVEN UNIVERSITY OF TECHNOLOGY Department of Mathematics and Computer Science A real-time SNR scalable transcoder for MPEG-2 video streams by Mohammad Al-khrayshah Supervisors: Prof. J.J. Lukkien Eindhoven

More information

TKT-2431 SoC design. Introduction to exercises. SoC design / September 10

TKT-2431 SoC design. Introduction to exercises. SoC design / September 10 TKT-2431 SoC design Introduction to exercises Assistants: Exercises and the project work Juha Arvio juha.arvio@tut.fi, Otto Esko otto.esko@tut.fi In the project work, a simplified H.263 video encoder is

More information

DigiPoints Volume 1. Student Workbook. Module 8 Digital Compression

DigiPoints Volume 1. Student Workbook. Module 8 Digital Compression Digital Compression Page 8.1 DigiPoints Volume 1 Module 8 Digital Compression Summary This module describes the techniques by which digital signals are compressed in order to make it possible to carry

More information

LECTURE VIII: BASIC VIDEO COMPRESSION TECHNIQUE DR. OUIEM BCHIR

LECTURE VIII: BASIC VIDEO COMPRESSION TECHNIQUE DR. OUIEM BCHIR 1 LECTURE VIII: BASIC VIDEO COMPRESSION TECHNIQUE DR. OUIEM BCHIR 2 VIDEO COMPRESSION A video consists of a time-ordered sequence of frames, i.e., images. Trivial solution to video compression Predictive

More information

A Video CoDec Based on the TMS320C6X DSP José Brito, Leonel Sousa EST IPCB / INESC Av. Do Empresário Castelo Branco Portugal

A Video CoDec Based on the TMS320C6X DSP José Brito, Leonel Sousa EST IPCB / INESC Av. Do Empresário Castelo Branco Portugal A Video CoDec Based on the TMS320C6X DSP José Brito, Leonel Sousa EST IPCB / INESC Av. Do Empresário Castelo Branco Portugal jbrito@est.ipcb.pt IST / INESC Rua Alves Redol, Nº 9 1000 029 Lisboa Portugal

More information

VIDEO COMPRESSION STANDARDS

VIDEO COMPRESSION STANDARDS VIDEO COMPRESSION STANDARDS Family of standards: the evolution of the coding model state of the art (and implementation technology support): H.261: videoconference x64 (1988) MPEG-1: CD storage (up to

More information

ISSCC 2006 / SESSION 22 / LOW POWER MULTIMEDIA / 22.1

ISSCC 2006 / SESSION 22 / LOW POWER MULTIMEDIA / 22.1 ISSCC 26 / SESSION 22 / LOW POWER MULTIMEDIA / 22.1 22.1 A 125µW, Fully Scalable MPEG-2 and H.264/AVC Video Decoder for Mobile Applications Tsu-Ming Liu 1, Ting-An Lin 2, Sheng-Zen Wang 2, Wen-Ping Lee

More information

Review and Implementation of DWT based Scalable Video Coding with Scalable Motion Coding.

Review and Implementation of DWT based Scalable Video Coding with Scalable Motion Coding. Project Title: Review and Implementation of DWT based Scalable Video Coding with Scalable Motion Coding. Midterm Report CS 584 Multimedia Communications Submitted by: Syed Jawwad Bukhari 2004-03-0028 About

More information

ISSCC 2001 / SESSION 9 / INTEGRATED MULTIMEDIA PROCESSORS / 9.2

ISSCC 2001 / SESSION 9 / INTEGRATED MULTIMEDIA PROCESSORS / 9.2 ISSCC 2001 / SESSION 9 / INTEGRATED MULTIMEDIA PROCESSORS / 9.2 9.2 A 80/20MHz 160mW Multimedia Processor integrated with Embedded DRAM MPEG-4 Accelerator and 3D Rendering Engine for Mobile Applications

More information

FPGA based High Performance CAVLC Implementation for H.264 Video Coding

FPGA based High Performance CAVLC Implementation for H.264 Video Coding FPGA based High Performance CAVLC Implementation for H.264 Video Coding Arun Kumar Pradhan Trident Academy of Technology Bhubaneswar,India Lalit Kumar Kanoje Trident Academy of Technology Bhubaneswar,India

More information

VIDEO AND IMAGE PROCESSING USING DSP AND PFGA. Chapter 3: Video Processing

VIDEO AND IMAGE PROCESSING USING DSP AND PFGA. Chapter 3: Video Processing ĐẠI HỌC QUỐC GIA TP.HỒ CHÍ MINH TRƯỜNG ĐẠI HỌC BÁCH KHOA KHOA ĐIỆN-ĐIỆN TỬ BỘ MÔN KỸ THUẬT ĐIỆN TỬ VIDEO AND IMAGE PROCESSING USING DSP AND PFGA Chapter 3: Video Processing 3.1 Video Formats 3.2 Video

More information

Image Compression for Mobile Devices using Prediction and Direct Coding Approach

Image Compression for Mobile Devices using Prediction and Direct Coding Approach Image Compression for Mobile Devices using Prediction and Direct Coding Approach Joshua Rajah Devadason M.E. scholar, CIT Coimbatore, India Mr. T. Ramraj Assistant Professor, CIT Coimbatore, India Abstract

More information

Fast Implementation of VC-1 with Modified Motion Estimation and Adaptive Block Transform

Fast Implementation of VC-1 with Modified Motion Estimation and Adaptive Block Transform Circuits and Systems, 2010, 1, 12-17 doi:10.4236/cs.2010.11003 Published Online July 2010 (http://www.scirp.org/journal/cs) Fast Implementation of VC-1 with Modified Motion Estimation and Adaptive Block

More information

EFFICIENT DEISGN OF LOW AREA BASED H.264 COMPRESSOR AND DECOMPRESSOR WITH H.264 INTEGER TRANSFORM

EFFICIENT DEISGN OF LOW AREA BASED H.264 COMPRESSOR AND DECOMPRESSOR WITH H.264 INTEGER TRANSFORM EFFICIENT DEISGN OF LOW AREA BASED H.264 COMPRESSOR AND DECOMPRESSOR WITH H.264 INTEGER TRANSFORM 1 KALIKI SRI HARSHA REDDY, 2 R.SARAVANAN 1 M.Tech VLSI Design, SASTRA University, Thanjavur, Tamilnadu,

More information

Audio-coding standards

Audio-coding standards Audio-coding standards The goal is to provide CD-quality audio over telecommunications networks. Almost all CD audio coders are based on the so-called psychoacoustic model of the human auditory system.

More information

Design of Low Power Wide Gates used in Register File and Tag Comparator

Design of Low Power Wide Gates used in Register File and Tag Comparator www..org 1 Design of Low Power Wide Gates used in Register File and Tag Comparator Isac Daimary 1, Mohammed Aneesh 2 1,2 Department of Electronics Engineering, Pondicherry University Pondicherry, 605014,

More information

Chapter 7 Large and Fast: Exploiting Memory Hierarchy. Memory Hierarchy. Locality. Memories: Review

Chapter 7 Large and Fast: Exploiting Memory Hierarchy. Memory Hierarchy. Locality. Memories: Review Memories: Review Chapter 7 Large and Fast: Exploiting Hierarchy DRAM (Dynamic Random Access ): value is stored as a charge on capacitor that must be periodically refreshed, which is why it is called dynamic

More information

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck Main memory management CMSC 411 Computer Systems Architecture Lecture 16 Memory Hierarchy 3 (Main Memory & Memory) Questions: How big should main memory be? How to handle reads and writes? How to find

More information

TKT-2431 SoC design. Introduction to exercises

TKT-2431 SoC design. Introduction to exercises TKT-2431 SoC design Introduction to exercises Assistants: Exercises Jussi Raasakka jussi.raasakka@tut.fi Otto Esko otto.esko@tut.fi In the project work, a simplified H.263 video encoder is implemented

More information

7.5 Dictionary-based Coding

7.5 Dictionary-based Coding 7.5 Dictionary-based Coding LZW uses fixed-length code words to represent variable-length strings of symbols/characters that commonly occur together, e.g., words in English text LZW encoder and decoder

More information

A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors

A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors Murali Jayapala 1, Francisco Barat 1, Pieter Op de Beeck 1, Francky Catthoor 2, Geert Deconinck 1 and Henk Corporaal

More information

Digital video coding systems MPEG-1/2 Video

Digital video coding systems MPEG-1/2 Video Digital video coding systems MPEG-1/2 Video Introduction What is MPEG? Moving Picture Experts Group Standard body for delivery of video and audio. Part of ISO/IEC/JTC1/SC29/WG11 150 companies & research

More information

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2 EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss

More information

CHAPTER 3 ASYNCHRONOUS PIPELINE CONTROLLER

CHAPTER 3 ASYNCHRONOUS PIPELINE CONTROLLER 84 CHAPTER 3 ASYNCHRONOUS PIPELINE CONTROLLER 3.1 INTRODUCTION The introduction of several new asynchronous designs which provides high throughput and low latency is the significance of this chapter. The

More information

ESE532 Spring University of Pennsylvania Department of Electrical and System Engineering System-on-a-Chip Architecture

ESE532 Spring University of Pennsylvania Department of Electrical and System Engineering System-on-a-Chip Architecture University of Pennsylvania Department of Electrical and System Engineering System-on-a-Chip Architecture ESE532, Spring 2017 HW2: Profiling Wednesday, January 18 Due: Friday, January 27, 5:00pm In this

More information

Advanced optimizations of cache performance ( 2.2)

Advanced optimizations of cache performance ( 2.2) Advanced optimizations of cache performance ( 2.2) 30 1. Small and Simple Caches to reduce hit time Critical timing path: address tag memory, then compare tags, then select set Lower associativity Direct-mapped

More information

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy Chapter 5B Large and Fast: Exploiting Memory Hierarchy One Transistor Dynamic RAM 1-T DRAM Cell word access transistor V REF TiN top electrode (V REF ) Ta 2 O 5 dielectric bit Storage capacitor (FET gate,

More information

The Basics of Video Compression

The Basics of Video Compression The Basics of Video Compression Marko Slyz February 18, 2003 (Sourcecoders talk) 1/18 Outline 1. Non-technical Survey of Video Compressors 2. Basic Description of MPEG 1 3. Discussion of Other Compressors

More information

Zonal MPEG-2. Cheng-Hsiung Hsieh *, Chen-Wei Fu and Wei-Lung Hung

Zonal MPEG-2. Cheng-Hsiung Hsieh *, Chen-Wei Fu and Wei-Lung Hung International Journal of Applied Science and Engineering 2007. 5, 2: 151-158 Zonal MPEG-2 Cheng-Hsiung Hsieh *, Chen-Wei Fu and Wei-Lung Hung Department of Computer Science and Information Engineering

More information

Lecture 5: Compression I. This Week s Schedule

Lecture 5: Compression I. This Week s Schedule Lecture 5: Compression I Reading: book chapter 6, section 3 &5 chapter 7, section 1, 2, 3, 4, 8 Today: This Week s Schedule The concept behind compression Rate distortion theory Image compression via DCT

More information

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off

More information

Ch. 4: Video Compression Multimedia Systems

Ch. 4: Video Compression Multimedia Systems Ch. 4: Video Compression Multimedia Systems Prof. Ben Lee (modified by Prof. Nguyen) Oregon State University School of Electrical Engineering and Computer Science 1 Outline Introduction MPEG Overview MPEG

More information

Advanced Encoding Features of the Sencore TXS Transcoder

Advanced Encoding Features of the Sencore TXS Transcoder Advanced Encoding Features of the Sencore TXS Transcoder White Paper November 2011 Page 1 (11) www.sencore.com 1.605.978.4600 Revision 1.0 Document Revision History Date Version Description Author 11/7/2011

More information

Redundant Data Elimination for Image Compression and Internet Transmission using MATLAB

Redundant Data Elimination for Image Compression and Internet Transmission using MATLAB Redundant Data Elimination for Image Compression and Internet Transmission using MATLAB R. Challoo, I.P. Thota, and L. Challoo Texas A&M University-Kingsville Kingsville, Texas 78363-8202, U.S.A. ABSTRACT

More information

Addressing Verification Bottlenecks of Fully Synthesized Processor Cores using Equivalence Checkers

Addressing Verification Bottlenecks of Fully Synthesized Processor Cores using Equivalence Checkers Addressing Verification Bottlenecks of Fully Synthesized Processor Cores using Equivalence Checkers Subash Chandar G (g-chandar1@ti.com), Vaideeswaran S (vaidee@ti.com) DSP Design, Texas Instruments India

More information

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs Chapter 5 (Part II) Baback Izadi Division of Engineering Programs bai@engr.newpaltz.edu Virtual Machines Host computer emulates guest operating system and machine resources Improved isolation of multiple

More information

Behavioral Array Mapping into Multiport Memories Targeting Low Power 3

Behavioral Array Mapping into Multiport Memories Targeting Low Power 3 Behavioral Array Mapping into Multiport Memories Targeting Low Power 3 Preeti Ranjan Panda and Nikil D. Dutt Department of Information and Computer Science University of California, Irvine, CA 92697-3425,

More information

Module 10 MULTIMEDIA SYNCHRONIZATION

Module 10 MULTIMEDIA SYNCHRONIZATION Module 10 MULTIMEDIA SYNCHRONIZATION Lesson 36 Packet architectures and audio-video interleaving Instructional objectives At the end of this lesson, the students should be able to: 1. Show the packet architecture

More information

The Implement of MPEG-4 Video Encoding Based on NiosII Embedded Platform

The Implement of MPEG-4 Video Encoding Based on NiosII Embedded Platform The Implement of MPEG-4 Video Encoding Based on NiosII Embedded Platform Fugang Duan School of Optical-Electrical and Computer Engineering, USST Shanghai, China E-mail: dfgvvvdfgvvv@126.com Zhan Shi School

More information

Video Coding Standards: H.261, H.263 and H.26L

Video Coding Standards: H.261, H.263 and H.26L 5 Video Coding Standards: H.261, H.263 and H.26L Video Codec Design Iain E. G. Richardson Copyright q 2002 John Wiley & Sons, Ltd ISBNs: 0-471-48553-5 (Hardback); 0-470-84783-2 (Electronic) 5.1 INTRODUCTION

More information

Multimedia Systems Video II (Video Coding) Mahdi Amiri April 2012 Sharif University of Technology

Multimedia Systems Video II (Video Coding) Mahdi Amiri April 2012 Sharif University of Technology Course Presentation Multimedia Systems Video II (Video Coding) Mahdi Amiri April 2012 Sharif University of Technology Video Coding Correlation in Video Sequence Spatial correlation Similar pixels seem

More information

Scalable Video Coding

Scalable Video Coding Introduction to Multimedia Computing Scalable Video Coding 1 Topics Video On Demand Requirements Video Transcoding Scalable Video Coding Spatial Scalability Temporal Scalability Signal to Noise Scalability

More information

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2 EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss

More information

Design of 2-D DWT VLSI Architecture for Image Processing

Design of 2-D DWT VLSI Architecture for Image Processing Design of 2-D DWT VLSI Architecture for Image Processing Betsy Jose 1 1 ME VLSI Design student Sri Ramakrishna Engineering College, Coimbatore B. Sathish Kumar 2 2 Assistant Professor, ECE Sri Ramakrishna

More information

Lecture 7, Video Coding, Motion Compensation Accuracy

Lecture 7, Video Coding, Motion Compensation Accuracy Lecture 7, Video Coding, Motion Compensation Accuracy Last time we saw several methods to obtain a good motion estimation, with reduced complexity (efficient search), and with the possibility of sub-pixel

More information

Low-Power Data Address Bus Encoding Method

Low-Power Data Address Bus Encoding Method Low-Power Data Address Bus Encoding Method Tsung-Hsi Weng, Wei-Hao Chiao, Jean Jyh-Jiun Shann, Chung-Ping Chung, and Jimmy Lu Dept. of Computer Science and Information Engineering, National Chao Tung University,

More information

IMPLEMENTATION OF H.264 DECODER ON SANDBLASTER DSP Vaidyanathan Ramadurai, Sanjay Jinturkar, Mayan Moudgill, John Glossner

IMPLEMENTATION OF H.264 DECODER ON SANDBLASTER DSP Vaidyanathan Ramadurai, Sanjay Jinturkar, Mayan Moudgill, John Glossner IMPLEMENTATION OF H.264 DECODER ON SANDBLASTER DSP Vaidyanathan Ramadurai, Sanjay Jinturkar, Mayan Moudgill, John Glossner Sandbridge Technologies, 1 North Lexington Avenue, White Plains, NY 10601 sjinturkar@sandbridgetech.com

More information

COSC 6385 Computer Architecture - Memory Hierarchies (II)

COSC 6385 Computer Architecture - Memory Hierarchies (II) COSC 6385 Computer Architecture - Memory Hierarchies (II) Edgar Gabriel Spring 2018 Types of cache misses Compulsory Misses: first access to a block cannot be in the cache (cold start misses) Capacity

More information

Advanced Video Coding: The new H.264 video compression standard

Advanced Video Coding: The new H.264 video compression standard Advanced Video Coding: The new H.264 video compression standard August 2003 1. Introduction Video compression ( video coding ), the process of compressing moving images to save storage space and transmission

More information

DUE to the high computational complexity and real-time

DUE to the high computational complexity and real-time IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 15, NO. 3, MARCH 2005 445 A Memory-Efficient Realization of Cyclic Convolution and Its Application to Discrete Cosine Transform Hun-Chen

More information

FPGA Implementation of 2-D DCT Architecture for JPEG Image Compression

FPGA Implementation of 2-D DCT Architecture for JPEG Image Compression FPGA Implementation of 2-D DCT Architecture for JPEG Image Compression Prashant Chaturvedi 1, Tarun Verma 2, Rita Jain 3 1 Department of Electronics & Communication Engineering Lakshmi Narayan College

More information