Adaptive Power Management of On-Chip Video Memory for Multiview Video Coding

Size: px

Start display at page:

Download "Adaptive Power Management of On-Chip Video Memory for Multiview Video Coding"

Katrina Small
6 years ago
Views:

1 Adaptive Power Management of On-Chip Video Memory for Multiview Video Coding Muhammad Shafique 1, Bruno Zatt 1,2, Fabio Leandro Walter 2, Sergio Bampi 2, Jörg Henkel 1 1 Karlsruhe Institute of Technology (KIT), Chair for Embedded Systems, Karlsruhe, Germany 2 Federal University of Rio Grande do Sul (UFRGS), Informatics Institute/PGMICRO, Porto Alegre, Brazil {muhammad.shafique, bruno.zatt, henkel}@kit.edu; {bzatt, bampi}@inf.ufrgs.br Abstract An adaptive power management of on-chip video memory for Multiview Video Coding is presented. It leverages texture, motion and disparity properties of objects and their correlations in the 3D-neighborhood. It groups different Macroblocks of a frame and predicts the highly-probable motion/disparity search direction in order to power-gate idle memory regions. Exploited are the statistical properties of Macroblock groups to predict idle sectors. Our approach achieves on average 32% and 61% energy reduction (averaged over various video sequences) compared to state-of-the-art DSW [7] and Level C [12], respectively. The Motion/Disparity Estimation architecture with video memory and power management scheme is implemented using an ASIC flow (IBM-65nm Low-Power technology) and it processes 4-view HD18p@33fps. Categories and Subject Descriptors: C.3 [Special-Purpose and Application-Based Systems]: Real-time and embedded systems; B.3.2 [Design Styles]: Cache Memories; I.4.2 [Compression (Coding)]: Approximate Methods General Terms: Algorithms, Design, Management Keywords: MVC, Video Coding, Motion Estimation, Disparity Estimation, Low-Power, Power-Management, On-Chip Memory, Video Memory, Adaptivity, Power-Gating I. INTRODUCTION AND RELATED WORK The Multiview Video Coding (MVC) standard [2] compresses the multiview video sequences (captured using multiple cameras) to realize emerging 3D-multimedia applications (like 3D-video recording/playback) on mobile devices [3][4]. MVC provides 2%-5% improved compression compared to simulcast H.264 (i.e. independent encoding of each view) by employing multiple block-sized Motion and Disparity Estimation (ME, DE) that exploit temporal and interview correlations at the cost of significantly high complexity and energy consumption [3]. Typically, ME/DE accounts for more than 9% of the total MVC energy consumption, out of which the major energy consuming part is the (on-chip and off-chip) memory [7]. Therefore, memory is the key focus for energy reduction in ME and DE in order to implement MVC on battery-powered devices. The high memory energy consumption is primarily due to the frequent access of reference pixel data used in SAD (Sum of Absolute Differences) computations during the block matching process [7]. ME and DE search the best match of a Macroblock (MB, 16x16 pixel block) in different search directions (i.e. neighboring reference frames in the left, right, top, and down directions). For a given search direction, the search is performed in a predefined search window such that the reference pixels in a search window can be used for Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DAC 212, June 3-7, 212, San Francisco, California, USA Copyright 212 ACM /12/6...$1. multiple SAD computations (see Section S1 for ME/DE overview). The search direction that contains the best matching of an MB is denoted as the best search direction. State-of-the-art techniques employ an on-chip memory 1 to incorporate search window prefetching and data reuse [12]-[15], search window follower [11], or asymmetric search windows [16] for reducing the off-chip memory power. These on-chip memories suffer from non negligible leakage power due to their large footprint ( 2 Mbit memory is required for a search range of ±128 and 4 search directions). Furthermore, not all parts of the search window stored in the on-chip memory are accessed because of the adaptive nature of fast ME/DE algorithms (like TZ Search [5]) and diverse texture/motion properties of MBs (see memory usage analysis in Section II.A). To address this issue, the work of [14] employs adaptive window sizing. However, this work targets a fixed Four-Step Search and does not account for DE and leakage power, which is a crucial power component. To reduce the leakage power, advanced techniques in powergating switch-off parts of the on-chip memory using sleep transistors with multiple sleep modes [17][18][27]. Some sleep modes are dataretentive, i.e. data loss in the memory is avoided while providing relatively little leakage savings. For power management, state-of-theart techniques employ prediction techniques based on either hardware monitoring [24] or exploiting limited application knowledge at frame level []. As a result, these techniques perform power-wise inefficient due to severe miss-predictions under high variations of memory usage. The work [29][3] illustrates the feasibility of application-aware power management for power-gating idle ASIP cores in a multimedia pipelined processor. The work in [7][8][9] presents an MVC ME/DE architecture with an on-chip video memory and a dynamic search window formation algorithm. The power-gating scheme evaluates the predicted memory usage of consecutive MBs to make a power-gating decision for the idle memory sectors, but do not employ methods (like computation reordering) to increase sleep durations. The above-discussed techniques provide limited leakage savings as they do not exploit (i) the relationship of MB properties (texture, motion, and disparity) with the distribution of ME and DE as the best search direction, (ii) best search direction and memory usage correlation in the 3D-neighborhood (i.e. spatial, temporal, and view domains). These multiview video content characteristics may provide a higher potential for leakage savings. Summarizing: in order to realize ME/DE of real-time full-hd MVC with low-power consumption, an adaptive power management scheme for on-chip memory is required that leverages the multiview video content characteristics at various levels (search direction, frame, MB, etc.) to predict the memory requirements (number and duration of idle memory sectors) and to power-gate them in an appropriate sleep mode. Before proceeding to our novel contribution, we present an analysis of the best search direction and memory usages during ME/DE, which provides the motivation and foundation for this work. 1 An on-chip memory in this paper denotes an on-chip video memory. 866

II. MOTIVATION AND NOVEL CONTRIBUTION A. Motivational Analysis of Motion and Disparity Estimation Our experiments on the Rena test video sequence in Fig. 1 and Fig.

2 II. MOTIVATION AND NOVEL CONTRIBUTION A. Motivational Analysis of Motion and Disparity Estimation Our experiments on the Rena test video sequence in Fig. 1 and Fig. 2 illustrate the distribution of ME and DE as the best search direction (see Section S2 for detailed analysis for other test video sequences). Fig. 1 depicts that the majority of the Macroblocks (MBs; 7%-9% cases) are encoded using ME as the best search direction. Note, in case of the first view V, all MBs are encoded using ME, because no neighboring views are available for prediction. Similarly, in case of other views (V1-V2), for the first frame of each GOP (Group of Pictures; i.e. T and T8 in Fig. 2), only DE is performed. When performing a detailed analysis of various frames in view V1 at QP=27 (Fig. 2), it can be observed that background objects (lowtexture, low-motion, static blocks) are mostly encoded using ME. In contrast, foreground objects (medium-high texture, high motion) are encoded using DE. Moreover, some background objects with medium-high texture may also be coded using DE (see curtains in, Fig. 2). The decision of ME and DE also depends upon the available correlation in the temporal domain. For instance, the number of DE-coded MBs is higher in compared to. Hint-1: If the best search direction (ME or DE) can be correctly predicted, significant energy savings can be obtained by avoiding ME/DE over unused search directions and power-gating the sectors storing the search windows for these unused search directions. It will also lead to a reduced amount of external memory transfers and computations. The key is to use the texture and motion/disparity properties of the MBs in the 3D-neighborhood for correct prediction of the best search direction. Mode Distribution [%] 1 5 V V1 V2 1 5 Rena QP ME V V1 V2 Ballroom DE QP Fig. 1 (a) ME and DE distribution for three views of Rena and Ballroom test multiview video sequences Disparity Estimation Motion Estimation T T3 T6 T7 T8 Fig. 2 ME/DE distribution in view V1 of Rena sequence In state-of-the-art schemes [7]-[16], ME/DE of MBs is processed in a raster scan order. However, our experiments (Fig. 2, see also Section S2) illustrate that objects often consist of MBs that do not lie on the raster scan order. Therefore, these schemes suffer from severe variations in the memory usage as MBs of different objects typically exhibit diverse memory requirements for ME/DE. This leads to reduced sleep durations and frequent wakeups of memory sectors, thus low leakage savings. Our analysis in Sections IV.A, S2 shows that different MBs sharing similar texture and motion/disparity properties have similar memory requirements. Hint-2: Longer sleep durations (thus higher leakage savings) can be achieved if the ME/DE processing of the MBs with similar properties is performed together, i.e. in a non-raster scan order. The key challenges are MB grouping and ME/DE computations reordering. Summarizing the analysis, the key research challenges for reducing the power of on-chip video memory of ME/DE are: a) Grouping the MBs of a frame w.r.t. their texture, motion, and disparity properties, b) Adaptively predicting the best search direction for MBs in different groups to power-gated on-chip memory sectors of unused search directions, c) Reordering the ME/DE processing computations to increase the sleep durations of on-chip memory sectors, d) Leveraging the multiview video content characteristics to enable a content-driven power management at various granularities (group-level, MB-level). B. Overview of Our Concept and Novel Contributions To address these challenges, a novel adaptive power management scheme is proposed for on-chip video memory that incorporates: 1. An MB-Group Formation Scheme (Section IV.A) that performs texture and activity (i.e. motion and disparity) classification for MBs considering the correlated neighboring MBs in their 3D-neighborhood (i.e. spatial, temporal, view domains). This classification is used to form groups of MBs that share similar texture and activity properties. 2. An Adaptive ME/DE Search Direction Prediction Algorithm (Section IV.A) that adaptively predicts the highlyprobable ME/DE search direction for MBs in different groups based on the best search direction correlation in the 3Dneighborhood and their respective texture differences. For each MB in a group, our scheme power-gates the memory sectors of the unused search direction. 3. A Content-Driven Power Management Scheme (Section IV.B) that leverages the multiview video content characteristics to manage the power at multiple levels (i.e. search direction, MB-group, MB). III. MEMORY AND POWER MODELS Now, we describe the model of our multibank on-chip video memory [7][8] and power-gate model of [17] which is used in this work to enable power-gating of multibank memory at a fine-granularity. Memory Model: The on-chip memory consists of N B number of banks. Each bank B k; k=1 NB contains N S number of sectors each having S L number of 128-bit memory lines (see Fig. 5 for an abstract view). The size of a sector is given as S=S L 128 bits. In order to provide parallel data access for SAD computing hardware accelerators, different rows of an MB are stored in different banks. The leakage energy is given as E Leak =T MEDE P Leak, where the T MEDE denotes the time for processing motion and disparity estimation. The miss energy is given as: E Miss = i=1 NMiss E Missi, where N Miss is the number of misses. Such a memory model can also be realized with multiple SRAM blocks, each having multiple sub-arrays [27] or considering the SRAM model of [26]. Power-Gate Model: We assume a power-gate model with three power modes: P ON, P DR, and P OFF. P ON is the Power-ON mode. P DR is the Data-Retentive (DR) low-leakage mode that preserves the data in SRAM cells. P OFF is the Power-OFF mode with data loss; it requires re-fetching of data from the external memory. Fig. 3 shows the power state machine with leakage energy savings and wakeup latency/energy overhead [17]. The wakeup latency of P DR is quite short compared to P OFF, therefore, it is beneficial for short sleep durations (see values in Table I; Section S3). Contrarily, P OFF is beneficial for long sleep durations. Multiple sleep modes facilitate different wakeup-overhead vs. leakage-saving tradeoff options. Since different collocated sectors in different banks store 867

µ M 1 M2 Group G1 G2 G3 G2 2σ Mem. Requirements Fig.

prediction using PDFs for ON, OFF, and DR mode. P ON σ PDR P OFF the data from the same MB, same sleep control is issued to these sectors.

Similar style of power-gating can be found in sub-array level power-gating [27] or even further fine-grained using wordline-level power gating [28]. P ON E ON =ΣV dd.i i.t i P DR E DR =E ON.

OUR ADAPTIVE POWER MANAGEMENT OF ON-CHIP VIDEO MEMORY FOR ME/DE IN MVC Fig.

Multiview Videos Offline Statistical Analysis (Search direction and memory usage analysis; Section II.

A) MB Group Memory Usage Prediction (Section IV.B) Content Driven Power Management (MB Grouplevel, MB level; Section IV.B) Monitoring (mem usage, etc.

5 ME/DE architecture with an on-chip memory and our power management scheme (novel contribution in green boxes) Our scheme works in five phases: i) Macroblock (MB) Grouping: First, the texture and

correlation in the 3D-neighborhood, iii) Predicting the memory usage of MB-groups using a statistical analysis of the memory usage of different groups and memory usage correlation of same groups in

different sectors at MB level: Since all MBs in a given group exhibit similar memory requirements for ME and DE, ME/DE processing computations of MBs are reordered in order to increase the sleep

A. Macroblock Grouping and Search Direction Prediction As discussed in Section II.A, in a conventional raster scan coding order, ME/DE is performed for all MBs in a row-wise fashion.

Since MBs from different objects exhibit distinct memory usage properties, that results in memory usage variations that lead to short sleep durations and frequent ON and OFF switching of the unused

Fig. 6 shows the algorithm for MB grouping. The input is the frame F (T,V), where T denotes the temporal location of the frame in view V. Other inputs are variance of the MB (σ, Eq.

3 µ M 1 M2 Group G1 G2 G3 G2 2σ Mem. Requirements Fig. 4 Rena test video sequence encoded at QP=22: (a) Distribution of ME and DE; (b) Macroblock grouping w/ computation reordering; (c) Distribution of memory usage for ME; (d) Memory requirement prediction using PDFs for ON, OFF, and DR mode. P ON σ PDR P OFF the data from the same MB, same sleep control is issued to these sectors. Power-gating at sector level enables a fine-grained power management control. Similar style of power-gating can be found in sub-array level power-gating [27] or even further fine-grained using wordline-level power gating [28]. P ON E ON =ΣV dd.i i.t i P DR E DR =E ON. Φ S1 P OFF E OFF = E DR ON =E wakeup ξ 1 E wakeup=½.c circuit V 2 dd T wakeup =C circuit V 2 dd /I Fig. 3 Power state machine with multiple sleep modes [17] IV. OUR ADAPTIVE POWER MANAGEMENT OF ON-CHIP VIDEO MEMORY FOR ME/DE IN MVC Fig. 5 shows an overview of our adaptive power management scheme (novel contribution in green boxes) for an on-chip multibank video memory integrated with an ME/DE architecture. Multiview Videos Offline Statistical Analysis (Search direction and memory usage analysis; Section II.A) Our Adaptive Power Management Scheme Macroblock Grouping (texture and motion classification; Section IV.A) Adaptive ME/DE Search Direction Prediction (Section IV.A) MB Group Memory Usage Prediction (Section IV.B) Content Driven Power Management (MB Grouplevel, MB level; Section IV.B) Monitoring (mem usage, etc.) SAD Accelerators Core Processor (Executing an ME/DE algorithm) V dd Bank 1 Bank 2 Bank n Ctrl. ST Sector ST MVC Video Encoder Fig. 5 ME/DE architecture with an on-chip memory and our power management scheme (novel contribution in green boxes) Our scheme works in five phases: i) Macroblock (MB) Grouping: First, the texture and activity classification of MBs is performed and MBs with similar texture and activity properties are grouped together, ii) Predicting the highly-probable best search direction based on the correlation in the 3D-neighborhood, iii) Predicting the memory usage of MB-groups using a statistical analysis of the memory usage of different groups and memory usage correlation of same groups in 3D-neighborhood, iv) Power-gating the unused memory sectors in appropriate sleep modes based on the predicted memory requirements, v) Computation-reordering and fine-tuning the power modes of different sectors at MB level: Since all MBs in a given group exhibit similar memory requirements for ME and DE, ME/DE processing computations of MBs are reordered in order to increase the sleep durations of on-chip memory sectors. Computation reordering is performed within a group, where the next MB for ME/DE processing is selected by evaluating its texture difference w.r.t. to the currently processed MB. A. Macroblock Grouping and Search Direction Prediction As discussed in Section II.A, in a conventional raster scan coding order, ME/DE is performed for all MBs in a row-wise fashion. Each row typically has MBs from different objects that typically span over many MBs both horizontally and vertically (see Rena dancing picture in Fig. 2 and Fig. 4). Since MBs from different objects exhibit distinct memory usage properties, that results in memory usage variations that lead to short sleep durations and frequent ON and OFF switching of the unused memory sectors. To avoid this, our scheme aggregates different MBs that share similar texture and activity (i.e. motion and disparity) properties in socalled MB-groups (see an example in Fig. 4a). Fig. 6 shows the algorithm for MB grouping. The input is the frame F (T,V), where T denotes the temporal location of the frame in view V. Other inputs are variance of the MB (σ, Eq. 2) as the lightweight texture approximation and texture difference (ξ, Eq. 3) w.r.t. the neighboring MBs in the 3D-neighborhood (i.e. spatial, temporal, and view domains). There are 4 spatial, 18 temporal, and 18 view neighboring MBs (see Fig. 15 in Section S1). First, the texture classification of the current MB is performed as low-texture (L), medium-texture (M), and high-texture (H); line 5. Afterwards, the matching neighbors (i.e. MBs in the 3D-neighborhood having similar texture properties as of the current MB) are found (lines 6-7). Since the MBs with similar texture properties most-probably belong to the same object, these MBs share the motion/disparity properties, i.e. so-called activity. Therefore, the activity of the current MB is predicted as low-motion (L), medium-motion (M), and high-motion (H) from the average activity of the matching neighbors (lines 8-9). Based on the texture and activity classification, an MB is assigned to a group, such that all the MBs in that group exhibit similar texture and activity properties (lines 1-11). The output is the composition of all three groups and the set of matching neighbors (line 13). 1. groupmbs(input: Frame F (T,V), Variance σ, Texture Difference ξ; Output: MB-Group G, Matching Neighbors N match ) 2. G ; N match ; 3. mb F (T,V) { 4. N MBMatch ; 5. T: = ( σmb τσ1)? L: ( σ MB > τσ2 )?H:M; 6. N mb.getneighbors( ); // see Fig. 15 in Section S1 7. n N if ( ξn < τξ) NMBMatch NMBMatch n; 8. M: = n N ( ν X + νy ) size ( N MBMatch ); MBMatch n 9. β : = ( M MB τ m1)?l:( M MB >τ m2 )?H:M; 1. G MB : = ( T = L& β= L )?G 1 :( T = H& β = H )?G 3 :G 2 ; 11. G.store(mb, G MB ); 12. N match.store(mb, NMBMatch ); } 13. return(g, N ma tch ); Fig. 6 Pseudo-code for macroblock grouping The thresholds (τ σ1, τ σ2, τ ξ, τ m1, τ m2 ) are obtained using the statistical distribution analysis of texture and activity properties of MBs of numerous background and foreground objects in various test video sequences (like Rena, Ballroom, Vassar, etc.) [1]. Highly-probable value of these thresholds are obtained as µ+3 σ (µ denotes the mean, σ denotes the standard deviation) using the probability density functions (PDF) following a Gaussian distribution; Eq

4 F(µ k +3σ k ; µ k, σ 2 k ) - F(; µ k, σ 2 k ) k=[variance, motion/disparity vectors].99 (1) σ ( ) 2 MB = i= 1 j= 1ρ(i,j) ρavg 6 (2) ξ n = σ CurrM B σ n σ CurrM B (3) MBs in a group share the best prediction direction due to their correlation as they most-probably belong to the same object. Fig. 4 illustrates an example scenario, where for the MBs of the dancing girl (group G3), DE is selected as the best search direction. In contrast, for the MBs of the background curtains (group G1), ME is selected as the best search direction. Therefore, grouping MBs also provides a potential for search direction prediction for the complete group (as we will discuss using Fig. 7). Note, in case of group G2, the decision becomes challenging as in case of mediumtexture nature with slow-medium motion, the best match can be found using ME or DE. Therefore, in case of group G2, our scheme adaptively selects the highly-probable search direction depending upon 3D-neighborhood. Adaptive Search Direction Prediction: Fig. 7 shows the algorithm for adaptively predicting the highly-probable best search direction for three groups (Fig. 6). As discussed in Section II.A and Fig. 4, background/low-textured MBs with low-motion (i.e. MBs in the group G1) are typically encoded using ME, and MBs with hightexture and high-motion (i.e. MBs in the group G3) are encoded using DE. Therefore, our algorithm predicts ME and DE as the best search directions for groups G1 and G3, respectively (lines 2-3). The decision about the MBs in the group G2 is made adaptively by taking into consideration the best search directions of the matching neighboring MBs (lines 4-12). If there are sufficient number of matching neighbors (for a high confidence of prediction), a prediction is performed considering the texture difference of the matching neighbors (lines 6-9). A cost cost ME is computed by accumulating the inverse of texture differences for all neighbors with ME as the best search direction (line 8). Similarly, cost DE is computed (line 9). If cost ME is greater than or equal to cost DE, ME is predicted as the best search direction, otherwise, DE is selected (lines 1-11). In case of insufficient correlation in the 3D-neighborhood, ME is predicted as the best search direction (line 12). Finally, the best search direction D Best is returned (line 13). 1. searchdirectionprediction(input: MB-Group G, Matching Neighbors N match, Texture Difference ξ; Output: Best Search Directions D Best ) 2. if ( G 1) mb G 1 mb.d Best : = ME; 3. if ( G 3) mb G 3 mb.d Best : = DE; 4. if ( G 2 ) { // adaptively select ME or DE for MBs in group G2 5. mb G2 6. if ( size (mb.n match ) >τmatch ) { 7. n mb.nmatc h 8. if ( n.d Best = ME ) cost ME : = cost ME + (1/ ξn ); else cost DE : = cost DE + (1/ ξn ); preddir : = (costme cost DE)? ME : DE; 11. mb.d Best : = preddir; } 12. else mb.d Best : = ME; } 13. return D Best; Fig. 7 Pseudo-code for adaptive search direction prediction For each MB, only one motion or disparity search in the selected search direction is performed. It leads to significant energy savings by avoiding external memory transfers and excessive computations. Furthermore, the sectors storing the search windows for the unused search directions are power-gated to reduce the leakage, which provides further energy savings (see Fig. 8). Note, Fig. 4a shows that in case of group G1 there are a few MBs that have DE as the best search direction. However, our scheme predicts ME as the best search direction for the group G1, so it might incur some video quality loss. Furthermore, a missprediction may also results in quality loss. Experiments in Section V, S3 show that this loss is visually imperceptible. B. Video Content-Driven Power Management Once the MB-groups are formed, the challenge is to accurately predict the memory usage requirements of an MB-group. The key is to leverage the multiview video content properties and the offline-statistical analysis of memory usage of different groups. Step-1: Memory Usage Prediction of MB-Groups: Fig. 4c shows the memory usage of different groups, where the memory usage of G1 is much lower than in other groups. Our scheme computes two different highly-probable memory requirement predictions (M 1 and M 2 ) from the probability density function (PDF obtained through an offline-analysis over various test video sequences, see details in Section S2). The M 1 amount of memory is kept in P ON mode as the probability of using these memory sectors is high. The memory requirement M 2 -M 1 is kept in the P DR mode, as others MBs of the same group may use this data and the wakeup overhead is minimal to avoid delay. Fig. 4d shows an abstract representation of obtaining these predictions. The memory requirements [M 1, M 2 ] of an MB-group can also be predicted with a high accuracy from the memory usages of the same MB-group in the neighboring frames or even views. (see experimental evidence in Section S2). These predicted memory requirement values are then forwarded to the power-management scheme to determine the number and mode of gated sectors. Step-2: MB-Group-Level Power Management: Fig. 8 presents the algorithm of our content-driven power-management. First the MB grouping is performed (line 2). Afterwards, each group is sequentially processed, i.e. ME/DE of the MBs from the group G1 is processed first followed by MBs from groups G2 and G3, respectively. It demonstrates the first reordering of the ME/DE computations, as MBs are now processed in a non-raster scan order (lines 3-28). The second reordering occurs when processing MBs within a group (lines 23-27). First, for each group, the best search direction is predicted and the memory sectors of the unused search directions are powergated in power state P OFF (line 4-5), as they will not be used during the complete ME/DE of this MB. Afterwards, the highly-probable memory usage is predicted from the PDF obtained by the offline statistical analysis (line 6); as also shown in Fig. 4. Based on this predicted memory usage, number of sectors that are candidate for power-gating in different power modes (P ON, P OFF, P DR ) are computed (lines 7-8). To cope with the potential misprediction, the correlation of the monitored memory usage of similar MB-group in the 3D-neighborhood is exploited (line 9). For G1 and G3, average memory usages of the same group in the temporal neighbors (Frame Left, Frame Right ) and in the disparity neighbors (Frame Top, Frame Down ) are considered, respectively. For G2, the average of all the four neighbors is computed. The candidate sectors for powergating in P ON and P DR power modes are determined considering this correlated memory usage (line 1). PDF-based and neighborhood-based predicted memory usages are averaged to obtain the number of sectors that are candidate for power-gating in P ON, P OFF, and P DR power modes (lines 11-12). To amortize the wakeup energy overhead, our scheme predicts the leakage energy benefit of gating sectors in different power modes. For this, first the sleep duration is predicted as the predicted ME/DE processing time of all the MBs in the group (line 13). The ME/DE of an MB is predicted as the average of the ME/DE processing time of all the matching neighbors in the 3Dneighborhood. Afterwards, the leakage savings are compared with the wakeup energy overhead and the sectors are set in their respective power modes (lines 14-17). In case of P OFF, additionally E MissGroup is considered as P OFF results in the loss of data in memory sectors and require a re-fetching (line 16). 869

5 1. ContentDrivenPM(Input: Frame F (T,V) ) 2. (G, N match ) groupmbs(f (T,V), σ, ξ ); // Fig g G { 4. DBest searchdirectionprediction(g, N match, ξ ); // Fig if (G 1 or G 3) PowerGate(D / D Best, P OFF,S(D/DBe st )); 6. [M 1,M 2] MemUsagePDF(g); // Fig S OFF(PDF) : = (S M 2) S Sector ; S DR(PDF) : = (M2 M 1) S Sector ; 8. S ON(PDF) : = S S Sector (SOFF + S DR ); 9. MNbs AVG d N N=[Left,Right,Top,Down] d.group(g).getmemusage( ); 1. S ON(Nbs) : = MNbs S Sector ; S DR(Nbs) : = AVG mb g mb.s DR ; S ON : = (SON(Nbs) + S ON(PDF) )/2; S DR : = (SDR(Nbs) + S DR(PDF) )/2; S OFF : = (S SON S DR ); 13. mb g mb.t pred : = AVG n Nmatch n.t MEDE; 14. if (( mb g mb.tpred PLeak(SDR) ) > Ewakeup(DR ON) ) 15. PowerGate(g,P DR,S DR ); 16. (( mb g mb.tpred PLeak(OFF) ) > ( wakeup(off ON) + E MissGroup )) 17. PowerGate(g,P OFF,S O FF); 18. e lse PowerGate(g, P DR,SDR ); 19. PowerON(g,S ON ); 2. g ' g; mb ' ; mb g.getfirstmb( ); 21. while(g ' ) { 22. if (G 2) PowerGate(D / mb.d Best, P OFF,S(D/mb.D Be st )); 23. mb ' g '.getcorrelatedmb(mb,mb n N N = [Left,Right,Top, Down] ); 24.. if (mb ' = ) mb ' g '.getnextmb(mb); MBLevelPM(mb',S ON,S OFF,S DR ); 26. [E MEDE,E Miss,E Leak,M mb] performsearch(mb',d B est ); } g' g'/mb'; } Fig. 8 Pseudo-code for our content-driven power management Step-3: Computation Reordering: In the next step, MBs of the group are processed one-by-one (lines 21-27, Fig. 8). As discussed earlier in Section I, processing ME/DE of MBs in a raster scan order results in frequent sleep and wakeup fluctuations, as MBs in a row may belong to different objects. Since an object typically spans over MBs of different rows (see Fig. 4, Section S2), sleep durations of the unused sectors can be lengthened (thus increasing the potential to put them in P OFF mode) by processing MBs group by group, as MBs of the same group exhibit similar memory requirements. This will reduce the sleep-wakeup fluctuations and lead to relatively higher leakage savings. Fig. 4 shows that MBgroups can be of non-rectangular shape and the ME/DE processing order of MBs in different groups is non-raster scan order; see Fig. 4b for a possible ME/DE processing order of MBs in group G1. To avoid sleep-wakeup fluctuations at fine-granularity, even the computations inside the MB-groups are reordered. Inside the group, the next MB for ME/DE processing is selected by evaluating its texture difference w.r.t. to the current MB, such that consecutively executing MBs exhibit similar memory requirements. The algorithm in Fig. 8 first determines a correlated MB in the spatial neighborhood (Left, Right, Top, Down); line 23. Then MB-level fine-grained power-management (see Fig. 9) is performed; line. Afterwards, the ME/DE is performed based on the decision of the best search direction and E MEDE, E Miss, E Leak, M mb are monitored as the ME/DE processing energy, miss energy, leakage energy, and actual memory usage, respectively (line 26). Step-4: Macroblock-Level Power Management: Fig. 9 shows the algorithm for MB-level power management. First the memory requirements of the MB are predicted from the matching neighbors (line 2) and the number of required memory sectors is computed (line 3). In case the number of required sectors is equal to the number of ON sectors, power modes of different sectors are not changed (line 5). In case the required memory is less than the P ON memory, the difference is put into data-retentive sleep mode P DR (lines 6, 8). Otherwise, more sectors are powered-on from P DR mode to P ON mode (lines 7, 8). 1. MBLevelPM(Macroblock mb, Group-Level number of memory sectors in different power modes S ON, S OFF, S DR ) 2. M pred : = AVG n mb.nmatch n.getmemusage( ); 3. S MB : = Mpred S Sector ; 4. Δ S: = SON S MB; 5. if ( Δ S == ) return ; 6. else if ( Δ S > ) SONmb : = SON Δ S; S DRmb : = SDR + ΔS; 7. else S ONmb : = SON +Δ S; S DRmb : = SDR Δ S; 8. PowerGate(g,P DR,S DRmb); PowerON(g,S ONmb); Fig. 9 Pseudo-code for macroblock-level power management V. RESULTS AND EVALUATION For energy and quality comparison, several multiview video sequences with different resolutions are used; VGA (48x64; Ballroom, Exit, Flamenco2, and Vassar ) and XGA (124x768; Breakdancers and Ballet ) [1]. Rena is a part of the training set, so we do not employ it for the evaluation to avoid biasing effects. Further test conditions are: TZ Search ME/DE algorithm, 193x193 search window, QP={22,27,32,37}. Note, the energy results include the overhead of our scheme. A. Comparison to State-of-the-Art We compare the on-chip memory energy savings of our adaptive power-management scheme with state-of-the-art memory energy reduction techniques like Level-C and Level-C+ [12] (search window-data reuse) and a memory power management scheme with dynamic search windows (DSW) [7]. For fairness of comparison same test videos and QP set are used for all schemes. Fig. 1 shows the on-chip energy consumption normalized to Level-C+ that presents the highest energy consumption among all schemes. Level- C+ and Level-C incur significant on-chip memory energy due to their large-sized search window that is active all the time, i.e. not exploiting the idle periods of memory to save power. Compared to Level-C and Level-C+, our scheme provides on average 61% and 67% on-chip memory energy reduction, respectively. Compared to the DSW scheme [7] that employs power-gating of memory sectors based on the memory requirements of consecutive MBs, our scheme provides on average 32% higher energy savings. Energy reduction is achieved by (i) increasing sleep duration using computation reordering, (ii) power-gating memory sectors due best search direction prediction, and (iii) leveraging the video content knowledge for a multi-level power management. On average, 51% of the sectors are in P OFF mode while 9.5% are in P DR mode (see further details in Section S3). Our experiments show that high motion/texture allow relatively less energy savings because more data from search area is accessed and less sectors are gated. On Chip Energy Normalized to Level C Level C+[12] Level C[12] DSW'11[7] Our Ballroom Exit Flamenco2 Vassar Bkdancers Ballet Fig. 1 On-chip memory energy savings comparison B. Overhead: Mispredictions and Memory Misses Fig. 11a shows that our scheme predicts the best search direction with an accuracy of 87% for high-activity sequences and 94% for low-activity sequences. This incurs a video quality loss of average.54 db BD-PSNR (Bjøntegaard Delta PSNR) with an average increase of 1.86% BD-BR (Bjøntegaard Delta Bitrate), compared to the exhaustive search of JMVC 6. [2]. However, this loss is 87

visually imperceptible (see Fig. 21, Section S3). Due to the predictive nature, our scheme incurs on average 8.5% on-chip memory misses compared to when storing the complete search window Fig. 11b.

ME/DE Prediction[%] 1 8 6 4 2 Hits Misses Ballroom Exit FlamencoVassarBDancer Bkdancers Ballet On Chip Mem. Misses [%] 1 9 8 7 6 5 Ballroom ExitFlamencoVassar BDancerBallet Bkdancers Ballet Fig.

Hardware Implementation The hardware prototype is implemented using an ASIC flow using the Cadence tool chain for standard-cell synthesis with an IBM 65nm Low-Power technology.

Compared to the state-of-the-art our architecture reduces the on-chip energy by 76% and % when compared to [13] and [7], respectively. Note, the work of [13] is implemented in 9nm technology.

6 visually imperceptible (see Fig. 21, Section S3). Due to the predictive nature, our scheme incurs on average 8.5% on-chip memory misses compared to when storing the complete search window Fig. 11b. However, including all latency overhead due to misprediction and the power-management decision logic, our scheme still provides a minimum throughput of 33fps (see Fig. 12). ME/DE Prediction[%] Hits Misses Ballroom Exit FlamencoVassarBDancer Bkdancers Ballet On Chip Mem. Misses [%] Ballroom ExitFlamencoVassar BDancerBallet Bkdancers Ballet Fig. 11 Our ME/DE mispredictions and on-chip memory misses C. Hardware Implementation The hardware prototype is implemented using an ASIC flow using the Cadence tool chain for standard-cell synthesis with an IBM 65nm Low-Power technology. The IC layout and comparison table is shown in Fig. 12. The designed architecture employs 64x4-sample SAD operators and 21 SAD trees fed by the 16 on-chip memory banks. Compared to the state-of-the-art our architecture reduces the on-chip energy by 76% and % when compared to [13] and [7], respectively. Note, the work of [13] is implemented in 9nm technology. Considering a 3% power reduction (in case of SRAMs) when moving from 9nm to 65 nm technology node [6], our proposed scheme still provides >6% reduction in the on-chip energy. The provided throughput is capable of providing real-time HD18p ME and DE at 33fps. The performance increase in relation to [7] is mainly due to the complexity reduction resulting from the search direction prediction. Note the 8x increase in the number of on-chip bits in comparison to [13] is due to the different-sized search windows. Our scheme supports 193x193 search windows (which are mandatory for DE to provide good video quality), while the architecture of [13] supports 33x33 search windows. Memory Bank Memory Bank 1 Memory Bank 2 Memory Bank 3 Memory Bank 4 Memory Bank 5 Memory Bank 6 Memory Bank 7 Memory Bank 8 Memory Bank 9 Memory Bank 1 Memory Bank 11 Memory Bank 12 Memory Bank 13 Memory Bank 14 Memory Bank 5 SAD Units AGU ME/DE Ctrl Technology Tsung'9 [13] DSW 11 [7] TSMC 9nm Low Power LowK Cu ST 65nm LP 7 metal layer Our ST 65nm LP 7 metal layer Gate Count 23k 12k 14k SRAM 64 Kbits 512 Kbits 512 Kbits Frequency 3 MHz 3 MHz 3 MHz Power 265mW, 1.2v 74mW, 1.v 63mW, 1.v Throughput (Resolution, Frame Rate) 4-views 4-views 4-views APM SAD Units: Sum of Absolute Differences Operators ME/DE Ctrl: Motion/Disparity Estimation Control Others AGU: Address Generation Unit APM: Adaptive Power Management Fig. 12 (a) Chip Layout, (b) Hardware results comparison VI. CONCLUSIONS We propose a novel adaptive power management scheme for onchip video memory targeting MVC. It leverages the multiview video content knowledge and computation reordering to achieve high energy savings with an imperceptible video quality loss. Key enabling attributes are MB-grouping based on texture and activity classification, best search direction prediction, and a video contentdriven multi-level power management policy. Our scheme achieves on average 32%-61% on-chip energy reduction compared to state-of-the-art [7][12]. We demonstrate the potential of leveraging the multiview video properties for low-power MVC realization on battery-powered devices. REFERENCES [1] Y. Su, A. Vetro, A. Smolic, Common Test Conditions for Multiview Video Coding, ISO/IEC JTC1/SC29/WG11 and ITU-T SG16 Q.6, Doc. JVT-7, July 26. [2] JMVC 6., garcon.ient.rwthaachen.de, Sep. 29; Joint Draft 8. on Multiview video coding, JVT-AB24, 28. [3] P. Merkle et al., " Efficient Prediction Structures for Multiview Video Coding" IEEE TCSVT, vol.17, no.11, pp , 27. [4] Lynx: [5] J. Yang et al., "Multiview video coding based on rectified epipolar lines", International CICSP, pp. 1-5, 29. [6] Cypress Seminconductor Corp., Advantages of 65 nm Technology over 9 nm Technology QDR Family of SRAMs, 21. [7] B. Zatt, M, Shafique, F. Sampaio, L. Agostini, S. Bampi, J. Henkel, "Run-time adaptive energy-aware motion and disparity estimation in multiview video coding", IEEE DAC, pp , 211. [8] B. Zatt, M, Shafique, S. Bampi, J. Henkel, "A Low-Power Memory Architecture with Application-Aware Power Management for Motion & Disparity Estimation in Multiview Video Coding", IEEE ICCAD, pp. 4-47, 211. [9] B. Zatt, M, Shafique, S. Bampi, J. Henkel, "Multi-Level Pipelined Parallel Hardware Architecture for High Throughput Motion and Disparity Estimation in Multiview Video Coding", IEEE DATE, pp , 211. [1] M, Shafique, B. Zatt, J. Henkel, "A Complexity Reduction Scheme with Adaptive Search Direction and Mode Elimination for Multiview Video Coding", Picture Coding Symposium, 212. [11] S. Saponara, L. Fanucci, "Data-adaptive motion estimation algorithm and VLSI architecture design for low-power video systems", IEE Comp. & Digital Tech., vol.151, no.1, pp , 24. [12] C.-Y. Chen et al., "Level C+ data reuse scheme for motion estimation with corresponding coding orders", IEEE TCSVT, vol.16, no.4, pp , 26. [13] P.-K. Tsung et al., "Cache-based integer motion/disparity estimation for quad-hd H.264/AVC and HD multiview video coding", IEEE ICASSP, pp , 29. [14] C.-Y. Tsai et al., "Low power cache algorithm and architecture design for fast motion estimation in H.264/AVC encoder system", IEEE ICASSP, vol. 2, pp. II-97-II-1, 27. [15] H. Shim, C.-M. Kyung, "Selective search area reuse algorithm for low external memory access motion estimation", IEEE TCSVT, vol.19, no.7, pp , 29. [16] X. Xu, Y. He, "Fast disparity motion estimation in MVC based on range prediction," IEEE ICIP, pp.2-23, 28. [17] H. Singh et al., "Enhanced leakage reduction techniques using intermediate strength power gating", IEEE TVLSI, vol. 15, no. 11, pp , 27. [18] S. Roy, N. Ranganathan, S. Katkoori, "State-retentive power gating of register files in multi-core processors featuring multithreaded inorder cores", IEEE Transaction on Computers, 21. [19] L. Shen et al., "View-adaptive motion estimation and disparity estimation for low complexity multiview video coding", IEEE TCSVT, vol.2, no.6, pp.9-93, 21. [2] H.-C. Chang et al., "A dynamic quality-adjustable H.264 video encoder for power-aware video applications", IEEE TCSVT, vol.19, no.12, pp.17-14, Dec. 29. [21] S.-H. Wang, S.-H. Tai, T. Chiang, "A low-power and bandwidthefficient motion estimation IP core design using binary search", IEEE TCSVT, vol.19, no.5, pp , 29. [22] T. Tuan et al., A 9nm low-power FPGA for battery-powered applications, ACM/SIGDA FPL, pp. 3-11, 26. [23] X. Xu, Y. He, "Fast disparity motion estimation in MVC based on range prediction", IEEE ICIP, pp.2-23, 28. [24] S. Mondal, S.O. Memik, Fine-grain leakage optimization in SRAM based FPGAs, IEEE GLSVLSI, pp ,. [] X. Liu, P. J. Shenoy, and M. D. Corner, Chameleon: application-level power management, IEEE TMC., vol. 7, no. 8, pp , 28. [26] G. Fukano et al., "A 65nm 1Mb SRAM Macro with Dynamic Voltage Scaling in Dual Power Supply Scheme for Low Power SoCs", NVSMW/ICMTD. pp.97-98, 28. [27] M. Khellah et al. "A 4.2GHz.3mm2 6kb Dual-V/sub cc/ SRAM Building Block in 65nm CMOS", IEEE ISSCC, pp.72-81, 26. [28] G. Gerosa et al., A Sub-2 W Low Power IA Processor for Mobile Internet Devices in 45 nm High-k Metal Gate CMOS, IEEE ISSCC,73-82, 29. [29] H. Javaid, M, Shafique, S. Parameswaren, J. Henkel, "Low-power adaptive pipelined MPSoCs for multimedia: an H.264 video encoder case study", IEEE DAC, pp , 211. [3] H. Javaid, M, Shafique, J. Henkel, S. Parameswaren, "System-Level Application-Aware Dynamic Power Management in Adaptive Pipelined MPSoCs for Multimedia", IEEE ICCAD, pp ,

respectively. The ME/DE search is performed in previously encoded frames (i.e. reference frames) for finding a block that best matches the currently encoded Macroblock (MB) given a similarity criterion (like Sum of Absolute Differences, SAD).

Note, a search direction refers to the relative position of a reference frame with respect to the current frame.

The search is performed by comparing a set of candidate blocks (selected depending upon given search patterns) inside a predefined search window (see Fig. 13) in order to find the best matching block.

13 Overview of motion and disparity estimation T T3 T6 T7 T8... V V1 V2 V3 I B B B B B B B I B P P Anchor B B B B B B B B B B B B Non-Anchor B B B B B B B B B B P P Anchor Fig.

15 Neighboring MBs in the 3D-neighborhood Once the best matching block is found, a Motion or Disparity Vector (MV, DV) is determined in order to represent the displacement between the current MB

Note, although ME and DE are conceptually similar, their search behavior and consequently the computational requirements, memory access pattern, and vector properties are distinct (see discussion in

Detailed Analysis of Multiview Videos A fast ME/DE TZ Search [5] algorithm is deployed for this analysis in order to represent a real-world scenario.

A) by evaluating for different video sequences with diverse motion/disparity and texture properties. The distribution in Fig. 16 and Fig.

While the MBs of foreground objects and object borders (with medium to high texture, high motion) are encoded using DE. It is noteworthy in Fig.

7 [31] Supplementary Material S1. Motion and Disparity Estimation in MVC MVC exploits the redundancies available in temporal and interview domains using multiple block-sized Motion Estimation (ME) and Disparity Estimation (DE), respectively. The ME/DE search is performed in previously encoded frames (i.e. reference frames) for finding a block that best matches the currently encoded Macroblock (MB) given a similarity criterion (like Sum of Absolute Differences, SAD). ME searches in temporal neighboring reference frames, while DE searches in frames of the neighboring views (see Fig. 13). Note, a search direction refers to the relative position of a reference frame with respect to the current frame. According to the MVC standard, multiple reference frames may be used to additionally improve the coding efficiency. However, in this work, we consider one reference frame per search direction, i.e. one forward and one backward reference in the temporal domain plus one forward and one backward reference in the view domain (if available). The search is performed by comparing a set of candidate blocks (selected depending upon given search patterns) inside a predefined search window (see Fig. 13) in order to find the best matching block. Temporal Reference Frame Motion Vector (MV) Best Matching Motion Estimation Disparity Reference Frame Best Matching Disparity Estimation Current MB Current Frame Disparity Vector (DV) Fig. 13 Overview of motion and disparity estimation T T3 T6 T7 T8... V V1 V2 V3 I B B B B B B B I B P P Anchor B B B B B B B B B B B B Non-Anchor B B B B B B B B B B P P Anchor Fig. 14 MVC Hierarchical Prediction Structure Fig. 15 Neighboring MBs in the 3D-neighborhood Once the best matching block is found, a Motion or Disparity Vector (MV, DV) is determined in order to represent the displacement between the current MB position and the best matching block position. Note, although ME and DE are conceptually similar, their search behavior and consequently the computational requirements, memory access pattern, and vector properties are distinct (see discussion in Section II.A). Fig. 14 illustrates the MVC prediction structure and coding sequence. Fig. 15 shows the neighboring MBs in spatial, temporal, and view domains (i.e. 3D-neighborhood). S2. Detailed Analysis of Multiview Videos A fast ME/DE TZ Search [5] algorithm is deployed for this analysis in order to represent a real-world scenario. Fast ME/DE algorithms are based on multiple search stages and patterns. These algorithms evaluate different number of search candidates for different MBs, thus exhibit highly-varying memory usage profile. A. Motion and Disparity Estimation Distribution This section reinforces our analysis of ME/DE search direction distribution (presented in Section II.A) by evaluating for different video sequences with diverse motion/disparity and texture properties. The distribution in Fig. 16 and Fig. 17 illustrates that most of the MBs (typically from the background objects with low-texture, low-motion, static blocks) are encoded using ME. While the MBs of foreground objects and object borders (with medium to high texture, high motion) are encoded using DE. It is noteworthy in Fig. 16 that the view V1 exhibits a higher number of DE encoded MBs compared to the other views. This is due to the fact that V1 has two references views available that increases the possibility to find a good match. The view V has no reference view available and consequently, all MBs are encoded using ME. Mode Distribution [%] Mode Distribution [%] 1 5 V V1 V2 V3 V V1 V2 V3 1 5 Rena Ballroom QP ME DE QP 1 5 V V1 Exit V2 V3 V V1 V2 V3 1 5 Vassar QP ME DE QP Fig. 16 (a) ME and DE distribution for four views of Rena, Ballroom, Exit and Vassar test video sequences The decision of ME and DE (as the best search direction) also depends upon the correlation available in the temporal domain. For instance, the number of DE-coded MBs is higher in compared to since is farther to the temporal references. Our memory usage analysis in Fig. 18 shows that the pattern of memory usage in ME is less scattered compared to that in DE, especially in case of low-motion sequences with smaller objects like Ballroom. The probability density function (PDF) in Fig. 18 shows that the distribution patterns of three groups are quite diverse. The PDF of the group G1 is quite centered in a low range (8-15Kpixels), while the PDF of the group G3 is quite dispersed over a big range (1-35Kpixels). Moreover, there is a minimal overlap between the PDFs different groups, which hints towards distinct memory predictions using PDFs. Therefore, based on this PDF analysis, our scheme computes two different highly-probable memory requirement predictions (M 1 and M 2 ) from the PDF 872

Furthermore, neighborhood correlation of memory usage can also be exploited to predict memory usage of a

Therefore, memory requirements [M1, M2] of an MB-group can also be predicted with a high accuracy from

Similar observation can be made from the memory requirements correlation shown in Fig. 2.

It shows that it is possible to infer the memory behavior based on the neighborhood knowledge.

1 1 15 2 3 Required Memory [K samples] 2 3 DE 6 3 1 2 3 Accessed Pixels (KPixel) 9 Group 1.

18 (a) Probability density function (PDF) for the memory usage requirements of different groups for

Ballroom sequences T3 T3 T3 Motion Estimation T3 Disparity Estimation T6 T7 T8 Fig.

Usage [Samples] Motion Estimation Memory Usage [Samples] 1 15 2 3 Required Memory [K samples].

determine the number and mode of gated sectors (see details in Section IV.B).

offline-analysis over various test video sequences) considering a Gaussian distribution.

84 [(F(µ+σ; µ, σ2) - F(; µ, σ2)] and M2 is obtained with a probability of.

The M1 amount of memory is kept in PON mode as the probability of using these memory sectors is high.

8 Furthermore, neighborhood correlation of memory usage can also be exploited to predict memory usage of a given MB-group, because MBs of the same groups typically contain same object, thus exhibiting similar memory requirements for ME and DE. Fig. 19 shows that there is an extensive correlation between the neighboring frames Æ Æ T3. Therefore, memory requirements [M1, M2] of an MB-group can also be predicted with a high accuracy from the memory usages of the same MB-group in the neighboring frames or even views. Similar observation can be made from the memory requirements correlation shown in Fig. 2. The regions that require more memory are located in the same region for different instants of time. It shows that it is possible to infer the memory behavior based on the neighborhood knowledge. Similar observation can be made for view neighbors..4.3 Group T ME Group 2.2 Group Required Memory [K samples] 2 3 DE Accessed Pixels (KPixel) 9 Group 1.3 Rena 6 Ballroom ME DE Accessed Pixels (KPixel) Fig. 18 (a) Probability density function (PDF) for the memory usage requirements of different groups for various test video sequences; (b) Histograms of memory usage during ME and DE processes for Rena and Ballroom sequences T3 T3 T3 Motion Estimation T3 Disparity Estimation T6 T7 T8 Fig. 19 MB-group correlation in different neighboring frames Correlated memory requirements behavior Memory Usage [Samples] Motion Estimation Memory Usage [Samples] Required Memory [K samples].4 These predicted memory requirement values are then forwarded to the power-management scheme to determine the number and mode of gated sectors (see details in Section IV.B). Disparity Estimation Group3 Occurences/Frame Group 1 Occurences/Frame (obtained through an offline-analysis over various test video sequences) considering a Gaussian distribution. M1 is obtained with a probability of.84 [(F(µ+σ; µ, σ2) - F(; µ, σ2)] and M2 is obtained with a probability of.9 [(F(µ+2σ; µ, σ2) - F(; µ, σ2)]. µ and σ are the mean and standard deviation, respectively. The M1 amount of memory is kept in PON mode as the probability of using these memory sectors is high. The memory requirement M2-M1 is kept in the PDR mode, as others MBs of the same group may use this data and the wakeup overhead is minimal to avoid delay. 9 x T3 T6 T7 T8 Memory Usage [Samples] Fig. 17 ME/DE distribution for different frames in the view V1 of the Rena test multiview video sequence Memory Usage [Samples] Correlated memory requirements behavior T x Fig. 2 3D-plots showing the correlation in the memory usage of MBs in the same frame and its temporal neighbors 873

41 41 4 4 4 42 42 PSNR [db] 4 4 38 38 Ballroom 38 Exit 38 38 Breakdancers 41 41 4 4 Ballet 37 37 2 3 1 15 2 3 35 4 37 5 7 9 3 4 5 6 7 8 9 1 37 37 5 15 35 4 5 6 7 8 9 11 12 14 15 1 2

21 Comparing the objective video quality (rate-distortion curves) and subjective video quality (pictures) of our scheme with the exhaustive ME/DE search of JMVC 6. [2] S3.

and power management at different levels (MB-groups, MBs, etc.).

22 shows that on average 51% of the sectors are on P OFF mode (up to 63%) while 9.5% are in P DR mode (up to 15%).

24 presents the comparison between our application-driven memory requirements predictor and traditional history-based median predictor.

The high prediction accuracy is achieved by taking into consideration the correlation on the 3Dneighborhood along with texture and activity properties of different MBs, frames, and

22 Power modes distribution of the on-chip video memory Compared to search window-based schemes (like in [12]), our approach requires much less external memory access since only a

Due to the computation reordering, our scheme reduces on average 15% of external memory access compared to our previous work of [7].

9 PSNR [db] Ballroom 38 Exit Breakdancers Ballet Our JMVC Fig. 21 Comparing the objective video quality (rate-distortion curves) and subjective video quality (pictures) of our scheme with the exhaustive ME/DE search of JMVC 6. [2] S3. Additional Detailed Results The on-chip memory power reduction is achieved by applying the computing reordering (that increases the number and sleep durations of idle memory sectors) and power management at different levels (MB-groups, MBs, etc.). The power state machine parameters are provided in Table I, based on the model of [17] (see Section III for power model details). Fig. 22 shows that on average 51% of the sectors are on P OFF mode (up to 63%) while 9.5% are in P DR mode (up to 15%). These results highly depend on the accuracy of MB-level memory requirements prediction. Fig. 24 presents the comparison between our application-driven memory requirements predictor and traditional history-based median predictor. Note, our proposed predictor reacts better and faster to the sudden variations of memory requirements. The high prediction accuracy is achieved by taking into consideration the correlation on the 3Dneighborhood along with texture and activity properties of different MBs, frames, and views. Memory Blocks States Ballroom Exit ON DR OFF Flamenco Vassar Bkdancers Ballet Ballroom Exit Flamenco Vassar Bkdancers Ballet Fig. 22 Power modes distribution of the on-chip video memory Compared to search window-based schemes (like in [12]), our approach requires much less external memory access since only a part of the search window is prefetched. Fig. 23 shows that our approach reduces the off-chip energy by 89% and 95% (on average) compared to Level-C and Level-C+ [12], respectively. Due to the computation reordering, our scheme reduces on average 15% of external memory access compared to our previous work of [7]. Table I: Power state machine parameters Sleep Mode Leakage Energy Wakeup Energy Wakeup Latency P ON 1 P DR P OFF refetching 1 Level C+[12] Level C[12] DSW[7] Our Ballroom Exit Flamenco2 Vassar Bkdancers Ballet Fig. 23 Off-chip memory energy savings compared to state-ofthe-art search window prefetching techniques Chip Energy [%] Off Memory Requirement [Ksanples] Actual Memory Requirements History Based (Median) Our #MB Fig. 24 comparing the accuracy of our application-driven memory requirement predictor with the history-based median predictor at MB-level The detailed video quality results are shown in Fig. 21 and Fig.. The objective video quality (rate-distortion curves) and subjective video quality (decoded frames) results in Fig. 21 illustrate that our 874

Multi-Level Pipelined Parallel Hardware Architecture for High Throughput Motion and Disparity Estimation in Multiview Video Coding

Multi-Level Pipelined Parallel Hardware Architecture for High Throughput Motion and Disparity Estimation in Multiview Video Coding Bruno Zatt, Muhammad Shafique, Sergio Bampi, Jörg Henkel Karlsruhe Institute