Bus Encoding Techniques for System- Level Power Optimization

Size: px

Start display at page:

Download "Bus Encoding Techniques for System- Level Power Optimization"

Milo Lawrence
5 years ago
Views:

1 Chapter 5 Bus Encoding Techniques for System- Level Power Optimization The switching activity on system-level buses is often responsible for a substantial fraction of the total power consumption for large VLSI systems. Thus, the minimization of the transitions at the I/O interfaces can provide significant savings on the overall power budget. This chapter introduces innovative encoding techniques suitable for minimizing the switching activity of system-level address buses in microprocessor-based systems. In particular, the proposed schemes target the reduction of the average number of bus line transitions per clock cycle by exploiting the spatial locality principle. Experimental results, conducted on address streams generated by a real microprocessor executing benchmark programs, have demonstrated the effectiveness of the proposed methods with respect to previous encoding approaches. A trade-off analysis has been carried out to evaluate whether the power saved by bus activity reduction is not offset by the encoding/decoding logic Introduction In microprocessor-based systems, significant power savings can be achieved through the reduction of the transition activity of the on- and off-chip buses. This is due to the total capacitance being switched, when a voltage change occurs on a bus line, that is usually sensibly larger than the capacitive load that must be charged/discharged when internal nodes toggle. Heavy loads are usually connected to off-chip buses due to I/O pins, long external board wires, and board-level connected devices. In order to drive these large board-level capacitances, the sizes of the devices in the I/O pads need to be much larger than the average on-chip features, increasing also the pads intrinsic parasitic capacitance.

2 134 In this chapter, we deal with the communication occurring on both data and address buses on general-purpose microprocessor-based systems. The increasing performance requirements and the existing gap between the speed of current microprocessors and the speed of the system interfaces has pushed the system designers to increase the bandwidth of the data transfers. For example, the PowerPC 620 processor can be configured with either a 64 or a 128-bit data bus to support the transfer of the 64-byte cache block within four clock cycles. Moreover, modern software applications span a very large address space. The latest versions of existing processors, such as the DEC Alpha AXP and the PowerPC 620, support a 64-bit address space. Hence, both data and address buses have become very wide for modern microprocessor-based systems. Due to the intrinsic capacitances of the bus lines, a considerable amount of power is required at the I/O pins of the microprocessor when data have to be transmitted over the bus. More specifically, it has been estimated that the capacitance driven by the I/O nodes is usually much larger (up to three orders of magnitude [146]) than the one seen by the internal nodes of the microprocessor. As a consequence, dramatic optimizations of the average power consumption can be achieved by minimizing the number of transitions (i.e., the switching activity) on system-level buses. In this chapter, we focus on the problem of reducing the switching activity on system-level buses through the application of dedicated encoding schemes. The aim is to propose innovative encoding techniques targeting the reduction of the average number of transitions on the system-level address buses of microprocessor-based systems. In these systems, the data and address buses are at the core of the interface between the processor and the memory and I/O sub-systems. More in detail, they are used, on-processor, to access the first and the second-level caches, as well as, off-processor, to access the external lower level caches, the main memory (usually through a memory controller), and the I/O sub-systems (to support direct memory accesses from the I/O controllers). Our goal is to avoid any modification to the standard memory components, hence adding the encoding circuitry inside the processor, and the decoding logic inside the memory and the I/O controllers. First, we analyzed the execution traces for benchmark programs on the MIPS RISC-based architecture [92], and we noticed how the address sequentiality for instruction address streams is equal to 63.04% on average, while it can reach up 76.56%. As a matter of fact, instruction

3 Chapter 5. Bus Encoding Techniques for System-Level Power Optimization 135 address streams are typically composed of bursts of in-sequence addresses, intermingled with out-of-sequence ones corresponding to taken branches and jumps. On the other hand, the percentage of in-sequence data addresses, that we derived for the MIPS processor, is equal to 11.39% on average and 52.47% at the maximum. In fact, data streams accessing data arrays generate in-sequence addresses, while other accesses to not-structured data generate addresses randomly distributed in time with a very limited temporal correlation among them. Then, we defined the asymptotic zero-transition encoding scheme, which is suitable for reducing the switching activity on the address bus lines. The technique relies on the observation that, in the most number of cases, patterns travelling onto address buses are consecutive, as the spatial locality principle states and as confirmed by the MIPS tracing. Under this condition it may therefore be possible, for the devices located at the receiving end of the bus, to automatically calculate the address to be received at the next clock cycle; consequently, the transmission of the new pattern can be avoided, resulting in an overall switching activity decrease. The main idea exploited by our encoding scheme, called in the sequel the T0 code, consists of avoiding the transfer of consecutive addresses on the bus by using a redundant line, INC, to transfer to the receiving sub-system the information on the sequentiality of the addresses. When two addresses in the stream to be transmitted are consecutive, the INC line is set to 1, the address bus lines are frozen, to avoid unnecessary switchings, and the new address is computed directly by the receiver. On the other hand, when two addresses are not consecutive, the INC line is driven to 0 and the bus lines operate normally. With the hypothesis of infinite streams of consecutive addresses, the T0 code enjoys the property of zero transitions occurring on the bus. Therefore, it outperforms the Gray code since, under the same assumption, Gray addressing requires one line switching per each pair of patterns. The increments between consecutive patterns can be parametric, reflecting the addressability scheme adopted in the given architecture. In this respect, our code has the same capabilities of the Gray scheme [127]. The proposed encoding has been applied to several address streams, broken down into instruction-only, data-only, and multiplexed instruction-data traces. The goal is to evaluate the effects of encoding on separated and shared address buses for both instructions and data streams. The chapter also introduces mixed encoding schemes, aiming at combining the best

4 136 properties of existing approaches. This with the ultimate objective of exploiting the advantages offered by each technique, and thus identifying the best method for the specific processor, the MIPS RISC we have used for the experiments. The combined techniques are suitable for address buses for both instructions and data streams. In fact, the separated accesses to instruction and data arrays occur mainly in sequence, but unfortunately they are often interleaved on the same address bus, thus the sequentiality is destroyed. The mixed encoding schemes aim at restoring the sequentiality. Furthermore, we present analytical and experimental analyses showing the improved performance of our encoding schemes when compared to existing bus encoding techniques. Since our global target is the reduction of the power consumed by the overall system, we need to guarantee that the power savings achieved by our encoding schemes are not offset by the power overhead dissipated by the encoder/decoder. Moreover, bus latency is considered a critical design constraint such as power. Performance of the coding-decoding scheme is an essential requirement, since in modern microprocessor-based systems bus width and clock rate are both constantly increasing. Therefore, the simultaneous optimization of power and timing has been targeted during the synthesis of the encoding/decoding logic. Although the T0 encoder/decoder are more area demanding than the Gray ones, the T0 bus interface turns out to be faster. In fact, the critical delay of a Gray decoder grows linearly with the number of bus lines to be decoded, while in the T0 decoder we propose the critical delay has logarithmic behavior. In spite of the fact that the encoding and decoding logic to implement the T0 and related codes is more expensive, in terms of area, than the Gray one, the power savings achieved by bus encoding are not offset by the energy consumed by the additional circuitry to implement the proposed codes. To support this claim, we present detailed implementations of such additional logic, and we report power dissipation results obtained through simulation of a large set of properly selected input patterns for typical onand off-chip capacitive loads. The rest of this chapter is organized as follows. Previous works on power-oriented bus encoding techniques have been outlined in Section 5.2, while Section 5.3 discusses the main characteristics of the proposed T0 code with respect to previous encoding schemes. Mixed bus encoding schemes have been proposed in Section 5.4 and their performances, in terms of average bus transitions, have also been reported for real address streams. Section 5.5

5 Chapter 5. Bus Encoding Techniques for System-Level Power Optimization 137 discusses the encoder/decoder architectures and their power consumption for typical on-chip and off-chip system-level bus loads. Finally, in Section 5.6 we draw some conclusions of our analysis Previous Work on Bus Encoding Techniques Encoding paradigms for reducing the switching activity on the bus lines in the processor-tomemory interface have been recently investigated. They all rely on encoding the binary patterns before transmitting them depending on the distinctive spectral characteristics of the pattern streams to be exchanged. In the well-known one-hot encoding technique [35], the bus is composed of m = 2 n wires. An n-bit value is coded by a logical 1 on the i-th wire, where 0 i 2 n-1 is the binary value corresponding to the bit pattern, while a logical 0 is put on the remaining m-1 wires. The onehot encoding has the drawback of requiring an exponential number of wires, but it guarantees exactly one 0-to-1 and one 1-to-0 transition when a new value is sent over the bus. In [167], Stan and Burleson have proposed the use of a redundant encoding scheme, the businvert code, to limit the average power. The bus-invert method performs well when patterns to be transmitted are randomly distributed in time and no information about pattern correlation is available. Therefore, the method seems to be appropriate for encoding the information travelling on data buses. A redundant bus line, INV, is needed to signal to the receiving end of the bus which polarity is used for the transmission of the incoming pattern. The encoding depends on the Hamming distance (i.e., the number of bit differences) between the value of the encoded bus lines at time t-1 (also counting the redundant line at time t-1) and the value of the address bus lines at time t. The Hamming distance is compared to N / 2, where N is the bus width (assuming N even without loss of generality). If the Hamming distance between two successive patterns is larger than N / 2, the current address is transmitted with inverted polarity and the redundant line is asserted; otherwise, the current address is transmitted as is, and the INV line is de-asserted. The bus-invert method can be expressed by the following equations: (B (t), INV (t) ) = (b (t), 0 ) if H (t) N / 2 (b (t), 1 ) if t > 0 H (t) > N / 2 (1)

6 138 where B (t) is the value of the encoded bus lines at time t, INV (t) is the additional bus line, b (t) is the address value at time t, H (t) = (B (t-1) INV (t-1), b (t) 0 ) is the Hamming distance, and N is the bus width of b (t). The corresponding decoding scheme is simply defined as: b (t) = B (t) if INV = 0 B (t) if INV = 1 (2) If the words transmitted on the bus are independent and uniformly distributed, the average number of transitions per clock cycle is lowered by less than 25% of the original value, due to the binomial distribution of the distance between consecutive patterns. Major drawbacks of this approach are related to the required redundant bus line and the overhead due to the logic to implement the voter to decide whether the Hamming distance exceeds N / 2. The encoding methods discussed so far perform well when no information about possible data correlation is available. In particular, they work fine when patterns to be transmitted are randomly distributed in time, such as for the data exchange between a microprocessor and the data cache. This is because, except for some specific applications such as arithmetic and DSP circuits, patterns being transmitted over these buses usually have very limited correlation. When the objective shifts to address buses, a radically different behavior can be observed. Other techniques have been explored, all relying on the well-known fact that the addresses generated by a microprocessor are often consecutive, due to the spatial locality principle [78], which states that the memory elements whose addresses are close to each other tend to be referenced during successive clock cycles. The addresses generated by a running microprocessor are often consecutive, since instructions are stored in adjacent sections of the memory space, and structured data are stored in consecutive memory locations for better locality. Clearly, there are exceptions to this behavior such as control-flow instructions, which cause interruptions in the sequence of consecutive addresses on the instruction flow, and data not stored in arrays are often addressed without any regular pattern. In any case, sequential addressing usually dominates and so the temporal correlation between successive addresses is usually very high. To exploit the peculiar characteristics of the address buses, Su et al. [169] have proposed to encode the patterns with the Gray code, since it guarantees a single bit transition when consecutive addresses are accessed. The results reported in [169] show that the number of bit switches is reduced by 37%, on average, on several benchmark programs, when binary

7 Chapter 5. Bus Encoding Techniques for System-Level Power Optimization 139 encoding is replaced with Gray encoding. Although Gray code is suitable for reducing the switching activity, we need to consider the power overhead caused by the presence of additional circuitry for encoding and decoding. In fact, it is unrealistic to assume that the address computation units, the data-path, the memory decoders and even the compiler could be modified to generate Gray code addresses. Therefore, a Gray encoder must be placed at the transmitting end of the bus, and a Gray decoder is required at all the receiving ends. In [169], the trade-off between cost of encoding/decoding and the savings on the address buses is not discussed. Further investigation on Gray code addressing has been carried out in [127], where architectural solutions are proposed for the realization of the Gray addressing circuitry, and their performance are compared with the pure binary addressing in the case of a 16-bit address bus. In addition, the issue of modifying the Gray code so as to preserve the one-transition property for consecutive addresses of byte-addressable machines has been extensively discussed. A further encoding method for decreasing the address bus activity still relies on the locality property of the memory references ([132], [133]). In more detail, this work originates from the consideration that a program favors a limited number of working zones of the memory address space at each instant of time. Consequently, given a reference to one of these zones, we need to send over the bus only the information related to the offset of this reference with respect to the previous reference to that zone, along with the information of the current working zone. The information related to the offset is coded on the bus by adopting the Modified One-hot Encoding mode (MOE) on the entire bus width, so that the activity of the n- bit address bus is reduced to 1. Basically, the MOE process of a value v (where n/2 v n/2 1) consists of two steps. First, the one-hot encoding of v into n bits is performed; second, the one-hot encoding of v is xor-ed with the previous value sent over the n-bit address bus. The Working Zone Encoding (WZE) scheme is suitable for data address buses and for shared address buses. In the former case, the address sequentiality is often destroyed by the interleaved accesses to different data arrays, while in the latter case, the address sequentiality is diminished by the interleaved accesses to instruction and data locations. The WZE method restores sequentiality by memorizing the reference address of each working zone on the

8 140 receiver side and by sending only the highly sequential offset. Whenever the memory access moves towards a new working zone, this information is sent to the receiver with a special codeword. So the receiver changes the default reference address and the offset transmission can resume. The WZE technique has been extended in [103] to the data buses based on the observation that data for successive accesses to a working zone frequently differ by a small amount, and, therefore, it is also effective to use an encoding based on offsets for the data buses. Basically, the data is sent as an offset, coded in the one-hot encoding with transition signaling, and an additional register indicates the current working zone. The main limitation of the WZE is its larger encoder and decoder logic overhead, which limits the power benefits of the I/O switching activity reduction and introduces additional delays on signal paths. Another main drawback of the WZE consists of requiring extra bus lines for communicating the working zone change. In fact, besides the n-bits of the original address bus, the proposed method requires (lg 2 B + 1) extra wires, where 1 bit to acknowledge whether the value sent over the bus is an offset, while lg 2 B bits to carry the information of which of the B previous references to each working zone is being used. Furthermore, WZE relies on strong assumptions on the patterns in the stream. If the data-access policy is not array-based, or if the number of working zones is too large, this encoding scheme looses its effectiveness. Other encoding techniques at the system-level have been reviewed in [168], where the authors discuss the introduction of redundancy in space (number of bus lines), time (number of cycles) and voltage (number of distinct voltage levels). In particular, the time redundancy has been demonstrated to be as effective as the space redundancy for decreasing the average switching activity. Nevertheless it is impractical to apply in high performance systems, since it requires extra transfer cycles. In [168], codes have also been reported for both level signaling and transition signaling. Among transition signaling codes, Limited-Weight Codes are suitable for reducing the bus switching activity. In fact, LWCs reduce the number of logical 1s transmitted over the bus, that are responsible for all the high-to-low and low-tohigh transitions, while the logical 0s are represented by the lack of transitions. A further work on encoding [154] discusses the lower and upper bounds on the transition activity provided by encoding techniques, and a low power encoder/decoder suitable for high

9 Chapter 5. Bus Encoding Techniques for System-Level Power Optimization 141 capacitance buses, where the extra power due to the encoder/decoder is offset by the power savings at the bus level. The bounds on the expected transition activity have been obtained by an information theoretic approach, where a signal with a certain entropy rate is coded with additional bits per sample on average. The architecture proposed in this work consists of a data source that is passed through a de-correlating function followed by a variable entropy coding function, aiming at reducing the switching activity. Several encoding schemes can then be employed for both functions. All the aforementioned schemes are suitable for general-purpose microprocessor-based systems, while an application-dependent encoding method for special-purpose embedded systems has been proposed in ([20], [22]). The approach, called Beach solution, reduces the address bus activity of special-purpose systems, where a dedicated or intellectual proprietary core processor executes the same portion of embedded code repeatedly. The analysis of the execution traces of a given program allows an accurate computation of the correlations existing between blocks of bits in consecutive patterns. This information is exploited to determine an application-dependent encoding which reduces the transition activity. So, the target of the Beach code consists of reducing the bus activity even in the cases where the percentage of in-sequence addresses is limited. In this situation, it may be possible to exploit other types of temporal correlations than arithmetic sequentiality between the patterns that are being transmitted. Although the sequentiality of addresses may be lost by a taken branch or jump instruction, same portion of the patterns may still maintain high correlation. For processors adopting segment- or page-based memory organizations, this can be justified by the fact that intrasegment/page branches and jumps are much more frequent than intersegment/page ones. The encoding strategy is then determined depending on the particular stream being transmitted. The main limitation of the Beach solution is that it is strongly applicationdependent, thus it can find applicability only in dedicated core-based systems executing a given program many times. Moreover, the power saving may be traded-off by the extra power due to the encoding/decoding logic. The bus encoding techniques described so far aim at reducing the switching activity at the memory-processor interface by changing the format of the information transmitted over the bus. An alternative approach consists of directly changing the way the information is stored in

10 142 memory, so that the address streams generated by the processor have already low transition activity. Memory organization techniques have mainly investigated two research areas. The first one exploits the low-power cost associated with accesses to the highest levels of the memory hierarchy ([192], [50]). Basically, the power is reduced by organizing data in the hierarchy so as data transfers from the lowest levels are minimized. The second area of research targets the mapping of data in the memory to increase the number of sequential accesses in the processor-memory interface ([192], [146], [193]). From our analysis, we can notice how bus encoding and memory organization techniques constitute a relatively new research area to cope with the problem of power reduction from a system-level standpoint. Both bus encoding and memory organization techniques appeared so far provide advantages and limitations. The most part of bus encoding schemes can be applied without any knowledge about the functionality of the application being executed, while the Beach solution and the memory organization techniques need detailed information on the given application in terms of data structures, loops, and control flow structures. On the other hand, bus encoding imposes, in most cases, hardware overhead in terms of encoder/decoder and redundant lines, while memory organization techniques exploit the existing hardware. Furthermore, memory organization and bus encoding techniques are not mutually exclusive: optimal power minimization strategies should exploit their synergy. In this scenario, the primary goal of our research is to provide original address encoding techniques suitable for a general processor-memory architecture, usable in general-purpose microprocessor-based systems as well as dedicated embedded systems applications. For in-sequence addresses, the Gray code achieves the minimum switching activity among irredundant codes; however, better performance can be obtained by introducing a minimal redundancy as proposed in our T0 code. The basic idea of T0 code consists of exploiting the strong temporal correlation between successive addresses related to instructions and data arrays to achieve zero asymptotic transitions on the address bus. We also propose power and timing efficient implementations of the T0 and related mixed encoding/decoding logic, and we discuss the applicability of the technique to real microprocessor-based designs. Moreover, the codes proposed in this chapter can be used in conjunction with data allocation techniques emphasizing sequential memory access to further minimize the average number of bus line transitions.

11 Chapter 5. Bus Encoding Techniques for System-Level Power Optimization Asymptotic Zero-Transition Encoding In the ideal case of an infinite stream of consecutive instructions, Gray code achieves its asymptotic best performance of a single transition per emitted address. It appears that Gray code is the best possible code for reducing the switching activity, because one bit difference is the minimum needed to distinguish two binary numbers. The key observation that allows us to improve upon this result is realizing that Gray code is optimum only for irredundant codes, that is, codes that employ exactly N-bit patterns to encode a maximum of 2 N data words. If we add redundancy to the code, we can achieve better performance. Let us provide an additional redundant line, INC, to the address bus. Its purpose is to signal with value one that a consecutive stream of addresses is output on the bus. If INC is high, all other lines on the bus are frozen, to avoid unnecessary switchings. The new address is computed directly by the receiver. On the other hand, when two addresses are not consecutive, the INC line is low and the remaining bus lines are used as standard binary codes for the new addresses. Obviously, this redundant code outperforms the Gray code on unlimited streams of consecutive addresses. Since all addresses are consecutive, the INC line is always asserted, and the bus lines never switch. As a consequence, the asymptotic performance of the T0 code is zero transitions per emitted consecutive address. The increments between consecutive patterns can be parametric, reflecting the addressability scheme adopted in the given architecture. In this respect, our code has the same capabilities of the Gray schemes reported in [127]. More formally, our encoding scheme can be described as follows: (B (t), INC (t) ) = (B (t-1), 1 ) if t > 0 b (t) = b (t-1) + S (b (t), 0 ) otherwise (3) where B (t) is the value on the encoded bus lines at time t, INC (t) is the additional bus line, b (t) is the address value at time t and S is a constant power of 2, called stride. The corresponding decoding scheme can be formally defined as follows: b (t) = (b (t-1) + S) if INC = 1 t > 0 B (t) if INC = 0 (4) The T0 code retains its zero-transition property even if the addresses are incremented by a constant stride equal to a power of two (as it is often the case for practical machines which are

12 144 byte addressable, but that are able to access data or instructions aligned at word boundaries). Obviously, the stride does not correspond to the memory granularity, but to the memory word length or to the cache block size Analytical Performance Comparison Aim of this paragraph is to compare the T0 and bus-invert codes, considering the binary code as reference. For the analysis, we consider a stream of unlimited length whose addresses have a random uniform distribution, and a stream of unlimited length whose addresses are consecutive. Table 1 reports the results of the comparison, where N is the address bus width. Notice that, in the case of the bus-invert code, the average number of transitions per clock cycle, T N, is given by: where C k N is the binomial coefficient. T N = 1/2 N N/2 k C k N + 1 (5) k = 0 Stream Type Code Avg. Trans. per Clock Avg. Trans. per Clock per Line Avg. I/O Pow. Diss. Out-of-Sequence Binary N/ Addresses T0 N/ Bus Invert T N T N / N T N / (N/2) In-Sequence Binary N/ Addresses T Bus Invert T N T N / N T N / (N/2) Table 1: Analytical Performance Comparison Experimental Performance Comparison This paragraph aims at evaluating the performance of the T0 code in terms of the average number of switchings required by the transmission over the bus for sequences of both artificial and real patterns. Since the code is designed specifically for patterns that satisfy, in a large number of cases, the sequentiality hypothesis, we first study its behavior by encoding artificially generated streams in which out-of-sequence addresses are inserted with controlled probability. For the experiment, streams of addresses have been generated with percentage of sequential addresses ranging from 0 to 100. The diagrams of Figure 1 and

13 Chapter 5. Bus Encoding Techniques for System-Level Power Optimization 145 Figure 2 summarize the results of our comparison analysis with respect to the binary and Gray codes, respectively "Binary" "T0" Average Number of Transitions per Bus Line Percentage of In-Sequence Addresses Figure 1: Comparison Binary-T0 for Artificially Generated Addresses. In particular, they clearly shows that the average number of transitions per bus line is smaller for the case of T0 addressing than for the pure binary encoding. As expected, the advantage of the T0 code becomes more remarkable as the percentage of in-sequence addresses contained into the address stream increases. Concerning Gray encoding, the T0 offers lower transitions starting from approximately 50% of in-sequence addresses "Gray" "T0" Average Number of Transitions per Bus Line Percentage of In-Sequence Addresses Figure 2: Comparison Gray-T0 for Artificially Generated Addresses The simulation of address streams generated ad-hoc substantiates the theoretical performance of the T0 code. However, to prove its applicability to real cases, sequences of addresses

14 146 produced by a commercial microprocessors running real programs have been considered. First, we briefly analyze the behavior of the addresses generated by a RISC microprocessor, then we provide experimental results for comparing the above discussed encoding techniques. The number of transitions occurring on the address bus during the execution of benchmark programs, derived from different application domains, has been used as the metric to estimate the power consumption. The addresses generated by a microprocessor obey to the spatial locality principle, that is, they are often in-sequence and, in general, they present a very high correlation. However, the things are quite different if the addresses correspond to instructions or data. In fact, addresses corresponding to instructions usually show a more sequential behavior with respect to data addresses. Instructions are usually stored in adjacent locations of the memory space, while only structured data such as arrays are generally stored in consecutive memory locations to achieve better locality. Clearly, there are exceptions to this behavior: Control-flow instructions cause interruptions in the sequence of consecutive addresses on the instruction flow, and data not stored in arrays are often accessed without any regular pattern. Nevertheless, sequential addressing dominates on average. These considerations are supported by the data contained in Table 2 to Table 4, where we report the percentage of in-sequence addresses measured on real address streams occurring on the 32-bit address bus of the MIPS microprocessor, when different benchmark programs are executed. Three distinct cases have been considered: The instruction address bus; The data address bus; The instruction/data multiplexed address bus. The tables report also the number of transitions on the address bus, for the above mentioned classes of codes, and the percentage of transitions saved with respect to binary encoding. The average percentage of sequential addresses in the benchmark streams is higher for instructions addresses (63.04%) than for data address streams (11.39%). This behavior can be due to the fact that references to automatic variables such as loop counters destroy the sequentiality of the address streams even if array data structures are accessed sequentially.

15 Chapter 5. Bus Encoding Techniques for System-Level Power Optimization 147 Benchmark Stream In-Seq. Binary T0 Bus-Invert Length Addr. Trans. Trans. Savings Trans. Savings gzip % % % gunzip % % % ghostview % % % espresso % % % nova % % % jedi % % % latex % % % matlab % % % oracle % % % Average 63.04% 35.52% 0.03% Table 2: Experimental Comparison of Binary, T0 and Bus-Invert Schemes for Instruction Address Streams. Benchmark Stream In-Seq. Binary T0 Bus-Invert Length Addr. Trans. Trans. Savings Trans. Savings gzip % % % gunzip % % % ghostview % % % espresso % % % nova % % % jedi % % % latex % % % matlab % % % oracle % % % Average 11.39% 3.37% 10.78% Table 3: Experimental Comparison of Binary, T0 and Bus-Invert Schemes for Data Address Streams. Benchmark Stream In-Seq. Binary T0 Bus-Invert Length Addr. Trans. Trans. Savings Trans. Savings gzip % % % gunzip % % % ghostview % % % espresso % % % nova % % % jedi % % % latex % % % matlab % % % oracle % % % Average 57.62% 10.25% 9.79% Table 4: Experimental Comparison of Binary, T0 and Bus-Invert Schemes for Multiplexed Address Streams. As expected, when the probability of consecutive addresses is high, as for instruction addresses, a scheme such as T0 can be effective to reduce the transitions count, offering an average savings of 35.52% (see Table 2), while the Bus-Invert approach does not introduce any savings.

16 148 On the other hand, when the probability of in-sequence addresses is very low, as in the case of the data addresses reported in Table 3, the T0 code can provide only marginal advantages with respect to binary encoding (3.37%). The binary encoding can thus represent a good alternative for data addresses, since it does not require any redundancy and therefore no encoding and decoding circuitry. Among redundant codes, another alternative for data address buses can be represented by the Bus-Invert code, that is advantageous for the minimization of the average number of transitions (10.78% savings on average). When the address bus is multiplexed or shared between instructions and data addresses, as in the MIPS architecture, the sequential behavior is often interrupted, when the selection signal switches from instruction to data and vice versa. Hence, the multiplexed address bus shows an intermediate behavior (57.62%), as shown in Table 4. As a matter of fact, the sequentiality of the addresses on the bus is somewhat reduced by the time multiplexing and by the inherent randomness of the data addresses. For shared address buses, both the T0 and Bus-Invert codes provide transition savings of 10% approximately Mixed Bus Encoding Techniques The different effects of the above discussed encoding schemes on the number of bus transitions suggested us the use of combined techniques which may exploit the best properties of each method to further improve the overall system-level power budget. In this section, we first introduce some novel encoding options, then we present transition results collected during the execution of the usual set of benchmark programs on the MIPS processor T0_BI Encoding For architectures based on a single address bus used to transmit both instruction and data addresses, as in the case of external second-level unified data and instruction caches or for data address buses, a new encoding can be applied to exploit the advantages provided by the T0 and Bus-Invert approaches. Basically, the T0 code is used for in-sequence addresses, while the Bus-Invert is adopted when the addresses are out-of-sequence. The combined code, called T0_BI code, reduces the average power and requires two redundant lines, INC and INV. The formal definition of the code is the following:

17 Chapter 5. Bus Encoding Techniques for System-Level Power Optimization 149 (B (t-1), 1, 0 ) if t > 0 b (t) = b (t-1) + S (B (t), INC (t), INV (t) ) = (b (t), 0, 1) if t > 0 b (t) b (t-1) + S H (t) > (N + 2) / 2 (b (t), 0, 0) if b (t) b (t-1) + S H (t) (N + 2) / 2 (6) where B (t) is the value of the encoded bus lines at time t, INV (t) and INC (t) are the additional bus lines, b (t) is the address value at time t, H (t) = (B (t-1) INC (t-1) INV (t-1), b (t) 0 0 ) is the Hamming distance, and N is the bus width of b (t). The corresponding decoding scheme is given by: b (t) = B (t) if t > 0 INC = 0 INV = 1 b (t-1) + S if t > 0 INC = 1 B (t) if INC = 0 INV = 0 (7) Dual_T0 Encoding For architectures based on multiplexed address buses, such as the one of the MIPS microprocessor, two address streams α and β with quite different behavior, are timemultiplexed on the same address bus. Stream α, corresponding to instruction addresses, has high probability of having two consecutive addresses on the bus in two successive clock cycles; on the contrary, stream β, corresponding to data addresses, has almost no in-sequence patterns. The control signal, SEL, available in the standard bus interface to de-multiplex the bus at the receiver side, is asserted when stream α is transmitted; otherwise, SEL is deasserted. An extension of the T0 code, called Dual_T0 code, can be effective in these cases, providing the application of the T0 code and the updating of the encoding/decoding registers whenever SEL is asserted; otherwise, the binary code is applied and the encoding/decoding registers hold. Analytically, the Dual_T0 encoding can be expressed as: (B (t), INC (t) ) = (B (t-1), 1) if t > 0 SEL = 1 b (t) = b (t) + S (b (t), 0) otherwise (8) where B (t) is the value on the encoded bus lines at time t, INC (t) is the additional bus line, b (t) is the address value at time t, S is the stride, SEL is the selection signal, and b (t) is given by: b (t) = b (t-1) if SEL = 0 b (t-1) if t > 0 SEL = 1 (9) The corresponding decoding scheme follows:

18 150 b (t) = (b (t) + S) if t > 0 INC = 1 B (t) if INC = 0 (10) Dual_T0_BI Encoding Similarly to what we have done for the T0_BI code, we can define the Dual_T0_BI code. With this scheme, which is expected to work well on multiplexed buses, the overall switching activity can be reduced by resorting the T0 code anytime stream α is transmitted on the bus, while the bus-invert is used for stream β. As for the T0_BI code, the Dual_T0_BI can offer significant savings for the average power. The Dual_T0_BI encoding is defined as: (B (t-1), 1) if t > 0 SEL = 1 b (t) = b (t) + S (B (t), INC_INV (t) ) = (b (t), 1) if t > 0 SEL = 0 H (t) > N / 2 (b (t), 0) otherwise (11) where B (t) is the value on the encoded bus lines at time t, INC_INV (t) is the additional bus line, b (t) is the address value at time t, S is the stride, SEL is the selection signal, H (t) = (B (t- 1) INC_INV (t-1), b (t) 0 ) is the Hamming distance, N is the bus width of b (t), and b (t) is defined as before. The decoding scheme can be summarized with the following equation: b (t) = (b (t) + S) if t > 0 INC_INV = 1 SEL = 1 B (t) if t > 0 INC_INV = 1 SEL = 1 B (t) if INC_INV = 0 (12) Experimental Performance Comparison To verify the performance of the newly proposed encoding techniques, we still used the MIPS processor as reference architecture for measuring the number of address bus transitions required by the execution of the usual set of benchmark programs. Table 5 to Table 7 summarize the results, in terms of number of transitions and transitions savings with respect to the pure binary encoding.

19 Chapter 5. Bus Encoding Techniques for System-Level Power Optimization 151 Benchmark Stream In-Seq. Binary T0_BI Dual_T0 Dual_T0_BI Length Addr. Trans. Trans. Savings Trans. Savings Trans. Savings gzip % % % % gunzip % % % % ghostview % % % % espresso % % % % nova % % % % jedi % % % % latex % % % % matlab % % % % oracle % % % % Average 63.04% 34.92% 35.52% 35.52% Table 5: Experimental Comparison of Mixed Encoding Schemes for Instruction Address Streams. Benchmark Stream In-Seq. Binary T0_BI Dual_T0 Dual_T0_BI Length Addr. Trans. Trans. Savings Trans. Savings Trans. Savings gzip % % % % gunzip % % % % ghostview % % % % espresso % % % % nova % % % % jedi % % % % latex % % % % matlab % % % % oracle % % % % Average 11.39% 12.82% 0.00% 10.66% Table 6: Experimental Comparison of Mixed Encoding Schemes for Data Address Streams. Benchmark Stream In-Seq. Binary T0_BI Dual_T0 Dual_T0_BI Length Addr. Trans. Trans. Savings Trans. Savings Trans. Savings gzip % % % % gunzip % % % % ghostview % % % % espresso % % % % nova % % % % jedi % % % % latex % % % % matlab % % % % oracle % % % % Average 57.62% 19.56% 12.15% 22.25% Table 7: Experimental Comparison of Mixed Encoding Schemes for Multiplexed Address Streams. When the percentage of in-sequence addresses is high, as in the case of instruction address streams (see Table 5), the savings provided by the T0_BI, the Dual_T0 and the Dual_T0_BI

20 152 codes are very remarkable (35.32%, on average, with respect to pure binary). However, almost the same savings have been obtained by using the simple T0 code. Thus, the latter seems to be the most preferable solution in this case, due to the relatively low cost of the T0 encoding/decoding circuitry. Regarding the data address streams, shown in Table 6, no savings are provided by the Dual_T0 code, while the T0_BI and the Dual_T0_BI offer some advantages with respect to binary encoding (12.82% and 10.66%, respectively). By comparing these results to the ones provided by the bus-invert method, we can claim that the T0_BI represents the most effective solution for average power minimization. If we consider multiplexed address buses (see Table 7), the Dual_T0_BI code shows the best savings (22.25% on average) if compared to the T0, T0_BI, and Dual_T0 codes (10.25%, 19.56%, and 12.15% over pure binary). Furthermore, the Dual_T0_BI provides the absolute best savings, thus being the most effective code, concerning the average power, for the address bus of the MIPS microprocessor Encoding and Decoding Logic The results of the experimental comparisons show that, for a muxed bus, the Dual_T0_BI code is the most effective in terms of transition count. It is now necessary to evaluate whether the power savings achievable through activity reduction is not offset by the circuitry required to implement the code on a system bus. After describing the basic encoder/decoder architecture for the T0 and Dual_T0_BI codes, we analyze their power consumption when the encoding scheme is used to minimize the power of on-chip and off-chip buses Architectures At a given clock cycle, t, the encoder computes the incremented address of cycle t-1 and compares it to the address generated at cycle t. If the incremented old (t-1) address and the new (t) address are equal, the INC line is raised, and the old address is left on the bus. The encoder architecture is shown on the left of Figure 3. The incrementer can be programmable, to be able to flexibly define the constant increment S. The encoder inserts one cycle delay between the arrival of the address b and the output of the encoded bus B. We do not consider this delay an overhead of the encoding. Even if binary code is used (i.e., no encoding), the

21 Chapter 5. Bus Encoding Techniques for System-Level Power Optimization 153 FFs on the output B would be needed because the address b is generated by complex logic which produces glitches and misaligned transitions. The FFs filter out glitches and align the transitions on B to the clock edge. Glitches on B must be avoided because B is connected to large output buffers that should always be driven by clean and fast edges, to eliminate excessive power dissipation and signal quality degradation. The decoder architecture is even simpler. At any given clock cycle, the last cycle s address is incremented. If the INC line is high, the old incremented value is used for addressing; otherwise, the value coming from the bus lines is selected. The decoder is depicted on the right of Figure 3. b D +S EQ 1 0 SEL D D B INC B INC 0 1 SEL S+ D b Figure 3: Block Diagrams of the T0 Code Encoder and Decoder. The encoder/decoder implementation is optimized for minimum delay. If the speed constraints are not tight, low-power implementations should be considered. For example, it is possible to disable the incrementer in the decoder when INC = 0. In this case, however, the incrementer delay would be added to the critical path (starting from the late arriving signal INC), and performance would be penalized. We consider delay constraints, the most critical ones in high-performance microprocessors. Since the bus input b is produced by the complex address computation logic, it is expected to have a late arrival time. Thus, the critical path will be in the encoder from b through the comparator EQ and the control-to-output delay of the multiplexer (the setup time of the register controlling the bus B must be added to the critical path as well). If the arrival time of b is not critical, the next critical path is through the incrementer, the comparator and the multiplexer. It is unlikely that this relatively simple logic will ever become the critical path of a complex microprocessor design, where much more complex tasks are usually performed in a single clock cycle. Consequently, we will discuss the

22 154 possibility that the encoder could constrain the critical path because of the delay from the late arriving b to the output of the multiplexer (going through the comparator). Fortunately, the combination of the T0 encoder and decoder is very fast on the critical path: The comparator can be implemented with an XOR tree structure (which has a logarithmic delay D cmp = K cmp log(n)) and the delay through a multiplexer is weakly dependent on the width of the bus. The dependence is due to the load on the control input which increases linearly with the width of the bus. If we drive the control input (SEL) with a tapered buffer, the delay has a logarithmic dependence on the bus width: D mux = K mux log(n). In the formulas for D cmp and D mux the constants K cmp and K mux are technology dependent. The incrementers in the encoder and decoder are not strongly timing constrained, thus we can implement them in a power-efficient fashion, as long as their delays do not become critical. Compared to the Gray encoder and decoder [127], our architectures are more area and power consuming. However, the performance of the Gray scheme is limited by the decoder (implemented as a chain of EXOR gates [127]), which has a delay D gray = K gray N. For wide buses in performance-constrained systems, the delay penalty of Gray addressing may be simply unacceptable, leaving the T0 code as the only alternative to standard binary encoding. For purely power-constrained systems, the designer s choice will be based on the trade-off between the additional power savings on the bus provided by the T0 code and the reduced power dissipation of the Gray encoder/decoder. Gray code would probably be the preferred choice for area-constrained systems where power dissipation is a secondary concern. Concerning the Dual_T0_BI code, the encoder architecture consists of a section for the T0 encoding which generates the INC signal, a section for the bus-invert logic providing the INV signal, and the output multiplexor, which is controlled by the input signal SEL and by signal INC_INV = INC + INV. While the architecture of the T0 section has been fully described, the bus-invert section has been realized by a Hamming distance evaluator of the encoded bus lines at time (t - 1) concatenated with the INC_INV signal and the address value at the present time t, followed by a majority voter to decide if the computed Hamming distance is greater than half of the bus width. Besides the encoded bus lines, the circuit outputs the INC_INV signal (the SEL signal, needed by the decoder, is already present on the bus). Concerning the decoder, its architecture can be obtained easily from the Dual_T0_BI decoding equations (SEL and INC_INV are the control signals of the multiplexer).

Bus Encoding Technique for hierarchical memory system Anne Pratoomtong and Weiping Liao

Bus Encoding Technique for hierarchical memory system Anne Pratoomtong and Weiping Liao Abstract In microprocessor-based systems, data and address buses are the core of the interface between a microprocessor