Bus Encoding Technique for hierarchical memory system Anne Pratoomtong and Weiping Liao

Size: px

Start display at page:

Download "Bus Encoding Technique for hierarchical memory system Anne Pratoomtong and Weiping Liao"

Sylvia Jackson
6 years ago
Views:

1 Bus Encoding Technique for hierarchical memory system Anne Pratoomtong and Weiping Liao Abstract In microprocessor-based systems, data and address buses are the core of the interface between a microprocessor and the external world. The increasing gap between interfaces has pushed CPU designers to increase the bandwidth of the data transfer. Moreover, modern software applications span a very large address space. With very wide address and data busses, the power dissipation on bus interfaces is becoming a major concern. In microprocessor-based systems, large power savings can be achieved through reduction of the transition activity of the on- and off-chip busses. This is because the total capacitance being switched when a voltage change occurs on a bus line is usually larger than the capacitive load that must be charged/discharged when internal nodes toggle. Encoding techniques are very effective in limiting the number of signal transitions on the bus lines. The microprocessor-based systems are incorporated with hierarchical memory systems. The characteristics of addresses on the bus in hierarchical memory can be very random. So our goal is to study the performance of existing bus encoding techniques (T0 code [1] and Bus-invert code [2]) for different types of memory hierarchies (e.g. main memory, L1, and L2 caches). 1 Introduction Due to the intrinsic capacitances of the bus lines, a considerable amount of power is required at the I/O pins of a microprocessor when data have to be transmitted over the bus. More specifically, the capacitance driven by the I/O node is usually much larger than the one seen by the internal nodes of microprocessors. As a consequence, dramatic optimization of the average power consumption can be achieved by minimizing the number of transitions (i.e. the switching activity) on system-level buses. Encoding paradigms for reducing the switching activity in the bus lines have been recently investigated. In [4], the authors proposed a bit encoding approach to reduce the average number of switching occur on a bus. The basic observation, which originated their work, is that using a transition-based encoding instead of a level encoding may limit the number of transitions in the case of non-equiprobable input lines. The 1

2 technique in [4] first encodes the data words in such a way that the probabilities of each bit become as unbalanced as possible, and then applies transition encoding at the bit level. In a later work [2], the Bus-Invert code was proposed. This scheme uses redundancy to save power. If the Hamming distance between two successive patterns is larger than N / 2, where N is the bus width, the new pattern is transmitted with inverted polarity, thereby achieving a maximum of N / 2 signal transitions on the bus. An extra line I is needed to signal to the receiving end of the bus which polarity is used for the transmission of the incoming pattern. The bus-invert code works fine when data patterns to be transmitted are randomly distributed in time. Therefore, it is appropriate for encoding the information traveling on data buses. When the objective shifts to address bus encoding, a radically different behavior is observed. The addresses generated by a running microprocessor are often consecutive, since instructions are stored in adjacent sections of the memory space, and structured data are stored in consecutive memory locations for better locality. To exploit this property, [5] proposed to reduce the switching activity on address busses by adopting Gray code. Gray code is particularly attractive since it guarantees single bit transitions when consecutive addresses are accessed. However, Gray code does not achieve the minimum switching activity. As a result, in [1], the T0 code was proposed. The main idea of the T0 code is to avoid the transfer of consecutive addresses on the bus by using a redundant line, INC. The T0 code can achieve zero switching activity for consecutive addresses. 2 Previous Work 2.1 Bus-Invert Encoding Bus-invert [2] is a method of coding I/O which lowers the bus activity, and thus decreases the I/O peak power dissipation and the I/O average power dissipation. This method is best applied to buses, which are most likely to have very large capacitances associated with them and as a consequence, dissipate a lot of power. The activity on a typical data bus is characterized by a random uniformly distributed sequence of value. With this assumption, for any given time slot, the data on an n-bit wide bus can be any values with equal probability. The average number of transition per time slot will be n/2. Thus the average power dissipation for the I/O will be proportional to n/2. When all the bus-line toggle at the same time there will 2

3 be a maximum of n transitions in a time slot and thus the worst-case power dissipation is proportional to n. Data value is the piece of information that has to be transmitted over the bus in a given time slot. The bus value is the actual value on the bus. One control bit called invert is needed in order to do the coding. If invert equals to zero, the bus value is equal to the data value. If invert equals to one, the bus value is the inverted of the data value. Invert equals to one if the hamming distance (number of bits different) between the present bus value (also counting the present invert line) and the next data value is larger than n/2. The worst-case power dissipation can then be decreased by half by coding the data value with this technique. 2.2 T0 code The T0 code [3] exploits the property of consecutive addresses to reduce the switching activity of address busses. In the T0 code, there is an additional redundant line, INC, to the address bus. Its purpose is to signal with value one that a consecutive stream of addresses is output on the bus. If INC is high, all other lines on the bus are frozen. When the redundant line is driven to zero, the remaining bus lines are used as standard binary codes for the new addresses. If all the addresses of the ideal stream are consecutive, the INC line is always high, and the bus lines never transition. As a consequence, the asymptotic performance of the T0 code is zero transitions per emitted consecutive address. More formally, the encoding and decoding scheme of the T0 code can be described as Equation 1 and 2, where B (t) is the value on the encoded bus lines at time t, INC (t) is the additional bus line, b (t) is the address value at time t and S is a constant power of 2, which is called stride. ( B, INC ( t 1) ( t 1) ( B,1) if t > 0 and b = b + s ) = ( b,0) otherwise (1) b ( ( b = ( t B t 1) ) + s) if if INC = 1 and t INC = 0 > 0 (2) 2.3 Hybrid Bus Encoding Technique In [4], new encoding schemes were proposed for bus encoding. Those new schemes actually combine the properties of existing approaches, which are mainly the T0 code and the Bus-Invert code. In this section, 3

4 we will discuss the coding schemes proposed in [4]. In [3] analytical performance is compared between T0 and Bus-Invert techniques using the address trace generated by a RISC microprocessor. Three distinct cases are considered: an instruction address bus, a data address bus, and an instruction /data multiplexed address bus. The average percentage of sequential addresses in the benchmark stream is higher for an instruction address than for a data address stream. Therefore the T0 code outperforms the Bus-Invert technique. On the other hand, when the probability of in-sequence addresses is very low, as in the case of data addresses, the Bus-Invert technique outperforms the T0 Code technique. When the address bus is multiplexed, as in MIPS architecture, the sequential behavior is often interrupted when the selection signal switches from instruction to data and vice versa. Hence, the multiplexed address bus shows an intermediate behavior. Thus, the hybrid method is proposed to exploit the best properties of each method. There are three hybrid methods proposed in [4]: T0 BI, Dual T0, and Dual T0 BI encoding. The T0 BI encoding requires 2 redundant lines, INC and INV. When both INC and INV are zero, the original address is sent without any encoding. When INC is zero and INV is one, the invert of the address is sent. When INC is one, the address bus content is frozen to avoid switching and the decoder at the destination will increment the address by the amount specified by a stride. The Dual T0 encoding requires one redundant line, INC. When the address bus is multiplexed, the control signal, SEL, is asserted when an instruction address is transmitted, and de-asserted when the data address is transmitted. When both SEL and INC are one, the address bus content is frozen and the decoder at the destination will increment the address by the amount specified by a stride. When both INC and SEL are zero, the original address is sent without any encoding. The corresponding decoding scheme simply accepts the address when the INC is zero and increases the previous address of the previous time frame by the amount specified by a stride when the INC is one. The Dual T0 BI encoding is the combination of the previous two methods and requires one redundant line, INCV. When both SEL and INCV are one, the address bus content is frozen and the decoder at the destination will increment the address by the amount specified by a stride. When SEL is zero, INCV is one and the hamming distance is greater than N/2, where N is the total number of address buses, the invert of the address is sent. 4

5 3 Methodology Figure 1: Overview of the project. This project consists of four main components. (1) Address Trace Generator: Simulate input SPECint95 for 10 million cycles and generate 6 address trace binary files. - Instruction address stream from CPU to L1 I-cache - Instruction address stream from L1 I-cache to L2 Unified cache - Instruction address stream from L2 Unified cache to memory - Data address stream from CPU to L1 D-cache - Data address stream from L1 D-cache to L2 Unified cache - Data address stream from L2 Unified cache to memory (2) The translation counter reads the input binary file and outputs the total number of bus transitions. (3) The Endian converter is used to convert Big-endian binary to small endian binary and vice versa. Address Trace generator runs on a SUN SPARC platform so the address trace binary files are in big endian format. The other 3 components develop and run under a Linux platform so the binary format is in small endian. (4) T0 and Bus Invert Encoders/Decoders encode the input binary file and write the encoded result to an output binary file. It also provides the encoder statistic and uses decoder as an error checking mechanism. 5

6 The following table shows the system configuration we used to get the address trace Issue width RUU size LSQ size L1 I-Cache L1 D-Cache L2 U-Cache Memory Width 4 inst/cycle 16 entries 8 entries 16 KB, DM, 32 B Block 4 KB, 4 way SA, 32 B Block 64 KB, 4 way SA, 64 B Block 8 Bytes 4 Simulation results Percentage of Bus Transition Reduction using T0 bus encoding technique % reduction IL1 IL2 Imem Types of Address Traces Go Gcc Vortex Test-math Figure 2 For the Instruction Address stream in figure 2, T0 code is always able to reduce the switching activity while the bus invert failed even for an address stream from L1 to L2 and L2 to memory. The performance of T0 code decreases when the address travels further away from CPU due to the less consecutive address pattern. For a data Address stream, the performance benefits from both techniques are very random. For data stream in figure 3, in some cases, the performance of T0 code increases when the address travels further away from the CPU, which is the opposite case for the instruction address stream. This might be 6

7 because the L1 D-cache is 4 times smaller than L1 I-cache. So the miss rate of data cache is higher. And therefore, generates more addresses than the case for an address stream which increases the effect of T0 bus-encoding technique. Even though Test-math application is a small benchmark, seem like it still has misses stream to memory. This might be because the program life is very short so most fractions of the misses are from cold misses. Vortex is a data base application. It seems like vortex also has a lot of misses to memory due to large working set that is not fit in the cache. And probably the 10M-cycle range that we run might happen to access data in a fix distance pattern so T0 performs very well. Percentage of Bus Transition Reduction using T0 and Bus Invert bus encoding technique 100 % Reduction DL1 DL2 Dmem Type of Address Traces GoT0 GoBI GccT0 GccBI VortexT0 VortexBI Test-mathT0 Test-mathBI Figure 3 In figure 4, for application Go, even though bus-invert has a substantial percentage of encoding activation, the performance is very small. This shows the effect of the high probability of the humming distance of n/2(16 in this case). We expect that this effect should be reduced if we partition the bus into 8 of 4-sub buses. From figure 5, it can be seen that the percentage of encoding activation for data address traces are very random and this reflects the random performance benefit seen from figure3. Again, we see the effect of performance degradation due to a high probability of humming distance of n/2 of bus-invert technique in application Vortex. 7

8 Percentage of encoded Activation of T0 and Bus Invert bus encoding technique % Encoded IL1 IL2 Imem Type of Address Traces GoT0 GoBI GccT0 GccBI VortexT0 VortexBI Test-mathT0 Test-mathBI Figure 4 Percentage of encoded Activation of T0 and Bus Invert bus encoding technique 100 % Encoded DL1 DL2 Dmem Type of Address Traces GoT0 GoBI GccT0 GccBI VortexT0 VortexBI Test-mathT0 Test-mathBI Figure 5 8

9 5 Conclusion and Future work In most cases, T0 code outperforms Bus Invert encoding. So T0 code still give performance benefit even for hierarchical memory system for both instruction and data stream. There are some cases where Bus Invert coding outperforms T0 coding so it is useful to implement both techniques for hierarchical memory system. The characteristic of the data address stream of hierarchical memory system is very random and hard to predict and thus there are many cases that both techniques give a very small performance gain so it is necessary to explore other bus coding technique such as T0-Xor code, Offset Code[4] to improve the performance. Since we only run the experiment base on a single system configuration, it is interesting to study the effect of different cache/memory configuration and characteristics (e.g. miss/hit rate and latency) on each bus encoding technique to get more concrete summary of performance for both techniques. Moreover, most of the programs contain many phases with vary memory access characteristic. So the implementation of the phase change detection mechanism will be useful for the decision of switching from one encoding technique to another and helps improve performance. The phase change detection can be implemented by a counter of switching activities of the bus, a Reference history register, or a Hamming distance history register. References [1] L.Benini, G.Micheli, E.Macii, D.Sciuto, and C.Silvano. Asymptotic zero-transition activity encoding for address busses in low- power microprocessor-based systems. In Proc. Of GLS-VLSI-97, March1997 [2] M.R.Stan. Bus-invert coding for low-power I/O. IEEE Trans. On VLSI Systems, p 49-58, March 1995 [3] L.Benini, G.Micheli, E.Macii, D.Sciuto, and C.Silvano. Address bus encoding techniques for systemlevel power optimization. In Proc. Of DATE-98, Feb 1998 [4] Y.Aghaghiri, F.Fallah, and M.Pedram. Irredundant address bus encoding for low power. In Proc. Of ISLPED-01, Aug [5] H.Metha, R.M. Owens, M.J. Irwin, Some Issues in Gray Code Addressing, IEEE 6 th Great Lakes Symposium on VLSI, p , March

Low-Power Data Address Bus Encoding Method

Low-Power Data Address Bus Encoding Method Tsung-Hsi Weng, Wei-Hao Chiao, Jean Jyh-Jiun Shann, Chung-Ping Chung, and Jimmy Lu Dept. of Computer Science and Information Engineering, National Chao Tung University,