A Superscalar RISC Processor with 160 FPRs for Large Scale Scientific Processing

Size: px

Start display at page:

Download "A Superscalar RISC Processor with 160 FPRs for Large Scale Scientific Processing"

Jonathan Webb
5 years ago
Views:

1 A Superscalar RISC Processor with 160 FPRs for Large Scale Scientific Processing Kentaro Shimada *1, Tatsuya Kawashimo *1, Makoto Hanawa *1, Ryo Yamagata *2, and Eiki Kamada *2 *1 Central Research Laboratory, Hitachi LTD. *2 General Purpose Computer Division, Hitachi LTD. Abstract We have developed a superscalar RISC processor for the super technical server HITACHI SR8000. The processor includes architectural features dedicated to scientific applications where massive amounts of data in the main memory must be processed. These features are 160 floating-point registers (FPRs) and simultaneous execution of up to 16 prefetch instructions. Any of the 160 FPRs can be accessed by instructions with the slide windowed registers scheme. The execution mechanism for prefetch instructions enabled high efficiency of the out-of-order superscalar processor despite the long latency of the main memory. The processor is manufactured using 0.25-µm CMOS technology. Our evaluation demonstrated that the processor achieves over 3 floating-point operations per cycle and the memory throughput of over 12 bytes per cycle. 1. Introduction Many supercomputers dedicated to large scale scientific processing have been developed. Some of these computers have used vector processors that have vector registers - 1 -

2 and execute vector instructions[4]. However, RISC processors, especially superscalar RISCs, have demonstrated high performance recently. With current CMOS technology, RISC processors have also reduced a cost for achieving high performance and enabled multiprocessor machines of over 1000 processors. Thus, we have developed a new superscalar RISC processor for the super technical server HITACHI SR8000, a multiprocessor supercomputer using the superscalar RISC processors as the processing engines. When designing a superscalar RISC processor for supercomputers, two primary requirements of large scale scientific processing become important considerations: a) Large scale scientific processing requires high performance on floating-point operations. b) Large scale scientific processing requires high-speed access to extremely large bodies of floating-point data in the main memory. To ensure high performance on floating-point operations, we used a superpipelined architecture in which the processor is highly pipelined to make the instruction throughput high in addition to the superscalar architecture of the processor. A superpipelined architecture usually results in long instruction latency, but we have hidden this latency by unrolling loops and by using software pipelining. To enable calculations using a large amount of floating-point data, scientific applications usually contain many loops, and these loops are often very suitable for unrolling or software - 2 -

3 pipelining. To exploit unrolling and software pipelining effectively, we have used 160 floating-point registers (FPRs), in the processor s instruction-set architecture. To enable fast access to large floating-point data, we have made the access throughput of the processor s memory bus as large as 16 bytes per cycle. The throughput depends on the implementation limitation of the processor. Large scale scientific processing often uses many multiple-dimension arrays to express data. When the elements of such arrays are in the IEEE 8-byte long double-precision format, the typical size of, for example, a two-dimensional 1024-by-1024 array can be 8 Mbytes. However, the typical size of the on-chip cache of today s RISC processors ranges from 64 Kbytes to 1 Mbyte, and we expect a cache that can hold arrays as large as 8 Mbytes to be too large to be implemented on a chip. Therefore we have designed the processor so that all of the data is initially stored in the main memory and needs to be transferred from the main memory to the on-chip cache of the processor before it can be used for some operations. Figure 1 shows a simple example of matrix addition. DO 10 J=1,1024 DO 10 I=1,1024 C(I,J) = A(I,J)+B(I,J) 10 CONTINUE Figure 1 A simple Example of Matrix Addition In Figure 1, the size of matrices A, B, and C is 8 Mbytes each, and the elements of matrices A and B are read only once while the elements of C are written once and never - 3 -

4 read. Thus, the execution time of the program depends entirely on the throughput of the main memory access rather than the throughput of the cache access or even that of the floating-point operation. With the throughput of the main memory access of 16 bytes per cycle, we can achieve 16/24 or floating-point operations per cycle because 24 bytes of data (two 8-byte data to be read and one 8-byte data to be written) is transferred from or to the main memory for each addition. RISC processors not designed for large scale scientific processing assume, in principal, that all necessary data is in the cache, and usually are not capable of such large throughput to or from the main memory. The typical throughput of such processors is from 4 to 8 bytes per cycle. We also had to consider the relationship between the memory-access performance and the memory-access throughput; i.e., the memory access latency. The memory access latency is usually very long compared with the processor s cycle time, and becomes longer and longer as the processor s operating frequency rises. When designing the basic profile of this processor, we anticipated that the memory latency would become a hundred or more cycles of the processor. If this long memory latency is not hidden, the effective memory-access performance will be degraded despite the high memory-access throughput. Conventional vector-type supercomputers hide the latency by using vector registers, but superscalar RISC processors need another method. We solved this memory latency problem by using a software data prefetch scheme. In the scheme, prefetch instructions are inserted at optimized points before and among - 4 -

5 loops by the compiler, then, at the execution time, many of the prefetch instructions are activated to fetch the required data from the main memory to the on-chip cache well before the load instructions read the on-chip cache. The processor was manufactured using 0.25-µm CMOS technology. The instruction set architecture of the processor is based on 64-bit PowerPC *1 [5]. In the rest of this paper, we describe the architectural features of the processor for large scale scientific processings, an overview of the processor, and the mechanism of the features, then evaluate and discuss their effect on the processor performance. 2. Architectural Features Because a superscalar processor for large scale scientific processing must be capable of high floating-point performance and high throughput of large data access from the main memory, we incorporated the 160 FPRs and a software prefetch scheme with simultaneous execution of 16 prefetch instructions. To ensure high floating-point performance, we designed the throughput of the processor s floating-point instructions as two instructions per cycle. The instruction set of the processor includes fused floating-point multiply-add instructions, thus, the processor has the throughput of up to four floating-point operations per cycle. To keep this throughput high, we used a superpipelined architecture in addition to the superscalar one. To enable many enough floating-point instructions to be active, we *1 PowerPC is a trademark of the International Business Machines Corporation - 5 -

6 used 160 FPRs. Processor instructions access this large number of FPRs in two ways: some instructions can designate FPR numbers from 0 to 127 (others can only designate numbers from 0 to 31), and all instructions that refer to the FPRs use the slide windowed registers scheme[1][3]. The extended instructions that can designate FPR numbers up to 127 include floating-point arithmetic instructions and floating-point load or store instructions. The slide windowed registers scheme enables all instructions, including those which can designate FPR numbers only from 0 to 31, can access any of the 160 FPRs. To increase throughput of large data access from the main memory, we made the throughput of the processor s memory bus 16 bytes per cycle. We also made the throughput of the processor s on-chip level-one cache twice that of the memory bus, namely 32 bytes per cycle. Because data must be written in the on-chip cache before being read from the on-chip cache, the on-chip cache throughput must be no less than twice that of the memory bus. Since we designed the load and store instruction throughput to be two instructions per cycle, we have used load FPR pair and store FPR pair instructions to load 16-bytes data (two IEEE double-precision floating-point values) into and store 16-byte data from a pair of FPRs to enable the full cache throughput for each instruction. We used the software prefetch scheme to solve the memory latency problem. For the scheme, we made that the processor can execute up to 16 prefetch instructions simultaneously

7 On top of this, we have implemented a special superscaler execution method for the prefetch instructions. The latency of prefetch instructions is the same as the memory latency, which is a hundred or more cycles, so using the same execution method as for other instructions with short latency would make the superscalar control facility ineffective. The central idea of our method is that the superscalar facility of the processor issues and forgets the prefetch instructions. The issued prefetch instructions are tracked by a special purpose facility called the block transfer buffer. In this way, the processor can effectively execute prefetch instructions with a latency of a hundred cycles. The detailed execution mechanism for prefetch instructions is described in the section Overview of the Processor s Internal Structure The internal structure of the processor, including the pipeline stages, is shown in Figure 2. The processor has a 64-Kbyte two-way set-associative instruction cache and a 128-Kbyte four-way set-associative write-through data cache with a block size of 128 bytes on the chip. Up to eight instructions are fetched from the instruction cache to the instruction buffer at every cycle, in the IF1 and IF2 stages. Instructions are extracted from the instruction buffer and decoded, and their FPR numbers are translated for the slide windowed registers at the IF3 stage. Then the registers, both FPRs and the general purpose registers (GPRs), are renamed and up to five instructions are dispatched in program order to the reservation stations corresponding to the kind of instructions in the - 7 -

8 D1 and D2 stages. There are four reservation stations: FRS for floating-point instructions; ARS for load and store instructions, including prefetch instructions; GRS for fixed-point instructions; and BRS for branch instructions. IF1 IF2 IF3 D1 D2 E0 E1 E2 E3 E4 E5 8 inst. 5 inst. Reorder Buffer (32 entries) ICache Inst. Rename (64-Kbyte Buffer Mul-ADD FRS 2-way) Mul-ADD DEC. DIV/SQRT Branch FN- Prediction TRANS FRR C1 C2 C3 FPR 6 inst. External Bus (16-byte Width) ARS GRS Load- Store Queue DCache (128-Kbyte 4-way) Block Transfer Buffer (16 entries) (32-byte Width) Mul /Div Store Buffer GRR GPR BRS Branch Exec. Figure 2 Overview of the Processor s Internal Structure All dispatched instructions are registered by the reorder buffer at the time of dispatch. The reorder buffer has 32 entries that keep track of each instruction, and it tracks sequential control flow of the program while instructions are issued from the reservation stations and executed out of order. Specifically, the reorder buffer allocates entries for the instructions dispatched in program order at the D2 stage, and waits for the - 8 -

9 instructions to finish their execution out of order. Then it put the results of their execution in program order again and checks if any exception was reported at the C1 stage. After that, at the C2 and C3 stages, the reorder buffer instructs each execution unit to write back data from the rename registers into the architectural registers, such as FPRs and GPRs and, in the case of a store instruction, to store the data into the store buffer. In parallel, the reorder buffer releases the entries in which the instructions are registered at the C2 stage. This process is called completion. The reorder buffer of the processor completes up to six instructions at every cycle. For the issue and execution stages, the number of stages varies according to the kinds of instructions. For example, fixed-point add instructions are issued at the E0 stage, their source registers are read at the E1 stage, and they are executed at the E2 stage. On the other hand, floating-point multiply-add instructions are issued at the E0 stage, their source registers are read at the E1 stage, and they are executed from the E2 stage through to the E5 stage. The results of these instructions are written into the rename registers for GPRs (GRRs) or the ones for FPRs (FRRs). Load and store instructions are also issued at the E0 stage and their source registers are read for address calculation at the E1 stage so that their data addresses can be calculated at the E2 stage. The calculated addresses are written into the load-store queue. The load-store queue resolves address conflicts between load and store instructions that are executed out of order. Load instructions then refer to the data cache - 9 -

10 in the E3 and E4 stages, and send the required data to GRRs or to FRRs at the E5 stage. Store instructions wait for the reorder buffer to instruct them to put the required data into the store buffer and complete them. A prefetch instruction puts its data address into the block transfer buffer, and waits for the reorder buffer to instruct it to fetch the main memory as we describe later. Floating-Point Unit Reorder Buffer FPR FPR Instruction Control Fixed-Point Unit GPR Branch Unit Load-Store Unit Data Cache Instruction Cache Bus Control Figure 3 Processor Chip Photograph

11 Figure 3 is a photograph of the processor chip where such as the reorder buffer and FPRs are shown. 4. Mechanism of the Architectural Features 4.1 Slide Windowed Registers for 160 FPRs Slide windowed registers have been developed for a two-way in-order superscalar processor[1]. We have modified these for use with processors that have an out of order superscalar and superpipelined architecture. The FPR number associated with the physical FPR (the physical register number PN) is translated from the register number designated in the instruction (the logical register number LN) with the following formula: PN = (LN + SWBS) mod 128 (GN LN) PN = LN (0 LN < GN) where GN is the number of registers in the global part (not the sliding part), and SWBS is the slide window base value which is held in the special purpose register SWSW (slide window status word). Figure 4 shows an example of this where GN is 16 and SWBS is 2. GN is selected from 4, 8, 16, or 32. SWBS is an even number from 0 to 126. : LN which any instruction can designate : LN which only an extended instruction can designate LN SWBS PN Slide Part Global Part Figure 4 An Example of Mapping through the Slide Windowed Registers (SWBS = 2, GN = 16)

12 To slide the slide windowed registers smoothly, we introduced a special instruction SSWSTP (slide sliding window step). The SSWSTP instruction changes the SWBS value in the SWSW register resulting in sliding of the slide windowed registers in the slide part. This instruction is speculatively executed at the IF3 stage to enable the immediately following instructions to use the changed SWBS value for their FPR number translation. Therefore, there are no penalties for the SSWSTP instruction execution. A special mechanism is implemented to recover speculatively changed SWBS values when the speculation fails because of a branch prediction miss. The mechanism includes buffers which register the SWBS values at the time of dispatch of each branch instruction to speed up the recovery. 4.2 Execution Mechanism for the Prefetch Instruction The execution mechanism for the prefetch instruction is designed for a mainmemory latency that is so long that the superscalar facility of the processor, namely the reorder buffer, cannot keep track of the prefetch instructions effectively. If the reorder buffer kept track of the prefetch instructions, one entry of it would be occupied until the prefetched data from the main memory was returned, thus the occupation could be more than one hundred cycles long. Consequently, the number of entries required to sustain the instruction throughput would be a hundred times the number of active instructions. In other words, the reorder buffer is required to have a few hundreds entries, which is not feasible

13 In the execution mechanism of the processor, a prefetch instruction is tracked by the special facility called a block transfer buffer instead of by the reorder buffer (Figure 2). When a prefetch instruction is executed, only its address calculation is performed and the calculated result is placed into the block transfer buffer. Then, after checking for address exceptions, the reorder buffer instructs the prefetch instruction to go to fetch the required data from the main memory, and frees the entry for the prefetch instruction without waiting for the data from the main memory to be returned. While the instruction fetches the data from the main memory, functions such as address-conflict checks are done by the block transfer buffer. The block transfer buffer has 16 entries, so up to 16 prefetch instructions can be active simultaneously. When the required data is returned from the main memory, the data is written into the on-chip data cache according to the data address registered in the block transfer buffer. This execution mechanism for the prefetch instruction enables the processor to effectively handle a main memory latency that is a hundred cycles long without requiring a reorder buffer of a few hundred entries. Because the main memory throughput of the processor is 16 bytes per cycle, a block transfer buffer of 16 entries allows the main memory latency to be up to 128 cycles long (= the on-chip cache block size of 128 bytes 16 active prefetchs / the memory throughput of 16 bytes per cycle)

14 5. Evaluation 5.1 Evaluation Environment To evaluate the processor architecture, we did a logic design simulation using the logic design data of the actual processor. Since the evaluation environment had to enable us to identify the bottleneck of the processor with sufficient evaluation accuracy, we excluded the bottlenecks from outside of the processor, such as the overhead of the main memory control, the refresh control, and so on, and idealized the memory system so that the processor bottleneck would be clearly shown in the performance number. To ensure the accuracy, we used the actual design file of the processor. The simulator used in this evaluation environment was developed mainly to verify the logic design of the processor, and the results obtained were identical to the real cycle by cycle behavior of the processor. 5.2 Effect of the Large Number of FPRs First, we evaluated the effect of the large number of FPRs. We used the calculation of the inner product of two arrays as a sample program. Its source program is shown in Figure 5. To evaluate the effect of the number of FPRs, we built up four model codes using this source program for use 6, 20, 40 or 80 FPRs. DO 10 I=1,N S = S + A(I) * B(I) 10 CONTINUE Figure 5 Source Program of the Inner Product Figure 6 shows the results for the four codes under the conditions that all

15 necessary data was in the cache and the loop length was The performance is expressed in floating operations per cycle (FLOPC) and the instruction throughput is expressed in instructions per cycle (IPC). 4 Performance 4 Performance [Floating Operations per Cycle] Instruction Throughput Instruction Throughput [Instructions per Cycle] Number of FPRs Used Figure 6 Effect of the number of FPRs When the number of FPRs was small, not enough instructions became active to hide the latency of the processor pipelines and the performance was low. When the number of FPRs increased, the performance rose to 3.15 FLOPC, which is 78.7% of the ideal performance (4 FLOPC) for the code using 80 FPRs. As a result, for this simple program of the inner product, 80 FPRs was enough to obtain fairy good performance despite the latency of the highly pipelined instruction

16 execution. For more complex and larger programs, the number of FPRs required to obtain good performance will rise. For example, the matrix multiplication program used in the following section required 104 FPRs. 5.3 Memory Throughput and Effect of Prefetch Next, we evaluated the processing speed and the memory throughput vs. the main memory latency and the effect of the software prefetch scheme. For this evaluation, our sample program was a version of the model code of the inner product with 80 FPRs used in the previous evaluation with prefetch instructions inserted. We also prepared a version of matrix multiplication where all the required data was assumed to be in the cache (the no prefetch instructions inserted version) and a version of matrix multiplication with prefetch instructions inserted. The source program of the matrix multiplication is shown in Figure 7, and its model code was produced by unrolling the outermost loop 16 times, the middle loop 2 times, and the innermost loop 2 times. DO 10 I=1,N DO 10 J=1,N S = 0 DO 20 K=1,N S = S + A(I,K) * B(K,J) 20 CONTINUE C(I,J) = S 10 CONTINUE Figure 7 The Source Program of Matrix Multiplication In the case of the matrix multiplication, the amount of data that had to be transferred from the main memory for every floating-point operation was reduced by the above loop unrollings. If no unrollings were applied, the amount of data for each

17 floating-point operation would be 8 bytes (one floating-point value), which is the same as for the sample program of the inner product Figure 5. With the unrollings, the amount of data is reduced to 2.25 bytes (9/32 floating-point values) because 36 floating-point values (16 2 A(I,K)s and 2 2 B(K,J)s) are read once and used for 128 floating-point operations ( floating-point multiplications and floating-point additions). In terms of the memory throughput, the code of matrix multiplication without unrollings and the four codes of the inner product need memory throughput of 32 bytes per cycle to obtain the ideal performance of 4 FLOPC, while the unrolled code of matrix multiplication needs memory throughput of 9 bytes per cycle to enable 4 FLOPC. On top of this, for the code of matrix multiplication with unrolling of the outermost loop 16 times, the unrolled data access size become 128 bytes (= 8 bytes 16 times). This is equal to the on-chip cache block size, so the prefetch instruction can be applied very effectively. The performance and the memory throughput of the inner product and the matrix multiplication for the all-in-cache case (the no prefetch instructions inserted version), and for the main memory latency of 50 cycles, 80 cycles, 100 cycles, 150 cycles, and 200 cycles are shown in Figure 8. For the inner product, the performance of 3.15 FLOPC and the memory throughput of 25.0 bytes per cycle for the all-in-cache case fell by about half when the data was out of the cache and prefetch instructions were used to fetch the data from the main memory into the cache. This is because the memory throughput of the processor was

18 4 Memory throughput limited by the main memory latency Performance Memory Throughput Peak cache throughput =32 bytes/cycle Performance Memory Throughput Performance [Floating-Operations per Cycle] Memory Throughput [Bytes/Cycle] Performance [Floating-Operations per Cycle] Memory Throughput [Bytes/Cycle] Peak main memory throughput=16 bytes/cycle 0 In Cache In Cache Main Memory Latency [Cycles] Main Memory Latency [Cycles] (a) Inner Product (b) Matrix Multiplication Figure 8 Performance vs. Main Memory Latency 16 bytes per cycle but the memory throughput required to achieve 4 FLOPC was 32 bytes per cycle. However, for the main memory latency from 50 cycles to 100 cycles, the performance degradation was very small because the execution mechanism of the prefetch instruction worked well. In Figure 8, the effective memory throughput for the main memory latency of 50 cycles was 12.2 bytes per cycle, which was 76.0% of the peak 16 bytes per cycle, and for the main memory latency of 100 cycles it was 11.4 bytes per cycle, which was 71.4% of the peak. For the main memory latency of 150 cycles or of 200 cycles, the memory throughput could not be sustained and the performance was degraded

19 In the matrix multiplication, 3.78 FLOPC (94.6% of the peak performance of 4 FLOPC) was achieved in the all-in-cache case. The required memory throughput of 9 bytes per cycle was much less than the memory throughput of the processor and much less than that limited by the main memory latency, so the performance was not heavily degraded for the main memory latency of 50, 80, 100, or even 150 cycles. In all of these cases, the performance was more than 3 FLOPC, and the mechanism of the prefetch instruction seemed to work well. With the main memory latency of 200 cycles, the memory throughput was limited to bytes per cycle (= 128 bytes cache block size 16 active prefetchs / the latency of 200 cycles), which is close to the required throughput, and the performance fell to 2.65 FLOPC. 6. Conclusion We have developed a superscalar RISC processor for large scale scientific processing. Our goal was to enable high performance on floating-point operations and to maximize the throughput of large data access from the main memory. To increase floating-point performance, we equipped the processor with a large number of FPRs, and introduced extended instructions that can designate a FPR number from 0 to 127 and slide windowed registers to hide the latency of the processor pipeline. To maximize the effective throughput of large data access, we used a software prefetch scheme and implemented a special superscalar execution method for the prefetch instructions. Our evaluation demonstrated that the large number of the FPRs improved performance

20 significantly and that the execution mechanism of the prefetch instructions worked well despite long latency of the main memory access. References [1] Shimamura,K., Tanaka,S., Shimomura,T., Hotta,T., Kamada,E., Sawamoto,H., Shimizu,T., and Nakazawa,K.: A Superscalar RISC Processor with Pseudo Vector Processing Feature, Proc. of International Conference on Computer Design 95,pp , IEEE (1995) [2] Nakazawa,K., Nakamura,H., Imori,H., and Kawabe,S.: Pseudo Vector Processor based on Register-Windowed Superscalar Pipeline, Proc. of Supercomputing 92, pp ,ieee (1992) [3] Nakamura,H., Imori,H., Nakazawa,K., Boku,T., Nakata,I., Yamashita,Y., Wada,H., and Inagami,Y.: A Scalar Architecture for Pseudo Vector Processing based on Slide- Windowed Registers, Proc. of International Conference on Supercomputing 93, pp , ACM (1993) [4] Kitai,K., Isobe,T., Tanaka,Y., Tamaki,Y., Fukagawa,M., Tanaka,T., and Inagami,Y.: Parallel Processing Architecture for the Hitachi S-3800 Shared-Memory Vector Multiprocessor, Proc. of International Conference on Supercomputing 93, pp , ACM (1993) [5] May,C., et al.(ed.): The PowerPC Architecture: A Specification for A New Family of RISC Processors Second Edition, Morgan Kaufmann Publishers, INC. (1994)

Pipelining and Vector Processing

Pipelining and Vector Processing Chapter 8 S. Dandamudi Outline Basic concepts Handling resource conflicts Data hazards Handling branches Performance enhancements Example implementations Pentium PowerPC