Architecture and Performance of the Hitachi SR2201 Massively Parallel Processor System

Size: px

Start display at page:

Download "Architecture and Performance of the Hitachi SR2201 Massively Parallel Processor System"

Virgil Hudson
6 years ago
Views:

1 Arhiteture and Performane of the Hitahi SR221 Massively Parallel Proessor System Hiroaki Fujii, Yoshiko Yasuda, Hideya Akashi, Yasuhiro Inagami, Makoto Koga*, Osamu Ishihara*, Masamori Kashiyama*, Hideo Wada*, and Tsutomu Sumimoto* Central Researh Laboratory, Hitahi Ltd. 1-28, Higashi-Koigakubo, Kokubunji, Tokyo 185, Japan Tel: ; Fax: {fujii, yoshikoy, akashi, *General Purpose Computer Division, Hitahi Ltd. 1, Horiyamashita, Hadano, Kanagawa , Japan Abstrat RISC-based Massively Parallel Proessors (MPPs) often show low effiieny in real-world appliations beause of ahe miss penalty, insuffiient throughput of the memory system, and poor inter-proessor ommuniation performane. Hitahi's SR221, an MPP salable up to 248 proessors and 6 GFLOPS peak performane, overomes these problems by introduing three novel features. First, its proessor, the 15 MHz HARP-1E, solves the ahe miss penalty by "pseudo vetor proessing" (PVP). In PVP, is loaded by prefething to a speial register bank, bypassing the ahe. Seond, a multi-bank memory arhiteture that operates like a pipeline eliminates the memory system bottlenek. Third, the inter-proessor ommuniation ahieves high performane on the three-dimensional rossbar network, using a "remote DMA transfer" protool and a hardware-based ahe ohereny. As the result of these improvements, the SR221 ahieved 22.4 GFLOPS with 124 proessors in the LINPACK benhmark, whih is almost 72% of the peak performane. 1. Introdution The Hitahi SR221 is a newly designed massively parallel proessor (MPP) omputer system that was introdued to the superomputing market in Marh Up to 248 RISC proessors an be onneted via a high-speed threedimensional (3D) rossbar network [1], [1]. Eah proessor, running at a lok frequeny of 15 MHz, has a peak performane of 3 MFLOPS, giving the SR221 a peak performane of 6 GFLOPS. The main memory for one proessing node is up to 1 GB with 1 MB of seondary ahe. The 3D rossbar network is able to transfer at 3 MB/ s over eah link. One of the main design targets on the SR221 is to solve the low effetive performane problem often seen in MPPs. The main auses of this performane degradation (whih are ahe miss penalty, insuffiient throughput of the memory system, and poor inter-proessor ommuniation performane) and the solutions adopted to the SR221, are disussed below. The ahe memory has been introdued in parallel proessor systems built around RISC proessors in an attempt to resolve the speed gap between the main memory and the CPU. However, this arhiteture fails to fulfill this objetive, espeially when an appliation program needs to aess a large portion of memory whih annot fit into the ahe, beause suh an aess inevitably suffers ahe misshits, whih leads to a heavy loss in effetive performane. Conventional ahe-based omputers are thus likely to show dereased performane when they enounter suh ase, whih is often observed in large-sale numerial appliations. To eliminate the ahe miss penalty and improve the performane, the SR221 is equipped with a novel mehanism in its RISC proessors, alled the "preload" operation [2]. This operation is a non-bloking diret main-memory load operation that bypasses the ahe memory. Sine the target of the preload operation is a 128-register bank, it is possible to issue preloads early enough so that the fethed value arrives before its use by other instrutions in the program. This preload operation, along with the super-salar arhiteture and the ode-sheduling similar to software pipelining [3], [4] provided by the ompiler, solves the ahe miss penalty and improves performane. We all the ombination of these tehniques "pseudo vetor proessing" (PVP) [2]. Insuffiient memory throughput, whih is another ause of performane degradation, beomes ritial in the ase of the PVP. Sine PVP issues feth operations almost every mahine yle to the main memory, it is espeially important that the main memory supports a sustained high bandwidth. To this end, the memory system of the SR221 is omposed of a multi-bank memory that operates like a pipeline. The effetive performane of MPPs an also be deteriorated by low performane on the inter-proessor network, due to narrow bandwidth, poor memory system performane and high software overhead. The memory system features desribed above not only help the PVP to work effetively, but also ontribute to inrease throughput in the inter-proessor network. Also, to ahieve high performane in interproessor ommuniations, the SR221 eliminates the software overhead by using a newly developed transfer protool, alled "remote DMA transfer", along with a hardwarebased ahe ohereny. The rest of this paper is organized as follows. Setion 2

2 L The SR221 LAN I/O Unit (IOU) Supervisory IOU (SIOU) Proessing Unit (PU) rossbar - swith f=7 hard disk Figure 1. Coneptual system onfiguration of the SR221. gives an arhitetural overview of the SR221. In Setion 3, the pseudo vetor proessing feature is desribed in detail. Setion 4 desribes issues on inter-proessor transfer. Setion 5 desribes the memory system. In Setion 6, some performane evaluation results are shown. Setion 7 onludes the paper with some remarks. 2. System Overview of the SR Overall Organization Figure 1 shows the oneptual system onfiguration of the SR221. The SR221 uses a multi-dimensional rossbar network to onnet the proessors. For example, a 2176 proessor element (PE) system of the SR221 uses a 3D rossbar network. The 2176 PEs are arranged in an 8x17~16 lattie and the PEs arranged along eah dimension (x, y, or z) are onneted by a ommon rossbar swith. There are two types of PEs: proessing units (PUS) and I/ Units (IOUs). The PUS perform omputation and the IOUs mainly ontrol the I/O proesses. The 2176 PE system has 248 PUS and 128 IOUs. One of the IOUs is a Supervisory IOU (SIOU), whih also performs system management. r-l memory ontrollers (MCA/MCDs) 1 mai;ieyge 1 proe;;!; unit Figure 2. Organization _ onnetions -to3d - rossbar network of a PU. 2.2 Organization of Proessing Unit Figure 2 shows the organization of a PU. Eah PU has seven omponents: an instrution proessor (IP), a storage ontroller (SC), a network interfae adapter (NIA), a memory ontroller for addresses (MCA), memory on- trollers for (MCDs), main storage (MS), and seondary ahe. The HARP-1E RISC proessor [5] is used as the IP. The HARP-1E is based on the PA-RISC 1.1 arhiteture. It runs at 15 MHz, and an operate at up to 3 MFLOPS. The NIA onnets eah PU to three rossbar swithes. It handles the sending and reeiving of between proessors, and handles the routing of through the network as well. When sending and reeiving, the NIA rt;ads and writes the from or to the MS diretly through the SC by diret memory aess (DMA). The SC is onneted to the IP, NIA, and memory ontrollers (MCA and MCDs). It proesses MS aess requests oming from the IP and NIA and passes the requests on to the memory ontrollers. The MCA and MCD manage the address information and of the MS aess requests, respetively i The IOUs and SIOU have the same organization as the PUS, but also have an I/O bus manager onneted to the SC, enabling them to onnet to I/O devies. 3. Pseudo Vetor Proessing Feature The main target appliations of the SR221 are largesale numerial appliations. These appliations need a large amount of spae whih annot fit into the ahe, resulting in a high number of ahe misses if run on a normal RISC proessor system. This problem is overome in the SR221 by using a nonbloking diret main-memory load feature alled preload. This feature does not utilize the ahe, so there is no ahe miss penalty. However, it needs optimized ode sheduling to hide the memory aess lateny. Code sheduling in the SR221 IP is based on the software pipelining tehnique [3], [4], and ahieves highly effetive omputational performane when used with preioid and supersalar proes&g features. The IP issues a preload and a floating point instrution in parallel every yle when it exeutes odes optimized by software pipelining. We all this feature pseudo vetor proessing (PVP) [2]. In PVP, the instrution of program segments suh as loop iterations are divided into two ategories: preloads, whih ost long lateny to omplete the exeution, and other instrutions (alulations and s). Using supersalar proessing, preloads for the to be used in a segment are ontinuously issued in advane to other instrutions of this segment, early enough so as to hide the lateny, and in parallel with alulations of a different segment, whih already has ompleted its own preloads. Thus the proessor, whih is exeuting PVP ode, an fully perform a load pipeline and a alulation one in parallel, thus ahieving high performane. PVP needs many floating-point registers as target registers for preloading. Eah IP in the SR221 has 128 floating-point registers; they are managed using a register-window. This sliding window feature [2] enables seletive

3 aess to the preloaded on the 128 registers in eah IP. Owing to this sliding window feature, the HARP-1E made few hanges on the usual RISC instrution set arhiteture. It needed no extension in register speifiation fields of instrutions for floating-point alulation, and just only added some new instrutions, suh as "preload", "window-swith", et. PU Program Data PU Program Data 4. Inter-proessor Data Transfer The SR221 ahieves high-performane inter-proessor transfer due to its 1. flexible inter-proessor network topology, 2. high-speed inter-proessor network, and 3. low-lateny inter-proessor ommuniation (message passing) protool. The first two result from its multi-dimensional rossbar network, and the third from its use of an original inter-proessor ommuniation faility, the "remote DMA", and hardware-based ahe ohereny. This setion desribes issues on multi-dimensional rossbar network and the remote DMA transfer faility. And Se. 5.3 desribes issues on the hardware-based ahe ohereny. 4.1 Multi-dimensional Crossbar Network The multi-dimensional rossbar network is one of the most important features of the SR221 [1]. Figure 1 shows the struture of the three-dimensional (3D) rossbar network. In the 3D rossbar network, PUs are plaed in a three-dimensional arrangement. Several rossbar swithes are plaed in parallel in eah dimension to onnet the PUs. The NIA on eah PU inludes a router for onneting itself to the three rossbars. Eah router an also route from a rossbar swith to another rossbar, enabling transfer between PUs whih are not diretly onneted by a single rossbar. Therefore, eah router is also a small rossbar swith. This network has three signifiant features supporting inter-proessor transfer: 1. short ommuniation distane Inter-proessor transfer between any two PEs is ahieved within at most three hops in the three-dimensional onfiguration. 2. great freedom in proessor mapping of appliations Beause this network is omposed of multiple rossbars, far fewer network onflits our in this network ompared to mesh-onneted or torus networks. Thus, high performane is ahieved for many variations in the inter-proessor ommuniation patterns due to the many independent ommuniation paths. Consequently, there is great freedom in the proessor mapping of appliations. 3. high-performane broadast and barrier synhronization faility The multi-dimensional rossbar topology failities olletive ommuniation via hardware, ahieving high performane (low lateny) broadast and barrier synhronization. The entire system an be partitioned into a maximum of eight groups (partitions), in eah of whih the olletive ommuniation faility an be used independently. Eah link of this network an transfer at 3 MB/s, OS whih mathes the omputing performane of the PU when the SR221 is solving large-sale numerial appliations. 4.2 Remote DMA Transfer Faility In onventional inter-proessor ommuniation protool (send/reeive model), when a PU sends to another, the is first opied to a send buffer in the operating system (OS), and then is transmitted by the network to a similar buffer in the reeiving PU. Finally, the is opied to the reeiving program. This protool has the following advantages: 1. The send operation is non-bloking. 2. Reliable transmission protool an be easily implemented. However, the ommuniation overhead on the send/reeive protool is quite large. This happens beause it is neessary to opy the twie and the proessing of the protool requires ontext swithes. Furthermore, reeiving of the generates an interrupt. To solve these problems, the SR221 supports a remote DMA transfer faility. The basi onept of this protool is shown in Fig. 3. In order to avoid the ommuniation overhead, the is transmitted diretly from one program area to another, without any OS operations. To ahieve the remote DMA transfer faility, the OS alloates a reserved physial memory area for the user spae in advane, whih is never moved to other address spae. The sender speifies that area of the reeiver and diretly writes the in it. Sine there is no buffering in the OS kernel, expensive memory opy operations are avoided. Also, there is no need for an OS system all and ontext swithes, sine the user program diretly invokes the ommuniation. 5. Memory System No Buffering in Kernel No OS System Call OS Network Figure 3. Basi onept of remote DMA transfer faility. To ahieve high memory performane, a great amount of hardware, suh as LSI pins, memory hips, and ontrol LSI hips, are needed. However the SR221 aims at ahieving a ompat 248 PU system whih ahieves high performane, both peak and effetive. Thus, ompatly implementing the PUs inluding the memory system is important. As a result, memory system should ahieve high effetive performane by fully utilizing a limited set of hardware resoures. This setion desribes how the memory system solves this problem.

4 address/ 8 bytes address 4 bytes x 2 instrution proessor storage ontroller memory ontroller address/ 8 bytes 8 bytes x 2 2 bytes address 2 bytes 2 bytes interfaes 5.1 Organization of Memory System network interfae adapter 15 MHz 75 MHz As shown in Fig. 2, the SC is implemented using a single LSI hip, whose number of LSI pins has been made as high as possible to widen the paths and to avoid bottleneks. Figure 4 shows the inter-lsi interfaes of the memory system. Beause the address and of a transation from the IP use a ommon 8-byte-wide path between the IP and the SC, the IP needs two mahine yles to transmit a storage transation. This is the only fator degrading the performane of PVP. The paths between the NIA and the SC an simultaneously handle MS reads for sending and MS writes for reeiving without performane degradation. The SC and other LSI hips shown in Fig. 2 run at 75 MHz, with the exeption of the IP, whih runs at 15 MHz. At the interfae between the memory ontroller and the SC there are two sets of MS aess paths to keep the same throughput of the IP-SC bus at half of the lok speed, supporting the required pith for MS aesses using PVP. Sine the paths are bi-diretional, path onflits sometimes our between the storage transations from the SC and the transmission of fethed from the memory ontroller. The path ontroller is able to swith diretion without idle yles, minimizing the penalty of these onflits. The time harts in Fig. 5 show the ontrol flow of the bidiretional paths. As shown in Fig. 5 (a), the onventional method spends an idle mahine yle to swith diretion. As shown in Fig. 5 (b), the method used on the SR221 swithes the diretion within the interval from the end of one transfer (the moment at whih the lath of the opposite port reeives the ) to the start of the next one. 5.2 Features for Supporting Pseudo Vetor Proessing Main storage aess using PVP has the following harateristis: 1. Sine PVP issues 8-byte-feth operations almost every mahine yle (15 MHz) to the MS, the -supply rate from MS to IP is about 1.2 GB/s. On the other hand, if ahe is used in the SR221, as it is in onventional systems, the -supply rate from MS to ahe (and then to the PU) would be at most about 6 MB/s due to ahe misses. 2. The MS aesses using PVP are unorrelated to eah other in priniple, so there is no regularity in their address sequenes. This harateristi is the most signifiant differene between the load/ operations of a vetor proessor and the MS aesses using PVP. To ahieve the high aess pith needed for PVP desribed in the first harateristi above, the memory system proesses MS aesses in a pipelined manner. And two sets of MS aess pipelines in the SC are used to keep the same throughput of the IP-SC bus at half of the lok speed. The system has 16 memory banks in the MS, providing 1.2-GB/s bandwidth for the PVP memory aesses. As shown in Fig. 6, the MS is separated into two groups (bank groups) of eight banks eah, based on the two sets of MS aess to SC Figure 4. Inter-LSI interfaes of memory system. bidiretional path MS to to SC bidiretional path MS idle fethed swith diretion to swith diretion to fethed (a) Conventional fethed fethed to swith diretion swith diretion to (b) SR221 idle swith diretion fethed fethed time to mahine yle to swith diretion time to mahine yle (13.3ns) Figure 5. Flow ontrol for bi-diretional path between memory ontroller and SC.

5 8B/13.3ns storage ontroller 4B/13.3ns 4B/13.3ns 8B/13.3ns MS MCD MCA MCD1 bank bank2 bank4 bank6 bank8 bank1 bank12 bank14 bank1 bank3 bank5 bank7 bank9 bank11 bank13 bank15 bank group bank group1 Data Address Control Figure 6. Main storage onfiguration. paths. Two MCDs manage the aessed, one MCD for eah bank group. An MCA manages the addresses for all MS aesses, independently of eah bank group. To avoid bank onflit penalties aused by the seond harateristi above, the SC and the memory ontroller have aess buffers. 5.3 Features for Supporting High-speed Inter-proessor Data Transfer Table 1. Equations used in experimental measurements. Eq. # Equation 1 s=s+a(i) 2 A(i)=B(i) 3 A(i)=B(i)+C(i) 4 s=s+a(i)*b(i) 5 C(i)=C(i)+A(i)*B(i) # of variables # of load operations # of operations As previously stated, eah link in the 3D rossbar network an transfer at 3 MB/s and the NIA handles sending and reeiving of in parallel. As a result, the throughput of MS aesses from the NIA reahes 6 MB/s. The use of the NIA-SC interfae desribed in Se. 5.1, and the 1.2-GB/s bandwidth of the MS desribed in Se. 5.2, allows this 6-MB/s bandwidth to be ahieved. The NIA aesses the to be transferred diretly from the MS, independent of the IP. However, the IP may aess and ahe the same areas aessed by the NIA, thus ahe ohereny has to be maintained. Conventional parallel proessor systems realize ahe oherene by software, whih leads to performane degradation during massive transfer due to high software overhead. To avoid this problem, the following two hardware features are implemented: 1. Store-through ahe management whih makes ahe oherene operations in sending unneessary. 2. A hardware support mehanism in the SC whih maintains ahe oherene in parallel with reeption from the NIA. This mehanism usually invokes ahe oherene operations one per ahe line. This strategy hides the overhead of ahe oherene operations. 6. Performane Evaluations of the SR Performane Measurements of Basi Loops A set of basi loops orresponding to ommon vetor operations were used to measure the performane of the memory system. To minimize the TLB (translation lookaside buffer) miss penalties and measure the true performane of the memory system, the area of programs was mapped by using the bloked TLB faility, whih translates a ontinuous memory area of up to 32-MBytes from virtual into physial address using one entry in the address translation table. The equations for eah of the basi loops used are shown in Table 1. For eah equation, the fators that affet the performane of the memory system are shown. These are the number of array variables, the number of load instrutions, and the number of instrutions that aess the memory in one iteration (when a vetor variable appears on both sides of the assignment, it is ounted twie beause a load and a instrution need to be issued). All variables are doublepreision floating point exept for the array indexes. The experimental measurements on one PU using the basi loop alulations (Table 1) are shown in Figure 7. The horizontal axis shows the stride of the aesses (i.e., the inrement used on the values of the index i). For basi loops that have more than one vetor variable, the aesses for all variables have the same stride. All arrays are aligned on 256- byte boundaries. As shown in Figure 7, all alulations have low performane at the same stride beause memory bank onflits our. For instane, when the stride is a multiple of 2, the performane is half of the maximum, beause only half of the memory banks are aessed. When the stride is a mul-

6 MS bandwidth (MB/s) Eq. 1: s = s + A Eq. 3: A = B + C Eq. 5: C = C + A*B Eq. 2: A = B Eq. 4: s = s + A*B stride (number of elements) Figure 7. Experimental measurements for basi loop alulations. Table 2. Memory aess performane. Equation Eq. 2: A(i)=B(i) Eq. 3: A(i)=B(i)+C(i) Performane (MB/s) SR221 Cray T3D Cray T3E IBM SP Table 3. Performane of inter-proessor transfer (MB/s). System SR221 Cray T3D IBM SP2 Theoretial peak Effetive peak tiple of 4, 8, or 16, the performane of the memory aess drops to 1/4, 1/8, and 1/16 of the maximum, beause the number of memory banks aessed is redued to 4, 2 and 1, respetively. The next analysis shows the differenes between the equations based on their features. The features in Figure 7 are as follows: 1. The number of array variables that must be aessed affets performane. The performane of Eq. 3 and 5, whih aess more than three array variables, is low. When the number of variables inreases, aesses to the same bank our ontinuously beause all variables are aligned in the 256-byte boundaries, and thus performane is dereased. 2. The number of instrutions affets performane. Array operations redue the aess-request issue pith to the memory, beause of the IP-SC bus width possibly lowering the throughput. On the other hand, this redues the load on the memory system sine dereases the impat of memory bank onflits, thus raising the throughput in some ases. Also, sine in PVP the array elements being loaded and d orrespond to different iterations, the banks aessed by the instrutions are different from the ones of the load instrutions. This differs from the equations that have only load instrutions. When the bandwidth obtained on Eq.'s 2 and 4 are ompared, eah equation has two array variables; however in Eq. 2, one of the two variables is a target of the instrution. In ontrast, both variables of Eq. 4 are a target of a load instrution. The performanes of the Hitahi SR221, CRAY T3D, CRAY T3E, and IBM SP2 are ompared in Table 2 [6]. The SR221 had the highest performane of these four mahines due to its fully pipelined memory system and PVP faility. 6.2 Evaluation of Inter-proessor Data Transfer Performane Effet of High Memory Bandwidth on Data Transfer and Remote DMA Transfer Faility. The high bandwidth of the SR221 memory system and remote DMA transfer faility enables high-bandwidth network transfer. To illustrate this point, the network transfer performane of ommerial parallel proessor systems [7], [8] are shown in Table 3. The SR221 outperforms the other two in terms of both theoretial and effetive peak network throughput Effet of Hardware Support on Cahe Coherene. Figure 8 shows the network transfer throughput using the ahe oherene management mehanism (CCMM) (oherene kept by hardware) and without using it (oherene kept by software). The measured network transfer throughput is for the ase when two proessing units issue a remote DMA transfer towards eah other simultaneously. The performane using hardware ahe oherene management was almost 4% higher than the ase using the software ounterpart. 6.3 Evaluation of Numerial Appliation Performane Performane of Impliit Method. Using the same assumptions as in Se. 6.1, we evaluated the performane of the four loops below: (a) a(i,j) = b(i,j) + b(i-1,j) + b(i+1,j) + b(i,j-1) + b(i,j+1) (b) b(i,j) = b(i,j) + b(i-1,j) + b(i+1,j) + b(i,j-1) + b(i,j+1) throughput (MB/s) with CCMM without CCMM 1 1K 1K K 1M 1M transfer size (byte) Figure 8. Inter-proessor transfer.

7 performane (MFLOPS) performane (MFLOPS) performane (MFLOPS) Eq. (a) Eq. (b) Eq. () Eq. (d) 4 Eq. (a) Eq. (b) Eq. () Eq. (d) 4 7 Figure 9. Performane of four typial equations used in impliit method. Eq. (a) Eq. (b) Eq. () Eq. (d) number of dummy elements (for N=36 ase) Figure 1. Performane of four equations with dummy elements array size (number of elements) array size (number of elements) Figure 11. Performane of four equations without PVP () b(i,j) = a(i,j) + b(i-1,j) + b(i+1,j) + b(i,j-1) + b(i,j+1) (d) (i,j) = a(i,j) + b(i-1,j) + b(i+1,j) + b(i,j-1) + b(i,j+1) These four equations are simplified forms of the ore loop that appears in numerial appliation programs using the impliit methods. The HARP-1E ahieves peak performane when it performs a multipliation and an addition in parallel every mahine yle. However these four equations have only addition operations, whih redues the maximum ahievable performane for these alulations to half of the peak value of 3 MFLOPS. The experimental results are shown in Fig. 9. The horizontal axis is array size N (all arrays have N x N elements). The vertial axis is the performane. As shown in Fig. 9, the performane hanges with the size of the array beause of bank onflits on the aesses of the j th index. To avoid this problem, dummy elements are added to the i th index of the array. The relationship between the number of dummy elements and performane when N equals 36 is shown in Fig. 1. In this ase, two dummy elements are suffiient to ahieve high performane. Figure 11 shows the performanes of the four equations without PVP. The performane ahieved using PVP (Fig. 9) is far higher than that obtained without it (i.e., aess through ahe). In the worst ase, when severe bank onflits our, the performane with PVP is equal to that of without PVP Performane of LU Deomposition Program. LU deomposition is the main omputation in the LINPACK benhmark, whih is ommonly used to measure the performane of superomputer. In this setion we identify the most suitable algorithm for the SR221 proessing unit and show the best performane tuning for this algorithm. The evaluation parameters are equivalent to the ones above. The ore alulation of LU deomposition is: a(i,j) = a(i,j) - a(k,j) * a(i,k). The order of the i, j, and k loops is what differentiates the LU deomposition algorithms. The outer produt form (k,j,i order) is ommonly used on vetor proessors beause they perform poorly when the inner produt form is used. In the inner produt Crout form (i,j,k order), the innermost loop (k) performs aumulation into a(i,j), reduing the number of memory operations. Memory an be reused by unrolling the i,j loops. On the SR221, as stated in Se. 6.1, performane an be improved by using algorithms that use fewer memory operations, and also by reusing memory to obtain a higher ratio of numeri instrutions to memory load/ instrutions. Therefore, the inner produt Crout algorithm is the most suitable one for the SR221. The outer produt and inner produt Crout forms of the part of the LU deomposition program that dominates the exeution time are shown in Fig. 12. In these programs, loop unrolling has been done by hand-oding to improve register alloation. The performane of both algorithms is shown in Figure 13. The horizontal axis is the number of elements (N) in eah dimension of the array a(i,j). The inner produt Crout algorithm delivers better performane beause it has fewer instrutions than the outer produt form. Both algorithms show the same behavior on bank onflits. One LINPACK benhmark measures the performane for

8 do 1 k=1,n-5,4 do 1 j=k+4,n-1,2 do 1 i=k+4,n a(i,j) =a(i,j) + w(1,j) *a(i,k) + w(2,j) *a(i,k+1) + w(3,j) *a(i,k+2) + w(4,j) *a(i,k+3) a(i,j+1)=a(i,j+1) + w(1,j+1)*a(i,k) + w(2,j+1)*a(i,k+1) + w(3,j+1)*a(i,k+2) + w(4,j+1)*a(i,k+3) 1 ontinue (a) outer produt form (j: 2-unrolling, k: 4-unrolling) do 1 i=1,n,5 do 2 j=i+1,n,2 do 3 k=1,i-1,2 s1 = s1 + a(j,k) *a(k,i) + a(j,k+1) *a(k+1,i) s2 = s2 + a(j+1,k)*a(k,i) + a(j+1,k+1)*a(k+1,i) s3 = s3 + a(j,k) *a(k,i+1) + a(j,k+1) *a(k+1,i+1) s4 = s4 + a(j+1,k)*a(k,i+1) + a(j+1,k+1)*a(k+1,i+1) s5 = s5 + a(j,k) *a(k,i+2) + a(j,k+1) *a(k+1,i+2) s6 = s6 + a(j+1,k)*a(k,i+2) + a(j+1,k+1)*a(k+1,i+2) s7 = s7 + a(j,k) *a(k,i+3) + a(j,k+1) *a(k+1,i+3) s8 = s8 + a(j+1,k)*a(k,i+3) + a(j+1,k+1)*a(k+1,i+3) s9 = s9 + a(j,k) *a(k,i+4) + a(j,k+1) *a(k+1,i+4) sa = sa + a(j+1,k)*a(k,i+4) + a(j+1,k+1)*a(k+1,i+4) 3 ontinue a(j,i) = a(j,i) - s1 a(j+1,i) = a(j+1,i) - s2 a(j,i+1) = a(j,i+1) - s3 - a(j,i)*a(i,i+1) a(j+1,i+1) = a(j+1,i+1) - s4 - a(j+1,i)*a(i,i+1) a(j,i+2) = a(j,i+2) - s5 - a(j,i)*a(i,i+2) - a(j,i+1)*a(i+1,i+2) a(j+1,i+2) = a(j+1,i+2) - s6 - a(j+1,i)*a(i,i+2) - a(j+1,i+1)*a(i+1,i+2) a(j,i+3) = a(j,i+3) - s7 - a(j,i)*a(i,i+3) - a(j,i+1)*a(i+1,i+3) - a(j,i+2)*a(i+2,i+3) a(j+1,i+3) = a(j+1,i+3) - s8 - a(j+1,i)*a(i,i+3) - a(j+1,i+1)*a(i+1,i+3) - a(j+1,i+2)*a(i+2,i+3) a(j,i+4) = a(j,i+4) - s9 - a(j,i)*a(i,i+4) - a(j,i+1)*a(i+1,i+4) - a(j,i+2)*a(i+2,i+4) - a(j,i+3)*a(i+3,i+4) a(j+1,i+4) = a(j+1,i+4) - sa - a(j+1,i)*a(i,i+4) - a(j+1,i+1)*a(i+1,i+4) - a(j+1,i+2)*a(i+2,i+4) - a(j+1,i+3)*a(i+3,i+4) 2 ontinue 1 ontinue (b) inner produt Crout form (i: 5-unrolling, j: 2-unrolling, k: 2-unrolling) Figure 12. LU deomposition program odes for experimental measurements. N=. As shown in Fig. 13, the performane for N= is worse than that of the neighboring points due to bank onflits. By inserting a dummy element as stated in Se , the performane of the inner produt Crout form was improved to 247 MFLOPS. This is 82% of the uniproessor peak performane (3 MFLOPS) Performane of Parallel LINPACK Benhmark. In solving the LU deomposition part of parallel LINPACK, a new method named double-bloked Gaussian elimination has been used [9]. This method uses two types of bloking, one for ommuniation and another for alulation. This method an ahieve high single-proessor performane by lengthened loop length and high parallel effiieny by optimized load balaning at the same time. The LINPACK benhmark performane of the same three systems (for a 256 PU onfiguration) are shown in Table 4. The performane of the CRAY T3D and IBM SP2 are derived from the LINPACK benhmark report dated Marh 28, The SR221 again outperforms the other two in terms of both peak performane and effetive performane ratio. 7. Conlusion On the oneption of Hitahi's SR221 massively parallel RISC omputer, areful attention was paid both to the proessing unit (PU) and to the network arhiteture in order to ahieve high overall effetive performane. Several features have been added to solve the auses of performane degradation ommonly found in onventional parallel proessor systems: 1. The PU has a pseudo vetor proessing (PVP) feature for

9 performane (MFLOPS) outer produt form inner produt Crout form 8 9 array size (number of elements) 1 12 Figure 13. Experimental measurements for LU deomposition programs. Table 4. Performane of LINPACK benhmark on 256 PU system. System Peak performane of PU (MFLOPS) Performane of benhmark (GFLOPS) Effiieny ompared to peak SR221 Cray T3D IBM SP % 66% 65% 13 Loops With Exits On Pipelined Arhitetures", Proeedings of Superomputing '9 (Nov., 199), [4] Rau, R. B., Lee, M., Tirumalai, P. P., and Shlansker, S. M.: "Register Alloation for Software Pipelined Loops", Proeedings of the ACM SIGPLAN '92 Conferene on Programming Language Design and Implementation (June, 1992), [5] Saito, K., Hashimoto, M., Sawamoto, H., Yamagata, R., Kumagai, T., Kamada, E., Matsubara, K., Isobe, T., Hotta, T., Nakano, T., Shimizu, T., and Nakazawa, K.: "A 15MHz Supersalar RISC Proessor with Pseudo Vetor Proessing Feature", Proeedings Notebook for Hot Chips VII (Aug., 1995), [6] Saini, S. and Bailey, H. D.: "RISC Proessors and High Performane Computing", Superomputing '95. Tutorial S5 (De., 1995). [7] Numrih, W. R., Springer, L. P., and Peterson, C. J.: "Measurement of Communiation Rates on the Cray T3D Interproessor Network", HPCN Europe '94 (1994), [8] Stunkel, B. C.: "The SP2 High-Performane Swith", IBM System Journal, Vol. 34, No. 2 (1995), [9] Yamamoto, Y. and Ohkouhi, T.: "The Optimization of the Gaussian Elimination for Massively Parallel Proessors", Proeedings of the JSPP '95 (1995), (in Japanese). [1] Yasuda, Y., Fujii, H., Akashi, H., Inagami, Y., Tanaka, T., Nakagoshi, J., Wada, H., and Sumimoto, T.: "Deadlok-free Fault-tolerant Routing in the Multi-dimensional Crossbar Network and Its Implementation for the Hitahi SR221", Proeedings of 11th International Parallel Proessing Symposium (IPPS '97) (April, 1997). aelerating the performane on large-sale numerial appliations. On PVP the PU loads by prefething to a speial register bank, bypassing the ahe. This solves the ahe miss penalties that our in large-sale numerial appliations, allowing high throughput memory aess. 2. The memory system of the SR221 has a 1.2-GB/s bandwidth. This supports the high throughput required by the PVP feature. 3. On inter-proessor transfer, the high performane of the memory system, the new proposed remote DMA transfer protool, and also the hardware support for maintaining ahe oherene, provide effiient transfer performane. Due to the ombined effet of all these features, the SR221 showed high effetive performane for proessing large-sale numerial appliations, as well as in inter-proessor transfer. For instane, the 124 PU system of the SR221 ahieved 22.4 GFLOPS on the LINPACK benhmark, whih orresponds to 72% of the peak performane. Referenes [1] Yasuda, Y., Fujii, H., Tanaka, T., and Inagami, Y.: "Performane Evaluation of the Hyper Crossbar Network", Tehnial Report of IEICE. CPSY (1993), (in Japanese). [2] Nakamura, H., Imori, H., Nakazawa, K., Boku, T., Nakata, I., Yamashita, Y., Wada, H., and Inagami, Y.: "A Salar Arhiteture for Pseudo Vetor Proessing based on Slide-Windowed Registers", Proeedings of International Conferene on Superomputing (July, 1993), [3] Tirumalai, P., Lee, M., and Shlansker, M.: "Parallelization Of

Deadlock-free Fault-tolerant Routing in the Multi-dimensional Crossbar Network and Its Implementation for the Hitachi SR2201

Deadlock-free Fault-tolerant Routing in the Multi-dimensional Crossbar Network and Its Implementation for the Hitachi SR2201 Yoshiko Yasuda, Hiroaki Fujii, Hideya Akashi, Yasuhiro Inagami, Teruo Tanaka*,