Architecture and Performance of the Hitachi SR2201 Massively Parallel Processor System

Size: px
Start display at page:

Download "Architecture and Performance of the Hitachi SR2201 Massively Parallel Processor System"

Transcription

1 Arhiteture and Performane of the Hitahi SR221 Massively Parallel Proessor System Hiroaki Fujii, Yoshiko Yasuda, Hideya Akashi, Yasuhiro Inagami, Makoto Koga*, Osamu Ishihara*, Masamori Kashiyama*, Hideo Wada*, and Tsutomu Sumimoto* Central Researh Laboratory, Hitahi Ltd. 1-28, Higashi-Koigakubo, Kokubunji, Tokyo 185, Japan Tel: ; Fax: {fujii, yoshikoy, akashi, *General Purpose Computer Division, Hitahi Ltd. 1, Horiyamashita, Hadano, Kanagawa , Japan Abstrat RISC-based Massively Parallel Proessors (MPPs) often show low effiieny in real-world appliations beause of ahe miss penalty, insuffiient throughput of the memory system, and poor inter-proessor ommuniation performane. Hitahi's SR221, an MPP salable up to 248 proessors and 6 GFLOPS peak performane, overomes these problems by introduing three novel features. First, its proessor, the 15 MHz HARP-1E, solves the ahe miss penalty by "pseudo vetor proessing" (PVP). In PVP, is loaded by prefething to a speial register bank, bypassing the ahe. Seond, a multi-bank memory arhiteture that operates like a pipeline eliminates the memory system bottlenek. Third, the inter-proessor ommuniation ahieves high performane on the three-dimensional rossbar network, using a "remote DMA transfer" protool and a hardware-based ahe ohereny. As the result of these improvements, the SR221 ahieved 22.4 GFLOPS with 124 proessors in the LINPACK benhmark, whih is almost 72% of the peak performane. 1. Introdution The Hitahi SR221 is a newly designed massively parallel proessor (MPP) omputer system that was introdued to the superomputing market in Marh Up to 248 RISC proessors an be onneted via a high-speed threedimensional (3D) rossbar network [1], [1]. Eah proessor, running at a lok frequeny of 15 MHz, has a peak performane of 3 MFLOPS, giving the SR221 a peak performane of 6 GFLOPS. The main memory for one proessing node is up to 1 GB with 1 MB of seondary ahe. The 3D rossbar network is able to transfer at 3 MB/ s over eah link. One of the main design targets on the SR221 is to solve the low effetive performane problem often seen in MPPs. The main auses of this performane degradation (whih are ahe miss penalty, insuffiient throughput of the memory system, and poor inter-proessor ommuniation performane) and the solutions adopted to the SR221, are disussed below. The ahe memory has been introdued in parallel proessor systems built around RISC proessors in an attempt to resolve the speed gap between the main memory and the CPU. However, this arhiteture fails to fulfill this objetive, espeially when an appliation program needs to aess a large portion of memory whih annot fit into the ahe, beause suh an aess inevitably suffers ahe misshits, whih leads to a heavy loss in effetive performane. Conventional ahe-based omputers are thus likely to show dereased performane when they enounter suh ase, whih is often observed in large-sale numerial appliations. To eliminate the ahe miss penalty and improve the performane, the SR221 is equipped with a novel mehanism in its RISC proessors, alled the "preload" operation [2]. This operation is a non-bloking diret main-memory load operation that bypasses the ahe memory. Sine the target of the preload operation is a 128-register bank, it is possible to issue preloads early enough so that the fethed value arrives before its use by other instrutions in the program. This preload operation, along with the super-salar arhiteture and the ode-sheduling similar to software pipelining [3], [4] provided by the ompiler, solves the ahe miss penalty and improves performane. We all the ombination of these tehniques "pseudo vetor proessing" (PVP) [2]. Insuffiient memory throughput, whih is another ause of performane degradation, beomes ritial in the ase of the PVP. Sine PVP issues feth operations almost every mahine yle to the main memory, it is espeially important that the main memory supports a sustained high bandwidth. To this end, the memory system of the SR221 is omposed of a multi-bank memory that operates like a pipeline. The effetive performane of MPPs an also be deteriorated by low performane on the inter-proessor network, due to narrow bandwidth, poor memory system performane and high software overhead. The memory system features desribed above not only help the PVP to work effetively, but also ontribute to inrease throughput in the inter-proessor network. Also, to ahieve high performane in interproessor ommuniations, the SR221 eliminates the software overhead by using a newly developed transfer protool, alled "remote DMA transfer", along with a hardwarebased ahe ohereny. The rest of this paper is organized as follows. Setion 2

2 L The SR221 LAN I/O Unit (IOU) Supervisory IOU (SIOU) Proessing Unit (PU) rossbar - swith f=7 hard disk Figure 1. Coneptual system onfiguration of the SR221. gives an arhitetural overview of the SR221. In Setion 3, the pseudo vetor proessing feature is desribed in detail. Setion 4 desribes issues on inter-proessor transfer. Setion 5 desribes the memory system. In Setion 6, some performane evaluation results are shown. Setion 7 onludes the paper with some remarks. 2. System Overview of the SR Overall Organization Figure 1 shows the oneptual system onfiguration of the SR221. The SR221 uses a multi-dimensional rossbar network to onnet the proessors. For example, a 2176 proessor element (PE) system of the SR221 uses a 3D rossbar network. The 2176 PEs are arranged in an 8x17~16 lattie and the PEs arranged along eah dimension (x, y, or z) are onneted by a ommon rossbar swith. There are two types of PEs: proessing units (PUS) and I/ Units (IOUs). The PUS perform omputation and the IOUs mainly ontrol the I/O proesses. The 2176 PE system has 248 PUS and 128 IOUs. One of the IOUs is a Supervisory IOU (SIOU), whih also performs system management. r-l memory ontrollers (MCA/MCDs) 1 mai;ieyge 1 proe;;!; unit Figure 2. Organization _ onnetions -to3d - rossbar network of a PU. 2.2 Organization of Proessing Unit Figure 2 shows the organization of a PU. Eah PU has seven omponents: an instrution proessor (IP), a storage ontroller (SC), a network interfae adapter (NIA), a memory ontroller for addresses (MCA), memory on- trollers for (MCDs), main storage (MS), and seondary ahe. The HARP-1E RISC proessor [5] is used as the IP. The HARP-1E is based on the PA-RISC 1.1 arhiteture. It runs at 15 MHz, and an operate at up to 3 MFLOPS. The NIA onnets eah PU to three rossbar swithes. It handles the sending and reeiving of between proessors, and handles the routing of through the network as well. When sending and reeiving, the NIA rt;ads and writes the from or to the MS diretly through the SC by diret memory aess (DMA). The SC is onneted to the IP, NIA, and memory ontrollers (MCA and MCDs). It proesses MS aess requests oming from the IP and NIA and passes the requests on to the memory ontrollers. The MCA and MCD manage the address information and of the MS aess requests, respetively i The IOUs and SIOU have the same organization as the PUS, but also have an I/O bus manager onneted to the SC, enabling them to onnet to I/O devies. 3. Pseudo Vetor Proessing Feature The main target appliations of the SR221 are largesale numerial appliations. These appliations need a large amount of spae whih annot fit into the ahe, resulting in a high number of ahe misses if run on a normal RISC proessor system. This problem is overome in the SR221 by using a nonbloking diret main-memory load feature alled preload. This feature does not utilize the ahe, so there is no ahe miss penalty. However, it needs optimized ode sheduling to hide the memory aess lateny. Code sheduling in the SR221 IP is based on the software pipelining tehnique [3], [4], and ahieves highly effetive omputational performane when used with preioid and supersalar proes&g features. The IP issues a preload and a floating point instrution in parallel every yle when it exeutes odes optimized by software pipelining. We all this feature pseudo vetor proessing (PVP) [2]. In PVP, the instrution of program segments suh as loop iterations are divided into two ategories: preloads, whih ost long lateny to omplete the exeution, and other instrutions (alulations and s). Using supersalar proessing, preloads for the to be used in a segment are ontinuously issued in advane to other instrutions of this segment, early enough so as to hide the lateny, and in parallel with alulations of a different segment, whih already has ompleted its own preloads. Thus the proessor, whih is exeuting PVP ode, an fully perform a load pipeline and a alulation one in parallel, thus ahieving high performane. PVP needs many floating-point registers as target registers for preloading. Eah IP in the SR221 has 128 floating-point registers; they are managed using a register-window. This sliding window feature [2] enables seletive

3 aess to the preloaded on the 128 registers in eah IP. Owing to this sliding window feature, the HARP-1E made few hanges on the usual RISC instrution set arhiteture. It needed no extension in register speifiation fields of instrutions for floating-point alulation, and just only added some new instrutions, suh as "preload", "window-swith", et. PU Program Data PU Program Data 4. Inter-proessor Data Transfer The SR221 ahieves high-performane inter-proessor transfer due to its 1. flexible inter-proessor network topology, 2. high-speed inter-proessor network, and 3. low-lateny inter-proessor ommuniation (message passing) protool. The first two result from its multi-dimensional rossbar network, and the third from its use of an original inter-proessor ommuniation faility, the "remote DMA", and hardware-based ahe ohereny. This setion desribes issues on multi-dimensional rossbar network and the remote DMA transfer faility. And Se. 5.3 desribes issues on the hardware-based ahe ohereny. 4.1 Multi-dimensional Crossbar Network The multi-dimensional rossbar network is one of the most important features of the SR221 [1]. Figure 1 shows the struture of the three-dimensional (3D) rossbar network. In the 3D rossbar network, PUs are plaed in a three-dimensional arrangement. Several rossbar swithes are plaed in parallel in eah dimension to onnet the PUs. The NIA on eah PU inludes a router for onneting itself to the three rossbars. Eah router an also route from a rossbar swith to another rossbar, enabling transfer between PUs whih are not diretly onneted by a single rossbar. Therefore, eah router is also a small rossbar swith. This network has three signifiant features supporting inter-proessor transfer: 1. short ommuniation distane Inter-proessor transfer between any two PEs is ahieved within at most three hops in the three-dimensional onfiguration. 2. great freedom in proessor mapping of appliations Beause this network is omposed of multiple rossbars, far fewer network onflits our in this network ompared to mesh-onneted or torus networks. Thus, high performane is ahieved for many variations in the inter-proessor ommuniation patterns due to the many independent ommuniation paths. Consequently, there is great freedom in the proessor mapping of appliations. 3. high-performane broadast and barrier synhronization faility The multi-dimensional rossbar topology failities olletive ommuniation via hardware, ahieving high performane (low lateny) broadast and barrier synhronization. The entire system an be partitioned into a maximum of eight groups (partitions), in eah of whih the olletive ommuniation faility an be used independently. Eah link of this network an transfer at 3 MB/s, OS whih mathes the omputing performane of the PU when the SR221 is solving large-sale numerial appliations. 4.2 Remote DMA Transfer Faility In onventional inter-proessor ommuniation protool (send/reeive model), when a PU sends to another, the is first opied to a send buffer in the operating system (OS), and then is transmitted by the network to a similar buffer in the reeiving PU. Finally, the is opied to the reeiving program. This protool has the following advantages: 1. The send operation is non-bloking. 2. Reliable transmission protool an be easily implemented. However, the ommuniation overhead on the send/reeive protool is quite large. This happens beause it is neessary to opy the twie and the proessing of the protool requires ontext swithes. Furthermore, reeiving of the generates an interrupt. To solve these problems, the SR221 supports a remote DMA transfer faility. The basi onept of this protool is shown in Fig. 3. In order to avoid the ommuniation overhead, the is transmitted diretly from one program area to another, without any OS operations. To ahieve the remote DMA transfer faility, the OS alloates a reserved physial memory area for the user spae in advane, whih is never moved to other address spae. The sender speifies that area of the reeiver and diretly writes the in it. Sine there is no buffering in the OS kernel, expensive memory opy operations are avoided. Also, there is no need for an OS system all and ontext swithes, sine the user program diretly invokes the ommuniation. 5. Memory System No Buffering in Kernel No OS System Call OS Network Figure 3. Basi onept of remote DMA transfer faility. To ahieve high memory performane, a great amount of hardware, suh as LSI pins, memory hips, and ontrol LSI hips, are needed. However the SR221 aims at ahieving a ompat 248 PU system whih ahieves high performane, both peak and effetive. Thus, ompatly implementing the PUs inluding the memory system is important. As a result, memory system should ahieve high effetive performane by fully utilizing a limited set of hardware resoures. This setion desribes how the memory system solves this problem.

4 address/ 8 bytes address 4 bytes x 2 instrution proessor storage ontroller memory ontroller address/ 8 bytes 8 bytes x 2 2 bytes address 2 bytes 2 bytes interfaes 5.1 Organization of Memory System network interfae adapter 15 MHz 75 MHz As shown in Fig. 2, the SC is implemented using a single LSI hip, whose number of LSI pins has been made as high as possible to widen the paths and to avoid bottleneks. Figure 4 shows the inter-lsi interfaes of the memory system. Beause the address and of a transation from the IP use a ommon 8-byte-wide path between the IP and the SC, the IP needs two mahine yles to transmit a storage transation. This is the only fator degrading the performane of PVP. The paths between the NIA and the SC an simultaneously handle MS reads for sending and MS writes for reeiving without performane degradation. The SC and other LSI hips shown in Fig. 2 run at 75 MHz, with the exeption of the IP, whih runs at 15 MHz. At the interfae between the memory ontroller and the SC there are two sets of MS aess paths to keep the same throughput of the IP-SC bus at half of the lok speed, supporting the required pith for MS aesses using PVP. Sine the paths are bi-diretional, path onflits sometimes our between the storage transations from the SC and the transmission of fethed from the memory ontroller. The path ontroller is able to swith diretion without idle yles, minimizing the penalty of these onflits. The time harts in Fig. 5 show the ontrol flow of the bidiretional paths. As shown in Fig. 5 (a), the onventional method spends an idle mahine yle to swith diretion. As shown in Fig. 5 (b), the method used on the SR221 swithes the diretion within the interval from the end of one transfer (the moment at whih the lath of the opposite port reeives the ) to the start of the next one. 5.2 Features for Supporting Pseudo Vetor Proessing Main storage aess using PVP has the following harateristis: 1. Sine PVP issues 8-byte-feth operations almost every mahine yle (15 MHz) to the MS, the -supply rate from MS to IP is about 1.2 GB/s. On the other hand, if ahe is used in the SR221, as it is in onventional systems, the -supply rate from MS to ahe (and then to the PU) would be at most about 6 MB/s due to ahe misses. 2. The MS aesses using PVP are unorrelated to eah other in priniple, so there is no regularity in their address sequenes. This harateristi is the most signifiant differene between the load/ operations of a vetor proessor and the MS aesses using PVP. To ahieve the high aess pith needed for PVP desribed in the first harateristi above, the memory system proesses MS aesses in a pipelined manner. And two sets of MS aess pipelines in the SC are used to keep the same throughput of the IP-SC bus at half of the lok speed. The system has 16 memory banks in the MS, providing 1.2-GB/s bandwidth for the PVP memory aesses. As shown in Fig. 6, the MS is separated into two groups (bank groups) of eight banks eah, based on the two sets of MS aess to SC Figure 4. Inter-LSI interfaes of memory system. bidiretional path MS to to SC bidiretional path MS idle fethed swith diretion to swith diretion to fethed (a) Conventional fethed fethed to swith diretion swith diretion to (b) SR221 idle swith diretion fethed fethed time to mahine yle to swith diretion time to mahine yle (13.3ns) Figure 5. Flow ontrol for bi-diretional path between memory ontroller and SC.

5 8B/13.3ns storage ontroller 4B/13.3ns 4B/13.3ns 8B/13.3ns MS MCD MCA MCD1 bank bank2 bank4 bank6 bank8 bank1 bank12 bank14 bank1 bank3 bank5 bank7 bank9 bank11 bank13 bank15 bank group bank group1 Data Address Control Figure 6. Main storage onfiguration. paths. Two MCDs manage the aessed, one MCD for eah bank group. An MCA manages the addresses for all MS aesses, independently of eah bank group. To avoid bank onflit penalties aused by the seond harateristi above, the SC and the memory ontroller have aess buffers. 5.3 Features for Supporting High-speed Inter-proessor Data Transfer Table 1. Equations used in experimental measurements. Eq. # Equation 1 s=s+a(i) 2 A(i)=B(i) 3 A(i)=B(i)+C(i) 4 s=s+a(i)*b(i) 5 C(i)=C(i)+A(i)*B(i) # of variables # of load operations # of operations As previously stated, eah link in the 3D rossbar network an transfer at 3 MB/s and the NIA handles sending and reeiving of in parallel. As a result, the throughput of MS aesses from the NIA reahes 6 MB/s. The use of the NIA-SC interfae desribed in Se. 5.1, and the 1.2-GB/s bandwidth of the MS desribed in Se. 5.2, allows this 6-MB/s bandwidth to be ahieved. The NIA aesses the to be transferred diretly from the MS, independent of the IP. However, the IP may aess and ahe the same areas aessed by the NIA, thus ahe ohereny has to be maintained. Conventional parallel proessor systems realize ahe oherene by software, whih leads to performane degradation during massive transfer due to high software overhead. To avoid this problem, the following two hardware features are implemented: 1. Store-through ahe management whih makes ahe oherene operations in sending unneessary. 2. A hardware support mehanism in the SC whih maintains ahe oherene in parallel with reeption from the NIA. This mehanism usually invokes ahe oherene operations one per ahe line. This strategy hides the overhead of ahe oherene operations. 6. Performane Evaluations of the SR Performane Measurements of Basi Loops A set of basi loops orresponding to ommon vetor operations were used to measure the performane of the memory system. To minimize the TLB (translation lookaside buffer) miss penalties and measure the true performane of the memory system, the area of programs was mapped by using the bloked TLB faility, whih translates a ontinuous memory area of up to 32-MBytes from virtual into physial address using one entry in the address translation table. The equations for eah of the basi loops used are shown in Table 1. For eah equation, the fators that affet the performane of the memory system are shown. These are the number of array variables, the number of load instrutions, and the number of instrutions that aess the memory in one iteration (when a vetor variable appears on both sides of the assignment, it is ounted twie beause a load and a instrution need to be issued). All variables are doublepreision floating point exept for the array indexes. The experimental measurements on one PU using the basi loop alulations (Table 1) are shown in Figure 7. The horizontal axis shows the stride of the aesses (i.e., the inrement used on the values of the index i). For basi loops that have more than one vetor variable, the aesses for all variables have the same stride. All arrays are aligned on 256- byte boundaries. As shown in Figure 7, all alulations have low performane at the same stride beause memory bank onflits our. For instane, when the stride is a multiple of 2, the performane is half of the maximum, beause only half of the memory banks are aessed. When the stride is a mul-

6 MS bandwidth (MB/s) Eq. 1: s = s + A Eq. 3: A = B + C Eq. 5: C = C + A*B Eq. 2: A = B Eq. 4: s = s + A*B stride (number of elements) Figure 7. Experimental measurements for basi loop alulations. Table 2. Memory aess performane. Equation Eq. 2: A(i)=B(i) Eq. 3: A(i)=B(i)+C(i) Performane (MB/s) SR221 Cray T3D Cray T3E IBM SP Table 3. Performane of inter-proessor transfer (MB/s). System SR221 Cray T3D IBM SP2 Theoretial peak Effetive peak tiple of 4, 8, or 16, the performane of the memory aess drops to 1/4, 1/8, and 1/16 of the maximum, beause the number of memory banks aessed is redued to 4, 2 and 1, respetively. The next analysis shows the differenes between the equations based on their features. The features in Figure 7 are as follows: 1. The number of array variables that must be aessed affets performane. The performane of Eq. 3 and 5, whih aess more than three array variables, is low. When the number of variables inreases, aesses to the same bank our ontinuously beause all variables are aligned in the 256-byte boundaries, and thus performane is dereased. 2. The number of instrutions affets performane. Array operations redue the aess-request issue pith to the memory, beause of the IP-SC bus width possibly lowering the throughput. On the other hand, this redues the load on the memory system sine dereases the impat of memory bank onflits, thus raising the throughput in some ases. Also, sine in PVP the array elements being loaded and d orrespond to different iterations, the banks aessed by the instrutions are different from the ones of the load instrutions. This differs from the equations that have only load instrutions. When the bandwidth obtained on Eq.'s 2 and 4 are ompared, eah equation has two array variables; however in Eq. 2, one of the two variables is a target of the instrution. In ontrast, both variables of Eq. 4 are a target of a load instrution. The performanes of the Hitahi SR221, CRAY T3D, CRAY T3E, and IBM SP2 are ompared in Table 2 [6]. The SR221 had the highest performane of these four mahines due to its fully pipelined memory system and PVP faility. 6.2 Evaluation of Inter-proessor Data Transfer Performane Effet of High Memory Bandwidth on Data Transfer and Remote DMA Transfer Faility. The high bandwidth of the SR221 memory system and remote DMA transfer faility enables high-bandwidth network transfer. To illustrate this point, the network transfer performane of ommerial parallel proessor systems [7], [8] are shown in Table 3. The SR221 outperforms the other two in terms of both theoretial and effetive peak network throughput Effet of Hardware Support on Cahe Coherene. Figure 8 shows the network transfer throughput using the ahe oherene management mehanism (CCMM) (oherene kept by hardware) and without using it (oherene kept by software). The measured network transfer throughput is for the ase when two proessing units issue a remote DMA transfer towards eah other simultaneously. The performane using hardware ahe oherene management was almost 4% higher than the ase using the software ounterpart. 6.3 Evaluation of Numerial Appliation Performane Performane of Impliit Method. Using the same assumptions as in Se. 6.1, we evaluated the performane of the four loops below: (a) a(i,j) = b(i,j) + b(i-1,j) + b(i+1,j) + b(i,j-1) + b(i,j+1) (b) b(i,j) = b(i,j) + b(i-1,j) + b(i+1,j) + b(i,j-1) + b(i,j+1) throughput (MB/s) with CCMM without CCMM 1 1K 1K K 1M 1M transfer size (byte) Figure 8. Inter-proessor transfer.

7 performane (MFLOPS) performane (MFLOPS) performane (MFLOPS) Eq. (a) Eq. (b) Eq. () Eq. (d) 4 Eq. (a) Eq. (b) Eq. () Eq. (d) 4 7 Figure 9. Performane of four typial equations used in impliit method. Eq. (a) Eq. (b) Eq. () Eq. (d) number of dummy elements (for N=36 ase) Figure 1. Performane of four equations with dummy elements array size (number of elements) array size (number of elements) Figure 11. Performane of four equations without PVP () b(i,j) = a(i,j) + b(i-1,j) + b(i+1,j) + b(i,j-1) + b(i,j+1) (d) (i,j) = a(i,j) + b(i-1,j) + b(i+1,j) + b(i,j-1) + b(i,j+1) These four equations are simplified forms of the ore loop that appears in numerial appliation programs using the impliit methods. The HARP-1E ahieves peak performane when it performs a multipliation and an addition in parallel every mahine yle. However these four equations have only addition operations, whih redues the maximum ahievable performane for these alulations to half of the peak value of 3 MFLOPS. The experimental results are shown in Fig. 9. The horizontal axis is array size N (all arrays have N x N elements). The vertial axis is the performane. As shown in Fig. 9, the performane hanges with the size of the array beause of bank onflits on the aesses of the j th index. To avoid this problem, dummy elements are added to the i th index of the array. The relationship between the number of dummy elements and performane when N equals 36 is shown in Fig. 1. In this ase, two dummy elements are suffiient to ahieve high performane. Figure 11 shows the performanes of the four equations without PVP. The performane ahieved using PVP (Fig. 9) is far higher than that obtained without it (i.e., aess through ahe). In the worst ase, when severe bank onflits our, the performane with PVP is equal to that of without PVP Performane of LU Deomposition Program. LU deomposition is the main omputation in the LINPACK benhmark, whih is ommonly used to measure the performane of superomputer. In this setion we identify the most suitable algorithm for the SR221 proessing unit and show the best performane tuning for this algorithm. The evaluation parameters are equivalent to the ones above. The ore alulation of LU deomposition is: a(i,j) = a(i,j) - a(k,j) * a(i,k). The order of the i, j, and k loops is what differentiates the LU deomposition algorithms. The outer produt form (k,j,i order) is ommonly used on vetor proessors beause they perform poorly when the inner produt form is used. In the inner produt Crout form (i,j,k order), the innermost loop (k) performs aumulation into a(i,j), reduing the number of memory operations. Memory an be reused by unrolling the i,j loops. On the SR221, as stated in Se. 6.1, performane an be improved by using algorithms that use fewer memory operations, and also by reusing memory to obtain a higher ratio of numeri instrutions to memory load/ instrutions. Therefore, the inner produt Crout algorithm is the most suitable one for the SR221. The outer produt and inner produt Crout forms of the part of the LU deomposition program that dominates the exeution time are shown in Fig. 12. In these programs, loop unrolling has been done by hand-oding to improve register alloation. The performane of both algorithms is shown in Figure 13. The horizontal axis is the number of elements (N) in eah dimension of the array a(i,j). The inner produt Crout algorithm delivers better performane beause it has fewer instrutions than the outer produt form. Both algorithms show the same behavior on bank onflits. One LINPACK benhmark measures the performane for

8 do 1 k=1,n-5,4 do 1 j=k+4,n-1,2 do 1 i=k+4,n a(i,j) =a(i,j) + w(1,j) *a(i,k) + w(2,j) *a(i,k+1) + w(3,j) *a(i,k+2) + w(4,j) *a(i,k+3) a(i,j+1)=a(i,j+1) + w(1,j+1)*a(i,k) + w(2,j+1)*a(i,k+1) + w(3,j+1)*a(i,k+2) + w(4,j+1)*a(i,k+3) 1 ontinue (a) outer produt form (j: 2-unrolling, k: 4-unrolling) do 1 i=1,n,5 do 2 j=i+1,n,2 do 3 k=1,i-1,2 s1 = s1 + a(j,k) *a(k,i) + a(j,k+1) *a(k+1,i) s2 = s2 + a(j+1,k)*a(k,i) + a(j+1,k+1)*a(k+1,i) s3 = s3 + a(j,k) *a(k,i+1) + a(j,k+1) *a(k+1,i+1) s4 = s4 + a(j+1,k)*a(k,i+1) + a(j+1,k+1)*a(k+1,i+1) s5 = s5 + a(j,k) *a(k,i+2) + a(j,k+1) *a(k+1,i+2) s6 = s6 + a(j+1,k)*a(k,i+2) + a(j+1,k+1)*a(k+1,i+2) s7 = s7 + a(j,k) *a(k,i+3) + a(j,k+1) *a(k+1,i+3) s8 = s8 + a(j+1,k)*a(k,i+3) + a(j+1,k+1)*a(k+1,i+3) s9 = s9 + a(j,k) *a(k,i+4) + a(j,k+1) *a(k+1,i+4) sa = sa + a(j+1,k)*a(k,i+4) + a(j+1,k+1)*a(k+1,i+4) 3 ontinue a(j,i) = a(j,i) - s1 a(j+1,i) = a(j+1,i) - s2 a(j,i+1) = a(j,i+1) - s3 - a(j,i)*a(i,i+1) a(j+1,i+1) = a(j+1,i+1) - s4 - a(j+1,i)*a(i,i+1) a(j,i+2) = a(j,i+2) - s5 - a(j,i)*a(i,i+2) - a(j,i+1)*a(i+1,i+2) a(j+1,i+2) = a(j+1,i+2) - s6 - a(j+1,i)*a(i,i+2) - a(j+1,i+1)*a(i+1,i+2) a(j,i+3) = a(j,i+3) - s7 - a(j,i)*a(i,i+3) - a(j,i+1)*a(i+1,i+3) - a(j,i+2)*a(i+2,i+3) a(j+1,i+3) = a(j+1,i+3) - s8 - a(j+1,i)*a(i,i+3) - a(j+1,i+1)*a(i+1,i+3) - a(j+1,i+2)*a(i+2,i+3) a(j,i+4) = a(j,i+4) - s9 - a(j,i)*a(i,i+4) - a(j,i+1)*a(i+1,i+4) - a(j,i+2)*a(i+2,i+4) - a(j,i+3)*a(i+3,i+4) a(j+1,i+4) = a(j+1,i+4) - sa - a(j+1,i)*a(i,i+4) - a(j+1,i+1)*a(i+1,i+4) - a(j+1,i+2)*a(i+2,i+4) - a(j+1,i+3)*a(i+3,i+4) 2 ontinue 1 ontinue (b) inner produt Crout form (i: 5-unrolling, j: 2-unrolling, k: 2-unrolling) Figure 12. LU deomposition program odes for experimental measurements. N=. As shown in Fig. 13, the performane for N= is worse than that of the neighboring points due to bank onflits. By inserting a dummy element as stated in Se , the performane of the inner produt Crout form was improved to 247 MFLOPS. This is 82% of the uniproessor peak performane (3 MFLOPS) Performane of Parallel LINPACK Benhmark. In solving the LU deomposition part of parallel LINPACK, a new method named double-bloked Gaussian elimination has been used [9]. This method uses two types of bloking, one for ommuniation and another for alulation. This method an ahieve high single-proessor performane by lengthened loop length and high parallel effiieny by optimized load balaning at the same time. The LINPACK benhmark performane of the same three systems (for a 256 PU onfiguration) are shown in Table 4. The performane of the CRAY T3D and IBM SP2 are derived from the LINPACK benhmark report dated Marh 28, The SR221 again outperforms the other two in terms of both peak performane and effetive performane ratio. 7. Conlusion On the oneption of Hitahi's SR221 massively parallel RISC omputer, areful attention was paid both to the proessing unit (PU) and to the network arhiteture in order to ahieve high overall effetive performane. Several features have been added to solve the auses of performane degradation ommonly found in onventional parallel proessor systems: 1. The PU has a pseudo vetor proessing (PVP) feature for

9 performane (MFLOPS) outer produt form inner produt Crout form 8 9 array size (number of elements) 1 12 Figure 13. Experimental measurements for LU deomposition programs. Table 4. Performane of LINPACK benhmark on 256 PU system. System Peak performane of PU (MFLOPS) Performane of benhmark (GFLOPS) Effiieny ompared to peak SR221 Cray T3D IBM SP % 66% 65% 13 Loops With Exits On Pipelined Arhitetures", Proeedings of Superomputing '9 (Nov., 199), [4] Rau, R. B., Lee, M., Tirumalai, P. P., and Shlansker, S. M.: "Register Alloation for Software Pipelined Loops", Proeedings of the ACM SIGPLAN '92 Conferene on Programming Language Design and Implementation (June, 1992), [5] Saito, K., Hashimoto, M., Sawamoto, H., Yamagata, R., Kumagai, T., Kamada, E., Matsubara, K., Isobe, T., Hotta, T., Nakano, T., Shimizu, T., and Nakazawa, K.: "A 15MHz Supersalar RISC Proessor with Pseudo Vetor Proessing Feature", Proeedings Notebook for Hot Chips VII (Aug., 1995), [6] Saini, S. and Bailey, H. D.: "RISC Proessors and High Performane Computing", Superomputing '95. Tutorial S5 (De., 1995). [7] Numrih, W. R., Springer, L. P., and Peterson, C. J.: "Measurement of Communiation Rates on the Cray T3D Interproessor Network", HPCN Europe '94 (1994), [8] Stunkel, B. C.: "The SP2 High-Performane Swith", IBM System Journal, Vol. 34, No. 2 (1995), [9] Yamamoto, Y. and Ohkouhi, T.: "The Optimization of the Gaussian Elimination for Massively Parallel Proessors", Proeedings of the JSPP '95 (1995), (in Japanese). [1] Yasuda, Y., Fujii, H., Akashi, H., Inagami, Y., Tanaka, T., Nakagoshi, J., Wada, H., and Sumimoto, T.: "Deadlok-free Fault-tolerant Routing in the Multi-dimensional Crossbar Network and Its Implementation for the Hitahi SR221", Proeedings of 11th International Parallel Proessing Symposium (IPPS '97) (April, 1997). aelerating the performane on large-sale numerial appliations. On PVP the PU loads by prefething to a speial register bank, bypassing the ahe. This solves the ahe miss penalties that our in large-sale numerial appliations, allowing high throughput memory aess. 2. The memory system of the SR221 has a 1.2-GB/s bandwidth. This supports the high throughput required by the PVP feature. 3. On inter-proessor transfer, the high performane of the memory system, the new proposed remote DMA transfer protool, and also the hardware support for maintaining ahe oherene, provide effiient transfer performane. Due to the ombined effet of all these features, the SR221 showed high effetive performane for proessing large-sale numerial appliations, as well as in inter-proessor transfer. For instane, the 124 PU system of the SR221 ahieved 22.4 GFLOPS on the LINPACK benhmark, whih orresponds to 72% of the peak performane. Referenes [1] Yasuda, Y., Fujii, H., Tanaka, T., and Inagami, Y.: "Performane Evaluation of the Hyper Crossbar Network", Tehnial Report of IEICE. CPSY (1993), (in Japanese). [2] Nakamura, H., Imori, H., Nakazawa, K., Boku, T., Nakata, I., Yamashita, Y., Wada, H., and Inagami, Y.: "A Salar Arhiteture for Pseudo Vetor Proessing based on Slide-Windowed Registers", Proeedings of International Conferene on Superomputing (July, 1993), [3] Tirumalai, P., Lee, M., and Shlansker, M.: "Parallelization Of

Deadlock-free Fault-tolerant Routing in the Multi-dimensional Crossbar Network and Its Implementation for the Hitachi SR2201

Deadlock-free Fault-tolerant Routing in the Multi-dimensional Crossbar Network and Its Implementation for the Hitachi SR2201 Deadlock-free Fault-tolerant Routing in the Multi-dimensional Crossbar Network and Its Implementation for the Hitachi SR2201 Yoshiko Yasuda, Hiroaki Fujii, Hideya Akashi, Yasuhiro Inagami, Teruo Tanaka*,

More information

On - Line Path Delay Fault Testing of Omega MINs M. Bellos 1, E. Kalligeros 1, D. Nikolos 1,2 & H. T. Vergos 1,2

On - Line Path Delay Fault Testing of Omega MINs M. Bellos 1, E. Kalligeros 1, D. Nikolos 1,2 & H. T. Vergos 1,2 On - Line Path Delay Fault Testing of Omega MINs M. Bellos, E. Kalligeros, D. Nikolos,2 & H. T. Vergos,2 Dept. of Computer Engineering and Informatis 2 Computer Tehnology Institute University of Patras,

More information

System-Level Parallelism and Throughput Optimization in Designing Reconfigurable Computing Applications

System-Level Parallelism and Throughput Optimization in Designing Reconfigurable Computing Applications System-Level Parallelism and hroughput Optimization in Designing Reonfigurable Computing Appliations Esam El-Araby 1, Mohamed aher 1, Kris Gaj 2, arek El-Ghazawi 1, David Caliga 3, and Nikitas Alexandridis

More information

Pipelined Multipliers for Reconfigurable Hardware

Pipelined Multipliers for Reconfigurable Hardware Pipelined Multipliers for Reonfigurable Hardware Mithell J. Myjak and José G. Delgado-Frias Shool of Eletrial Engineering and Computer Siene, Washington State University Pullman, WA 99164-2752 USA {mmyjak,

More information

COSSIM An Integrated Solution to Address the Simulator Gap for Parallel Heterogeneous Systems

COSSIM An Integrated Solution to Address the Simulator Gap for Parallel Heterogeneous Systems COSSIM An Integrated Solution to Address the Simulator Gap for Parallel Heterogeneous Systems Andreas Brokalakis Synelixis Solutions Ltd, Greee brokalakis@synelixis.om Nikolaos Tampouratzis Teleommuniation

More information

COST PERFORMANCE ASPECTS OF CCD FAST AUXILIARY MEMORY

COST PERFORMANCE ASPECTS OF CCD FAST AUXILIARY MEMORY COST PERFORMANCE ASPECTS OF CCD FAST AUXILIARY MEMORY Dileep P, Bhondarkor Texas Instruments Inorporated Dallas, Texas ABSTRACT Charge oupled devies (CCD's) hove been mentioned as potential fast auxiliary

More information

A Dual-Hamiltonian-Path-Based Multicasting Strategy for Wormhole-Routed Star Graph Interconnection Networks

A Dual-Hamiltonian-Path-Based Multicasting Strategy for Wormhole-Routed Star Graph Interconnection Networks A Dual-Hamiltonian-Path-Based Multiasting Strategy for Wormhole-Routed Star Graph Interonnetion Networks Nen-Chung Wang Department of Information and Communiation Engineering Chaoyang University of Tehnology,

More information

A Load-Balanced Clustering Protocol for Hierarchical Wireless Sensor Networks

A Load-Balanced Clustering Protocol for Hierarchical Wireless Sensor Networks International Journal of Advanes in Computer Networks and Its Seurity IJCNS A Load-Balaned Clustering Protool for Hierarhial Wireless Sensor Networks Mehdi Tarhani, Yousef S. Kavian, Saman Siavoshi, Ali

More information

Outline: Software Design

Outline: Software Design Outline: Software Design. Goals History of software design ideas Design priniples Design methods Life belt or leg iron? (Budgen) Copyright Nany Leveson, Sept. 1999 A Little History... At first, struggling

More information

The Tofu Interconnect D

The Tofu Interconnect D 2018 IEEE International Conferene on Cluster Computing The Tofu Interonnet D Yuihiro Ajima, Takahiro Kawashima, Takayuki Okamoto, Naoyuki Shida, Kouihi Hirai, Toshiyuki Shimizu Next Generation Tehnial

More information

Multi-hop Fast Conflict Resolution Algorithm for Ad Hoc Networks

Multi-hop Fast Conflict Resolution Algorithm for Ad Hoc Networks Multi-hop Fast Conflit Resolution Algorithm for Ad Ho Networks Shengwei Wang 1, Jun Liu 2,*, Wei Cai 2, Minghao Yin 2, Lingyun Zhou 2, and Hui Hao 3 1 Power Emergeny Center, Sihuan Eletri Power Corporation,

More information

Learning Convention Propagation in BeerAdvocate Reviews from a etwork Perspective. Abstract

Learning Convention Propagation in BeerAdvocate Reviews from a etwork Perspective. Abstract CS 9 Projet Final Report: Learning Convention Propagation in BeerAdvoate Reviews from a etwork Perspetive Abstrat We look at the way onventions propagate between reviews on the BeerAdvoate dataset, and

More information

Acoustic Links. Maximizing Channel Utilization for Underwater

Acoustic Links. Maximizing Channel Utilization for Underwater Maximizing Channel Utilization for Underwater Aousti Links Albert F Hairris III Davide G. B. Meneghetti Adihele Zorzi Department of Information Engineering University of Padova, Italy Email: {harris,davide.meneghetti,zorzi}@dei.unipd.it

More information

Constructing Transaction Serialization Order for Incremental. Data Warehouse Refresh. Ming-Ling Lo and Hui-I Hsiao. IBM T. J. Watson Research Center

Constructing Transaction Serialization Order for Incremental. Data Warehouse Refresh. Ming-Ling Lo and Hui-I Hsiao. IBM T. J. Watson Research Center Construting Transation Serialization Order for Inremental Data Warehouse Refresh Ming-Ling Lo and Hui-I Hsiao IBM T. J. Watson Researh Center July 11, 1997 Abstrat In typial pratie of data warehouse, the

More information

Multi-Channel Wireless Networks: Capacity and Protocols

Multi-Channel Wireless Networks: Capacity and Protocols Multi-Channel Wireless Networks: Capaity and Protools Tehnial Report April 2005 Pradeep Kyasanur Dept. of Computer Siene, and Coordinated Siene Laboratory, University of Illinois at Urbana-Champaign Email:

More information

Accommodations of QoS DiffServ Over IP and MPLS Networks

Accommodations of QoS DiffServ Over IP and MPLS Networks Aommodations of QoS DiffServ Over IP and MPLS Networks Abdullah AlWehaibi, Anjali Agarwal, Mihael Kadoh and Ahmed ElHakeem Department of Eletrial and Computer Department de Genie Eletrique Engineering

More information

A DYNAMIC ACCESS CONTROL WITH BINARY KEY-PAIR

A DYNAMIC ACCESS CONTROL WITH BINARY KEY-PAIR Malaysian Journal of Computer Siene, Vol 10 No 1, June 1997, pp 36-41 A DYNAMIC ACCESS CONTROL WITH BINARY KEY-PAIR Md Rafiqul Islam, Harihodin Selamat and Mohd Noor Md Sap Faulty of Computer Siene and

More information

Announcements. Lecture Caching Issues for Multi-core Processors. Shared Vs. Private Caches for Small-scale Multi-core

Announcements. Lecture Caching Issues for Multi-core Processors. Shared Vs. Private Caches for Small-scale Multi-core Announements Your fous should be on the lass projet now Leture 17: Cahing Issues for Multi-ore Proessors This week: status update and meeting A short presentation on: projet desription (problem, importane,

More information

Automatic Physical Design Tuning: Workload as a Sequence Sanjay Agrawal Microsoft Research One Microsoft Way Redmond, WA, USA +1-(425)

Automatic Physical Design Tuning: Workload as a Sequence Sanjay Agrawal Microsoft Research One Microsoft Way Redmond, WA, USA +1-(425) Automati Physial Design Tuning: Workload as a Sequene Sanjay Agrawal Mirosoft Researh One Mirosoft Way Redmond, WA, USA +1-(425) 75-357 sagrawal@mirosoft.om Eri Chu * Computer Sienes Department University

More information

Space- and Time-Efficient BDD Construction via Working Set Control

Space- and Time-Efficient BDD Construction via Working Set Control Spae- and Time-Effiient BDD Constrution via Working Set Control Bwolen Yang Yirng-An Chen Randal E. Bryant David R. O Hallaron Computer Siene Department Carnegie Mellon University Pittsburgh, PA 15213.

More information

Automatic Generation of Transaction-Level Models for Rapid Design Space Exploration

Automatic Generation of Transaction-Level Models for Rapid Design Space Exploration Automati Generation of Transation-Level Models for Rapid Design Spae Exploration Dongwan Shin, Andreas Gerstlauer, Junyu Peng, Rainer Dömer and Daniel D. Gajski Center for Embedded Computer Systems University

More information

A Novel Validity Index for Determination of the Optimal Number of Clusters

A Novel Validity Index for Determination of the Optimal Number of Clusters IEICE TRANS. INF. & SYST., VOL.E84 D, NO.2 FEBRUARY 2001 281 LETTER A Novel Validity Index for Determination of the Optimal Number of Clusters Do-Jong KIM, Yong-Woon PARK, and Dong-Jo PARK, Nonmembers

More information

SVC-DASH-M: Scalable Video Coding Dynamic Adaptive Streaming Over HTTP Using Multiple Connections

SVC-DASH-M: Scalable Video Coding Dynamic Adaptive Streaming Over HTTP Using Multiple Connections SVC-DASH-M: Salable Video Coding Dynami Adaptive Streaming Over HTTP Using Multiple Connetions Samar Ibrahim, Ahmed H. Zahran and Mahmoud H. Ismail Department of Eletronis and Eletrial Communiations, Faulty

More information

The Minimum Redundancy Maximum Relevance Approach to Building Sparse Support Vector Machines

The Minimum Redundancy Maximum Relevance Approach to Building Sparse Support Vector Machines The Minimum Redundany Maximum Relevane Approah to Building Sparse Support Vetor Mahines Xiaoxing Yang, Ke Tang, and Xin Yao, Nature Inspired Computation and Appliations Laboratory (NICAL), Shool of Computer

More information

Flow Demands Oriented Node Placement in Multi-Hop Wireless Networks

Flow Demands Oriented Node Placement in Multi-Hop Wireless Networks Flow Demands Oriented Node Plaement in Multi-Hop Wireless Networks Zimu Yuan Institute of Computing Tehnology, CAS, China {zimu.yuan}@gmail.om arxiv:153.8396v1 [s.ni] 29 Mar 215 Abstrat In multi-hop wireless

More information

HEXA: Compact Data Structures for Faster Packet Processing

HEXA: Compact Data Structures for Faster Packet Processing Washington University in St. Louis Washington University Open Sholarship All Computer Siene and Engineering Researh Computer Siene and Engineering Report Number: 27-26 27 HEXA: Compat Data Strutures for

More information

What are Cycle-Stealing Systems Good For? A Detailed Performance Model Case Study

What are Cycle-Stealing Systems Good For? A Detailed Performance Model Case Study What are Cyle-Stealing Systems Good For? A Detailed Performane Model Case Study Wayne Kelly and Jiro Sumitomo Queensland University of Tehnology, Australia {w.kelly, j.sumitomo}@qut.edu.au Abstrat The

More information

Folding. Hardware Mapped vs. Time multiplexed. Folding by N (N=folding factor) Node A. Unfolding by J A 1 A J-1. Time multiplexed/microcoded

Folding. Hardware Mapped vs. Time multiplexed. Folding by N (N=folding factor) Node A. Unfolding by J A 1 A J-1. Time multiplexed/microcoded Folding is verse of Unfolding Node A A Folding by N (N=folding fator) Folding A Unfolding by J A A J- Hardware Mapped vs. Time multiplexed l Hardware Mapped vs. Time multiplexed/mirooded FI : y x(n) h

More information

Cluster-based Cooperative Communication with Network Coding in Wireless Networks

Cluster-based Cooperative Communication with Network Coding in Wireless Networks Cluster-based Cooperative Communiation with Network Coding in Wireless Networks Zygmunt J. Haas Shool of Eletrial and Computer Engineering Cornell University Ithaa, NY 4850, U.S.A. Email: haas@ee.ornell.edu

More information

Direct-Mapped Caches

Direct-Mapped Caches A Case for Diret-Mapped Cahes Mark D. Hill University of Wisonsin ahe is a small, fast buffer in whih a system keeps those parts, of the ontents of a larger, slower memory that are likely to be used soon.

More information

Partial Character Decoding for Improved Regular Expression Matching in FPGAs

Partial Character Decoding for Improved Regular Expression Matching in FPGAs Partial Charater Deoding for Improved Regular Expression Mathing in FPGAs Peter Sutton Shool of Information Tehnology and Eletrial Engineering The University of Queensland Brisbane, Queensland, 4072, Australia

More information

Gray Codes for Reflectable Languages

Gray Codes for Reflectable Languages Gray Codes for Refletable Languages Yue Li Joe Sawada Marh 8, 2008 Abstrat We lassify a type of language alled a refletable language. We then develop a generi algorithm that an be used to list all strings

More information

Smooth Trajectory Planning Along Bezier Curve for Mobile Robots with Velocity Constraints

Smooth Trajectory Planning Along Bezier Curve for Mobile Robots with Velocity Constraints Smooth Trajetory Planning Along Bezier Curve for Mobile Robots with Veloity Constraints Gil Jin Yang and Byoung Wook Choi Department of Eletrial and Information Engineering Seoul National University of

More information

DECT Module Installation Manual

DECT Module Installation Manual DECT Module Installation Manual Rev. 2.0 This manual desribes the DECT module registration method to the HUB and fan airflow settings. In order for the HUB to ommuniate with a ompatible fan, the DECT module

More information

This fact makes it difficult to evaluate the cost function to be minimized

This fact makes it difficult to evaluate the cost function to be minimized RSOURC LLOCTION N SSINMNT In the resoure alloation step the amount of resoures required to exeute the different types of proesses is determined. We will refer to the time interval during whih a proess

More information

Parallelizing Frequent Web Access Pattern Mining with Partial Enumeration for High Speedup

Parallelizing Frequent Web Access Pattern Mining with Partial Enumeration for High Speedup Parallelizing Frequent Web Aess Pattern Mining with Partial Enumeration for High Peiyi Tang Markus P. Turkia Department of Computer Siene Department of Computer Siene University of Arkansas at Little Rok

More information

Extracting Partition Statistics from Semistructured Data

Extracting Partition Statistics from Semistructured Data Extrating Partition Statistis from Semistrutured Data John N. Wilson Rihard Gourlay Robert Japp Mathias Neumüller Department of Computer and Information Sienes University of Strathlyde, Glasgow, UK {jnw,rsg,rpj,mathias}@is.strath.a.uk

More information

Establishing Secure Ethernet LANs Using Intelligent Switching Hubs in Internet Environments

Establishing Secure Ethernet LANs Using Intelligent Switching Hubs in Internet Environments Establishing Seure Ethernet LANs Using Intelligent Swithing Hubs in Internet Environments WOEIJIUNN TSAUR AND SHIJINN HORNG Department of Eletrial Engineering, National Taiwan University of Siene and Tehnology,

More information

Zippy - A coarse-grained reconfigurable array with support for hardware virtualization

Zippy - A coarse-grained reconfigurable array with support for hardware virtualization Zippy - A oarse-grained reonfigurable array with support for hardware virtualization Christian Plessl Computer Engineering and Networks Lab ETH Zürih, Switzerland plessl@tik.ee.ethz.h Maro Platzner Department

More information

Algorithms, Mechanisms and Procedures for the Computer-aided Project Generation System

Algorithms, Mechanisms and Procedures for the Computer-aided Project Generation System Algorithms, Mehanisms and Proedures for the Computer-aided Projet Generation System Anton O. Butko 1*, Aleksandr P. Briukhovetskii 2, Dmitry E. Grigoriev 2# and Konstantin S. Kalashnikov 3 1 Department

More information

A Partial Sorting Algorithm in Multi-Hop Wireless Sensor Networks

A Partial Sorting Algorithm in Multi-Hop Wireless Sensor Networks A Partial Sorting Algorithm in Multi-Hop Wireless Sensor Networks Abouberine Ould Cheikhna Department of Computer Siene University of Piardie Jules Verne 80039 Amiens Frane Ould.heikhna.abouberine @u-piardie.fr

More information

Performance Improvement of TCP on Wireless Cellular Networks by Adaptive FEC Combined with Explicit Loss Notification

Performance Improvement of TCP on Wireless Cellular Networks by Adaptive FEC Combined with Explicit Loss Notification erformane Improvement of TC on Wireless Cellular Networks by Adaptive Combined with Expliit Loss tifiation Masahiro Miyoshi, Masashi Sugano, Masayuki Murata Department of Infomatis and Mathematial Siene,

More information

Methods for Multi-Dimensional Robustness Optimization in Complex Embedded Systems

Methods for Multi-Dimensional Robustness Optimization in Complex Embedded Systems Methods for Multi-Dimensional Robustness Optimization in Complex Embedded Systems Arne Hamann, Razvan Rau, Rolf Ernst Institute of Computer and Communiation Network Engineering Tehnial University of Braunshweig,

More information

SSD Based First Layer File System for the Next Generation Super-computer

SSD Based First Layer File System for the Next Generation Super-computer SSD Based First Layer File System for the Next Generation Super-omputer Shinji Sumimoto, Ph.D. Next Generation Tehnial Computing Unit FUJITSU LIMITED Sept. 24 th, 2018 0 Outline of This Talk A64FX: High

More information

Algorithms for External Memory Lecture 6 Graph Algorithms - Weighted List Ranking

Algorithms for External Memory Lecture 6 Graph Algorithms - Weighted List Ranking Algorithms for External Memory Leture 6 Graph Algorithms - Weighted List Ranking Leturer: Nodari Sithinava Sribe: Andi Hellmund, Simon Ohsenreither 1 Introdution & Motivation After talking about I/O-effiient

More information

The AMDREL Project in Retrospective

The AMDREL Project in Retrospective The AMDREL Projet in Retrospetive K. Siozios 1, G. Koutroumpezis 1, K. Tatas 1, N. Vassiliadis 2, V. Kalenteridis 2, H. Pournara 2, I. Pappas 2, D. Soudris 1, S. Nikolaidis 2, S. Siskos 2, and A. Thanailakis

More information

Approximate logic synthesis for error tolerant applications

Approximate logic synthesis for error tolerant applications Approximate logi synthesis for error tolerant appliations Doohul Shin and Sandeep K. Gupta Eletrial Engineering Department, University of Southern California, Los Angeles, CA 989 {doohuls, sandeep}@us.edu

More information

DECODING OF ARRAY LDPC CODES USING ON-THE FLY COMPUTATION Kiran Gunnam, Weihuang Wang, Euncheol Kim, Gwan Choi, Mark Yeary *

DECODING OF ARRAY LDPC CODES USING ON-THE FLY COMPUTATION Kiran Gunnam, Weihuang Wang, Euncheol Kim, Gwan Choi, Mark Yeary * DECODING OF ARRAY LDPC CODES USING ON-THE FLY COMPUTATION Kiran Gunnam, Weihuang Wang, Eunheol Kim, Gwan Choi, Mark Yeary * Dept. of Eletrial Engineering, Texas A&M University, College Station, TX-77840

More information

The recursive decoupling method for solving tridiagonal linear systems

The recursive decoupling method for solving tridiagonal linear systems Loughborough University Institutional Repository The reursive deoupling method for solving tridiagonal linear systems This item was submitted to Loughborough University's Institutional Repository by the/an

More information

Implementing Load-Balanced Switches With Fat-Tree Networks

Implementing Load-Balanced Switches With Fat-Tree Networks Implementing Load-Balaned Swithes With Fat-Tree Networks Hung-Shih Chueh, Ching-Min Lien, Cheng-Shang Chang, Jay Cheng, and Duan-Shin Lee Department of Eletrial Engineering & Institute of Communiations

More information

Cross-layer Resource Allocation on Broadband Power Line Based on Novel QoS-priority Scheduling Function in MAC Layer

Cross-layer Resource Allocation on Broadband Power Line Based on Novel QoS-priority Scheduling Function in MAC Layer Communiations and Networ, 2013, 5, 69-73 http://dx.doi.org/10.4236/n.2013.53b2014 Published Online September 2013 (http://www.sirp.org/journal/n) Cross-layer Resoure Alloation on Broadband Power Line Based

More information

Reevaluating the overhead of data preparation for asymmetric multicore system on graphics processing

Reevaluating the overhead of data preparation for asymmetric multicore system on graphics processing KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS VOL. 10, NO. 7, Jul. 2016 3231 Copyright 2016 KSII Reevaluating the overhead of data preparation for asymmetri multiore system on graphis proessing

More information

Facility Location: Distributed Approximation

Facility Location: Distributed Approximation Faility Loation: Distributed Approximation Thomas Mosibroda Roger Wattenhofer Distributed Computing Group PODC 2005 Where to plae ahes in the Internet? A distributed appliation that has to dynamially plae

More information

Design of High Speed Mac Unit

Design of High Speed Mac Unit Design of High Speed Ma Unit 1 Harish Babu N, 2 Rajeev Pankaj N 1 PG Student, 2 Assistant professor Shools of Eletronis Engineering, VIT University, Vellore -632014, TamilNadu, India. 1 harishharsha72@gmail.om,

More information

13.1 Numerical Evaluation of Integrals Over One Dimension

13.1 Numerical Evaluation of Integrals Over One Dimension 13.1 Numerial Evaluation of Integrals Over One Dimension A. Purpose This olletion of subprograms estimates the value of the integral b a f(x) dx where the integrand f(x) and the limits a and b are supplied

More information

Computing Pool: a Simplified and Practical Computational Grid Model

Computing Pool: a Simplified and Practical Computational Grid Model Computing Pool: a Simplified and Pratial Computational Grid Model Peng Liu, Yao Shi, San-li Li Institute of High Performane Computing, Department of Computer Siene and Tehnology, Tsinghua University, Beijing,

More information

Exploring the Commonality in Feature Modeling Notations

Exploring the Commonality in Feature Modeling Notations Exploring the Commonality in Feature Modeling Notations Miloslav ŠÍPKA Slovak University of Tehnology Faulty of Informatis and Information Tehnologies Ilkovičova 3, 842 16 Bratislava, Slovakia miloslav.sipka@gmail.om

More information

Analysis of input and output configurations for use in four-valued CCD programmable logic arrays

Analysis of input and output configurations for use in four-valued CCD programmable logic arrays nalysis of input and output onfigurations for use in four-valued D programmable logi arrays J.T. utler H.G. Kerkhoff ndexing terms: Logi, iruit theory and design, harge-oupled devies bstrat: s in binary,

More information

Uplink Channel Allocation Scheme and QoS Management Mechanism for Cognitive Cellular- Femtocell Networks

Uplink Channel Allocation Scheme and QoS Management Mechanism for Cognitive Cellular- Femtocell Networks 62 Uplink Channel Alloation Sheme and QoS Management Mehanism for Cognitive Cellular- Femtoell Networks Kien Du Nguyen 1, Hoang Nam Nguyen 1, Hiroaki Morino 2 and Iwao Sasase 3 1 University of Engineering

More information

RAC 2 E: Novel Rendezvous Protocol for Asynchronous Cognitive Radios in Cooperative Environments

RAC 2 E: Novel Rendezvous Protocol for Asynchronous Cognitive Radios in Cooperative Environments 21st Annual IEEE International Symposium on Personal, Indoor and Mobile Radio Communiations 1 RAC 2 E: Novel Rendezvous Protool for Asynhronous Cognitive Radios in Cooperative Environments Valentina Pavlovska,

More information

Post-K Supercomputer with Fujitsu's Original CPU, A64FX Powered by Arm ISA

Post-K Supercomputer with Fujitsu's Original CPU, A64FX Powered by Arm ISA Post-K Superomputer with Fujitsu's Original CPU, A64FX Powered by Arm ISA Toshiyuki Shimizu Nov. 15th, 2018 Post-K is under development, information in these slides is subjet to hange without notie 0 Agenda

More information

Design of a Parallel Vector Access Unit for SDRAM Memory Systems

Design of a Parallel Vector Access Unit for SDRAM Memory Systems Design of a Parallel Vetor Aess Unit for SDRAM Memory Systems Binu K. Mathew, Sally A. MKee, John B. Carter, Al Davis Department of Computer Siene University of Utah Salt Lake City, UT 84112 mbinu sam

More information

Reduced-Complexity Column-Layered Decoding and. Implementation for LDPC Codes

Reduced-Complexity Column-Layered Decoding and. Implementation for LDPC Codes Redued-Complexity Column-Layered Deoding and Implementation for LDPC Codes Zhiqiang Cui 1, Zhongfeng Wang 2, Senior Member, IEEE, and Xinmiao Zhang 3 1 Qualomm In., San Diego, CA 92121, USA 2 Broadom Corp.,

More information

Improved flooding of broadcast messages using extended multipoint relaying

Improved flooding of broadcast messages using extended multipoint relaying Improved flooding of broadast messages using extended multipoint relaying Pere Montolio Aranda a, Joaquin Garia-Alfaro a,b, David Megías a a Universitat Oberta de Catalunya, Estudis d Informàtia, Mulimèdia

More information

MATH STUDENT BOOK. 12th Grade Unit 6

MATH STUDENT BOOK. 12th Grade Unit 6 MATH STUDENT BOOK 12th Grade Unit 6 Unit 6 TRIGONOMETRIC APPLICATIONS MATH 1206 TRIGONOMETRIC APPLICATIONS INTRODUCTION 3 1. TRIGONOMETRY OF OBLIQUE TRIANGLES 5 LAW OF SINES 5 AMBIGUITY AND AREA OF A TRIANGLE

More information

Scheduling Multiple Independent Hard-Real-Time Jobs on a Heterogeneous Multiprocessor

Scheduling Multiple Independent Hard-Real-Time Jobs on a Heterogeneous Multiprocessor Sheduling Multiple Independent Hard-Real-Time Jobs on a Heterogeneous Multiproessor Orlando Moreira NXP Semiondutors Researh Eindhoven, Netherlands orlando.moreira@nxp.om Frederio Valente Universidade

More information

mahines. HBSP enhanes the appliability of the BSP model by inorporating parameters that reet the relative speeds of the heterogeneous omputing omponen

mahines. HBSP enhanes the appliability of the BSP model by inorporating parameters that reet the relative speeds of the heterogeneous omputing omponen The Heterogeneous Bulk Synhronous Parallel Model Tiani L. Williams and Rebea J. Parsons Shool of Computer Siene University of Central Florida Orlando, FL 32816-2362 fwilliams,rebeag@s.uf.edu Abstrat. Trends

More information

CleanUp: Improving Quadrilateral Finite Element Meshes

CleanUp: Improving Quadrilateral Finite Element Meshes CleanUp: Improving Quadrilateral Finite Element Meshes Paul Kinney MD-10 ECC P.O. Box 203 Ford Motor Company Dearborn, MI. 8121 (313) 28-1228 pkinney@ford.om Abstrat: Unless an all quadrilateral (quad)

More information

Detection and Recognition of Non-Occluded Objects using Signature Map

Detection and Recognition of Non-Occluded Objects using Signature Map 6th WSEAS International Conferene on CIRCUITS, SYSTEMS, ELECTRONICS,CONTROL & SIGNAL PROCESSING, Cairo, Egypt, De 9-31, 007 65 Detetion and Reognition of Non-Oluded Objets using Signature Map Sangbum Park,

More information

Abstract. Key Words: Image Filters, Fuzzy Filters, Order Statistics Filters, Rank Ordered Mean Filters, Channel Noise. 1.

Abstract. Key Words: Image Filters, Fuzzy Filters, Order Statistics Filters, Rank Ordered Mean Filters, Channel Noise. 1. Fuzzy Weighted Rank Ordered Mean (FWROM) Filters for Mixed Noise Suppression from Images S. Meher, G. Panda, B. Majhi 3, M.R. Meher 4,,4 Department of Eletronis and I.E., National Institute of Tehnology,

More information

Chapter 2: Introduction to Maple V

Chapter 2: Introduction to Maple V Chapter 2: Introdution to Maple V 2-1 Working with Maple Worksheets Try It! (p. 15) Start a Maple session with an empty worksheet. The name of the worksheet should be Untitled (1). Use one of the standard

More information

An Evaluation of Automatic and Interactive Parallel Programming Tools

An Evaluation of Automatic and Interactive Parallel Programming Tools An Evaluation of Automati and Interative Parallel Programming Tools Doreen Y Cheng Computer Siene Co NASA Ames Researh Center MS 258-6 Moffett Field, CA 9435 Douglas M Pase Formerly at NASA (CSC) Cray

More information

Using Augmented Measurements to Improve the Convergence of ICP

Using Augmented Measurements to Improve the Convergence of ICP Using Augmented Measurements to Improve the onvergene of IP Jaopo Serafin, Giorgio Grisetti Dept. of omputer, ontrol and Management Engineering, Sapienza University of Rome, Via Ariosto 25, I-0085, Rome,

More information

Z8530 Programming Guide

Z8530 Programming Guide Z8530 Programming Guide Alan Cox alan@redhat.om Z8530 Programming Guide by Alan Cox Copyright 2000 by Alan Cox This doumentation is free software; you an redistribute it and/or modify it under the terms

More information

Automated System for the Study of Environmental Loads Applied to Production Risers Dustin M. Brandt 1, Celso K. Morooka 2, Ivan R.

Automated System for the Study of Environmental Loads Applied to Production Risers Dustin M. Brandt 1, Celso K. Morooka 2, Ivan R. EngOpt 2008 - International Conferene on Engineering Optimization Rio de Janeiro, Brazil, 01-05 June 2008. Automated System for the Study of Environmental Loads Applied to Prodution Risers Dustin M. Brandt

More information

User-level Fairness Delivered: Network Resource Allocation for Adaptive Video Streaming

User-level Fairness Delivered: Network Resource Allocation for Adaptive Video Streaming User-level Fairness Delivered: Network Resoure Alloation for Adaptive Video Streaming Mu Mu, Steven Simpson, Arsham Farshad, Qiang Ni, Niholas Rae Shool of Computing and Communiations, Lanaster University

More information

Uncovering Hidden Loop Level Parallelism in Sequential Applications

Uncovering Hidden Loop Level Parallelism in Sequential Applications Unovering Hidden Loop Level Parallelism in Sequential Appliations Hongtao Zhong, Mojtaba Mehrara, Steve Lieberman, and Sott Mahlke Advaned Computer Arhiteture Laboratory University of Mihigan, Ann Arbor,

More information

3-D IMAGE MODELS AND COMPRESSION - SYNTHETIC HYBRID OR NATURAL FIT?

3-D IMAGE MODELS AND COMPRESSION - SYNTHETIC HYBRID OR NATURAL FIT? 3-D IMAGE MODELS AND COMPRESSION - SYNTHETIC HYBRID OR NATURAL FIT? Bernd Girod, Peter Eisert, Marus Magnor, Ekehard Steinbah, Thomas Wiegand Te {girod eommuniations Laboratory, University of Erlangen-Nuremberg

More information

Episode 12: TCP/IP & UbiComp

Episode 12: TCP/IP & UbiComp Episode 12: TCP/IP & UbiComp Hannes Frey and Peter Sturm University of Trier Outline Introdution Mobile IP TCP and Mobility Conlusion Referenes [1] James D. Solomon, Mobile IP: The Unplugged, Prentie Hall,

More information

- 1 - S 21. Directory-based Administration of Virtual Private Networks: Policy & Configuration. Charles A Kunzinger.

- 1 - S 21. Directory-based Administration of Virtual Private Networks: Policy & Configuration. Charles A Kunzinger. - 1 - S 21 Diretory-based Administration of Virtual Private Networks: Poliy & Configuration Charles A Kunzinger kunzinge@us.ibm.om - 2 - Clik here Agenda to type page title What is a VPN? What is VPN Poliy?

More information

Performance Benchmarks for an Interactive Video-on-Demand System

Performance Benchmarks for an Interactive Video-on-Demand System Performane Benhmarks for an Interative Video-on-Demand System. Guo,P.G.Taylor,E.W.M.Wong,S.Chan,M.Zukerman andk.s.tang ARC Speial Researh Centre for Ultra-Broadband Information Networks (CUBIN) Department

More information

A {k, n}-secret Sharing Scheme for Color Images

A {k, n}-secret Sharing Scheme for Color Images A {k, n}-seret Sharing Sheme for Color Images Rastislav Luka, Konstantinos N. Plataniotis, and Anastasios N. Venetsanopoulos The Edward S. Rogers Sr. Dept. of Eletrial and Computer Engineering, University

More information

COMP 181. Prelude. Intermediate representations. Today. Types of IRs. High-level IR. Intermediate representations and code generation

COMP 181. Prelude. Intermediate representations. Today. Types of IRs. High-level IR. Intermediate representations and code generation Prelude COMP 181 Intermediate representations and ode generation November, 009 What is this devie? Large Hadron Collider What is a hadron? Subatomi partile made up of quarks bound by the strong fore What

More information

Parallelization and Performance of 3D Ultrasound Imaging Beamforming Algorithms on Modern Clusters

Parallelization and Performance of 3D Ultrasound Imaging Beamforming Algorithms on Modern Clusters Parallelization and Performane of 3D Ultrasound Imaging Beamforming Algorithms on Modern Clusters F. Zhang, A. Bilas, A. Dhanantwari, K.N. Plataniotis, R. Abiprojo, and S. Stergiopoulos Dept. of Eletrial

More information

References. December 1992, pp. 71 { 81. pp.457{467. Magazine, June for very large high throughput database systems,"

References. December 1992, pp. 71 { 81. pp.457{467. Magazine, June for very large high throughput database systems, the overall working time for other appliations. In ase, data ltering was the only appliation being run, then using distributed indexing, we an serve 00 times as many requests. 6 Conlusion We have explored

More information

DoS-Resistant Broadcast Authentication Protocol with Low End-to-end Delay

DoS-Resistant Broadcast Authentication Protocol with Low End-to-end Delay DoS-Resistant Broadast Authentiation Protool with Low End-to-end Delay Ying Huang, Wenbo He and Klara Nahrstedt {huang, wenbohe, klara}@s.uiu.edu Department of Computer Siene University of Illinois at

More information

Allocating Rotating Registers by Scheduling

Allocating Rotating Registers by Scheduling Alloating Rotating Registers by Sheduling Hongbo Rong Hyunhul Park Cheng Wang Youfeng Wu Programming Systems Lab Intel Labs {hongbo.rong,hyunhul.park,heng..wang,youfeng.wu}@intel.om ABSTRACT A rotating

More information

Reducing Runtime Complexity of Long-Running Application Services via Dynamic Profiling and Dynamic Bytecode Adaptation for Improved Quality of Service

Reducing Runtime Complexity of Long-Running Application Services via Dynamic Profiling and Dynamic Bytecode Adaptation for Improved Quality of Service Reduing Runtime Complexity of Long-Running Appliation Servies via Dynami Profiling and Dynami Byteode Adaptation for Improved Quality of Servie ABSTRACT John Bergin Performane Engineering Laboratory University

More information

An Efficient and Scalable Approach to CNN Queries in a Road Network

An Efficient and Scalable Approach to CNN Queries in a Road Network An Effiient and Salable Approah to CNN Queries in a Road Network Hyung-Ju Cho Chin-Wan Chung Dept. of Eletrial Engineering & Computer Siene Korea Advaned Institute of Siene and Tehnology 373- Kusong-dong,

More information

Cluster-Based Cumulative Ensembles

Cluster-Based Cumulative Ensembles Cluster-Based Cumulative Ensembles Hanan G. Ayad and Mohamed S. Kamel Pattern Analysis and Mahine Intelligene Lab, Eletrial and Computer Engineering, University of Waterloo, Waterloo, Ontario N2L 3G1,

More information

Effecting Parallel Graph Eigensolvers Through Library Composition

Effecting Parallel Graph Eigensolvers Through Library Composition Effeting Parallel Graph Eigensolvers Through Library Composition Alex Breuer, Peter Gottshling, Douglas Gregor, Andrew Lumsdaine Open Systems Laboratory Indiana University Bloomington, IN 47405 {abreuer,pgottsh,dgregor,lums@osl.iu.edu

More information

High-level synthesis under I/O Timing and Memory constraints

High-level synthesis under I/O Timing and Memory constraints Highlevel synthesis under I/O Timing and Memory onstraints Philippe Coussy, Gwenolé Corre, Pierre Bomel, Eri Senn, Eri Martin To ite this version: Philippe Coussy, Gwenolé Corre, Pierre Bomel, Eri Senn,

More information

Performance of Histogram-Based Skin Colour Segmentation for Arms Detection in Human Motion Analysis Application

Performance of Histogram-Based Skin Colour Segmentation for Arms Detection in Human Motion Analysis Application World Aademy of Siene, Engineering and Tehnology 8 009 Performane of Histogram-Based Skin Colour Segmentation for Arms Detetion in Human Motion Analysis Appliation Rosalyn R. Porle, Ali Chekima, Farrah

More information

Intra- and Inter-Stream Synchronisation for Stored Multimedia Streams

Intra- and Inter-Stream Synchronisation for Stored Multimedia Streams IEEE International Conferene on Multimedia Computing & Systems, June 17-23, 1996, in Hiroshima, Japan, p 372-381 Intra- and Inter-Stream Synhronisation for Stored Multimedia Streams Ernst Biersak, Werner

More information

A Multi-Head Clustering Algorithm in Vehicular Ad Hoc Networks

A Multi-Head Clustering Algorithm in Vehicular Ad Hoc Networks International Journal of Computer Theory and Engineering, Vol. 5, No. 2, April 213 A Multi-Head Clustering Algorithm in Vehiular Ad Ho Networks Shou-Chih Lo, Yi-Jen Lin, and Jhih-Siao Gao Abstrat Clustering

More information

Tackling IPv6 Address Scalability from the Root

Tackling IPv6 Address Scalability from the Root Takling IPv6 Address Salability from the Root Mei Wang Ashish Goel Balaji Prabhakar Stanford University {wmei, ashishg, balaji}@stanford.edu ABSTRACT Internet address alloation shemes have a huge impat

More information

1. Introduction. 2. The Probable Stope Algorithm

1. Introduction. 2. The Probable Stope Algorithm 1. Introdution Optimization in underground mine design has reeived less attention than that in open pit mines. This is mostly due to the diversity o underground mining methods and omplexity o underground

More information

Alleviating DFT cost using testability driven HLS

Alleviating DFT cost using testability driven HLS Alleviating DFT ost using testability driven HLS M.L.Flottes, R.Pires, B.Rouzeyre Laboratoire d Informatique, de Robotique et de Miroéletronique de Montpellier, U.M. CNRS 5506 6 rue Ada, 34392 Montpellier

More information

Staircase Join: Teach a Relational DBMS to Watch its (Axis) Steps

Staircase Join: Teach a Relational DBMS to Watch its (Axis) Steps Stairase Join: Teah a Relational DBMS to Wath its (Axis) Steps Torsten Grust Maurie van Keulen Jens Teubner University of Konstanz Department of Computer and Information Siene P.O. Box D 88, 78457 Konstanz,

More information

BSPLND, A B-Spline N-Dimensional Package for Scattered Data Interpolation

BSPLND, A B-Spline N-Dimensional Package for Scattered Data Interpolation BSPLND, A B-Spline N-Dimensional Pakage for Sattered Data Interpolation Mihael P. Weis Traker Business Systems 85 Terminal Drive, Suite Rihland, WA 995-59-946-544 mike@vidian.net Robert R. Lewis Washington

More information