Architecture and Performance of the Hitachi SR2201 Massively Parallel Processor System
|
|
- Virgil Hudson
- 6 years ago
- Views:
Transcription
1 Arhiteture and Performane of the Hitahi SR221 Massively Parallel Proessor System Hiroaki Fujii, Yoshiko Yasuda, Hideya Akashi, Yasuhiro Inagami, Makoto Koga*, Osamu Ishihara*, Masamori Kashiyama*, Hideo Wada*, and Tsutomu Sumimoto* Central Researh Laboratory, Hitahi Ltd. 1-28, Higashi-Koigakubo, Kokubunji, Tokyo 185, Japan Tel: ; Fax: {fujii, yoshikoy, akashi, *General Purpose Computer Division, Hitahi Ltd. 1, Horiyamashita, Hadano, Kanagawa , Japan Abstrat RISC-based Massively Parallel Proessors (MPPs) often show low effiieny in real-world appliations beause of ahe miss penalty, insuffiient throughput of the memory system, and poor inter-proessor ommuniation performane. Hitahi's SR221, an MPP salable up to 248 proessors and 6 GFLOPS peak performane, overomes these problems by introduing three novel features. First, its proessor, the 15 MHz HARP-1E, solves the ahe miss penalty by "pseudo vetor proessing" (PVP). In PVP, is loaded by prefething to a speial register bank, bypassing the ahe. Seond, a multi-bank memory arhiteture that operates like a pipeline eliminates the memory system bottlenek. Third, the inter-proessor ommuniation ahieves high performane on the three-dimensional rossbar network, using a "remote DMA transfer" protool and a hardware-based ahe ohereny. As the result of these improvements, the SR221 ahieved 22.4 GFLOPS with 124 proessors in the LINPACK benhmark, whih is almost 72% of the peak performane. 1. Introdution The Hitahi SR221 is a newly designed massively parallel proessor (MPP) omputer system that was introdued to the superomputing market in Marh Up to 248 RISC proessors an be onneted via a high-speed threedimensional (3D) rossbar network [1], [1]. Eah proessor, running at a lok frequeny of 15 MHz, has a peak performane of 3 MFLOPS, giving the SR221 a peak performane of 6 GFLOPS. The main memory for one proessing node is up to 1 GB with 1 MB of seondary ahe. The 3D rossbar network is able to transfer at 3 MB/ s over eah link. One of the main design targets on the SR221 is to solve the low effetive performane problem often seen in MPPs. The main auses of this performane degradation (whih are ahe miss penalty, insuffiient throughput of the memory system, and poor inter-proessor ommuniation performane) and the solutions adopted to the SR221, are disussed below. The ahe memory has been introdued in parallel proessor systems built around RISC proessors in an attempt to resolve the speed gap between the main memory and the CPU. However, this arhiteture fails to fulfill this objetive, espeially when an appliation program needs to aess a large portion of memory whih annot fit into the ahe, beause suh an aess inevitably suffers ahe misshits, whih leads to a heavy loss in effetive performane. Conventional ahe-based omputers are thus likely to show dereased performane when they enounter suh ase, whih is often observed in large-sale numerial appliations. To eliminate the ahe miss penalty and improve the performane, the SR221 is equipped with a novel mehanism in its RISC proessors, alled the "preload" operation [2]. This operation is a non-bloking diret main-memory load operation that bypasses the ahe memory. Sine the target of the preload operation is a 128-register bank, it is possible to issue preloads early enough so that the fethed value arrives before its use by other instrutions in the program. This preload operation, along with the super-salar arhiteture and the ode-sheduling similar to software pipelining [3], [4] provided by the ompiler, solves the ahe miss penalty and improves performane. We all the ombination of these tehniques "pseudo vetor proessing" (PVP) [2]. Insuffiient memory throughput, whih is another ause of performane degradation, beomes ritial in the ase of the PVP. Sine PVP issues feth operations almost every mahine yle to the main memory, it is espeially important that the main memory supports a sustained high bandwidth. To this end, the memory system of the SR221 is omposed of a multi-bank memory that operates like a pipeline. The effetive performane of MPPs an also be deteriorated by low performane on the inter-proessor network, due to narrow bandwidth, poor memory system performane and high software overhead. The memory system features desribed above not only help the PVP to work effetively, but also ontribute to inrease throughput in the inter-proessor network. Also, to ahieve high performane in interproessor ommuniations, the SR221 eliminates the software overhead by using a newly developed transfer protool, alled "remote DMA transfer", along with a hardwarebased ahe ohereny. The rest of this paper is organized as follows. Setion 2
2 L The SR221 LAN I/O Unit (IOU) Supervisory IOU (SIOU) Proessing Unit (PU) rossbar - swith f=7 hard disk Figure 1. Coneptual system onfiguration of the SR221. gives an arhitetural overview of the SR221. In Setion 3, the pseudo vetor proessing feature is desribed in detail. Setion 4 desribes issues on inter-proessor transfer. Setion 5 desribes the memory system. In Setion 6, some performane evaluation results are shown. Setion 7 onludes the paper with some remarks. 2. System Overview of the SR Overall Organization Figure 1 shows the oneptual system onfiguration of the SR221. The SR221 uses a multi-dimensional rossbar network to onnet the proessors. For example, a 2176 proessor element (PE) system of the SR221 uses a 3D rossbar network. The 2176 PEs are arranged in an 8x17~16 lattie and the PEs arranged along eah dimension (x, y, or z) are onneted by a ommon rossbar swith. There are two types of PEs: proessing units (PUS) and I/ Units (IOUs). The PUS perform omputation and the IOUs mainly ontrol the I/O proesses. The 2176 PE system has 248 PUS and 128 IOUs. One of the IOUs is a Supervisory IOU (SIOU), whih also performs system management. r-l memory ontrollers (MCA/MCDs) 1 mai;ieyge 1 proe;;!; unit Figure 2. Organization _ onnetions -to3d - rossbar network of a PU. 2.2 Organization of Proessing Unit Figure 2 shows the organization of a PU. Eah PU has seven omponents: an instrution proessor (IP), a storage ontroller (SC), a network interfae adapter (NIA), a memory ontroller for addresses (MCA), memory on- trollers for (MCDs), main storage (MS), and seondary ahe. The HARP-1E RISC proessor [5] is used as the IP. The HARP-1E is based on the PA-RISC 1.1 arhiteture. It runs at 15 MHz, and an operate at up to 3 MFLOPS. The NIA onnets eah PU to three rossbar swithes. It handles the sending and reeiving of between proessors, and handles the routing of through the network as well. When sending and reeiving, the NIA rt;ads and writes the from or to the MS diretly through the SC by diret memory aess (DMA). The SC is onneted to the IP, NIA, and memory ontrollers (MCA and MCDs). It proesses MS aess requests oming from the IP and NIA and passes the requests on to the memory ontrollers. The MCA and MCD manage the address information and of the MS aess requests, respetively i The IOUs and SIOU have the same organization as the PUS, but also have an I/O bus manager onneted to the SC, enabling them to onnet to I/O devies. 3. Pseudo Vetor Proessing Feature The main target appliations of the SR221 are largesale numerial appliations. These appliations need a large amount of spae whih annot fit into the ahe, resulting in a high number of ahe misses if run on a normal RISC proessor system. This problem is overome in the SR221 by using a nonbloking diret main-memory load feature alled preload. This feature does not utilize the ahe, so there is no ahe miss penalty. However, it needs optimized ode sheduling to hide the memory aess lateny. Code sheduling in the SR221 IP is based on the software pipelining tehnique [3], [4], and ahieves highly effetive omputational performane when used with preioid and supersalar proes&g features. The IP issues a preload and a floating point instrution in parallel every yle when it exeutes odes optimized by software pipelining. We all this feature pseudo vetor proessing (PVP) [2]. In PVP, the instrution of program segments suh as loop iterations are divided into two ategories: preloads, whih ost long lateny to omplete the exeution, and other instrutions (alulations and s). Using supersalar proessing, preloads for the to be used in a segment are ontinuously issued in advane to other instrutions of this segment, early enough so as to hide the lateny, and in parallel with alulations of a different segment, whih already has ompleted its own preloads. Thus the proessor, whih is exeuting PVP ode, an fully perform a load pipeline and a alulation one in parallel, thus ahieving high performane. PVP needs many floating-point registers as target registers for preloading. Eah IP in the SR221 has 128 floating-point registers; they are managed using a register-window. This sliding window feature [2] enables seletive
3 aess to the preloaded on the 128 registers in eah IP. Owing to this sliding window feature, the HARP-1E made few hanges on the usual RISC instrution set arhiteture. It needed no extension in register speifiation fields of instrutions for floating-point alulation, and just only added some new instrutions, suh as "preload", "window-swith", et. PU Program Data PU Program Data 4. Inter-proessor Data Transfer The SR221 ahieves high-performane inter-proessor transfer due to its 1. flexible inter-proessor network topology, 2. high-speed inter-proessor network, and 3. low-lateny inter-proessor ommuniation (message passing) protool. The first two result from its multi-dimensional rossbar network, and the third from its use of an original inter-proessor ommuniation faility, the "remote DMA", and hardware-based ahe ohereny. This setion desribes issues on multi-dimensional rossbar network and the remote DMA transfer faility. And Se. 5.3 desribes issues on the hardware-based ahe ohereny. 4.1 Multi-dimensional Crossbar Network The multi-dimensional rossbar network is one of the most important features of the SR221 [1]. Figure 1 shows the struture of the three-dimensional (3D) rossbar network. In the 3D rossbar network, PUs are plaed in a three-dimensional arrangement. Several rossbar swithes are plaed in parallel in eah dimension to onnet the PUs. The NIA on eah PU inludes a router for onneting itself to the three rossbars. Eah router an also route from a rossbar swith to another rossbar, enabling transfer between PUs whih are not diretly onneted by a single rossbar. Therefore, eah router is also a small rossbar swith. This network has three signifiant features supporting inter-proessor transfer: 1. short ommuniation distane Inter-proessor transfer between any two PEs is ahieved within at most three hops in the three-dimensional onfiguration. 2. great freedom in proessor mapping of appliations Beause this network is omposed of multiple rossbars, far fewer network onflits our in this network ompared to mesh-onneted or torus networks. Thus, high performane is ahieved for many variations in the inter-proessor ommuniation patterns due to the many independent ommuniation paths. Consequently, there is great freedom in the proessor mapping of appliations. 3. high-performane broadast and barrier synhronization faility The multi-dimensional rossbar topology failities olletive ommuniation via hardware, ahieving high performane (low lateny) broadast and barrier synhronization. The entire system an be partitioned into a maximum of eight groups (partitions), in eah of whih the olletive ommuniation faility an be used independently. Eah link of this network an transfer at 3 MB/s, OS whih mathes the omputing performane of the PU when the SR221 is solving large-sale numerial appliations. 4.2 Remote DMA Transfer Faility In onventional inter-proessor ommuniation protool (send/reeive model), when a PU sends to another, the is first opied to a send buffer in the operating system (OS), and then is transmitted by the network to a similar buffer in the reeiving PU. Finally, the is opied to the reeiving program. This protool has the following advantages: 1. The send operation is non-bloking. 2. Reliable transmission protool an be easily implemented. However, the ommuniation overhead on the send/reeive protool is quite large. This happens beause it is neessary to opy the twie and the proessing of the protool requires ontext swithes. Furthermore, reeiving of the generates an interrupt. To solve these problems, the SR221 supports a remote DMA transfer faility. The basi onept of this protool is shown in Fig. 3. In order to avoid the ommuniation overhead, the is transmitted diretly from one program area to another, without any OS operations. To ahieve the remote DMA transfer faility, the OS alloates a reserved physial memory area for the user spae in advane, whih is never moved to other address spae. The sender speifies that area of the reeiver and diretly writes the in it. Sine there is no buffering in the OS kernel, expensive memory opy operations are avoided. Also, there is no need for an OS system all and ontext swithes, sine the user program diretly invokes the ommuniation. 5. Memory System No Buffering in Kernel No OS System Call OS Network Figure 3. Basi onept of remote DMA transfer faility. To ahieve high memory performane, a great amount of hardware, suh as LSI pins, memory hips, and ontrol LSI hips, are needed. However the SR221 aims at ahieving a ompat 248 PU system whih ahieves high performane, both peak and effetive. Thus, ompatly implementing the PUs inluding the memory system is important. As a result, memory system should ahieve high effetive performane by fully utilizing a limited set of hardware resoures. This setion desribes how the memory system solves this problem.
4 address/ 8 bytes address 4 bytes x 2 instrution proessor storage ontroller memory ontroller address/ 8 bytes 8 bytes x 2 2 bytes address 2 bytes 2 bytes interfaes 5.1 Organization of Memory System network interfae adapter 15 MHz 75 MHz As shown in Fig. 2, the SC is implemented using a single LSI hip, whose number of LSI pins has been made as high as possible to widen the paths and to avoid bottleneks. Figure 4 shows the inter-lsi interfaes of the memory system. Beause the address and of a transation from the IP use a ommon 8-byte-wide path between the IP and the SC, the IP needs two mahine yles to transmit a storage transation. This is the only fator degrading the performane of PVP. The paths between the NIA and the SC an simultaneously handle MS reads for sending and MS writes for reeiving without performane degradation. The SC and other LSI hips shown in Fig. 2 run at 75 MHz, with the exeption of the IP, whih runs at 15 MHz. At the interfae between the memory ontroller and the SC there are two sets of MS aess paths to keep the same throughput of the IP-SC bus at half of the lok speed, supporting the required pith for MS aesses using PVP. Sine the paths are bi-diretional, path onflits sometimes our between the storage transations from the SC and the transmission of fethed from the memory ontroller. The path ontroller is able to swith diretion without idle yles, minimizing the penalty of these onflits. The time harts in Fig. 5 show the ontrol flow of the bidiretional paths. As shown in Fig. 5 (a), the onventional method spends an idle mahine yle to swith diretion. As shown in Fig. 5 (b), the method used on the SR221 swithes the diretion within the interval from the end of one transfer (the moment at whih the lath of the opposite port reeives the ) to the start of the next one. 5.2 Features for Supporting Pseudo Vetor Proessing Main storage aess using PVP has the following harateristis: 1. Sine PVP issues 8-byte-feth operations almost every mahine yle (15 MHz) to the MS, the -supply rate from MS to IP is about 1.2 GB/s. On the other hand, if ahe is used in the SR221, as it is in onventional systems, the -supply rate from MS to ahe (and then to the PU) would be at most about 6 MB/s due to ahe misses. 2. The MS aesses using PVP are unorrelated to eah other in priniple, so there is no regularity in their address sequenes. This harateristi is the most signifiant differene between the load/ operations of a vetor proessor and the MS aesses using PVP. To ahieve the high aess pith needed for PVP desribed in the first harateristi above, the memory system proesses MS aesses in a pipelined manner. And two sets of MS aess pipelines in the SC are used to keep the same throughput of the IP-SC bus at half of the lok speed. The system has 16 memory banks in the MS, providing 1.2-GB/s bandwidth for the PVP memory aesses. As shown in Fig. 6, the MS is separated into two groups (bank groups) of eight banks eah, based on the two sets of MS aess to SC Figure 4. Inter-LSI interfaes of memory system. bidiretional path MS to to SC bidiretional path MS idle fethed swith diretion to swith diretion to fethed (a) Conventional fethed fethed to swith diretion swith diretion to (b) SR221 idle swith diretion fethed fethed time to mahine yle to swith diretion time to mahine yle (13.3ns) Figure 5. Flow ontrol for bi-diretional path between memory ontroller and SC.
5 8B/13.3ns storage ontroller 4B/13.3ns 4B/13.3ns 8B/13.3ns MS MCD MCA MCD1 bank bank2 bank4 bank6 bank8 bank1 bank12 bank14 bank1 bank3 bank5 bank7 bank9 bank11 bank13 bank15 bank group bank group1 Data Address Control Figure 6. Main storage onfiguration. paths. Two MCDs manage the aessed, one MCD for eah bank group. An MCA manages the addresses for all MS aesses, independently of eah bank group. To avoid bank onflit penalties aused by the seond harateristi above, the SC and the memory ontroller have aess buffers. 5.3 Features for Supporting High-speed Inter-proessor Data Transfer Table 1. Equations used in experimental measurements. Eq. # Equation 1 s=s+a(i) 2 A(i)=B(i) 3 A(i)=B(i)+C(i) 4 s=s+a(i)*b(i) 5 C(i)=C(i)+A(i)*B(i) # of variables # of load operations # of operations As previously stated, eah link in the 3D rossbar network an transfer at 3 MB/s and the NIA handles sending and reeiving of in parallel. As a result, the throughput of MS aesses from the NIA reahes 6 MB/s. The use of the NIA-SC interfae desribed in Se. 5.1, and the 1.2-GB/s bandwidth of the MS desribed in Se. 5.2, allows this 6-MB/s bandwidth to be ahieved. The NIA aesses the to be transferred diretly from the MS, independent of the IP. However, the IP may aess and ahe the same areas aessed by the NIA, thus ahe ohereny has to be maintained. Conventional parallel proessor systems realize ahe oherene by software, whih leads to performane degradation during massive transfer due to high software overhead. To avoid this problem, the following two hardware features are implemented: 1. Store-through ahe management whih makes ahe oherene operations in sending unneessary. 2. A hardware support mehanism in the SC whih maintains ahe oherene in parallel with reeption from the NIA. This mehanism usually invokes ahe oherene operations one per ahe line. This strategy hides the overhead of ahe oherene operations. 6. Performane Evaluations of the SR Performane Measurements of Basi Loops A set of basi loops orresponding to ommon vetor operations were used to measure the performane of the memory system. To minimize the TLB (translation lookaside buffer) miss penalties and measure the true performane of the memory system, the area of programs was mapped by using the bloked TLB faility, whih translates a ontinuous memory area of up to 32-MBytes from virtual into physial address using one entry in the address translation table. The equations for eah of the basi loops used are shown in Table 1. For eah equation, the fators that affet the performane of the memory system are shown. These are the number of array variables, the number of load instrutions, and the number of instrutions that aess the memory in one iteration (when a vetor variable appears on both sides of the assignment, it is ounted twie beause a load and a instrution need to be issued). All variables are doublepreision floating point exept for the array indexes. The experimental measurements on one PU using the basi loop alulations (Table 1) are shown in Figure 7. The horizontal axis shows the stride of the aesses (i.e., the inrement used on the values of the index i). For basi loops that have more than one vetor variable, the aesses for all variables have the same stride. All arrays are aligned on 256- byte boundaries. As shown in Figure 7, all alulations have low performane at the same stride beause memory bank onflits our. For instane, when the stride is a multiple of 2, the performane is half of the maximum, beause only half of the memory banks are aessed. When the stride is a mul-
6 MS bandwidth (MB/s) Eq. 1: s = s + A Eq. 3: A = B + C Eq. 5: C = C + A*B Eq. 2: A = B Eq. 4: s = s + A*B stride (number of elements) Figure 7. Experimental measurements for basi loop alulations. Table 2. Memory aess performane. Equation Eq. 2: A(i)=B(i) Eq. 3: A(i)=B(i)+C(i) Performane (MB/s) SR221 Cray T3D Cray T3E IBM SP Table 3. Performane of inter-proessor transfer (MB/s). System SR221 Cray T3D IBM SP2 Theoretial peak Effetive peak tiple of 4, 8, or 16, the performane of the memory aess drops to 1/4, 1/8, and 1/16 of the maximum, beause the number of memory banks aessed is redued to 4, 2 and 1, respetively. The next analysis shows the differenes between the equations based on their features. The features in Figure 7 are as follows: 1. The number of array variables that must be aessed affets performane. The performane of Eq. 3 and 5, whih aess more than three array variables, is low. When the number of variables inreases, aesses to the same bank our ontinuously beause all variables are aligned in the 256-byte boundaries, and thus performane is dereased. 2. The number of instrutions affets performane. Array operations redue the aess-request issue pith to the memory, beause of the IP-SC bus width possibly lowering the throughput. On the other hand, this redues the load on the memory system sine dereases the impat of memory bank onflits, thus raising the throughput in some ases. Also, sine in PVP the array elements being loaded and d orrespond to different iterations, the banks aessed by the instrutions are different from the ones of the load instrutions. This differs from the equations that have only load instrutions. When the bandwidth obtained on Eq.'s 2 and 4 are ompared, eah equation has two array variables; however in Eq. 2, one of the two variables is a target of the instrution. In ontrast, both variables of Eq. 4 are a target of a load instrution. The performanes of the Hitahi SR221, CRAY T3D, CRAY T3E, and IBM SP2 are ompared in Table 2 [6]. The SR221 had the highest performane of these four mahines due to its fully pipelined memory system and PVP faility. 6.2 Evaluation of Inter-proessor Data Transfer Performane Effet of High Memory Bandwidth on Data Transfer and Remote DMA Transfer Faility. The high bandwidth of the SR221 memory system and remote DMA transfer faility enables high-bandwidth network transfer. To illustrate this point, the network transfer performane of ommerial parallel proessor systems [7], [8] are shown in Table 3. The SR221 outperforms the other two in terms of both theoretial and effetive peak network throughput Effet of Hardware Support on Cahe Coherene. Figure 8 shows the network transfer throughput using the ahe oherene management mehanism (CCMM) (oherene kept by hardware) and without using it (oherene kept by software). The measured network transfer throughput is for the ase when two proessing units issue a remote DMA transfer towards eah other simultaneously. The performane using hardware ahe oherene management was almost 4% higher than the ase using the software ounterpart. 6.3 Evaluation of Numerial Appliation Performane Performane of Impliit Method. Using the same assumptions as in Se. 6.1, we evaluated the performane of the four loops below: (a) a(i,j) = b(i,j) + b(i-1,j) + b(i+1,j) + b(i,j-1) + b(i,j+1) (b) b(i,j) = b(i,j) + b(i-1,j) + b(i+1,j) + b(i,j-1) + b(i,j+1) throughput (MB/s) with CCMM without CCMM 1 1K 1K K 1M 1M transfer size (byte) Figure 8. Inter-proessor transfer.
7 performane (MFLOPS) performane (MFLOPS) performane (MFLOPS) Eq. (a) Eq. (b) Eq. () Eq. (d) 4 Eq. (a) Eq. (b) Eq. () Eq. (d) 4 7 Figure 9. Performane of four typial equations used in impliit method. Eq. (a) Eq. (b) Eq. () Eq. (d) number of dummy elements (for N=36 ase) Figure 1. Performane of four equations with dummy elements array size (number of elements) array size (number of elements) Figure 11. Performane of four equations without PVP () b(i,j) = a(i,j) + b(i-1,j) + b(i+1,j) + b(i,j-1) + b(i,j+1) (d) (i,j) = a(i,j) + b(i-1,j) + b(i+1,j) + b(i,j-1) + b(i,j+1) These four equations are simplified forms of the ore loop that appears in numerial appliation programs using the impliit methods. The HARP-1E ahieves peak performane when it performs a multipliation and an addition in parallel every mahine yle. However these four equations have only addition operations, whih redues the maximum ahievable performane for these alulations to half of the peak value of 3 MFLOPS. The experimental results are shown in Fig. 9. The horizontal axis is array size N (all arrays have N x N elements). The vertial axis is the performane. As shown in Fig. 9, the performane hanges with the size of the array beause of bank onflits on the aesses of the j th index. To avoid this problem, dummy elements are added to the i th index of the array. The relationship between the number of dummy elements and performane when N equals 36 is shown in Fig. 1. In this ase, two dummy elements are suffiient to ahieve high performane. Figure 11 shows the performanes of the four equations without PVP. The performane ahieved using PVP (Fig. 9) is far higher than that obtained without it (i.e., aess through ahe). In the worst ase, when severe bank onflits our, the performane with PVP is equal to that of without PVP Performane of LU Deomposition Program. LU deomposition is the main omputation in the LINPACK benhmark, whih is ommonly used to measure the performane of superomputer. In this setion we identify the most suitable algorithm for the SR221 proessing unit and show the best performane tuning for this algorithm. The evaluation parameters are equivalent to the ones above. The ore alulation of LU deomposition is: a(i,j) = a(i,j) - a(k,j) * a(i,k). The order of the i, j, and k loops is what differentiates the LU deomposition algorithms. The outer produt form (k,j,i order) is ommonly used on vetor proessors beause they perform poorly when the inner produt form is used. In the inner produt Crout form (i,j,k order), the innermost loop (k) performs aumulation into a(i,j), reduing the number of memory operations. Memory an be reused by unrolling the i,j loops. On the SR221, as stated in Se. 6.1, performane an be improved by using algorithms that use fewer memory operations, and also by reusing memory to obtain a higher ratio of numeri instrutions to memory load/ instrutions. Therefore, the inner produt Crout algorithm is the most suitable one for the SR221. The outer produt and inner produt Crout forms of the part of the LU deomposition program that dominates the exeution time are shown in Fig. 12. In these programs, loop unrolling has been done by hand-oding to improve register alloation. The performane of both algorithms is shown in Figure 13. The horizontal axis is the number of elements (N) in eah dimension of the array a(i,j). The inner produt Crout algorithm delivers better performane beause it has fewer instrutions than the outer produt form. Both algorithms show the same behavior on bank onflits. One LINPACK benhmark measures the performane for
8 do 1 k=1,n-5,4 do 1 j=k+4,n-1,2 do 1 i=k+4,n a(i,j) =a(i,j) + w(1,j) *a(i,k) + w(2,j) *a(i,k+1) + w(3,j) *a(i,k+2) + w(4,j) *a(i,k+3) a(i,j+1)=a(i,j+1) + w(1,j+1)*a(i,k) + w(2,j+1)*a(i,k+1) + w(3,j+1)*a(i,k+2) + w(4,j+1)*a(i,k+3) 1 ontinue (a) outer produt form (j: 2-unrolling, k: 4-unrolling) do 1 i=1,n,5 do 2 j=i+1,n,2 do 3 k=1,i-1,2 s1 = s1 + a(j,k) *a(k,i) + a(j,k+1) *a(k+1,i) s2 = s2 + a(j+1,k)*a(k,i) + a(j+1,k+1)*a(k+1,i) s3 = s3 + a(j,k) *a(k,i+1) + a(j,k+1) *a(k+1,i+1) s4 = s4 + a(j+1,k)*a(k,i+1) + a(j+1,k+1)*a(k+1,i+1) s5 = s5 + a(j,k) *a(k,i+2) + a(j,k+1) *a(k+1,i+2) s6 = s6 + a(j+1,k)*a(k,i+2) + a(j+1,k+1)*a(k+1,i+2) s7 = s7 + a(j,k) *a(k,i+3) + a(j,k+1) *a(k+1,i+3) s8 = s8 + a(j+1,k)*a(k,i+3) + a(j+1,k+1)*a(k+1,i+3) s9 = s9 + a(j,k) *a(k,i+4) + a(j,k+1) *a(k+1,i+4) sa = sa + a(j+1,k)*a(k,i+4) + a(j+1,k+1)*a(k+1,i+4) 3 ontinue a(j,i) = a(j,i) - s1 a(j+1,i) = a(j+1,i) - s2 a(j,i+1) = a(j,i+1) - s3 - a(j,i)*a(i,i+1) a(j+1,i+1) = a(j+1,i+1) - s4 - a(j+1,i)*a(i,i+1) a(j,i+2) = a(j,i+2) - s5 - a(j,i)*a(i,i+2) - a(j,i+1)*a(i+1,i+2) a(j+1,i+2) = a(j+1,i+2) - s6 - a(j+1,i)*a(i,i+2) - a(j+1,i+1)*a(i+1,i+2) a(j,i+3) = a(j,i+3) - s7 - a(j,i)*a(i,i+3) - a(j,i+1)*a(i+1,i+3) - a(j,i+2)*a(i+2,i+3) a(j+1,i+3) = a(j+1,i+3) - s8 - a(j+1,i)*a(i,i+3) - a(j+1,i+1)*a(i+1,i+3) - a(j+1,i+2)*a(i+2,i+3) a(j,i+4) = a(j,i+4) - s9 - a(j,i)*a(i,i+4) - a(j,i+1)*a(i+1,i+4) - a(j,i+2)*a(i+2,i+4) - a(j,i+3)*a(i+3,i+4) a(j+1,i+4) = a(j+1,i+4) - sa - a(j+1,i)*a(i,i+4) - a(j+1,i+1)*a(i+1,i+4) - a(j+1,i+2)*a(i+2,i+4) - a(j+1,i+3)*a(i+3,i+4) 2 ontinue 1 ontinue (b) inner produt Crout form (i: 5-unrolling, j: 2-unrolling, k: 2-unrolling) Figure 12. LU deomposition program odes for experimental measurements. N=. As shown in Fig. 13, the performane for N= is worse than that of the neighboring points due to bank onflits. By inserting a dummy element as stated in Se , the performane of the inner produt Crout form was improved to 247 MFLOPS. This is 82% of the uniproessor peak performane (3 MFLOPS) Performane of Parallel LINPACK Benhmark. In solving the LU deomposition part of parallel LINPACK, a new method named double-bloked Gaussian elimination has been used [9]. This method uses two types of bloking, one for ommuniation and another for alulation. This method an ahieve high single-proessor performane by lengthened loop length and high parallel effiieny by optimized load balaning at the same time. The LINPACK benhmark performane of the same three systems (for a 256 PU onfiguration) are shown in Table 4. The performane of the CRAY T3D and IBM SP2 are derived from the LINPACK benhmark report dated Marh 28, The SR221 again outperforms the other two in terms of both peak performane and effetive performane ratio. 7. Conlusion On the oneption of Hitahi's SR221 massively parallel RISC omputer, areful attention was paid both to the proessing unit (PU) and to the network arhiteture in order to ahieve high overall effetive performane. Several features have been added to solve the auses of performane degradation ommonly found in onventional parallel proessor systems: 1. The PU has a pseudo vetor proessing (PVP) feature for
9 performane (MFLOPS) outer produt form inner produt Crout form 8 9 array size (number of elements) 1 12 Figure 13. Experimental measurements for LU deomposition programs. Table 4. Performane of LINPACK benhmark on 256 PU system. System Peak performane of PU (MFLOPS) Performane of benhmark (GFLOPS) Effiieny ompared to peak SR221 Cray T3D IBM SP % 66% 65% 13 Loops With Exits On Pipelined Arhitetures", Proeedings of Superomputing '9 (Nov., 199), [4] Rau, R. B., Lee, M., Tirumalai, P. P., and Shlansker, S. M.: "Register Alloation for Software Pipelined Loops", Proeedings of the ACM SIGPLAN '92 Conferene on Programming Language Design and Implementation (June, 1992), [5] Saito, K., Hashimoto, M., Sawamoto, H., Yamagata, R., Kumagai, T., Kamada, E., Matsubara, K., Isobe, T., Hotta, T., Nakano, T., Shimizu, T., and Nakazawa, K.: "A 15MHz Supersalar RISC Proessor with Pseudo Vetor Proessing Feature", Proeedings Notebook for Hot Chips VII (Aug., 1995), [6] Saini, S. and Bailey, H. D.: "RISC Proessors and High Performane Computing", Superomputing '95. Tutorial S5 (De., 1995). [7] Numrih, W. R., Springer, L. P., and Peterson, C. J.: "Measurement of Communiation Rates on the Cray T3D Interproessor Network", HPCN Europe '94 (1994), [8] Stunkel, B. C.: "The SP2 High-Performane Swith", IBM System Journal, Vol. 34, No. 2 (1995), [9] Yamamoto, Y. and Ohkouhi, T.: "The Optimization of the Gaussian Elimination for Massively Parallel Proessors", Proeedings of the JSPP '95 (1995), (in Japanese). [1] Yasuda, Y., Fujii, H., Akashi, H., Inagami, Y., Tanaka, T., Nakagoshi, J., Wada, H., and Sumimoto, T.: "Deadlok-free Fault-tolerant Routing in the Multi-dimensional Crossbar Network and Its Implementation for the Hitahi SR221", Proeedings of 11th International Parallel Proessing Symposium (IPPS '97) (April, 1997). aelerating the performane on large-sale numerial appliations. On PVP the PU loads by prefething to a speial register bank, bypassing the ahe. This solves the ahe miss penalties that our in large-sale numerial appliations, allowing high throughput memory aess. 2. The memory system of the SR221 has a 1.2-GB/s bandwidth. This supports the high throughput required by the PVP feature. 3. On inter-proessor transfer, the high performane of the memory system, the new proposed remote DMA transfer protool, and also the hardware support for maintaining ahe oherene, provide effiient transfer performane. Due to the ombined effet of all these features, the SR221 showed high effetive performane for proessing large-sale numerial appliations, as well as in inter-proessor transfer. For instane, the 124 PU system of the SR221 ahieved 22.4 GFLOPS on the LINPACK benhmark, whih orresponds to 72% of the peak performane. Referenes [1] Yasuda, Y., Fujii, H., Tanaka, T., and Inagami, Y.: "Performane Evaluation of the Hyper Crossbar Network", Tehnial Report of IEICE. CPSY (1993), (in Japanese). [2] Nakamura, H., Imori, H., Nakazawa, K., Boku, T., Nakata, I., Yamashita, Y., Wada, H., and Inagami, Y.: "A Salar Arhiteture for Pseudo Vetor Proessing based on Slide-Windowed Registers", Proeedings of International Conferene on Superomputing (July, 1993), [3] Tirumalai, P., Lee, M., and Shlansker, M.: "Parallelization Of
Deadlock-free Fault-tolerant Routing in the Multi-dimensional Crossbar Network and Its Implementation for the Hitachi SR2201
Deadlock-free Fault-tolerant Routing in the Multi-dimensional Crossbar Network and Its Implementation for the Hitachi SR2201 Yoshiko Yasuda, Hiroaki Fujii, Hideya Akashi, Yasuhiro Inagami, Teruo Tanaka*,
More informationOn - Line Path Delay Fault Testing of Omega MINs M. Bellos 1, E. Kalligeros 1, D. Nikolos 1,2 & H. T. Vergos 1,2
On - Line Path Delay Fault Testing of Omega MINs M. Bellos, E. Kalligeros, D. Nikolos,2 & H. T. Vergos,2 Dept. of Computer Engineering and Informatis 2 Computer Tehnology Institute University of Patras,
More informationSystem-Level Parallelism and Throughput Optimization in Designing Reconfigurable Computing Applications
System-Level Parallelism and hroughput Optimization in Designing Reonfigurable Computing Appliations Esam El-Araby 1, Mohamed aher 1, Kris Gaj 2, arek El-Ghazawi 1, David Caliga 3, and Nikitas Alexandridis
More informationPipelined Multipliers for Reconfigurable Hardware
Pipelined Multipliers for Reonfigurable Hardware Mithell J. Myjak and José G. Delgado-Frias Shool of Eletrial Engineering and Computer Siene, Washington State University Pullman, WA 99164-2752 USA {mmyjak,
More informationCOSSIM An Integrated Solution to Address the Simulator Gap for Parallel Heterogeneous Systems
COSSIM An Integrated Solution to Address the Simulator Gap for Parallel Heterogeneous Systems Andreas Brokalakis Synelixis Solutions Ltd, Greee brokalakis@synelixis.om Nikolaos Tampouratzis Teleommuniation
More informationCOST PERFORMANCE ASPECTS OF CCD FAST AUXILIARY MEMORY
COST PERFORMANCE ASPECTS OF CCD FAST AUXILIARY MEMORY Dileep P, Bhondarkor Texas Instruments Inorporated Dallas, Texas ABSTRACT Charge oupled devies (CCD's) hove been mentioned as potential fast auxiliary
More informationA Dual-Hamiltonian-Path-Based Multicasting Strategy for Wormhole-Routed Star Graph Interconnection Networks
A Dual-Hamiltonian-Path-Based Multiasting Strategy for Wormhole-Routed Star Graph Interonnetion Networks Nen-Chung Wang Department of Information and Communiation Engineering Chaoyang University of Tehnology,
More informationA Load-Balanced Clustering Protocol for Hierarchical Wireless Sensor Networks
International Journal of Advanes in Computer Networks and Its Seurity IJCNS A Load-Balaned Clustering Protool for Hierarhial Wireless Sensor Networks Mehdi Tarhani, Yousef S. Kavian, Saman Siavoshi, Ali
More informationOutline: Software Design
Outline: Software Design. Goals History of software design ideas Design priniples Design methods Life belt or leg iron? (Budgen) Copyright Nany Leveson, Sept. 1999 A Little History... At first, struggling
More informationThe Tofu Interconnect D
2018 IEEE International Conferene on Cluster Computing The Tofu Interonnet D Yuihiro Ajima, Takahiro Kawashima, Takayuki Okamoto, Naoyuki Shida, Kouihi Hirai, Toshiyuki Shimizu Next Generation Tehnial
More informationMulti-hop Fast Conflict Resolution Algorithm for Ad Hoc Networks
Multi-hop Fast Conflit Resolution Algorithm for Ad Ho Networks Shengwei Wang 1, Jun Liu 2,*, Wei Cai 2, Minghao Yin 2, Lingyun Zhou 2, and Hui Hao 3 1 Power Emergeny Center, Sihuan Eletri Power Corporation,
More informationLearning Convention Propagation in BeerAdvocate Reviews from a etwork Perspective. Abstract
CS 9 Projet Final Report: Learning Convention Propagation in BeerAdvoate Reviews from a etwork Perspetive Abstrat We look at the way onventions propagate between reviews on the BeerAdvoate dataset, and
More informationAcoustic Links. Maximizing Channel Utilization for Underwater
Maximizing Channel Utilization for Underwater Aousti Links Albert F Hairris III Davide G. B. Meneghetti Adihele Zorzi Department of Information Engineering University of Padova, Italy Email: {harris,davide.meneghetti,zorzi}@dei.unipd.it
More informationConstructing Transaction Serialization Order for Incremental. Data Warehouse Refresh. Ming-Ling Lo and Hui-I Hsiao. IBM T. J. Watson Research Center
Construting Transation Serialization Order for Inremental Data Warehouse Refresh Ming-Ling Lo and Hui-I Hsiao IBM T. J. Watson Researh Center July 11, 1997 Abstrat In typial pratie of data warehouse, the
More informationMulti-Channel Wireless Networks: Capacity and Protocols
Multi-Channel Wireless Networks: Capaity and Protools Tehnial Report April 2005 Pradeep Kyasanur Dept. of Computer Siene, and Coordinated Siene Laboratory, University of Illinois at Urbana-Champaign Email:
More informationAccommodations of QoS DiffServ Over IP and MPLS Networks
Aommodations of QoS DiffServ Over IP and MPLS Networks Abdullah AlWehaibi, Anjali Agarwal, Mihael Kadoh and Ahmed ElHakeem Department of Eletrial and Computer Department de Genie Eletrique Engineering
More informationA DYNAMIC ACCESS CONTROL WITH BINARY KEY-PAIR
Malaysian Journal of Computer Siene, Vol 10 No 1, June 1997, pp 36-41 A DYNAMIC ACCESS CONTROL WITH BINARY KEY-PAIR Md Rafiqul Islam, Harihodin Selamat and Mohd Noor Md Sap Faulty of Computer Siene and
More informationAnnouncements. Lecture Caching Issues for Multi-core Processors. Shared Vs. Private Caches for Small-scale Multi-core
Announements Your fous should be on the lass projet now Leture 17: Cahing Issues for Multi-ore Proessors This week: status update and meeting A short presentation on: projet desription (problem, importane,
More informationAutomatic Physical Design Tuning: Workload as a Sequence Sanjay Agrawal Microsoft Research One Microsoft Way Redmond, WA, USA +1-(425)
Automati Physial Design Tuning: Workload as a Sequene Sanjay Agrawal Mirosoft Researh One Mirosoft Way Redmond, WA, USA +1-(425) 75-357 sagrawal@mirosoft.om Eri Chu * Computer Sienes Department University
More informationSpace- and Time-Efficient BDD Construction via Working Set Control
Spae- and Time-Effiient BDD Constrution via Working Set Control Bwolen Yang Yirng-An Chen Randal E. Bryant David R. O Hallaron Computer Siene Department Carnegie Mellon University Pittsburgh, PA 15213.
More informationAutomatic Generation of Transaction-Level Models for Rapid Design Space Exploration
Automati Generation of Transation-Level Models for Rapid Design Spae Exploration Dongwan Shin, Andreas Gerstlauer, Junyu Peng, Rainer Dömer and Daniel D. Gajski Center for Embedded Computer Systems University
More informationA Novel Validity Index for Determination of the Optimal Number of Clusters
IEICE TRANS. INF. & SYST., VOL.E84 D, NO.2 FEBRUARY 2001 281 LETTER A Novel Validity Index for Determination of the Optimal Number of Clusters Do-Jong KIM, Yong-Woon PARK, and Dong-Jo PARK, Nonmembers
More informationSVC-DASH-M: Scalable Video Coding Dynamic Adaptive Streaming Over HTTP Using Multiple Connections
SVC-DASH-M: Salable Video Coding Dynami Adaptive Streaming Over HTTP Using Multiple Connetions Samar Ibrahim, Ahmed H. Zahran and Mahmoud H. Ismail Department of Eletronis and Eletrial Communiations, Faulty
More informationThe Minimum Redundancy Maximum Relevance Approach to Building Sparse Support Vector Machines
The Minimum Redundany Maximum Relevane Approah to Building Sparse Support Vetor Mahines Xiaoxing Yang, Ke Tang, and Xin Yao, Nature Inspired Computation and Appliations Laboratory (NICAL), Shool of Computer
More informationFlow Demands Oriented Node Placement in Multi-Hop Wireless Networks
Flow Demands Oriented Node Plaement in Multi-Hop Wireless Networks Zimu Yuan Institute of Computing Tehnology, CAS, China {zimu.yuan}@gmail.om arxiv:153.8396v1 [s.ni] 29 Mar 215 Abstrat In multi-hop wireless
More informationHEXA: Compact Data Structures for Faster Packet Processing
Washington University in St. Louis Washington University Open Sholarship All Computer Siene and Engineering Researh Computer Siene and Engineering Report Number: 27-26 27 HEXA: Compat Data Strutures for
More informationWhat are Cycle-Stealing Systems Good For? A Detailed Performance Model Case Study
What are Cyle-Stealing Systems Good For? A Detailed Performane Model Case Study Wayne Kelly and Jiro Sumitomo Queensland University of Tehnology, Australia {w.kelly, j.sumitomo}@qut.edu.au Abstrat The
More informationFolding. Hardware Mapped vs. Time multiplexed. Folding by N (N=folding factor) Node A. Unfolding by J A 1 A J-1. Time multiplexed/microcoded
Folding is verse of Unfolding Node A A Folding by N (N=folding fator) Folding A Unfolding by J A A J- Hardware Mapped vs. Time multiplexed l Hardware Mapped vs. Time multiplexed/mirooded FI : y x(n) h
More informationCluster-based Cooperative Communication with Network Coding in Wireless Networks
Cluster-based Cooperative Communiation with Network Coding in Wireless Networks Zygmunt J. Haas Shool of Eletrial and Computer Engineering Cornell University Ithaa, NY 4850, U.S.A. Email: haas@ee.ornell.edu
More informationDirect-Mapped Caches
A Case for Diret-Mapped Cahes Mark D. Hill University of Wisonsin ahe is a small, fast buffer in whih a system keeps those parts, of the ontents of a larger, slower memory that are likely to be used soon.
More informationPartial Character Decoding for Improved Regular Expression Matching in FPGAs
Partial Charater Deoding for Improved Regular Expression Mathing in FPGAs Peter Sutton Shool of Information Tehnology and Eletrial Engineering The University of Queensland Brisbane, Queensland, 4072, Australia
More informationGray Codes for Reflectable Languages
Gray Codes for Refletable Languages Yue Li Joe Sawada Marh 8, 2008 Abstrat We lassify a type of language alled a refletable language. We then develop a generi algorithm that an be used to list all strings
More informationSmooth Trajectory Planning Along Bezier Curve for Mobile Robots with Velocity Constraints
Smooth Trajetory Planning Along Bezier Curve for Mobile Robots with Veloity Constraints Gil Jin Yang and Byoung Wook Choi Department of Eletrial and Information Engineering Seoul National University of
More informationDECT Module Installation Manual
DECT Module Installation Manual Rev. 2.0 This manual desribes the DECT module registration method to the HUB and fan airflow settings. In order for the HUB to ommuniate with a ompatible fan, the DECT module
More informationThis fact makes it difficult to evaluate the cost function to be minimized
RSOURC LLOCTION N SSINMNT In the resoure alloation step the amount of resoures required to exeute the different types of proesses is determined. We will refer to the time interval during whih a proess
More informationParallelizing Frequent Web Access Pattern Mining with Partial Enumeration for High Speedup
Parallelizing Frequent Web Aess Pattern Mining with Partial Enumeration for High Peiyi Tang Markus P. Turkia Department of Computer Siene Department of Computer Siene University of Arkansas at Little Rok
More informationExtracting Partition Statistics from Semistructured Data
Extrating Partition Statistis from Semistrutured Data John N. Wilson Rihard Gourlay Robert Japp Mathias Neumüller Department of Computer and Information Sienes University of Strathlyde, Glasgow, UK {jnw,rsg,rpj,mathias}@is.strath.a.uk
More informationEstablishing Secure Ethernet LANs Using Intelligent Switching Hubs in Internet Environments
Establishing Seure Ethernet LANs Using Intelligent Swithing Hubs in Internet Environments WOEIJIUNN TSAUR AND SHIJINN HORNG Department of Eletrial Engineering, National Taiwan University of Siene and Tehnology,
More informationZippy - A coarse-grained reconfigurable array with support for hardware virtualization
Zippy - A oarse-grained reonfigurable array with support for hardware virtualization Christian Plessl Computer Engineering and Networks Lab ETH Zürih, Switzerland plessl@tik.ee.ethz.h Maro Platzner Department
More informationAlgorithms, Mechanisms and Procedures for the Computer-aided Project Generation System
Algorithms, Mehanisms and Proedures for the Computer-aided Projet Generation System Anton O. Butko 1*, Aleksandr P. Briukhovetskii 2, Dmitry E. Grigoriev 2# and Konstantin S. Kalashnikov 3 1 Department
More informationA Partial Sorting Algorithm in Multi-Hop Wireless Sensor Networks
A Partial Sorting Algorithm in Multi-Hop Wireless Sensor Networks Abouberine Ould Cheikhna Department of Computer Siene University of Piardie Jules Verne 80039 Amiens Frane Ould.heikhna.abouberine @u-piardie.fr
More informationPerformance Improvement of TCP on Wireless Cellular Networks by Adaptive FEC Combined with Explicit Loss Notification
erformane Improvement of TC on Wireless Cellular Networks by Adaptive Combined with Expliit Loss tifiation Masahiro Miyoshi, Masashi Sugano, Masayuki Murata Department of Infomatis and Mathematial Siene,
More informationMethods for Multi-Dimensional Robustness Optimization in Complex Embedded Systems
Methods for Multi-Dimensional Robustness Optimization in Complex Embedded Systems Arne Hamann, Razvan Rau, Rolf Ernst Institute of Computer and Communiation Network Engineering Tehnial University of Braunshweig,
More informationSSD Based First Layer File System for the Next Generation Super-computer
SSD Based First Layer File System for the Next Generation Super-omputer Shinji Sumimoto, Ph.D. Next Generation Tehnial Computing Unit FUJITSU LIMITED Sept. 24 th, 2018 0 Outline of This Talk A64FX: High
More informationAlgorithms for External Memory Lecture 6 Graph Algorithms - Weighted List Ranking
Algorithms for External Memory Leture 6 Graph Algorithms - Weighted List Ranking Leturer: Nodari Sithinava Sribe: Andi Hellmund, Simon Ohsenreither 1 Introdution & Motivation After talking about I/O-effiient
More informationThe AMDREL Project in Retrospective
The AMDREL Projet in Retrospetive K. Siozios 1, G. Koutroumpezis 1, K. Tatas 1, N. Vassiliadis 2, V. Kalenteridis 2, H. Pournara 2, I. Pappas 2, D. Soudris 1, S. Nikolaidis 2, S. Siskos 2, and A. Thanailakis
More informationApproximate logic synthesis for error tolerant applications
Approximate logi synthesis for error tolerant appliations Doohul Shin and Sandeep K. Gupta Eletrial Engineering Department, University of Southern California, Los Angeles, CA 989 {doohuls, sandeep}@us.edu
More informationDECODING OF ARRAY LDPC CODES USING ON-THE FLY COMPUTATION Kiran Gunnam, Weihuang Wang, Euncheol Kim, Gwan Choi, Mark Yeary *
DECODING OF ARRAY LDPC CODES USING ON-THE FLY COMPUTATION Kiran Gunnam, Weihuang Wang, Eunheol Kim, Gwan Choi, Mark Yeary * Dept. of Eletrial Engineering, Texas A&M University, College Station, TX-77840
More informationThe recursive decoupling method for solving tridiagonal linear systems
Loughborough University Institutional Repository The reursive deoupling method for solving tridiagonal linear systems This item was submitted to Loughborough University's Institutional Repository by the/an
More informationImplementing Load-Balanced Switches With Fat-Tree Networks
Implementing Load-Balaned Swithes With Fat-Tree Networks Hung-Shih Chueh, Ching-Min Lien, Cheng-Shang Chang, Jay Cheng, and Duan-Shin Lee Department of Eletrial Engineering & Institute of Communiations
More informationCross-layer Resource Allocation on Broadband Power Line Based on Novel QoS-priority Scheduling Function in MAC Layer
Communiations and Networ, 2013, 5, 69-73 http://dx.doi.org/10.4236/n.2013.53b2014 Published Online September 2013 (http://www.sirp.org/journal/n) Cross-layer Resoure Alloation on Broadband Power Line Based
More informationReevaluating the overhead of data preparation for asymmetric multicore system on graphics processing
KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS VOL. 10, NO. 7, Jul. 2016 3231 Copyright 2016 KSII Reevaluating the overhead of data preparation for asymmetri multiore system on graphis proessing
More informationFacility Location: Distributed Approximation
Faility Loation: Distributed Approximation Thomas Mosibroda Roger Wattenhofer Distributed Computing Group PODC 2005 Where to plae ahes in the Internet? A distributed appliation that has to dynamially plae
More informationDesign of High Speed Mac Unit
Design of High Speed Ma Unit 1 Harish Babu N, 2 Rajeev Pankaj N 1 PG Student, 2 Assistant professor Shools of Eletronis Engineering, VIT University, Vellore -632014, TamilNadu, India. 1 harishharsha72@gmail.om,
More information13.1 Numerical Evaluation of Integrals Over One Dimension
13.1 Numerial Evaluation of Integrals Over One Dimension A. Purpose This olletion of subprograms estimates the value of the integral b a f(x) dx where the integrand f(x) and the limits a and b are supplied
More informationComputing Pool: a Simplified and Practical Computational Grid Model
Computing Pool: a Simplified and Pratial Computational Grid Model Peng Liu, Yao Shi, San-li Li Institute of High Performane Computing, Department of Computer Siene and Tehnology, Tsinghua University, Beijing,
More informationExploring the Commonality in Feature Modeling Notations
Exploring the Commonality in Feature Modeling Notations Miloslav ŠÍPKA Slovak University of Tehnology Faulty of Informatis and Information Tehnologies Ilkovičova 3, 842 16 Bratislava, Slovakia miloslav.sipka@gmail.om
More informationAnalysis of input and output configurations for use in four-valued CCD programmable logic arrays
nalysis of input and output onfigurations for use in four-valued D programmable logi arrays J.T. utler H.G. Kerkhoff ndexing terms: Logi, iruit theory and design, harge-oupled devies bstrat: s in binary,
More informationUplink Channel Allocation Scheme and QoS Management Mechanism for Cognitive Cellular- Femtocell Networks
62 Uplink Channel Alloation Sheme and QoS Management Mehanism for Cognitive Cellular- Femtoell Networks Kien Du Nguyen 1, Hoang Nam Nguyen 1, Hiroaki Morino 2 and Iwao Sasase 3 1 University of Engineering
More informationRAC 2 E: Novel Rendezvous Protocol for Asynchronous Cognitive Radios in Cooperative Environments
21st Annual IEEE International Symposium on Personal, Indoor and Mobile Radio Communiations 1 RAC 2 E: Novel Rendezvous Protool for Asynhronous Cognitive Radios in Cooperative Environments Valentina Pavlovska,
More informationPost-K Supercomputer with Fujitsu's Original CPU, A64FX Powered by Arm ISA
Post-K Superomputer with Fujitsu's Original CPU, A64FX Powered by Arm ISA Toshiyuki Shimizu Nov. 15th, 2018 Post-K is under development, information in these slides is subjet to hange without notie 0 Agenda
More informationDesign of a Parallel Vector Access Unit for SDRAM Memory Systems
Design of a Parallel Vetor Aess Unit for SDRAM Memory Systems Binu K. Mathew, Sally A. MKee, John B. Carter, Al Davis Department of Computer Siene University of Utah Salt Lake City, UT 84112 mbinu sam
More informationReduced-Complexity Column-Layered Decoding and. Implementation for LDPC Codes
Redued-Complexity Column-Layered Deoding and Implementation for LDPC Codes Zhiqiang Cui 1, Zhongfeng Wang 2, Senior Member, IEEE, and Xinmiao Zhang 3 1 Qualomm In., San Diego, CA 92121, USA 2 Broadom Corp.,
More informationImproved flooding of broadcast messages using extended multipoint relaying
Improved flooding of broadast messages using extended multipoint relaying Pere Montolio Aranda a, Joaquin Garia-Alfaro a,b, David Megías a a Universitat Oberta de Catalunya, Estudis d Informàtia, Mulimèdia
More informationMATH STUDENT BOOK. 12th Grade Unit 6
MATH STUDENT BOOK 12th Grade Unit 6 Unit 6 TRIGONOMETRIC APPLICATIONS MATH 1206 TRIGONOMETRIC APPLICATIONS INTRODUCTION 3 1. TRIGONOMETRY OF OBLIQUE TRIANGLES 5 LAW OF SINES 5 AMBIGUITY AND AREA OF A TRIANGLE
More informationScheduling Multiple Independent Hard-Real-Time Jobs on a Heterogeneous Multiprocessor
Sheduling Multiple Independent Hard-Real-Time Jobs on a Heterogeneous Multiproessor Orlando Moreira NXP Semiondutors Researh Eindhoven, Netherlands orlando.moreira@nxp.om Frederio Valente Universidade
More informationmahines. HBSP enhanes the appliability of the BSP model by inorporating parameters that reet the relative speeds of the heterogeneous omputing omponen
The Heterogeneous Bulk Synhronous Parallel Model Tiani L. Williams and Rebea J. Parsons Shool of Computer Siene University of Central Florida Orlando, FL 32816-2362 fwilliams,rebeag@s.uf.edu Abstrat. Trends
More informationCleanUp: Improving Quadrilateral Finite Element Meshes
CleanUp: Improving Quadrilateral Finite Element Meshes Paul Kinney MD-10 ECC P.O. Box 203 Ford Motor Company Dearborn, MI. 8121 (313) 28-1228 pkinney@ford.om Abstrat: Unless an all quadrilateral (quad)
More informationDetection and Recognition of Non-Occluded Objects using Signature Map
6th WSEAS International Conferene on CIRCUITS, SYSTEMS, ELECTRONICS,CONTROL & SIGNAL PROCESSING, Cairo, Egypt, De 9-31, 007 65 Detetion and Reognition of Non-Oluded Objets using Signature Map Sangbum Park,
More informationAbstract. Key Words: Image Filters, Fuzzy Filters, Order Statistics Filters, Rank Ordered Mean Filters, Channel Noise. 1.
Fuzzy Weighted Rank Ordered Mean (FWROM) Filters for Mixed Noise Suppression from Images S. Meher, G. Panda, B. Majhi 3, M.R. Meher 4,,4 Department of Eletronis and I.E., National Institute of Tehnology,
More informationChapter 2: Introduction to Maple V
Chapter 2: Introdution to Maple V 2-1 Working with Maple Worksheets Try It! (p. 15) Start a Maple session with an empty worksheet. The name of the worksheet should be Untitled (1). Use one of the standard
More informationAn Evaluation of Automatic and Interactive Parallel Programming Tools
An Evaluation of Automati and Interative Parallel Programming Tools Doreen Y Cheng Computer Siene Co NASA Ames Researh Center MS 258-6 Moffett Field, CA 9435 Douglas M Pase Formerly at NASA (CSC) Cray
More informationUsing Augmented Measurements to Improve the Convergence of ICP
Using Augmented Measurements to Improve the onvergene of IP Jaopo Serafin, Giorgio Grisetti Dept. of omputer, ontrol and Management Engineering, Sapienza University of Rome, Via Ariosto 25, I-0085, Rome,
More informationZ8530 Programming Guide
Z8530 Programming Guide Alan Cox alan@redhat.om Z8530 Programming Guide by Alan Cox Copyright 2000 by Alan Cox This doumentation is free software; you an redistribute it and/or modify it under the terms
More informationAutomated System for the Study of Environmental Loads Applied to Production Risers Dustin M. Brandt 1, Celso K. Morooka 2, Ivan R.
EngOpt 2008 - International Conferene on Engineering Optimization Rio de Janeiro, Brazil, 01-05 June 2008. Automated System for the Study of Environmental Loads Applied to Prodution Risers Dustin M. Brandt
More informationUser-level Fairness Delivered: Network Resource Allocation for Adaptive Video Streaming
User-level Fairness Delivered: Network Resoure Alloation for Adaptive Video Streaming Mu Mu, Steven Simpson, Arsham Farshad, Qiang Ni, Niholas Rae Shool of Computing and Communiations, Lanaster University
More informationUncovering Hidden Loop Level Parallelism in Sequential Applications
Unovering Hidden Loop Level Parallelism in Sequential Appliations Hongtao Zhong, Mojtaba Mehrara, Steve Lieberman, and Sott Mahlke Advaned Computer Arhiteture Laboratory University of Mihigan, Ann Arbor,
More information3-D IMAGE MODELS AND COMPRESSION - SYNTHETIC HYBRID OR NATURAL FIT?
3-D IMAGE MODELS AND COMPRESSION - SYNTHETIC HYBRID OR NATURAL FIT? Bernd Girod, Peter Eisert, Marus Magnor, Ekehard Steinbah, Thomas Wiegand Te {girod eommuniations Laboratory, University of Erlangen-Nuremberg
More informationEpisode 12: TCP/IP & UbiComp
Episode 12: TCP/IP & UbiComp Hannes Frey and Peter Sturm University of Trier Outline Introdution Mobile IP TCP and Mobility Conlusion Referenes [1] James D. Solomon, Mobile IP: The Unplugged, Prentie Hall,
More information- 1 - S 21. Directory-based Administration of Virtual Private Networks: Policy & Configuration. Charles A Kunzinger.
- 1 - S 21 Diretory-based Administration of Virtual Private Networks: Poliy & Configuration Charles A Kunzinger kunzinge@us.ibm.om - 2 - Clik here Agenda to type page title What is a VPN? What is VPN Poliy?
More informationPerformance Benchmarks for an Interactive Video-on-Demand System
Performane Benhmarks for an Interative Video-on-Demand System. Guo,P.G.Taylor,E.W.M.Wong,S.Chan,M.Zukerman andk.s.tang ARC Speial Researh Centre for Ultra-Broadband Information Networks (CUBIN) Department
More informationA {k, n}-secret Sharing Scheme for Color Images
A {k, n}-seret Sharing Sheme for Color Images Rastislav Luka, Konstantinos N. Plataniotis, and Anastasios N. Venetsanopoulos The Edward S. Rogers Sr. Dept. of Eletrial and Computer Engineering, University
More informationCOMP 181. Prelude. Intermediate representations. Today. Types of IRs. High-level IR. Intermediate representations and code generation
Prelude COMP 181 Intermediate representations and ode generation November, 009 What is this devie? Large Hadron Collider What is a hadron? Subatomi partile made up of quarks bound by the strong fore What
More informationParallelization and Performance of 3D Ultrasound Imaging Beamforming Algorithms on Modern Clusters
Parallelization and Performane of 3D Ultrasound Imaging Beamforming Algorithms on Modern Clusters F. Zhang, A. Bilas, A. Dhanantwari, K.N. Plataniotis, R. Abiprojo, and S. Stergiopoulos Dept. of Eletrial
More informationReferences. December 1992, pp. 71 { 81. pp.457{467. Magazine, June for very large high throughput database systems,"
the overall working time for other appliations. In ase, data ltering was the only appliation being run, then using distributed indexing, we an serve 00 times as many requests. 6 Conlusion We have explored
More informationDoS-Resistant Broadcast Authentication Protocol with Low End-to-end Delay
DoS-Resistant Broadast Authentiation Protool with Low End-to-end Delay Ying Huang, Wenbo He and Klara Nahrstedt {huang, wenbohe, klara}@s.uiu.edu Department of Computer Siene University of Illinois at
More informationAllocating Rotating Registers by Scheduling
Alloating Rotating Registers by Sheduling Hongbo Rong Hyunhul Park Cheng Wang Youfeng Wu Programming Systems Lab Intel Labs {hongbo.rong,hyunhul.park,heng..wang,youfeng.wu}@intel.om ABSTRACT A rotating
More informationReducing Runtime Complexity of Long-Running Application Services via Dynamic Profiling and Dynamic Bytecode Adaptation for Improved Quality of Service
Reduing Runtime Complexity of Long-Running Appliation Servies via Dynami Profiling and Dynami Byteode Adaptation for Improved Quality of Servie ABSTRACT John Bergin Performane Engineering Laboratory University
More informationAn Efficient and Scalable Approach to CNN Queries in a Road Network
An Effiient and Salable Approah to CNN Queries in a Road Network Hyung-Ju Cho Chin-Wan Chung Dept. of Eletrial Engineering & Computer Siene Korea Advaned Institute of Siene and Tehnology 373- Kusong-dong,
More informationCluster-Based Cumulative Ensembles
Cluster-Based Cumulative Ensembles Hanan G. Ayad and Mohamed S. Kamel Pattern Analysis and Mahine Intelligene Lab, Eletrial and Computer Engineering, University of Waterloo, Waterloo, Ontario N2L 3G1,
More informationEffecting Parallel Graph Eigensolvers Through Library Composition
Effeting Parallel Graph Eigensolvers Through Library Composition Alex Breuer, Peter Gottshling, Douglas Gregor, Andrew Lumsdaine Open Systems Laboratory Indiana University Bloomington, IN 47405 {abreuer,pgottsh,dgregor,lums@osl.iu.edu
More informationHigh-level synthesis under I/O Timing and Memory constraints
Highlevel synthesis under I/O Timing and Memory onstraints Philippe Coussy, Gwenolé Corre, Pierre Bomel, Eri Senn, Eri Martin To ite this version: Philippe Coussy, Gwenolé Corre, Pierre Bomel, Eri Senn,
More informationPerformance of Histogram-Based Skin Colour Segmentation for Arms Detection in Human Motion Analysis Application
World Aademy of Siene, Engineering and Tehnology 8 009 Performane of Histogram-Based Skin Colour Segmentation for Arms Detetion in Human Motion Analysis Appliation Rosalyn R. Porle, Ali Chekima, Farrah
More informationIntra- and Inter-Stream Synchronisation for Stored Multimedia Streams
IEEE International Conferene on Multimedia Computing & Systems, June 17-23, 1996, in Hiroshima, Japan, p 372-381 Intra- and Inter-Stream Synhronisation for Stored Multimedia Streams Ernst Biersak, Werner
More informationA Multi-Head Clustering Algorithm in Vehicular Ad Hoc Networks
International Journal of Computer Theory and Engineering, Vol. 5, No. 2, April 213 A Multi-Head Clustering Algorithm in Vehiular Ad Ho Networks Shou-Chih Lo, Yi-Jen Lin, and Jhih-Siao Gao Abstrat Clustering
More informationTackling IPv6 Address Scalability from the Root
Takling IPv6 Address Salability from the Root Mei Wang Ashish Goel Balaji Prabhakar Stanford University {wmei, ashishg, balaji}@stanford.edu ABSTRACT Internet address alloation shemes have a huge impat
More information1. Introduction. 2. The Probable Stope Algorithm
1. Introdution Optimization in underground mine design has reeived less attention than that in open pit mines. This is mostly due to the diversity o underground mining methods and omplexity o underground
More informationAlleviating DFT cost using testability driven HLS
Alleviating DFT ost using testability driven HLS M.L.Flottes, R.Pires, B.Rouzeyre Laboratoire d Informatique, de Robotique et de Miroéletronique de Montpellier, U.M. CNRS 5506 6 rue Ada, 34392 Montpellier
More informationStaircase Join: Teach a Relational DBMS to Watch its (Axis) Steps
Stairase Join: Teah a Relational DBMS to Wath its (Axis) Steps Torsten Grust Maurie van Keulen Jens Teubner University of Konstanz Department of Computer and Information Siene P.O. Box D 88, 78457 Konstanz,
More informationBSPLND, A B-Spline N-Dimensional Package for Scattered Data Interpolation
BSPLND, A B-Spline N-Dimensional Pakage for Sattered Data Interpolation Mihael P. Weis Traker Business Systems 85 Terminal Drive, Suite Rihland, WA 995-59-946-544 mike@vidian.net Robert R. Lewis Washington
More information