Towards a Low-Power Accelerator of Many FPGAs for Stencil Computations

Size: px

Start display at page:

Download "Towards a Low-Power Accelerator of Many FPGAs for Stencil Computations"

Mariah Dean
5 years ago
Views:

1 2012 Thir International Conference on Networking an Computing Towars a Low-Power Accelerator of Many FPGAs for Stencil Computations Ryohei Kobayashi Tokyo Institute of Technology, Japan kobayashi@arch.cs.titech.ac.jp Shinya Takamaea-Yamazaki Tokyo Institute of Technology, Japan JSPS Research Fellow, Japan takamaea@arch.cs.titech.ac.jp Kenji Kise Tokyo Institute of Technology, Japan kise@cs.titech.ac.jp Abstract We have propose the effective stencil computation metho an the architecture by employing multiple small FPGAs with 2D-mech topology. In this paper, we show that our propose architecture works correctly on the real 2D-mesh connecte FPGA array. We evelope a software simulator in C++, which emulates our propose architecture, an implemente two prototype systems in Verilog HDL. One prototype system is for logic verification with communication moules an the other is for estimation of power consumption without communication moules. We run the former prototype system for 2M cycles an check the behavior with the software simulator. Our architecture is evelope towars a low-power accelerator of many FPGAs. The evaluation result with the secon prototype shows that the system of a single FPGA noe with eight floating-point aers an eight floating-point multipliers archives 2.24GFlop/s in 0.16GHz operations with 2.37W power consumption. This performance/w value is about six-times better than NViia GTX280 GPU car. Inex Terms FPGA accelerator, Stencil computation, Lowpower I. INTRODUCTION Stencil computation is one of the typical scientific computing kernels [1]. Various accelerators to solve stencil computation at high spee are esigne by using multiple high en FPGAs [2][3]. We have propose a stencil computing metho optimize for a 2D-mesh-connecte FPGA array [4]. This paper escribes implementation result of our propose metho. This paper also shows that our esigne architecture works correctly on the real 2D-mesh connecte FPGA array. This system is evelope towars a low-power accelerator of many FPGAs. We alreay have evelope a 2D-mesh connecte FPGA array, ScalableCore system which is a high spee simulation environment for many-core processors research [5]. The ScalableCore system uses multiple small-capacity FPGAs, which are connecte in 2D-mesh. In this paper, we use harware components of the ScalableCore system as an infrastructure for HPC harware accelerators. In orer to achieve high performance, the pipelines of the execution units shoul be kept operating effectively while the computation. In the stencil computation, whole the ata is ivie into multiple blocks an each block is assigne to each FPGA. The bounary ata of each block is share by the ajacent FPGAs. In our system, the computation orer is Fig. 1. 2D stencil computation Fig. 2. Pseuo coe of 2D stencil computation customize in each FPGA in orer to increase the acceptable latency of the ata sharing among the FPGAs. II. PARALLEL STENCIL COMPUTATION BY USING MULTI-FPGAS Fig. 1 shows a typical pattern of 2D stencil computation. In the figure, each circle represents a value of gri-point an each value of gri-point at next time-step is compute by using the values of its four ajacent gri-points at current time-step. Fig. 2 shows a pseuo coe of 2D stencil computation shown in Fig. 1. In the figure, k represents time-step, (i, j) represents coorinate of gri-point. Two buffers, V0 an V1, are use for the computation. The value of gri-point (i, j) is represente as Vn[i][j] an n represent the buffer number (0 or 1). As shown as the fourth line in Fig. 2, Vn[i][j] is upate by the summation of four values. The each value is obtaine by multiplying weighting factor by one ajacent gri-points (Vn[i-1][j], Vn[i][j-1], Vn[i][j+1], Vn[i+1][j]). As shown as the seventh an eighth line in Fig. 2, every gri-point is upate for the next time-step /12 $ IEEE DOI /ICNC

2 Fig. 3. Block ivision an assigne to each FPGA. Whole the ata is ivie into multiple blocks accoring to the number of vertical an horizontal array of FPGAs an each block is assigne to each FPGA. The bounary ata of each block is share multiple FPGAs via their communication interfaces. The ata sharing takes some overhea of ata traversals. In orer to eliminate this overhea, we customize the computation orer for each FPGA. As shown in Fig. 3, the ata set of stencil computation is ivie into several blocks accoring to the number of vertical an horizontal array of FPGAs. Each ata block is assigne to each FPGA. The computation on each FPGA uses the assigne ata an the bounary ata of each block share. The necessary bounary ata of the ajacent FPGAs have to be sent to. In Fig. 3, the circle represents gri-point, a group of gri-points (4 4) is assigne one FPGA, an arrow represents communication to the neighbor FPGA. Gray regions represent the ata subset communicate to other FPGAs. Fig. 4 shows two cases of computation orer. Fig. 4 (a) shows the orer that FPGA (A) an FPGA (B) compute by the same orer. A otte square shows the ata subset assigne to a FPGA. In fact, the computations use extra ata of the bounary which is not share. However, extra ata is omitte in this figure for simplicity. We efine a sequent process to compute all the gri-points at a time-step as Iteration. The circle represents one gri-point. The alphabet in a circle represents ID of the FPGA. The number in a cycle represents computing orer in the FPGA, therefore, the computations of each FPGA procee in orer of the arrow. In this example, each FPGA upates the assigne ata of sixteen gri-points (from 0 to 15) uring every Iteration. For simplicity, we assume that a computation upating a value of one gri-point takes just a cycle an several FIFOs are use in orer to avoi illegal moification of the ata. The value of A0 is compute at 0th cycle an the value of A1 is compute at 1st cycle in FPGA (A). Similarly, the value of B0 is compute at 0th cycle an the value of B1 is compute at 1st cycle in FPGA (B). All the computations are processe in this orer. We assume that each FPGA can use the obtaine ata of the FPGA in a single cycle. After the completion of the computations for each Iteration, the process procees to the next time-step. In this case, Iteration takes sixteen cycles to complete the computations. The first Iteration begins with 0th cycle an the secon Iteration begins with 16th cycle. Fig. 4. The computing orer of gri-points on FPGA. (b) is propose metho [4]. Fig. 5. Computing orer applie propose metho. Therefore, the thir Iteration begins with 32n cycle. In Fig. 4 (a), the computation of gri-point B1 uses the values of vertical an horizontal gri-points A13, B5, B0, B2. The value of gri-point A13 nees to be communicate between FPGA (A) an FPGA (B) because the value is share with these FPGAs. The others o not nee to be communicate between FPGAs. In this computation orer, the value of gripoint A13 is compute at 13th cycle an the value of gri-point B1 is compute at 17th cycle. The computation of B1 uses the compute value of A13. In orer not to stall the computation of B1, the value of A13 must be communicate within three cycles (14, 15, 16) after the computation. The values of gripoints A12, A14, A15 must also be communicate within three cycles in orer not to stall the computations. If the N M gripoints are assigne to a single FPGA, every shar value must be communicate within N-1 cycles because of this iscussion. Fig. 4 (b) moels that FPGA (C) an FPGA (D) compute in reverse orer. The computation orer of FPGA (C) is the inverse orer of FPGA (A) in Fig.4 (a). FPGA (B) an FPGA (D) use the same computation orer. In this case, in orer not to stall the computation of D1 of Iteration 2 (17th cycle), the margin to sen value of C1 (1st cycle) is 15 cycles (2sim16). If the N M gri-points are assigne to a single FPGA, communication latency between FPGA (A) an FPGA (B) must be within N M 1 cycles because of this iscussion. In this way, by means of changing computation orer, acceptable latency of communication is increase. Fig. 5 shows the computation orer (propose metho) in each FPGA. The square represents FPGA, the arrow represents 344

3 Fig. 7. Fig. 6. MADD architecture with eight. Relationship between the gri-points an BlockRAM. computation orer in Fig. 5. FPGAs of 1st an 3r rows compute in the same orer as FPGA (C) in Fig. 4 (b). FPGAs of 2n an 4th in Fig. 5 compute in the same orer as FPGA (D) in Fig. 4 (b). As iscusse in Fig. 4, the communication latency between FPGAs in propose metho can ensure the cycles to require about one Iteration. Communication to face each other in the irection of the arrow can also. That is, compute cycles of ajacent sies are equal when to place FPGA (C) an FPGA (D) in Fig. 4 (b) upsie own. Consier the communication between the left an right sies of the FPGA. C3 an C0 are ajacent when the two sie-bysie to the left or right FPGA (C) in Fig. 4 (b). In this time, acceptable latency of communication is 12 cycles in the FPGA of the right. This number of cycles is calculate by the cycles accoring to an Iteration minus the gri-points of one sie cycles. In this way, the propose metho gives increase acceptable latency of communication by computing the up an own in reverse orer, in other wors, this metho ensure margin of about one Iteration. Until now, we efine that the computation of one gri-point takes one cycle. However, if the computation of one gripoint takes k cycles, the acceptable communication latency is (N M M) k cycles between left FPGA an right FPGA. III. ARCHITECTURE AND IMPLEMENTION The noe architecture implemente in a FPGA is part of the computation. This architecture is assume a single FPGA. Therefore, communication between FPGAs is not consiere. We efine that the noe architecture implemente communication moules is system architecture. An then, use ata type is single precision floating-point. A. Noe Architecture Fig. 6 shows the noe architecture with eight multiplyaer units. The square in the figure represents BlockRAM 1. MADD represents Multiply-Aer unit. The square in MADD represents register. Both multiplier an aer are single precision floating-point unit which conforms to IEEE 754. We use the multiplier an aer both have seven pipeline stages. In this case, since two registers are inclue to the MADD, the pipeline of the ata path in the MADD becomes sixteen stages. Therefore, the ata path is regare as connecting the eight stages aer an eight stages multiplier. This pipeline scheuling is vali only when with of compute gri is equal to the pipeline stages of multiplier an aer. So, we ecie multiplier an aer have eight stages. We explain the reason later in this pepar. Fig. 7 shows the relationship between BlockRAM in Fig. 6 an gri-points. The number written in BlockRAM in Fig. 6 correspons to the number in respectively. In Fig. 7, the ata set which assigne to each FPGA is split in the vertical irection, an is store in each BlockRAM (0 7). They are surroune by the ashe line. If the ata set of is assigne to one FPGA, the split ata set (8 128) is store in each BlockRAM (0 7). Furthermore, the ata of the communication region is store in another BlockRAM or some BlockRAMs (it is not 0 7 BlockRAM surroune by the ashe line). The communication region is the set of ata which is transferre to the ajacent noes. However, the computation in single FPGA always use ata of same region, an on t upate the ata of communication region since the ata of communication region is not communicate because of not existing ajacent FPGA noes. Therefore, the BlockRAM store the ata of the communication region oes not have ports to input. Fig. 8 shows MADD pipeline operation. The circle in the figure represents the value of gri-point an the square is the computation result which the value of the gri-point is multiplie by a weighting factor. Both multiplier an aer have eight stages of the pipeline. Fig. 8 (a) shows the number of gri-point. We explain the computation of gri-points First of all, gri-points 1 8 are loae from BlockRAM an they are input to the multiplier in cycles 0 7. Next, the computation result is output from multiplier, at the same times, gri-points are input to the multiplier in cycles An then, gri-points are input to the multiplier, at the same time, value of gri-points 1 8 an multiplie by a weighting factor are summe in cycles Finally, computation results that ata of up, own, left an right gir-points are multiplie by a weighting factor an summe are output in cycles The ata of gri-point which will be use must not be upate by writing computation result in BlockRAM. Therefore, general approach uses the temporary buffer in which the ata is store, such as FIFO, before writing them in BlockRAM. 1 BlockRAM is low-latency SRAM which each FPGA has. 345

4 e e e e e e e e e e e e e e e / e e e e e e e e e e e e e e e e e / e e e e e e e e e e e e e e e / e e e e e e e e e e e e e e e e e /ZD e e e e e e e e e e e e Fig. 8. MADD pipeline operation. But, the propose architecture nees no aitional temporary buffer because MADD pipeline give the same functionality as temporary buffer. In the case of Fig. 8, the ata of gripoints are upate in cycles This ata of gripoints are input to the multiplier in cycles 32 40, an are not use later. Therefore, if the computation in a single FPGA, the orer of upate ata is protecte without using FIFO. As previously explaine, this scheuling is vali only when with of compute gri is equal to the pipeline stages of multiplier an aer. The with of compute gri which a MADD processes is eight because the number of the pipeline stages of the multiplier an the aer is eight. This architecture achieves about 100% always fille. The filing rate of the pipelines is (N-8/N) 100. N is cycles which taken this computation. In aition to, this architecture oes not use the aitional temporary buffer to upate ata. Therefore, this architecture can achieve high computation performance an the small circuit area. E ^ t D D D D D D D D e ' ' t ' ' E ^ Fig. 9. System architecture. ZKD y&^ h ^ ^ ^ ^ :' B. System Architecture Fig. 9 shows the system architecture. We escribe the ifference between Fig. 9 an Fig. 6. TheDESin the figure is a eserializer which receives ata from ajacent FPGA an theseris a serializer which sens ata to ajacent FPGA. The ata which the eserializer receives is store in FIFO to maintain the upate orer. The ata which the FIFO receives is store in only the BlockRAM. The input of the serializer is also prepare FIFO. This FIFO is input computation results of MADD, however, only the ata of communication region. An then, GATE as vali-bit of 1bit to computation results of MADD an input this ata to the serializer. This vali-bit is rea-enable signal of the FIFO prepare as the output estination of the eserializer which receives the ata from ajacent FPGAs. Therefore, this vali-bit ensures that the ata of communication region which is use to compute is store to the FIFO. C. Development Flow We implement the prototype system compose of many FPGAs for logic verification of propose metho. This implementation is use boars of ScalableCore. We explain ratio- nality of the implementation that multi-fpga are connecte. Logic verification of small FPGA is easier than implemente in a single big FPGA. Even if a FPGA has broken own, the system operates normally by replacing the FPGA. In this way, there are several merits. An then, use ata type is integer because of the ease of ebugging. We coe the software simulator in C++, which emulates stencil computation in cycle level accuracy in multiple FPGA noes. The execution results of the software simulator are verifie by compare to the execution result of the stencil computation program in function level accuracy coe in C. Then, we implemente the circuits in Verilog HDL by reference to the cycle level software simulator an verifie them by using iverilog an GTKwave. We use MADD which type is integer. The implementation of Ser/Des is use ata recovery an NRZI coe. D. Initialization Mechanism As escribe in II, the computation orer on each FPGA is ifferent to increase the acceptable latency of communication. To etermine the computation orer of each FPGA, every FPGA uses own position coorinate in the system. We 346

W^ WZKD ^ZD WZKD ^ZD WZKD ^ZD WZKD ^ZD WZKD ^ZD WZKD ^ZD WZKD ^ZD WZKD ^ZD WZKD ^ZD WZKD ^ZD Fig. 12. Configuration iagram of the mesh connecte FPGA array. Fig. 11. Fig. 10. Proviing cooinate.

10 represents FPGA noe, we efine the noe in the upper left as Master noe. The Master noe provies their positions to ajacent noes. The horizontal arrow in Fig.

5 W^ WZKD ^ZD WZKD ^ZD WZKD ^ZD WZKD ^ZD WZKD ^ZD WZKD ^ZD WZKD ^ZD WZKD ^ZD WZKD ^ZD WZKD ^ZD Fig. 12. Configuration iagram of the mesh connecte FPGA array. Fig. 11. Fig. 10. Proviing cooinate. Sening start signal of computation. implemente a mechanism to provie the position coorinate. Fig. 10 shows how to provie the position coorinate for all FPGAs. The square in Fig. 10 represents FPGA noe, we efine the noe in the upper left as Master noe. The Master noe provies their positions to ajacent noes. The horizontal arrow in Fig. 10 represents the elivery which x-coorinate is provie by aing own x-coorinate an 1. The vertical arrow in Fig. 10 represents the elivery which a y-coorinate is provie by aing own y-coorinate an 1. Eventually, all of FPGA noes know own position coorinate. It is necessary for this array system to be synchronize precisely the timing of start of computation in the first Iteration because this array system is not able to get the ata of communication region to be use for the next Iteration if there is a skew. Therefore, we esigne the prototype circuit generating the start signal of computation. Fig. 11 shows communication pattern of start signal of the computation. The square in Fig. 11 represents FPGA noe, an the arrow represents start signal of computation in the first Iteration. The FPGA noe in the upper left sens start signal to right an own FPGA noes at first. The FPGA noe which receive start signal from upper or left noes sens start signal to right an own FPGA noes immeiately. By oing this, all FPGA noes receive the start signal. IV. EVALUATION A. Environment Fig. 12 shows harware configuration of FPGA array 2.Itis possible to scale array system freely accoring to gri-size of stencil computation by connecting the FPGA in mesh. Each noe in the FPGA array is equippe with FPGA (Xilinx Spartan-6 XC6SLX16), an BlockRAM capacity of each FPGA is 72KB. Implementing MADD in the FPGA is use IP core that core-generator which Xilinx Co. owns gives. Implementing single MADD expens four pieces of 32 DSPblocks which a Spartan-6 FPGA has. Therefore, the number of MADD to be able to be implemente in single FPGA is eight. We coe these circuits in Verilog HDL an use Xilinx ISE 13.3 to generate circuit information. We use the program of stencil computation which is coe in C because of verification of the circuits implemente an comparison of execution spee. We coe the program for verification by using Softfloat library whose computation precision is same as floating-point arithmetic of FPGA. Moreover, we coe the program for comparison of execution spee by not using the library because it is important for this version program to run faster. IV-B shows performance evaluation of the single FPGA which eight MADD is implemente in. Gri-size 3 is 2Data set (64 128) an number of Iteration are Computation result is output to PC connecte by USB, an we compare it to program execution result which is coe in C,as a result, ata of all gri-points are matche. B. Harware Resource Consumption LUT utilization of single an eight MADD implemente in the FPGA are 9% an 50% respectively 4. Table I shows harware resource consumption of single FPGA, however, this 2 SRAM in Fig. 12 is not use. 3 The total number of gri-points which can be compute are, 72KB(BlockRAMcapacity) 4B(ata-size of gri-point(single precision floating-point)), 18K. However, With of gri is 64 because of number of MADD an scheuling conition. 4 9% inclues communication moule to output to PC an optimizations are enable because of implementing multiple MADD. Therefore, LUT utilization of eight MADD implemente in the FPGA is less than 9%

6 TABLE I HARDWARE RESOURCE CONSUMPTION Device Utilization Summary Slice Logic Utilization Use / Available Utilization LUTs 4,560 / 9,112 50% Slices 1,527 / 2,278 67% BlockRAM 24 / 32 75% DSP48A1 32 / % & ' W TABLE II DESIGN PARAMETERS. operation frequency the number of FPGA the number of MADD harware peak performance number of computation the total number of gri-points number of Iteration F GHz N FPGA N MADD P peak GFlop/s OP GRID ITER Fig. 13. FPGA. e e e &D, Peak an effective performance of stencil computation in single table o not inclue moule to communicate with ajacent FP- GAs. Utilization of DSP block is 100% because implementing eight MADD consumes all of DSP block. t W leel C. Performance Of Single FPGA Noe Table II shows esign parameters to analyze performance of FPGA array. Operation frequency is F GHzthe number of MADD implemente in each FPGA is N MADD, the number of FPGA N FPGA. Each MADD can operate aition an multiplication on every cycle at the same time. For this reason, harware peak performance of single MADD is 2F GFlop/s, an harware peak performance of single FPGA is 2FN MADD GFlop/s. Therefore, harware peak performance of FPGA array which N FPGA are connecte is shown below. P peak =2 F N FPGA N MADD (1) When operation frequency is 0.16GHz, harware peak performance P peak is 2.56 GFlop/s because N MADD is 8, N FPGA is 1. However, as shown in Fig. 2, Average utilization of MADD unit is 100 (4+3)/8 = 87.5% computation of single gir-point is floating point arithmetic of seven times 5. Therefore, peak performance with operation frequency 0.16GHz is = 2.24GFlop/s. Fig. 13 shows peak performance an effective performance of stencil computation by single FPGA epening on operation frequency. Effective performance is measure by the total number of floating point arithmetic ivie by execution time. The total number of floating point arithmetic is shown below by using OP, GRID, ITER in Table II 6. We measure execution time by stop-watch. OP GRID ITER = As shown in Fig. 13, since peak an effective performance of stencil computation are almost same, that overhea of 5 The four multiplications an the three aitions 6 OP is the total number of computation require to upate ata of a gripoints from time-step k to k+1. In this case, OP is seven because of four multiplications an the three aitions. e e Fig. 14. Power consumption. propose computation metho is small is figure out. Moreover, we compile the stencil computation program coe for comparison in C with -O3 option. Effective performance is 8.64GFlop/s when running on a single threa in Intel Core i with operation frequency 3.4GHz. This result is aequate performance, compare to 2.8GFlop/s in [6]. The effective performance with the prototype system for estimation of power consumption without communication moules shows that the system of a single FPGA implementing eight floating-point aers an eight floating-point multipliers archives 2.24GFlop/s in 0.16GHz operations with 2.37W power consumption. Effective performance in Intel Corei with operation frequency 3.4GHz is 8.64GFlop/s. Therefore, single FPGA achieves performance of 26% of Intel Core i7. D. Power Consumption in Single FPGA Noe We connecte multiple FPGA noes with operation frequency 0.16GHz an measure power consumption in single FPGA noe by Watt Checker. The power consumption in FPGA system connecte 10 FPGA noes is 25W. Fig. 14 shows power consumption epening on the number of FPGA noes. Power consumption in single FPGA noe is about 2.37W by taking a linear approximation for the plotte points. The value (1.1404) of linear approximate equation in Fig. 14 is thought power consumption of power boar supplying to each FPGA noe. 348

7 & ' Fig. 15. & e', e e e e e E Estimation of effective performance improvement rate. E. Operation Check in Real System Presently, we checke that the array system compose of four FPGA noes (2, 2) run without causing a stall when operation frequency is 40MHz, communication frequency is 100MHz. We verifie the array system by comparing the sum ata of gir-points assigne to a FPGA with execution result of the stencil computation program coe in C. F. Estimation of Effective Performance in 256 FPGA Noes In this section, we show the estimation of effective performance when F is 0.16GHz, N MADD is 8 an N FPGA is 256. Fig. 15 shows the estimation of effective performance improvement rate epening on the number of FPGA. P peak is 655GFlop/s because of equation (1). But, since utilization of MADD is 87.5%, upper limit of effective performance is 655GFlop/s = 573GFlop/s without overhea of communication. Moreover, we show the estimation of effective performance par watt. Power consumption of FPGA array compose of 256 FPGAs is estimate at 607W because of the approximate expression in Fig. 14. Therefore, the estimation of effective performance par watt is 0.944GFlop/sW. V. RELATED WORK The many of works that stencil computation is optimize for multi-core processors an GPU have been reporte. Augustin et al.[6] reports that they execute stencil computation by using Intel Xeon E5220 qua-core processor running at 2.26GHz. Single core of the processor achieves 2.8GFlop/s, just 31% of the peak performance. Moreover, two E5220 processors achieve 15.9GFlop/s for 8 cores, 21.8% of the peak. Phillips et al.[7] reports that they execute stencil computation by using NVIDIA TESLA C1060 GPU. Then, single GPU achieves 51.2GFlop/s, 65.6% of the peak performance in ouble-precision arithmetic. This computation performance is reuce further by the GPU cluster. In the case of a gris, computation performance is 42.2% of the peak performance. Several stuies of esigning harware for stencil computation by using FPGA have been reporte [2][8]. [2] proposes harware for stencil computation that is compose of systolic array of programmable processing elements an implement prototype by using multiple FPGAs (ALTERA Staratix family). Sano et al. achieves performance scalability with a constant memory-banwith by implementing architecture applying pipeline scheuling metho that is propose for Cell Automata. However, this work is ifferent from our work in implementing architecture an type of FPGA. Sato et al. [8] implement circuits that calculate Poisson s equation by using FPGA array. VI. CONCLUSION This paper escribes a high performance stencil computing metho optimize for a 2D-mesh-connecte FPGA array. This paper also escribes implementation result of our propose metho. We showe that our propose architecture works correctly on the real 2D-mesh connecte FPGA array. We evelope a prototype system for estimation of power consumption without communication moules. This prototype system of a single FPGA with eight floating-point aers an eight floating-point multipliers archives 2.24GFlop/s in 0.16GHz operations with 2.37W power consumption. ACKNOWLEDGMENT This work is supporte in part by Core Research for Evolutional Science an Technology (CREST), JST. REFERENCES [1] Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter, Leoni Oliker, Davi Patterson, John Shalf, an Katherine Yelick. Stencil computation optimization an auto-tuning on state-of-the-art multicore architectures. In Proceeings of the 2008 ACM/IEEE conference on Supercomputing, SC 08, pp. 4:1 4:12, Piscataway, NJ, USA, IEEE Press. [2] K. Sano, Y. Hatsua, an S. Yamamoto. Scalable streaming-array of simple soft-processors for stencil computations with constant memorybanwith. In Fiel-Programmable Custom Computing Machines (FCCM), 2011 IEEE 19th Annual International Symposium on, pp , may [3] M. Shafiq, M. Pericas, R. e la Cruz, M. Araya-Polo, N. Navarro, an E. Ayguae. Exploiting memory customization in fpga for 3 stencil computations. In Fiel-Programmable Technology, FPT International Conference on, pp , ec [4] Kobayashi Ryohei, Sano Shintaro, Takamaea-Yamazaki Shinya, an Kise Kenji. High performance stencil computation on mesh connecte fpga arrays. In Transactions on Symposium on Avance Computing Systems an Infrastructures, Vol. 2012, pp , may [5] Shinya Takamaea-Yamazaki, Shintaro Sano, Yoshito Sakaguchi, Naoki Fujiea, an Kenji Kise. In International Symposium on Applie Reconfigurable Computing (ARC 2012), March [6] Werner Augustin, Vincent Heuveline, an Jan-Philipp Weiss. Optimize stencil computation using in-place calculation on moern multicore systems. In Proceeings of the 15th International Euro-Par Conference on Parallel Processing, Euro-Par 09, pp , Berlin, Heielberg, Springer-Verlag. [7] E.H. Phillips an M. Fatica. Implementing the himeno benchmark with cua on gpu clusters. In Parallel Distribute Processing (IPDPS), 2010 IEEE International Symposium on, pp. 1 10, april [8] SATO Kazuki, JIANG Li, TAKAHASHI Kenichi, TAMUKOH Hakaru, KOBAYASHI Yuichi, an SEKINE Masatoshi. Performance evaluation of poisson equation an cip metho implemente on fpga array. IEICE technical report. Circuits an systems, Vol. 109, No. 396, pp ,

Yet Another Parallel Hypothesis Search for Inverse Entailment Hiroyuki Nishiyama and Hayato Ohwada Faculty of Sci. and Tech. Tokyo University of Scien

Yet Another Parallel Hypothesis Search for Inverse Entailment Hiroyuki Nishiyama an Hayato Ohwaa Faculty of Sci. an Tech. Tokyo University of Science, 2641 Yamazaki, Noa-shi, CHIBA, 278-8510, Japan hiroyuki@rs.noa.tus.ac.jp,