International IEEE Symposium on Field-Programmable Custom Computing Machines

Size: px

Start display at page:

Download "International IEEE Symposium on Field-Programmable Custom Computing Machines"

Stanley Logan
5 years ago
Views:

- International IEEE Symposium on ield-programmable Custom Computing Machines Scalable

Bandwidth Kentaro Sano Yoshiaki Hatsuda Satoru Yamamoto Graduate School of Information

3, Outline Introduction Stencil computation and its streaming Architecture and design

1 - International IEEE Symposium on ield-programmable Custom Computing Machines Scalable Streaming-Array of Simple Soft-Processors for Stencil Computations with Constant Bandwidth Kentaro Sano Yoshiaki Hatsuda Satoru Yamamoto Graduate School of Information Sciences Tohoku University, Japan Sendai north-east Tokyo 1 International Symposium May 3, Outline Introduction Stencil computation and its streaming Architecture and design Performance model Evaluation Conclusions 9-PGA prototype system 2 International Symposium May 3,

2 Introduction General-purpose µ-processors used for high-performance computing systems Increasing cores for higher performance, but low growing rate of off-chip I/O bandwidth Insufficient memory/network bandwidth for the entire arithmetic performance Only a fraction of the peak performance usually utilized in large-scale systems Stencil computation Typical scientific computing kernel Small operational intensity :.5 lop/byte µ-processors assume higher op. intensity Core Core Core Core Cache off-chip I/O BW /Network 3.7 lop/byte is lost. (88% of the peak) Poor performance! 3 AMD Opteron X4 : 4.2 lop/byte International Symposium May 3, Stencil Computation on CPUs/GPUs Highly-optimized implementation techniques achieve : Processor Peak Performance [Glop/s] Sustained (efficiency) 1 core of Xeon E522 (quad-core) (31%) 8 cores of Xeon E522 SMP node (22%) 1 x NVidia Tesla C (65%) 16 x NVidia Tesla C (42%) 4 Inefficiency is caused by General-purpose structure, unsuitable for stencil computation Imbalanced bandwidth for the entire performance Another promising way : scalable custom computing on PGAs International Symposium May 3,

3 This Research Linear scalability for high-performance stencil computation by custom computing on multiple PGAs Scalable Streaming-Array (SSA) Domain-specific architecture Policy to overcome high design-cost (Custom & programmable soft-cores) SW layer Common HW layer PGA device easy to change static, but customizable 5 Major contributions Extensible architecture for scalable multi-pga system with constant memory-bandwidth Programmable design of soft-cores tailored for the app domain Performance model to predict the scalability 9-PGA system demonstrating 26 Glop/s with only 2 GB/s International Symposium May 3, Iterative Stencil Computation i -1 i i+1 for (n=; n< N max ; n++) for (j=; j< J max ; j++) for (i=; i< I max ; i++) v i,j = f (v i,j, v i+1,j, v i-1,j, v i,j+1, v i,j-1 ) local computation Time-step iteration Grid traverse j -1 j j +1 Stencil 2D iterative stencil computation Computational grid 6 Update all the grid-points with local stencil computation Repeat for successive time-steps Ex: Linear eq. solvers (Jacobi method, etc.) Difference-scheme computations (CD, etc.) International Symposium May 3,

4 Streaming with a Cyclic Buffer I-1 active stencil S i-1,j-1 cyclic buffer (size= 2I ) v i,j data input v i-1,j-1 J-1 time-step iteration n time-step iteration n+1 Cyclic buffer 7 Tomoyoshi Kobori and Tsutomu Maruyama, A High Speed Computation System for 3D CHC Lattice Gas Model with PGA, Proceedings of PL23, , 23. International Symposium May 3, Pipelining Multiple Iterations I-1 active stencil S i-1,j-1 S i-2,j-2 S i-3,j-3 cyclic buffer (size=2i ) v i,j data input v i-1,j-1 v i-2,j-2 J-1 time-step iteration n time-step iteration n+1 time-step iteration n+2 Pipelining execution with constant bandwidth! 8 Tomoyoshi Kobori and Tsutomu Maruyama, A High Speed Computation System for 3D CHC Lattice Gas Model with PGA, Proceedings of PL23, , 23. International Symposium May 3,

5 Scalable Streaming-Array Architecture I-1 active stencil S i-1,j-1 S i-2,j-2 S i-3,j-3 cyclic buffer (size=2i ) v i,j data input v i-1,j-1 v i-2,j-2 J-1 time-step n time-step n+1 time-step n+2 Buffer Stage 1 Buffer Stage 2 Buffer Stage 3 Multiple time-step execution per memory-read. 9 International Symposium May 3, SSA Design for Jacobi Computation Requirements High density & utilization of P units : simple processing elements (PEs) lexibility in application domain : programmable PEs & sequencers 1 Pipelined Stage Module Overview of SSA on a single PGA International Symposium May 3,

Processing Element (PE) from / to PE x, y-1 8-stage pipeline Buffer Constant Multiplexor

Buffer 1 2 MR EX1 EX2 EX3 EX4 EX5 MW x MAC + to PE x+1, y from / to PE x, y+1 Sequencer of each

Assembly Language Two opecodes (control, comput.

6 Processing Element (PE) from / to PE x, y-1 8-stage pipeline Buffer Constant Multiplexor loating-point multiply and accumulate unit (MAC) from PE x-1,y I PE x, y data addr Constants Buffer 1 2 MR EX1 EX2 EX3 EX4 EX5 MW x MAC + to PE x+1, y from / to PE x, y+1 Sequencer of each PSM Control PEs in PSM Micro operations PSM = SIMD core 11 International Symposium May 3, SSA Assembly Language Two opecodes (control, comput.) lset bnz mulp accp loop-set with # of iterations branch if loop-counter is not zero multiply multiply and accumulate-positive 12 International Symposium May 3,

7 Performance Model of SSA 2 words /cycle PGA 1 8 PEs 8 PEs 8 PEs PGA 2 SSA 8 x 16 PGA n SSA 8 x PSMs 13 Cycles of pipelined exec. on multiple PGAs Speedup to single PGA Grid : to Iteration : 5 to 8 Linear speedup up to 1s PGAs for medium size Speedup to a single PGA Linear x=y=248, g=8 x=y=512, g=8 x=y=256, g=8 x=y=128, g=8 x=y=128, g=5 x=y=128, g= Number of PGAs (each has 16 PSMs.) International Symposium May 3, Prototyping SSA for Evaluation max 17MHz Peak of 25.5 Glop/s to Host DDR2 Master Master PGA PC GB/s Slave PGA 2 USB I/ (NIOS II) SSA Scalable Streaming 8 Array x 12 63% (8 x 12 ALUTs PEs) 99% DSPs Controller Stream Controller I/O Module I/O Module SSA Scalable Streaming Array 8 x 16 79% (8 x 16 ALUTs PEs) 1% DSPs I/O Module max 137MHz Peak of 34 Glop/s to/from Next Slave Slaves 3, 4, 5, 6, 7, 8, 9 14 ALTERA Stratix III PGA x 9 IEEE754 single-precision P Sustained 2GB/s for reading and writing DDR2 PSM has 8 PEs. req = 133MHz International Symposium May 3,

8 Photograph of 9-PGA System Slave 8 Slave 6 Slave 9 Slave 4 Slave 7 Slave 5 Slave 2 Master DDR2 15 Terasic DE3-15 boards Slave 3 International Symposium May 3, Benchmark Problems 2D heat-propagation Laplace s equation Jacobi method 78 iterations Grid sizes = 128 x 128, 512 x 512, 248 x 248 simple condition Same hardware, different programs 16 complex condition Boundary condition and results International Symposium May 3,

9 Results Performance [Glop/s] Peak 128 x x x Performance per Watt [Glop/sW] Glop/sW w/o fans w/ fans computing idle 1.36 Power (w/ fans) Power [W] Number of PGAs Performance Number of PGAs Performance per Watt 17 Linear scalability already for small problem Sustained 26Glop/s for 2GB/s bandwidth Utilization : 87.5 % (ideal for 4 mul. & 3 add) Power-efficiency of 1.36Glop/sW competing with the top 3 of Green5 list Glop/sW Machine 1st 1.68 Blue Gene/Q, US 2nd 1.44 GRAPE-DR, JP 3rd.96 Tsubame, JP Green5 Nov International Symposium May 3, Conclusions Scalable Streaming-Array High-performance stencil-computation Extensible multi-pga system Programmable simple soft-cores Performance model SW layer Common HW layer PGA device Design policy 9-PGA prototype system Linear scalability & 26Glop/s for only 2GB/s 1.36Glops/W competing with Green5 list uture work Larger and complex 3D problems (fluid dynamics) SSA compiler SSA synthesis tool 18 International Symposium May 3,

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.