HA-PACS Project Challenge for Next Step of Accelerating Computing

Size: px

Start display at page:

Download "HA-PACS Project Challenge for Next Step of Accelerating Computing"

Brenda Armstrong
5 years ago
Views:

1 HA-PAS Project hallenge for Next Step of Accelerating omputing Taisuke Boku enter for omputational Sciences University of Tsukuba 1

2 Outline of talk Introduction of S, U. Tsukuba HA-PAS Project overview HA-PAS Base luster HA-PAS Applications TA (Tightly oupled Accelerators) Summary 2

S at University of Tsukuba enter for omputational Sciences Established in 1992 12 years as enter for omputational Physics Reorganized as enter for omputational Sciences in 2004 Daily

3 S at University of Tsukuba enter for omputational Sciences Established in years as enter for omputational Physics Reorganized as enter for omputational Sciences in 2004 Daily collaborative researches with two kinds of researchers (about 30 in total) omputational Scientists who have NEEDS (applications) omputer Scientists who have SEEDS (system & solution)

4 S(cont d) Application fields Particle Physics Astrophysics Nuclear Physics Quantum ondensed Matter Physics Life Science Global Environment Science omputer system fields High Performance omputing Systems omputational Informatics Not a general omputer Service enter ollaborative Research enter for omputational Sciences and omputer Science

5 Project plan of HA-PAS HA-PAS (Highly Accelerated Parallel Advanced system for omputational Sciences) Accelerating critical problems on various scientific fields in enter for omputational Sciences, University of Tsukuba The target application fields will be partially limited urrent target: QD, Astro, QM/MM (quantum mechanics / molecular mechanics, for life science) Two parts HA-PAS base cluster: for development of GPU-accelerated code for target fields, and performing product-run of them HA-PAS/TA: (TA = Tightly oupled Accelerators) for elementary research on new technology for accelerated computing Our original communication system based on PI-Express named PEARL, and a prototype communication chip named PEAH2 5

6 GPU omputing: current trend of HP GPU clusters in TOP500 on Nov nd 天河 Tienha-1A (Rpeak=4.70 PFLOPS) 4th 星雲 Nebulae (Rpeak=2.98 PFLOPS) 5th TSUBAME2.0 (Rpeak=2.29 PFLOPS) (1st K omputer Rpeak=11.28 PFLOPS) Features high peak performance / cost ratio high peak performance / power ratio large scale applications with GPU acceleration don t run yet in production on GPU cluster Our First target is to develop large scale applications accelerated by GPU in real computational sciences 6

7 Issues of GPU luster Problems of GPGPU for HP Data I/O performance limitation Ex) GPGPU: PIe gen2 x16 Peak Performance: 8GB/s (I/O) 665 GFLOPS (NVIDIA M2090) Memory size limitation Ex) M2090: 6GByte vs PU: GByte ommunication between accelerators: no direct path (external) communication latency via PU becomes large Ex) GPGPU: GPU mem PU mem (MPI) PU mem GPU mem Researches for direct communication between GPUs are required Our another target is developing a direct communication system between external GPUs for a feasibility study for future accelerated computing 7

8 Project Formation HA-PAS (Highly Accelerated Parallel Advanced system for omputational Sciences) Apr Mar. 2014, 3-year project (the system will be maintain until Mar. 2016) Project Office for Exascale omputational Sciences (Leader: Prof. M. Umemura) Develop large scale GPU applications : 14 members Elementary Particle Physics, Astrophysics, Bioscience, Nuclear Physics, Quantum Matter Physics, Global Environmental Science, omputational Informatics, High Performance omputing Systems Project Office for Exascale omputing System Development(Leader: Prof. T. Boku) Develop two types of GPU cluster systems: 15 members 8

9 HA-PAS base cluster (Feb. 2012) 9

10 HA-PAS base cluster Front view Side view 10

11 HA-PAS base cluster Front view of 3 blade chassis Rear view of one blade chassis with 4 blades Rear view of Infiniband switch and cables (yellow=fibre, black=copper) 11

12 HA-PAS: base cluster (computation node) AVX (2.6GHz x 8flop/clock) (16GB, 12.8GB/s)x8 =128GB, 102.4GB/s 20.8GFLOPSx16 =332.8GFLOPS Total: 3TFLOPS 665GFLOPSx4 =2660GFLOPS (6GB, 177GB/s)x4 =24GB, 708GB/s 12 8GB/s

13 HA-PAS: base cluster unit(pu) Intel Xeon E5 (SandyBridge-EP) x 2 8 cores/socket (16 cores/node) with 2.6 GHz AVX (256-bit SIMD) on each core peak perf./socket = 2.6 x 4 x 2 = GFLOPS pek perf./node = GFLOPS Each socket supports up to 40 lanes of PIe gen3 great performance to connect multiple GPUs without I/O performance bottleneck current NVIDIA M2090 supports just PIe gen2, but net generation (Kepler) will support PIe gen3 M2090 x4 can be connected to 2 SandyBridge-EP still remaining PIe gen3 x8 x2 Infiniband QDR x 2 13

14 HA-PAS: base cluster unit(gpu) NVIDIA M2090 x 4 Number of processor core: 512 Processor core clock: 1.3 GHz DP 665 GFLOPS, SP 1331GFLOPS PI Express gen2 16 system interface Board power dissipation: <= 225 W Memory clock: 1.85 GHz, size: 6GB with E, 177GB/s Shared/L1 ache: 64KB, L2 ache: 768KB 14

15 HA-PAS: base cluster unit(blade node) 1x PIe slot for HA 2x NVIDIA Tesla M2090 2x 2.6GHz 8core SandyBridge-EP Air flow 2x 2.5 HDD 2x NVIDIA Tesla M2090 Power Supply Unit and Fan - 8U enclosure - 4 nodes - 3 PSU(Hot Swappable) - 6 Fans(Hot Swappable) Front view Rear view 15

16 Basic performance data MPI pingpong 6.4 GB/s (N 1/2 = 8KB) with dual rail Infiniband QDR (Mellanox onnectx-3) actually FDR for HA and QDR for switch PIe benchmark (Device -> Host memory copy), aggregated perf. for 4 GPUs simultaneously 24 GB/s (N 1/2 = 20KB) PIe gen2 x16 x4, theoretical peak = 8 GB/s x4 = 32 GB/s Stream (memory) 74.6 GB/s theoretical peak = GB/s 16

17 PIe Host:Device communication performance Slower start on Host->Device compared with Device->Host 17

HA-PAS Application (1):Elementary Particle Physics Multi-scale

of nuclei in lattice QD GPU to solve large sparse linear systems of

at finite temperature and density GPU to perform matrix-matrix

18 HA-PAS Application (1):Elementary Particle Physics Multi-scale physics Investigate hierarchical properties via direct construction of nuclei in lattice QD GPU to solve large sparse linear systems of equations quark Finite temperature and density Phase analysis of QD at finite temperature and density GPU to perform matrix-matrix product of dense matrices Expected QD phase diagram proton neutron nucleus 18

HA-PAS Applications (2):Astrophysics (A) ollisional N-body Simulation Globular lusters Formation of the most primordial objects formed more than 10 giga years.

gravitational interactions between stars and multiple black holes in galaxy centers.

19 HA-PAS Applications (2):Astrophysics (A) ollisional N-body Simulation Globular lusters Formation of the most primordial objects formed more than 10 giga years. Fossil object as a clue to investigate the primordial universe Massive Black Holes in Galaxies Understanding of the formation of massive black holes in galaxies Numerical simulations of complicated gravitational interactions between stars and multiple black holes in galaxy centers. Direct (brute force) calculations of acceleration and jerks are required to achieve the required numerical accuracy omputations of the accelerations of particles and their time derivatives (jerks) are time consuming. Accelerations and jerks are computed on GPU (B) Radiation Transfer First Stars and Re-ionization of the Universe Understanding of the formation of the first stars in the universe and the succeeded re-ionization of the universe. Accretion Disks around Black Holes Study of the high temperature regions around black holes alculation of the physical effects of photons emitted by stars and galaxies onto the surrounding matter. So far, poorly investigated due to its huge amount of computational cost, though it is of critical importance in the formation of stars and galaxies. omputations of the radiation intensity and the resulting chemical reactions based on the ray-tracing methods can be highly accelerated with GPUs owing to its high concurrency. 19

20 HA-PAS Application (3):Bioscience GPU acceleration - Direct coulmb (Gromacs, NAMD, Amber) -2 electron integral DNA-protein complex macroscale MD QM region > 100 atoms 20 Reaction mechanisms QM/MM-MD

21 HA-PAS Application (4) Other advanced researches on HP Division in S XcalableMP-dev (XMP-dev) for easy and simple programming language to support distributed memory & GPU accelerated computing for large scale computational sciences G8 NuFuSE (Nuclear Fusion Simulation for Exascale) project platform for porting Plasma Simulation ode with GPU technology limate simulation especially for LES (Large Eddy Simulation) for cloud-level resolution on city-model size simulation Any other collaboration... 21

22 HA-PAS: TA (Tightly oupled Accelerator) TA: Tightly oupled Accelerator Direct connection between accelerators (GPUs) Using PIe as a communication device between accelerator Most acceleration device and other I/O device are connected by PIe as PIe end-point (slave device) An intelligent PIe device logically enables an end-point device to directly communicate with other end-point devices PEARL: PI Express Adaptive and Reliable Link We already developed such PIe device (PEAH, PI Express Adaptive ommunication Hub) on JST-REST project low power and dependable network for embedded system It enables direct connection between nodes by PIe Gen2 x4 link Improving PEAH for HP to realize TA 22

23 PEAH PEAH: PI-Express Adaptive ommunication Hub An intelligent PI-Express communication switch to use PIe link directly for node-to-node interconnection Edge of PEAH PIe link can be connected to any peripheral devices, including GPU Prototype PEAH chip 4-port PI-E gen.2 with x4 lane / port PI-E link edge control feature: root complex and end points are automatically switched (flipped) according to the connection handling Other fault-tolerant (reliability) function is implemented: flip network link to allow single link fault in HA-PAS/TA prototype development, we will enhance current PEAH chip PEAH2 23

throughput Enhanced version of PEAH PEAH2 x4 lanes -> x8 lanes hardwired on main data path and PIe interface fabric PIe

24 HA-PAS/TA (Tightly oupled Accelerator) True GPU-direct current GPU clusters require 3- hop communication (3-5 times memory copy) For strong scaling, Inter-GPU direct communication protocol is needed for lower latency and higher throughput Enhanced version of PEAH PEAH2 x4 lanes -> x8 lanes hardwired on main data path and PIe interface fabric PIe IB HA PIe PU MEM MEM Node PIe GPU MEM PU PIe PEAH 2 Node PIe GPU IB Switch PIe IB HA PIe PU MEM MEM PIe GPU MEM PU PIe PEAH 2 PIe GPU 24

25 Implementation of PEAH2: ASI FPGA FPGA based implementation today s advanced FPGA allows to use PIe hub with multiple ports currently gen2 x 8 lanes x 4 ports are available soon gen3 will be available (?) easy modification and enhancement fits to standard (full-size) PIe board internal multi-core general purpose PU with programmability is available easily split hardwired/firmware partitioning on certain level on control layer ontrolling PEAH2 for GPU communication protocol collaboration with NVIDIA for information sharing and discussion based on UDA4.0 device to device direct memory copy protocol 25

26 HA-PAS/TA Node luster = N Gx4 PEAH2 x 2 PEARL Ring Network Gx4 PEAH2 x 2... Gx4 PEAH2 x 2 Node luster with 16 nodes GPUx64 (G) PUx32 () GPU comm with PIe IB link / node PU: Xeon E5 GPU: Kepler Infiniband Link High speed GPU-GPU comm. by PEAH within N (PI-E gen2x8 = 5GB/s/link) Infiniband QDR (x2) for N-N comm. (4GB/s/link) 4 N with 16 nodes, or 8 N with 8 nodes = 360 TFLOPS extension to base cluster Node luster Node luster Node luster Node luster... Node luster 26 Infiniband Network

Additional latency by PIe switch G3 x16 QPI PIe GPU

27 PEARL/PEAH2 variation (1) Option 1: Performance comparison among IB and PEARL can be evenly compared Additional latency by PIe switch G3 x16 QPI PIe GPU GPU GPU GPU G3 x16 PIe SW G3 x8 IB HA G3 x8 PEA H2 G2 x8 27

28 PEARL/PEAH2 variation (2) Option 2: Requires only 72 lanes in total asymmetric connection among 3 blocks of GPUs QPI G3 x16 GPU G3 x16 GPU PIe G3 x16 GPU G3 x8 IB HA PIe SW G3 x16 GPU PEA H2 G2 x8 28

29 PEAH2 prototype board for TA FPGA (Altera Stratix IV GX530) daughter board connector PIe external link connector x2 (one more on daughter board) PIe edge connector (to host server) power regulators for FPGA 29

30 Summary HA-PAS consists of two elements: HA-PAS base cluster for application development and HA-PAS/TA for elementary study for advanced technology on direct communication among accelerating devices (GPUs) HA-PAS base cluster started its operation from Feb with 802 TFLOPS peak performance FPGA implementation of PEAH2 is finished for the prototype version on Mar and enhanced for final version in following 6 months HA-PAS/TA with at least 300 TFLOPS additional performance will be installed around Mar

Appro Supercomputer Solutions Appro and Tsukuba University

Appro Supercomputer Solutions Appro and Tsukuba University Accelerator Cluster Collaboration Steven Lyness, VP HPC Solutions Engineering About Appro Over 20 Years of Experience 1991 2000 OEM Server Manufacturer