Appro Supercomputer Solutions Appro and Tsukuba University

Size: px

Start display at page:

Download "Appro Supercomputer Solutions Appro and Tsukuba University"

Benedict Elliott
6 years ago
Views:

1 Appro Supercomputer Solutions Appro and Tsukuba University Accelerator Cluster Collaboration Steven Lyness, VP HPC Solutions Engineering

2 About Appro Over 20 Years of Experience OEM Server Manufacturer Branded Servers Clusters Solutions Manufacturer 2007 to 2012 End-To-End Supercomputer Solutions Moving Forward. 2

Appro on Top 500 Over 2 PFLOPs (peak) with just five Top100 systems added in to Top500 in November Variety of technologies: Intel, AMD, NVIDIA Multiple

3 Appro on Top 500 Over 2 PFLOPs (peak) with just five Top100 systems added in to Top500 in November Variety of technologies: Intel, AMD, NVIDIA Multiple server form factors Infiniband and GigE Fat Tree and 3D Torus Excellent Linpack efficiency on nonoptimized SB systems 85.5% Fat Tree 83% - 85% 3D Torus 3

4 Appro Milestones Installations in 2012 Site Peak Performance Los Alamos (LANL) Sandia (SNL) Livermore (LLNL) Japan (Tsukuba, Kyoto) > 1.8 PFLOPs > 1.2 PFLOPs > 1.5 PFLOPs > 1 PFLOPs

5 About University Of Tsukuba HA-PACS Project HA-PACS (Highly Accelerated Parallel Advanced system for Computational Sciences) Apr Mar. 2014, 3-year project Project Leader: Prof. M. Sato (Director, CCS, Univ. of Tsukuba) Develop next generation GPU system : 15 members Project Office for Exascale Computing System Development (Leader: Prof. T. Boku) GPU cluster based on Tightly Coupled Accelerators architecture Develop large scale GPU applications : 15 members Project Office for Exascale Computational Sciences (Leader: Prof. M. Umemura) Elementary Particle Physics, Astrophysics, Bioscience, Nuclear/Quantum Physics, Global Environmental Science, High Performance Computing 5

6 University of Tsukuba- HA-PACS Project :: Problem Definition Many technology discussions to determine KEY : Fixed budget High Availability Latest Processor / High Flops 1:2 CPU:Accelerator Ratio High Bandwidth to the Accelerator High bandwidth, low latency interconnect Apps Could take advantage of more than QDR IB High IO Bandwidth to storage Easy to Manage 2012 GTC Conference 6

7 Solution Keys Fixed Budget Considerations Need to find a balance between: Performance - Flops, bandwidth (memory, IO Capacity (CPU Qty, GPU Qty, Memory per core, IO, Storage) Availability Features Ease of Management / Supportability Architecture needed: High Availability Nodes (PS, Fans) IPC networks (Ex. InfiniBand) Service Networks (Provisioning and Management) 2012 GTC Conference 7

8 Meeting Key Requirements Challenge: Create a Solution with High Availability Redundant power supplies Redundant hot swap fan trays Redundant Hot swap disk drives Redundant Networks Solution: Appro Xtreme-X Supercomputer, flagship product-line using GreenBlade sub-rack component used for for the DoE TLCC2 project Expand to add support for new custom blade nodes What Appro Brings to NWS 8

Solution Architecture :: Appro Xtreme-X Supercomputer Unified scalable

ownership Offers high performance and high availability features with lower

Appro HPC Software Stack - Complete HPC Cluster Software tools combined with

9 Solution Architecture :: Appro Xtreme-X Supercomputer Unified scalable cluster architecture that can be provisioned and managed as a stand-alone supercomputer. Improved power & cooling efficiency to dramatically lower total cost of ownership Offers high performance and high availability features with lower latency and higher bandwidth. Appro HPC Software Stack - Complete HPC Cluster Software tools combined with the Appro Cluster Engine (ACE) Management Software including the following capabilities: System Management Network Management Server Management 2012 GTC Conference Cluster Management Storage Management 9

10 Meeting Key Requirements Optimal Performance Peak Performance CPU Contribution Sandy Bridge-EP 2.6 GHz E Processor (332 GFlops per node) GPU Contribution 665 GFlops per NVIDIA S2090 Four (4) S2090 s per node or 2.66 TFlops per node Combined Peak Performance is 3 TFlops per node Two Hundred and Sixty-Eight (268) nodes provides 802 TFlops Accelerator Performance DEDICATED PCI-e Gen3 X16 for each NVIDIA GPU Uses Gen2 so we have up to 8 GB/s per GPU available IO Performance 2 x QDR (Mellanox CX3) Up to 4GB/s per link (on PCI-e Gen3 X8) bus GigE for Operations networks Presentation Name 10

swappable & redundant Support one or redundant iscb platform manager modules with Enhanced management capabilities

11 Appro GreenBlade Sub-Rack With Accelerator Expansion Blades Up to 4x 2P GB812X blades Expandability for HDD, SSD, GPU, MIC Six Cooling Fan Units Hot swappable & redundant Up to six 1600W power supplies Platinum-rated; 95%+ efficient Hot swappable & redundant Support one or redundant iscb platform manager modules with Enhanced management capabilities Active & dynamic fan control Power monitoring Remote power control Integrated console server Appro Confidential and Proprietary

Appro GreenBlade Subrack iscb Modules Server Board Increased memory footprint (2 DPC) Provides access to two (2) PCI-e Gen3 X16 PER SOCKET Provides for increased IO capability QDR or FDR InfiniBand

12 Appro GreenBlade Subrack iscb Modules Server Board Increased memory footprint (2 DPC) Provides access to two (2) PCI-e Gen3 X16 PER SOCKET Provides for increased IO capability QDR or FDR InfiniBand on the motherboard Internal RAID Adapter on Gen3 bus Up to two (2) 2.5 Hard drives NOTE: Can run diskless/stateless because of Appro Cluster Engine but needed local scratch 2012 GTC Conference Appro Confidential and Proprietary

13 :: Server Node Design Challenge: Create a server node with Latest Generation of processors: Need for flops AND IO capacity HIGH bandwidth to the Accelerators High Memory capacity Solution: Meeting Key Requirements High Bandwidth Intel Sandy Bridge-EP for CPU and the NVIDIA Tesla for GPU Working with Intel EPSD EARLY on to design a motherboard Washington Pass (S2600WP) Motherboard with: Dual Sandy Bridge-EP (E5-2700) sockets Expose four (4) PCI-e Gen3 X16 for Accelerator Connectivity Expose one (1) PCI-e Gen3 X8 for Expansion slot/io Two (2) DIMMS Per channel (16 DIMMS total) 2U form factor for fit and air flow/cooling 2012 GTC Conference 13

14 Meeting Key Requirements Intel EPSD S2600WP Motherboard 4 Channels 1,600 MHz 51.2 GB/sec 4 Channels 1,600 MHz 51.2 GB/sec DDR3 DDR3 DDR3 DDR3 DDR3 DDR3 DDR3 DDR3 Sandy Bridge QPI Sandy Bridge Patsburg DMI SNB-EN QPI PCH EP EP PCI-e X4 Gen 3 x 8 Gen 3 x 16 Gen 3 x 16 Gen 3 x 16 Gen 3 x 16 Gen 3 x 8 Dual GbE ESI BMC BIOS 4 x NVIDIA M2090 2xQDR IB Dual GbE 2012 GTC Conference Appro Confidential and Proprietary

15 PAGE 15 GreenBlade Node Design HDD0 HDD1 GigE Cluster Management / Operations Network (Prime) QDR InfiniBand (Port 0) QDR InfiniBand (Port 1) GigE Cluster Management / Operations Network (Secondary) 2012 GTC Conference

16 Meeting Key Requirements :: Network Availability Challenge To provide cost effective redundant networks to eliminate/reduce failures (MTTI) Solution Build system with redundant operations Ethernet networks Redundant on-board GigE each with access to IPMI Redundant iscb Modules for baseboard management, node control and monitoring Build system with redundant InfiniBand networks DUAL QDR for price/performance Selected Mellanox due to Gen3 X8 support (dual port adapter) 2012 GTC Conference 16

Meeting Key Requirements :: Operations Networking GbE 10GigE Management Node(s) Login Node(s) External Network Sub Management Node (GreenBlade GB812X) 10GigE Switch Sub

17 Meeting Key Requirements :: Operations Networking GbE 10GigE Management Node(s) Login Node(s) External Network Sub Management Node (GreenBlade GB812X) 10GigE Switch Sub Management Node (GreenBlade GB812X) 48 port Leaf Switches Compute Nodes Rack (1), Rack (2) and Rack (3) Compute Nodes Rack (N-2), Rack (N-1) and Rack (N) 2012 GTC Conference 17

Meeting Key Requirements :: Ease of Use Challenge Need the System top install quickly to get into production Most have limited people resources Need to be able to keep the system

18 Meeting Key Requirements :: Ease of Use Challenge Need the System top install quickly to get into production Most have limited people resources Need to be able to keep the system running and doing science Solution Appro HPC Software Stack Tested and Validated Full stack from HW layer to Application layer Allows for quick bring up of a cluster 2012 GTC Conference 18

19 Appro HPC Software Stack Appro HPC Software Stack User Applications Performance Monitoring Compilers Message Passing Job Scheduling Storage Cluster Monitoring Remote Power Mgmt HPCC Perfctr IOR PAPI/IPM netperf Intel Cluster Studio MVAPICH2 Grid Engine NFS (3.x ACE PGI (PGI CDK) GNU PathScale OpenMPI Intel MPI-(Intel Cluster Studio) SLURM PBS Pro Local FS (ext3, ext4, XFS) PanFS Lustre ACE (iscb and OpenIPMI) PowerMan Console Mgmt ACE ConMan Provisioning OS Appro Cluster Engine (ACE ) Virtual Clusters Linux (Red Hat, CentOS, SuSE) Appro Xtreme-X Supercomputer Building Blocks Appro Turn-Key Integration & Delivery Services HW and SW integration, pre-acceptance testing, dismantle, packing and shipping Appro HPC Professional Services - On-site Installation services and/or Customized services 2012 GTC Conference

20 Appro Key Advantages :: Summary Partnering with Key technology partners to offer cutting-edge integrated solutions: Performance Storage IOR Networking Bandwidth, latencies and message rates Features High Availability (high standard MTBF, redundancy - PS) Ease of Management Flexibility Price /Performance Training Programs Pre-Sales (Sell everything it does and ONLY that) Installation and Tuning Post Install Support 2012 GTC Conference 20

21 Appro Xtreme-X Supercomputer :: Turn-Key Solution Summary Appro HPC Software Stack Appro Cluster Engine (ACE) Management Software Suite Appro Xtreme-X Supercomputer addressing 4 HPC Workload Configurations Capacity Computing Hybrid Computing Data Intensive Computing Capability Computing Turn-Key Integration & Delivery Services - Node, Rack, Switch, Interconnect, cable, network, storage, software, Burning-in - Pre-acceptance testing, performance validation, dismantle, packing and shipping Appro HPC Professional Services - On-site Installation services and/or Customized services Appro Corporate Presentation 21

22 Questions? Ask Now or see us at Table #54 Appro Supercomputer Solutions Steve Lyness, VP HPC Solutions Engineering Learn More at

23 HA-PACS Next Step for Scientific Frontier by Accelerated Computing Taisuke Boku Center for Computational Sciences University of Tsukuba GTC2012, San Jose 23

24 Project plan of HA-PACS 2 HA-PACS (Highly Accelerated Parallel Advanced system for Computational Sciences) Accelerating critical problems on various scientific fields in Center for Computational Sciences, University of Tsukuba The target application fields will be partially limited Current target: QCD, Astro, QM/MM (quantum mechanics / molecular mechanics, for life science) Two parts HA-PACS base cluster: for development of GPU-accelerated code GTC2012, San Jose for target fields, and performing productrun of them

25 GPU Computing: current trend of HPC GPU clusters in TOP500 on Nov nd 天河 Tienha-1A (Rpeak=4.70 PFLOPS) 4th 星雲 Nebulae (Rpeak=2.98 PFLOPS) 5th TSUBAME2.0 (Rpeak=2.29 PFLOPS) (1st K Computer Rpeak=11.28 PFLOPS) Features high peak performance / cost ratio high peak performance / power ratio large scale applications with GPU acceleration don t run yet in production on GPU cluster Our First target is to develop large scale applications accelerated by GPU in real computational sciences 25 GTC2012, San Jose

26 Problems of GPU Cluster 26 Problems of GPGPU for HPC Data I/O performance limitation Ex) GPGPU: PCIe gen2 x16 Peak Performance: 8GB/s (I/O) 665 GFLOPS (NVIDIA M2090) Memory size limitation Ex) M2090: 6GByte vs CPU: GByte Communication between accelerators: no direct path (external) communication latency via CPU becomes large Our another target is developing a direct communication system between external Ex) GPGPU: GPUs for a feasibility GPU mem study CPU for mem future (MPI) CPU accelerated computing mem GPU mem GTC2012, Researches San Jose for direct communication

27 Project Formation 27 GTC2012, San Jose HA-PACS (Highly Accelerated Parallel Advanced system for Computational Sciences) Apr Mar. 2014, 3-year project Project Leader: Prof. M. Sato (Director, CCS, Univ. of Tsukuba) Develop next generation GPU system : 15 members Project Office for Exascale Computing System Development (Leader: Prof. T. Boku) GPU cluster based on Tightly Coupled Accelerators architecture Develop large scale GPU applications : 15 members Project Office for Exascale Computational

28 HA-PACS base cluster (Feb. 2012) 2 GTC2012, San Jose

29 HA-PACS base cluster Front view Side view 2 GTC2012, San Jose

30 HA-PACS base cluster Front view of 3 blade chassis Rear view of one blade chassis with 4 blades Rear view of Infiniba and cables (yellow=fibre, black= 3 GTC2012, San Jose

HA-PACS: base cluster (computation node) AVX (2.6GHz x 8flop/clock) (16GB, 12.8GB =128GB, 102.4 20.

31 HA-PACS: base cluster (computation node) AVX (2.6GHz x 8flop/clock) (16GB, 12.8GB =128GB, GFLOPSx16 =332.8GFLOPS Total: 3TFLOPS 665GFLOPSx4 =2660GFLOPS (6GB, 177GB/s)x4 =24GB, 708GB/s 31 8GB/s GTC2012, San Jose

32 HA-PACS: base cluster unit(cpu) Intel Xeon E5 (SandyBridge-EP) x 2 8 cores/socket (16 cores/node) with 2.6 GHz AVX (256-bit SIMD) on each core peak perf./socket = 2.6 x 4 x 2 = GFLOPS pek perf./node = GFLOPS Each socket supports up to 40 lanes of PCIe gen3 great performance to connect multiple GPUs without I/O performance bottleneck current NVIDIA M2090 supports just PCIe gen2, but net generation (Kepler) will support PCIe gen3 M2090 x4 can be connected to 2 SandyBridge-EP still remaining PCIe gen3 x8 x2 Infiniband QDR x 2 3 GTC2012, San Jose

33 HA-PACS: base cluster unit(gpu) NVIDIA M2090 x 4 Number of processor core: 512 Processor core clock: 1.3 GHz DP 665 GFLOPS, SP 1331GFLOPS PCI Express gen2 16 system interface Board power dissipation: <= 225 W Memory clock: 1.85 GHz, size: 6GB with ECC, 177GB/s Shared/L1 Cache: 64KB, L2 Cache: 768KB 33 GTC2012, San Jose

34 HA-PACS: base cluster unit(blade node) 2x NVIDIA Tesla M2090 1x PCIe slot for HCA 2x 2.6GHz 8core SandyBridge-EP Air flow 2x 2.5 HDD 2x NVIDIA Tesla M2090 Front view Rear view Power Supply Unit and Fan - 8U enclosure - 4 nodes - 3 PSU(Hot Swappable) - 6 Fans(Hot Swappable) 34 GTC2012, San Jose

35 Basic performance data MPI pingpong 6.4 GB/s (N 1/2 = 8KB) with dual rail Infiniband QDR (Mellanox ConnectX-3) actually FDR for HCA and QDR for switch PCIe benchmark (Device -> Host memory copy), aggregated perf. for 4 GPUs simultaneously 24 GB/s (N 1/2 = 20KB) PCIe gen2 x16 x4, theoretical peak = 8 GB/s x4 = 32 GB/s Stream (memory) 74.6 GB/s theoretical peak = GB/s 3 GTC2012, San Jose

36 PCIe Host:Device communication performance Slower start on Host->Device compared with De 3 GTC2012, San Jose

HA-PACS Application (1): Elementary Particle Physics Multi-scale

properties via direct construction of nuclei in lattice QCD GPU to

QCD at finite temperature and density GPU to perform matrix-matrix

37 HA-PACS Application (1): Elementary Particle Physics Multi-scale physics Finite temperature and density Investigate hierarchical properties via direct construction of nuclei in lattice QCD GPU to solve large sparse linear systems of equations quark Phase analysis of QCD at finite temperature and density GPU to perform matrix-matrix product of dense matrices Expected QCD phase diagram proton neutron nucleus 37 GTC2012, San Jose

HA-PACS Applications (2): Astrophysics (A) Collisional N-body Simulation (B) Radiation Transfer Globular Clusters Formation of the most primordial objects formed more than 10 giga years.

Numerical simulations of complicated gravitational interactions between stars and multiple black holes in galaxy centers.

38 HA-PACS Applications (2): Astrophysics (A) Collisional N-body Simulation (B) Radiation Transfer Globular Clusters Formation of the most primordial objects formed more than 10 giga years. Fossil object as a clue to investigate the primordial universe Accretion Disks around Black H Massive Black Holes in Galaxies Understanding of the formation of massive black holes in galaxies Numerical simulations of complicated gravitational interactions between stars and multiple black holes in galaxy centers. First Stars and Re-ionization o Understanding of the formation of the first stars in the universe and the succeeded re-ionization of the universe. Study of the high temperature regions around black holes Calculation of the physical effects of photons emitted by stars and galaxies onto the surrounding matter. Direct (brute force) calculations of acceleration and jerks are required to achieve the required numerical accuracy Computations of the accelerations of particles and their time derivatives (jerks) are time consuming. Accelerations and jerks are computed on GPU So far, poorly investigated due to its huge amount of computational cost, though it is of critical importance in the formation of stars and galaxies. Computations of the radiation intensity and the resulting chemical reactions based on the ray-tracing methods can be highly accelerated with GPUs owing to its high concurrency. 38 GTC2012, San Jose

39 HA-PACS Application (3): Bioscience GPU acceleration DNA-protein complex (macroscale MD) - Direct coulmb (Gromacs, QM region NAMD, > 100 atoms Amber) -2 electron integral 39 GTC2012, San Jose Reaction mechanisms (QM/MM-MD)

40 HA-PACS Application (4) Other advanced researches on HPC Division in CCS XcalableMP-dev (XMP-dev) for easy and simple programming language to support distributed memory & GPU accelerated computing for large scale computational sciences G8 NuFuSE (Nuclear Fusion Simulation for Exascale) project platform for porting Plasma Simulation Code with GPU technology Climate simulation especially for LES (Large Eddy Simulation) for cloud-level resolution on city-model size simulation Any other collaboration... 4 GTC2012, San Jose

41 HA-PACS: TCA (Tightly Coupled Accelerator) 41 TCA: Tightly Coupled Accelerator Direct connection between accelerators (GPUs) Using PCIe as a communication device between accelerator Most acceleration device and other I/O device are connected by PCIe as PCIe end-point (slave device) An intelligent PCIe device logically enables an endpoint device to directly communicate with other endpoint devices PEARL: PCI Express Adaptive and Reliable Link We already developed such PCIe device (PEACH, PCI Express Adaptive Communication Hub) on JST-CREST project low power and dependable network for embedded system It enables direct connection between nodes by PCIe Gen2 x4 link GTC2012, Improving San Jose PEACH for HPC to realize TCA

42 PEACH 4 PEACH: PCI-Express Adaptive Communication Hub An intelligent PCI-Express communication switch to use PCIe link directly for node-to-node interconnection Edge of PEACH PCIe link can be connected to any peripheral devices, including GPU Prototype PEACH chip 4-port PCI-E gen.2 with x4 lane / port PCI-E link edge control feature: root GTC2012, San Jose complex and end points are automatically switched (flipped) according

HA-PACS/TCA (Tightly Coupled Accelerator) True GPU-direct IB Switc

memory copy) For strong scaling, Inter- GPU direct communication

PCIe HC A PCIe CP U MEM MEM PCIe GPU MEM Enhanced version of PEACH

interface fabric CP U PCIe PCIe PEAC H2 Node PCIe GPU IB HC A PCIe

43 HA-PACS/TCA (Tightly Coupled Accelerator) True GPU-direct IB Switc h current GPU clusters require 3-hop communication (3-5 times memory copy) For strong scaling, Inter- GPU direct communication protocol is needed for lower latency and higher throughput Node IB PCIe HC A PCIe CP U MEM MEM PCIe GPU MEM Enhanced version of PEACH PEACH2 x4 lanes -> x8 lanes hardwired on main data path and PCIe interface fabric CP U PCIe PCIe PEAC H2 Node PCIe GPU IB HC A PCIe CP U MEM MEM PCIe GPU MEM CP U PCIe PEAC H2 PCIe GPU 4 GTC2012, San Jose

44 Implementation of PEACH2: ASIC FPGA 4 FPGA based implementation today s advanced FPGA allows to use PCIe hub with multiple ports currently gen2 x 8 lanes x 4 ports are available soon gen3 will be available (?) easy modification and enhancement fits to standard (full-size) PCIe board internal multi-core general purpose CPU with programmability is available easily split hardwired/firmware partitioning on certain level on control layer Controlling PEACH2 for GPU GTC2012, communication San Jose protocol collaboration with NVIDIA for information

HA-PACS/TCA Node Cluster = NC Gx4 PEAC H2 C x 2 PEARL Ring

by PEACH within NC (PCI-E gen2x8 = 5GB/s/link) Infiniband QDR

.. Node Node Node Node Cluster Cluster Cluster Cluster

.. Node Cluster with 16 nodes GPUx64 (G) CPUx32 (C) GPU comm

45 HA-PACS/TCA Node Cluster = NC Gx4 PEAC H2 C x 2 PEARL Ring Network Infiniband Link High speed GPU-GPU comm. by PEACH within NC (PCI-E gen2x8 = 5GB/s/link) Infiniband QDR (x2) for NC-NC comm. (4GB/s/link) 4 Gx4 PEAC H2 C x 2... Node Node Node Node Cluster Cluster Cluster Cluster Infiniband Network Gx4 PEAC H2 C x 2... Node Cluster with 16 nodes GPUx64 (G) CPUx32 (C) GPU comm with PCIe IB link / 4 NC with 16 no or 8 NC with 8 n = 360 TFLOPS e to base cluster Node Cluster node CPU: Xeon E5

46 Option 1: PEARL/PEACH2 variation (1) Performance comparison among IB and PEARL can be evenly compared Additional latency by PCIe switch C C C C G3 x16 GP U GP U QPI PCIe C C C C GP U G3 x16 GP U G3 x8 IB HC A G3 x8 PCIe SW PEA CH 2 G2 x8 4 GTC2012, San Jose

47 PEARL/PEACH2 variation (2) Option 2: Requires only 72 lanes in total asymmetric connection among 3 blocks of GPUs C C C C QPI C C C C G3 x16 G3 x16 GP U GP U PCIe G3 x16 GP U G3 x8 IB HC A PCIe G3 SW x16 GP U PE AC H2 G2 x8 4 GTC2012, San Jose

48 PEACH2 prototype board for TCA FPGA daughter board (Altera Stratix IV GX530) connector PCIe external link connector PCIe x2 edge connector (one more on (to host server) daughter board) GTC2012, San Jose 4 power regulato for FPGA

49 Summary 4 HA-PACS consists of two elements: HA- PACS base cluster for application development and HA-PACS/TCA for elementary study for advanced technology on direct communication among accelerating devices (GPUs) HA-PACS base cluster started its operation from Feb with 802 TFLOPS peak performance (Linpack performance will come on June 2012, also expecting good score on Green500) FPGA implementation of PEACH2 is GTC2012, finished San Jose for the prototype version on

HA-PACS Project Challenge for Next Step of Accelerating Computing

HA-PACS Project Challenge for Next Step of Accelerating Computing HA-PAS Project hallenge for Next Step of Accelerating omputing Taisuke Boku enter for omputational Sciences University of Tsukuba taisuke@cs.tsukuba.ac.jp 1 Outline of talk Introduction of S, U. Tsukuba