D.A.V.I.D.E. (Development of an Added-Value Infrastructure Designed in Europe) IWOPH 17 E4. WHEN PERFORMANCE MATTERS
THE COMPANY Since 2002, E4 Computer Engineering has been innovating and actively encouraging the adoption of new computing and storage technologies. Because new ideas are so important, we invest heavily in research and hence in our future. Thanks to our comprehensive range of hardware, software and services, we are able to offer our customers complete solutions for their most demanding workloads on: HPC, Big-Data, AI, Deep Learning, Data Analytics, Cognitive Computing and for any challenging Storage and Computing requirements. E4. WhenPerformance Matters.
A COUPLE OF FACTS ABOUT E4 CERN (Switzerland - HEP): 8.000+ servers, 170+ PB CNAF (Italian CERN Tier1): 26+ PB in a single storage e-geos (Italy Aerospace and Satellite data processing): 100+ compute nodes, high performance storage Pedraforca (Spain): ARM cluster MontBlanc Project (FP7) PRACE-3IP PCP Pre-Commercial Procurement concerning R&D services on Whole System Design for Energy Efficient HPC
PRACE (Partnership for Advanced Computing in Europe) The mission of PRACE (Partnership for Advanced Computing in Europe) is to enable high impact scientific discovery and engineering research and development across all disciplines to enhance European competitiveness for the benefit of society. PRACE seeks to realize this mission by offering world class computing and data management resources and services through a peer review process. PRACE also seeks to strengthen the European users of HPC in industry through various initiatives. PRACE has a strong interest in improving energy efficiency of computing systems and reducing their environmental impact. 4
PRACE (Partnership for Advanced Computing in Europe) In response to the requirements set forth in the Call N 10 - FP7-INFRASTRUCTURES-2012-1 of the Framework Programme, in agreement with the European Commission s Policy concerning HPC, and as a means for PRACE to take a leading position in the provision of HPC High end systems and innovative technology, a group of PRACE Project 3IP Partners has agreed that a joint HPC Pre-Commercial Procurement pilot will be conducted for a first time by a multi-country, multi-partner consortium within the PRACE 3rd Implementation Phase project PRACE-3IP together with PRACE AISBL as observer and advisory entity. 5
GENERAL CONTEXT OF PCPS Pre-Commercial Procurement ( PCP ) is a relatively new model of procurement that is gaining usage in many European Union Member States. PCP stands out as an effective tool to tackle discrepancies between how EU Member States and other countries benefit from their basic research expenditure. The European Commission is fostering the PCP model and has identified High Performance Computing ( HPC ) as an area in which basic research and development ( R&D ) coupled with PCP can drive European innovation. 6
GENERAL CONTEXT OF PCPS PCP has the following key elements: It is for R&D services only (R&D activities constitute more than 50 % of the overall budget) and the public purchaser does not reserve the R&D results exclusively for its own use; Risk-benefit sharing between the public purchaser and the R&D service providers, with sharing of Intellectual Property Rights; A competitive procurement designed to exclude state aid: The PCP R&D work has to be performed at market prices. PCP is a phased model that aims at conducting R&D up to the development of a limited volume of first products/services in the form of a test series. The target can typically be a solution to a major technical challenge. 7
GENERAL CONTEXT OF PCPS The model suggested by the EC is a three phased model: Solution exploration leading to solution design Prototyping Original development of a limited volume of first products/services. The number of suppliers decreases from one phase to the next in order to select the suppliers that best address the technical challenge on which the PCP is based. 8
GOALS OF THE PRACE-3IP PCP The main constraint related to the upgrade of the HPC infrastructures today is the availability and cost (both economically and ecologically) of the energy required to power and cool new more performing supercomputers. This has been identified as one of the major challenges to address in the design and the operation of future multi-petascale and Exascale HPC systems by multiple international expert reports (IESP, EESI). Consequently by means of this PCP, solutions are sought for a Whole System Design for Energy Efficient HPC. 9
GOALS OF THE PRACE-3IP PCP Energy efficiency cannot be ascribed to a single aspect of a system, but should be addressed at all levels, from the material science in the transistor design to thermohydraulic engineering for the design of the cooling system or even computational. The goal of this PCP is to procure R&D services that result in highly energy efficient HPC system components that are integrated into an HPC architecture which is capable of providing a floating-point peak performance of up to 100 PFlop/s. 10
GOALS OF THE PRACE-3IP PCP Energy efficiency must be demonstrated by measuring total energy-to-solution for a representative set of scientific HPC applications on a self-contained pilot system. This pilot system has to be deployed and operated as a pre-production system at the site of a PRACE member. 11
DESCRIPTION OF THE PCP PROCEDURE AND CONTRACTS This PCP procedure is articulated in two stages: The Tendering Stage - Bid submission and admission based on participation requirements, feasibility study and financial offer. The Execution Stage this stage is divided in three different Phases: Phase I (solution design) (duration six months) Phase II (prototype development) (duration 10 months) Phase III (Pre-commercial small scale product/service development) (duration 16 months). 12
PCP PHASE III D.A.V.I.D.E. SUPERCOMPUTER (Development of an Added Value Infrastructure Designed in Europe) COMPUTE NODE: Derived from the IBM POWER8 System S822LC (codename Minsky). 2 IBM POWER8 NVlink and 4 NVIDIA Tesla P100 HSXM2 with the intra node communication layout optimized for best performance. While the original design of the Minsky server is air cooled, its implementation for DAVIDE uses direct liquid cooling for CPUs and GPUs. Each compute node has a peak performance of 22 TFLOPS and an power consumption of less than 2kW. Total number of nodes Form factor SoC GPU Network Cooling Max performance (node) Storage Power 45 (compute) + 2 (login) 2U 2xPOWER8 NVlink 4xNVIDIA Tesla P100 HSMX2 2xIB EDR, 1x 1GbE SoC and GPU with direct hot water 22 TFlops 1xSSD SATA, 1x NVMe DC power distribution 13
PCP PHASE III D.A.V.I.D.E. SUPERCOMPUTER (Development of an Added Value Infrastructure Designed in Europe) ACCELERATOR NVIDIA Tesla P100 (HSMX2) NVIDIA Tesla P100 was built to deliver performance for the most demanding compute applications. NVLINK BUS NVIDIA s new High-Speed Signaling interconnect (NVHS). NVHS transmits data over a differential pair running at up to 20 Gb/sec. Eight of these differential connections form a Sub-Link that sends data in one direction, and two sub-links one for each direction form a Link that connects two processors (GPU-to-GPU or GPU-to-CPU). A single Link supports up to 40 GB/sec of bidirectional bandwidth between the endpoints. The NVLink implementation in NVIDIA Tesla P100 supports up to four links, enabling ganged configurations with aggregate maximum bidirectional bandwidth of 160 GB/sec. 14
PCP Phase III 15
PCP PHASE III D.A.V.I.D.E. SUPERCOMPUTER (Development of an Added Value Infrastructure Designed in Europe) OCP form-factor compute node Liquid cooling tunnels 2xIBM POWER8 with NVLink 4xNVIDIA Tesla P100 HSMX2 2xIB EDR
PCP PHASE III D.A.V.I.D.E. SUPERCOMPUTER (Development of an Added Value Infrastructure Designed in Europe) OPEN RACK LIQUID COOLED Direct hot-water cooling (35-40 C) for the CPUs and GPUs. Capable to extract about 80% of the heat produced by the compute nodes. Extremely flexible and requiring minor modifications of the infrastructure. Total number (racks) 3 Form factor Cooling Capacity Heat exchanger 2U 40 kw Liquid-liquid, redundant pumps Each rack has an independent liquid-liquid or liquid/air heat exchanger unit with redundant pumps. The compute nodes are connected to the heat exchanger through pipes and a side bar for water distribution. 17
PCP PHASE III/LAYOUT PLEASE NOTE: STORAGE NODES (IN RED) ARE NOT PART OF THE PROJECT 18
PCP PHASE III D.A.V.I.D.E. SUPERCOMPUTER (Development of an Added Value Infrastructure Designed in Europe) Power- and Energy-monitoring & management infrastructure (in collaboration with University of Bologna, ETHZ) Off-the-shelf components High speed and accurate per-node power sensing synchronized among the nodes Data accessible out-of-band and without processor intervention Out-of-Band and synchronized fine grain performance sensing Dedicated data-collection subsystem running on management nodes Predictive Power Aware job scheduler and power manager
PCP PHASE III D.A.V.I.D.E. SUPERCOMPUTER (Development of an Added Value Infrastructure Designed in Europe) Power Monitoring Interposer Design Directly integrated in the Power Distribution Board (PDB) Out-of-band Power Monitoring with sampling rate up to 50 ks/s per channel Estimated precision @1kS/s ±0.5W (± σ) Data sent to the broker with coarse and fine granularity. Beaglebone Black connectors on PDB
PCP PHASE III D.A.V.I.D.E. SUPERCOMPUTER (Development of an Added Value Infrastructure Designed in Europe) Power Aware Job Dispatcher Machine Learning models to predict the power consumption of HPC applications SLURM Custom Extensions to schedule jobs based on their envisioned power consumption Run-time monitoring and power management Frequency scaling/rapl-like mechanism
PCP PHASE III D.A.V.I.D.E. SUPERCOMPUTER (Development of an Added Value Infrastructure Designed in Europe) Power monitoring & profiling, power management, power capping & prediction 22
PCP PHASE III D.A.V.I.D.E. SUPERCOMPUTER (Development of an Added Value Infrastructure Designed in Europe) Programming environment CentOS PGI GNU Scientific libraries Off-the-shelf applications https://public.dhe.ibm.com/common/ssi/ecm/po/en/pol03251usen/pol03251usen.pdf 23
CURRENT STATUS (JUNE 2017) 45 nodes (air cooled) up and running at E4 s integration facility Nodes at nominal configuration 2xIBM POWER8 4xNVIDIA P100 SMX2 NVLINK 2xIB EDR CentOS, GNU, PGI Running the baseline performance tests is propedeutical to measure the improvements when the final PCP Phase III system will be deployed Access granted to selected users 24
TIMELINE July/August 2017: Phased conversion of the nodes Nodes are moved to the OCP chassis Water cooling components are added Power- and Energy-monitoring & management infrastructure is installed Each node is tested 25
TIMELINE July/August 2017 As nodes are converted and tested, they are shipped in batches to CINECA In-house testing under operating conditions Envisioned rate: 3 to 4 nodes per week First batch of nodes will take longer because of the learning curve Rate is conservative because it accounts for any potential problems/malfunctions requiring additional rework 26
TIMELINE August/September 2017 Installation and configuration of the system at CINECA Preliminary access to selected users According to PRACE practices and policies Contact me (fabrizio.magugliani@e4company.com) Carlo Cavazzoni (c.cavazzoni@cineca.it) 27
TIMELINE October 2017 Run the contractually-required codes QuantumEspresso BQCD SPECFEM3D HPL NEMO Checking and comparison of the results against the air-cooled results to demonstrate Same (or better performance) on liquid-cooled than air-cooled configuration Lower power consumption Better throughput Remember: the project s goal is Whole System Design for Energy Efficient HPC 28
TIMELINE November 2017 Staged access to selected users (According to PRACE practices and policies) Handling off the system to CINECA (contractual deadline) Continuous monitoring of the system by E4 s staff 29
PCP PHASE III D.A.V.I.D.E. SUPERCOMPUTER (Development of an Added Value Infrastructure Designed in Europe) Key take-aways: Research and production HPC system European IP Liquid cooling Continuous out-of-band measuring, monitoring, capping of compute node energy usage and performance with no impact on the application performance Standard programing model Lower power consumption without impacting performance 30
CONTACTS Email contact fabrizio.magugliani@e4company.com E4 Computer Engineering SpA Via Martiri della Libertà, 66, 42019 Scandiano (RE) - Italy Tel. 0039 0522 991811