D.A.V.I.D.E. (Development of an Added-Value Infrastructure Designed in Europe) IWOPH 17 E4. WHEN PERFORMANCE MATTERS

Similar documents
MULTITHERMAN: Out-of-band High-Resolution HPC Power and Performance Monitoring Support for Big-Data Analysis

Workshop: Innovation Procurement in Horizon 2020 PCP Contractors wanted

MULTITHERMAN: Out-of-band High-Resolution HPC Power and Performance Monitoring Support for Big-Data Analysis

The EuroHPC strategic initiative

EuroHPC Bologna 23 Marzo Gabriella Scipione

Exascale: challenges and opportunities in a power constrained world

HPC IN EUROPE. Organisation of public HPC resources

A Dwarf in a Giant Embedded systems in High Performance Computing. Andrea Bartolini

Preparing GPU-Accelerated Applications for the Summit Supercomputer

High Performance Computing from an EU perspective

TECHNICAL OVERVIEW ACCELERATED COMPUTING AND THE DEMOCRATIZATION OF SUPERCOMPUTING

EuroHPC: the European HPC Strategy

EuroHPC and the European HPC Strategy HPC User Forum September 4-6, 2018 Dearborn, Michigan, USA

Barcelona Supercomputing Center

IBM Power Systems HPC Cluster

HPC projects. Grischa Bolls

Building supercomputers from embedded technologies

IBM CORAL HPC System Solution

S8765 Performance Optimization for Deep- Learning on the Latest POWER Systems

PROJECT FINAL REPORT. Tel: Fax:

Power Systems AC922 Overview. Chris Mann IBM Distinguished Engineer Chief System Architect, Power HPC Systems December 11, 2017

CAS 2K13 Sept Jean-Pierre Panziera Chief Technology Director

IBM Power AC922 Server

HPC Progress and Response to the National Cyber-Infrastructure

CS500 SMARTER CLUSTER SUPERCOMPUTERS

Helix Nebula Science Cloud Pre-Commercial Procurement pilot. 5 April 2016 Bob Jones, CERN

A Breakthrough in Non-Volatile Memory Technology FUJITSU LIMITED

ESFRI Strategic Roadmap & RI Long-term sustainability an EC overview

INCREASE IT EFFICIENCY, REDUCE OPERATING COSTS AND DEPLOY ANYWHERE

A PRINCIPLED TECHNOLOGIES TEST REPORT DELL ACTIVE SYSTEM 800 WITH DELL OPENMANAGE POWER CENTER

High Performance Computing Data Management. Philippe Trautmann BDM High Performance Computing Global Research

Leonhard: a new cluster for Big Data at ETH

Prototyping in PRACE PRACE Energy to Solution prototype at LRZ

Design, Development and Improvement of Nagios System Monitoring for Large Clusters

SLHC-PP DELIVERABLE REPORT EU DELIVERABLE: Document identifier: SLHC-PP-D v1.1. End of Month 03 (June 2008) 30/06/2008

Pre-Commercial Procurement project - HNSciCloud. 20 January 2015 Bob Jones, CERN

THE WHITE HOUSE. Office of the Press Secretary. For Immediate Release September 23, 2014 EXECUTIVE ORDER

ACCI Recommendations on Long Term Cyberinfrastructure Issues: Building Future Development

in Action Fujitsu High Performance Computing Ecosystem Human Centric Innovation Innovation Flexibility Simplicity

NVIDIA DGX SYSTEMS PURPOSE-BUILT FOR AI

Data Center Efficiency Workshop Commentary-Intel

RISE SICS North Newsletter 2017:3

2 The BEinGRID Project

EU Liaison Update. General Assembly. Matthew Scott & Edit Herczog. Reference : GA(18)021. Trondheim. 14 June 2018

EU Code of Conduct on Data Centre Energy Efficiency. Endorser Guidelines and Registration Form. Version 3.1.0

BENEFITS OF ASETEK LIQUID COOLING FOR DATA CENTERS

STULZ Micro DC. with Chip-to-Atmosphere TM Liquid to Chip Cooling

An Open Accelerator Infrastructure Project for OCP Accelerator Module (OAM)

HPC Resources & Training

Vienna Scientific Cluster s The Immersion Supercomputer: Extreme Efficiency, Needs No Water

HPC Solutions in High Density Data Centers

JÜLICH SUPERCOMPUTING CENTRE Site Introduction Michael Stephan Forschungszentrum Jülich

The NIS Directive and Cybersecurity in

European policy in support of energy efficiency investments

IBM Power Advanced Compute (AC) AC922 Server

IBM Deep Learning Solutions

The Future of High Performance Interconnects

Pre-announcement of upcoming procurement, NWP18, at National Supercomputing Centre at Linköping University

New HPC architectures landscape and impact on code developments Massimiliano Guarrasi, CINECA Meeting ICT INAF Catania, 13/09/2018

Organizational Update: December 2015

Principles for a National Space Industry Policy

IBM Spectrum Scale IO performance

The Future of Solid State Lighting in Europe

Shaping the Cyber Security R&D Agenda in Europe, Horizon 2020

J.Enhancing energy security and improving access to energy services through development of public-private renewable energy partnerships

Co-designing an Energy Efficient System

Power Ac-Dc Power Supplies & Dc-Dc Converters

European Space Policy

The EU OPEN meter project

Building the Most Efficient Machine Learning System

19. prosince 2018 CIIRC Praha. Milan Král, IBM Radek Špimr

OpenStaPLE, an OpenACC Lattice QCD Application

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

EU Research Infra Integration: a vision from the BSC. Josep M. Martorell, PhD Associate Director

A High-Performing Cloud Begins with a Strong Foundation. A solution guide for IBM Cloud bare metal servers

Pedraforca: a First ARM + GPU Cluster for HPC

GRIDS INTRODUCTION TO GRID INFRASTRUCTURES. Fabrizio Gagliardi

Predictive Insight, Automation and Expertise Drive Added Value for Managed Services

Research Infrastructures and Horizon 2020

HPC Technology Trends

NVIDIA GPU BOOST FOR TESLA

Building a Europe of Knowledge. Towards the Seventh Framework Programme

HPC SERVICE PROVISION FOR THE UK

HPE Energy Efficiency Certification Service

Deep Learning mit PowerAI - Ein Überblick

TECHNOLOGIES CO., LTD.

Seventh Framework Programme Security Research. Health Security Committee CBRN Section. 30 September by Clément Williamson

Oak Ridge National Laboratory Computing and Computational Sciences

Proposition to participate in the International non-for-profit Industry Association: Energy Efficient Buildings

INCEPTION IMPACT ASSESSMENT. A. Context, Problem definition and Subsidiarity Check

H2020 & THE FRENCH SECURITY RESEARCH

CSD3 The Cambridge Service for Data Driven Discovery. A New National HPC Service for Data Intensive science

Planning for Liquid Cooling Patrick McGinn Product Manager, Rack DCLC

SIMPLIFY, AUTOMATE & TRANSFORM YOUR BUSINESS

NC Education Cloud Feasibility Report

Atos announces the Bull sequana X1000 the first exascale-class supercomputer. Jakub Venc

IBM Cloud Lessons Learned: VMware Cloud Foundation on IBM Cloud VMworld 2017 We are a cognitive solutions and cloud platform company that leverages th

Making Open Data work for Europe

Smarter Clusters from the Supercomputer Experts

COMPTIA CLO-001 EXAM QUESTIONS & ANSWERS

Transcription:

D.A.V.I.D.E. (Development of an Added-Value Infrastructure Designed in Europe) IWOPH 17 E4. WHEN PERFORMANCE MATTERS

THE COMPANY Since 2002, E4 Computer Engineering has been innovating and actively encouraging the adoption of new computing and storage technologies. Because new ideas are so important, we invest heavily in research and hence in our future. Thanks to our comprehensive range of hardware, software and services, we are able to offer our customers complete solutions for their most demanding workloads on: HPC, Big-Data, AI, Deep Learning, Data Analytics, Cognitive Computing and for any challenging Storage and Computing requirements. E4. WhenPerformance Matters.

A COUPLE OF FACTS ABOUT E4 CERN (Switzerland - HEP): 8.000+ servers, 170+ PB CNAF (Italian CERN Tier1): 26+ PB in a single storage e-geos (Italy Aerospace and Satellite data processing): 100+ compute nodes, high performance storage Pedraforca (Spain): ARM cluster MontBlanc Project (FP7) PRACE-3IP PCP Pre-Commercial Procurement concerning R&D services on Whole System Design for Energy Efficient HPC

PRACE (Partnership for Advanced Computing in Europe) The mission of PRACE (Partnership for Advanced Computing in Europe) is to enable high impact scientific discovery and engineering research and development across all disciplines to enhance European competitiveness for the benefit of society. PRACE seeks to realize this mission by offering world class computing and data management resources and services through a peer review process. PRACE also seeks to strengthen the European users of HPC in industry through various initiatives. PRACE has a strong interest in improving energy efficiency of computing systems and reducing their environmental impact. 4

PRACE (Partnership for Advanced Computing in Europe) In response to the requirements set forth in the Call N 10 - FP7-INFRASTRUCTURES-2012-1 of the Framework Programme, in agreement with the European Commission s Policy concerning HPC, and as a means for PRACE to take a leading position in the provision of HPC High end systems and innovative technology, a group of PRACE Project 3IP Partners has agreed that a joint HPC Pre-Commercial Procurement pilot will be conducted for a first time by a multi-country, multi-partner consortium within the PRACE 3rd Implementation Phase project PRACE-3IP together with PRACE AISBL as observer and advisory entity. 5

GENERAL CONTEXT OF PCPS Pre-Commercial Procurement ( PCP ) is a relatively new model of procurement that is gaining usage in many European Union Member States. PCP stands out as an effective tool to tackle discrepancies between how EU Member States and other countries benefit from their basic research expenditure. The European Commission is fostering the PCP model and has identified High Performance Computing ( HPC ) as an area in which basic research and development ( R&D ) coupled with PCP can drive European innovation. 6

GENERAL CONTEXT OF PCPS PCP has the following key elements: It is for R&D services only (R&D activities constitute more than 50 % of the overall budget) and the public purchaser does not reserve the R&D results exclusively for its own use; Risk-benefit sharing between the public purchaser and the R&D service providers, with sharing of Intellectual Property Rights; A competitive procurement designed to exclude state aid: The PCP R&D work has to be performed at market prices. PCP is a phased model that aims at conducting R&D up to the development of a limited volume of first products/services in the form of a test series. The target can typically be a solution to a major technical challenge. 7

GENERAL CONTEXT OF PCPS The model suggested by the EC is a three phased model: Solution exploration leading to solution design Prototyping Original development of a limited volume of first products/services. The number of suppliers decreases from one phase to the next in order to select the suppliers that best address the technical challenge on which the PCP is based. 8

GOALS OF THE PRACE-3IP PCP The main constraint related to the upgrade of the HPC infrastructures today is the availability and cost (both economically and ecologically) of the energy required to power and cool new more performing supercomputers. This has been identified as one of the major challenges to address in the design and the operation of future multi-petascale and Exascale HPC systems by multiple international expert reports (IESP, EESI). Consequently by means of this PCP, solutions are sought for a Whole System Design for Energy Efficient HPC. 9

GOALS OF THE PRACE-3IP PCP Energy efficiency cannot be ascribed to a single aspect of a system, but should be addressed at all levels, from the material science in the transistor design to thermohydraulic engineering for the design of the cooling system or even computational. The goal of this PCP is to procure R&D services that result in highly energy efficient HPC system components that are integrated into an HPC architecture which is capable of providing a floating-point peak performance of up to 100 PFlop/s. 10

GOALS OF THE PRACE-3IP PCP Energy efficiency must be demonstrated by measuring total energy-to-solution for a representative set of scientific HPC applications on a self-contained pilot system. This pilot system has to be deployed and operated as a pre-production system at the site of a PRACE member. 11

DESCRIPTION OF THE PCP PROCEDURE AND CONTRACTS This PCP procedure is articulated in two stages: The Tendering Stage - Bid submission and admission based on participation requirements, feasibility study and financial offer. The Execution Stage this stage is divided in three different Phases: Phase I (solution design) (duration six months) Phase II (prototype development) (duration 10 months) Phase III (Pre-commercial small scale product/service development) (duration 16 months). 12

PCP PHASE III D.A.V.I.D.E. SUPERCOMPUTER (Development of an Added Value Infrastructure Designed in Europe) COMPUTE NODE: Derived from the IBM POWER8 System S822LC (codename Minsky). 2 IBM POWER8 NVlink and 4 NVIDIA Tesla P100 HSXM2 with the intra node communication layout optimized for best performance. While the original design of the Minsky server is air cooled, its implementation for DAVIDE uses direct liquid cooling for CPUs and GPUs. Each compute node has a peak performance of 22 TFLOPS and an power consumption of less than 2kW. Total number of nodes Form factor SoC GPU Network Cooling Max performance (node) Storage Power 45 (compute) + 2 (login) 2U 2xPOWER8 NVlink 4xNVIDIA Tesla P100 HSMX2 2xIB EDR, 1x 1GbE SoC and GPU with direct hot water 22 TFlops 1xSSD SATA, 1x NVMe DC power distribution 13

PCP PHASE III D.A.V.I.D.E. SUPERCOMPUTER (Development of an Added Value Infrastructure Designed in Europe) ACCELERATOR NVIDIA Tesla P100 (HSMX2) NVIDIA Tesla P100 was built to deliver performance for the most demanding compute applications. NVLINK BUS NVIDIA s new High-Speed Signaling interconnect (NVHS). NVHS transmits data over a differential pair running at up to 20 Gb/sec. Eight of these differential connections form a Sub-Link that sends data in one direction, and two sub-links one for each direction form a Link that connects two processors (GPU-to-GPU or GPU-to-CPU). A single Link supports up to 40 GB/sec of bidirectional bandwidth between the endpoints. The NVLink implementation in NVIDIA Tesla P100 supports up to four links, enabling ganged configurations with aggregate maximum bidirectional bandwidth of 160 GB/sec. 14

PCP Phase III 15

PCP PHASE III D.A.V.I.D.E. SUPERCOMPUTER (Development of an Added Value Infrastructure Designed in Europe) OCP form-factor compute node Liquid cooling tunnels 2xIBM POWER8 with NVLink 4xNVIDIA Tesla P100 HSMX2 2xIB EDR

PCP PHASE III D.A.V.I.D.E. SUPERCOMPUTER (Development of an Added Value Infrastructure Designed in Europe) OPEN RACK LIQUID COOLED Direct hot-water cooling (35-40 C) for the CPUs and GPUs. Capable to extract about 80% of the heat produced by the compute nodes. Extremely flexible and requiring minor modifications of the infrastructure. Total number (racks) 3 Form factor Cooling Capacity Heat exchanger 2U 40 kw Liquid-liquid, redundant pumps Each rack has an independent liquid-liquid or liquid/air heat exchanger unit with redundant pumps. The compute nodes are connected to the heat exchanger through pipes and a side bar for water distribution. 17

PCP PHASE III/LAYOUT PLEASE NOTE: STORAGE NODES (IN RED) ARE NOT PART OF THE PROJECT 18

PCP PHASE III D.A.V.I.D.E. SUPERCOMPUTER (Development of an Added Value Infrastructure Designed in Europe) Power- and Energy-monitoring & management infrastructure (in collaboration with University of Bologna, ETHZ) Off-the-shelf components High speed and accurate per-node power sensing synchronized among the nodes Data accessible out-of-band and without processor intervention Out-of-Band and synchronized fine grain performance sensing Dedicated data-collection subsystem running on management nodes Predictive Power Aware job scheduler and power manager

PCP PHASE III D.A.V.I.D.E. SUPERCOMPUTER (Development of an Added Value Infrastructure Designed in Europe) Power Monitoring Interposer Design Directly integrated in the Power Distribution Board (PDB) Out-of-band Power Monitoring with sampling rate up to 50 ks/s per channel Estimated precision @1kS/s ±0.5W (± σ) Data sent to the broker with coarse and fine granularity. Beaglebone Black connectors on PDB

PCP PHASE III D.A.V.I.D.E. SUPERCOMPUTER (Development of an Added Value Infrastructure Designed in Europe) Power Aware Job Dispatcher Machine Learning models to predict the power consumption of HPC applications SLURM Custom Extensions to schedule jobs based on their envisioned power consumption Run-time monitoring and power management Frequency scaling/rapl-like mechanism

PCP PHASE III D.A.V.I.D.E. SUPERCOMPUTER (Development of an Added Value Infrastructure Designed in Europe) Power monitoring & profiling, power management, power capping & prediction 22

PCP PHASE III D.A.V.I.D.E. SUPERCOMPUTER (Development of an Added Value Infrastructure Designed in Europe) Programming environment CentOS PGI GNU Scientific libraries Off-the-shelf applications https://public.dhe.ibm.com/common/ssi/ecm/po/en/pol03251usen/pol03251usen.pdf 23

CURRENT STATUS (JUNE 2017) 45 nodes (air cooled) up and running at E4 s integration facility Nodes at nominal configuration 2xIBM POWER8 4xNVIDIA P100 SMX2 NVLINK 2xIB EDR CentOS, GNU, PGI Running the baseline performance tests is propedeutical to measure the improvements when the final PCP Phase III system will be deployed Access granted to selected users 24

TIMELINE July/August 2017: Phased conversion of the nodes Nodes are moved to the OCP chassis Water cooling components are added Power- and Energy-monitoring & management infrastructure is installed Each node is tested 25

TIMELINE July/August 2017 As nodes are converted and tested, they are shipped in batches to CINECA In-house testing under operating conditions Envisioned rate: 3 to 4 nodes per week First batch of nodes will take longer because of the learning curve Rate is conservative because it accounts for any potential problems/malfunctions requiring additional rework 26

TIMELINE August/September 2017 Installation and configuration of the system at CINECA Preliminary access to selected users According to PRACE practices and policies Contact me (fabrizio.magugliani@e4company.com) Carlo Cavazzoni (c.cavazzoni@cineca.it) 27

TIMELINE October 2017 Run the contractually-required codes QuantumEspresso BQCD SPECFEM3D HPL NEMO Checking and comparison of the results against the air-cooled results to demonstrate Same (or better performance) on liquid-cooled than air-cooled configuration Lower power consumption Better throughput Remember: the project s goal is Whole System Design for Energy Efficient HPC 28

TIMELINE November 2017 Staged access to selected users (According to PRACE practices and policies) Handling off the system to CINECA (contractual deadline) Continuous monitoring of the system by E4 s staff 29

PCP PHASE III D.A.V.I.D.E. SUPERCOMPUTER (Development of an Added Value Infrastructure Designed in Europe) Key take-aways: Research and production HPC system European IP Liquid cooling Continuous out-of-band measuring, monitoring, capping of compute node energy usage and performance with no impact on the application performance Standard programing model Lower power consumption without impacting performance 30

CONTACTS Email contact fabrizio.magugliani@e4company.com E4 Computer Engineering SpA Via Martiri della Libertà, 66, 42019 Scandiano (RE) - Italy Tel. 0039 0522 991811