The ICARUS white paper: A scalable, energy-efficient, solar-powered HPC center based on low power GPUs

Similar documents
Energy efficiency of the simulation of three-dimensional coastal ocean circulation on modern commodity and mobile processors

Realization of a low energy HPC platform powered by renewables - A case study: Technical, numerical and implementation aspects

Performance engineering for hardware-, numericaland energy-efficiency in FEM frameworks: modeling and control

TR An Overview of NVIDIA Tegra K1 Architecture. Ang Li, Radu Serban, Dan Negrut

Fra superdatamaskiner til grafikkprosessorer og

Total efficiency of core components in Finite Element frameworks

Pedraforca: a First ARM + GPU Cluster for HPC

System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.

Towards a complete FEM-based simulation toolkit on GPUs: Geometric Multigrid solvers

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?

Exploring the limits of mixed precision FEM based computations on the Tegra-K1 micro-architecture

Prototyping in PRACE PRACE Energy to Solution prototype at LRZ

Performance and Accuracy of Lattice-Boltzmann Kernels on Multi- and Manycore Architectures

Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs

Building supercomputers from commodity embedded chips

TSUBAME-KFC : Ultra Green Supercomputing Testbed

Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing

Managing Hardware Power Saving Modes for High Performance Computing

GPU-centric communication for improved efficiency

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

The Mont-Blanc approach towards Exascale

Cray XC Scalability and the Aries Network Tony Ford

Exploiting CUDA Dynamic Parallelism for low power ARM based prototypes

CUDA on ARM Update. Developing Accelerated Applications on ARM. Bas Aarts and Donald Becker

The Mont-Blanc Project

Overview. CS 472 Concurrent & Parallel Programming University of Evansville

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D

n N c CIni.o ewsrg.au

HPC with Multicore and GPUs

16th. Wu Feng. November synergy.cs.vt.edu

Trends in HPC (hardware complexity and software challenges)

What are Clusters? Why Clusters? - a Short History

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

ACCELERATED COMPUTING: THE PATH FORWARD. Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015

E4-ARKA: ARM64+GPU+IB is Now Here Piero Altoè. ARM64 and GPGPU

Manycore and GPU Channelisers. Seth Hall High Performance Computing Lab, AUT

Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain)

Hybrid Warm Water Direct Cooling Solution Implementation in CS300-LC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

NVIDIA Jetson Platform Characterization

CME 213 S PRING Eric Darve

Heterogenous Computing

Building supercomputers from embedded technologies

The rcuda middleware and applications

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Brand-New Vector Supercomputer

Making a Case for a Green500 List

8/28/12. CSE 820 Graduate Computer Architecture. Richard Enbody. Dr. Enbody. 1 st Day 2

NVIDIA Update and Directions on GPU Acceleration for Earth System Models

Planning for Liquid Cooling Patrick McGinn Product Manager, Rack DCLC

Fujitsu s Approach to Application Centric Petascale Computing

PORTABLE AND SCALABLE SOLUTIONS FOR CFD ON MODERN SUPERCOMPUTERS

HPC and the AppleTV-Cluster

HPC projects. Grischa Bolls

PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort

Outline Marquette University

in Action Fujitsu High Performance Computing Ecosystem Human Centric Innovation Innovation Flexibility Simplicity

S THE MAKING OF DGX SATURNV: BREAKING THE BARRIERS TO AI SCALE. Presenter: Louis Capps, Solution Architect, NVIDIA,

ENDURING DIFFERENTIATION. Timothy Lanfear

ENDURING DIFFERENTIATION Timothy Lanfear

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Parallel Programming

HP Update. Bill Mannel VP/GM HPC & Big Data Business Unit Apollo Servers

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

Multicore Computing and Scientific Discovery

GPU Architecture. Alan Gray EPCC The University of Edinburgh

GPUs and Emerging Architectures

High Performance Computing on ARM

Co-designing an Energy Efficient System

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

HPC Cineca Infrastructure: State of the art and towards the exascale

Fabio AFFINITO.

CUDA on ARM Update. Developing Accelerated Applications on ARM. Bas Aarts and Donald Becker

GPUS FOR NGVLA. M Clark, April 2015

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

On the efficiency of the Accelerated Processing Unit for scientific computing

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

Fast Hardware For AI

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

Umeå University

Umeå University

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM

Barcelona Supercomputing Center

Digital Signal Processor Supercomputing

John W. Romein. Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

A176 Cyclone. GPGPU Fanless Small FF RediBuilt Supercomputer. IT and Instrumentation for industry. Aitech I/O

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group

THE LEADER IN VISUAL COMPUTING

Using Graphics Chips for General Purpose Computation

7 DAYS AND 8 NIGHTS WITH THE CARMA DEV KIT

EE 7722 GPU Microarchitecture. Offered by: Prerequisites By Topic: Text EE 7722 GPU Microarchitecture. URL:

HPC Technology Trends

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

arxiv: v1 [physics.comp-ph] 4 Nov 2013

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Transcription:

The ICARUS white paper: A scalable, energy-efficient, solar-powered HPC center based on low power GPUs Markus Geveler, Dirk Ribbrock, Daniel Donner, Hannes Ruelmann, Christoph Höppke, David Schneider, Daniel Tomaschewski, Stefan Turek Unconventional HPC, EuroPar 2016, Grenoble, 2016 / 8 / 23 markus.geveler@math.tu-dortmund.de

Outline Introduction Simulation, Hardware-oriented numerics and energy-efficiency We are energy consumers and why we can not continue like this (Digital-)Computer hardware (in a general sense) of the future ICARUS Architecture and white sheet system integration of high-end photovoltaic-, battery- and embedded tech benchmarking of basic kernels and applications node level cluster level

Preface: what we usually do Simulation of technical flows with FEATFLOW Performance engineering for hardware efficiency numerical efficiency energy efficiency

Motivation Computers will require more energy than the world generates by 2040 "Conventional approaches are running into physical limits. Reducing the 'energy cost' of managing data on-chip requires coordinated research in new materials, devices, and architectures" - the Semiconductor Industry Association (SIA), 2015 world energy production 1E+21 J = 1000 Exajoule smaller transistors,... today's systems consumption future systems consumption E [J/year] 1E+14 J lower bound systems consumption (Landauer Limit) 2015 2035 2040

Motivation What we can do enhance supplies enhance energy efficiency build renewable power source into system 1E+21 J = 1000 Exajoule higher production world energy production today's systems consumption future systems consumption E [J/year] 1E+14 J clever engineering (chip level), more EE in hardware (system level) lower bound systems consumption (Landauer Limit) 2015 2035 2040

HPC Hardware Today's HPC facilities (?) www.green500.org Green500 list Nov 2015 Green 500 rank Top 500 rank Total power [kw] MFlops per watt Year Hardware architecture 1 133 50 7031 2015 ExaScaler-1.4 80Brick, Xeon E5-2618Lv3 8C 2.3GHz, Infiniband FDR, PEZY-SC 2 392 51 5331 2013 LX 1U-4GPU/104Re-1G Cluster, Intel Xeon E5-2620v2 6C 2.1GHz, Infiniband FDR, NVIDIA Tesla K80 3 314 57 5271 2014 ASUS ESC4000 FDR/G2S, Intel Xeon E5-2690v2 10C 3GHz, Infiniband FDR, AMD FirePro S9150 4 318 65 4778 2015 Sugon Cluster W780I, Xeon E5-2640v3 8C 2.6GHz, Infiniband QDR, NVIDIA Tesla K80 5 102 190 4112 2015 Cray CS-Storm, Intel Xeon E5-2680v2 10C 2.8GHz, Infiniband FDR, Nvidia K80 6 457 58 3857 2015 Inspur TS10000 HPC Server, Xeon E5-2620v3 6C 2.4GHz, 10G Ethernet, NVIDIA Tesla K40 7 225 110 3775 2015 Inspur TS10000 HPC Server, Intel Xeon E5-2620v2 6C 2.1GHz, 10G Ethernet, NVIDIA Tesla K40 No Top500 top-scorer Top500 #1: 17,000 kw, 1,900 MFlop/s/W unconventional hardware

What we can expect from hardware Arithmetic Intensity (AI) copy2dram SpMV, Performance BLAS 1,2, FEM Lattice Methods stream bandwidth * AI GEMM theoretical peak performance AI = #flops / #bytes Particle Methods (close to peak performance)

Total efficiency of simulation software Aspects Numerical efficiency dominates asymptotic behaviour and wall clock time Hardware-efficiency exploit all levels of parallelism provided by hardware (SIMD, multithreading on a chip/device/socket, multiprocessing in a cluster, hybrids) then try to reach good scalability (communication optimisations, block comm/comp) Energy-efficiency by hardware: what is the most energy-efficient computer hardware? What is the best core frequency? What is the optimal number of cores used? by software as a direct result of performance but: its not all about performance Hardware-oriented Numerics: Enhance hardware- and numerical efficiency simultaneously, use (most) energy-efficient Hardware(-settings) where available! Attention: codependencies!

Hardware-oriented Numerics Energy Efficiency energy consumption/efficiency is one of the major challenges for future supercomputers 'exascale'-challenge in 2012 we proved: we can solve PDEs for less energy 'than normal' simply by switching computational hardware from commodity to embedded Tegra 2 (2x ARM Cortex A9) in the Tibidabo system of the MontBlanc project tradeoff between energy and wall clock time ~3x less energy ~5x more time!

Hardware-oriented Numerics Energy Efficiency To be more energy-efficient with different computational hardware, this hardware would have to dissipate less power at the same performance as the other! More performance per Watt! powerdown > speeddown ~3x less energy ~5x more time!

Hardware-oriented Numerics Energy Efficiency: technology of ARM-based SoCs since 2012 Something has been happening in the mobile computing hardware evolution: Tegra 3 (late 2012) was also based on A9 but had 4 cores Tegra 4 (2013) is build upon the A15 core (higher frequency) and had more RAM and LPDDR3 instead of LPDDR2 Tegra K1 (32 Bit, late 2014) CPU pretty much like Tegra 4 but higher freq., more memory TK1 went GPGPU and comprises a programmable Kepler GPU on the same SoC! the promise: 350+ Gflop/s for less than 11W for comparison: Tesla K40 + x86 CPU: 4200 Gflop/s for 385W 2.5x higher EE promised interesting for Scientific Computing! Higher EE than commodity accelerator (of that time)! TU Dortmund

An off-grid compute center of the future Vision Insular Compute center for Applied Mathematics with Renewables-provided power supply based on Unconventional compute hardware empaired with Simulation Software for technical processes Motivation system integration for Scientific HPC high-end unconventional compute hardware high-end renewable power source (photovoltaic) specially tailored numerics, simulation software no future spendings due to energy consumtion SME-class resource: <80K Scalability, modular design (simplicity) (maintainability) (safety)...

Cluster Whitesheet nodes: 60 x NVIDIA Jetson TK 1 #cores (ARM Cortex-A15): 240 #GPUs (Kepler, 192 cores): 60 RAM/core: 2GB LPDDR3 switches (GiBit Ethernet): 3xL1, 1xL2 cluster theoretical peak perf: ~20TFlop/s SP cluster peak power: < 1kW, provided by PV PV capacity: 8kWp battery: 8kWh Software: FEAT (optimised for Tegra K1): www.featflow.de

Cluster Architecture

Housing and power supply Photovoltaic units and battery rack primary: approx. 16m x 3m area, 8kWp secondary: for ventilation and cooling battery: Li-Ion, 8kWh 2 solar converters, 1 battery converter Modified overseas cargo container Steel Dry Cargo Container (High Cube) with dimensions 20 x 8 x 10 feet climate isolation (90mm) only connection to infrastructure: network cable

ICARUS compute hardware We can build better computers

Testhardware and measuring Complete 'box' measure power at AC-converter (inlet) all power needed for the node wikimedia wikimedia TU Dortmund Note: not a chip-to-chip comparison, but a system-to-system one

Sample results Compute-bound, CPU

Sample results Compute-bound, GPU

Sample results Memory bandwidth-bound, GPU

Sample results Applications, DG-FEM, single node

Sample results Applications, LBM, full cluster CPU GPU

Sample results Power supply

More experiences so far Reliability cluster operation since March 2016 61 Jetson boards 57 working permanently 4 with uncritical fan failure uptime: since first startup (almost), exept for maintenance Temperature on warm days (31 degrees Celsius external, 50% humidity): 33 degrees Celsius ambient temperature in container 35% relative humidity 39 43 degrees Celsius on chip in idle mode approx. 68 degrees Celsius at load monitored by rack PDU sensors

Conclusion and outlook ICARUS built-in power source built-in ventilation, cooling, heating no infrastructure-needs (except for area) Jetson boards are 'cool': with the currently installed hardware, no additional cooling needed versatile: compute (or other-) hardware can be exchanged easily off-grid resource: can be deployed in areas with weak infrastructure: in developing nations/regions as secondary/emergency system

Conclusion and outlook To be very clear We do not see a future where Jetson TK 1 or similar boards are the compute nodes We wanted to show, that we can build compute hardware differently and for a certain kind of application, it works We showed, that system integration of renewables and compute tech is possible build renewable power source into system The future world energy production today's systems consumption future systems consumption E [J/year] Tegra X1,, or other commodity GPUs (?) collect data all year (weather!) learn how to improve all components lower bound systems consumption (Landauer Limit) 2015 2035 2040

Thank you www.icarus-green-hpc.org This work has been supported in part by the German Research Foundation (DFG) through the Priority Program 1648 Software for Exascale Computing (grant TU 102/48). ICARUS hardware is financed by MIWF NRW under the lead of MERCUR.