Leibniz Supercomputer Centre. Movie on YouTube

Similar documents
TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 13 th CALL (T ier-0)

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 11th CALL (T ier-0)

UAntwerpen, 24 June 2016

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 6 th CALL (Tier-0)

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,

Introduc)on to Hyades

High Performance Computing. What is it used for and why?

n N c CIni.o ewsrg.au

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

LRZ SuperMUC One year of Operation

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

DELIVERABLE D5.5 Report on ICARUS visualization cluster installation. John BIDDISCOMBE (CSCS) Jerome SOUMAGNE (CSCS)

High Performance Computing. What is it used for and why?

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

OpenSees on Teragrid

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Outline. March 5, 2012 CIRMMT - McGill University 2

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

How to run applications on Aziz supercomputer. Mohammad Rafi System Administrator Fujitsu Technology Solutions

HPC Hardware Overview

InfiniBand Strengthens Leadership as the Interconnect Of Choice By Providing Best Return on Investment. TOP500 Supercomputers, June 2014

Before We Start. Sign in hpcxx account slips Windows Users: Download PuTTY. Google PuTTY First result Save putty.exe to Desktop

Simulation using MIC co-processor on Helios

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Thread and Data parallelism in CPUs - will GPUs become obsolete?

Philippe Thierry Sr Staff Engineer Intel Corp.

LBRN - HPC systems : CCT, LSU

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Introduction to Xeon Phi. Bill Barth January 11, 2013

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 16 th CALL (T ier-0)

Advanced Parallel Programming I

GPUs and Emerging Architectures

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 14 th CALL (T ier-0)

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Intel Architecture for HPC

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

Modern CPU Architectures

Introduction to CUDA Programming

NAMD Performance Benchmark and Profiling. January 2015

the Intel Xeon Phi coprocessor

arxiv: v1 [physics.comp-ph] 4 Nov 2013

LAB. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Illinois Proposal Considerations Greg Bauer

Cluster Network Products

Intel Many Integrated Core (MIC) Architecture

Interconnect Your Future

Parallel Computer Architecture - Basics -

BlueGene/L (No. 4 in the Latest Top500 List)

Symmetric Computing. SC 14 Jerome VIENNE

Habanero Operating Committee. January

Carlo Cavazzoni, HPC department, CINECA

Overview of Tianhe-2

Co-designing an Energy Efficient System

The Energy Challenge in HPC

Introduc)on to Xeon Phi

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

Genius Quick Start Guide

Analyzing the High Performance Parallel I/O on LRZ HPC systems. Sandra Méndez. HPC Group, LRZ. June 23, 2016

Introduction to HPC Numerical libraries on FERMI and PLX

ANSYS Fluent 14 Performance Benchmark and Profiling. October 2012

6/14/2017. The Intel Xeon Phi. Setup. Xeon Phi Internals. Fused Multiply-Add. Getting to rabbit and setting up your account. Xeon Phi Peak Performance

Submitting and running jobs on PlaFRIM2 Redouane Bouchouirbat

The Mont-Blanc approach towards Exascale

The RWTH Compute Cluster Environment

Benchmark results on Knight Landing (KNL) architecture

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

Symmetric Computing. Jerome Vienne Texas Advanced Computing Center

NCAR Workload Analysis on Yellowstone. March 2015 V5.0

NERSC. National Energy Research Scientific Computing Center

Accelerated Earthquake Simulations

Intel Xeon Phi Coprocessors

Description of Power8 Nodes Available on Mio (ppc[ ])

Transitioning to Leibniz and CentOS 7

GPU Acceleration of Matrix Algebra. Dr. Ronald C. Young Multipath Corporation. fmslib.com

New User Seminar: Part 2 (best practices)

OBTAINING AN ACCOUNT:

Introduction to High Performance Computing at UEA. Chris Collins Head of Research and Specialist Computing ITCS

HPC Architectures. Types of resource currently in use

Our new HPC-Cluster An overview

Using Cartesius & Lisa

Lecture 3: Intro to parallel machines and models

Interconnection Network for Tightly Coupled Accelerators Architecture

ECE 574 Cluster Computing Lecture 4

NCAR s Data-Centric Supercomputing Environment Yellowstone. November 28, 2011 David L. Hart, CISL

Job Management on LONI and LSU HPC clusters

Intel MIC Programming Workshop, Hardware Overview & Native Execution. IT4Innovations, Ostrava,

Intel Architecture for Software Developers

Scalability and Classifications

Intel Knights Landing Hardware

Accelerator Programming Lecture 1

DATARMOR: Comment s'y préparer? Tina Odaka

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues

The Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center

Introduction to High Performance Computing at UEA. Chris Collins Head of Research and Specialist Computing ITCS

PART-I (B) (TECHNICAL SPECIFICATIONS & COMPLIANCE SHEET) Supply and installation of High Performance Computing System

Parallel Computing. November 20, W.Homberg

Advanced cluster techniques with LoadLeveler

Performance Optimization of Smoothed Particle Hydrodynamics for Multi/Many-Core Architectures

Transcription:

SuperMUC @ Leibniz Supercomputer Centre Movie on YouTube

Peak Performance Peak performance: 3 Peta Flops 3*10 15 Flops Mega 10 6 million Giga 10 9 billion Tera 10 12 trillion Peta 10 15 quadrillion Exa 10 18 quintillion Zetta 10 21 sextillion Flops: Floating Point Operations per Seconds

Distributed Memory Architecture 18 partitions called islands with 512 nodes Node is a shared memory system with 2 processors Sandy Bridge-EP Intel Xeon E5-2680 8C 2.7 GHz (Turbo 3.5 GHz) 32 GByte memory Inifiniband network interface Processor has 8 cores 2-way hyperthreading 21.6 GFlops @ 2.7 GHz per core 172.8 GFlops per processor

Sandy Bridge Processor Latency: 4 cycles 12 cycles 31 cycles Core L1 32KB L2 256KB 8 multithreaded cores Core L1 32KB L2 256KB Bandwidth: 2*16/cycle 32 / cycle 32 / cycle L3 2.5 MB Shared L3 L3 2.5 MB Network frequency equal to core frequency Memory QPI PCIe L3 cache Partitioned with cache coherence based on core valid bits Physical addresses distributed by a hash function

NUMA Node 4GB 2 QPI links 4GB 4GB 4GB Sandy Bridge Each 2 GT/s Sandy Bridge 4GB 4GB 4GB 4GB 8xPCIe3.0 (8GB/s) Infiniband 2 processors with 32 GB of memory Aggregate memory bandwidth per node 102.4 GB/s Latency local ~50ns (~135 cycles @2.7 GHz) remote ~90ns (~240 cycles)

Interconnection Network Infiniband FDR-10 FDR means fourteen data rate FDR-10 has an effective data rate of 41.25 Gb/s Latency: 100 nsec per switch, 1usec MPI Vendor: Mellanox Intra-Island Topology: non-blocking tree 256 communication pairs can talk in parallel. Inter-Island Topology: Pruned Tree 4:1 128 links per island to next level

Peak Performance 126 spine 36 port switch 36 port switch switches 19 links Rest for fat node and IO 126 links 648 port switch 516 nodes 516 links 18 islands + IO island 648 port switch

9288 Compute Nodes Cold Corridoor Infiniband (red) and Ethernet (green) cabling Matthias Brehm, Herbert Huber, LRZ High Performance Systems Division

Infiniband Interconnect 19 Orcas 126 Spine Switches 11900 Infiniband Cables Matthias Brehm, Herbert Huber, LRZ High Performance Systems Division

IO System Spine Infiniband Switches GPFS for $WORK and $SCRATCH Login nodes $HOME Archive 10 PB @ 200 GB/s 5 PB @ 80 Gb/s 30 PB @ 10GbE

Parallel File System GPFS 10 Pbyte, 200 GigaByte/s I/O Bandwidth 9 DDN SFA 12k Controller 5040 3 TByte SATA Disks Matthias Brehm, Herbert Huber, LRZ High Performance Systems Division

SuperMIC Intel Xeon Phi Cluster 32 Nodes 2 Xeon Ivy-Bridge processors E5-2650 8 cores each 2.6 GHz clock frequency 2 Intel Xeon Phi coprocessors 5110P 60 cores @ 1.1 GHz Memory 64 GB host memory 2x8 GB Xeon Phi

Intel Xeon Phi Number of cores 60 Frequency of cores GDDR5 memory size Number of hardware threads per core SIMD vector registers Flops/cycle Theoretical peak performance L2 cache per core 1.1 GHz 8 GB 4 32 (512-bit wide) per thread context 16 (DP), 32 (SP) 1 TFlop/s (DP), 2 TFlop/s (SP) 512 kb Connection to host 6.2 GB/s

Nodes with Coprocessors

Access to SuperMIC Login to SuperMUC Login to SuperMIC ssh supermic.smuc.lrz.de Load leveler script with class phi Interactive access to nodes and coprocessors Submit batch script with sleep command. Login to compute nodes ssh i01r13??? Login to MIC coprocessors ssh i01r13???-mic0 ssh i01r13???-mic1 PPK required

The Compute Cube of LRZ Rückkühlwerke Hö Höchstleistungsrechner (säulenfrei) (sä Zugangsbrücke Zugangsbrücke Server/Netz Archiv/Backup Archiv/Backup Klima Klima Elektro

Run jobs in batch Advantages Reproducable performance Run larger jobs No need to interactive poll for resources Test queue Max 1 island, 32 nodes, 2h, 1 job in queue General queue Max 1 island, 512 nodes, 48 h Large Max 4 islands, 2048 nodes, 48 h Special Max 18 islands

Job Script #!/bin/bash #@ wall_clock_limit = 00:4:00 #@ job_name = add #@ job_type = parallel #@ class = test #@ network.mpi = sn_all,not_shared,us #@ output = job$(jobid).out llsubmit job.scp Submission to batch system llq u $USER Check status of own jobs llcancel <jobid> Kill job if no longer needed #@ error = job$(jobid).out #@ node = 2 #@ total_tasks=4 #@ node_usage = not_shared #@ queue. /etc/profile cd ~/apptest/application poe appl

Limited CPU Hours available Please Specify job execution as tight as possible. Do not request more nodes than required. We have to pay for all allocated cores, not only the used ones. SHORT (<1sec) sequential runs can be done on the login node. Even SHORT OMP runs can be done on the login node.

Login to SuperMUC, Documentation First change the standard password https://idportal.lrz.de/r/entry.pl Login via lxhalle due to restriction on connecting machines ssh <userid>@supermuc.lrz.de No outgoing connections allowed Documentation http://www.lrz.de/services/compute/supermuc/ http://www.lrz.de/services/compute/supermuc/loadleveler/ Intel compiler: http://software.intel.com/sites/products/documentation/hpc/co mposerxe/en-us/2011update/cpp/lin/index.htm

Batch Script Parameters #@ energy_policy_tag = NONE Switch of automatic adaptation of core frequency for performance measurements #@ node = 2 #@ total_tasks= 4 #@ task_geometry = {(0,2) (1,3)} #@ tasks_per_node = 2 Limitations on combination documented at LRZ web page

Compiler Intel C++ icc Version 12.1 Editors vi emacs xedit