HPC Architectures evolution: the case of Marconi, the new CINECA flagship system. Piero Lanucara
|
|
- Warren Carter
- 5 years ago
- Views:
Transcription
1 HPC Architectures evolution: the case of Marconi, the new CINECA flagship system Piero Lanucara
2 Many advantages as a supercomputing resource: Low energy consumption. Limited floor space requirements Fast internal network Homogeneous architecture simple usage model. But Low, single core performance + I/O structure meant very high parallelism necessary (at least 1024 cores). For some applications low memory/core (1Gb) and I/O performance also a problem. Also limited capabilities of O.S. on compute cores (e.g. no interactive access) Cross compilation, because login nodes different to compute nodes, can complicate some build procedures. BG/Q (Fermi) as a Tier0 Resource FERMI scheduled to be decommissioned mid-end 2016
3 Replacing Fermi at Cineca - considerations A new procurement is a complicated process and considers many factors but must include (together with the price): Minimum peak compute power Power consumption Floor space required Availability Disk space, internal network, etc. IBM no longer offers the BlueGene range for supercomputers so cannot be a solution. Many computer centres are adopting instead a heterogenous model for computer clusters
4 Replacing Fermi the Marconi solution Fermi 2PFlops. 0.8Mwatt Galileo & PICO 1.2PFlops. 0.4Mwatt Marconi A1 2.1PFlops. 0.7Mwatt Marconi A2 11PFlops. 1.3MWatt Alpha Volume Beta Commit Wins Marconi A3 7PFlops. 1.0Mwatt Sistem A4(?) 50PFlops. 3.2 Mwatt 1.2Mwatt 2.4Mwatt 2.3Mwatt 2.3Mwatt 3.2Mwatt 50 rack 120 rack 120 rack 120 rack 150 rack 100mq 240mq 240mq 240mq 300mq
5 Marconi High level system Characteristics Tender proposal Partition Installation CPU # nodes # of Racks Power A1 Broadwell (2.1 PFlops) April 2016 E v KW A2 - Knight Landing (11 Pflops) September 2016 KNL KW A3 Skylake (7 Pflops expected) June 2017 E v5 >1500 > KW Network: Intel OmniPath
6 1 PFs no-conventional (KNL) A2 KNL 68cores, 1.4 GHz; 3600 nodes, 11 PFs 5 PFs - conventional 1 PFs - conventional A1 BRD 2x18 cores, 2.3GHz 1500 nodes, 2PFs A3 SKL 2x20 cores, 2.3 GHz; >1500 nodes, 7 PFs
7 storage GSS*: 5 PB scratch area (50 GB/s) by Jan17: 10 PB long term storage (40 GB/s) 20 PB Tape library 1 PFs no-conventional (KNL) A2 KNL 68cores, 1.4 GHz; 3600 nodes, 11 PFs 5 PFs - conventional 1 PFs - conventional A1 BRD 2x18 cores, 2.3GHz 1500 nodes, 2PFs A3 SKL 2x20 cores, 2.3 GHz; >1500 nodes, 7 PFs
8 Marconi - Compute A1 (half reserved to EUROfusion) 1512 Lenovo NeXtScale Server -> 2 PFlops Intel E v4 Broadwell GHz. dual socket node: 36 core and 128GByte / node A2 (1 PFlops to EUROfusion) 3600 server Intel AdamPass -> 11 PFlops Intel PHI code name Knight Landing (KNL) GHz. single socket node: 96GByte DDR4 + 16GByte MCDRAM A3 (great part reserved to EUROfusion) >1500 Lenovo Stark Server -> 7 PFlops Intel E5-2680v5 SkyLake GHz dual socket node: 40 core and 196GByte /node
9 Marconi - Compute A1 (half reserved to EUROfusion) 1512 Lenovo NeXtScale Server -> 2 PFlops Intel E v4 Broadwell GHz. dual socket node: 36 core and 128GByte / node A2 (1 PFlops to EUROfusion) 3600 server Intel AdamPass -> 11 PFlops Intel PHI code name Knight Landing (KNL) GHz. single socket node: 96GByte DDR4 + 16GByte MCDRAM A3 (great part reserved to EUROfusion) >1500 Lenovo Stark Server -> 7 PFlops Intel E5-2680v5 SkyLake GHz dual socket node: 40 core and 196GByte /node
10 Marconi - Compute A1 (half reserved to EUROfusion) 1512 Lenovo NeXtScale Server -> 2 PFlops Intel E v4 Broadwell GHz. dual socket node: 36 core and 128GByte / node A2 (1 PFlops to EUROfusion) 3600 server Intel AdamPass -> 11 PFlops Intel PHI code name Knight Landing (KNL) GHz. single socket node: 96GByte DDR4 + 16GByte MCDRAM A3 (great part reserved to EUROfusion) >1500 Lenovo Stark Server -> 7 PFlops Intel E5-2680v5 SkyLake GHz dual socket node: 40 core and 196GByte /node
11 Marconi - Network Network type: new Intel Omnipath Largest Omnipath cluster of the world Network topology: Fat-tree 2:1 oversubscription tapering at the level of the core switches only Core Switches: 5 x OPA Core Switch Sawtooth Forest 768 ports each Hdge Switch: 216 OPA Edge Switch Eldorado Forest 48 ports each Maximum system configuration: 5(opa) x 768(ports) x 2(tapering) -> 7680 servers
12 5 x 768 ports core Switches 3x 216 x 48 ports Hedge Switches 32 downlink 32 nodes fully interconnected island 6624 Compute nodes
13 A1 HPL Full system Linpack: 1 MPI task per node perf range: PFs. Max Perf: PFs with Turbo-OFF. Turbo-ON -> throttling June 2016:Number 46
14 A2 HPL Full system Linpack: 3556 nodes 1 MPI task per node Max Perf with HyperThreading-OFF. November 2016:Number 12 [0] ================================================================================ [0] T/V N NB P Q Time Gflops [0] [0] W[0] R00L2L e+06 [0] HPL_pdgesv() start time Fri Nov 4 23:10: [0] [0] HPL_pdgesv() end time Sat Nov 5 06:33: [0] [0] HPL Efficiency by CPU Cycle % [0] HPL Efficiency by BUS Cycle % [0] [0] Ax-b _oo/(eps*( A _oo* x _oo+ b _oo)*n)= PASSED [0] ================================================================================ [0] [0] Finished 1 tests with the following results: [0] 1 tests completed and passed residual checks, [0] 0 tests completed and failed residual checks, [0] 0 tests skipped because of illegal input values. [0] [0] [0] End of Tests. [0] ================================================================================
15 Intel Xeon PHI KNC An accelerator (like GPUs) but more similar to a conventional multicore CPU. Current version, Knight s Corner (KNC) has GHz cores,8-16gb RAM. 512 bit vector unit. Cores connected in a ring topology and MPI possible. No need to write CUDA or OpenCL as Intel compilers will compile Fortran or C code for the MIC. 1-2 Tflops, according to model.
16 A2: Knights Landing (KNL) A big unknown because very few people currently have access to KNL. But we know the architecture of KNL and the differences and similarities with respect to KNC. The main differences are: KNL will be a standalone processor not an accelerator (unlike KNC) KNL has more powerful cores and faster internal network. On package high performance, memory (16GB, MCDRAM).
17 Intel Xeon PHI KNC-KNL comparision KNC (Galileo) KNL (Marconi) #cores 61 (pentium) 68 (Atom ) Core frequency GHz 1.4 Ghz Memory 16GB GDDR5 96GB DDR4 +16Gb MCDRAM Internal network Bi-directional Ring Mesh Vectorisation 512 bit /core 2xAVX-512 /core Usage Co-processor Standalone Performance (Gflops) 1208 (dp)/2416 (sp) ~3000 (dp) Power ~300W ~200W A KNC core can be 10x slower than a Haswell core. A KNL core is expected to be 2-3X slower. Big differences also in memory bandwidth.
18
19
20 Coming next: A3 A3. Intel Skylake processors (mid-2017) Successor to Haswell, and launched in Expect increase in performance and power efficiency.
21 Coming next: A3
22 Marconi A1 and A2 exploitation at its best *
23 Exploiting the parallel universe Three levels of parallelism supported by Intel hardware Thread Level Parallelism Multi thread/task performance Exposed by programming models Execute tens/hundreds/thousands task concurrently Vector Level Parallelism Single thread performance Exposed by tools and programming models Operate on 4/8/16 elements at a time Instruction Level Parallelism Single thread performance Automatically exposed by HW/tools Effectively limited to a few instructions
24 A1 exploitation A1: Broadwell nodes Similar to Haswell cores present on Galileo. Expect only a small difference in single core performance wrt Galileo, but a big difference compared to Fermi. More cores/node (36) should mean better OpenMP performance but also MPI performance will improve (faster network). Life much easier for SPMD programming models. cores/node 36 Memory/node 128 GB + Use SIMD vectorization
25 Single Instruction Multiple Data (SIMD) vectorization Technique for exploiting VLP on a single thread Operate on more than one element at a time Might decrease instruction counts significantly Elements are stored on SIMD registers or vectors Code needs to be vectorized Vectorization usually on inner loops Main and remainder loops are generated a[i:4] for (int i = 0; i < N; i++) c[i] = a[i] + b[i]; Scalar loop b[i:4] for (int i = 0; i < N; i += 4) c[i:4] = a[i:4] + b[i:4]; SIMD loop (4 elements) c[i:4]
26 A2 exploitation A2: KNL (More) similar to KNC coprocessor present on Galileo, but with remarkable differences: High bandwidth MCDRAM available AVX-512 ISA Binary compatible with standard Xeon MPI and OpenMP mixing still the best choice (less OpenMP performances dependent). cores/node Use SIMD vectorization Memory/node 96GByte DDR4 + 16GByte MCDRAM
27 A2: MCDRAM Memory bandwidth in HPC is one of common bottleneck for performances To increase the demand for memory bandwidth KNL have a on-package high memory bandwidth memory (HBM) based on multi-channel dynamic random access memory (MCDRAM). This memory is capable of delivering up to 5x performance ( 400 Gb/s) compared to DDR4 memory on same platform ( 90 GB/s)
28 A2: MCDRAM HBM on KNL can be used as 1. a last-level cache (LLC) 2. an addressable memory. The configuration is determined at boot time, by choosing in BIOS setting between three MCDRAM modes: 1. Flat mode 2. Cache mode 3. Hybrid mode
29 A2: MCDRAM The best mode to use will depend on the application.
30 A2: Using HBM as addressable memory Two methods for this: the numactl tool Works best if the whole app can fit in MCDRAM the memkind library Using library calls or Compiler Directives Needs source modification
31 A2: Using numactl to access MCDRAM Run "numactl --hardware" to see the NUMA configuration of your system Look for the node only. If the total memory footprint of your app is smaller than the size of MCDRAM ps -C myapp u see RSS value Use numactl to allocate all of its memory from MCDRAM numactl --membind=mcdram_id myapp Where mcdram_id is the ID of MCDRAM "node" If the total memory footprint of your app is larger than the size of MCDRAM You can still use numactl to allocate part of your app in MCDRAM numactl --preferred=mcdram_id myapp Allocations that don't fit into MCDRAM spills over to DDR numactl --interleave=nodes myapp Allocations are interleaved across all nodes
32 A2: Using Memkind to access MCDRAM Memkind library is a user-extensible heap manager built on top of jemalloc, a C library for general-purpose memory allocation functions. The library is generalizable to any NUMA architecture, but for Knights Landing processors it is used primarily for manual allocation to HBM using special allocators for C/C++ has limited support for Fortran
33 A2: Using Memkind - C case Allocate 1000 floats from DDR float *fv; fv = (float *)malloc(sizeof(float) * 1000); Allocate 1000 floats from MCDRAM float *fv; fv = (float *)hbw_malloc(sizeof(float) * 1000);
34 A2: Using Memkind - Fortran case C Declare arrays to be dynamic REAL, ALLOCATABLE :: A(:), B(:), C(:)!DEC$ ATTRIBUTES FASTMEM :: A NSIZE=1024 c c allocate array 'A' from MCDRAM c ALLOCATE (A(1:NSIZE)) c c Allocate arrays that will come from DDR c ALLOCATE (B(NSIZE), C(NSIZE))
35 A2 exploitation A2: KNL (More) similar to KNC coprocessor present on Galileo, but with remarkable differences: High bandwidth MCDRAM available AVX-512 ISA Binary compatible with standard Xeon MPI and OpenMP mixing still the best choice (less OpenMP performances dependent). cores/node Use SIMD vectorization Memory/node 96GByte DDR4 + 16GByte MCDRAM
36
37 Porting Applications onto Marconi A1 and A2 Applications developers must utilise vectorisation (SIMD) and/or MPI(OpenMP) processes(threads), and possibly the fast memory of KNL (MCDRAM). From the user perspective, its success will depend on how software tools and applications are able to exploit the KNL architecture. Key idea: use proxy (KNC, ): SPECFEM3D_GLOBE. Already reasonable results with KNC in the framework of the activity. Good amount of vectorisation (FORCE_VECTORIZATION preprocessing enabling and SIMD optimization) suitable for KNC and future KNL. High number of OpenMP threads scaling (up to more than 60 on KNC) Worth noting that up to now KNC haven t been widely supported by Geophysical Applications software developers and users. A remarkable exception is SPECFEM3D_GLOBE software CIG repo, where the native version is maintained and tested. Again, this should be fine for its KNL startup.
38 Computer system Good practice: vectorization (SPECFEM3D_GLOBE KNC investigation) SPECFEM3D_GLOBE Regional_MiddleEast test case: forward simulation Haswell (Galileo) KNC (Galileo) Computer System e.t. (sec.) e.t. (sec.) Speedup wrt Haswell SPECFEM3D_GLOBE Regional_MiddleEast test case: no vectorisation comparison Haswell (Galileo) KNC (Galileo) Slowdown factor wrt vectorised Based on a 4-node Galileo partition (16 MPI processes, 4 and 60 OpenMP threads on Haswell and KNC respectively). The impact of vectorisation: on Haswell and KNC respectively). <- 2x Slowdown factor
39 Good practice: choosing MCDRAM memory modes MCDRAM cache mode should be the good choice for most applications but applications with cache unfriendly data are candidates for using other memory modes. In this case: Try to identify what to put in MCDRAM: timings with selected data items allocated fast/slow (intrusive) memory profiling (VTUNE Amplifier tool)
40 Conclusions Marconi A1: moderate improvements over the years. but a big improvements compared to Fermi. High expectations of Marconi A2 KNL performances. KNC paves the way for increasing performances.try to manage domain parallelism, increase threading, exploit data parallelism (vectorisation) and improve data locality (new chance: use on package memory)
INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian
INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc Processors The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past, computers
More informationINTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian
INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER Adrian Jackson a.jackson@epcc.ed.ac.uk @adrianjhpc Processors The power used by a CPU core is proportional to Clock Frequency x Voltage 2 In the past,
More informationBenchmark results on Knight Landing (KNL) architecture
Benchmark results on Knight Landing (KNL) architecture Domenico Guida, CINECA SCAI (Bologna) Giorgio Amati, CINECA SCAI (Roma) Roma 23/10/2017 KNL, BDW, SKL A1 BDW A2 KNL A3 SKL cores per node 2 x 18 @2.3
More informationTECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 13 th CALL (T ier-0)
TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 13 th CALL (T ier-0) Contributing sites and the corresponding computer systems for this call are: BSC, Spain IBM System x idataplex CINECA, Italy Lenovo System
More informationIntel Architecture for HPC
Intel Architecture for HPC Georg Zitzlsberger georg.zitzlsberger@vsb.cz 1st of March 2018 Agenda Salomon Architectures Intel R Xeon R processors v3 (Haswell) Intel R Xeon Phi TM coprocessor (KNC) Ohter
More informationHPC-CINECA infrastructure: The New Marconi System. HPC methods for Computational Fluid Dynamics and Astrophysics Giorgio Amati,
HPC-CINECA infrastructure: The New Marconi System HPC methods for Computational Fluid Dynamics and Astrophysics Giorgio Amati, g.amati@cineca.it Agenda 1. New Marconi system Roadmap Some performance info
More informationWelcome. Virtual tutorial starts at BST
Welcome Virtual tutorial starts at 15.00 BST Using KNL on ARCHER Adrian Jackson With thanks to: adrianj@epcc.ed.ac.uk @adrianjhpc Harvey Richardson from Cray Slides from Intel Xeon Phi Knights Landing
More informationCarlo Cavazzoni, HPC department, CINECA
Introduction to Shared memory architectures Carlo Cavazzoni, HPC department, CINECA Modern Parallel Architectures Two basic architectural scheme: Distributed Memory Shared Memory Now most computers have
More informationIntel Knights Landing Hardware
Intel Knights Landing Hardware TACC KNL Tutorial IXPUG Annual Meeting 2016 PRESENTED BY: John Cazes Lars Koesterke 1 Intel s Xeon Phi Architecture Leverages x86 architecture Simpler x86 cores, higher compute
More informationTECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 16 th CALL (T ier-0)
PRACE 16th Call Technical Guidelines for Applicants V1: published on 26/09/17 TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 16 th CALL (T ier-0) The contributing sites and the corresponding computer systems
More informationOutline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends
Collaborators: Richard T. Mills, Argonne National Laboratory Sarat Sreepathi, Oak Ridge National Laboratory Forrest M. Hoffman, Oak Ridge National Laboratory Jitendra Kumar, Oak Ridge National Laboratory
More informationGPUs and Emerging Architectures
GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs
More informationIntel Xeon Phi архитектура, модели программирования, оптимизация.
Нижний Новгород, 2017 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Дмитрий Рябцев, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture
More informationExascale: challenges and opportunities in a power constrained world
Exascale: challenges and opportunities in a power constrained world Carlo Cavazzoni c.cavazzoni@cineca.it SuperComputing Applications and Innovation Department CINECA CINECA non profit Consortium, made
More informationPORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune
PORTING CP2K TO THE INTEL XEON PHI ARCHER Technical Forum, Wed 30 th July Iain Bethune (ibethune@epcc.ed.ac.uk) Outline Xeon Phi Overview Porting CP2K to Xeon Phi Performance Results Lessons Learned Further
More informationn N c CIni.o ewsrg.au
@NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU
More informationResources Current and Future Systems. Timothy H. Kaiser, Ph.D.
Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic
More informationEARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA
EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA SUDHEER CHUNDURI, SCOTT PARKER, KEVIN HARMS, VITALI MOROZOV, CHRIS KNIGHT, KALYAN KUMARAN Performance Engineering Group Argonne Leadership Computing Facility
More informationIntroduction to tuning on KNL platforms
Introduction to tuning on KNL platforms Gilles Gouaillardet RIST gilles@rist.or.jp 1 Agenda Why do we need many core platforms? KNL architecture Single-thread optimization Parallelization Common pitfalls
More informationIntel Xeon Phi архитектура, модели программирования, оптимизация.
Нижний Новгород, 2016 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture How Programming
More informationHPC Cineca Infrastructure: State of the art and towards the exascale
HPC Cineca Infrastructure: State of the art and towards the exascale HPC Methods for CFD and Astrophysics 13 Nov. 2017, Casalecchio di Reno, Bologna Ivan Spisso, i.spisso@cineca.it Contents CINECA in a
More informationThe Intel Xeon PHI Architecture
The Intel Xeon PHI Architecture (codename Knights Landing ) Dr. Christopher Dahnken Senior Application Engineer Software and Services Group Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT
More informationNERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber
NERSC Site Update National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory Richard Gerber NERSC Senior Science Advisor High Performance Computing Department Head Cori
More informationIntroduction to tuning on many core platforms. Gilles Gouaillardet RIST
Introduction to tuning on many core platforms Gilles Gouaillardet RIST gilles@rist.or.jp Agenda Why do we need many core platforms? Single-thread optimization Parallelization Conclusions Why do we need
More informationPRACE Project Access Technical Guidelines - 19 th Call for Proposals
PRACE Project Access Technical Guidelines - 19 th Call for Proposals Peer-Review Office Version 5 06/03/2019 The contributing sites and the corresponding computer systems for this call are: System Architecture
More informationTECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 14 th CALL (T ier-0)
TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 14 th CALL (T ier0) Contributing sites and the corresponding computer systems for this call are: GENCI CEA, France Bull Bullx cluster GCS HLRS, Germany Cray
More informationHPC Architectures. Types of resource currently in use
HPC Architectures Types of resource currently in use Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationIntroduction to tuning on KNL platforms
Introduction to tuning on KNL platforms Gilles Gouaillardet RIST gilles@rist.or.jp 1 Agenda Why do we need many core platforms? KNL architecture Post-K overview Single-thread optimization Parallelization
More informationHPC Architectures past,present and emerging trends
HPC Architectures past,present and emerging trends Andrew Emerson, Cineca a.emerson@cineca.it 27/09/2016 High Performance Molecular 1 Dynamics - HPC architectures Agenda Computational Science Trends in
More informationBenchmark results on Knight Landing architecture
Benchmark results on Knight Landing architecture Domenico Guida, CINECA SCAI (Bologna) Giorgio Amati, CINECA SCAI (Roma) Milano, 21/04/2017 KNL vs BDW A1 BDW A2 KNL cores per node 2 x 18 @2.3 GHz 1 x 68
More informationResources Current and Future Systems. Timothy H. Kaiser, Ph.D.
Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1 Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic
More informationKnights Corner: Your Path to Knights Landing
Knights Corner: Your Path to Knights Landing James Reinders, Intel Wednesday, September 17, 2014; 9-10am PDT Photo (c) 2014, James Reinders; used with permission; Yosemite Half Dome rising through forest
More informationComputer Architecture and Structured Parallel Programming James Reinders, Intel
Computer Architecture and Structured Parallel Programming James Reinders, Intel Parallel Computing CIS 410/510 Department of Computer and Information Science Lecture 17 Manycore Computing and GPUs Computer
More informationIntel Many-Core Processor Architecture for High Performance Computing
Intel Many-Core Processor Architecture for High Performance Computing Andrey Semin Principal Engineer, Intel IXPUG, Moscow June 1, 2017 Legal INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH
More informationDeep Learning with Intel DAAL
Deep Learning with Intel DAAL on Knights Landing Processor David Ojika dave.n.ojika@cern.ch March 22, 2017 Outline Introduction and Motivation Intel Knights Landing Processor Intel Data Analytics and Acceleration
More informationIntel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins
Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications
More informationNew HPC architectures landscape and impact on code developments Massimiliano Guarrasi, CINECA Meeting ICT INAF Catania, 13/09/2018
New HPC architectures landscape and impact on code developments Massimiliano Guarrasi, CINECA Meeting ICT INAF Catania, 13/09/2018 CINECA at glance Services both for Universities and Ministry Solutions
More informationTools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,
Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - fabio.baruffa@lrz.de LRZ, 27.6.- 29.6.2016 Architecture Overview Intel Xeon Processor Intel Xeon Phi Coprocessor, 1st generation Intel Xeon
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel
More informationIHK/McKernel: A Lightweight Multi-kernel Operating System for Extreme-Scale Supercomputing
: A Lightweight Multi-kernel Operating System for Extreme-Scale Supercomputing Balazs Gerofi Exascale System Software Team, RIKEN Center for Computational Science 218/Nov/15 SC 18 Intel Extreme Computing
More informationHPC Architectures past,present and emerging trends
HPC Architectures past,present and emerging trends Author: Andrew Emerson, Cineca a.emerson@cineca.it Speaker: Alessandro Marani, Cineca a.marani@cineca.it Agenda Computational Science Trends in HPC technology
More informationFinite Element Integration and Assembly on Modern Multi and Many-core Processors
Finite Element Integration and Assembly on Modern Multi and Many-core Processors Krzysztof Banaś, Jan Bielański, Kazimierz Chłoń AGH University of Science and Technology, Mickiewicza 30, 30-059 Kraków,
More informationPerformance Optimization of Smoothed Particle Hydrodynamics for Multi/Many-Core Architectures
Performance Optimization of Smoothed Particle Hydrodynamics for Multi/Many-Core Architectures Dr. Fabio Baruffa Dr. Luigi Iapichino Leibniz Supercomputing Centre fabio.baruffa@lrz.de Outline of the talk
More informationThe Stampede is Coming: A New Petascale Resource for the Open Science Community
The Stampede is Coming: A New Petascale Resource for the Open Science Community Jay Boisseau Texas Advanced Computing Center boisseau@tacc.utexas.edu Stampede: Solicitation US National Science Foundation
More informationProgramming Models for Multi- Threading. Brian Marshall, Advanced Research Computing
Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows
More informationParallel Accelerators
Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon
More informationarxiv: v2 [hep-lat] 3 Nov 2016
MILC staggered conjugate gradient performance on Intel KNL arxiv:1611.00728v2 [hep-lat] 3 Nov 2016 Department of Physics, Indiana University, Bloomington IN 47405, USA E-mail: ruizli@umail.iu.edu Carleton
More informationPerformance and Energy Usage of Workloads on KNL and Haswell Architectures
Performance and Energy Usage of Workloads on KNL and Haswell Architectures Tyler Allen 1 Christopher Daley 2 Doug Doerfler 2 Brian Austin 2 Nicholas Wright 2 1 Clemson University 2 National Energy Research
More informationBei Wang, Dmitry Prohorov and Carlos Rosales
Bei Wang, Dmitry Prohorov and Carlos Rosales Aspects of Application Performance What are the Aspects of Performance Intel Hardware Features Omni-Path Architecture MCDRAM 3D XPoint Many-core Xeon Phi AVX-512
More informationIFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor
IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor D.Sc. Mikko Byckling 17th Workshop on High Performance Computing in Meteorology October 24 th 2016, Reading, UK Legal Disclaimer & Optimization
More informationGPU Architecture. Alan Gray EPCC The University of Edinburgh
GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From
More informationScientific Computing with Intel Xeon Phi Coprocessors
Scientific Computing with Intel Xeon Phi Coprocessors Andrey Vladimirov Colfax International HPC Advisory Council Stanford Conference 2015 Compututing with Xeon Phi Welcome Colfax International, 2014 Contents
More informationJÜLICH SUPERCOMPUTING CENTRE Site Introduction Michael Stephan Forschungszentrum Jülich
JÜLICH SUPERCOMPUTING CENTRE Site Introduction 09.04.2018 Michael Stephan JSC @ Forschungszentrum Jülich FORSCHUNGSZENTRUM JÜLICH Research Centre Jülich One of the 15 Helmholtz Research Centers in Germany
More informationReusing this material
XEON PHI BASICS Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationIntel Many Integrated Core (MIC) Architecture
Intel Many Integrated Core (MIC) Architecture Karl Solchenbach Director European Exascale Labs BMW2011, November 3, 2011 1 Notice and Disclaimers Notice: This document contains information on products
More informationOP2 FOR MANY-CORE ARCHITECTURES
OP2 FOR MANY-CORE ARCHITECTURES G.R. Mudalige, M.B. Giles, Oxford e-research Centre, University of Oxford gihan.mudalige@oerc.ox.ac.uk 27 th Jan 2012 1 AGENDA OP2 Current Progress Future work for OP2 EPSRC
More informationThe Mont-Blanc approach towards Exascale
http://www.montblanc-project.eu The Mont-Blanc approach towards Exascale Alex Ramirez Barcelona Supercomputing Center Disclaimer: Not only I speak for myself... All references to unavailable products are
More informationCUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation
CUDA Accelerated Linpack on Clusters E. Phillips, NVIDIA Corporation Outline Linpack benchmark CUDA Acceleration Strategy Fermi DGEMM Optimization / Performance Linpack Results Conclusions LINPACK Benchmark
More informationVectorisation and Portable Programming using OpenCL
Vectorisation and Portable Programming using OpenCL Mitglied der Helmholtz-Gemeinschaft Jülich Supercomputing Centre (JSC) Andreas Beckmann, Ilya Zhukov, Willi Homberg, JSC Wolfram Schenck, FH Bielefeld
More informationDATARMOR: Comment s'y préparer? Tina Odaka
DATARMOR: Comment s'y préparer? Tina Odaka 30.09.2016 PLAN DATARMOR: Detailed explanation on hard ware What can you do today to be ready for DATARMOR DATARMOR : convention de nommage ClusterHPC REF SCRATCH
More informationHPC code modernization with Intel development tools
HPC code modernization with Intel development tools Bayncore, Ltd. Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel Xeon and Xeon Phi February 17 th 2016, Barcelona Microprocessor
More informationParallel Accelerators
Parallel Accelerators Přemysl Šůcha ``Parallel algorithms'', 2017/2018 CTU/FEL 1 Topic Overview Graphical Processing Units (GPU) and CUDA Vector addition on CUDA Intel Xeon Phi Matrix equations on Xeon
More informationTowards modernisation of the Gadget code on many-core architectures Fabio Baruffa, Luigi Iapichino (LRZ)
Towards modernisation of the Gadget code on many-core architectures Fabio Baruffa, Luigi Iapichino (LRZ) Overview Modernising P-Gadget3 for the Intel Xeon Phi : code features, challenges and strategy for
More informationParallel and Distributed Programming Introduction. Kenjiro Taura
Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel Programming? 2 What Parallel Machines Look Like, and Where Performance Come From? 3 How to Program Parallel
More informationTACC s Stampede Project: Intel MIC for Simulation and Data-Intensive Computing
TACC s Stampede Project: Intel MIC for Simulation and Data-Intensive Computing Jay Boisseau, Director April 17, 2012 TACC Vision & Strategy Provide the most powerful, capable computing technologies and
More informationFujitsu s Approach to Application Centric Petascale Computing
Fujitsu s Approach to Application Centric Petascale Computing 2 nd Nov. 2010 Motoi Okuda Fujitsu Ltd. Agenda Japanese Next-Generation Supercomputer, K Computer Project Overview Design Targets System Overview
More informationThe Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center
The Stampede is Coming Welcome to Stampede Introductory Training Dan Stanzione Texas Advanced Computing Center dan@tacc.utexas.edu Thanks for Coming! Stampede is an exciting new system of incredible power.
More informationTutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers
Tutorial Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Dan Stanzione, Lars Koesterke, Bill Barth, Kent Milfeld dan/lars/bbarth/milfeld@tacc.utexas.edu XSEDE 12 July 16, 2012
More informationEnhancing Application Performance using Heterogeneous Memory Architectures on the Many-core Platform
Enhancing Application Performance using Heterogeneous Memory Architectures on the Many-core Platform Shuo Li, Karthik Raman, Ruchira Sasanka, Andrey Semin Presented by James Tullos IXPUG US Annual Meeting
More informationPreparing for Highly Parallel, Heterogeneous Coprocessing
Preparing for Highly Parallel, Heterogeneous Coprocessing Steve Lantz Senior Research Associate Cornell CAC Workshop: Parallel Computing on Ranger and Lonestar May 17, 2012 What Are We Talking About Here?
More information1. Many Core vs Multi Core. 2. Performance Optimization Concepts for Many Core. 3. Performance Optimization Strategy for Many Core
1. Many Core vs Multi Core 2. Performance Optimization Concepts for Many Core 3. Performance Optimization Strategy for Many Core 4. Example Case Studies NERSC s Cori will begin to transition the workload
More informationAccelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include
3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI
More informationMapping MPI+X Applications to Multi-GPU Architectures
Mapping MPI+X Applications to Multi-GPU Architectures A Performance-Portable Approach Edgar A. León Computer Scientist San Jose, CA March 28, 2018 GPU Technology Conference This work was performed under
More informationNon-uniform memory access (NUMA)
Non-uniform memory access (NUMA) Memory access between processor core to main memory is not uniform. Memory resides in separate regions called NUMA domains. For highest performance, cores should only access
More informationIntroduction of Oakforest-PACS
Introduction of Oakforest-PACS Hiroshi Nakamura Director of Information Technology Center The Univ. of Tokyo (Director of JCAHPC) Outline Supercomputer deployment plan in Japan What is JCAHPC? Oakforest-PACS
More informationCOMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES
COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES P(ND) 2-2 2014 Guillaume Colin de Verdière OCTOBER 14TH, 2014 P(ND)^2-2 PAGE 1 CEA, DAM, DIF, F-91297 Arpajon, France October 14th, 2014 Abstract:
More informationUmeå University
HPC2N @ Umeå University Introduction to HPC2N and Kebnekaise Jerry Eriksson, Pedro Ojeda-May, and Birgitte Brydsö Outline Short presentation of HPC2N HPC at a glance. HPC2N Abisko, Kebnekaise HPC Programming
More informationCray XC Scalability and the Aries Network Tony Ford
Cray XC Scalability and the Aries Network Tony Ford June 29, 2017 Exascale Scalability Which scalability metrics are important for Exascale? Performance (obviously!) What are the contributing factors?
More informationDouble Rewards of Porting Scientific Applications to the Intel MIC Architecture
Double Rewards of Porting Scientific Applications to the Intel MIC Architecture Troy A. Porter Hansen Experimental Physics Laboratory and Kavli Institute for Particle Astrophysics and Cosmology Stanford
More informationUmeå University
HPC2N: Introduction to HPC2N and Kebnekaise, 2017-09-12 HPC2N @ Umeå University Introduction to HPC2N and Kebnekaise Jerry Eriksson, Pedro Ojeda-May, and Birgitte Brydsö Outline Short presentation of HPC2N
More informationOverview of Tianhe-2
Overview of Tianhe-2 (MilkyWay-2) Supercomputer Yutong Lu School of Computer Science, National University of Defense Technology; State Key Laboratory of High Performance Computing, China ytlu@nudt.edu.cn
More informationIt s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist
It s a Multicore World John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist Waiting for Moore s Law to save your serial code started getting bleak in 2004 Source: published SPECInt
More informationPerformance optimization of the Smoothed Particle Hydrodynamics code Gadget3 on 2nd generation Intel Xeon Phi
Performance optimization of the Smoothed Particle Hydrodynamics code Gadget3 on 2nd generation Intel Xeon Phi Dr. Luigi Iapichino Leibniz Supercomputing Centre Supercomputing 2017 Intel booth, Nerve Center
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Multi-core today: Intel Xeon 600v4 (016) Xeon E5-600v4 Broadwell
More informationANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation
ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation Ray Browell nvidia Technology Theater SC12 1 2012 ANSYS, Inc. nvidia Technology Theater SC12 HPC Revolution Recent
More informationEfficient Parallel Programming on Xeon Phi for Exascale
Efficient Parallel Programming on Xeon Phi for Exascale Eric Petit, Intel IPAG, Seminar at MDLS, Saclay, 29th November 2016 Legal Disclaimers Intel technologies features and benefits depend on system configuration
More informationGoing parallel with efficiency on the future Knights family
Going parallel with efficiency on the future Knights family CJ Newburn, SW/HW HPC Architect, Community Builder Nov 18, 2015 SC15 IXPUG BoF: Paving the way for Performance on Intel Knights Landing Processors
More informationWhat s P. Thierry
What s new@intel P. Thierry Principal Engineer, Intel Corp philippe.thierry@intel.com CPU trend Memory update Software Characterization in 30 mn 10 000 feet view CPU : Range of few TF/s and
More informationAdvances of parallel computing. Kirill Bogachev May 2016
Advances of parallel computing Kirill Bogachev May 2016 Demands in Simulations Field development relies more and more on static and dynamic modeling of the reservoirs that has come a long way from being
More informationIntroduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes
Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Moore s law Intel Sandy Bridge EP: 2.3 billion Nvidia
More informationLeibniz Supercomputer Centre. Movie on YouTube
SuperMUC @ Leibniz Supercomputer Centre Movie on YouTube Peak Performance Peak performance: 3 Peta Flops 3*10 15 Flops Mega 10 6 million Giga 10 9 billion Tera 10 12 trillion Peta 10 15 quadrillion Exa
More informationComparing Performance and Power Consumption on Different Architectures
Comparing Performance and Power Consumption on Different Architectures Andriani Mappoura August 18, 2017 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2017 Abstract
More informationTECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 11th CALL (T ier-0)
TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 11th CALL (T ier-0) Contributing sites and the corresponding computer systems for this call are: BSC, Spain IBM System X idataplex CINECA, Italy The site selection
More informationPerformance of deal.ii on a node
Performance of deal.ii on a node Bruno Turcksin Texas A&M University, Dept. of Mathematics Bruno Turcksin Deal.II on a node 1/37 Outline 1 Introduction 2 Architecture 3 Paralution 4 Other Libraries 5 Conclusions
More informationIntel MIC Programming Workshop, Hardware Overview & Native Execution LRZ,
Intel MIC Programming Workshop, Hardware Overview & Native Execution LRZ, 27.6.- 29.6.2016 1 Agenda Intro @ accelerators on HPC Architecture overview of the Intel Xeon Phi Products Programming models Native
More informationAccelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing
Accelerating HPC (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing SAAHPC, Knoxville, July 13, 2010 Legal Disclaimer Intel may make changes to specifications and product
More informationAdvanced Research Computing. ARC3 and GPUs. Mark Dixon
Advanced Research Computing Mark Dixon m.c.dixon@leeds.ac.uk ARC3 (1st March 217) Included 2 GPU nodes, each with: 24 Intel CPU cores & 128G RAM (same as standard compute node) 2 NVIDIA Tesla K8 24G RAM
More informationThe Intel Xeon Phi Coprocessor. Dr-Ing. Michael Klemm Software and Services Group Intel Corporation
The Intel Xeon Phi Coprocessor Dr-Ing. Michael Klemm Software and Services Group Intel Corporation (michael.klemm@intel.com) Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationOpenFOAM on BG/Q porting and performance
OpenFOAM on BG/Q porting and performance Paride Dagna, SCAI Department, CINECA SYSTEM OVERVIEW OpenFOAM : selected application inside of PRACE project Fermi : PRACE Tier- System Model: IBM-BlueGene /Q
More information