Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Similar documents
Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

TACC s Stampede Project: Intel MIC for Simulation and Data-Intensive Computing

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Advanced High Performance Computing CSCI 580

Parallel Programming on Ranger and Stampede

Introduction to Xeon Phi. Bill Barth January 11, 2013

The Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center

GPUs and Emerging Architectures

NCAR s Data-Centric Supercomputing Environment Yellowstone. November 28, 2011 David L. Hart, CISL

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Preparing for Highly Parallel, Heterogeneous Coprocessing

NCAR s Data-Centric Supercomputing Environment Yellowstone. November 29, 2011 David L. Hart, CISL

Introduction to the Intel Xeon Phi on Stampede

The Era of Heterogeneous Computing

n N c CIni.o ewsrg.au

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 13 th CALL (T ier-0)

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Introduc)on to Xeon Phi

Trends in HPC (hardware complexity and software challenges)

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

A Unified Approach to Heterogeneous Architectures Using the Uintah Framework

An Introduction to the Intel Xeon Phi. Si Liu Feb 6, 2015

Intel Many Integrated Core (MIC) Architecture

Real Parallel Computers

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany

IBM CORAL HPC System Solution

HPC. Accelerating. HPC Advisory Council Lugano, CH March 15 th, Herbert Cornelius Intel

Cheyenne NCAR s Next-Generation Data-Centric Supercomputing Environment

The NCAR Yellowstone Data Centric Computing Environment. Rory Kelly ScicomP Workshop May 2013

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

NCAR Workload Analysis on Yellowstone. March 2015 V5.0

HPC Hardware Overview

Overview of Parallel Computing. Timothy H. Kaiser, PH.D.

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

Carlo Cavazzoni, HPC department, CINECA

Real Parallel Computers

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 6 th CALL (Tier-0)

Illinois Proposal Considerations Greg Bauer

High Performance Computing with Accelerators

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 11th CALL (T ier-0)

Introduc)on to Xeon Phi

Overview of Tianhe-2

The Mont-Blanc approach towards Exascale

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

Thread and Data parallelism in CPUs - will GPUs become obsolete?

John Hengeveld Director of Marketing, HPC Evangelist

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

Lecture 20: Distributed Memory Parallelism. William Gropp

Cray XC Scalability and the Aries Network Tony Ford

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

Experiences with ENZO on the Intel Many Integrated Core Architecture

Intra-MIC MPI Communication using MVAPICH2: Early Experience

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 14 th CALL (T ier-0)

Technology for a better society. hetcomp.com

General Purpose GPU Computing in Partial Wave Analysis

InfiniBand Strengthens Leadership as the Interconnect Of Choice By Providing Best Return on Investment. TOP500 Supercomputers, June 2014

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Fujitsu s Approach to Application Centric Petascale Computing

Leibniz Supercomputer Centre. Movie on YouTube

WVU RESEARCH COMPUTING INTRODUCTION. Introduction to WVU s Research Computing Services

University at Buffalo Center for Computational Research

Intel Xeon Phi архитектура, модели программирования, оптимизация.

High Performance Computing: Blue-Gene and Road Runner. Ravi Patel

The Intel Xeon Phi Coprocessor. Dr-Ing. Michael Klemm Software and Services Group Intel Corporation

7 DAYS AND 8 NIGHTS WITH THE CARMA DEV KIT

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Overview of High Performance Computing

The BioHPC Nucleus Cluster & Future Developments

Overlapping Computation and Communication for Advection on Hybrid Parallel Computers

HPC Architectures past,present and emerging trends

Architecture, Programming and Performance of MIC Phi Coprocessor

Mapping MPI+X Applications to Multi-GPU Architectures

BlueGene/L. Computer Science, University of Warwick. Source: IBM

The Red Storm System: Architecture, System Update and Performance Analysis

OP2 FOR MANY-CORE ARCHITECTURES

Intel Architecture for HPC

HPC Capabilities at Research Intensive Universities

Gateways to Discovery: Cyberinfrastructure for the Long Tail of Science

Introduction to GPU hardware and to CUDA

IN11E: Architecture and Integration Testbed for Earth/Space Science Cyberinfrastructures

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 16 th CALL (T ier-0)

Cray XD1 Supercomputer Release 1.3 CRAY XD1 DATASHEET

Description of Power8 Nodes Available on Mio (ppc[ ])

Building NVLink for Developers

Stockholm Brain Institute Blue Gene/L

Intel High-Performance Computing. Technologies for Engineering

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Cluster Network Products

The IBM Blue Gene/Q: Application performance, scalability and optimisation

LBRN - HPC systems : CCT, LSU

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

Timothy Lanfear, NVIDIA HPC

Transcription:

Resources Current and Future Systems Timothy H. Kaiser, Ph.D. tkaiser@mines.edu 1

Most likely talk to be out of date History of Top 500 Issues with building bigger machines Current and near future academic machines 2

Top 500 list Ranks computers based on performance on a linear solve http://www.top500.org/ 3

Top500 Benchmarks Spring 13 4

Trends 5

BlueM Mines` Supercomputer 154 Tflops 17.4 Tbytes 10,496 Cores 85 KW Five Racks (not full) dual architecture Two Distinct Compute Units idataplex Blue Gene Q Best of both worlds Shared 480 Tbyte File System Compact Low Power Consumption

BlueM s Compute Units - AuN AuN (Golden) idataplex Intel 8x2 core SandyBridge 144 Nodes 2,304 Cores 9,216 Gbytes Feature Latest Generation Intel Processors Large Memory / Node Common architecture Similar user environment to RA and Mio Quickly get researchers up and running 50 Tflops

BlueM s Compute Units - MC2 MC2 (Energy) Blue Gene Q PowerPC A2 17 Core 512 Nodes 8,192 Cores 8,192 Gbytes 104 Tflops Feature New Architecture Designed for large core count jobs Highly scaleable Multilevel parallelism - Direction of HPC Room to Grow Future looking machine

Colorado State Cray model XT6m Opera1onal January 2011 Update: Peak performance 12 teraflops 2016 Cores Dimensions: 7.5 E. (h) x 2.0 E. (w) x 4.5 E. (d) 20 Tflops Computer par11on 52 compute nodes 2 processors / node 104 AMD Magny Cours 64- bit 1.9 GHz total processors 12 cores / processor; 1,248 total cores 32 GB DDR3 ECC SDRAM / node; 1.664 TB total RAM 9

NCAR's Computational and Information Systems Laboratory (CISL) invites NSFsupported university researchers in the atmospheric, oceanic, and closely related sciences to submit large allocation requests by September 17, 2012. University researchers supported by an NSF award can request up to 30,000 GAUs as a Small Allocation request. Up to 10,000 GAUs are available to graduate students and post-docs; no NSF award is required. https://www2.cisl.ucar.edu/docs/allocations#university 10

NCAR & CISL systems Yellowstone A 1.5-petaflops high-performance computing system with 72,288 processor cores and 144 terabytes of memory. Production computing operations will begin in the summer of 2012. Bluefire NCAR's 77-teraflops IBM Power6 system used by the Climate Simulation Lab (CSL) and Community Computing Facilities. Janus The Janus system is a Dell Linux cluster that is housed on the CU- Boulder campus and has a high-speed networking connection to NCAR's computing and data storage systems. Lynx A Cray XT5m system deployed as a testing platform and available to NCAR users. Mirage and Storm CISL operates two data analysis and visualization clusters, with software packages including NCL, Vapor, Matlab and IDL, for its user community. GLADE The central GLADE file system significantly expands the disk space available to CISL users and allows users to access their data from both HPC and DAV systems. HPSS CISL has migrated its archival storage resource to the High-Performance Storage System (HPSS) environment, which currently stores more than 12 PB of data in support of CISL computing facilities and NCAR research activities. 11

12

NCAR%Resources%!!at!the!NCAR!Wyoming!Supercompu6ng!Center!(NWSC)! Centralized!Filesystems!and!Data!Storage!(GLADE)! >90!GB/sec!aggregate!I/O!bandwidth,!GPFS!filesystems! 10.9!PetaBytes!iniJally!K>!16.4!PetaBytes!in!1Q2014! High!Performance!CompuJng!(Yellowstone)! IBM!iDataPlex!Cluster!with!Intel!Xeon!E5K2670!processors!with!Advanced!Vector! Extensions!(AVX)! 1.50!PetaFLOPs!!28.9!bluefireKequivalents!!4,518!nodes!!72,288!cores! 145!TeraBytes!total!memory! Mellanox!FDR!InfiniBand!full!fatKtree!interconnect! Data!Analysis!and!VisualizaJon!(Geyser!&!Caldera)! Large!Memory!System!with!Intel!Westmere!EX!processors! 16!nodes,!640!WestmereKEX!cores,!16!TeraBytes!memory,!16!nVIDIA!Quadro!6000!GPUs! GPUKComputaJon/Vis!System!with!Intel!Sandy!Bridge!EP!processors!with!AVX! 16!nodes,!256!E5K2670!cores,!1!TeraByte!memory,!32!nVIDIA!M2070Q!GPUs! Knights!Corner!System!with!Intel!Sandy!Bridge!EP!processors!with!AVX! 16!Knights!Corner!nodes,!256!E5K2670!cores,!>1600!KC!cores,!1!TB!memory!!Early!2013!deliver! NCAR!HPSS!Data!Archive! 2!SL8500!Tape!libraries!(20k!cartridge!slots)!@!NWSC! >100!PetaByte!capacity!(with!5!TeraByte!cartridges,!uncompressed)! 2!SL8500!Tape!libraries!(15k!slots)!@!Mesa!Lab!(current!16!PetaByte!archive)! 13! Codenamed! Sandy!Bridge!EP! 2

Yellowstone Compute 72,288 processor cores 2.6-GHz Intel Sandy Bridge EP with Advanced Vector Extensions (AVX) 8-flops clock 4,518 nodes IBM dx360 M4, dual socket, 8 cores per socket 144.58 TB total system memory 2 GB/core, 32 GB/node, DDR3-1600 FDR Mellanox InfiniBand interconnect Full fat tree, single plane Bandwidth 13.6 GBps bidirectional per node; latency 2.5 µs Peak bidirectional bisection bandwidth: 31.7 TBps 1.504 petaflops peak 1.20 petaflops estimated HPL 14

XSEDE Extreme Science and Engineering Discovery Environment https://www.xsede.org/ Mostly the same people as TeraGrid Mostly the same machines 15

XSEDE Machines User Guides: https://www.xsede.org/user-guides 16

Future Directions in HPC Four important concepts that will effect math software - Jack Dongarra Effective use of many-core Exploiting mixed precision in our numerical computations Self adapting / auto tuning of software Fault tolerant algorithms Barriers to progress are increasingly on the software side. Hardware has a half-life measured in years, while software has a half-life measured in decades. High performance ecosystem out of balance: HW, SW, OS, Compilers, Algorithms, Applications. 17

Top500 Benchmarks Spring 13 18

Trends Hardware Large number of cores Less memory per core More Flops/Watt Better Interconnect Software Hybrid programming Directives based 19

GPU GPU computing is the use of a GPU (graphics processing unit) together with a CPU to accelerate general-purpose scientific and engineering applications. GPUs do real computation Vendors have taken GPU systems and repackaged them to do computation Vendors IBM AMD Nvidia Intel NVidia Tesla M2090 GPU 20

Not a completely new concept Think coprocessor Main processor passes off some work to coprocessor Remember the 8087? Same issues Programs must be written to take advantage Getting data to/from coprocessor 21

Programming (Bottom Level) Program is written in two parts CPU GPU Computation starts on CPU Data is prepared on CPU Data is sent back to CPU Data and Program (subroutine) are sent to GPU Subroutine run on GPU as a thread 22

Issues Complexity Separate code for GPU Easy to write tough to get to run well Bottleneck between CPU and GPU Mixed precision Efficiency on the GPU Small amount of fast memory Massive number of threads must be managed 23

GPUs Many more cores Does not support normal process Expected to run multiple threads per core Very small fast memory MUCH less memory per core 24

Issues Complexity Directives based programming similar to OpenMP Libraries Bottleneck between CPU and GPU Getting Better Mixed precision (Some) newer GPUs have better ratio of performance Efficiency on the GPU More memory and flatter hierarchy Better thread management 25

CSM s old GPU node (2009) # of Tesla GPUs 4 # of Streaming Processor Cores 960 (240 per processor) Frequency of processor cores 1.296 to 1.44 GHz Single Precision floating point performance (peak) 3.73 to 4.14 TFlops Double Precision floating point performance (peak) 311 to 345 GFlops Floating Point Precision IEEE 754 single & double Total Dedicated Memory 16 GB Memory Interface 512-bit Memory Bandwidth 408 GB/sec Max Power Consumption 800 W System Interface PCIe x16 or x8 Software Development Tools C-based CUDA Toolkit 26

Today s NVIDA offerings 27

Intel Many Integrated Core (MIC) What? Many (>50) cores on a chip Each core is x86 type processor Why? Massive parallelizm Same (MoL) instruction set as other X86 When? Knights Corner Prerelease product PCI card Available very soon as Xeon Phi, also PCI card http://openlab.web.cern.ch/publications/presentations?page=1 28

Intel MIC differences X86 instruction set Can in theory, run full os on the card Should most likely run threads (OpenMP) Uses the same compilers as normal Intel processors Codes optimized for current generation processor will run well on MIC Threading Vectorization 29

Summary Core count is going up Memory / core is going down Threading will become more important Hybrid will be critical 30

31

Next few slides taken from Dr. Jay Boisseau Director of TACC 32

MIC Architecture Many cores on the die L1 and L2 cache Bidirectional ring network Memory and PCIe connection MIC (KNF) architecture block diagram Knights Ferry SDP Up to 32 cores 1-2 GB of GDDR5 RAM 512-bit wide SIMD registers L1/L2 caches Multiple threads (up to 4) per core Slow operation in double precision Knights Corner (first product) 50+ cores Increased amount of RAM Details are under NDA Double precision half the speed of single precision (canonical ratio) 22 nm technology 33

What we at TACC like about MIC (and we think that you will like this, too) Intel s MIC is based on x86 technology x86 cores w/ caches and cache coherency SIMD instruction set Programming for MIC is similar to programming for CPUs Familiar languages: C/C++ and Fortran Familiar parallel programming models: OpenMP & MPI MPI on host and on the coprocessor Any code can run on MIC, not just kernels Optimizing for MIC is similar to optimizing for CPUs Make use of existing knowledge! Key elements of this talk highlighted! 34

Differences Coprocessor vs. Accelerator Architecture: x86 vs. streaming processors HPC Programming model: Threading/MPI: Programming details coherent caches vs. shared memory and caches extension to C++/C/Fortran vs. CUDA/OpenCL OpenCL support OpenMP and Multithreading vs. threads in hardware MPI on host and/or MIC vs. MPI on host only offloaded regions vs. kernels Support for any code: serial, scripting, etc. Yes No Native mode: Any code may be offloaded as a whole to the coprocessor 35

Programming Models Ready to use on day one! TBB s will be available to C++ programmers MKL will be available Automatic offloading by compiler for some MKL features Cilk Plus Useful for task-parallel programing (add-on to OpenMP) May become available for Fortran users as well OpenMP TACC expects that OpenMP will be the most interesting programming model for our HPC users 36

IBM Blue Gene Q New machine from IBM Evolution from BGL and BGP Many cores / node with less memory / core but more than L or P Very energy efficient 4 of the top 8 on top 500 list 37

BGQ Rack 208 Tflop 62.5 kw 1 rack 1024 nodes 16384 cores 1 node = 16+1 cores 16 Gbytes or 1Gbyte/core Footprint < 31 ft2 38

BGQ Proprietary Parts Processors Designed for HPC 4 threads/core Advanced speculative operation Transactional memory Networks 5D torus Collective and barrier Floating point addition in network Special IO Nodes 39

5D Torus What the? 40