ECE 574 Cluster Computing Lecture 2

Similar documents
It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

ECE 574 Cluster Computing Lecture 1

ECE 574 Cluster Computing Lecture 18

Overview. CS 472 Concurrent & Parallel Programming University of Evansville

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Cray XC Scalability and the Aries Network Tony Ford

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Parallel Computing & Accelerators. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Presentations: Jack Dongarra, University of Tennessee & ORNL. The HPL Benchmark: Past, Present & Future. Mike Heroux, Sandia National Laboratories

Trends in HPC (hardware complexity and software challenges)

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

ECE 471 Embedded Systems Lecture 23

represent parallel computers, so distributed systems such as Does not consider storage or I/O issues

Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester

Managing HPC Active Archive Storage with HPSS RAIT at Oak Ridge National Laboratory

Memory Footprint of Locality Information On Many-Core Platforms Brice Goglin Inria Bordeaux Sud-Ouest France 2018/05/25

Chapter 5 Supercomputers

InfiniBand Strengthens Leadership as the Interconnect Of Choice By Providing Best Return on Investment. TOP500 Supercomputers, June 2014

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

High-Performance Computing - and why Learn about it?

CSE5351: Parallel Procesisng. Part 1B. UTA Copyright (c) Slide No 1

The Mont-Blanc approach towards Exascale

ECE 574 Cluster Computing Lecture 4

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?

Composite Metrics for System Throughput in HPC

Supercomputers. Alex Reid & James O'Donoghue

Big Orange Bramble. August 09, 2016

CS2214 COMPUTER ARCHITECTURE & ORGANIZATION SPRING Top 10 Supercomputers in the World as of November 2013*

Overview of Tianhe-2

CSC2/458 Parallel and Distributed Systems Machines and Models

The BioHPC Nucleus Cluster & Future Developments

Parallel Computing & Accelerators. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Fujitsu s Approach to Application Centric Petascale Computing

Overview of Parallel Computing. Timothy H. Kaiser, PH.D.

PRACE Project Access Technical Guidelines - 19 th Call for Proposals

High Performance Computing (HPC) Introduction

Steve Scott, Tesla CTO SC 11 November 15, 2011

HPC IN EUROPE. Organisation of public HPC resources

Overview and history of high performance computing

Fujitsu s Technologies to the K Computer

Building supercomputers from commodity embedded chips

HPC future trends from a science perspective

2018/9/25 (1), (I) 1 (1), (I)

GOING ARM A CODE PERSPECTIVE

Introduction to High-Performance Computing

Digital Signal Processor Supercomputing

Confessions of an Accidental Benchmarker

HPC Technology Update Challenges or Chances?

MAHA. - Supercomputing System for Bioinformatics

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

Overview of Reedbush-U How to Login

HPC Architectures. Types of resource currently in use

System Packaging Solution for Future High Performance Computing May 31, 2018 Shunichi Kikuchi Fujitsu Limited

Top500

HPC Algorithms and Applications

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 6 th CALL (Tier-0)

Parallelism and Concurrency. COS 326 David Walker Princeton University

Hybrid Architectures Why Should I Bother?

8/28/12. CSE 820 Graduate Computer Architecture. Richard Enbody. Dr. Enbody. 1 st Day 2

Large scale Imaging on Current Many- Core Platforms

ECE 598 Advanced Operating Systems Lecture 22

Parallel and Distributed Systems. Hardware Trends. Why Parallel or Distributed Computing? What is a parallel computer?

Lecture 3: Intro to parallel machines and models

Overview. Idea: Reduce CPU clock frequency This idea is well suited specifically for visualization

ECE 574 Cluster Computing Lecture 16

Aim High. Intel Technical Update Teratec 07 Symposium. June 20, Stephen R. Wheat, Ph.D. Director, HPC Digital Enterprise Group

High Performance Linear Algebra

Emerging Heterogeneous Technologies for High Performance Computing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Performance of deal.ii on a node

Tianhe-2, the world s fastest supercomputer. Shaohua Wu Senior HPC application development engineer

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Automatic Tuning of the High Performance Linpack Benchmark

The Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center

Goro Watanabe. Bill King. OOW 2013 The Best Platform for Big Data and Oracle Database 12c. EVP Fujitsu R&D Center North America

European Processor Initiative & RISC-V

JÜLICH SUPERCOMPUTING CENTRE Site Introduction Michael Stephan Forschungszentrum Jülich

Trends of Network Topology on Supercomputers. Michihiro Koibuchi National Institute of Informatics, Japan 2018/11/27

Accelerating HPL on Heterogeneous GPU Clusters

Complexity and Advanced Algorithms. Introduction to Parallel Algorithms

PART I - Fundamentals of Parallel Computing

An approach to provide remote access to GPU computational power

Seagate ExaScale HPC Storage

HPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances)

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 13 th CALL (T ier-0)

CS 5803 Introduction to High Performance Computer Architecture: Performance Metrics

Co-designing an Energy Efficient System

CERN openlab II. CERN openlab and. Sverre Jarp CERN openlab CTO 16 September 2008

A Comprehensive Study on the Performance of Implicit LS-DYNA

Performance of computer systems

ECE 471 Embedded Systems Lecture 2

The Leading Parallel Cluster File System

University at Buffalo Center for Computational Research

Overview of High Performance Computing

Pedraforca: a First ARM + GPU Cluster for HPC

Scalability and Classifications

Motivation Goal Idea Proposition for users Study

Transcription:

ECE 574 Cluster Computing Lecture 2 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 24 January 2019

Announcements Put your name on HW#1 before turning in! 1

Top500 List November 2018 2

# Name Country Arch /Proc Cores Max/Peak Accel Power PFLOPS kw 1 Summit (IBM) US/ORNL Power9 2,397,824 143/200 NVD Volta 9,783 2 Sierra (IBM) US/LLNL Power9 1,572,480 94/125 NVD Volta 7,438 3 TaihuLight China Sunway 10,649,600 93/125? 15,371 4 Tianhe-2A China x86/ivb 4,981,760 61/101 MatrixDSP 18,482 5 Piz Daint (Cray) Swiss x86/snb 387,872 21/27 NVD Tesla 2,384 6 Trinity (Cray) US/LANL x86/hsw 979,072 20/158 XeonPhi 7,578 7 ABCI (Fujitsu) Japan x86/skl 391,680 20/33 NVD Tesla 1,649 8 SuperMUC-NG (Lenovo) Germany x86/skl 305,856 19/26?? 9 Titan (Cray) USA/ORNL x86/opteron 560,640 17/27 NVD K20 8,209 10 Sequoia (IBM) USA/LLNL Power BGQ 1,572,864 17/20? 7,890 11 Lassen (IBM) USA/LLNL Power9 248,976 15/19 NVD Tesla? 12 Cori (Cray) USA/LBNL x86/hsw 622,336 14/27 Xeon Phi 3,939 13 Nurion (Cray) Korea x86/?? 570,020 14/25 Xeon Phi? 14 Oakforest (Fujitsu) Japan x86/?? 556,104 13/24 Xeon Phi 2,719 15 HPC4 (HPE) Italy x86/snb 253,600 12/18 NVD Tesla 1,320 16 Tera-1000-2 (Bull) France x86/??? 561,408 12/23 Xeon Phi 3,178 17 Stampede2 (Dell) US x86/snb 367,024 12/28 Xeon Phi 18,309 18 K computer (Fujitsu) Japan SPARC VIIIfx 705,024 10/11? 12,660 19 Marconi (Lenovo) Italy x86/snb 348,000 10/18 Xeon Phi 18,816 20 Taiwania-2 (Quanta) Taiwan x86/skl 170,352 9/15 NVD Tesla 798 21 Mira (IBM) US/ANL Power/BGQ 786,432 8/10?? 3,945 22 Tsubame3.0 (HPE) Japan x86/snb 135,828 8/12 NVD Tesla 792 23 UK Meteor (Cray) UK x86/ivb 241,920 7/8??? 8,128 24 Theta (Cray) US/ANL x86/??? 280,320 7/11 Xeon Phi 3 11,661 25 MareNostrum (Lenovo) Spain x86/skl 153,216 6/10 Xeon Phi 1,632

4

Top500 List Notes Can watch video presentation on it here? Left off my summary: RAM? (#1 is 3PB) Interconnect? Power: does this include cooling or not? Cost of power over lifetime of use is often higher than the cost to build it. Power comparison: small town? 1MW around 1000 homes? (this varies) How long does it take to run LINPACK? How much money does it cost to run LINPACK? 5

Lots of turnover since last time I taught the class? Operating system. Cost to run computer more than cost to build it? Tiahne-2 was Xeon Phi, but US banned Intel from exporting anymore, so upgraded and using own custom DSP boards now. Need to be 10 PFlops to be near top these days? 100k cores at least? First ARM system, Cavium ThunderX in Astra (US/LANL) at 204 6

What goes into a top supercomputer? Commodity or custom Architecture: x86? SPARC? Power? ARM embedded vs high-speed? Memory Storage How much? Large hadron collider one petabyte of data every day Shared? If each node wants same data, do you need to replicate it, have a network filesystem, copy it around 7

with jobs, etc? Cluster filesystems? Reliability. How long can it stay up without crashing? Can you checkpoint/restart jobs? Sequoia MTBF 1 day. Blue Waters 2 nodes failure per day. Titan MTBF less than 1 day Power / Cooling Big river nearby? Accelerator cards / Heterogeneous Systems Network How fast? Latency? Interconnect? (torus, cube, 8

hypercube, etc) Ethernet? Infiniband? Custom? Operating System Linux? Custom? If just doing FP, do you need overhead of an OS? Job submission software, Authentication Software how to program? Too hard to program can doom you. A lot of interest in the Cell processor. Great performance if programmed well, but hard to do. Tools software that can help you find performance problems 9

Other stuff Rmax vs Rpeak Rmax is max measured, Rpeak is theoretical best HPL Linpack Embarrassingly parallel linear algebra Solves a (random) dense linear system in double precision (64 bits) arithmetic HP Conjugate gradient benchmark More realistic? Does more memory access, more I/O bound. 10

#1 on list is Summit. 3PFLOPS CG wheras 143PFLOPS HPL Some things can move around, K-computer 18th in HPL but 3rd with CG Green 500 11

Historical Note From the November 2002 list, entry #332 Location: Orono, ME Proc Arch: x86 Proc Type: Pentium III, 1GHz Total cores: 416 RMax/RPeak: 225/416 GFLOPS Power:??? Accelerators: None 12

Introduction to Performance Analysis 13

What is Performance? Getting results as quickly as possible? Getting correct results as quickly as possible? What about Budget? What about Development Time? What about Hardware Usage? What about Power Consumption? 14

Motivation for HPC Optimization HPC environments are expensive: Procurement costs: $40 million Operational costs: $5 million/year Electricity costs: 1 MW / year $1 million Air Conditioning costs:?? 15

Know Your Limitation CPU Constrained Memory Constrained (Memory Wall) I/O Constrained Thermal Constrained Energy Constrained 16

Performance Optimization Cycle Develop Code Functionally Complete/ Correct Code Measure Analyze Modify / Tune Usage / Production Functionally Complete/ Correct/Optimized Code 17

Wisdom from Knuth We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified Donald Knuth 18

Amdahl s Law Original Speed up Blue 100x Speed up Red 2x Time 19

Speedup Speedup is the improvement in latency (time to run) S = t old t new So if originally took 10s, new took 5s, then speedup=2. 20

Scalability How a workload behaves as more processors are added Parallel efficiency: E p = S p p = T s pt p p=number of processes (threads) T s is execution time of serial code T p is execution time with p processes Linear scaling, ideal: S p = p Super-linear scaling possible but unusual 21

Strong vs Weak Scaling Strong Scaling for fixed program size, how does adding more processors help Weak Scaling how does adding processors help with the same per-processor workload 22

Strong Scaling Have a problem of a certain size, want it to get done faster. Ideally with problem size N, with 2 cores it runs twice as fast as with 1 core (linear speedup) Often processor bound; adding more processing helps, as communication doesn t dominate Hard to achieve for large number of nodes, as many 23

algorithms communication costs get larger the more nodes involved Amdahl s Law limits things, as more cores don t help serial code Strong scaling efficiency: t1 / ( N * tn ) * 100% Improve by throwing CPUs at the problem. 24

Weak Scaling Have a problem, want to increase problem size without slowing down. Ideally with problem size N with 1 core, a problem of size 2*n just as fast with 2 cores. Often memory or communication bound. Gustafson s Law (rough paraphrase) No matter how much you parallelize your code, there will be serial sections that just can t be made parallel 25

Weak scaling efficiency: ( t1 / tn ) * 100% Improve by adding memory, or improving communication? 26

Scaling Example LINPACK on Rasp-pi cluster. What kind of scaling is here? 5000 4000 Linear Scaling N=5000 N=8000 N=10,000 N=19,000 N=25,000 N=35,000 MFLOPS 3000 2000 1000 0 12 4 8 16 32 Nodes 27

Weak scaling. To get linear speedup need to increase problem size. If it were strong scaling, the individual colored lines would increase rather than dropping off. 28