ECE 574 Cluster Computing Lecture 18

Similar documents
ECE 571 Advanced Microprocessor-Based Design Lecture 7

ECE 571 Advanced Microprocessor-Based Design Lecture 6

ECE 574 Cluster Computing Lecture 17

ECE 571 Advanced Microprocessor-Based Design Lecture 16

ECE 574 Cluster Computing Lecture 16

ECE 471 Embedded Systems Lecture 2

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Technology for a better society. hetcomp.com

GPU Architecture. Alan Gray EPCC The University of Edinburgh

ECE 571 Advanced Microprocessor-Based Design Lecture 24

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Trends in HPC (hardware complexity and software challenges)

Instruction Set Architecture ( ISA ) 1 / 28

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

ECE 574 Cluster Computing Lecture 2

GPU-centric communication for improved efficiency

ECE571: Advanced Microprocessor Design Final Project Spring Officially Due: Friday, 4 May 2018 (Last day of Classes)

ECE 571 Advanced Microprocessor-Based Design Lecture 20

ECE 571 Advanced Microprocessor-Based Design Lecture 18

ECE 574 Cluster Computing Lecture 15

Parallel Computing & Accelerators. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

ECE 571 Advanced Microprocessor-Based Design Lecture 21

ECE 571 Advanced Microprocessor-Based Design Lecture 20

ECE 574 Cluster Computing Lecture 1

GPUs and Emerging Architectures

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

HPC future trends from a science perspective

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Steve Scott, Tesla CTO SC 11 November 15, 2011

GPU ARCHITECTURE Chris Schultz, June 2017

Parallel Programming on Ranger and Stampede

ECE 574 Cluster Computing Lecture 23

Building supercomputers from commodity embedded chips

Building supercomputers from embedded technologies

Big Data Systems on Future Hardware. Bingsheng He NUS Computing

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

The challenges of new, efficient computer architectures, and how they can be met with a scalable software development strategy.! Thomas C.

Modern Processor Architectures. L25: Modern Compiler Design

Antonio R. Miele Marco D. Santambrogio

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Using Graphics Chips for General Purpose Computation

Achieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013

Parallelism. Parallel Hardware. Introduction to Computer Systems

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

An Introduction to the SPEC High Performance Group and their Benchmark Suites

Intel Xeon Phi архитектура, модели программирования, оптимизация.

ECE 571 Advanced Microprocessor-Based Design Lecture 22

GPU ARCHITECTURE Chris Schultz, June 2017

Graphics Processing Unit (GPU)

CUDA on ARM Update. Developing Accelerated Applications on ARM. Bas Aarts and Donald Becker

Parallelism and Concurrency. COS 326 David Walker Princeton University

Graphics Processor Acceleration and YOU

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design

Chapter 2 Parallel Hardware

Parallel Computing & Accelerators. John Urbanic Pittsburgh Supercomputing Center Parallel Computing Scientist

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

ECE 471 Embedded Systems Lecture 16

ECE 571 Advanced Microprocessor-Based Design Lecture 4

ECE 571 Advanced Microprocessor-Based Design Lecture 17

There s STILL plenty of room at the bottom! Andreas Olofsson

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

ECE 571 Advanced Microprocessor-Based Design Lecture 5

Vectorisation and Portable Programming using OpenCL

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

OpenCL: History & Future. November 20, 2017

Operational Robustness of Accelerator Aware MPI

HPC Architectures. Types of resource currently in use

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

ECE 471 Embedded Systems Lecture 15

ECE 598 Advanced Operating Systems Lecture 4

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

Intel Xeon Phi Coprocessors

Managing Hardware Power Saving Modes for High Performance Computing

designing a GPU Computing Solution

Barcelona Supercomputing Center

Erkenntnisse aus aktuellen Performance- Messungen mit LS-DYNA

EE 7722 GPU Microarchitecture. Offered by: Prerequisites By Topic: Text EE 7722 GPU Microarchitecture. URL:

Real Parallel Computers

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM

Parallel and Distributed Programming Introduction. Kenjiro Taura

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Managing HPC Active Archive Storage with HPSS RAIT at Oak Ridge National Laboratory

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

ECE 471 Embedded Systems Lecture 16

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?

Altera SDK for OpenCL

Cray XC Scalability and the Aries Network Tony Ford

Catapult: A Reconfigurable Fabric for Petaflop Computing in the Cloud

! Readings! ! Room-level, on-chip! vs.!

Overview. Lecture 6: odds and ends. Synchronicity. Warnings. synchronicity. multiple streams and devices. multiple GPUs. other odds and ends

Addressing Heterogeneity in Manycore Applications

Accelerator programming with OpenACC

ECE 571 Advanced Microprocessor-Based Design Lecture 6

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

Pedraforca: a First ARM + GPU Cluster for HPC

Transcription:

ECE 574 Cluster Computing Lecture 18 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 2 April 2019

HW#8 was posted Announcements 1

Project Topic Notes I responded to everyone s e-mail. If your group didn t get one let me know Have a wide variety of machines to run on. If interested in power measurement let me know. 2

CUDA Notes Nicely, we can use only block/thread for our results, even on biggest files In the past had to split 3 ways, grid/block/thread In past there was a limit of 64k blocks with compute version 2 but on compute version 3 we can have up to 2 billion 3

Other Accelerator Options XeonPhi came out of the larabee design (effort to do a GPU powered by x86 chips). Large array of x86 chips(p5 class on older models, atom on newer) on PCIe card. Sort of like a plug-in mini cluster. Runs Linux, can ssh into the boards over PCIe. Benefit: can use existing x86 programming tools and knowledge. FPGA can have FPGA accelerator. Only worthwhile if you don t plan to reprogram it much as time delay in reprogramming. Also requires special compiler support 4

(OpenMP?) ASIC can have hard-coded custom hardware for acceleration. Expensive. Found in BitCoin mining? DSPs can be used as accelerators 5

OpenACC Sort of like OpenMP but can offload to GPU as well as CPUs Cray, CAPS, Nvidia and PGI Lots of pragmas 6

Other Libraries Metal from Apple, their replacement for OpenCL. C++ like, sort of a mix of OpenCL and OpenGL Defunct low-level GPU libraries: Glide (3dfx), Mantle (AMD) Other low-level GPU libraries: GNM (playstation 4), NVN (Nvidia/Switch) WebGPU GL/GPGPU Javascript (currently under development) WebCL OpenCL Javascript bindings 7

OpenVG 2d vector graphics accel 8

Vulkan More modern OpenGL Supposedly OpenCL merging into Vulcan? based on AMD Mantle 9

CUDA is only for NVIDIA OpenCL What if you have Intel or AMD (ATI) chip? Or ARM MALI (sadly not Raspberry Pi) OpenCL is sort of like CUDA, but cross-platform Not only for GPUs, but can target regular CPU, DSP, FPGAs, etc Vendor provides a driver 10

Khronos (the OpenGL + Vulkan people?) OpenCL also run Windows, OSX, Linux 11

Cluster Computing Power Can spend a whole class (i.e. ECE571) discussing where power goes in a modern computing system. 12

Cluster Computing Power Why is low-power super-computing important? 13

Green500 Green 500 list Push for more accurate power reporting in the Top500 list Top 5, Nov 2016 1. NVIDIA DGX-1 Xeon/Tesla 350kW, 9.462 GFLOPs/W 2. (#8) Swiss Piz Daint Cray XC50 Xeon/Tesla 1.3MW, 7.453 GFLOPs/W 14

3. Riken ZettaScaler Xeon/PEZY-SCnp 2 ARM Cores/1024 RISC Cores, 1.5TFLOPs 150kW, 6.7GFLOPs/W 4. (#1) Sunway TaihuLight, Sunway, 15MW, 6.0GFLOPs/W 5. Fujitsu PRIMERGY, Xeon Phi, 77kW, 5.8GFLOPs/W 15

SuperComputer Power Cooling DVFS Power-capping Up to 12% spent by the interconnect Pi cluster, 90W, 20W or so is the ethernet switch 16

Measuring Power and Energy Sense resistor or Hall Effect sensor gives you the current Sense resistor is small resistor. Measure voltage drop. Current V=IR Ohm s Law, so V/R=I Voltage drops are often small (why?) so you made need to amplify with instrumentation amplifier Then you need to measure with A/D converter P = IV and you know the voltage How to get Energy from Power? 17

Measuring Power and Energy 18

Why? New, massive, HPC machines use impressive amounts of power When you have 100k+ cores, saving a few Joules per core quickly adds up To improve power/energy draw, you need some way of measuring it 19

Energy/Power Measurement is Already Possible Three common ways of doing this: Hand-instrumenting a system by tapping all power inputs to CPU, memory, disk, etc., and using a data logger Using a pass-through power meter that you plug your server into. Often these will log over USB Estimating power/energy with a software model based on system behavior 20

Where does the power go in a system? CPU DRAM GPU Disk Network Cooling/Fan Power Supply 21

Measuring system wide Can use kill-o-watt, WattsUpPro, or similar 22

Measuring hardware Really hard DRAM, PCI, USB, fans: can put sense resistor in line with power supply, measure current and voltage to calculate power CPU harder, how do you intercept? There is the P4 line (a 12V special power cable from PSU on recent systems) but it might also power other parts of the system 23

Estimating Power Can construct a model to estimate power Inputs like performance counters CPU might be related to instructions/cycle, pipeline stalls, FPU instructions retired, etc DRAM might be related to cache misses, bytes/second Hopefully you validate the model 24

RAPL Power Estimates Running-Average Power Limit Recent Intel CPUs Need to estimate power usage for power-capping and turbo-boost Nicely provide the values to userspace On most systems (excluding some Haswell models) is an estimate from an on-chip power model based on various inputs (temperature, perf counters, etc) Very easy to read, using the perf tool 25