ECE 574 Cluster Computing Lecture 16

Similar documents
ECE 571 Advanced Microprocessor-Based Design Lecture 20

ECE 571 Advanced Microprocessor-Based Design Lecture 18

ECE 574 Cluster Computing Lecture 15

ECE 574 Cluster Computing Lecture 4

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Portland State University ECE 588/688. Graphics Processors

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Graphics Processing Unit Architecture (GPU Arch)

Threading Hardware in G80

Windowing System on a 3D Pipeline. February 2005

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

ECE 574 Cluster Computing Lecture 17

GeForce4. John Montrym Henry Moreton

The NVIDIA GeForce 8800 GPU

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

Spring 2009 Prof. Hyesoon Kim

ECE 574 Cluster Computing Lecture 18

Spring 2011 Prof. Hyesoon Kim

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

1.2.3 The Graphics Hardware Pipeline

Graphics Hardware, Graphics APIs, and Computation on GPUs. Mark Segal

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Real-Time Rendering Architectures

From Brook to CUDA. GPU Technology Conference

Modern Processor Architectures. L25: Modern Compiler Design

From Shader Code to a Teraflop: How Shader Cores Work

EECS 487: Interactive Computer Graphics

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Monday Morning. Graphics Hardware

CONSOLE ARCHITECTURE

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

Lecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

CS427 Multicore Architecture and Parallel Computing

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1

Introduction to GPU computing

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

GPUs and GPGPUs. Greg Blanton John T. Lubia

GPU Memory Model. Adapted from:

Tutorial on GPU Programming #2. Joong-Youn Lee Supercomputing Center, KISTI

Multi-Processors and GPU

Spring 2010 Prof. Hyesoon Kim. AMD presentations from Richard Huddy and Michael Doggett

Automatic Tuning Matrix Multiplication Performance on Graphics Hardware

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Current Trends in Computer Graphics Hardware

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

GPU Architecture. Michael Doggett Department of Computer Science Lund university

ECE 471 Embedded Systems Lecture 2

ECE 574 Cluster Computing Lecture 1

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

GPU Architecture. Samuli Laine NVIDIA Research

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

Motivation Hardware Overview Programming model. GPU computing. Part 1: General introduction. Ch. Hoelbling. Wuppertal University

Mattan Erez. The University of Texas at Austin

Massively Parallel Architectures

Graphics Hardware. Instructor Stephen J. Guy

GPGPU. Peter Laurens 1st-year PhD Student, NSC

The Bifrost GPU architecture and the ARM Mali-G71 GPU

ECE 571 Advanced Microprocessor-Based Design Lecture 21

ECE 574 Cluster Computing Lecture 13

GRAPHICS HARDWARE. Niels Joubert, 4th August 2010, CS147

ECE 571 Advanced Microprocessor-Based Design Lecture 13

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

Introduction to CUDA (1 of n*)

Parallel Computing Ideas

PowerVR Hardware. Architecture Overview for Developers

GPGPU, 4th Meeting Mordechai Butrashvily, CEO GASS Company for Advanced Supercomputing Solutions

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Introduction to Modern GPU Hardware

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU.

Lecture 25: Board Notes: Threads and GPUs

BlueGene/L (No. 4 in the Latest Top500 List)

Bifurcation Between CPU and GPU CPUs General purpose, serial GPUs Special purpose, parallel CPUs are becoming more parallel Dual and quad cores, roadm

GPGPU introduction and network applications. PacketShaders, SSLShader

Vulkan: Architecture positive How Vulkan maps to PowerVR GPUs Kevin sun Lead Developer Support Engineer, APAC PowerVR Graphics.

GPU Fundamentals Jeff Larkin November 14, 2016

Optimal Shaders Using High-Level Languages

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

GPU Target Applications

ECE 598 Advanced Operating Systems Lecture 18

Lecture 6: Texture. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

General Purpose GPU Computing in Partial Wave Analysis

Lecture 13: OpenGL Shading Language (GLSL)

Vertex Shader Design I

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Lecture 1: Gentle Introduction to GPUs

A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications

GPU for HPC. October 2010

A Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis

Antonio R. Miele Marco D. Santambrogio

What s New with GPGPU?

Performance potential for simulating spin models on GPU

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.

ECE 571 Advanced Microprocessor-Based Design Lecture 4

Real-Time Buffer Compression. Michael Doggett Department of Computer Science Lund university

Transcription:

ECE 574 Cluster Computing Lecture 16 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 26 March 2019

Announcements HW#7 posted HW#6 and HW#5 returned Don t forget project topics due Thursday! Topics: memory/reliability on Pi? Launch in a balloon? Topics: GPU on Pi (OpenCL) 1

Notes on HW#5 Were supposed to use sections directive for the coarse code Should parallelize your biggest loop, unless it is autocollapsing, parallizing the colors loop of 0..3 won t scale very well This is why some were seeing bigger benefits of combine vs convolve Loop indices don t need to be marked as private, 2

OpenMP assumes they are (so don t change their value outside the for statement) 3

Notes on MPI So many issues were C related. Fortran next time? Got MPI working on haswell-ep with mpich, needed to set MpiDefault=pmi2 in slurm.conf 4

Notes on HW#6 Biggest problem was calculating at different granularity to the gather Other problem was the final tail end of data to calculate in cases where not exact multiple of number or ranks. 5

Pi cluster 1 head node (16GB SD card), 24 sub-nodes. One currently seems to be down (reliability!) Read up on the cluster here: https://www.mdpi.com/2079-9292/5/4/61/htm Added your accounts, same password as haswell-ep (via hashes) Try not to use up too much disk space Also note the SD card is sorta slow, which with the network affects scaling a bit. 6

Use slurm The batch scripts I gave you have a timeout of 5 minutes per job. Last time some people s code went crazy and ran forever and other people s jobs never ran Use sinfo or squeue to see cluster and job stats Use scancel to cancel a job If things going poorly, contact me Did update PAPI on all nodes which should be working 7

MPI and slurm HW #SBATCH --tasks-per-node=4 -N = number of nodes -n = number of tasks, default is one task per node? N=4 tasks-per-node=4, 16 N=4 tasks-per-node=4, sbatch -n 8, 16 (N=nodes, n=tasks) N=4 tasks-per-node=4, sbatch -N 8, 32 8

nothing, sbatch -N 8, 32 nothing, sbatch -n 8, (8, 2 nodes * 4 each) nothing, sbatch -N 8 -n 8 (8, 8 nodes * 1 each) 9

Why use slurm? Can set account to charge Can handle checkpointing Can set constraints (run on machine with gpu, certain proc type) Contiguous allocations CPU freq, power capping Licenses avail (things like Matlab etc) Memory avail 10

Graphics Processing Units Retrospective on old graphics hardware Framebuffer is simple (though annoying pointer match like in sobel or worse). VGA Mode 13h, 0xa0000, 64kB Old video game systems didn t even have that. Why? 1MB for a framebuffer was expensive. Only 64k RAM total. Atari 2600 only had 128B of RAM, total. 40-bit framebuffer. Racing the beam. Also could do sprites or tile based. 11

GPUs 12

Interfaces Originally each vendor had own 3D interface, SGI standardized OpenGL SGI Direct3D Microsoft Vulkan new interface with less baggage WebGL? Originally for HPC/CAD but gaming has brought down prices for everyone. 13

GPGPUS Interfaces needed, as GPU companies do not like to reveal what their chips due at the assembly level. CUDA (Nvidia) OpenCL (Everyone else) can in theory take parallel code and map to CPU, GPU, FPGA, DSP, etc 14

Old example: Why GPUs? 3GHz Pentium 4, 6 GFLOPS, 6GB/sec peak GeForceFX 6800: 53GFLOPS, 34GB/sec peak Newer example Raspberry Pi, 700MHz, 0.177 GFLOPS On-board GPU: Video Core IV: 24 GFLOPS 15

GPGPU Key Ideas Using many slimmed down cores Have single instruction stream operate across many cores (SIMD) Avoid latency (slow textures, etc) by working on another group when one stalls 16

GPU Benefits Specialized hardware, concentrating on arithmetic. Transistors for ALUs not cache. Fast 32-bit floating point (16-bit?) Driven by commodity gaming, so much faster than would be if only HPC people using them. Accuracy? 64-bit floating point? 32-bit floating point? 16-bit floating point? Doesn t matter as much if color slightly off for a frame in your video game. highly parallel 17

GPU Problems Optimized for 3d-graphics, not always ideal for other things Need to port code, usually can t just recompile cpu code. Companies secretive. Serial code with a lot of control flow runs poorly Off-chip memory transfers can be slow 18

Latency vs Throughput CPUs = Low latency, low throughput GPUs = high latency, high throughput CPUs optimized to try to get lowest latency (caches); with no parallelism have to get memory back as soon as possible GPUs optimized for throughput. Best throughput for all better than low-latency for one 19

Older / Traditional GPU Pipeline In old days, fixed pipeline (lots of triangles). Modern chips much more flexible, but the old pipeline can still be implemented in software via the fancier interface. 20

Older / Traditional GPU Pipeline CPU send list of vertices to GPU. Transform (vertex processor) (convert from world space to image space). 3d translation to 2d, calculate lighting. Operate on 4-wide vectors (x,y,z,w in projected space, r,g,b,a color space) Rasterizer transform vertexes/vectors into a grid. Fragments. break up to pixels and anti-alias 21

Shader (Fragment processor) compute color for each pixel. Use textures if necessary (texture memory, mostly read) Write out to framebuffer (mostly write) Z-buffer for depth/visibility 22

GPGPUs Started when the vertex and fragment processors became generically programmable (originally to allow more advanced shading and lighting calculations) By having generic use can adapt to different workloads, some having more vertex operations and some more fragment 23

Graphics vs Programmable Use Vertex Vertex Processing Data MIMD processing Polygon Polygon Setup Lists SIMD Rasterization Fragment Per-pixel math Data Programmable SIMD Texture Data fetch, Blending Data Data Fetch Image Z-buffer, anti-alias Data Predicated Write 24

Example for Shader 3.0, came out DirectX9 They are up to Pixel Shader 5.0 now 25

Shader 3.0 Programming Vertex Processor 512 static / 65536 dynamic instructions Up to 32 temporary registers Simple flow control Texturing texture data can be fetched during vertex operations 26

Can do a four-wide SIMD MAD (multiply ADD) and a scalar op per cycle: EXP, EXPP, LIT, LOGP (exponential) RCP, RSQ (reciprocal, r-square-root) SIN, COS (trig) 27

Shader 3.0 Programming Fragment Processor 65536 static / 65536 dynamic instructions (but can time out if takes too long) Supports conditional branches and loops fp32 and fp16 internal precision Can do 4-wide MAD and 4-wide DP4 (dot product) 28