Overview of High Performance Computing

Similar documents
Overview of Parallel Computing. Timothy H. Kaiser, PH.D.

Overview of High Performance Computing

Computing on Mio Introduction

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

BİL 542 Parallel Computing

Introduction to Parallel Programming

High Performance Computing Systems

Introduction to Parallel Programming

Introduction to Parallel Programming

High Performance Computing

Outline. CSC 447: Parallel Programming for Multi- Core and Cluster Systems

Interconnect Technology and Computational Speed

Introduction to parallel Computing

InfiniBand SDR, DDR, and QDR Technology Guide

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Introduction. HPC Fall 2007 Prof. Robert van Engelen

Some aspects of parallel program design. R. Bader (LRZ) G. Hager (RRZE)

What is Parallel Computing?

Analytical Modeling of Parallel Systems. To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003.

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Introduction to High-Performance Computing

Introduction to High-Performance Computing

Outline. Overview Theoretical background Parallel computing systems Parallel programming models MPI/OpenMP examples

High Performance Computing (HPC) Introduction

Parallel Computing. Hwansoo Han (SKKU)

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Scalability of Processing on GPUs

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Chapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance

CSC630/CSC730 Parallel & Distributed Computing

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

The Art of Parallel Processing

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

10th August Part One: Introduction to Parallel Computing

CSE 613: Parallel Programming. Lecture 2 ( Analytical Modeling of Parallel Algorithms )

CS 475: Parallel Programming Introduction

Fundamentals of. Parallel Computing. Sanjay Razdan. Alpha Science International Ltd. Oxford, U.K.

Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.

Design of Parallel Algorithms. Course Introduction

Lecture 10: Performance Metrics. Shantanu Dutt ECE Dept. UIC

Multiprocessors & Thread Level Parallelism

Performance Metrics. Measuring Performance

BlueGene/L. Computer Science, University of Warwick. Source: IBM

Understanding Parallelism and the Limitations of Parallel Computing

Carlo Cavazzoni, HPC department, CINECA

Multithreaded Programming

BlueGene/L (No. 4 in the Latest Top500 List)

Parallel Programming Concepts. Tom Logan Parallel Software Specialist Arctic Region Supercomputing Center 2/18/04. Parallel Background. Why Bother?

Engineering Robust Server Software

HPC Architectures. Types of resource currently in use

Lecture 1: Gentle Introduction to GPUs

Multiprocessors - Flynn s Taxonomy (1966)

MEASURING COMPUTER TIME. A computer faster than another? Necessity of evaluation computer performance

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

High-Performance and Parallel Computing

Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace

Building NVLink for Developers

Deep Learning Performance and Cost Evaluation

Measuring Performance. Speed-up, Amdahl s Law, Gustafson s Law, efficiency, benchmarks

Parallel Numerics, WT 2013/ Introduction

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

The Future of High Performance Computing

Parallel Performance and Optimization

Lecture 7: Parallel Processing

The Use of Cloud Computing Resources in an HPC Environment

Parallel Architectures

COSC 6385 Computer Architecture - Multi Processor Systems

CSE5351: Parallel Processing Part III

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

PROCESSES AND THREADS THREADING MODELS. CS124 Operating Systems Winter , Lecture 8

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

Parallel Computers. c R. Leduc

Mapping MPI+X Applications to Multi-GPU Architectures

Advanced High Performance Computing CSCI 580

Parallel Systems. Part 7: Evaluation of Computers and Programs. foils by Yang-Suk Kee, X. Sun, T. Fahringer

Performance Metrics. 1 cycle. 1 cycle. Computer B performs more instructions per second, thus it is the fastest for this program.

CSL 860: Modern Parallel

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization

Course web site: teaching/courses/car. Piazza discussion forum:

How to write code that will survive the many-core revolution Write once, deploy many(-cores) F. Bodin, CTO

Goals of this Course

The future is parallel but it may not be easy

Interconnection Network

ECE 669 Parallel Computer Architecture

Lecture 7: Parallel Processing

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

CUDA. Matthew Joyner, Jeremy Williams

Review of previous examinations TMA4280 Introduction to Supercomputing

Parallel Computing Basics, Semantics

High Performance Computing in C and C++

Introduction to Parallel Computing

27. Parallel Programming I

Linux multi-core scalability

PARALLEL MEMORY ARCHITECTURE

Parallel Architectures

Transcription:

Overview of High Performance Computing Timothy H. Kaiser, PH.D. tkaiser@mines.edu http://inside.mines.edu/~tkaiser/csci580fall13/ 1

Near Term Overview HPC computing in a nutshell? Basic MPI - run an example A few additional MPI features Scripting #1 OpenMP Libraries and other stuff 2

Introduction What is parallel computing? Why go parallel? When do you go parallel? What are some limits of parallel computing? Types of parallel computers Some terminology 3

What is Parallelism? Consider your favorite computational application One processor can give me results in N hours Why not use N processors -- and get the results in just one hour? The concept is simple: Parallelism = applying multiple processors to a single problem 4

Parallel computing is computing by committee Parallel computing: the use of multiple computers or processors working together on a common task. Each processor works on its section of the problem Grid of a Problem to be Solved Process 0 does work for this region Process 1 does work for this region Processors are allowed to exchange information with other processors Process 2 does work for this region Process 3 does work for this region 5

Why do parallel computing? Limits of single CPU computing Available memory Performance Parallel computing allows: Solve problems that don t fit on a single CPU Solve problems that can t be solved in a reasonable time 6

Why do parallel computing? We can run Larger problems Faster More cases Run simulations at finer resolutions Model physical phenomena more realistically 7

Weather Forecasting Atmosphere is modeled by dividing it into three-dimensional regions or cells 1 mile x 1 mile x 1 mile (10 cells high) about 500 x 10 6 cells. The calculations of each cell are repeated many times to model the passage of time. About 200 floating point operations per cell per time step or 10 11 floating point operations necessary per time step 10 day forecast with 10 minute resolution => 1.5x10 14 flop 100 Mflops would take about 17 days 1.7 Tflops would take 2 minutes 17 Tflops would take 12 seconds 8

Modeling Motion of Astronomical bodies (brute force) Each body is attracted to each other body by gravitational forces. Movement of each body can be predicted by calculating the total force experienced by the body. For N bodies, N - 1 forces / body yields N 2 calculations each time step A galaxy has, 10 11 stars => 10 9 years for one iteration Using a N log N efficient approximate algorithm => about a year NOTE: This is closely related to another hot topic: Protein Folding 9

Types of parallelism two extremes Data parallel Each processor performs the same task on different data Example - grid problems Bag of Tasks or Embarrassingly Parallel is a special case Task parallel Each processor performs a different task Example - signal processing such as encoding multitrack data Pipeline is a special case 10

Simple data parallel program Example: integrate 2-D propagation problem Starting partial differential equation: Finite Difference Approximation: PE #0 PE #1 PE #2 PE #3 y PE #4 PE #5 PE #6 PE #7 x 11

Typical Task Parallel Application DATA Normalize Task FFT Task Multiply Task Inverse FFT Task Signal processing Use one processor for each task Can use more processors if one is overloaded This is a pipeline 12

Parallel Program Structure Communicate & Repeat work 1a work 2a work (N)a Begin start parallel work 1b work 1c work 2b work 2c work (N)b work (N)c End Parallel End work 1d work 2d work (N)d 13

Parallel Problems Communicate & Repeat work 1a work 2a work (N)a Begin start parallel work 1b work 1c work 2b work 2c work (N)b work (N)c End Parallel Start Serial Section work 1d work 2d Subtasks don t finish together work (N)d Serial Section (No Parallel Work) work 1x work 2x work (N)x End Serial Section start parallel work 1y work 2y work (N)y work 1z work 2z work (N)z Not using all processors End Parallel End 14

A Real example #!/usr/bin/env python from sys import argv from os.path import isfile from time import sleep from math import sin,cos # fname="message" my_id=int(argv[1]) print my_id, "starting program" # if (my_id == 1): sleep(2) myval=cos(10.0) mf=open(fname,"w") mf.write(str(myval)) mf.close() else: myval=sin(10.0) notready=true while notready : if isfile(fname) : notready=false sleep(3) mf=open(fname,"r") message=float(mf.readline()) mf.close() total=myval**2+message**2 else: sleep(5) print my_id, "done with program" 15

Theoretical upper limits All parallel programs contain: Parallel sections Serial sections Serial sections are when work is being duplicated or no useful work is being done, (waiting for others) Serial sections limit the parallel effectiveness If you have a lot of serial computation then you will not get good speedup No serial work allows perfect speedup Amdahl s Law states this formally 16

Amdahl s Law Amdahl s Law places a strict limit on the speedup that can be realized by using multiple processors. Effect of multiple processors on run time t p = (f p /N + f s )t s Effect of multiple processors on speed up S = t s tp = Where Fs = serial fraction of code Fp = parallel fraction of code N = number of processors Perfect speedup t=t1/n or S(n)=n 1 fp /N + f s 17

Illustration of Amdahl's Law It takes only a small fraction of serial content in a code to degrade the parallel performance. 18

Amdahl s Law Vs. Reality Amdahl s Law provides a theoretical upper limit on parallel speedup assuming that there are no costs for communications. In reality, communications will result in a further degradation of performance 80 70 60 fp = 0.99 50 40 30 Amdahl's Law Reality 20 10 0 0 50 100 150 200 250 Number of processors 19

Sometimes you don t get what you expect! 20

Some other considerations Writing effective parallel application is difficult Communication can limit parallel efficiency Serial time can dominate Load balance is important Is it worth your time to rewrite your application Do the CPU requirements justify parallelization? Will the code be used just once? 21

Parallelism Carries a Price Tag Parallel programming Involves a steep learning curve Is effort-intensive Parallel computing environments are unstable and unpredictable Don t respond to many serial debugging and tuning techniques May not yield the results you want, even if you invest a lot of time Will the investment of your time be worth it? 22

Terms related to algorithms Amdahl s Law (talked about this already) Superlinear Speedup Efficiency Cost Scalability Problem Size Gustafson s Law 23

Superlinear Speedup S(n) > n, may be seen on occasion, but usually this is due to using a suboptimal sequential algorithm or some unique feature of the architecture that favors the parallel formation. One common reason for superlinear speedup is the extra cache in the multiprocessor system which can hold more of the problem data at any instant, it leads to less, relatively slow memory traffic. 24

Efficiency Efficiency = Execution time using one processor over the Execution time using a number of processors Its just the speedup divided by the number of processors 25

Cost The processor-time product or cost (or work) of a computation defined as Cost = (execution time) x (total number of processors used) The cost of a sequential computation is simply its execution time, t s. The cost of a parallel computation is t p x n. The parallel execution time, t p, is given by t s /S(n) Hence, the cost of a parallel computation is given by Cost-Optimal Parallel Algorithm One in which the cost to solve a problem on a multiprocessor is proportional to the cost 26

Scalability Used to indicate a hardware design that allows the system to be increased in size and in doing so to obtain increased performance - could be described as architecture or hardware scalability. Scalability is also used to indicate that a parallel algorithm can accommodate increased data items with a low and bounded increase in computational steps - could be described as algorithmic scalability. 27

Problem size Problem size: the number of basic steps in the best sequential algorithm for a given problem and data set size Intuitively, we would think of the number of data elements being processed in the algorithm as a measure of size. However, doubling the date set size would not necessarily double the number of computational steps. It will depend upon the problem. For example, adding two matrices has this effect, but multiplying matrices quadruples operations. Note: Bad sequential algorithms tend to scale well 28

Other names for Scaling Strong Scaling (Engineering) For a fixed problem size how does the time to solution vary with the number of processors Weak Scaling How the time to solution varies with processor count with a fixed problem size per processor 29

Some Classes of machines Network Processor Processor Processor Processor Memory Memory Memory Memory Distributed Memory Processors only Have access to their local memory talk to other processors over a network 30

Some Classes of machines Uniform Shared Memory (UMA) Processor Processor All processors have equal access to Memory Processor Processor Memory Processor Processor Can talk via memory Processor Processor 31

Some Classes of machines Hybrid Shared memory nodes connected by a network... 32

Some Classes of machines More common today Each node has a collection of multicore chips... Ra has 268 nodes 256 quad core dual socket 12 dual core quad socket 33

Some Classes of machines Hybrid Machines Add special purpose processors to normal processors Not a new concept but, regaining traction Example: our Tesla Nvidia node, cuda1 "Normal" CPU Special Purpose Processor FPGA, GPU, Vector, Cell... 34

Network Topology For ultimate performance you may be concerned how you nodes are connected. Avoid communications between distant node For some machines it might be difficult to control or know the placement of applications 35

Network Terminology Latency How long to get between nodes in the network. Bandwidth How much data can be moved per unit time. Bandwidth is limited by the number of wires and the rate at which each wire can accept data and choke points 36

Ring 37

Grid Wrapping produces torus Can also do 3d and wrapped 38

Star? Quality depends on what is in the center 39

Example: An Infiniband Switch Infiniband, DDR, Cisco 7024 IB Server Switch - 48 Port Adaptors. Each compute node has one DDR 1-Port HCA 4X DDR=> 16Gbit/sec 140 nanosecond hardware latency 1.26 microsecond at software level 40

Measured Bandwidth 41

AUN Connectivity 42

AUN Connectivity 43

Tree Fat tree the lines get wider as you go up 44

Hypercube 110 111 100 101 010 011 000 001 3 dimensional hypercube 45

4D Hypercube 1100 1110 1101 1111 1000 1010 1001 1011 0100 0110 0101 0111 0000 0010 0001 0011 Some communications algorithms are hypercube based How big would a 7d hypercube be? 46

5d Torus 47