Overview of High Performance Computing

Size: px

Start display at page:

Download "Overview of High Performance Computing"

Sharleen Fields
6 years ago
Views:

1 Overview of High Performance Computing Timothy H. Kaiser, PH.D. Show_me_some_local_HPC_tutorials/ 1

2 Introduction What is High Performance Computing? Why go parallel? When do you go parallel? What are some limits of parallel computing? Types of parallel computers Some terminology What is available How this all works 2

3 What the Exa? Exa = 1,152,921,504,606,846,976 = 2**60=1024**6 = 10**18.06 Peta = 1,125,899,906,842,624 = 2**50=1024**5 = 10**15.05 Tera = 1,099,511,627,776 = 2**40=1024**4 = 10**12.04 Giga = 1,073,741,824 = 2**30=1024**3 = 10**9.03 Mega = 1,048,576 = 2**20=1024**2 = 10**6.02 Kilo = 1,024 = 2**10=1024**1 = 10**3.01 3

4 Top 500 4

5 What is Parallelism? Consider your favorite computational application One processor can give me results in N hours Why not use N processors -- and get the results in just one hour? The concept is simple: Parallelism = applying multiple processors to a single problem 5

6 Parallel computing is computing by committee Parallel computing: the use of multiple computers or processors working together on a common task. Each processor works on its section of the problem Grid of a Problem to be Solved Process 0 does work for this region Process 1 does work for this region Processors are allowed to exchange information with other processors Process 2 does work for this region Process 3 does work for this region 6

7 Why do parallel computing? Limits of single CPU computing Available memory Performance Parallel computing allows: Solve problems that don t fit on a single CPU Solve problems that can t be solved in a reasonable time 7

8 Why do parallel computing? We can run Larger problems Faster More cases Run simulations at finer resolutions Model physical phenomena more realistically 8

9 Weather Forecasting Atmosphere is modeled by dividing it into three-dimensional regions or cells 1 mile x 1 mile x 1 mile (10 cells high) about 500 x 10 6 cells. The calculations of each cell are repeated many times to model the passage of time. About 200 floating point operations per cell per time step or floating point operations necessary per time step 10 day forecast with 10 minute resolution => 1.5x10 14 flop 100 Mflops would take about 17 days 1.7 Tflops would take 2 minutes 17 Tflops would take 8 seconds 105 Tflops would take 1.3 seconds What might you want to do if running for 1.3 seconds? 9

10 Modeling Motion of Astronomical bodies (brute force) Each body is attracted to each other body by gravitational forces. Movement of each body can be predicted by calculating the total force experienced by the body. For N bodies, N - 1 forces / body yields N 2 calculations each time step A galaxy has, stars => 10 9 years for one iteration Using a N log N efficient approximate algorithm => about a year NOTE: This is closely related to another hot topic: Protein Folding 10

11 Types of parallelism two extremes Data parallel Each processor performs the same task on different data Example - grid problems Bag of Tasks or Embarrassingly Parallel is a special case Task parallel Each processor performs a different task Example - signal processing such as encoding multitrack data Pipeline is a special case 11

12 Simple data parallel program Example: integrate 2-D propagation problem Starting partial differential equation: Finite Difference Approximation: PE #0 PE #1 PE #2 PE #3 y PE #4 PE #5 PE #6 PE #7 x 12

13 Typical Task Parallel Application DATA Normalize Task FFT Task Multiply Task Inverse FFT Task Signal processing Use one processor for each task Can use more processors if one is overloaded This is a pipeline 13

14 Parallel Program Structure Communicate & Repeat work 1a work 2a work (N)a Begin start parallel work 1b work 1c work 2b work 2c work (N)b work (N)c End Parallel End work 1d work 2d work (N)d 14

15 Parallel Problems Communicate & Repeat work 1a work 2a work (N)a Begin start parallel work 1b work 1c work 2b work 2c work (N)b work (N)c End Parallel Start Serial Section work 1d work 2d Subtasks don t finish together work (N)d Serial Section (No Parallel Work) work 1x work 2x work (N)x End Serial Section start parallel work 1y work 2y work (N)y work 1z work 2z work (N)z Not using all processors End Parallel End 15

16 A Real example #!/usr/bin/env python from sys import argv from os.path import isfile from time import sleep from math import sin,cos # fname="message" my_id=int(argv[1]) print("\n%d starting program \n" % (my_id)) # if (my_id == 1): sleep(2) myval=cos(10.0) mf=open(fname,"w") mf.write(str(myval)) mf.close() if (my_id == 0): myval=sin(10.0) notready=true while notready : if isfile(fname) : notready=false sleep(3) mf=open(fname,"r") message=float(mf.readline()) mf.close() total=myval**2+message**2 else: sleep(5) print("sin(10)**2+cos(10)**2= %15.12f" % (total)) print("%d done with program \n" %(my_id)) 16

17 Theoretical upper limits All parallel programs contain: Parallel sections Serial sections Serial sections are when work is being duplicated or no useful work is being done, (waiting for others) Serial sections limit the parallel effectiveness If you have a lot of serial computation then you will not get good speedup No serial work allows perfect speedup Amdahl s Law states this formally 17

18 Amdahl s Law Amdahl s Law places a strict limit on the speedup that can be realized by using multiple processors. Effect of multiple processors on run time t p = (f p /N + f s )t s Effect of multiple processors on speed up Where Fs = serial fraction of code Fp = parallel fraction of code N = number of processors Perfect speedup t=t1/n or S(n)=n S = t s tp = 1 fp /N + f s 18

19 Illustration of Amdahl's Law It takes only a small fraction of serial content in a code to degrade the parallel performance. 19

20 Amdahl s Law Vs. Reality Amdahl s Law provides a theoretical upper limit on parallel speedup assuming that there are no costs for communications. In reality, communications will result in a further degradation of performance fp = Amdahl's Law Reality Number of processors 20

21 Sometimes you don t get what you expect! 21

22 Some other considerations Writing effective parallel application is difficult Communication can limit parallel efficiency Serial time can dominate Load balance is important Is it worth your time to rewrite your application Do the CPU requirements justify parallelization? Will the code be used just once? 22

23 Parallelism Carries a Price Tag Parallel programming Involves a steep learning curve Is effort-intensive Parallel computing environments are unstable and unpredictable Don t respond to many serial debugging and tuning techniques May not yield the results you want, even if you invest a lot of time Will the investment of your time be worth it? 23

24 Terms related to algorithms Amdahl s Law (talked about this already) Superlinear Speedup Efficiency Cost Scalability Problem Size Gustafson s Law 24

25 Superlinear Speedup S(n) > n, may be seen on occasion, but usually this is due to using a suboptimal sequential algorithm or some unique feature of the architecture that favors the parallel formation. One common reason for superlinear speedup is the extra cache in the multiprocessor system which can hold more of the problem data at any instant, it leads to less, relatively slow memory traffic. 25

26 Efficiency Efficiency = Execution time using one processor over the Execution time using a number of processors Its just the speedup divided by the number of processors 26

27 Cost The processor-time product or cost (or work) of a computation defined as Cost = (execution time) x (total number of processors used) The cost of a sequential computation is simply its execution time, t s. The cost of a parallel computation is t p x n. The parallel execution time, t p, is given by t s /S(n) Hence, the cost of a parallel computation is given by Cost-Optimal Parallel Algorithm One in which the cost to solve a problem on a multiprocessor is proportional to the cost 27

28 Scalability Used to indicate a hardware design that allows the system to be increased in size and in doing so to obtain increased performance - could be described as architecture or hardware scalability. Scalability is also used to indicate that a parallel algorithm can accommodate increased data items with a low and bounded increase in computational steps - could be described as algorithmic scalability. 28

29 Problem size Problem size: the number of basic steps in the best sequential algorithm for a given problem and data set size Intuitively, we would think of the number of data elements being processed in the algorithm as a measure of size. However, doubling the date set size would not necessarily double the number of computational steps. It will depend upon the problem. For example, adding two matrices has this effect, but multiplying matrices quadruples operations. Note: Bad sequential algorithms tend to scale well 29

30 Other names for Scaling Strong Scaling (Engineering) For a fixed problem size how does the time to solution vary with the number of processors Weak Scaling How the time to solution varies with processor count with a fixed problem size per processor 30

31 Some Classes of machines Network Processor Processor Processor Processor Memory Memory Memory Memory Distributed Memory Processors only Have access to their local memory talk to other processors over a network 31

32 Some Classes of machines Uniform Shared Memory (UMA) Processor Processor All processors have equal access to Memory Processor Processor Memory Processor Processor Can talk via memory Processor Processor 32

33 Some Classes of machines Hybrid Shared memory nodes connected by a network... 33

34 Some Classes of machines More common today Each node has a collection of multicore chips... Ra has 268 nodes 256 quad core dual socket 12 dual core quad socket 34

35 Some Classes of machines Hybrid Machines Add special purpose processors to normal processors Not a new concept but, regaining traction Example: our Power8/K80 nodes Issue: transfer speed between units "Normal" CPU Special Purpose Processor FPGA, GPU, Vector, Cell... 35

36 Network Topology For ultimate performance you may be concerned how you nodes are connected. Avoid communications between distant node For some machines it might be difficult to control or know the placement of applications 36

37 Network Terminology Latency How long to get between nodes in the network. Bandwidth How much data can be moved per unit time. Bandwidth is limited by the number of wires and the rate at which each wire can accept data and choke points 37

38 Ring 38

39 Grid Wrapping produces torus 39

40 Tree Fat tree the lines get wider as you go up 40

41 Hypercube dimensional hypercube 41

42 4D Hypercube Some communications algorithms are hypercube based How big would a 9d hypercube be? 42

5d Torus http://www.idris.fr/eng/turing/hw-turing-eng.

43 5d Torus 3d Grid 01,03,29 3d Torus adds 01,03,29 5d adds 12 43

44 5d - Blue Gene Q MidPlane nodes 4x4x4x4x2 44

45 5D Torus Network BGQ Layout The network topology of BlueGene/Q is a five-dimensional (5D) torus, with direct links between the nearest neighbors in the ±A, ±B, ±C, ±D, and ±E directions. As such there are only a few optimum block sizes that will use the network efficiently. Node Boards Compute Nodes Cores Torus Dimensions x2x2x2x2 2 (adjacent pairs) x2x4x2x2 4 (quadrants) x2x4x4x2 8 (halves) x2x4x4x2 16 (midplane) x4x4x4x2 32 (1 rack) x4x4x8x2 64 (2 racks) x4x8x8x2 45

46 Star? Quality depends on what is in the center 46

47 Example: An Infiniband Switch Infiniband, DDR, Cisco 7024 IB Server Switch - 48 Port Adaptors. Each compute node has one DDR 1-Port HCA 4X DDR=> 16Gbit/sec 140 nanosecond hardware latency 1.26 microsecond at software level 47

48 Measured Bandwidth 48

49 Infiniband Rates 49

50 New Kid on the Block - Intel Omnipath Designed with the technical and cost requirements of future exascale supercomputers in mind Packet Integrity Protection: a link-level error checking capability that is applied to all data traversing the wire. It allows for transparent detection and recovery of transmission errors as they occur. Dynamic Lane Scaling: maintains link continuity in the event of a lane failure. With the help of PIP, Omni-Path uses the remaining lanes in the link to continue operation. Traffic Flow Optimization: improves quality of service by allowing higher priority data packets to preempt lower priority packets, regardless of packet ordering. 50

51 More Omnipath Info 100 gigabits/sec of bandwidth per port, with port-to-port latencies on par with that of EDR InfiniBand. Intel has stated that their host architecture supports message rates of up to 160 million messages per second Higher Density switches Host Integration Roadmap Intel is planning to offer an in-package host adapter configuration, where the fabric ASIC is integrated into the processor socket. Further down the road, the Omni-Path host interface will be integrated directly into the processor. 51

52 Back to coprocessors In the simple case all nodes contain just a collection of normal CPUs and Memory Similar to desktop to laptop machines Connected together via some network There are nonstandard nodes CPU with GPU FPGA High core count (Knights XXX or Phi) 52

53 Graphic Processing Unit - GPU Graphics cards are available in many systems Some years ago people realized graphics cards are good at some operations - vector and matrix Why not use them for general computing Difficulties Initially not designed for it Difficult to program bandwidth to/from CPU memory 53

54 GPU - Now NVIDIA is the biggest supplier of GPU cards for HPC Cards developed specifically for processing Programming has become easier Bandwidth is much improved Special instructions for AI Many libraries available Lots of applications 54

55 Vintage Nvidia GPU Systems 55

56 56

57 Nvidia - IBM Two computers Summit (ORNL) Sierra (LLNL) Pflops IBM Power 9 - Nvidia Volta GPU NVLink High Speed Interconnect EDR Infiniband 57

58 DoE IBM/Nvidia Machines Combines IBM Power 9 CPU Nvidia Volta GPU NVLink interconnect 58

59 Key features Volta will Peak out at over 7 Tflops Stacked Memory (very dense and lots of it) NVLink is a key technology in Summit s and Sierra s server node architecture, enabling IBM POWER CPUs and NVIDIA GPUs to access each other s memory (Unified memory >512GB HBM+DDR4) NVLink will be up to 5 to 12 times faster than PCI3 Less than half the watts per flop of current generation chips > 40 Tflops/node * 3,400 nodes about 150 PFlops Back of the envelop calculation - Power 9 = 14 Tflops (Don t quote me on this.) 59

60 What is Intel Knights-xxx or Xeon - Phi Xeon - Phi (Mic) A processing chip that contains a large number of cores >60 with > 240 threads Cores are lower performance than normal Xeon Slower clock speed 1st gen Only in order and Missing some Xeon instructions Runs as a coprocessor, can t boot OS 60

Current and Coming Knights Landing 3 versions: 1 Card and 2 bootable Support for full instruction set On package memory On board external memory

61 Current and Coming Knights Landing 3 versions: 1 Card and 2 bootable Support for full instruction set On package memory On board external memory with bootable versions Much better memory bandwidth 3 Tflops each, implies 576 Tflops/rack (48*4*3) 61

62 Shipping Knights Landing 62

63 What is Intel Knights-xxx or Xeon - Phi Many programming paradigms Runs regular C and Fortran Intel compilers Supports OpenMP Can run MPI Supports offloaded calculations Knights Mill (announced) will have special support for AI 63

64 Our Resources Documentation Hardware Mio BlueM Golden (AuN) Energy (Mc2) Next Machine? 64

65 Getting Help About Resources Getting accounts Mio node information 65

66 Platforms - Overview Mines has three high performance computing platforms available for campus use, AuN, Mc2, and Mio. AuN and Mc2 share a 480 Tbyte file system and are collectively known as BlueM. AuN is a 144 node, 50 Tflop, X86 system. Mc2 is a 512 node Blue Gene Q rated at 104 Tflop. Mio, is a shared resource built up using what is commonly know as the condo model. Individual research groups own nodes and they have priority access. There are also nodes owned by students. Mio currently has ~200 x86 nodes, three x86/gpu nodes, and two 4 way Phi nodes, two Power8/K80GPU nodes and is serviced by a 240 Tbyte file system. 104 Tflops- CPU 66

67 2010-current Mio Nodes: ~200 x86 11 GPUs 8 Phi 2 Power8 ~ 104 Tflops It s All Mine 240 TByte file system 67

68 Mio Concept CCIT Funds infrastructure Groups purchase nodes Groups can use their nodes when they desire Research Groups have priority access to their nodes Students have priority access to TechFee nodes When nodes are not being used by owners they are available for others Owner starting a job will kick others off 68

2x Intel Xeon 5770 CPU 24 GB Total 8 cores Total 2.93GHz gpu001 Nvida T10 Processor 4 GB 240 Cores Nvida T10 Processor 4 GB 240 Cores 2x Intel Xeon 5650 CPU 48 GB Total 12 cores Total 2.

69 2x Intel Xeon 5770 CPU 24 GB Total 8 cores Total 2.93GHz gpu001 Nvida T10 Processor 4 GB 240 Cores Nvida T10 Processor 4 GB 240 Cores 2x Intel Xeon 5650 CPU 48 GB Total 12 cores Total 2.66GHz gpu003 Nvida M2070 Processor 5.6 GB 448 Cores Nvida M2070 Processor 5.6 GB 448 Cores Nvida M2070 Processor 5.6 GB 448 Cores 2x Intel Xeon 5770 CPU 24 GB Total 8 cores Total 2.93GHz gpu002 Nvida T10 Processor 4 GB 240 Cores Nvida T10 Processor 4 GB 240 Cores IBM Power 8 processor 256 GB Total 20 cores Total 3.49 GHz ppc001 Nvida K80 24 GB 4992 Cores Nvida K80 24 GB 4992 Cores Mio head node, management node, and network switch TechFee GPU Nodes IBM Power 8 processor 256 GB Total 20 cores Total 3.49 GHz ppc002 Nvida K80 24 GB 4992 Cores Nvida K80 24 GB 4992 Cores 240 Tbyte Parallel (GPFS) File System mic0 Intel Xeon Phi 8 GB Coprocessor P 60 Cores 2x Intel Xeon E CPU mic1 Intel Xeon Phi 8 GB Coprocessor P 60 Cores Compute Nodes GB Total 12 cores Total 2.3GHz phi002 mic2 Intel Xeon Phi 8 GB Coprocessor P 60 Cores mic3 Intel Xeon Phi 8 GB Coprocessor P 60 Cores Intel Xeon CPUs Based Nodes mic0 Intel Xeon Phi 8 GB Coprocessor P 60 Cores 2x Intel Xeon E CPU mic1 Intel Xeon Phi 8 GB Coprocessor P 60 Cores GB Each 8-28 cores Each 32GB Total 12 cores Total 2.3GHz phi001 mic2 Intel Xeon Phi 8 GB Coprocessor P 60 Cores 69 mic3 Intel Xeon Phi 8 GB Coprocessor P 60 Cores

70 BlueM Mines` Supercomputer 154 Tflops 17.4 Tbytes 10,496 Cores 85 KW dual architecture Best of both worlds Two Distinct Compute Units idataplex Blue Gene Q Shared 480 Tbyte File System Compact Low Power Consumption 70

71 BlueM s Compute Units - AuN AuN (Golden) idataplex Intel 8x2 core SandyBridge 144 Nodes 2,304 Cores 9,216 Gbytes Feature Latest Generation Intel Processors Large Memory / Node Common architecture Similar user environment to RA and Mio Quickly get researchers up and running 50 Tflops

72 BlueM s Compute Units - MC2 MC2 (Energy) Blue Gene Q PowerPC A2 17 Core 512 Nodes 8,192 Cores 8,192 Gbytes 104 Tflops Feature New Architecture Designed for large core count jobs Highly scaleable Multilevel parallelism - Direction of HPC Room to Grow Future looking machine

73 Mc2 - AuN Comparison Feature Mc2 AuN Gflop/Node Memory/Node 16 Gbytes 64 Gbytes Gflop/Gbytes Recommended Loading 16*4=64 16 Bandwidth Faster Fast 8 73

74 Advertised Layout Cooling Distribution Unit Hose for water cooling

75 75

76 Allocations for BlueM By proposal Must be faculty to propose Students can work on faculty s grant 76

77 Other Resources Xsede RMACC NCAR/UoWy computational-systems/cheyenne 77

78 Extreme Science and Engineering Discovery Environment The Extreme Science and Engineering Discovery Environment (XSEDE) is the most advanced, powerful, and robust collection of integrated advanced digital resources and services in the world. It is a single virtual system that scientists can use to interactively share computing resources, data, and expertise. Scientists and engineers around the world use these resources and services things like supercomputers, collections of data, and new tools to make our lives healthier, safer, and better. XSEDE, and the experts who lead the program, will make these resources easier to use and help more people use them. The five-year, $110-million project is supported by the National Science Foundation. In summer of 2016, the NSF announced that XSEDE was awarded an additional 5 years of funding after the first 5-year award completed. Originally, XSEDE replaced and expanded on the NSF TeraGrid project. More than 10,000 scientists used the TeraGrid to complete thousands of research projects, at no cost to the scientists. 78

79 79

80 Rocky Mountain Advanced Computing Consortium The Rocky Mountain Advanced Computing Consortium is a collaboration among academic and research institutions located throughout the intermountain states. Our mission is to facilitate widespread effective use of high performance computing throughout the Rocky Mountain region by: Educating graduate and undergraduate students, faculty, researchers, and industry partners on the use of computational science and high performance computing. Coordinating multi-institutional efforts to advance research, practice, and education in computational science in order to address important regional problems. Bringing together a broad range of researchers, faculty, and industry partners with a depth of experience and expertise not available at any single institution and facilitate their collaboration in multi-disciplinary and multi-institutional teams. Mines is a founding member RMACC High Performance Computing Symposium 80

81 RMACC available resources Accessing Summit Summit is a new HPC resource for researchers at CU, CSU, and RMACC partners Key features include 400 TFlops peak performance General compute nodes High-memory nodes GPGPU nodes KNL Xeon Phi nodes Omni-Path interconnect GPFS scratch filesystem 81

82 NCAR/UoWy Cheyenne 82

83 NCAR/UoWy Cheyenne Climate Simulation Laboratory Researchers must have funding from NSF awards to address the climate-related questions for which they are requesting CSL allocations. University Community In general, any U.S.-based researcher with an NSF award in the atmospheric sciences or computational science in support of the atmospheric sciences NCAR Community NCAR investigators have access Wyoming-NCAR Alliance The NWSC represents a collaboration between NCAR and the University of Wyoming. As part of the Wyoming-NCAR Alliance (WNA), a portion of the Cheyenne system about 160 million core-hours per year is reserved for Wyoming-led projects and allocated by a University of Wyoming-managed process. 83

84 How do you run? Getting on Programming Running 84

85 Getting on ssh from your machine to Mio or BlueM 85

86 Hello World in Parallel Compile your program with Parallel compilers 86

87 Running We write & run a script that tells the system what we want to do 87

88 Output After some time our program will run 88

Overview of High Performance Computing

Overview of High Performance Computing Timothy H. Kaiser, PH.D. tkaiser@mines.edu http://inside.mines.edu/~tkaiser/csci580fall13/ 1 Near Term Overview HPC computing in a nutshell? Basic MPI - run an example