Computing on Mio Introduction

Size: px

Start display at page:

Download "Computing on Mio Introduction"

Arron Barnett
6 years ago
Views:

Computing on Mio Introduction Timothy H. Kaiser, Ph.D. tkaiser@mines.

1 Computing on Mio Introduction Timothy H. Kaiser, Ph.D. Director - CSM High Performance Computing Director - Golden Energy Computing Organization 1

2 Purpose To enable people to quickly be make use of the high performance computing resource Mio.mines.edu To give people an overview of high performance computing so that they might know what is and is not possible Get people comfortable using other high performance computing resources 2

3 Topics Introduction to High Performance Computing Purpose and methodology Hardware Programming Mio overview History General and Node descriptions File systems Scheduling 3

4 More Topics Running on Mio Documentation Batch scripting Interactive use Staging data Enabling Graphics sessions Useful commands 4

5 Miscellaneous Topics Some tricks from MPI and OpenMP Libraries Nvidia node 5

6 High Performance Computing 6

7 What is Parallelism? Consider your favorite computational application One processor can give me results in N hours Why not use N processors -- and get the results in just one hour? The concept is simple: Parallelism = applying multiple processors to a single problem 7

8 Parallel computing is computing by committee Parallel computing: the use of multiple computers or processors working together on a common task. Each processor works on its section of the problem Processors are allowed to exchange information with other processors Grid of a Problem to be Solved Process 0 does work for this region Process 2 does work for this region Process 1 does work for this region Process 3 does work for this region 8

9 Why do parallel computing? Limits of single CPU computing Available memory Performance Parallel computing allows: Solve problems that don t fit on a single CPU Solve problems that can t be solved in a reasonable time 9

10 Why do parallel computing? We can run Larger problems Faster More cases Run simulations at finer resolutions Model physical phenomena more realistically 10

11 Weather Forecasting Atmosphere is modeled by dividing it into three-dimensional regions or cells 1 mile x 1 mile x 1 mile (10 cells high) about 500 x 10 6 cells. The calculations of each cell are repeated many times to model the passage of time. About 200 floating point operations per cell per time step or floating point operations necessary per time step 10 day forecast with 10 minute resolution => 1.5x10 14 flop 100 Mflops would take about 17 days 1.7 Tflops would take 2 minutes 17 Tflops would take 12 seconds 11

12 Modeling Motion of Astronomical bodies (brute force) Each body is attracted to each other body by gravitational forces. Movement of each body can be predicted by calculating the total force experienced by the body. For N bodies, N - 1 forces / body yields N 2 calculations each time step A galaxy has, stars => 10 9 years for one iteration Using a N log N efficient approximate algorithm => about a year NOTE: This is closely related to another hot topic: Protein Folding 12

13 Types of parallelism two extremes Data parallel Task parallel Each processor performs the same task on different data Example - grid problems Bag of Tasks or Embarrassingly Parallel is a special case Each processor performs a different task Example - signal processing such as encoding multitrack data Pipeline is a special case 13

14 Simple data parallel program Example: integrate 2-D propagation problem Starting partial differential equation: Finite Difference Approximation: PE #0 PE #1 PE #2 PE #3 y PE #4 PE #5 PE #6 PE #7 x 14

15 Typical Task Parallel Application DATA Normalize Task FFT Task Multiply Task Inverse FFT Task Signal processing Use one processor for each task Can use more processors if one is overloaded This is a pipeline 15

16 Parallel Program Structure Communicate & Repeat work 1a work 2a work (N)a Begin start parallel work 1b work 1c work 2b work 2c work (N)b work (N)c End Parallel End work 1d work 2d work (N)d 16

17 Parallel Problems Communicate & Repeat work 1a work 2a work (N)a Begin start parallel work 1b work 1c work 2b work 2c work (N)b work (N)c End Parallel Start Serial Section work 1d work 2d Subtasks don t finish together work (N)d Serial Section (No Parallel Work) work 1x work 2x work (N)x End Serial Section start parallel work 1y work 2y work (N)y work 1z work 2z work (N)z Not using all processors End Parallel End 17

18 Theoretical upper limits All parallel programs contain: Parallel sections Serial sections Serial sections are when work is being duplicated or no useful work is being done, (waiting for others) Serial sections limit the parallel effectiveness If you have a lot of serial computation then you will not get good speedup No serial work allows perfect speedup Amdahl s Law states this formally 18

19 Amdahl s Law Amdahl s Law places a strict limit on the speedup that can be realized by using multiple processors. Effect of multiple processors on run time t p = (f p /N + f s )t s Effect of multiple processors on speed up 1 Where Fs = serial fraction of code Fp = parallel fraction of code N = number of processors Perfect speedup t=t1/n or S(n)=n S = t s t p = fp /N + f s 19

20 Illustration of Amdahl's Law It takes only a small fraction of serial content in a code to degrade the parallel performance. 20

21 Amdahl s Law Vs. Reality Amdahl s Law provides a theoretical upper limit on parallel speedup assuming that there are no costs for communications. In reality, communications will result in a further degradation of performance fp = Amdahl's Law Reality Number of processors 21

22 Hardware 22

23 Some Classes of machines Network Processor Processor Processor Processor Memory Memory Memory Memory Distributed Memory Processors only Have access to their local memory talk to other processors over a network 23

24 Some Classes of machines Uniform Shared Memory (UMA) Processor Processor All processors have equal access to Memory Processor Processor Memory Processor Processor Can talk via memory Processor Processor 24

25 Some Classes of machines Hybrid Shared memory nodes connected by a network... 25

26 Some Classes of machines More common today Each node has a collection of multicore chips... Mio has 48 nodes 32 have 8 cores/node (4/socket) 16 have 12 cores/node (6/socket) 26

27 Some Classes of machines Hybrid Machines Add special purpose processors to normal processors Not a new concept but, regaining traction Example: our Tesla Nvidia node, cuda1 "Normal" CPU Special Purpose Processor FPGA, GPU, Vector, Cell... 27

28 Programming Message passing Threads/Shared Memory Manual 28

29 Message Passing Most common method for large scale applications Usually the same program running on many cores Processes communicate by passing messages Let s look at a silly example... 29

30 Message Passing Message Passing Interface (MPI) library is often used Works on shared and distributed memory machines Each process is given a number (0 to N-1) Most parallel machines ship with MPI Can be build from source 30

31 Fortran MPI Program program send_receive include "mpif.h" integer myid,ierr,numprocs,tag,source,destination,count integer buffer integer status(mpi_status_size) call MPI_INIT( ierr ) call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr ) call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr ) tag=1234 count=1 if(myid.eq. 0)then buffer=5678 Call MPI_Send(buffer, count, MPI_INTEGER,1,& tag, MPI_COMM_WORLD, ierr) write(*,*)"processor ",myid," sent ",buffer endif if(myid.eq. 1)then Call MPI_Recv(buffer, count, MPI_INTEGER,0,& tag, MPI_COMM_WORLD, status,ierr) write(*,*)"processor ",myid," got ",buffer endif call MPI_FINALIZE(ierr) stop end 31

32 C MPI Program #include <stdio.h> #include "mpi.h" int main(int argc,char *argv[]) { int myid, numprocs, tag,source,destination,count, buffer; MPI_Status status; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); tag=1234; count=1; if(myid == 0){ buffer=5678; MPI_Send(&buffer,count,MPI_INT,1,tag,MPI_COMM_WORLD); printf("processor %d sent %d\n",myid,buffer); } if(myid == 1){ MPI_Recv(&buffer,count,MPI_INT,0,tag,MPI_COMM_WORLD,&status); printf("processor %d got %d\n",myid,buffer); } MPI_Finalize(); } 32

33 Threads and Shared Memory Threads are sometimes called light weight processes There is some mapping of threads to cores Often 1-1 mapping Cheap to start, sleep, wake Threads can move between cores GPU can have many threads for each core Threads share memory on a multicore machine Communicate via shared memory 33

34 Threads Programming Fork join model Pthreads is a threading package for manually writing threaded programs In theory can automatically be done by compilers Semiautomatic Directive driven threading Directives tell the compiler where it can multithread OpenMP is the most common directive set It is possible to combine MPI and OpenMP in a single program 34

35 Fortran OpenMP!$OMP PARALLEL DO do i=1,maxthreads seed(i)=i+i**2 errcode(i)=vslnewstream(stream(i), brng, seed(i) )! write(*,*)i,errcode(i) enddo !$omp PARALLEL DO PRIVATE(twod) do i=1,nrays twod=>tarf(:,:,i) j=omp_get_thread_num()+1 errcode(j)=vdrnggaussian( method, stream(j), msize*msize, twod, a,sigma)! write(*,*)"sub array ",i,j,errcode(j),twod(1,1) enddo !$omp PARALLEL DO PRIVATE(twod) do i=1,nrays twod=>tarf(:,:,i)! write(*,*)twod call my_clock(cnt1(i)) CALL DGESV( N, NRHS, twod, LDA, IPIVs(:,i), Bs(:,i), LDB, INFOs(i) ) call my_clock(cnt2(i)) write(*,'(i5,i5,3(f12.3))')i,infos(i),cnt2(i),cnt1(i),real(cnt2(i)-cnt1(i),b8) enddo 35

36 C OpenMP #pragma omp parallel sections { #pragma omp section {!!! system_clock(&t1_start);!!! over(m1,n);!!! system_clock(&t1_end);!!! t1_start=t1_start-t0_start;!!! t1_end=t1_end-t0_start; } #pragma omp section {!!! system_clock(&t2_start);!!! over(m2,n);!!! system_clock(&t2_end);!!! t2_start=t2_start-t0_start;!!! t2_end=t2_end-t0_start;... } A different thread and core will be assigned to run each section 36

37 Manual Parallel Programming Log on to a machine with N cores Launch N a separate tasks, putting each in the background Usually the N tasks are independent Each task launched could actually be a pipeline of processes or multithreaded Madagascar 37

38 Golden Energy Computing Organization GECO - a quick overview 38

39 GECO Golden Energy Computing Organization Center to be a national hub for computational inquiries aimed at the discovery of new ways to meet the energy needs of our society. GECO s main computational resource is RA 39

GECO HPC Hardware/Software: Ra Architecture Dell with Intel quad-core, dual-socket system 2144

67 GHz) (quad core dual socket) 184 with 16 Gbytes & 72 with 32 Gbytes 12 nodes with 48 Xeon 7140M

6 terabytes) 300 terabyte disk 300 terabyte tape back up 16/32 gigabytes RAM per node Performance

40 GECO HPC Hardware/Software: Ra Architecture Dell with Intel quad-core, dual-socket system 2144 processing cores in 268 nodes 256 nodes with 512 Clovertown E5355 (2.67 GHz) (quad core dual socket) 184 with 16 Gbytes & 72 with 32 Gbytes 12 nodes with 48 Xeon 7140M (3.4 GHz) (quad socket dual core) 32 Gbytes each Memory 5,632 Gbytes ram (5.6 terabytes) 300 terabyte disk 300 terabyte tape back up 16/32 gigabytes RAM per node Performance 18 Tflop sustained performance 23 Tflop peak like every human on the planet doing 2500 calculations per second Center for Technology and Learning Media 40

41 41

42 RA problems Only for Energy Science Allocation is by proposal People wanted quick access and RA is heavily loaded Little student access 42

43 Mio.Mines.edu MIO Mio.Mines.Edu It s All Mine 43

44 Mio.mines.edu New concept in HPC for CSM They own the nodes School puts up the money for infrastructure Researchers purchase individual nodes Can use other s when they are not in use Started with the head node and compute 4 nodes, 32 cores Current (10/28/10) 48 nodes, 472 cores 5.53 Tflop 44

45 Original Mio Compute Nodes Penguin 2 x Dual Intel Xeon X5570 Quad Core 2.93GHz 8MB max RAM speed 1333MHz up to 2 x 48GB DDR3-800 REG, ECC (24 x 4GB) 2 x 160GB, SATA, 7200RPM 2 x Intel Xeon Dual Socket Motherboard with Integrated Infiniband DDR/CX4 Connections Half the size of RA nodes More efficient in power and computation More memory Faster Clock Have since gone to 12 core nodes and 1 Tbytes disk 45

46 Oct. 28, 2010 Total Installed: 472 cores 5.53 Tflop 46

47 Node Summary Node Cores RAM GB DISK Free (/scratch) GB to to

48 Mio: Installing a 200 Tbyte Panasas system Used by BP Donated to CSM File system Panasas gave us a one shot deal on software Was working but we need to rebuild after a switch upgrade Available in parallel to all nodes 48

49 File Systems on Mio Four partitions $HOME - should be kept very small, having only start up scripts and other simple scripts $DATA - Should contain programs you have built for use in your research, small data sets and run scripts. $SCRATCH - The main area for running applications. Output from parallel runs should be done to this directory. $SETS - This area is for large data sets, mainly read only, that will be use over a number of months. 49

50 What s in a Name? The name Mio is a play on words. It is a Spanish translation of the word mine as in belongs to me, not the hole in the ground. The phrase The computer is mine. can be translated as El ordenador es mío. 50

51 Mio.mines.edu Compute Nodes Tux Network switch Panasas file system Head Node 51

52 Mio Links Picture of usage Documentation Nodes in use Jobs running 52

53 Getting on to Mio Address mio.mines.edu Only available on campus, or using VPN, or through another on campus machine (RA) Must use ssh client Unix and OSX ssh is built in OSX - /HD/Applications/Utilities/Terminal Windows use putty (Installed on Mines machines) All Applications\putty 53

54 ssh usage Strongly suggest that you set up and use ssh keys Nontrivial pass phrase and a connection agent so that you only need to type the pass phrase once every 4 hours Good security Easy access See 54

55 ssh usage For today Passwords on Mio.mines.edu are the same as your Mines MultiPass password, that is, it is the same as the standard unix boxes on campus such as imagine or illuminate. If you don't have a MultiPass password or need to reset it see 55

56 You may see: First login Creating directory '/u/is/az/joeminer'. Creating directory '/u/is/az/joeminer/.mozilla'. Creating directory '/u/is/az/joeminer/.mozilla/extensions'. Creating directory '/u/is/az/joeminer/.mozilla/plugins'. Creating directory '/u/is/az/joeminer/.kde'. Creating directory '/u/is/az/joeminer/.kde/autostart'. It doesn't appear that you have set up your ssh key. This process will make the files: /u/is/az/joeminer/.ssh/id_rsa.pub /u/is/az/joeminer/.ssh/id_rsa /u/is/az/joeminer/.ssh/authorized_keys Generating public/private rsa key pair. Enter file in which to save the key (/u/is/az/joeminer/.ssh/id_rsa): Created directory '/u/is/az/joeminer/.ssh'. Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /u/is/az/joeminer/.ssh/id_rsa. Your public key has been saved in /u/is/az/joeminer/.ssh/id_rsa.pub. The key fingerprint is: 55:bf:89:4d:03:4d:da:21:84:a9:e8:b2:b9:3a:bd:fe ~]$ Hit return until you see the final prompt 56

57 Mio Environment When you ssh to Mio you are on the head node Shared resource DO NOT run parallel, compute intensive, or memory intensive jobs on the head node You have access to basic Unix commands Editors, file commands, gcc... Need to set up your environment to gain access to HPC compilers 57

58 Mio HPC Environment You want access to parallel and high performance compilers Intel, AMD, mpicc, mpif90 Commands to run and monitor parallel programs See: Gives details Shortcut 58

59 Shortcut Setup Add the following to you.bashrc file and log out/in if [ -f /usr/local/bin/setup/setup ]; then source /usr/local/bin/setup/setup intel ; fi Compilers Intel 11.x AMD Portland Group NAG Openmpi parallel MPI environment MPI compilers MPI run commands Python and Should give you: [tkaiser@mio ~]$ which mpicc /opt/lib/openmpi/1.4.2/intel/11.1/bin/mpicc [tkaiser@mio ~]$ 59

60 Getting the Examples tests]$ cd $DATA tests]$ mkdir cwp tests]$ cd cwp cwp]$ wget :31:06 (48.6 MB/s) - `cwp.tar' saved [20480/20480] [tkaiser@mio cwp]$ tar -xf *tar 60

61 A digression - Makefiles Makefiles are instructions for building a program Normal names are makefile or Makefile In a directory that contains a makefile type the command make They can be very complicated Can have sub commands See: 61

62 Our Makefile all: c_ex01 f_ex01 pointer invertc c_ex01 : c_ex01.c! mpicc -o c_ex01 c_ex01.c f_ex01: f_ex01.f90! mpif90 -o f_ex01 f_ex01.f90 pointer: pointer.f90! ifort -openmp pointer.f90 -mkl -o pointer invertc: invertc.c! icc -openmp invertc.c -o invertc clean:! rm -rf c_ex01 f_ex01 *mod err* out* mynodes* pointer invertc tar:! tar -cf introduction.tar makefile batch1 c_ex01.c f_ex01.f90 pointer.f90 invertc.c batch2 62

63 The End 63

Overview of Parallel Computing. Timothy H. Kaiser, PH.D.

Overview of Parallel Computing Timothy H. Kaiser, PH.D. tkaiser@mines.edu Introduction What is parallel computing? Why go parallel? The best example of parallel computing Some Terminology Slides and examples