CME342 - Parallel Methods in Numerical Analysis

Size: px

Start display at page:

Download "CME342 - Parallel Methods in Numerical Analysis"

Bruno Phillips
6 years ago
Views:

1 CME342 - Parallel Methods in Numerical Analysis Parallel Architectures April 2, 2014 Lecture 2

2 Announcements 1. Subscribe to the mailing list. Go to lists.stanford.edu and follow directions in 1 st handout Subscribe to list with name: cme342-class 2. Sign your name up on the list I am passing around if you have not done so. If you are not on the list, you will not have preferred access to cluster resources / GPUs. 3. If you missed lecture notes or handouts, they are on the web page now 2

3 Some units 1MFlop= 10^6 flops/sec 1MByte=10^6 bytes 1GFlop= 10^9 flops/sec 1GByte=10^9 bytes 1TFlop= 10^12 flops/sec 1TByte=10^12 bytes 3

4 Performance goals 4

5 Microprocessor performance 5

6 Top 500 List November

7 Top 500 List Historical Trends 7

8 Parallel Computers A parallel computer is a collection of CPUs/ Processing Units that cooperate to solve a problem. It can solve a problem faster or just solve a bigger problem. How large is the collection? How is the memory organized? How do they communicate and transfer information? This is not a course on parallel architectures: this lecture will be just an overview.you need to have a certain knowledge of the underlying hardware to efficiently use parallel computers. 8

9 Categorization of Parallel Architectures Control mechanism: instruction stream and data stream Process granularity Address space organization Interconnection network Static Dynamic 9

10 Control mechanism (Flynn s taxonomy) SISD: Single Instruction stream Single Data stream SIMD: Single Instruction stream Multiple Data stream MIMD: Multiple Instruction stream Multiple Data stream MISD: Multiple Instruction stream Single Data stream 10

11 SIMD Multiple processing elements are under the supervision of a control unit Thinking Machine CM-2, MasPar MP-2, Quadrics SIMD extensions are now present in commercial microprocessors (MMX or SSE in Intel x86, 3DNow in AMD K6 and Athlon, Altivec in Motorola G4) 11

12 MIMD Each processing element is capable of executing a different program independent of the other processors Most multiprocessors can be classified in this category 12

13 Process granularity Coarse grain: Cray C90, Fujitsu Medium grain: IBM SP2, CM-5, clusters Fine grain: CM-2, Quadrics, Blue Gene/L? 13

14 Address space: Single address space Uniform Memory Address: SMP (UMA) Non Uniform memory Address (NUMA) Local address spaces: Message passing 14

15 SMP architecture CPU CPU CPU CPU Cache Cache Cache Cache Bus or Crossbar Switch Memory I/O SMP uses shared system resources (memory, I/O) that can be accessed equally from all the processors Cache coherence is maintained 15

16 NUMA architecture Memory Memory Memory Memory CPU CPU CPU CPU Cache Cache Cache Cache Bus or Crossbar Switch Shared address space Memory latency varies whether you access local or remote memory Cache coherence is maintained using an hardware or software protocol 16

17 Message-passing / Distributed Memory Memory Memory Memory Memory CPU CPU CPU CPU Cache Cache Cache Cache Local address space No Cache coherence Communication network 17

18 Hybrids? Memory Device memory Memory Device memory Memory Device memory CPU GPU CPU GPU CPU GPU Cache Cache Cache PCIx PCIx PCIx Notice that, currently, a CPU can have many cores that share memory within one processor (and possibly multiple ones) In addition, accelerator cards can be present with separate memory (heterogeneous) architectures (GPUs and others) No Cache coherence Communication network 18

19 Dynamic interconnections Crossbar Switching : P1 M1 M2 most expensive and extensive interconnection. P2 Bus connected : Processors are connected to memory through a common datapath Multistage interconnection: Butterfly,Omega network, perfect shuffle, etc Butterfly 19

20 Static interconnection networks Complete interconnection Star interconnection Linear array Mesh: 2D/3D mesh, 2D/3D torus Tree and fat tree network Hypercube network 20

21 Characteristics of static networks Diameter: maximum distance between any two processors in the network D=1 complete connection D=N-1 D=N/2 D=2( N -1) linear array ring 2D mesh D=2 ( (N/2)) 2D torus D=log N hypercube 21

22 Characteristics of static networks Bisection width: the minimum number of communications links that have to be removed to partition the network in half. Channel rate: peak rate at which a single wire can deliver bits Channel bandwidth: it is the product of channel rate and channel width Bisection bandwidth B: it is the product of bisection width and channel bandwidth. 22

23 Linear array, ring, mesh, torus P P P P P P Processors are arranged as a d-dimensional grid or torus P P P P P P P P P P P P P P P P P P 23

24 Tree, Fat-tree Tree network: there is only one path between any pair of processors. Fat tree network: increase the number of communication links close to the root 24

25 Hypercube 1-D 2-D 3-D 25

26 Binary Reflected GRAY code: G(i,d) denotes the i-th entry in a sequence of Gray codes of d bits. G(i,d+1) is derived from G(i,d) by reflecting the table and prefixing the reflected entry with 1 and the original entry with 0. 26

27 0 Example of BRG code 1-bit 2-bit 3-bit 8p ring 8p hyper

28 Embedding other networks into hypercubes The hypercube is a rich topology, many other networks can be easily mapped onto it. Mapping a linear array into an hypercube: A linear array (or ring) of 2^d processors can be embedded into a d-dimensional hypercube by mapping processor I onto processor G(I,d) of the hypercube Mapping a 2^r x 2^s mesh on an hypercube: processor(i,j)---> G(i,r) G(j,s) concatenation) ( denotes 28

29 Trade-off among different networks Network Minimum latency Maximum Bw per Proc Wires Switches Example Completely connected Constant Constant O(p*p) - - Crossbar Constant Constant O(p) O(p*p) Cray Bus Constant O(1/p) O(p) O(p) SGI Challenge Mesh O(sqrt p) Constant O(p) - Intel ASCI Red Hypercube O(log p) Constant O(p log p) - Sgi Origin Switched O(log p) Constant O(p log p) O(p log p) IBM SP-2 29

30 Beowulf Cluster built with commodity hardware components PC hardware (x86,alpha,powerpc) Commercial high-speed interconnection (100Base-T, Gigabit Ethernet, Myrinet,SCI,Infiniband) Linux, Free-BSD operating system 30

31 Appleseed: PowerPC cluster 31

32 Clusters of SMPs The next generation of supercomputers will have thousand of SMP nodes connected. Increase the computational power of the single node Keep the number of nodes low New programming approach needed, MPI +Threads (OpenMP,Pthreads,.) ASCI White, Compaq SC, IBM SP

Multithreaded architecture The MTA system provides scalable shared memory, in which every processor has equal access to every memory location No concerns about the layout of memory Each MTA processor

33 Multithreaded architecture The MTA system provides scalable shared memory, in which every processor has equal access to every memory location No concerns about the layout of memory Each MTA processor has up to 128 RISC-like virtual processors Each virtual processor is a hardware stream with its own instruction counter, register set, stream status word and target and trap registers. A different hardware stream is activated every clock period. (now Cray) 33

Earth Simulator Between 2002-04, the Earth Simulator was the most powerful supercomputer available today 40 Teraflops of peak 10 Terabytes of memory Use a crossbar switch to connect

34 Earth Simulator Between , the Earth Simulator was the most powerful supercomputer available today 40 Teraflops of peak 10 Terabytes of memory Use a crossbar switch to connect 640 nodes: 3,000Km of cables!!!! Each node has 8 vector processors Sustained performances of up to 20TFlops on climate simulation, 15TFlops on 4096^3 isotropic turbulence simulation 34

BlueGene/L In 2006-07, BGL was the #1 computer on the TOP500 supercomputer list 32 x 32 x 64 3D torus 131,000

35 BlueGene/L In , BGL was the #1 computer on the TOP500 supercomputer list 32 x 32 x 64 3D torus 131,000 processors System on a chip Low-cost, low-power processor 360 teraops peak 280 teraops sustained (Linpack) 32 tebibytes 35

36 Tianhe-1 and -2 National Super Computer Center in Guangzhou 3,120,000 cores Intel Xeon E5-2692, 2.2GHz Intel Xeon Phi 31S1P accelerator cards TH Express-2 interconnect Linpack performance: 33.8 PFlop/s Power: 18.8 MWatt Intel CC compiler, Intel MKL MPICH2 for communication 36

37 Programming models Shared memory: Automatic parallelization Pthreads Compiler directives: OpenMP Message passing: MPI: message passing interface PVM: parallel virtual machine HPF: high performance Fortran 37

38 Pthreads Posix threads: Standard definition but non-standard implementation Hard to code 38

39 More Recent Models Extracting high performance from the use of Graphics Processing Units (GPUs) using CUDA for NVIDIA cards OpenCL for CPUs / GPUs / DSPs / FPGAs OpenACC for heterogeneous systems And the list goes on what should you do? 39

40 OpenMP New de-facto standard available on all major platforms Easy to implement Single source for parallel and serial code 40

41 MPI Standard parallel interface: MPI 1 MPI 2: extend MPI 1 with single side communication Need to rewrite the code Code portable on all architectures 41

42 PVM Parallel Virtual Machine Another popular message passing interface Useful in environments with multiple vendors and for MPMD approach 42

43 Simple code to compute Π!program compute_pi!!integer n, i!!double precision w, x, sum, pi, f, a! c function to integrate!!f(a) = 4.d0 / (1.d0 + a*a)!!print *, 'Enter number of intervals: '!!read *,n! c calculate the interval size!!w = 1.0d0/n!!sum = 0.0d0!!do i = 1, n!!!x = w * (i - 0.5d0)!!!sum = sum + f(x)!!end do!!pi = w * sum!!print *, 'computed pi = ', pi!!stop!!end! 43

44 OpenMP code program compute_pi!!integer n, i!!double precision w, x, sum, pi, f, a! c function to integrate!!f(a) = 4.d0 / (1.d0 + a*a)!!print *, 'Enter number of intervals: '!!read *,n! c calculate the interval size!!w = 1.0d0/n!!sum = 0.0d0!!$OMP PARALLEL DO PRIVATE(x), SHARED(w)!!$OMP& REDUCTION(+: sum)!!do i = 1, n!!!x = w * (i - 0.5d0)!!!sum = sum + f(x)!!end do!!$omp END PARALLEL DO!!pi = w * sum!!print *, 'computed pi = ', pi!!stop!!end! 44

45 MPI code program compute_pi!!include 'mpif.h'!!double precision mypi, pi, w, sum, x, f, a!!integer n, myid, numprocs, i, rc! c function to integrate!!f(a) = 4.d0 / (1.d0 + a*a)!!call MPI_INIT( ierr )!!call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr )!!call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr )!!if ( myid.eq. 0 ) then!!print *, 'Enter number of intervals: '!!read *, n!!endif!!call MPI_BCAST(n,1,MPI_INTEGER,0,MPI_COMM_WORLD,ierr)! c calculate the interval size!!w = 1.0d0/n!!sum = 0.0d0!!do i = myid+1, n, numprocs!!!x = w * (i - 0.5d0)!!!sum = sum + f(x)!!enddo!!mypi = w * sum!! 45

46 MPI code collect all the partial sums!!call MPI_REDUCE(mypi,pi,1,MPI_DOUBLE_PRECISION,MPI_SUM,0,!!$ MPI_COMM_WORLD,ierr)! c node 0 prints the answer.!!!if (myid.eq. 0) then!!!print *, 'computed pi = ', pi!!endif!!call MPI_FINALIZE(rc)!!stop!!end! 46

47 Pthreads code #include <stdio.h>! #include <unistd.h>! #include <sys/times.h>! #include <pthread.h>! #include <stdlib.h>!! float sumall, width;! int i, iend;! private int ibegin;! private int cut;! private float x;! private float xsum;! void *do_work();! main(argc,argv)! int argc;! char *argv[];! {! /* Pi - Program loops over slices in interval, summing area of each slice! */! struct tms time_buf;! float ticks, t1, t2;! int intrvls, numthreads, istart;! pthread_t pt[32];!! /* get intervals from command line */!! intrvls = atoi(argv[1]);! numthreads = atoi(getenv( "PL_NUM_THREADS"));! printf(" intervals = %d PL_NUM_THREADS = %d \n",intrvls, numthreads);!! /* get number of clock ticks per second and initialize timer */!! ticks = sysconf(_sc_clk_tck);! t2 =times(&time_buf);!! /* - - Compute width of cuts */!! width = 1. / intrvls;! sumall = 0.0;!! 47

48 Pthreads code /* - - Loop over interval, summing areas */! istart = 1;! iend = intrvls/numthreads;! for (i = 0; i < numthreads - 1 ; i++)! {! pthread_create(&pt[i], pthread_attr_default, do_work,(void *) istart);! istart += iend;! }! do_work( istart);! istart += iend;! for (i = 0; i < numthreads - 1 ; i++)! {! pthread_join(pt[i], NULL);! }! /* - - fininish any remaining slices */! iend = intrvls - (intrvls/numthreads)* numthreads;! if( iend) do_work( istart);! /* - - Finish overall timing and write results */! t1 = times(&time_buf);! printf("time in main = %20.14e sum = %20.14f \n",(t1 -t2)/ticks,sumall);! printf("error = %20.15e \n",sumall );! }!! void *do_work(istart)! int istart;! {! ibegin = istart;! xsum = 0.0;! for (cut = ibegin ; cut < ibegin+iend; cut++)! {! x = (( (float ) cut) -.5) * width ;! }! xsum += width * 4. /(1. + x * x);! }! sumall += xsum;! 48

49 PVM code 1/3 program compute_pi_master! include '~/pvm3/include/fpvm3.h'! parameter (NTASKS = 5)! parameter (INTERVALS = 1000)! integer mytid! integer tids(ntasks)! real sum, area! real width! integer i, numt, msgtype, bufid, bytes, who, info!! sum = 0.0! C Enroll in PVM! call pvmfmytid(mytid)! C spawn off NTASKS workers! call pvmfspawn('comppi.worker', PVMDEFAULT, ' ',! + NTASKS, tids, numt)! width = 0.0! i = 0! C Multi-cast initial dummy message to workers! msgtype = 0! call pvmfinitsend(0, info)! call pvmfpack(integer4, i, 1, 1, info)! call pvmfpack(real4, width, 1, 1, info)! call pvmfmcast(ntasks, tids, msgtype, info)! C compute interval width! width = 1.0 / INTERVALS! C for each interval, 1) receive area from worker 2) add area to sum 3) send worker new interval number and width! call pvmfrecv(-1, -1, bufid)! call pvmfbufinfo(bufid, bytes, msgtype, who, info)! call pvmfunpack(real4, area, 1, 1, info)! sum = sum + area! call pvmfinitsend(pvmdatadefault, info)! call pvmfpack(integer4, i, 1, 1, info)! call pvmfpack(real4, width, 1, 1, info)! call pvmfsend(who, msgtype, info)! enddo!! 49

50 PVM code 2/3 C Signal to workers that tasks are done! i = -1! call pvmfinitsend(0, info)! call pvmfpack(integer4, i, 1, 1, info)! call pvmfpack(real4, width, 1, 1, info)!! C Collect the last NTASK areas and send the completion signal! do i = 1, NTASKS! call pvmfrecv(-1,-1, bufid)! call pvmfbufinfo(bufid, bytes, msgtype, who, info)! call pvmfunpack(real4, area, 1, 1, info)!! sum = sum + area!! call pvmfsend(who, msgtype, info)! enddo!! print 10,sum! 10 format(x,'computed value of Pi is ',F8.6)!! call pvmfexit(info)! end! 50

51 PVM code 3/3 integer mytid, master! real area! real width, int_val, height! integer int_num! C Enroll in PVM! call pvmfmytid(mytid)! C who is sending me work?! call pvmfparent(master)!! C receive first job from the master! call pvmfrecv(-1, -1, info)! call pvmfunpack(integer4, int_num, 1, 1, info)! call pvmfunpack(real4, width, 1, 1, info)!! C While I've not been sent the signal to quit, I'll keep processing! 40 if (int_num.eq. -1) goto 50! C compute interval value from interval number! int_val = int_num * width! C compute height of given rectangle! height = F(int_val)!! C compute area! area = height * width!! C send area back to master! call pvmfinitsend(pvmdatadefault, info)! call pvmfpack(real4, area, 1, 1, info)! call pvmfsend(master, 9, info)!! C Wait for next job from master! call pvmfrecv(-1, -1, info)! call pvmfunpack(integer4, int_num, 1, 1, info)! call pvmfunpack(real4, width, 1, 1, info)! goto 40! C all done! 50 call pvmfexit(info)! end!! REAL FUNCTION F(X)! REAL X! F = 4.0/(1.0+X**2)! RETURN! END! 51

CS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2

CS 770G - Parallel Algorithms in Scientific Computing Parallel Architectures. May 7, 2001 Lecture 2 CS 770G - arallel Algorithms in Scientific Computing arallel Architectures May 7, 2001 Lecture 2 References arallel Computer Architecture: A Hardware / Software Approach Culler, Singh, Gupta, Morgan Kaufmann