Goals of this Course

Size: px

Start display at page:

Download "Goals of this Course"

Ashlynn Snow
5 years ago
Views:

1 CISC High performance parallel algorithms for computational science Instructor: Dr. Michela Taufer Spring 2009 Goals of this Course This course is intended to provide students with an understanding of parallelization with MPI and OpenMP. Case studies for parallelization include Molecular Dynamics and Monte Carlo simulations, their principals, and their sequential and parallel algorithms. Emphasis is placed on the algorithmic and code components of these simulations, their performance analysis, and their scalability. From the syllabus 2 1

2 Course Topics Parallel Programming Parallel architectures Parallel programming with Message Passing Interface (MPI) Parallel programming with OpenMP Case study I: Molecular Dynamics (MD) simulations Parallelization of Molecular Dynamics (MD) algorithm with MPI and OpenMP Case Study II: Monte Carlo Simulations Parallelization of Monte Carlo algorithm with MPI Hybrid Parallelism: Combining MPI and OpenMP 3 Course Information and Deadlines Webpage: Mailing list: cisc849010_sp09@gcl.cis.udel.edu Access course material: User: cisc849student Password: Work4Fun! Schedule: Download it from the course webpage It is a tentative schedule! Syllabus Download it at the course webpage Read it carefully! 4 2

3 Books Parallel Programming with MPI by Peter Pacheco 5 Books Parallel Programming in C with MPI and OpenMP by Michael J. Quinn 6 3

4 Books Parallel Programming in OpenMP by Rohit Chandra, Leo Dagum, Dave Kohr, Dror Maydan, Jeff McDonald, Ramesh Menon 7 Books The Art of Molecular Dynamics Simulation by D.C. Rapaport, Cambridge Ed. 8 4

5 Books Molecular Modeling Principles and Applications by A.R. Leach, Pearson Ed. 9 Modern Scientific Method (I) Nature Observation Physical experiments and models Theory Classical science 10 5

6 Modern Scientific Method (II) Nature Observation Numerical simulations Physical experiments and models Expensive Time-consuming Unethical Impossible.. Theory Contemporary science 11 Grand Challenges Grand challenges are complex scientific problems: Quantum chemistry, statistical mechanics, and relativistic physics Cosmology and astrophysics Computational fluid dynamics and turbulence Biology, pharmacology, genome sequencing, protein folding, and cell modeling Global weather and environmental modeling They require extraordinarily powerful computers when solved via numerical simulations: Need more computational power Benefit from parallel computing 12 6

7 What is Parallel Computing? Parallel computing: use of multiple processors or computers working together to solve a single computational problem. Each processor works on its section of the problem Processors can exchange information Grid of Problem to be solved CPU #1 works on this area of the problem exchange CPU #2 works on this area of the problem y exchange exchange CPU #3 works on this area of the problem exchange CPU #4 works on this area of the problem x 13 Why Do Parallel Computing? Limits of single CPU computing performance available memory Parallel computing allows one to: solve problems that don t fit on a single CPU solve problems that can t be solved in a reasonable time We can solve larger problems faster more cases 14 7

8 Example: Weather Modeling and Forecasting For modeling a hurricane region: Assume region of interest is 1000 X 1000 miles, with height of 10 miles. Partition into segments of 0.1 x 0.1 x 0.1 miles: 10^10 grid points Simulate 2 days, with 30-minute time steps: 100 total time steps Assume the computations at each grid point require 100 instructions. A single time step then requires 10^12 instructions. For two days we need 10^14 instructions For serial computer with 10^8 instructions/sec, this takes 10^6 seconds (10 days!) to predict next 2 days!! THIS REQUIRES PARALLELISM FOR PERFORMANCE TO PREDICT Also requires lots of memory which implies parallelism Currently all major weather forecast centers (US, Europe, Asia) have supercomputers with 1000s of processors. 15 Other Examples Vehicle design and dynamics Analysis of protein structures Human genome work Quantum chromodynamics Cosmology Ocean modeling Imaging and Rendering Petroleum exploration Nuclear Weapon design Database query Ozone layer monitoring Natural language understanding Study of chemical phenomena And many other grand challenge projects 16 8

9 What is Parallel Computers? A Parallel Computer is a computer (or collection of computers) with multiple processors that can work together on solving a complex problem and supporting parallel computing Distributed multiprocessor: parallel computer constructed out of multiple computers and an interconnected network Centralized multiprocessor (or Symmetrical multiprocessor or SMP): all CPUs share access to a single global memory. How do the processors work together? 17 Distributed Multiprocessor 18 9

10 Centralized Multiprocessors 19 What is Parallel Programming? Parallel Programming: programming in a language that allow you to explicitly indicate how parts of the computation may be executed in parallel (concurrently). Confide the task to compiler technology: compiler detect and exploit the parallelism in existing code written in sequential languages Write your own parallel program: e.g., parallel programs written in C/C++/Fortran with MPI or OpenMP 20 10

11 MPI and OpenMP MPI (Message Passing Interface) MPI is a library specification for message-passing, proposed as a standard by a broadly based committee of vendors, implementors, and users. From OpenMP OpenMP Application Program Interface (API) supports multi-platform shared-memory parallel programming in C/C++ and Fortran on all architectures, including Unix platforms and Windows NT platforms. From 21 Single Program, Multiple Data (SPMD) SPMD: dominant programming model for shared and distributed memory machines. One source code is written Code can have conditional execution based on which processor is executing the copy All copies of code are started simultaneously and communicate and synch with each other periodically MPMD: more general, and possible in hardware, but no system/programming software enables it 22 11

12 SPMD Programming Model source.c source.c source.c source.c source.c Processor 0 Processor 1 Processor 2 Processor 3 23 Types of Parallelism: Two Extremes Data parallelism Each processor performs the same task on different data Task parallelism (or Functional Parallelism) Each processor performs a different task Most applications fall somewhere on the continuum between these two extremes 24 12

13 Data Parallel Programming Example One code will run on 2 CPUs Program has array of data to be operated on by 2 CPUs so array is split into two parts. program: if CPU=a then low_limit=1 upper_limit=50 elseif CPU=b then low_limit=51 upper_limit=100 end if do I = low_limit, upper_limit work on A(I) end do... end program CPU A program: low_limit=1 upper_limit=50 do I= low_limit, upper_limit work on A(I) end do end program CPU B program: low_limit=51 upper_limit=100 do I= low_limit, upper_limit work on A(I) end do end program 25 Task Parallel Programming Example One code will run on 2 CPUs Program has 2 tasks (a and b) to be done by 2 CPUs program.f: initialize... if CPU=a then do task a elseif CPU=b then do task b end if. end program CPU A program.f: initialize do task a end program CPU B program.f: initialize do task b end program 26 13

random seed values while preserving temperature 27 Data Parallelism: Protein Folding One single

14 Task Parallelism: Protein Folding Same initial protein structure Different MDs (independent tasks) Final set of folded protein structures Independent tasks: change atoms velocities to random seed values while preserving temperature 27 Data Parallelism: Protein Folding One single folding process is performed in parallel PC 1 PC 3 PC 2 PC 0 Partitioned the space in four regions 28 14

15 Data Dependency Graphs Formal method to identify parallelism A directed graph: Vertexes (circles) represent tasks to be completed Edges denote dependencies among tasks If there is not path between two vertexes, then the tasks are independent Labels inside circles represent the kind of tasks being performed Multiple circles with the same label represent tasks performing the same operation on different operands 29 A Parallelism in Data Dependency Graphs A A B B B B C BD B C CE C Data parallelism Task parallelism Sequential dependency 30 15

16 Pipeline Divide a process into stages Produce several items simultaneously E.g., automobile assembly line 31 Pipelining Given a sequential dependence graph (a sequence of independent tasks or stages), assume that: all tasks take the same amount of time multiple problem instances need to be proceeded Then the output of each functional unit is the input of the input to the next. i-2 i-1 i i+1 A B C Examples: von Neuman model where the various circuits in the CPU are split up into functional units; Automobile assembly line 32 16

17 Limits of Parallel Computing Theoretical Upper Limits Amdahl s Law Practical Limits Load balancing Non-computational sections Time to re-write code Hardware/System Limits Topology Network bandwidth and latency Number of processors 33 Amdahl s Law Amdahl s Law places a strict limit on the speedup that can be realized by using multiple processors. Effect of multiple processors on run time t n = ( f p / N + f s )t 1 Effect of multiple processors on speed up S = 1 Where f s + fp / N f s = serial fraction of code f p = parallel fraction of code N = number of processors 34 17

18 It takes only a small fraction of serial content in a code to degrade the parallel performance. 250 Illustration of Amdahl's Law S f p = f p = f p = f p = Number of processors 35 Practical Limits: Amdahl s Law vs. Reality Amdahl s Law provides a theoretical upper limit on parallel speedup assuming that there are no costs for communications. In reality, communications will result in a further degradation of performance. S f p = 0.99 Amdahl's Law Reality Number of processors 36 18

19 Shared and Distributed Memory P P P P P P BBus U S Memory P M P P P P P M M M M M Network Shared memory: single address space. All processors have access to a pool of shared memory. (examples: Cray SV1, IBM Power4 node) Methods of memory access: - Bus - Crossbar Distributed memory: each processor has it s own local memory. Must do message passing to exchange data between processors. (examples: Clusters, Cray T3E) Methods of memory access: - various topological interconnects 37 Bus-Based Shared-Memory Architecture (I) Processors are connected to global memory by means of a common data path called a bus. Global Memory BUS CPU CPU CPU 38 19

20 Critical Issues Simplicity of construction Provides uniform access to shared memory Bus can carry limited amount of data between the memory and processors As the number of processors increases each processor spends more time waiting for memory access while the bus is used by other processor Saturation of the bus SGI Challenge XL has only 36 processors 39 Bus-Based Shared-Memory Architecture (II) By adding caches to bus, the performance increases. Global Memory BUS Cache Cache Cache CPU CPU CPU 40 20

21 Bus with and without Cache With cache performance # processors What is the matter with this picture? 41 Switch-Based Shared-Memory Architecture M1 M2... Mm CPU 1 CPU 2 CPU p Switch element PxM crossbar switch: P processors and M memory banks Example of 5x5 crossbar switch: basic unit of the Convex SPP

22 Pro and Contra Do not suffer from problems of saturations BUT Very expensive architectures mxn crossbar needs mn hardware switches 43 Cost Estimation Crossbar switch is a non-blocking network: connection of a processor to a memory bank does not block the connection of any other processor to any other memory bank. Total # of switching elements required is: f(p*m) = approx f(p*p) (assuming p = m) As p grows, so does complexity of switching network f(p*p) Cross bar switches are not scalable in terms of cost

23 Type of Distributed-Memory Architectures Bus-based networks Cluster of workstations on a Ethernet Dynamic interconnect (indirect topology) Static interconnect (direct topology) 45 Dynamic vs. Static Interconnect DYNAMIC INTERCONNECT (indirect topology): Communication links are connected to one another dynamically by switching elements to establish path among processors STATIC INTERCONNECT (direct topology): Point to point communication links among processors. Switch Processor/memory pair Crossbar network: specialized switching nodes transfer the messages Mash: processors themselves as the routing nodes 46 23

24 Dynamic Interconnect: Crossbar Switch Network 47 Dynamic Interconnect: Omega Network (I) 48 24

25 Dynamic Interconnect: Omega Network (II) Routing 49 Example of Systems with Dynamic Interconnection Networks Crossbar switch network Fujitsu VPP 500 (224x224 crossbar with 224 nodes) Compromised strategy between crossbar and omega networks SP series from IBM Each switch of the omega structure is a 8x8 crossbar Larger installation machine has 512 nodes 50 25

26 Static Interconnection Networks completely connected star connected linear array ring mesh hypercube 51 Static Interconnect Completely Connected : Each processor has direct communication link to every other processor Star Connected Network : The middle processor is the central processor. Every other processor is connected to it. Counter part of Cross Bar switch in Dynamic interconnect

27 Static Interconnect Linear Array : Ring : Mesh Network (e.g. 2D) 53 Static Interconnect Torus or Wraparound Mesh : 54 27

28 Static Interconnect Hypercube Network : A multidimensional mesh of processors with exactly two processors in each dimension. A d dimensional processor consists of p = 2 d processors Shown below are 0, 1, 2, and 3D hypercubes 0-D 1-D 2-D 3-D hypercubes 55 Routing How is data transmitted between two nodes not directly connected? Hardware and hardware+software solutions How is a route between nodes decided if there are multiple routes? Deterministic shortest-path routing algorithms How do intermediate nodes forward communications? Store-and-forward routing Wormhole routing 56 28

29 Store-and-Forward vs. Wormhole Routing Store-and-forward routing: data being shipped through a network a packet at a time. We send the packet to the first intermediate node, then on to the second, and so forth. Wormhole routing: a worm crawling through a wormhole. A packet contains a header with routing information followed by a payload containing the actual data, probably followed by a checksum or something to guarantee integrity. Once the header has arrived at a node, it's possible to make routing decisions and pass it along immediately, rather than waiting for the entire packet to arrive first. Wormhole routing dramatically reduces latency, but creates new possibilities for deadlock. 57 Evaluate Network Topology Diameter Connectivity Bisection width Channel width Channel rate Channel bandwidth Bisection bandwidth 58 29

30 Metrics Diameter: Maximum distance between any two processors in the network. The distance between two processors is defined as the shortest path, in terms of links, between them. This relates to communication time. Diameter for completely connected network is 1, for star network is 2, for ring is p/2 (for p even processors) 59 Metrics Connectivity: This is a measure of the multiplicity of paths between any two processors (# arcs that must be removed to break into two). High connectivity is desired since it lowers contention for communication resources. Connectivity is 1 for linear array, 1 for star, 2 for ring, 2 for mesh, 4 for tours in previous examples 60 30

31 Metrics Bisection width: Minimum # of communication links that have to be removed to partition the network into two equal halves. Bisection width is 2 for ring, sq. root(p) for mesh with p (even) processors, p/2 for hypercube, (p*p)/4 for completely connected (p even). Channel width: # of physical wires in each communication link Channel rate: Peak rate at which a single physical wire link can deliver bits 61 Metrics Channel bandwidth: Peak rate at which data can be communicated between the ends of a communication link: = (channel width) * (channel rate) ) Bisection bandwidth: Minimum volume of communication allowed between any 2 halves of the network with equal # of processors: = (bisection width) * (channel BW) 62 31

32 Example I: 2D Mesh Processor nodes n=d 2 Without wraparound connections Switch nodes n Diameter 2( n 1) Bisection width n Edges/node 4 Edge length constant 63 Example 2: Binary Tree Network Processor nodes n=2 d Switch nodes 2n-1 Diameter 2logn Bisection width 1 Edges/node 3 Edge length variable 64 32

33 Next Lecture. Topics: Programming with MPI Deadlines: Course: Read syllabus and tentative schedule Print slides for next lecture. Take slides with you!!!! Seminar: Choose your seminar day and your paper Project: Read the project descriptions - Next deadline is 2/19!!!! Homework: No homework assignment this time. 65 Get some Practice Find memory model and topology for the following machines: CRAY T3E CRAY SV1 IBM RS/6000 SP Hitachi SR8000 Compaq HPC320 IBM eserver p690 SGI Origin 2000 Clusters of SMPs Cray T3D Fujitsu VPP5000 series Next lecture: you give your answers!!!! 66 33

Interconnection Network

Interconnection Network Recap: Generic Parallel Architecture A generic modern multiprocessor Network Mem Communication assist (CA) $ P Node: processor(s), memory system, plus communication assist Network