Lecture 4: Locality and parallelism in simulation I

Similar documents
Lecture 15: More Iterative Ideas

Week 3: MPI. Day 04 :: Domain decomposition, load balancing, hybrid particlemesh

CS 267 Sources of. Parallelism and Locality in Simulation. Lecture 4" James Demmel!

High Performance Computing: Tools and Applications

Lecture 5. Applications: N-body simulation, sorting, stencil methods

Lecture 3: Intro to parallel machines and models

MPI Casestudy: Parallel Image Processing

Principles of Parallel Algorithm Design: Concurrency and Mapping

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206

Introduction to Parallel Programming

Principles of Parallel Algorithm Design: Concurrency and Mapping

ScaFaCoS and P 3 M. Olaf Lenz. Recent Developments. June 3, 2013

HPC Algorithms and Applications

Mass-Spring Systems. Last Time?

CSC 391/691: GPU Programming Fall N-Body Problem. Copyright 2011 Samuel S. Cho

Using GPUs to compute the multilevel summation of electrostatic forces

Adaptive-Mesh-Refinement Pattern

Fast methods for N-body problems

Decomposing onto different processors

PARALLEL METHODS FOR SOLVING PARTIAL DIFFERENTIAL EQUATIONS. Ioana Chiorean

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David

Continuum-Microscopic Models

Introduction to parallel Computing

Introduction to Parallel Programming

Handling Parallelisation in OpenFOAM

--Introduction to Culler et al. s Chapter 2, beginning 192 pages on software

Foster s Methodology: Application Examples

Lecture 7: More about MPI programming. Lecture 7: More about MPI programming p. 1

Lecture 19: Graph Partitioning

High-Performance Computing

CSE 160 Lecture 13. Non-blocking Communication Under the hood of MPI Performance Stencil Methods in MPI

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh

Introduction to Computer Graphics. Animation (2) May 26, 2016 Kenshi Takayama

CSE. Parallel Algorithms on a cluster of PCs. Ian Bush. Daresbury Laboratory (With thanks to Lorna Smith and Mark Bull at EPCC)

What is Performance for Internet/Grid Computation?

Some aspects of parallel program design. R. Bader (LRZ) G. Hager (RRZE)

HPC Fall 2010 Final Project 3 2D Steady-State Heat Distribution with MPI

Lagrangian methods and Smoothed Particle Hydrodynamics (SPH) Computation in Astrophysics Seminar (Spring 2006) L. J. Dursi

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Programming as Successive Refinement. Partitioning for Performance

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

Design of Parallel Algorithms. Models of Parallel Computation

Outline. CSC 447: Parallel Programming for Multi- Core and Cluster Systems

Dynamic load balancing in OSIRIS

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

ECE 669 Parallel Computer Architecture

Numerical Methods for (Time-Dependent) HJ PDEs

Outline: Embarrassingly Parallel Problems

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Parallelization of Scientific Applications (II)

Blocking SEND/RECEIVE

Introduction to MPI. EAS 520 High Performance Scientific Computing. University of Massachusetts Dartmouth. Spring 2014


Benchmark 1.a Investigate and Understand Designated Lab Techniques The student will investigate and understand designated lab techniques.

Partitioning and Divide-and- Conquer Strategies

Lecture 7: Distributed memory

Big Orange Bramble. August 09, 2016

Fluent User Services Center

Conquer Strategies. Partitioning and Divide-and- Partitioning Strategies + + Using separate send()s and recv()s. Slave

Lecture 5: Search Algorithms for Discrete Optimization Problems

Summer 2009 REU: Introduction to Some Advanced Topics in Computational Mathematics

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Outline: Embarrassingly Parallel Problems. Example#1: Computation of the Mandelbrot Set. Embarrassingly Parallel Problems. The Mandelbrot Set

Lecture 17: Array Algorithms

Parallelization Strategy

2.7 Cloth Animation. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter 2 123

PHYSICALLY BASED ANIMATION

COSC 6374 Parallel Computation. Parallel Design Patterns. Edgar Gabriel. Fall Design patterns

High-Performance Computing

Multigrid James Demmel

Thinking parallel. Decomposition. Thinking parallel. COMP528 Ways of exploiting parallelism, or thinking parallel

Explicit and Implicit Coupling Strategies for Overset Grids. Jörg Brunswig, Manuel Manzke, Thomas Rung

Algorithms PART I: Embarrassingly Parallel. HPC Fall 2012 Prof. Robert van Engelen

Algorithms. Algorithms GEOMETRIC APPLICATIONS OF BSTS. 1d range search line segment intersection kd trees interval search trees rectangle intersection

Algorithms. Algorithms GEOMETRIC APPLICATIONS OF BSTS. 1d range search line segment intersection kd trees interval search trees rectangle intersection

PROGRAMMING OF MULTIGRID METHODS

Programming Scalable Systems with MPI. Clemens Grelck, University of Amsterdam

Algorithms. Algorithms GEOMETRIC APPLICATIONS OF BSTS. 1d range search line segment intersection kd trees interval search trees rectangle intersection

Lecture 11: Ray tracing (cont.)

Parallelization Strategy

Random Graph Model; parameterization 2

CS 378: Computer Game Technology

Programming with MPI

Multigrid Pattern. I. Problem. II. Driving Forces. III. Solution

A Kernel-independent Adaptive Fast Multipole Method

A synchronizer generates sequences of clock pulses at each node of the network satisfying the condition given by the following definition.

Ray Tracing. Cornell CS4620/5620 Fall 2012 Lecture Kavita Bala 1 (with previous instructors James/Marschner)

Robot Motion Planning

ENERGY-224 Reservoir Simulation Project Report. Ala Alzayer

Preliminary Spray Cooling Simulations Using a Full-Cone Water Spray

Partitioning and Divide-and-Conquer Strategies

An Introduction to Parallel Programming

Parallel Summation of Inter-Particle Forces in SPH

Graph Adjacency Matrix Automata Joshua Abbott, Phyllis Z. Chinn, Tyler Evans, Allen J. Stewart Humboldt State University, Arcata, California

Search Algorithms for Discrete Optimization Problems

Predictive Engineering and Computational Sciences. Data Structures and Methods for Unstructured Distributed Meshes. Roy H. Stogner

Lecture 4: Principles of Parallel Algorithm Design (part 3)

Message Passing Interface (MPI)

Shared Memory Programming. Parallel Programming Overview

Transcription:

Lecture 4: Locality and parallelism in simulation I David Bindel 6 Sep 2011

Logistics

Distributed memory machines Each node has local memory... and no direct access to memory on other nodes Nodes communicate via network interface Example: our cluster! Or almost any other current multi-node machine

Message-passing programming model Collection of named processes Data is partitioned Communication by send/receive of explicit message Lingua franca: MPI (Message Passing Interface)

Message passing dot product: v1 Processor 1: 1. Partial sum s1 2. Send s1 to P2 3. Receive s2 from P2 4. s = s1 + s2 Processor 2: 1. Partial sum s2 2. Send s2 to P1 3. Receive s1 from P1 4. s = s1 + s2 What could go wrong? Think of phones vs letters...

Message passing dot product: v1 Processor 1: 1. Partial sum s1 2. Send s1 to P2 3. Receive s2 from P2 4. s = s1 + s2 Processor 2: 1. Partial sum s2 2. Receive s1 from P1 3. Send s2 to P1 4. s = s1 + s2 Better, but what if more than two processors?

MPI: the de facto standard Pro: Portability Con: least-common-denominator for mid 80s The assembly language (or C?) of parallelism... but, alas, assembly language can be high performance.

The story so far Even serial performance is a complicated function of the underlying architecture and memory system. We need to understand these effects in order to design data structures and algorithms that are fast on modern machines. Good serial performance is the basis for good parallel performance. Parallel performance is additionally complicated by communication and synchronization overheads, and by how much parallel work is available. If a small fraction of the work is completely serial, Amdahl s law bounds the speedup, independent of the number of processors. We have discussed serial architecture and some of the basics of parallel machine models and programming models. Now we want to describe how to think about the shape of parallel algorithms for some scientific applications.

Reminder: what do we want? High-level: solve big problems fast Start with good serial performance Given p processors, could then ask for Good speedup: p 1 times serial time Good scaled speedup: p times the work in same time Easiest to get good speedup from cruddy serial code!

Parallelism and locality Real world exhibits parallelism and locality Particles, people, etc function independently Nearby objects interact more strongly than distant ones Can often simplify dependence on distant objects Can get more parallelism / locality through model Limited range of dependency between adjacent time steps Can neglect or approximate far-field effects Often get parallism at multiple levels Hierarchical circuit simulation Interacting models for climate Parallelizing individual experiments in MC or optimization

Basic styles of simulation Discrete event systems (continuous or discrete time) Game of life, logic-level circuit simulation Network simulation Particle systems (our homework) Billiards, electrons, galaxies,... Ants, cars,...? Lumped parameter models (ODEs) Circuits (SPICE), structures, chemical kinetics Distributed parameter models (PDEs / integral equations) Heat, elasticity, electrostatics,... Often more than one type of simulation appropriate. Sometimes more than one at a time!

Discrete events Basic setup: Finite set of variables, updated via transition function Synchronous case: finite state machine Asynchronous case: event-driven simulation Synchronous example: Game of Life Nice starting point no discretization concerns!

Game of Life Lonely Crowded OK Born (Dead next step) (Live next step) Game of Life (John Conway): 1. Live cell dies with < 2 live neighbors 2. Live cell dies with > 3 live neighbors 3. Live cell lives with 2 3 live neighbors 4. Dead cell becomes live with exactly 3 live neighbors

Game of Life P0 P1 P2 P3 Easy to parallelize by domain decomposition. Update work involves volume of subdomains Communication per step on surface (cyan)

Game of Life: Pioneers and Settlers What if pattern is dilute? Few or no live cells at surface at each step Think of live cell at a surface as an event Only communicate events! This is asynchronous Harder with message passing when do you receive?

Asynchronous Game of Life How do we manage events? Could be speculative assume no communication across boundary for many steps, back up if needed Or conservative wait whenever communication possible possible guaranteed! Deadlock: everyone waits for everyone else to send data Can get around this with NULL messages How do we manage load balance? No need to simulate quiescent parts of the game! Maybe dynamically assign smaller blocks to processors?

Particle simulation Particles move via Newton (F = ma), with External forces: ambient gravity, currents, etc. Local forces: collisions, Van der Waals (1/r 6 ), etc. Far-field forces: gravity and electrostatics (1/r 2 ), etc. Simple approximations often apply (Saint-Venant)

A forced example Example force: f i = j Gm i m j (x j x i ) r 3 ij ( ( ) ) a 4 1, r ij = x i x j r ij Long-range attractive force (r 2 ) Short-range repulsive force (r 6 ) Go from attraction to repulsion at radius a

A simple serial simulation In MATLAB, we can write npts = 100; t = linspace(0, tfinal, npts); [tout, xyv] = ode113(@fnbody,... t, [x; v], [], m, g); xout = xyv(:,1:length(x)) ;... but I can t call ode113 in C in parallel (or can I?)

A simple serial simulation Maybe a fixed step leapfrog will do? npts = 100; steps_per_pt = 10; dt = tfinal/(steps_per_pt*(npts-1)); xout = zeros(2*n, npts); xout(:,1) = x; for i = 1:npts-1 for ii = 1:steps_per_pt x = x + v*dt; a = fnbody(x, m, g); v = v + a*dt; end xout(:,i+1) = x; end

Plotting particles

Pondering particles Where do particles live (esp. in distributed memory)? Decompose in space? By particle number? What about clumping? How are long-range force computations organized? How are short-range force computations organized? How is force computation load balanced? What are the boundary conditions? How are potential singularities handled? What integrator is used? What step control?

External forces Simplest case: no particle interactions. Embarrassingly parallel (like Monte Carlo)! Could just split particles evenly across processors Is it that easy? Maybe some trajectories need short time steps? Even with MC, load balance may not be entirely trivial.

Local forces Simplest all-pairs check is O(n 2 ) (expensive) Or only check close pairs (via binning, quadtrees?) Communication required for pairs checked Usual model: domain decomposition

Local forces: Communication Minimize communication: Send particles that might affect a neighbor soon Trade extra computation against communication Want low surface area-to-volume ratios on domains

Local forces: Load balance Are particles evenly distributed? Do particles remain evenly distributed? Can divide space unevenly (e.g. quadtree/octtree)

Far-field forces Mine Mine Mine Buffered Buffered Buffered Every particle affects every other particle All-to-all communication required Overlap communication with computation Poor memory scaling if everyone keeps everything! Idea: pass particles in a round-robin manner

Passing particles for far-field forces Mine Mine Mine Buffered Buffered Buffered copy local particles to current buf for phase = 1:p send current buf to rank+1 (mod p) recv next buf from rank-1 (mod p) interact local particles with current buf swap current buf with next buf end

Passing particles for far-field forces Suppose n = N/p particles in buffer. At each phase t comm α + βn t comp γn 2 So we can mask communication with computation if n 1 ( β + ) β 2γ 2 + 4αγ > β γ More efficient serial code = larger n needed to mask communication! = worse speed-up as p gets larger (fixed N) but scaled speed-up (n fixed) remains unchanged. This analysis neglects overhead term in LogP.

Far-field forces: particle-mesh methods Consider r 2 electrostatic potential interaction Enough charges looks like a continuum! Poisson equation maps charge distribution to potential Use fast Poisson solvers for regular grids (FFT, multigrid) Approximation depends on mesh and particle density Can clean up leading part of approximation error

Far-field forces: particle-mesh methods Map particles to mesh points (multiple strategies) Solve potential PDE on mesh Interpolate potential to particles Add correction term acts like local force

Far-field forces: tree methods Distance simplifies things Andromeda looks like a point mass from here? Build a tree, approximating descendants at each node Several variants: Barnes-Hut, FMM, Anderson s method More on this later in the semester

Summary of particle example Model: Continuous motion of particles Could be electrons, cars, whatever... Step through discretized time Local interactions Relatively cheap Load balance a pain All-pairs interactions Obvious algorithm is expensive (O(n 2 )) Particle-mesh and tree-based algorithms help An important special case of lumped/ode models.