Numerical Modelling in Fortran: day 7. Paul Tackley, 2017

Similar documents
Numerical Modelling in Fortran: day 6. Paul Tackley, 2017

Numerical Modelling in Fortran: day 3. Paul Tackley, 2017

Shared Memory Programming With OpenMP Exercise Instructions

CS 403: Lab 6: Profiling and Tuning

Numerical Modelling in Fortran: day 4. Paul Tackley, 2017

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Optimisation p.1/22. Optimisation

Supercomputing in Plain English Part IV: Henry Neeman, Director

CIS 403: Lab 6: Profiling and Tuning

Shared Memory Programming With OpenMP Computer Lab Exercises

Memory. From Chapter 3 of High Performance Computing. c R. Leduc

1 Stokes equations with FD on a staggered grid using the stream-function approach.

Domain Decomposition: Computational Fluid Dynamics

Advanced Computer Architecture Lab 3 Scalability of the Gauss-Seidel Algorithm

15 Answers to Frequently-

Bindel, Fall 2011 Applications of Parallel Computers (CS 5220) Tuning on a single core

Optimising for the p690 memory system

2-D Wave Equation Suggested Project

Domain Decomposition: Computational Fluid Dynamics

Domain Decomposition: Computational Fluid Dynamics

Adarsh Krishnamurthy (cs184-bb) Bela Stepanova (cs184-bs)

Introduction. No Optimization. Basic Optimizations. Normal Optimizations. Advanced Optimizations. Inter-Procedural Optimizations

DRAFT. ATMET Technical Note. Common Customizations for RAMS v6.0. Number 2. Prepared by: ATMET, LLC PO Box Boulder, Colorado

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Stream Function-Vorticity CFD Solver MAE 6263

Sami Ilvonen Pekka Manninen. Introduction to High-Performance Computing with Fortran. September 19 20, 2016 CSC IT Center for Science Ltd, Espoo

CSU RAMS. Procedures for Common User Customizations of the Model. Updated by:

CS 426 Parallel Computing. Parallel Computing Platforms

Let s look at each and begin with a view into the software

The NEMO Ocean Modelling Code: A Case Study

Parallel Code Optimisation

Optimising with the IBM compilers

Loop Transformations! Part II!

HPC VT Machine-dependent Optimization

An interesting related problem is Buffon s Needle which was first proposed in the mid-1700 s.

Lecture 2: Introduction to OpenMP with application to a simple PDE solver

Python Scripting for Computational Science

Optimisation. CS7GV3 Real-time Rendering

CUDA. Fluid simulation Lattice Boltzmann Models Cellular Automata

Programming. Dr Ben Dudson University of York

Autotuning. John Cavazos. University of Delaware UNIVERSITY OF DELAWARE COMPUTER & INFORMATION SCIENCES DEPARTMENT

Kampala August, Agner Fog

Tiling: A Data Locality Optimizing Algorithm

Amdahl s Law. AMath 483/583 Lecture 13 April 25, Amdahl s Law. Amdahl s Law. Today: Amdahl s law Speed up, strong and weak scaling OpenMP

Lecture 16. Today: Start looking into memory hierarchy Cache$! Yay!

Python Scripting for Computational Science

Welcome to Part 3: Memory Systems and I/O

Introduction. Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: We study the bandwidth problem

Why C? Because we can t in good conscience espouse Fortran.

Control Structures. Code can be purely arithmetic assignments. At some point we will need some kind of control or decision making process to occur

MAR 572 Geophysical Simulation

ROB: head/tail. exercise: result of processing rest? 2. rename map (for next rename) log. phys. free list: X11, X3. PC log. reg prev.

An Introduction to F2Py

In examining performance Interested in several things Exact times if computable Bounded times if exact not computable Can be measured

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Computer Organization & Assembly Language Programming

Cache introduction. April 16, Howard Huang 1

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC

Threaded Programming. Lecture 9: Introduction to performance optimisation

Numerical Modelling in Fortran: day 2. Paul Tackley, 2017

Vectorization on KNL

Lab 9: FLUENT: Transient Natural Convection Between Concentric Cylinders

Module 1: Introduction to Finite Difference Method and Fundamentals of CFD Lecture 13: The Lecture deals with:

Supercomputing and Science An Introduction to High Performance Computing

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I

MS6021 Scientific Computing. MatLab and Python for Mathematical Modelling. Aimed at the absolute beginner.

PDF created with pdffactory Pro trial version How Computer Memory Works by Jeff Tyson. Introduction to How Computer Memory Works

Compiling for Performance on hp OpenVMS I64. Doug Gordon Original Presentation by Bill Noyce European Technical Update Days, 2005

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

Figure 1 Common Sub Expression Optimization Example

Workshop on High Performance Computing (HPC08) School of Physics, IPM February 16-21, 2008 HPC tools: an overview

Parallel Poisson Solver in Fortran

Performance Engineering: Lab Session

41st Cray User Group Conference Minneapolis, Minnesota

Notes slides from before lecture. CSE 21, Winter 2017, Section A00. Lecture 3 Notes. Class URL:

Parallelism. CS6787 Lecture 8 Fall 2017

17/05/2018. Outline. Outline. Divide and Conquer. Control Abstraction for Divide &Conquer. Outline. Module 2: Divide and Conquer

c Fluent Inc. May 16,

Fluid Simulation. [Thürey 10] [Pfaff 10] [Chentanez 11]

COMP Preliminaries Jan. 6, 2015

Performance analysis basics

The parallelization of a block-tridiagonal matrix system for an electromagnetic wave simulation in TOKAMAK(TORIC) by MPI Fortran

Testing is a very big and important topic when it comes to software development. Testing has a number of aspects that need to be considered.

Software Design Models, Tools & Processes. Lecture 6: Transition Phase Cecilia Mascolo

2 TEST: A Tracer for Extracting Speculative Threads

The NetBeans IDE is a big file --- a minimum of around 30 MB. After you have downloaded the file, simply execute the file to install the software.

Manual and Compiler Optimizations

Control Hazards. Prediction

COMP4128 Programming Challenges

AM119: Yet another OpenFoam tutorial

Review of previous examinations TMA4280 Introduction to Supercomputing

Computational modeling

NIA CFD Seminar, October 4, 2011 Hyperbolic Seminar, NASA Langley, October 17, 2011

ECE 5730 Memory Systems

High-Performance Scientific Computing

LECTURE 0: Introduction and Background

Where Does The Cpu Store The Address Of The

OpenACC programming for GPGPUs: Rotor wake simulation

Lecture 2: Single processor architecture and memory

About Phoenix FD PLUGIN FOR 3DS MAX AND MAYA. SIMULATING AND RENDERING BOTH LIQUIDS AND FIRE/SMOKE. USED IN MOVIES, GAMES AND COMMERCIALS.

Transcription:

Numerical Modelling in Fortran: day 7 Paul Tackley, 2017

Today s Goals 1. Makefiles 2. Intrinsic functions 3. Optimisation: Making your code run as fast as possible 4. Combine advection-diffusion and Poisson solvers to make a convection simulation program

Projects: start thinking about Agree topic with me before final lecture

Makefiles If your programs are split over several files, it may be easiest to write instructions for compiling them in a makefile (this applies mainly to unix-like systems) This also makes sure that the various options you use remain the same The makefile defines dependencies, so it only recompiles source files that have changed since the last compilation See manual pages for full information

Example makefile FFLAGS lists flags to be used when compiling fortran : gives dependencies USAGE: Just type in make or make myproject

Intrinsic functions There are many more intrinsic functions than we have used! It is good to browse a list so that you know what is available, for example at http://www.nsc.liu.se/~boein/f77to90/a5.ht ml This doesn t list cpu_time(), which is useful for checking how long parts of the code take

Speed and optimization Running large simulations can take a long time => speed is important. Optimization=making it run as fast as possible First consideration: use the most efficient algorithm, e.g., multigrid Then: get code working using code that is easyto-read and debug Finally: Find out which part(s) of the code are taking the most time, and rewrite those to optimize speed Code written for maximum speed may not be the most legible or compact!

Manual versus automatic optimisation Many steps can be done automatically by the compiler. Use appropriate compiler options (see documentation), e.g., -O2, -O3: selects a bunch of optimisations -unroll unroll loops etc. (see compiler documentation) Some need to be done manually. In general, try to write code in such a way that the compiler can optimise it!

Example of compiler optimisation Solution convection code (this week), test case, 32-bit, gfortran, macbook pro Options Time (s) None 31.57 -O1 24.741 -O2 23.578 -O3 23.625 -O2 ffast-math 20.958 -O2 ffast-math ftree-vectorize 20.552

Manual optimization step 1: Identify bottlenecks The 90/10 law: 90% of the time is spent in 10% of the code. Find this 10% and work on that! e.g., Use a profiler. Or put cpu_time() statements in to time different subroutines or loops Usually, most of the time is spent in loops. In our multigrid code it is probably the loops that update the field and calculate the residue. Optimization of loops is the most important consideration.

To understand optimization it is important to understand how the CPU works Two aspects are particularly important: Cache Pipelining

Cache memory Fast, small memory close to a CPU that stores copies of data in the main memory Substantially reduces latency of memory accesses. Designing code such that data fits in cache can greatly improve speed. Good design includes memory locality, and not-too-large size of arrays.

Athlon64 multiple caches

Tips to improve cache usage Memory locality: data within each block should be close together (small stride). Appropriate data structures and ordering of nested loops. (see later) Arrays shouldn t be too large. e.g., for a matrix*matrix multiply, split each matrix into blocks and treat blocks separately

CPU architecture and pipelining (images from http://en.wikipedia.org/wiki/central_processing_unit) Executing an instruction takes several steps. In the simplest case these are done sequentially, e.g., 15 cycles to perform 3 instructions!

Pipelining If each step can be done independently, then up to 1 instruction/cycle can be sustained => 5* faster Basic 5-stage pipline. Like an assembly line. Several cycles are needed to start and end the pipeline.

Superscalar pipeline More than 1 instruction per cycle (2 in the example below)

Pipelining in practice Done by the compiler, but must write code to maximize success Branches (e.g., if ) cause the pipeline to flush and have to restart Avoid branching inside loops! Helps if data is in cache

Summary Goal is to maximize use of cache and pipelining. Design code to reuse data in cache as much as possible, and to stream data efficiently through the CPU (pipeline / vectorisation)

More information Wikipedia pages http://en.wikipedia.org/wiki/loop_optimization http://en.wikipedia.org/wiki/optimization_(computer_sci ence) http://en.wikipedia.org/wiki/memory_locality http://en.wikipedia.org/wiki/software_pipelining http://www.ibiblio.org/pub/languages/fortran/ch1-9.html http://www.azillionmonkeys.com/qed/optimize.htm l ETH course How to write fast numerical code (Frühjahrssemester)

Time taken for various operations Slowest: sin, cos, **, etc. sqrt / * fastest: + - Simplify equations in loops to minimize number of operators, particularly slow ones!

Loop optimization (1) Remove conditional statements from loops! SLOW FAST

Loop optimization (2) Data locality: fastest if processing nearby (e.g., consecutive) locations in memory Fortran arrays: first index accesses consecutive locations (opposite in C) Order loops such that first index loop is innermost, 2nd index loop is next, etc. SLOW FAST

Loop optimization (3) Unrolling: eliminate loop overhead by writing loops as lots of separate operations Partial unrolling: reduces number of cycles, reducing loop overhead Original Unrolled by factor 4 This can be done automatically by the compiler

Loop optimization (4) fusing + unrolling: see below Original loop Partial unrolling +fusion (reduces number of writes by factor 2)

Loop optimization (5): other things simplify calculated indices use registers for temporary results Put invariant expressions (things that don t change each iteration) outside the loop Loop blocking/tiling: splitting a big loop or nested loops into smaller ones in order to fit into cache.

Other Optimizations Use binary I/O not ascii Avoid splitting code into excessive procedures. Overhead associated with calling functions/subroutines Reduces ability of compiler to do global optimizations Use procedure inlining (done by compiler): compiler inserts a copy of the function/subroutine each time it is called Use simple data structures in major loops to aid compiler optimizations (defined types may slow things down)

Now back to iterative solvers

Goal for today s exercise: Calculate velocity field from temperature field =>convection e.g., for highly viscous flow (e.g., Earth s mantle) with constant viscosity (P=pressure, Ra=Rayleigh number): P + 2 v = Ra.Tŷ Substituting the streamfunction for velocity, we get: ( v x,v ) y = ψ ψ, y x writing as 2 Poisson equations: 2 ψ = ω 4 ψ = Ra T x the streamfunction-vorticity formulation 2 ω = Ra T x (y points up)

Example of convection (in 3-D)

P + 2 v = Ra.Tŷ P x + 2 v x = 0 Take Mathematical aside Is a vector equation with x and y parts: P y + 2 v y = Ra.T (1) (2) (1) y (2) x v 2 x y v y x = Ra T x Use streamfunction: 2 ψ 2 y + 2 ψ 2 x 2 = 2 2 ψ = Ra T x This is the z-component of the curl of the momentum equation: F = F z y F y z ˆx + F x z F z x ŷ + F y x F x y ẑ

Combine advection-diffusion and Poisson solver routines Initialise T=random or spike or... S, W=0 grid spacing h=1/(ny-1) and diffusive timestep Timestep until desired time is reached calculate right-hand side=ra*dt/dx Poisson solve to get W from rhs Poisson solve to get S from W calculate V from S calculate advective timestep and min(adv,diff) take advection and diffusion time step t=t+dt: have we finished? If not, loop again. Use multigrid solver Use existing adv-diff routines

Program definition Input parameters: nx,ny : number of grid points (y=vertical, grid spacing is the same, i.e., dx=dy=h) Ra: Rayleigh number (typical range 1e3-1e7) total_time: large enough to reach stable state Boundary conditions T: as before, T=1 at bottom, 0 at top, dt/dx=0 at sides S and W: 0 at all boundaries Example structure Module with Poisson solver routines Module with 2_deriv, v_gradt, S_to_V routines main program Try to write frames at different times and make an animation!

Exercise Get everything together and run a test case with nx=257, ny=65 (i.e., aspect ratio 4) Ra=1e5 total_time=0.1 err=1.e-3 Random initial T field Email code, plots and parameter file for this case. Due next week: 04.12.2017

Should look something like this Note: It won t look exactly the same, due to the random initial T field. You might get 2 or 4 cells instead of 3.