Chapter 24a More Numerics and Parallelism

Size: px

Start display at page:

Download "Chapter 24a More Numerics and Parallelism"

Oliver McCormick
5 years ago
Views:

1 Chapter 24a More Numerics and Parallelism Nick Maclaren ix-courses/cplusplus This was written by me, not Bjarne Stroustrup

2 Numeric Algorithms These are only accumulate(), inner_product(), partial_sum() and adjacent_difference() Not what numerical programmers call algorithms I can't see any particular reason to use them C++ developers rarely pay attention to numerical properties, or high performance, unlike Fortran ones They are likely to be just the obvious code The first three can be implemented much better I recommend doing as I show in the exercises BLAS, long double or compensated summation 12

3 Gaussian Elimination The book teaches Gaussian elimination with pivoting and an example of a typical numeric algorithm You may need to write such code, in other contexts But DON'T just copy that code, for reasons I shall explain I am NOT criticising the book or code merely stressing the software reuse principle The executive summary here is use LAPACK 13

4 Using Libraries The first approach is to call a (good!) library These usually have a Fortran or C interface There are some C++ libraries around, too They are of VERY mixed quality NAG, LAPACK, FFTW are reliable Netlib is patchy, but some of it is good Numerical Recipes is NOT reliable 14

5 How to Write Them Choose a numerically competent algorithm! This is the key to accuracy and performance Do NOT use Numerical Recipes as a guide The NAG documentation is much better When coding them, watch out for numeric errors Typically accumulation and cancellation errors For these, there are some adequate solutions Or subtracting/dividing two nearly-equal numbers This one is harder to resolve, and I shall skip it 15

6 Improving Accuracy Often arises when using accumulate() or inner_product() Only simple solution is to use long double for the accumulation It's useful for the multiplication in inner_product(), too, but is not essential This is left as an exercise (see later) It may or may not help, for very complicated reasons You can actually do a lot better (in accuracy) But it's NOT a task for the non-expert Both numerically and in the C++ and C languages 16

7 Improving Accuracy Do not,, repeat NOT,, simply code Kahan summation A nightmare in C++, even for the VERY few experts The problem is primarily the C and C++ standards They don't specify what most people think they do All compilers, versions and options will vary Look in the specimen answers for this chapter on the local course Web site fancy_accumulate.cpp and fancy_inner.cpp And read the comments they are not exaggerated! Those work as they stand under gcc but not Intel I can get them working under Intel, painfully 17

8 Doing Better Rule number one is to look for a better algorithm And at the highest level possible, too It is tricky, but the potential gains are huge You can extend the arithmetic's precision Do that only when a few 'operations' are the problem Addition/subtraction is the only easy case I can and have done multiplication, math. functions etc. It's anywhere from painful to fiendish or worse The C/C++ standards are the real problem It can often be easier in assembler :-( 18

9 BLAS and LAPACK Always a good idea to use their interface Have option of writing your own or calling them Optimised libraries can be a LOT faster Atlas, MKL, ACML etc. but not standard Linux ones Mainly the level 3 BLAS, but can include level 1 E.g. xgemm matrix multiply inner_product() is level 1 (DDOT( DDOT, ZDOT, ZDOTC) The BLAS can increase accuracy, but generally don't LAPACK generally uses the level 3 BLAS Optimised ones include NAG, MKL, ACML They are also numerically robust algorithms 19

10 Calling Them Calling the BLAS and LAPACK interface: The interface is usually Fortran 77 A vendor may provide a C one, or even a C++ one The code may be in anything it's not your problem This is not a big deal, but needs care Fortran 77 is to modern Fortran as C is to C++ And you can usually get between Fortran 77 and C BLAS/LAPACK are unmodified Fortran 77 This can't be called entirely portably The next slide gives the USUAL rules 20

11 Calling Fortran 77 Call via extern C BLAS name DDOT becomes ddot_ A Fortran SUBROUTINE is a C void function ALL arguments are passed as pointers double and int carry across, including function results complex and C character arrays are OK, with care Do NOT call functions returning either as the result Write a small Fortran subroutine and return via arguments LOGICAL and character lengths are a bit of a problem In Fortran subroutine, translate LOGICAL to int For character strings, pass the length separately Fortran character strings are not null-terminated 21

12 Performance It is possible to get array-handling C++ code to run as fast as Fortran (my specimen answers do, for example) But it is MUCH harder to achieve Quite a lot of that has to do with the last dimension varying fastest (row-major order) The problems are mainly that most good array libraries are Fortran-based This includes the BLAS and LAPACK But there do seem to be some fundamental ones as well E.g. find x such that b=a.x is more natural for column-major Left solution (i.e. to find x such that b=x.a) ) fits row-major better 22

13 Parallelism Using multiple processes is easy Distributed memory and message passing Use MPI via C see my MPI course for more You will need to pack and unpack C++ classes CilkPlus looks interesting currently Intel only I can't remember exactly which product, so it may cost Intel are funding gcc to include it I hope to investigate it and maybe write a course It's a shared-memory C++ language extension 23

14 Shared Memory Aargh! This area of POSIX is a nightmare area Its specification often makes no sense Its memory model isn't compatible with C99's Its synchronisation doesn't cover program state C threading isn't usable by mere mortals Experts could use it to write higher-level primitives But I have reason to believe it won't work reliably I haven't had time to complete a test program 24

15 OpenMP This is the leader for shared-memory parallelism When the requirement is performance My OpenMP course describes a defensive strategy Its specification makes even POSIX's look good And it doesn't fit well with C++ Realistically, you can parallelise only C-style code That's a soluble problem, in most cases You can use C++ in serial code, including <vector> Theoretically, OpenMP supports a lot more of C++ In practice, I would expect truly foul problems 25

16 Other Shared-Memory There are Boost facilities, too DON'T rely on them The shared-memory problem is NOT about the calls It's not even even about synchronisation etc. It's ALL about the memory consistency model The question is whether the compiler agrees with Boost And much the same applies to any other facilities There are a zillion threading libraries, all dangerous As all experts agree, this CAN'T be done by a library Language and compiler support is CRITICAL 26

17 Exercises Instead of exercise 10, look up Marsaglia's DIEHARD or Knuth TAOCP, vol. 2 Code one of the better tests e.g. the runs test Use realistic sample sizes millions or more Or use the spacings test,, which I have done Generate a U(0,1) sample of size N and sort into order The spacings are negative exponential, mean 1/(N+1) Test using Kolmogorov-Smirnov or otherwise LOTS of simulations rely on adjacency properties 27

18 Exercises The first two extra ones are about basic algorithms and accuracy, to give you a feel for that One uses the BLAS, but it probably won't do much Look at my code to see why I say what I do The others are about using matrices I use Cholesky as a basis, because it is simpler than Gaussian elimination It is for positive definite real matrices ONLY,, and needs no pivoting 28

19 Exercises Exercise 13. Take accumulate.cpp and complete it (see statements marked CHANGE) It's completed (and more) in fancy_accumulate.cpp Exercise 14. Do the same for inner.cpp You will need lblas to link it It's completed (and more) in fancy_inner.cpp These exercises are fairly easy The point of my fancy coding is to show why I make the remarks I do There be dragons! 29

20 Exercises I recommend doing exercises if you are going to need to do any serious n-d array handling They are about the simplest realistic problem possible Tackling a 'real' problem as a first step is insane FAR WORSE,, you are likely to do things in bad ways They teach how to call the BLAS/LAPACK And provide a proper interface to them! They will expose some of the gotchas Never underestimate the problems these can cause 30

21 Exercises Exercise 15. Take Book_matrix_zero.cpp and complete it according to the instructions. This will use the book's Matrix.h class to solve Cholesky by calling the BLAS/LAPACK, and by hand There is a specimen answer in Book_matrix_one.h Do not worry if the matrix multiply is very slow Exercise 16. Attempt to optimise matmul() Aim for same time as cholesky() on 1000x1000 Clue: transpose matrices to do all inner loops along fastest varying dimension uses slices when you can do that There is a specimen answer in Book_matrix_two.cpp 31

22 Exercise 17. Change Exercises Change matmul(), cholesky() and solver() to use the BLAS and LAPACK This will be a LOT faster if you use MKL, ACML etc., and faster (especially the solver) even with GNU versions Be warned: this needs a clear head I did it by comparing intermediate results with a working version on 3x3 matrices The problem is storage order incompatibility 32

23 Exercises Exercise 18. Take My_matrix_zero.cpp,, add complete the program Write a very simple 2-D double matrix class Implement only what you need Use first dimension varying fastest (column-major) Complete the calls to the BLAS and LAPACK Write a matmul() There is a specimen answer in My_matrix_one.cpp Do not try to be clever at this stage This is a lot easier than you might think 33

24 Exercises Exercise 19. Take the program you wrote in exercise 18 and extend it to work better Use the techniques in this chapter The higher level code should use inner product calls and A += z*a,, where A is a 1-D slice Do NOT try to provide a proper interface for slices Provide them solely for matmul(), cholesky() and solver() Do support both row and column slices Try to get matrix multiply to run faster There is a specimen answer in My_matrix_two.h 34

25 Exercises My_matrix_three.h uses a high-precision inner product (from my fancy answer to exercise 14) It doesn't make very much difference, to time or accuracy The solver is twice as slow and still much less than machine accuracy Why? The time is in memory access, and the accuracy limit is in the mathematics LAPACK is robust But, occasionally,, this technique can be necessary Exercise 20. For extreme masochists only. Try repeating these exercises with the <valarray> and <gslice> or Boost::multi_array 35

26 Next lecture There is no next lecture! We are at the end 36

Chapter 21a Other Library Issues

Chapter 21a Other Library Issues Nick Maclaren http://www.ucs.cam.ac.uk/docs/course-notes/un ix-courses/cplusplus This was written by me, not Bjarne Stroustrup Function Objects These are not the only way