Development Tools for Parallel Computing. David Lecomber CTO, Allinea Software

Size: px

Start display at page:

Download "Development Tools for Parallel Computing. David Lecomber CTO, Allinea Software"

Blake Chapman
5 years ago
Views:

1 Development Tools for Parallel Computing David Lecomber CTO, Allinea Software

2 Agenda Introduction What is HPC Bugs and Debugging Debugging parallel applications Challenges for the future

3 About Allinea Development tools company for HPC Flagship product Allinea DDT The most scalable debugger Now the leading debugger in parallel computing Record holder for debugging software on largest machines Production use at extreme scale and desktop Wide customer base Blue-chip engineering, government and academic research Strong collaborative relationships with customers and partners

What is HPC High Performance Computing Common aliases Simulation of some natural process/thing

Supercomputing Scientific computing Parallel computing Very large number of usually interrelated

) Examples Engineering - aerospace, automotive Sciences nuclear physics, molecular modelling,

4 What is HPC High Performance Computing Common aliases Simulation of some natural process/thing Intense number crunching: CPUs work flat out Historically distinct from data crunching Supercomputing Scientific computing Parallel computing Very large number of usually interrelated calculations: too big/slow for single machine Rarely real time today (but soon?) Examples Engineering - aerospace, automotive Sciences nuclear physics, molecular modelling, astrophysics Oil and gas reservoir modelling Medical modelling of human heart, neurology Climate modelling and weather forecasting

5 Parallel programming in HPC A world of pragmatists Scientists, academics, grad students, engineers Fortran, C++ One dominant standard library: MPI Many legacy codebases Distributed development Difficult to test scale, platforms,... Job launch and data transfer between machines Point to point communication (send, receive) and collective operations Single program, multiple data (SPMD) - multiple processes with separate memory Other models: OpenMP, PGAS languages Decades of parallel computing Problems naturally parallel although sometimes complex to partition

6 Parallel Programming Models Shared memory - OpenMP Pragmas to existing code Can be straightforward. #omp parallel for for (i = 0; i < n; i++) {. Data race conditions a potential problem Shared memory required Try it with gcc -fopenmp Distributed memory - MPI Distributed memory communication library MPI_Send send bytes to process N MPI_Recv receive bytes from MPI_Bcast broadcast from all. Around 200 functions many codes only use ~10 Free implementations eg. Open MPI, MPICH do not require a cluster/supercomputer

7 Example Code

8 The impact of multicore Cannot wait for faster processors to arrive Performance/capability leaps only via more parallelism Reluctant adopters of multicore but why? Existing codes are parallel Scalability (performance) often tails of as process counts rise ( weak scaling ) One survey of 188 supercomputing centres (IDC): Two strong oxen or 1,024 chickens? 8 Petaflops but near 10 Megawatts efficiency is important 52% of HPC applications run above 1 node 12% of HPC applications scale above 1,000 cores 1% of applications scale above 10,000 cores Software development is required to efficiently use more parallel resource

2010 2001 2003 2005 2007 2009 2011 Year HPC core counts 20000 15000 Core count Machine sizes are exploding 10000 5000 0 2001 2002 2003

9 How extreme is it? Core count Growth in HPC core counts Average Cores Largest Smallest Year HPC core counts Core count Machine sizes are exploding Average Cores Smallest Skewed by largest machines but a common trend Largest system (Nov 2011) Japan 10 Petaflops UK's largest: 90,000 cores and 2/3rd of a Petaflop Easier to build a machine than it is to program it

HPC's current challenge GPUs a rival to traditional processors AMD and NVIDIA OpenCL, CUDA Great bang-for-bucks ratios A big challenge for HPC developers Data transfer Several

10 HPC's current challenge GPUs a rival to traditional processors AMD and NVIDIA OpenCL, CUDA Great bang-for-bucks ratios A big challenge for HPC developers Data transfer Several memory levels Grid/block layout and thread scheduling Synchronization Tiny granularity often one thread per single calculation (SIMD) New languages, compilers, potential standards

11 Example GPU algorithm Matrix-matrix multiplication For C For CUDA Transfer whole matrix to device memory Read lines of A for block to shared memory Nested loops, ~4 lines of code in C Read columns of B for block to shared memory Synchronise Calculate output (loop) one output cell of C per GPU thread End kernel Write array back to host memory Recognizably C but... More complex More concurrent More buggy

12 A parallel hybrid world Hardware is determining the software Exploit concurrency within a multicore node: Shared memory via OpenMP, pthreads, To exploit GPUs: CUDA, OpenCL, For multiple nodes: MPI Result: Mixtures of paradigms Very large GPU systems now in service: Oak Ridge National Laboratory, Tennessee Titan (Cray XK6) 20,000 nodes - 299,008 CPU cores and 960 NVIDIA Tesla GPUs (and growing..) NUDT China Tianhe-1A Message passing, shared memory and GPU 86,038 CPU cores and 7168 NVIDIA Tesla GPUs $88M to build, $20M to run Many software rewrites are in progress because of GPUs Cost/performance vs codebase complexity and longevity development investment. but what do we do when software fails?

13 So how do we fix software? With Thousands of threads Millions of variables Terabytes of data How do you figure out what's going on with your code? Old tricks long dead: multiple terminals, print statements, Different from (eg.) web farms Everything is inter-related not independent We need to see all threads and processes together Different from most other fields? From embedded multicore? Only in scale (sometimes in terminology) Does it look like your problem? Does it look like your next problem?

14 Bugs in Practice

15 Some types of bug Some Terminology Bohr bug Heisenbug Vanishes when you try to debug (observe) Mandelbug Steady, dependable bug Complexity and obscurity of the cause is so great that it appears chaotic Schroedinbug First occurs after someone reads the source file and deduces that it never worked, after which the program ceases to work

16 How do we debug? A scientific process? Identify and reproduce the bug Hypothesis, trial and observation, understand how the code is behaving... Printf Command line debuggers Graphical debuggers Other options Static analysis Race detection with automated tools Valgrind Manual source code review Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it. Brian W. Kernighan and P. J. Plauger in The Elements of Programming Style.

17 The oldest debugger in the world All developers know printf As part of more general debugging messages Run binary in a debug mode to log behaviour Default for many web applications eg. HTTP access logs A form of post-mortem debugging In response to specific problem Insert instructions to code to test hypothesis At line 53, x becomes 7 and the if statement is executed Recompile, run, examine output Hypothesis incorrect? Loop back and try again Hypothesis correct? Remove debug output from binary, and fix the bug At scale? Interference with timing Interleaving of output can be misleading Flushing of output can be misleading Too much output

18 Real debuggers... Inspect the insides of an application whilst it is alive Inspect process state Control/observe execution Step line by line, function by function through an execution Stop at a line or function (breakpoint) Stop if a memory location changes Ideal to watch how a program is executed Process registers, and memory Variables and stacktraces (nesting of function calls) Less intrusive on the code than printf See exact line of crash unlike printf Test more hypotheses at a time Most well-known examples cater for single process debugging GDB, Visual Studio,...

19 Debugging Parallel Applications The same needs: observation, control,... More complex environment More complex problems No command prompt Printf unreliable No core files More processes More data More Heisenbugs Threading and communication introduce non-determinism

20 Allinea DDT in a nutshell Graphical source level debugger for Parallel, multi-threaded, scalar or hybrid code C, C++, F90, Co-Array Fortran, UPC Strong feature set Memory debugging Data analysis Managing concurrency Emphasizing differences Collective control Make as simple as possible, no more

21 Fixing everyday crashes Typical crash scenario: Too many to manually examine individually A good overview is important Threads/processes can be anywhere Allinea DDT merges stacks from processes and threads into a tree Leap to source for crashes Information scalably without overload Common fault patterns evident instantly Divergence, deadlock

Change interleaving order by stepping/playing selectively Group

22 Process Control Interacting with application progress is easy with DDT Step, breakpoint, play, or set data watchpoints based on groups Change interleaving order by stepping/playing selectively Group creation is easy Integrated throughout Allinea DDT eg. stack and data views

Simplifying data divergence Clear need to see data Too many variables to trawl manually Allinea DDT compares data automatically Smart highlighting

23 Simplifying data divergence Clear need to see data Too many variables to trawl manually Allinea DDT compares data automatically Smart highlighting Subtle hints for differences and changes New: Now with sparklines! More detailed analysis Full cross process comparison Historical values via tracepoints

24 Large Array Support Browse arrays 1, 2, 3, dimensions Table view Filtering Export Look for an outlier Save to a spreadsheet View arrays from multiple processes Search terabytes for rogue data in parallel

25 A simple parallel debugger A basic parallel debugger Aggregate scalar debuggers Implement support for many platforms and MPI implementations Develop user interface User Interface They work: good starting point Control asynchronously Simplify control and state display Controller Controller Debugger Debugger Process Process Initial architecture Scalar debuggers connect to user interface Direction connections - linear performance Any per-process item is an eventual bottleneck Operating system limitations I/O limitations File handles on the GUI Threads, processes Linear access counts on the best networked file systems are still linear Memory and computation limitations Machines still getting bigger...

26 Bug fixing at scale Can we reproduce at a smaller scale? Attempt to make problem happen on fewer nodes Often requires reduced data set the large one may not fit Does the bug even exist on smaller problems? Didn't you already try the code at small scale? Is it a system issue eg. an MPI problem? Is probability stacking up against you? Smaller data set may not trigger the problem Unlikely to spot on smaller runs without many many runs But near guaranteed to see it on a many-thousand core run Debugging at extreme scale is a necessity

27 How to make a Petascale debugger A control tree is the solution Ability to send bulk commands and merge responses Compact data type to represent sets of processes 100,000 processes in a depth 3 tree eg. For message envelopes An ordered tree of intervals? Or a bitmap? Develop aggregations Merge operations are key Not everything can merge losslessly Maintain the essence of the information eg. min, max, distribution

Time (Seconds) For Petascale and beyond DDT 3.0 Performance Figures 0.12 0.1 0.08 0.06 0.04 0.

28 Time (Seconds) For Petascale and beyond DDT 3.0 Performance Figures ,000 All Step All Breakpoint 100, ,000 MPI Processes 200,000 Partnership with largest users DoE Oak Ridge National Laboratories LLNL, ANL, CEA and others High performance debugging - even at 220,000 cores Step all and display stacks: 0.1 seconds Logarithmic Usability is a Big Thing Scalable interface and features One million cores? waiting for the machine!

29 The Future Concurrency will increase 2012 or early 2013 DDT will debug a million core system International and national groups are preparing for Exascale: 100x more powerful than today's most powerful system Expected to be multi-level parallel (hybrid) Continued adoption of multicore and hybrid programming in consumer arena: Laptops, tablets, mobile phones

Debugging HPC Applications. David Lecomber CTO, Allinea Software

Debugging HPC Applications. David Lecomber CTO, Allinea Software Debugging HPC Applications David Lecomber CTO, Allinea Software david@allinea.com Agenda Bugs and Debugging Debugging parallel applications Debugging OpenACC and other hybrid codes Debugging for Petascale