Development tools to enable Multicore

Size: px

Start display at page:

Download "Development tools to enable Multicore"

Daniela Goodwin
6 years ago
Views:

1 Development tools to enable Multicore From the desktop to the extreme A perspective on multicore looking in from HPC David Lecomber CTO, Allinea Software david@allinea.com

2 Introduction The Multicore Challenge in a nutshell: Vast collection of existing single-core software and development tools and developers Multicore processors leave legacy software stuck in first gear Software is the challenge of multicore

3 Allinea Software A development tools company Leading in HPC software tools market Global customer base Blue-chip engineering, government and academic research Allinea DDT Leading debugger in parallel computing World's only scalable debugger Record holder for debugging software on largest machines Production use at extreme scale and desktop Allinea OPT Profiling tool for parallel applications

interrelated calculations too big for single

4 What is HPC High Performance Computing Simulation of some natural process/thing Intense number crunching CPUs worked flat out Very large number of usually interrelated calculations too big for single machine (data or time) Rarely real time Distinct from data crunching

5 Engineering Aerospace, Automotive EDA Sciences Nuclear Oil and gas Pharmaceutical Climate and weather Finance Large users of HPC

Extreme machine sizes Growth in HPC core counts HPC core counts 600000 20000

20072008 2009 2010 2011 Core count 300000 200000 100000 0 2002 2004 2006 2008

Cores Smallest Scientific progress requires more CPU hours Machine sizes are

6 Extreme machine sizes Growth in HPC core counts HPC core counts Core count Core count Year Average Cores Largest Smallest Average Cores Smallest Scientific progress requires more CPU hours Machine sizes are exploding Skewed by largest machines but common trend Software changing to exploit the machines

7 Parallel programming in HPC A world of pragmatists Scientists, academics, grad students, engineers Fortran, C++ Many legacy codebases Distributed development Single program, multiple data (SPMD) Multiple processes with separate memory One standard library Job launch and data transfer between machines: MPI Decades of parallel computing Problems naturally parallel although sometimes complex to partition

8 The challenge of multicore Cannot wait for faster processors to arrive Performance/capability leaps only via more parallelism Reluctant adopters of multicore but why? Existing codes are parallel Scalability (performance) often tails of as process counts rise ( weak scaling ) Two strong oxen or 1,024 chickens? 8 Petaflops but near 10 Megawatts efficiency is important One survey of 188 supercomputing centres (IDC): 52% of HPC applications run above 1 node 12% of HPC applications scale above 1,000 cores 1% of applications scale above 10,000 cores Software development is required to efficiently use more parallel resource

9 HPC's current challenge GPUs a rival to traditional processors AMD and NVIDIA OpenCL, CUDA Great bang-for-bucks ratios A big challenge for HPC developers Data transfer Several memory levels Grid/block layout and thread scheduling Synchronization New languages, compilers, potential standards

$GPU a very parallel platform Typically 20,000 active threads in a GPU today Each doing tiny fraction of work for the result Potentially more parallel Algorithms are different at this scale Some$

10 GPU a very parallel platform Typically 20,000 active threads in a GPU today Each doing tiny fraction of work for the result Potentially more parallel Algorithms are different at this scale Some problems hard to solve efficiently Costly to set up in transfer costs input and output data Limitations on available shared memory and registers Need to remember and use limits for defining thread layout There's a spreadsheet for this!

11 Example GPU algorithm Matrix-matrix multiplication For C Nested loops, ~4 lines of code in C For CUDA Transfer whole matrix to device memory Read lines of A for block to shared memory Read columns of B for block to shared memory Synchronise Calculate output (loop) one output cell of C per GPU thread End kernel Write array back to host memory Recognizably C but... More complex More concurrent More buggy

12 Result: A parallel hybrid world Hardware is determining the software Parallel frameworks to exploit single address space within nodes Shared memory via OpenMP, pthreads, Or to exploit GPUs CUDA, OpenCL, Issues: Cost/performance vs Codebase complexity and longevity development investment Mixtures of paradigms message passing and shared memory and GPU MPI + OpenMP Still not typical benefits not worth the shift for many MPI applications MPI + GPU Many HPC systems have GPU as the grunt of the machine Cannot leave majority of flops in a system idle! Many software projects are in progress because of GPUs More complex and heterogeneous than before. but what do we do when software fails?

13 So how do we fix software? With Thousands of threads Millions of variables Terabytes of data How do you figure out what's going on with your code? Old tricks long dead: multiple terminals, print statements, Different from (eg.) Google Everything is inter-related not independent We need to see all threads and processors together Different from most other fields? From embedded multicore? Only in scale (sometimes in terminology) Does it look like your problem? Does it look like your next problem?

Allinea DDT Graphical debugger designed for: Multithreaded code Single address space Multiprocess code Interdependent processes Or independent processes Parallel code Multi-node

14 Allinea DDT Graphical debugger designed for: Multithreaded code Single address space Multiprocess code Interdependent processes Or independent processes Parallel code Multi-node software Any mix of the above Strong feature set Memory debugging Data analysis Managing concurrency Emphasizing differences Collective control Make as simple as possible, no more

Understanding control flow Application crashes Threads/processes can be anywhere Cannot scroll through them individually Finding them is essential Allinea DDT merges stacks from processes and threads

15 Understanding control flow Application crashes Threads/processes can be anywhere Cannot scroll through them individually Finding them is essential Allinea DDT merges stacks from processes and threads into a tree Common faults patterns evident instantly Divergence, deadlock Information scalably without overload Concept works across many scenarios Multiple applications, different binaries Client(s)-server Multiple processes, threads or even GPU threads MPI or independent processes From single thread to millions Pthreads, OpenMP, GPU threads

16 Controlling progress Bulk control is essential for multicore debugging Group together processes Step, breakpoint, play, based on group Change interleaving order by stepping/playing selectively Integrated throughout Allinea DDT Stack and data views for group creation

When data diverges Developers need to see data There are too many variables to trawl manually Allinea DDT compares the data automatically Subtlely highlights if

17 When data diverges Developers need to see data There are too many variables to trawl manually Allinea DDT compares the data automatically Subtlely highlights if different on other process or if changed More detailed analysis A full cross process comparison retrieves data Fast even at scale Historical values of variables

Searching haystacks Arrays are the building blocks of HPC Largest jobs accumulate vast terabytes of data ~2GB per core is typical max available and frequently used

18 Searching haystacks Arrays are the building blocks of HPC Largest jobs accumulate vast terabytes of data ~2GB per core is typical max available and frequently used Allinea DDT displays and searches across whole job in parallel Sometimes need to search for NaN/Inf etc. Export at runtime Working on real visualization integration

Heterogeneity example - GPU Debugger based on graphics initially released by hardware vendors 1/30th second frame rate 1000-fold slow down is painless to debug a frame HPC kernels already had

19 Heterogeneity example - GPU Debugger based on graphics initially released by hardware vendors 1/30th second frame rate 1000-fold slow down is painless to debug a frame HPC kernels already had runtimes of many seconds Back to drawing board! Command line tool difficult to see through Fundamentals of control in place... But... intractable thread lists make usage impossible Now consider multiple such tools!

Far easier with a GUI Allinea DDT introduced NVIDIA GPUs Built on NVIDIA's low level efforts cuda-gdb, driver and compiler work Execution model is unusual GUI work required 32 thread units, in

20 Far easier with a GUI Allinea DDT introduced NVIDIA GPUs Built on NVIDIA's low level efforts cuda-gdb, driver and compiler work Execution model is unusual GUI work required 32 thread units, in customizable blocks and grids Mixed GPU and CPU within one interface Interaction with CPU is clear in DDT Easy to switch between contexts Parallel stacks, thread selectors etc. Data and threads from each context is clear Supports multiple nodes

21 Debugging's dirty secret In spite of usage at the core of parallel computing Multicore was not fully exploited by debuggers Single threaded interfaces Core counts increased and clock speed tailed Debuggers stayed still but machines grew Architected for bottlenecks GUI directly instructed processes individually Doubled node counts doubled resources and collective operation time and the machines kept on growing By mid-2007 the problem was acute The largest machine was now 130,000 cores Real applications were running at scale But debuggers were limited to ~5,000 cores Assuming a 5-10 seconds pain threshold for collective operations A new credo was born Traditional debuggers would never scale Something had to be done! No alternative

22 Solving performance problems Time (Seconds) DDT 3.0 Performance Figures Jaguar Cray XT , , , ,000 Allinea redesigned the architecture to scale Built a tree network Direct connections are gone Logarithmic performance Over 220,000 cores debugged simultaneously Step all and display stacks in ~1/10 second Performance was 50% of the challenge A usable interface was the other A partnership with largest users US DoE Oak Ridge National Laboratories Also projects with Argonne National Lab, CEA (France), and others MPI Processes All Step All Breakpoint

23 Examples Random bug in a system library 1/32,000 per process probability One failure in any process killed whole program Needed scalable debugger to see the problem 100% failure rate on 100,000 core job Never seen on small runs Fired up the debugger Problem identified in 15 minutes A standard library crashed only at 98,304 cores Essential library Universally used at smaller scales Bedrock trusted by users Frequent crashes at very high scale but not every time After two runs at scale Location of segmentation fault identified in 30 minutes Null pointers and corruption clear

24 Comparison of fields HPC Long term hegemony unsettled by new systems: the hybrid challenge OpenCL, CUDA, simpler language extensions Objective: Write once run anywhere with efficiency of expert hand coded software Debugging strongly matches need Standard API MPI Increasingly standard platform (90% x86_64, 90% Linux) New hardware rarely massive project Debuggers understand the model Process acquisition Internal objects Message queues Challenge of recent GPU h/w changes now resolved Embedded multicore Long term hegemony of single core disrupted by multi-core OpenCL, and other programming models being considered Objective: Write once run anywhere Concurrency needed for performance Disparate development environments/platforms eg. Linus's ARM Linux commentary Difficult to slot in a solution to a changing and varied world Convergence on standards will help MCAPI OpenCL

25 What problems remain? Debuggers only represent state Some automation for common bugs Memory debugging for leaks and beyond-allocation read/write Deadlock detection in MPI Race conditions timing related bugs a bigger problem for multithreaded code Great scope for automation via binary rewriting Automatic scheduler perturbation Some interesting work on race conditions historically memory/cpu constrained Formally correct? Essential for embedded: controlling a plane or reactor is safety critical Rare visitor to HPC in spite of critical role in design

26 Summary HPC sector has long tradition of parallel computing but newly complicated by heterogeneity Investment in development for platforms is risky decision Portable programming models help Memory hiearchies and separate host/coprocessor memory is crippling Ecosystems and programming models critical to platform success Cell processor vs NVIDIA Tools are an essential component Debugging parallel applications requires new approaches HPC is at a higher scale than current embedded multicore Primarily GUI related but automation techniques also valued Impact of multicore is Increased similarity between HPC and Embedded multicore in code and in debugging Other tools: Performance tuning important for HPC and for embedded Likely to see further overlap eg. Power efficient programming Concurrency is hard to do well Don't slip on the snake oil beware miracle solutions There is no silver bullet

27 Questions

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software

GPU Debugging Made Easy. David Lecomber CTO, Allinea Software GPU Debugging Made Easy David Lecomber CTO, Allinea Software david@allinea.com Allinea Software HPC development tools company Leading in HPC software tools market Wide customer base Blue-chip engineering,