Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

Size: px

Start display at page:

Download "Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc."

Warren Mills
5 years ago
Views:

1 Towards a codelet-based runtime for exascale computing Chris Lauderdale ET International, Inc.

2 What will be covered Slide 2 of 24 Problems & motivation Codelet runtime overview Codelets & complexes Dealing with locality & heterogeneity Related work & conclusion

3 Introduction Slide 3 of 24 Can t reach exascale by continuing past trends Need something to Expose and coordinate parallelism Control data and execution locality Abstract platform- and location-specific details Unify software interface for supercomputing

4 Abstract machine model Slide 4 of 24 Cluster Node Node Node Node CPU GPU/accel. CPU CPU Core/ group Core/ group Core/ group Core/ group GPU/accel. Node RAM CPU Chip-local RAM Core/ group Local RAM Registers Core Exec. units Threads

5 Hardware-related problems: Scalability Slide 5 of 24 Present & future reliance on thread-level parallelism for performance increases Can t keep increasing clock rate Can t keep relying on instruction-level parallelism Memory access More cores higher access latency, power cost Not practical to use coherent caches Small core-/chip-local memories simplify hardware but complicate software Need a way to hide access latencies and cross address spaces

6 Slide 6 of 24 Hardware-related problems: Heterogeneity Increasingly common Good solution for Effective utilization of space/power on chip Accelerating matrix-/vector-related operations Difficult to actually use in software Special APIs for accelerators Must statically partition work or duplicate code Need to handle more transparently (unify and coordinate support software)

7 Software-related problems Slide 7 of 24 Reliance on sequential processing, coherent memory Can use multithreading for parallelism, but High address space/memory overhead for stack High overhead to create, manage, switch threads Stack must remain in fixed address range for its lifetime Need a way to sidestep blocking, expose finegrained parallelism

8 Existing software frameworks Slide 8 of 24 MPI, SHMEM: Explicit data transfer Cluster Node Node Node GPU/accel. Node CPU CPU Core/ group Core/ group CPU Core/ group Core/ group GPU/accel. Node RAM CPU Chip-local RAM Core/ group OpenCL, CUDA, DirectCompute: Self-contained SIMD kernels Local RAM Registers Core Exec. units Threads OpenMP: Parallel for-all Cilk: Parallel recursion

9 Existing software frameworks Slide 9 of 24 MPI, SHMEM Must explicitly transfer data to/from specific nodes Are not thread-safe in general (specific to implementation) OpenMP, Cilk, TBB OpenMP & Cilk geared to specific algorithm types TBB is C++-only; Cilk is C-only, but techniques could be applied to C+ +/FORTRAN Work only in one address space Uniform, coherent memory assumed OpenCL, CUDA, DirectCompute OpenCL and Direct3D device contexts not thread-safe; CUDA is Must explicitly coordinate CPU and GPU Existing frameworks achieve specific goals, but do not interact well.

10 Codelet runtime overview Slide 10 of 24 Present software stack: Application System libraries Operating system Hardware Proposed software stack: Application Codelet runtime System libraries Operating system Hardware Present execution model: Function calls [User-mode threads] OS/HW threads Proposed execution model: [Function calls] Codelet dispatch [User-mode threads] OS/HW threads

11 Codelet runtime overview: Codelets Slide 11 of 24 Break application into smaller pieces (=codelets) Codelets shouldn t block or run indefinitely Must explicitly spill/fill at codelet boundaries Low-overhead hiding of long-latency operations One codelet starts an operation, another catches the result Runtime provides for inter-address-space mobility Simple & rapid exposure of fine-grained parallelism Makes scalability easy just provide work and something will run it

12 Example: Dual parallel for-all loops Slide 12 of 24 A1 B1 B1 C C C C C C B2 B2 A2

13 Codelet runtime overview: Locales Slide 13 of 24 High-level description of available hardware Region-bound processing+storage capabilities: locale Exposed API for placing codelet execution & data Codelets+locales enable transparent handling of heterogeneity

14 Codelets Slide 14 of 24 Fundamental unit of scheduling/execution Represented by in-memory descriptor Run fork: Work to be performed to advance program state. Cancel fork: Work to be performed to back out program state, in case an error is encountered.

15 Codelet complexes Slide 15 of 24 Codelet complex: Ad-hoc group of 1 codelet(s) that cooperate to complete some task. Can specify chain codelet & context when starting Complex must chain run or cancel its chain codelet before completing. Used for: Input cleanup Passing return values, taking further inputs Catching and resuming from errors

16 Codelet/function interoperability Slide 16 of 24 Codelets/complexes used as implementation of HLL functions: Chain codelet+context corresponds to return IP+SP Input to chain corresponds to return value Error to canceled chain corresponds to thrown exception Functions used in implementation of codelets: Run/cancel forks implemented as functions Runtime calls fork function to dispatch codelet Return from fork function = end of codelet Complexes can be wrapped as functions and vice versa

17 Locality awareness Slide 17 of 24 Cluster System components grouped into a locale tree Each locale has attached scheduler & allocator Leaf locales correspond to threads Higher-level locales manage children s resources collectively Schedulers/allocators push and pull work around the hierarchy Node Chip Core group Core Thread

18 Handling heterogeneity Slide 18 of 24 Global locale tree shared throughout runtime Locales describe associated hardware details Code format/isa differences Codelets are identified globally, but different descriptor data may be used in different locales Can provide different run/cancel forks for different architectures using same descriptor

19 Scheduling and allocation Slide 19 of 24 Leaf schedulers/allocators manage time/space on a particular thread, higher-level can delegate Application can specify sooner/later ordering B Chip scheduler B Chip scheduler Chip scheduler A C Core 1 sched. Core 2 sched. B has been scheduled to chip. A is running on core 1. Core 2 is idle. A schedules C to core 1. C DE Core 1 sched. B Core 2 sched. B taken by core 2. C runs on core 1. C schedules D & E to core 1. D E Core 1 sched. E Core 2 sched. D runs on core 1. E stolen by core 2.

20 Applicability to algorithm classes Slide 20 of 24 Fork-join-style algorithms Recursion-based Can parallelize multiway-recursive algorithms Application-specified scheduling order limits parallelism blowup Data-parallel/SIMD Can do parallel for-all over locales to distribute work Work stealing automatically balances load afterwards Dataflow algorithms Can register codelet instances to catch data availability Can use locale-based routing to walk around graphs

21 Related work Slide 21 of 24 Basis for codelets: Gao et al. s theoretical model Dropped theoretical limitations Added cancellation and chaining semantics Locales closely related to Habanero hierarchical place trees Existing frameworks: MPI, SHMEM, OpenMP, Cilk, TBB, OpenCL, CUDA, DirectCompute (already addressed) ParalleX (model) and HPX (runtime implementation) Many higher-level constructs Can implement PX constructs on top of a codelet runtime

22 Ongoing/future work Slide 22 of 24 SWift Adaptive Runtime Machine Version 0: Experimental prototype; available for download Reduced scheduling capability, codelet semantics, allocator support Version 1: Under development

23 Conclusion Slide 23 of 24 Need a new execution model for exascale Codelet runtime model enables Scalability Feed codelets to the runtime, don t rely on threading Unified model for entire cluster Portability Single portable runtime interface Platform differences can be dealt with by runtime Better hardware utilization Automatic load balancing Transparent use of heterogeneous components

24 Slide 24 of 24 Questions/comments? SWARM v0 download:

ET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc.

ET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc. HPC Runtime Software Rishi Khan SC 11 Current Programming Models Shared Memory Multiprocessing OpenMP fork/join model Pthreads Arbitrary SMP parallelism (but hard to program/ debug) Cilk Work Stealing