PaRSEC: Distributed task-based runtime for scalable hybrid applications

Size: px

Start display at page:

Download "PaRSEC: Distributed task-based runtime for scalable hybrid applications"

Charles Manning
5 years ago
Views:

1 PaRSEC: Distributed task-based runtime for scalable hybrid applications George Bosilca and many others Intertwine Workshop, April. 20, 2018

PaRSEC: a generic runtime system for asynchronous, architecture aware scheduling of fine-grained tasks on distributed many-core heterogeneous architectures Concepts Clear separation of concerns:

2 PaRSEC: a generic runtime system for asynchronous, architecture aware scheduling of fine-grained tasks on distributed many-core heterogeneous architectures Concepts Clear separation of concerns: compiler optimize each task class, developer describe dependencies between tasks, the runtime orchestrate the dynamic execution Interface with the application developers through specialized domain specific languages (PTG/JDF/TTG, Python, insert_task, fork/join, ) Separate algorithms from data distribution Make control flow executions a relic Runtime Portability layer for heterogeneous architectures Scheduling policies adapt every execution to the hardware & ongoing system status Data movements between producers and consumers are inferred from dependencies. Communications/computations overlap naturally unfold Coherency protocols minimize data movements Memory hierarchies (including NVRAM and disk) integral part of the scheduling decisions

3 Tasks A B C Data collections PaRSEC User data: dense matrix, sparse, structured or unstructured A(k) F(A(k)) F(C(k)) C(k) B(k) Graph of tasks = a data centric programming environment based on asynchronous tasks executing on a heterogeneous distributed environment An execution unit taking a set of input data and generating, upon completion, a different set of output data. Tasks and data have a coherent distributed scope (managed by the runtime) Low-level API allowing the design of Domain Specific Languages (JDF, DTD, TTG) Supports distributed heterogeneous (and trigger tasks execution) environments. Communications are implicit (the runtime moves data) Resources are dynamic (threads, accelerators) Built-in resilience, performance instrumentation and analysis (R, python)

PaRSEC Architecture Domain Specific Extensions

Sparse LA Architecture information Memory

DTG Data Collections Data Data Data Data

MOESI * Tasks Tasks Task classes TTG / Chemistry

Specialized Kernels Kernels Data Accelerators

* Software design based on Modular Component

Clear components API Runtime selection of

4 PaRSEC Architecture Domain Specific Extensions Parallel Runtime Hardware Compact Representation - PTG Scheduling Scheduling Distributed Scheduling Memory Allocator Cores Dense LA Sparse LA Architecture information Memory Hierarchies Dynamic Discovered Representation - DTG Data Collections Data Data Data Data Movement Collective Patterns Coherence Soft. MOESI * Tasks Tasks Task classes TTG / Chemistry Hard core Specialized Specialized Kernels Specialized Kernels Kernels Data Accelerators Movement MPI Xeon Phi OpenACC UCX NVidia GasNet * Software design based on Modular Component Architecture (MCA) of Open MPI. Clear components API Runtime selection of components Implementing a new component has little impact on the rest of the software stack. Not yet componentized

The PaRSEC data Data View User defined Runtime defined A(k) v2 v2 v1 Data Collection A data is a manipulation

copies (remote node, device, checkpoint) Shape: can have different memory layout Visibility: only accessible via

distributed among the nodes Can be regular (multi-dimensional matrices) Or irregular (sparse data, graphs) Can be

5 The PaRSEC data Data View User defined Runtime defined A(k) v2 v2 v1 Data Collection A data is a manipulation token, the basic logical element (view) used in the description of the dataflow Locations: have multiple coherent copies (remote node, device, checkpoint) Shape: can have different memory layout Visibility: only accessible via the most current version of the data State: can be migrated / logged Data collections are ensemble of data distributed among the nodes Can be regular (multi-dimensional matrices) Or irregular (sparse data, graphs) Can be regularly distributed (cyclic-k) or user-defined Data View a subset of the data collection used in a particular algorithm (aka. submatrix, row, column, ) A data-copy is the practical unit of data Has a memory layout (think MPI datatype) Has a property of locality (device, NUMA domain, node) Has a version associated with Multiple instances can coexist

6 DSL: The PaRSEC application parsec_vector_t ddata; parsec_vector_init( &ddata, matrix_integer, matrix_tile, nodes, rank, 1, /* tile_size*/ N, /* Global vector size*/ 0, /* starting point */ 1 ); /* block size */ parsec_context_t* parsec; parsec = parsec_context_init(null, NULL); /* start the PaRSEC engine */ parsec_handle_t* parsec_dtd_handle = parsec_handle_new (parsec); parsec_enqueue(parsec, (parsec_handle_t*) parsec_dtd_handle); parsec_context_start(parsec); for( n = 0; n < N; n++ ) { parsec_insert_task( parsec_dtd_handle, call_to_kernel_type_write, Create Data", PASSED_BY_REF, DATA_AT(&dDATA, n), OUT REGION_FULL, 0 /* DONE */); for( k = 0; k < K; k++ ) { parsec_insert_task( parsec_dtd_handle, call_to_kernel_type_read, "Read_Data", PASSED_BY_REF, DATA_AT(&dDATA, n), INPUT REGION_FULL, 0 /* DONE */ ); } } parsec_handle_wait( parsec_dtd_handle ); Define a distributed collection of data (here 1 dimension array of integers) Start PaRSEC (resource allocation) Create a tasks placeholder and associate it with the PaRSEC context Add tasks. A configurable window will limit the number of pending tasks Wait till completion Data initialization and PaRSEC context setup. Common to all DSL

7 Uncountable ways How to describe a graph of tasks? Generic: Dagguer (Charm++), Legion, ParalleX, Parameterized Task Graph (PaRSEC), Dynamic Task Discovery (StarPU, StarSS), Yvette (XML), Fork/Join (spawn). CnC, Uintah, DARMA, Kokkos, RAJA Application specific: MADNESS, PaRSEC runtime The runtime is agnostic to the domain specific language (DSL) Different DSL interoperate through the data collections The DSL share Distributed schedulers Communication engine Hardware resources Data management (coherence, versioning, ) They don t share The task structure The internal dataflow depiction 7

8 DSL: The insert_task interface Define a distributed collection of data (here 1 dimension array of integers) Start PaRSEC (resource allocation) Create a tasks placeholder and associate it with the PaRSEC context Add tasks. A configurable window will limit the number of pending tasks Wait till completion parsec_vector_t ddata; parsec_vector_init( &ddata, matrix_integer, matrix_tile, nodes, rank, 1, /* tile_size*/ N, /* Global vector size*/ 0, /* starting point */ 1 ); /* block size */ parsec_context_t* parsec; parsec = parsec_context_init(null, NULL); /* start the PaRSEC engine */ parsec_handle_t* parsec_dtd_handle = parsec_handle_new (parsec); parsec_enqueue(parsec, (parsec_handle_t*) parsec_dtd_handle); parsec_context_start(parsec); for( n = 0; n < N; n++ ) { parsec_insert_task( parsec_dtd_handle, call_to_kernel_type_write, } PASSED_BY_REF, 0 /* DONE */); for( k = 0; k < K; k++ ) { parsec_insert_task( parsec_dtd_handle, } Create Data", DATA_AT(&dDATA, n), OUT REGION_FULL, call_to_kernel_type_read, "Read_Data", PASSED_BY_REF, DATA_AT(&dDATA, n), INPUT REGION_FULL, 0 /* DONE */ ); parsec_handle_wait( parsec_dtd_handle ); Data initialization and PaRSEC context setup. Common to all DSL

9 The Parameterized Task Graph (JDF) GEQRT(k) k = 0..( MT < NT )? MT-1 : NT-1 ) : A(k, k) RW A <- (k == 0)? A(k, k) : A1 TSMQR(k-1, k, k) -> (k < NT-1)? A UNMQR(k, k+1.. NT-1) [type = LOWER] -> (k < MT-1)? A1 TSQRT(k, k+1) [type = UPPER] -> (k == MT-1)? A(k, k) [type = UPPER] WRITE T <- T(k, k) -> T(k, k) -> (k < NT-1)? T UNMQR(k, k+1.. NT-1) BODY [type = CPU] /* default */ zgeqrt( A, T ); END BODY [type = CUDA] cuda_zgeqrt( A, T ); END Control flow is eliminated, therefore maximum parallelism is possible A dataflow parameterized and concise language Accept non-dense iterators Allow inlined C/C++ code to augment the language [any expression] Termination mechanism part of the runtime (i.e. needs to know the number of tasks per node) The dependencies had to be globally (and statically) defined prior to the execution Dynamic DAGs non-natural No data dependent DAGs 9

30GHz 20 cores gcc 6.3 MKL 2016 PaRSEC-2.

0 0 3 1 2 6 9 1 0 4 7 1 1 21 3 5 8 3 0 1 2

21 TSMQR 3 4 6 7 9 10 11 5 8 MAGMA runs out

10 Dense Linear Algebra: QR heterogeneous Experiments on Arc machines, E GHz 20 cores gcc 6.3 MKL 2016 PaRSEC-2.0-rc1 StarPU CUDA GEQRT 0 TSQRT UNMQR TSMQR MAGMA runs out of GPU memory B big tile size b: small tile size ib: inner block size 10

PERFORMANCE (TFLOP/S) 25 20 15 10 5 DGEQRF performance strong scaling Cray XT5 (Kraken) - N = M = 41,472

Linear Algebra DPLASMA = ScaLAPACK + runtime (PaRSEC) 0 768 2304 4032 5760 7776 10080 14784 19584 23868

11 PERFORMANCE (TFLOP/S) DGEQRF performance strong scaling Cray XT5 (Kraken) - N = M = 41,472 Systolic QR over PULSAR Systolic QR over PaRSEC (2D) DPLASMA HQR (best single tree) LibSCI Scalapack Dense Linear Algebra DPLASMA = ScaLAPACK + runtime (PaRSEC) NUMBER OF CORES Keeneland h stands for dynamic Hierarchical algorithms (a task can divide itself) GEQRT TSQRT UNMQR TSMQR

12 Overhead of insert_task 4500 Cholesky(double precision) on Haswell, 8 nodes, 20 cores/node, 160 cores Size(N)/No. of Tasks(millions), Tile size: k/ k/0.2 30k/0.8 40k/2.0 50k/3.8 60k/6.5 70k/ k/ Gflops PaRSEC-PTG, Tile size: 128 PaRSEC-DTD, Tile size: 128 PaRSEC-DTD, 128, Execution Only 10k/0.5 20k/3.8 30k/ k/ k/ k/102.5 Size(N)/No. of Tasks(millions), Tile size: 128 PaRSEC-PTG, Tile size: 320 PaRSEC-DTD, Tile size: k/ k/

13 Overhead of insert_task DTD overhead 4500 Cholesky(double precision) on Haswell, 8 nodes, 20 cores/node, 160 cores Size(N)/No. of Tasks(millions), Tile size: k/ k/0.2 30k/0.8 40k/2.0 50k/3.8 60k/6.5 70k/ k/ T DTD/PTG : Overall time N: Total number of tasks C T : Cost/duration of each task P: Total number of nodes/process n: Total number of cores C D : Cost of discovering a task C R : Cost of building DAG/relationship Gflops PaRSEC-PTG, Tile size: 128 PaRSEC-DTD, Tile size: 128 PaRSEC-DTD, 128, Execution Only 10k/0.5 20k/3.8 30k/ k/ k/ k/102.5 Size(N)/No. of Tasks(millions), Tile size: 128 PaRSEC-PTG, Tile size: 320 PaRSEC-DTD, Tile size: k/ k/

Overhead of insert_task DTD overhead Fixed task duration Weak scaling: Fixed number of tasks per process T DTD/PTG : Overall time N: Total number of tasks C T : Cost/duration of each task P: Total

14 Overhead of insert_task DTD overhead Fixed task duration Weak scaling: Fixed number of tasks per process T DTD/PTG : Overall time N: Total number of tasks C T : Cost/duration of each task P: Total number of nodes/process n: Total number of cores C D : Cost of discovering a task C R : Cost of building DAG/relationship 6144 cores Benefits: critical path is defined by the sequential ordering Drawbacks: impossible to build collective patterns, selecting the window size is difficult, all data movement There are three types of scenario must be known globally (and their order is critically important) Insert All: Each rank inserts all tasks, and executes only locals Select Insert: Each rank inserts only local tasks, but iterates over all tasks. Insert Local: Each rank only inserts local tasks. 14

instead of data Need for lookahead to compensate for

15 Explicitly expose collective patterns SLATE - PaRSEC Task dependencies done on the coordination doorbells instead of data Need for lookahead to compensate for the high level lack of parallelism A(k, k) -> A(k+1.. N, k),

16 Natural data-dependent DAG Composition 16 POTRF TRTRI LAUUM Example POTRI = POTRF + TRTRI + LAUUM 3 approaches: Fork/join: complete POTRF before starting TRTRI Compiler-based: give the three sequential algorithms to the Q2J compiler, and get a single PTG for POINV Runtime-based: tell the runtime that after POTRF is done on a tile, TRTRI can start, and let the runtime compose PaRSEC Traditional

17 More dynamic applications Execution Time (sec) Original 64 PaRSEC 64 NWChem (PaRSEC) Beta-carotene efficiency DIP: Elastodynamic Wave Propagation Perfect PaRSEC MPI Not yet HT ready Interoperability between GlobalArray + PaRSEC + MPI Cores/Node C 40 H number of threads Dynamically redistribute the data - use PAPI counters to estimate the imbalance - reshuffle the frontiers to balance the workload

Dynamic distributed Workload f(x) Nodes in the tree are tasks, and they are distributed across P processes g(x) We have a forest of trees, each tree representing a different function or intermediary

18 Dynamic distributed Workload f(x) Nodes in the tree are tasks, and they are distributed across P processes g(x) We have a forest of trees, each tree representing a different function or intermediary expression to compute Active process: process has pending actions or has pending emissions to perform Idle process: no further pending local actions Idle process returns to active state upon reception of a message Workload termination is a global state when in a snapshot all processes are idle and there are no in-transit messages Wave Algorithms (4C) Once idle a process start a termination wave counting communications (sent and recv). Two identical waves are necessary to detect termination. Optimal Delay Algorithms (EDOD) Reduce the idleness to the root of a tree (a nonleaf become idle when all children are idle + local). Acknowledge message reception to allow processes to become idle (follow the same tree and inverse the parents idleness) Credit Distribution Algorithms (CDA) Spread credits via messages. Return credits to an entity once idle. Detect termination once all the credits are accounted for.

19 Random Walk application Number of control messages Implementation in PaRSEC (exp. on Mira) MADNESS style trees Threshold define the level of refinement (accuracy of the operations) 10 8 nodes in the tree (aka tasks) CDA ½ the control messages compared with 4C All of them are FLUSH (no BORROW) Projection Operation with threshold of 10 13

20 Conclusions Programming can be made easy(ier) Portability: inherently take advantage of all hardware capabilities Efficiency: deliver the best performance on several families of algorithms Domain Specific Languages to facilitate development Interoperability: data is the centric piece Build a scientific enabler allowing different communities to focus on different problems Application developers on their algorithms Language specialists on Domain Specific Languages System developers on system issues Compilers on optimizing the task code Interact with hardware designers to improve support for runtime needs HiHAT: A New Way Forward for Hierarchical Heterogeneous Asynchronous Tasking

21 Priority based Wish List Portable and efficient API/library for accelerator support Portable, efficient and interoperable communication library (UCX, libfabric, ) Moving away from MPI will require an efficient datatype engine Also supported by rest of the software stack (for interoperability) Resource management/allocation system PaRSEC supports dynamic resource provisioning, but we need a portable system to bridge the gap between different programming runtimes Memory allocator: thread safe, runtime defined properties, arenas (with and without sbrk). (memkind?) Generic profiling system, tools integration Any type of task-based debugger and performance analysis

Hierarchical DAG Scheduling for Hybrid Distributed Systems

June 16, 2015 Hierarchical DAG Scheduling for Hybrid Distributed Systems Wei Wu, Aurelien Bouteiller, George Bosilca, Mathieu Faverge, Jack Dongarra IPDPS 2015 Outline! Introduction & Motivation! Hierarchical