Progress report on the Integrative Model for Parallelism

Size: px

Start display at page:

Download "Progress report on the Integrative Model for Parallelism"

Kelly King
5 years ago
Views:

1 Progress report on the Integrative Model for Parallelism Victor Eijkhout Presented at 4th Workshop on Extreme-scale Programming Tools Supercomputing 2015

2 Table of Contents 1 Introduction 2 IMP programming 3 Results and prospects 4

3 What is wrong with parallel programming? Many programming models corresponding to different hardware types unportable code, low productivity Programming is often CSP / SPMD coding of interaction, not the algorithm Programming systems are concerned with algorithms, but computation is almost irrelevant in current architectures lacking systematic treatment of data placement and movement Remarkable divorce of practice and theory of parallelism no systematic approach to program generation, transformation, execution

4 My aims for a new parallel programming system 1 Uniform treatment of parallelism 2 While acknowledging the realities of hardware 3 Strongly theory based IMP: Integrative Model for Parallelism

5 ijkhout: IMP Design of a programming system One needs to distinguish: Programming model How does it look in code Execution model How is it actually executed Data model How is data placed and moved about Three different vocabularies!

6 Programming model Sequential semantics [A]n HPF program may be understood (and debugged) using sequential semantics, a deterministic world that we are comfortable with. Once again, as in traditional programming, the programmer works with a single address space, treating an array as a single, monolithic object, regardless of how it may be distributed across the memories of a parallel machine. (Nikhil 1993) As opposed to [H]umans are quickly overwhelmed by concurrency and find it much more difficult to reason about concurrent than sequential code. Even careful people miss possible interleavings among even simple collections of partially ordered operations. (Sutter and Larus 2005)

7 ijkhout: IMP Programming model Sequential semantics is close to the mathematics of the problem. Note: sequential semantics in the programming model does not mean BSP synchronization in the execution.

8 ijkhout: IMP Execution model Data flow expresses the essential dependencies in an algorithm. Data flow applies to multiple parallelism models. But it would be a mistake to program dataflow explicitly.

9 Data model Distribution: mapping from processors to data. Needed (and missing from existing systems such as UPC, HPF): distributions need to be first-class objects we need an algebra of distributions

10 Integrative Model for Parallelism Theoretical model for describing parallelism Library (or maybe language) for describing operations on parallel data Minimal, yet sufficient, specification of parallel aspects Many aspects are formally derived, including messages and task dependencies. Specify what, not how Improve programmer productivity, code quality, efficiency and robustness

11 Table of Contents 1 Introduction 2 IMP programming 3 Results and prospects 4

12 Sequential semantics The user programs sequentially objects have user-specified distribution IMP system derives all communication/synchronization

13 Inspector executor Programmer adds operations ( kernels ) to queue separate queue analysis and execution analysis can optimize for latency and such Inspector-executor is used in PETSc, Trilinos for matrix-vector product Queues feature in many DAG models.

14 Structure of the IMP system Eijkhout: IMP

15 Heat equation programming distribution *blocked = new IMP_block_distribution(arch,nglobal); auto xs = new std::vector<imp_object*>; for (int step=0; step<=nsteps; step++) xs->push_back( new IMP_object(blocked); task_queue *queue; queue = new IMP_task_queue(arch); queue->add_kernel( new IMP_origin_kernel(xs->at(0)) ); for (int step=0; step<nsteps; step++) { queue->add_kernel( new IMP_diffusion_kernel( xs->at(step),xs->at(step+1) ) ); } queue->analyze_dependencies(); queue->execute();

16 Heat equation kernels Kernels as programmed

17 Heat equation tasks Tasks as derived in parallel (see why I said you don t want to program dataflow?)

18 ijkhout: IMP Conjugate gradients Looks much like PETSc: one or two lines per major operation kernel *rrcopy = new IMP_copy_kernel( rr,rrp ); queue->add_kernel(rrcopy); kernel *matvec = new IMP_spmvp_kernel( pnew,q,a ); queue->add_kernel(matvec); kernel *pap_inprod = new IMP_innerproduct_kernel( pnew,q,pap ); queue->add_kernel(pap_inprod); kernel *alpha_calc = new IMP_scalar_kernel( rr,"/",pap,alpha ); queue->add_kernel(alpha_calc);

19 Kernel programming Kernel contains function pointer and dependency structure of the algorithm (not of tasks or processors!) class IMP_diffusion_kernel : virtual public IMP_kernel { public: IMP_diffusion_kernel(object *in,object *out) : kernel(in,out),imp_kernel(in add_beta_oper( new ioperator("none") ); // we need the i index add_beta_oper( new ioperator(">=1") ); // we need the i+1 index add_beta_oper( new ioperator("<=1") ); // we need the i-1 index set_localexecutefn( &local_diffusion ); }; };

20 N-body problems Upper tree levels: redundant computation

21 ijkhout: IMP Recursive definition of levels Redundancy is never explicity programmed! ioperator *coarsen = new ioperator("/2"); for (int level=0; ; level++) { new_dist = level_dist->operate(coarsen); distributions->push_back(new_dist); if (new_dist->global_size()==1) break; level_dist = new_dist; }

22 Table of Contents 1 Introduction 2 IMP programming 3 Results and prospects 4

23 State so far Implemented template code for: heat equation k-means clustering CG N-body code which can be compiled (sorta) to: MPI OpenMP tasking hybrid

24 CG with MPI backend 500 Weak scaling Conjugate gradients 400 musec for 20 its PETSc IMP procs (would be better if I used MPI collectives)

25 Future True hybrid computing Latency hiding mechanisms automatic ( communication avoiding ) Load balancing Redundancy

27 Thanks Supported by NSF EAGER Code: Contact:

28 Table of Contents 1 Introduction 2 IMP programming 3 Results and prospects 4

29 (The secret sauce) Distributions alone are not enough: we need to capture how the algorithm interacts with distributions β-distribution : mathematical formalization of halo region β = σ f γ where σ f signature function and γ distribution of the output object. then q p if α(q) β(p) /0 (where α distribution of input)

A DSL for Integrative Parallel Programming

TACC Technical Report TR-14-01 A DSL for Integrative Parallel Programming Victor Eijkhout September 1, 2014 This technical report is a preprint of a paper intended for publication in a journal or proceedings.