Runtime Address Space Computation for SDSM Systems

Size: px

Start display at page:

Download "Runtime Address Space Computation for SDSM Systems"

Meredith Campbell
5 years ago
Views:

1 Runtime Address Space Computation for SDSM Systems Jairo Balart

2 Outline Introduction Inspector/executor model Implementation Evaluation Conclusions & future work 2

3 Outline Introduction Inspector/executor model Implementation Evaluation Conclusions & future work 3

4 Introduction Programming models for distributed memory systems Message passing (MPI) data distribution work distribution data communication/memory consistency SDSM systems Co-Array Fortran UPC OpenMP CAF UPC OpenMP data distribution NO NO/YES YES work distribution NO YES YES memory consistency NO YES YES 4

5 Introduction SDSM critical issues memory consistency data sharing Sources of overheads Memory monitoring memory access interception (UPC) page fault exception handling (OpenMP) Data & control communication memory consistency (UPC & OpenMP) 5

6 Introduction Chances of optimizations in current SDSM implementations related to address space information current implementations have limited information data & control communication on demand runtime can not foresee future communications Gather information at runtime Possible solution: code inspection prior to code execution 6

7 Outline Introduction Inspector/executor model Implementation Evaluation Conclusions & future work 7

8 Inspector/executor model How to embed inspector/executor in a SDSM system to complement native mechanism for memory consistency and data sharing to generate an accurate description of the address space used Inspector information increases optimization chances communication & computation overlap group communications minimizes control messages for consistency 8

9 Inspector/executor model Issues to be considered Implicit overhead Optimized inspectors predictable memory accesses Reuse Parallel inspectors Control dependences & pointers Complex memory accesses that can not be inspected use SDSM original mechanisms Amount of data generated how to describe address space? 9

10 Inspector executor/model Implicit overhead Predictable accesses for (i=0; i<dimx; i++) for (j=0; j<dimy; j++) a[i][j]=b[i][j]+c[i][j]; a 0 0 j DIMY i DIMX 10

11 Inspector executor/model Implicit overhead Reusing inspector data #pragma omp parallel private (iteration) for (iteration=1; iteration<=max_iterations; iteration++) rank(iteration); void rank (int iteration) {... #pragma omp for nowait for (i=0; i<num_keys; i++) { }... } 11

12 Inspector executor/model Implicit overhead Parallel inspectors If code to be inspected is parallel nothing forbids parallel inspection The same scheduling for inspection and execution must be applied 12

13 Inspector executor/model Control dependences & pointers not very common on numerical applications for ( i = 0; i < NK; i++) { x1 = 2.0 * x[2*i] - 1.0; x2 = 2.0 * x[2*i+1] - 1.0; t1 = pow2(x1) + pow2(x2); if (t1 <= 1.0) { t2 = sqrt(-2.0 * log(t1) / t1); t3 = (x1 * t2); t4 = (x2 * t2); l = max(fabs(t3), fabs(t4)); qq[l] += 1.0; sx = sx + t3; sy = sy + t4; } } NAS EP for (j = 1; j <= lastrow-firstrow+1; j++) { sum = 0.0; for (k = rowstr[j]; k < rowstr[j+1]; k++) { sum = sum + a[k]*p[colidx[k]]; } w[j] = sum; } NAS CG 13

14 Inspector executor/model Amount of generated data at address level * * 3 = at 4KB-page level * * 3 / = #define SIZE #pragma omp for for (i = 0; i < SIZE; i++) for (k = 0; k < SIZE; k++) for (j = 0; j < SIZE; j++) matrixc[i][j] += (matrixa[i][k] * matrixb[k][j]); 14

15 Outline Introduction Inspector/executor model Implementation Evaluation Conclusions & future work 15

16 Implementation SDSM totally relying on inspector/executor data Evaluate impact of building inspector data and distributing it Stress the inspector role to the limit Communication not allowed during execution (only during inspection) before execution all data has to be available on its nodes Communication/execution decoupled inspection phase communication phase execution phase 16

17 Implementation Relaxed consistency copies + diffs Inspection done at page level default 4KB special treatment for scalar objects (1B, 2B, 4B or 8B) Programming model: OpenMP compiler infrastructure only supports OpenMP transformations inspectors coded by hand limitations: loop parallelism, static scheduling Optimizations Predictable accesses Parallel Inspection Reuse done by hand 17

18 Implementation Inspector phase Master broadcast loop parameters: for (i = start; i < end; i+=step) scheduling Code inspection done in parallel Slaves send inspection data to master # pages read & base addresses # pages written & base addresses begin_for_sampling (low, upper, step, schedule, chunk, loop_id, reuse_flag); while (next_iters_sampling (&start, &end, &last)) for (p_i = start; (step >= 1)? (p_i <= end) : (p_i >= end); p_i += step) for (p_j=0; p_j<dimy; p_ j++) /* a[i][j]=b[i][j]+c[i][j];*/ sample_stmt (&a[p_i][p_j], 2, &b[p_i][p_j], &c[p_i][p_j]); end_for_sampling (); 18

19 Implementation Inspector phase Master broadcast loop parameters: for (i = start; i < end; i+=step) scheduling Code inspection done in parallel Slaves send inspection data to master # pages read & base addresses # pages written & base addresses begin_for_sampling (low, upper, step, schedule, chunk, loop_id, reuse_flag); while (next_iters_sampling (&start, &end, &last)) { sample_vector (&a[start][0], (end - start) * DIMY, WRITE); sample_vector (&b[start][0], (end - start) * DIMY, READ); sample_vector (&c[start][0], (end - start) * DIMY, READ); } end_for_sampling (); 19

20 Implementation Communication phase Master computes needed page interchanges Master sends page interchanges queries Pages are interchanged Master computes pages written in more than 1 node and does copies begin_for_sampling (low, upper, step, schedule, chunk, loop_id, reuse_flag); while (next_iters_sampling (&start, &end, &last)) { sample_vector (&a[start][0], (end - start) * DIMY, WRITE); sample_vector (&b[start][0], (end - start) * DIMY, READ); sample_vector (&c[start][0], (end - start) * DIMY, READ); } end_for_sampling (); 20

21 Implementation Execution phase Each node has all pages execution requires No runtime entries on execution After execution conflictive pages are returned to master Master find differences on conflictive pages and updates its pages begin_for (); while (next_iters (&start, &end, &last)) for (p_i = start; (step >= 1)? (p_i <= end) : (p_i >= end); p_i += step) for (p_j=0; p_j <DIMY; p_j++) a[p_i][p_j]=b[p_i][p_j]+c[p_i][p_j]; end_for_sampling (); 21

22 Outline Introduction Inspector/executor model Implementation Evaluation Conclusions & future work 22

23 Evaluation: the environment 8 nodes of MareNostrum v1 (hosted at BSC, Barcelona) Each node 2 Power PC 970FX at 2.2 Ghz 4GB RAM/node Myrinet network 4 Gb/s gcc 3.3 linux NPBC 2.3 Omni EP IS FT 23

24 Evaluation: EP.A Works mainly with private data Communications only for reduction variable Single loop executed just one no reuse Non Non optimized EP CLASS A A Optimized EP CLASS A A Execution Time (sec) ,60 3, Runtime Application Control Comm. Data Comm. Exection Time (sec) ,96 3,91 7, Runtime Application Control Comm. Data Comm. Number of threads Number of threads 24

25 Evaluation: IS.B 2 shared vectors of 128MB (32768 pages of 4KB) strided & via index vector accesses Reuse (10 iterations) Reduction 30,00 Non optimized inspection IS CLASS B Non optimized IS CLASS B 1,63 30,00 1,86 B Optimized IS CLASS B Execution Time 25,00 20,00 15,00 10,00 5,00 2,73 3,17 Runtime Application Control Comm. Data Comm. Execution Time(sec) 25,00 20,00 15,00 10,00 5,00 3,27 4,71 Runtime Application Control Comm. Data Comm. 0, , Number of threads Number of threads 25

26 Evaluation: FT.B 3 3-dimensional matrixes of 512 MB ( pages of 4KB) strided accesses 20 iterations 4 loops: 3 reused 1 not reused Data distribution changes at each iteration cffts3 () #pragma omp for for (j = ) for (i = ) for (k = ) main () for (iter = 1; iter < niter; iter++) evolve () cffts3 () cffts2 () cffts1 () cffts1 & cffts2 () #pragma omp for for (k = ) for (j = ) for (i = ) Execution Time (sec) B Optimized FT CLASS B 1,17 1,26 1, Number of threads Runtime Application Control Comm. Data Comm. 26

27 Outline Introduction Inspector/executor model Implementation Evaluation Conclusions & future work 27

28 Conclusions Explore the possibility of embedding an inspector/executor model to a SDSM system The role of the inspector is to supply with an as much as possible accurate address space description The inspector must be optimized to be affordable: predictable accesses using parallel inspection reuse if possible 28

29 Future work More evaluation Rest of NPB benchmarks Spec OpenMP (non numerical applications) Embed inspector executor model in a real SDSM system Automatic reuse mechanism Simultaneous inspection and execution Start to work in the compiler infrastructure Support the rest of OpenMP constructions 29

30 Questions? Thanks! 30

31 Evaluation: CG Non optimized inspection FT CLASS B , Execution 200 Time 0, Number of threads 0,68 Runtime Application Control Comm. Data Comm. Non optimized inspection CG CLASS B Optimized CG CLASS B , Execution 200 Time 100 Runtime Application Control Comm. Data Comm Execution 200 Time 100 2,94 3,09 Runtime Application Control Comm. Data Comm Number of threads Number of threads 31

OpenMP on the FDSM software distributed shared memory. Hiroya Matsuba Yutaka Ishikawa

OpenMP on the FDSM software distributed shared memory Hiroya Matsuba Yutaka Ishikawa 1 2 Software DSM OpenMP programs usually run on the shared memory computers OpenMP programs work on the distributed