page migration Implementation and Evaluation of Dynamic Load Balancing Using Runtime Performance Monitoring on Omni/SCASH

Size: px

Start display at page:

Download "page migration Implementation and Evaluation of Dynamic Load Balancing Using Runtime Performance Monitoring on Omni/SCASH"

Irma Black
5 years ago
Views:

1 Omni/SCASH heterogeneity Omni/SCASH page migration Implementation and Evaluation of Dynamic Load Balancing Using Runtime Performance Monitoring on Omni/SCASH Yoshiaki Sakae, 1 Satoshi Matsuoka, 2 Mitsuhisa Sato 3 and Hiroshi Harada 4 In the commodity cluster environment, there may be performance heterogeneity between nodes due to several reasons, from which the application programmer will suffer as load imbalances. To overcome these problems, some dynamic load balancing mechanisms are needed. In this paper, we report our ongoing work on dynamic load balancing extension to Omni/SCASH. Using our dynamic load balancing mechanisms, we expect that programmers can have load imbalances adjusted automatically by the runtime system without explicit definition of data and task placements in a commodity cluster environment with possibly heterogeneous performance nodes. The results of evaluation indicates that, our loop re-partitioning scheme which is one of the dynamic load balancing extension works well, and also it is important to combine loop re-partition with dynamic page migration. 1. HPC SMP 9) 1 Tokyo Institute of Technology 2 / JST Tokyo Institute of Technology / JST 3 Tsukuba University 4 Hewlett-Packard Japan,Ltd. heterogeneity heterogeneity OpenMP 1

2 SDSM SCASH OpenMP Omni/SCASH 7) Omni/SCASH SCASH page fault counting Omni OpenMP Omni OpenMP OpenMP C translator C-front F-front C Fortran Xobject Xobject AST (Abstract Syntax Tree) Java Exc Java tools Exc Java tools Xobject Java OpenMP C Exc Java tools C 2.2 SCASH SCASH 3) PM 8) OS Software Distributed Shared Memory System SCASH Eager Release Consistency (ERC) invalidate update SCASH home base home base home base home home 2.3 Translation of OpenMP programs to SCASH OpenMP SCASH OpenMP SCASH shemem OpenMP SCASH OpenMP parallel fork-join parallel fork Omni/SCASH nested parallelism SCASH OpenMP OpenMP e.g. SCASH 3. 1 : OpenMP dynamic guided chunk queue 2

3 Example: Num Proc = 3, Perf Ratio = 3:1:2 schedule(profiled) #pragma parallel omp for schedule(profiled) for (i = 0; i < N; i++) { LOOP_BODY; 1 3n n 2n schedule(profiled, 2) schedule(profiled, 2, 3, 1) : executed on proc 0 : executed on proc 1 1 : executed on proc 2 : profiling loop Example of profiled scheduling profiled chunk profiled OpenMP schedule clause schedule(profiled[, chunk_size[, eval_size[, eval_skip]]]) profiled eval skip eval size chunk size eval skip, eval size, chunk size eval skip 0 eval size eval size = 1 chunk size 0 1 profiled 3 3 : 1 : 2 profiled Omni Omni 2 PAPI 2) static void ompc_func(void ** ompc_args){ int i; { int lb, ub, step; double start_time = 0.0, stop_time = 0.0; lb = 0, ub = N, step = 1; _ompc_profiled_sched_init(lb, ub, step, 1); while (_ompc_profiled_sched_next(&lb, &ub, start_time, stop_time)) { _ompc_profiled_get_time(&start_time); for (i = lb; i < ub; i += step) { LOOP_BODY; _ompc_profiled_get_time(&stop_time); 2 profiled CPU ompc profiled sched init() ompc profiled sched next() ompc profiled sched next() ompc profiled get time() 3 ompc profiled sched next() chunk queue dynamic, guided 4. profiled NAS Parallel Benchmarks (NPB-2.3) EP, CG OpenMP Fortran OpenMP RWCP EP EP 2n 3

4 Execution Time of EP Class S on Homogeneous Settings < = > /? static dynamic profiled Time [sec] "! # < = > /? Number of Nodes Execution Time of EP Class S on Homogeneous Settings (Pentium III nodes only ) ( ) * +, -. / ( "7;8: 1 3 $ % & ' ( ) * +, -. / ( 0 1 < 09A = = BDC = E F ( ) * +, -. / ( "798: : Performance Heterogeneous Cluster Fast nodes Slow node CPU Pentium III 500MHz Celeron 300MHz Cache 512KB 128KB Chipset Intel 440BX Same as Fast Memory SDRAM 512MB Same as Fast NIC Myrinet M2M-PCI32C Same as Fast Embarassingly Parallel profiled EP CG CG Conjugate Gradient CG unstructured grid computation CG profiled CPU Pentium III 500MHz Celeron 300MHz Fast, Slow OS RedHat 7.2 SCore gcc O profiled chunk size, eval size, eval skip 4.2 profiled EP 4 dynamic chunk profiled static static, dynamic, profiled EP class S Fast static EP static EP EP chunk queue dynamic static profiled EP 4

5 Execution Time of EP Class S on Heterogeneous Settings static (Slow x 1 + Fast) dynamic (Slow x 1 + Fast) profiled (Slow x 1 + Fast) static (Fast x 4) 2 Memory Behavior of the CG Class A on Celeron 1 + Pentium III 3 settings for static/profiled scheduling static profiled L2 miss ratio 29.6% 31.1% Page Fault at SCASH Barrier Time [sec] Time [sec] Number of Nodes Execution Time of EP Class S on Heterogeneous Settings (one Celeron node + Pentium III nodes) Execution Time of CG Class A on Heterogeneous Settings Number of Nodes static (Slow x 1 + Fast) profiled (Slow x 1 + Fast) Execution Time of CG Class A on Heterogeneous Settings (one Celeron node + Pentium III nodes) chunk vector EP profilied 6 CG class A static profiled CG kernel parallel for affinity CG static profiled profiled 4.4 profiled 2 Slow node 1 Fast node 3 4 CG Class A static profiled L2 cache miss ratio SCASH page fault L2 cache miss ratio PAPI : PAPI L2 TCM L2 cache miss ratio = PAPI L1 DCM 100 L2 cache miss ratio static, profiled SCASH page fault profiled static profiled affinity profiled : profiled page migration page migration page migration stable 5. Ongoing Work: Page Fault Conting Page Migration CG SDSM 5

6 OpenMP page reference counter 6) 1),4),5) SDSM SDSM page fault SPMD 6. Software Distributed Shared Memory, SCASH OpenMP Omni/SCASH heterogeneity page fault conting EP profiled static CG SDSM profiled SDSM page fault counting Omni/SCASH profiled 1) J. Bircsak, P. Craig, R. Crowell, J. Harris, C. A. Nelson, and C. D. Offner. Extending OpenMP for NUMA Machines: The Language. In Proceedings of Workshop on OpenMP Applications and Tool (WOMPAT 2000), July San Diego, USA. 2) S. Browne, J. Dongarra, N. Garner, G. Ho, and P. Mucci. A Portable Programming Interface for Performance Evaluation on Modern Processors. The International Journal of High Performance Computing Applications, 14(3): , Fall ) H. Harada, H. Tezuka, A. Hori, S. Sumimoto, T. Takahashi, and Y. Ishikawa. SCASH: Software DSM using High Performance Network on Commodity Hardware and Software. In Proceedings of Eighth Workshop on Scalable Shared-memory Multiprocessors, pages ACM, May ) A. Hasegawa, M. Sato, Y. Ishikawa, and H. Harada. Optimization and Performance Evaluation of NPB on Omni OpenMP Compiler for SCASH, Software Distributed Memory System (in Japanese). In IPSJ SIG Notes, 2001-ARC-142, 2001-HPC-85, pages , Mar ) J. Merlin. Distributed OpenMP: Extensions to OpenMP for SMP Clusters. In Invited Talk. Second European Workshop on OpenMP (EWOMP 00), Oct Edinburgh, Scotland. 6) D. S. Nikolopoulos, T. S. Papatheodorou, C. D. Polychronopoulos, J. Labarta, and E. Ayguadé. Is Data Distribution Necessary in OpenMP? In Proc. of Supercomputing 2000, Nov Dallas, TX. 7) M. Sato, H. Harada, and Y. Ishikawa. OpenMP compiler for Software Distributed Shared Memory System SCASH. In Proceedings of Workshop on OpenMP Applications and Tool (WOMPAT 2000), July San Diego, USA. 8) H. Tezuka, A. Hori, Y. Ishikawa, and M. Sato. PM: An Operating System Coordinated High Performance Communication Library. In P. Sloot and B. Hertzberger, editors, High-Performance Computing and Networking 97, volume 1225, pages Lecture Notes in Computer Science, Apr ) 6

Preliminary Evaluation of Dynamic Load Balancing Using Loop Re-partitioning on Omni/SCASH

Preliminary Evaluation of Dynamic Load Balancing Using Loop Re-partitioning on Omni/SCASH Yoshiaki Sakae Tokyo Institute of Technology, Japan sakae@is.titech.ac.jp Mitsuhisa Sato Tsukuba University, Japan